2020-02-27 12:28:16

Preliminaries

Announcements

Today’s agenda

  • General comments on homework
  • Version control
  • Version control with git and GitHub
  • RStudio projects

Comments on homework

Things to consider

  • Where are my data stored?
    • Can I create a reproducible workflow for importing them?
  • Automate as much as possible
  • Use names that are transparent to others

  • What are your document’s dependencies, e.g. packages, external scripts or functions
  • Make your code as readable as possible.
    • Consider <COMMAND/CTRL>I to reindent lines
    • <SHIFT><COMMAND/CTRL>A to reformat code
  • Plot first, analyze later

Version control

What is version control?

Why do version control?

  • Keep record of who made what changes when
  • Revert to previous version
  • Transparency

How to do version control

Edit, Save, Attach

  • My typical scheme for collaborative writing
  • Requires consistent file-naming (and incrementing) conventions
    • gilmore-etal-nature-2017-01-28-1319.docx
    • gilmore-etal-nature-2017-01-28-1400.docx
    • or gilmore-etal-nature-2020-02-04v01.docx

Pros

  • No special software needed
  • Works with every file type
  • Good for any file type

Cons

  • Not everyone uses the same file naming conventions
  • Dealing with conflicts
  • File management

Google Docs

  • Gilmore, R.O., Diaz, M.T., Wyble, B.A., & Yarkoni, T. (2017). Progress toward openness, transparency, and reproducibility in cognitive neuroscience. Annals of the New York Academy of Sciences, 1396, 5–18. doi: 10.1111/nyas.13325.

Pros

  • Works with docs and sheets
  • Unlimited, automatic track changes
  • Stored in cloud
  • No special naming conventions
  • Writers can work in parallel

Cons

  • NYAS Editor wanted Word doc with track changes!
  • Not all types of files
  • Not everyone likes/knows how to use Google docs

Cloud storage (Box, OneDrive)

  • Pros
    • Keep the same file name
    • Let Box or OneDrive do automatic versioning
    • Shared file space
    • Any kind of file supported

  • Cons
    • limit on # of versions (was 100 for Box)
    • Penn State enterprise license

Open Science Framework

  • Integrates with Box, Dropbox, GitHub
  • Pros
    • Free, open source, devoted to open science
    • Many different file types
  • Cons
    • Version control depends on storage sources

Version control using git and GitHub

How to do version control

  • git
    • free, open source version control system
    • created by Linus Torvalds, creator of the Linux operating system, to manage that project’s software development

GitHub is a web-based git service

Cons

  • Designed by and for software developers
  • Text files
  • Wonky
  • Longevity of web-based repos?
  • No automatic synching
  • Users have to specifically push (upload), pull (download) files, merge/manage conflicts

Pros

  • Users have to specifically push (upload), pull (download) files, merge/manage conflicts
  • Great for R code, data
  • Great for Jupyter notebooks (later in course)
  • Exploits the power of the web

  • Easy-to-use “Pages” feature
  • Supports Markdown, Jupyter notebooks

Distributed version control

  • Everyone works on their own local copies
  • Pull before they start work
  • Save and push when they’re finished

Learn from my mistakes

  • Use RStudio projects (next)
  • Have your local directory structure mirror GitHub
  • Pull before you start working

  • Work in a branch (I usually work in dev)
  • Commit often
  • Push to GitHub
  • Create pull request to merge your changes into master

Let’s do this…

Complete Hello World exercise

Get git setup on your computer

Connect git to your GitHub account

RStudio Projects

Projects

  • Separate “home” directory for each piece of work
  • Switch to using Open Project... command in File menu
  • Saved metadata in *.Rproj (not saved on GitHub)

Connect RStudio with git and GitHub

Create new RStudio project

Rick’s workflow

  • Create repo on GitHub
  • Copy link
  • Open RStudio
  • Create new project in RStudio

  • Add *.Rproj to .gitignore
  • Edit README.md
  • Save and commit as ‘first commit’
  • Push to GitHub to test connection

Learn from my mistakes

  • Add sensitive files, directories to .gitignore
.Rproj.user
.Rhistory
.RData
.Ruserdata
psy-525-reproducible-research-2020.Rproj
psy-525-spring-2020.csv
.databrary.RData
hw

Real-life scenarios less radically open than mine

  • Use git locally, but not push to GitHub
  • Use RStudio projects, but not use git or GitHub
  • Use private repository (limit of 3 collaborators/repo)

Version control for RStudio Projects

  • packrat
  • Trying it again for this course. Progress report soon.

Next time

  • Simulation as a tool for reproducible and transparent science
  • Visualization tools in R

Resources

Software

This talk was produced on 2020-02-27 in RStudio using R Markdown. The code and materials used to generate the slides may be found at https://github.com/psu-psychology/psy-525-reproducible-research-2020. Information about the R Session that produced the code is as follows:

## R version 3.6.2 (2019-12-12)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets 
## [6] methods   base     
## 
## other attached packages:
##  [1] DiagrammeR_1.0.5 forcats_0.4.0    stringr_1.4.0   
##  [4] dplyr_0.8.3      purrr_0.3.3      readr_1.3.1     
##  [7] tidyr_1.0.0      tibble_2.1.3     ggplot2_3.2.1   
## [10] tidyverse_1.3.0  seriation_1.2-8 
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1      httr_1.4.1        
##  [3] tufte_0.5          jsonlite_1.6      
##  [5] viridisLite_0.3.0  foreach_1.4.8     
##  [7] modelr_0.1.5       gtools_3.8.1      
##  [9] assertthat_0.2.1   highr_0.8         
## [11] cellranger_1.1.0   yaml_2.2.0        
## [13] pillar_1.4.3       backports_1.1.5   
## [15] lattice_0.20-38    glue_1.3.1        
## [17] digest_0.6.23      RColorBrewer_1.1-2
## [19] rvest_0.3.5        colorspace_1.4-1  
## [21] htmltools_0.4.0    pkgconfig_2.0.3   
## [23] broom_0.5.3        haven_2.2.0       
## [25] scales_1.1.0       gdata_2.18.0      
## [27] farver_2.0.3       generics_0.0.2    
## [29] withr_2.1.2        lazyeval_0.2.2    
## [31] cli_2.0.1          magrittr_1.5      
## [33] crayon_1.3.4       readxl_1.3.1      
## [35] evaluate_0.14      fs_1.3.1          
## [37] fansi_0.4.1        nlme_3.1-142      
## [39] MASS_7.3-51.5      gplots_3.0.1.2    
## [41] xml2_1.2.2         tools_3.6.2       
## [43] registry_0.5-1     hms_0.5.3         
## [45] lifecycle_0.1.0    munsell_0.5.0     
## [47] reprex_0.3.0       cluster_2.1.0     
## [49] packrat_0.5.0      compiler_3.6.2    
## [51] caTools_1.18.0     rlang_0.4.4       
## [53] grid_3.6.2         iterators_1.0.12  
## [55] rstudioapi_0.10    visNetwork_2.0.9  
## [57] htmlwidgets_1.5.1  labeling_0.3      
## [59] bitops_1.0-6       rmarkdown_2.1     
## [61] gtable_0.3.0       codetools_0.2-16  
## [63] DBI_1.1.0          TSP_1.1-8         
## [65] R6_2.4.1           gridExtra_2.3     
## [67] lubridate_1.7.4    knitr_1.27        
## [69] pwr_1.2-2          utf8_1.1.4        
## [71] KernSmooth_2.23-16 dendextend_1.13.3 
## [73] stringi_1.4.5      Rcpp_1.0.3        
## [75] vctrs_0.2.2        gclus_1.3.2       
## [77] dbplyr_1.4.2       tidyselect_1.0.0  
## [79] xfun_0.12

References