Variable definitions that control algorithmic decisions below. These are placed here to draw our attention them – if we wish to change our decisions throughout the pipeline, these can be adjusted, rather than having to hunt them down below.
#parameters that control our algorithmic decisions about what constitutes bad head motion
#we put these here to draw our attention to
fd_spike_mm <- 0.3 #movement greater than this threshold (in mm) is considered a movement spike
p_spikes_cutoff <- 0.2 #if there are more than this proportion of spikes (based on fd_spike_mm), exclude the subject
fd_max_mm <- 10 #any movement greater than this amount should lead the subject to be excluded
One of the most important skills you learn along the data science path is not specific to R, but is crucial to solving data problems. People have called this different things such as “thinking algorithmically”, “thinking like an engineer”, or “solving problems like a computer programmer.” If you’re not an expert on algorithms and are neither a computer scientist nor an engineer, you might ask “what does this really mean?” The answer is hard to convey conceptually, but instead depends heavily on the experiences you accrue as you solve data problems.
My goal here is to convey some of the lessons I’ve learned over the past 10 years and give you an approximate framework for solving data management and analysis problems that extend beyond the basics of the tidyverse.
I know most of you aren’t in the neuroimaging world, and my goal isn’t to confuse you with all of the details (as people like to say, “it’s complicated”). And the example is not intended to pitch to the imaging crowd, but instead to demonstrate a real-world data problem that has parallels to many other data types (behavior during experiments, self-reports distributed across files and folders, etc.).
With regard to fMRI, I’ll simply convey that head movement during a scan is a huge bugaboo because it fundamentally corrupts the signal. To wit:
There are different ways to mitigate the problem, but suffice it to say that we need to a) know how bad it is, b) detect and potentially motion ‘spikes’ from analyses, and c) consider whether to discard someone altogether if the motion is bad enough. We quantify head movement in fMRI by looking at how much the brain moves from image to image (over time). A common metric is called ‘framewise displacement’ (FD), which is approximation of how much the head moved in any way/direction, measured in millimeters. So that’s what we’ll be working on…
We’ll focus here on data problems that have multiple, often repetitive, steps. This problem will also involve tracking down and reading files from many different directories, where the directories are organized by subject (plus additional subdirectories).
The FD metric is written to a .txt file in every subject’s processed data directory. It lives the motion_info folder.
It looks like this:
MMClock/
└── MR_Proc
├── 10637_20140304
│ └── mni_aroma_minimal_fsl
│ └── rest1
│ └── motion_info
│ └── fd.txt
├── 10638_20140507
│ └── mni_aroma_minimal_fsl
│ └── rest1
│ └── motion_info
│ └── fd.txt
├── 10662_20140415
│ └── mni_aroma_minimal_fsl
│ └── rest1
│ └── motion_info
│ └── fd.txt
This is a very structured format, though somewhat complicated. So long as your folder and file structure is systematically organized (e.g., keeping the use of upper and lower case letters the same), this sort of task is easily handled through programming.
We want all of these fd.txt files to be read into R so that we can diagnose FD problems for each subject. This means we need to loop over these in some way, read each of them, and compute motion metrics for each subject.
Our motion metrics:
Let’s look at one FD file:
#location of an example file to play with
single_file <- file.path(working_dir, "mot_example", "MMClock", "MR_Proc", "10637_20140304",
"mni_aroma_minimal_fsl", "rest1", "motion_info", "fd.txt")
fd1 <- read.table(single_file)
str(fd1)
## 'data.frame': 300 obs. of 1 variable:
## $ V1: num 0 0.1396 0.0919 0.1559 0.1331 ...
head(fd1)
V1 |
---|
0.000 |
0.140 |
0.092 |
0.156 |
0.133 |
0.161 |
Here is a quick visual check on what the file contains (head movement over time).
fd1$time <- 1:nrow(fd1)
ggplot(fd1, aes(x=time, y=V1)) + geom_line()
We really just need $V1
for each dataset since it’s one vector (one estimate per timepoint). In future, we probably just need the 300 values as a vector – it doesn’t even merit a data.frame at this stage.
fdvec <- fd1$V1
How do we find all of these files? There are many ways, but list.files()
is pretty handy:
fd_files <- list.files(path=file.path(working_dir, "mot_example"), pattern = "fd.txt", recursive = TRUE)
head(fd_files)
## [1] "MMClock/MR_Proc/10637_20140304/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [2] "MMClock/MR_Proc/10638_20140507/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [3] "MMClock/MR_Proc/10662_20140415/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [4] "MMClock/MR_Proc/10711_20140826/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [5] "MMClock/MR_Proc/10717_20140813/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [6] "MMClock/MR_Proc/10767_20140814/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
length(fd_files)
## [1] 130
As noted above, we want to get some summary statistics about FD for each subject such as the mean and max.
In terms of the success of this project, we also want to flag anyone who should be excluded from analysis altogether (data beyond repair). This is a conceptual choice in data analysis – how to handle outliers, including data removal or downweighting.
For my purposes, I consider a subject irreparable if:
Either of these could lead my connectivity estimates to be corrupt and untrustworthy. Hence the reason for exclusion.
The algorithm would be successful if we 1) got the summary statistics of interest, and 2) removed or flagged subjects who are irreparable for either reason above.
It’s useful to calibrate your effort on a coding challenge in proportion to its conceptual or methodological importance.
If you’ll never need to do this again, a ‘one-off’ solution may be worthwhile, and you might try to keep your effort on the order of minutes, if possible. If you anticipate needing to this same operation regularly throughout your career, it may be a time to use functions, comment your code extensively, validate it carefully, and so on.
In the ideal case, functions should be iron clad (resistant to errors) and well-validated. This is why R packages are great – their creators are usually statisticians, methodologists, and/or people who have formal training in computer science. They have usually written their functions to yield very reliable, well-validated input. Thus, when possible, use existing R functions rather than writing custom code. There is a handy package called “sos” that can help you find functions that may be relevant to your case. The example below could be pasted into your R console if you want to try a search for functions having to do with ARIMA models:
p_load(sos)
findFn("arima")
Writing your own functions can be fun, but it can also take considerable time as you’re learning! Likewise, functions that have many steps take longer to write. When possible, try to write compact functions that have very specific actions, rather than writing long functions that have many steps. The larger the function becomes, the greater the possibility that it will need to be adapted when needed on a similar, but not identical, project.
If this problem is absolutely crucial – like you might need to retract paper if it’s wrong – slow down and make sure you check the results in detail for one case/example, as well as the global outputs (e.g., the overall distribution).
Here, keeping high-motion subjects accidentally could easily be seen as a critical flaw in review, or later after publication (slight risk of retraction…), so I want to be careful! Likewise, head motion is a general problem in fMRI data, so since I have many fMRI projects, it would be nice to have friendly code that works well. This will make implementing best practices simple on future projects.
So, how do we do implement the proposed head motion statistics? Let’s start with one person… We already read in data above, let’s look at it a bit.
nrow(fd1)
## [1] 300
ggplot(fd1, aes(x=V1)) + geom_histogram(bins=12)
(mfd <- mean(fdvec))
## [1] 0.467
(maxfd <- max(fdvec))
## [1] 5.42
(nspikes <- sum(fdvec >= fd_spike_mm))
## [1] 117
(pspikes <- nspikes/length(fdvec))
## [1] 0.39
(bad_subj <- pspikes >= p_spikes_cutoff || maxfd >= fd_max_mm)
## [1] TRUE
This subject is bad on the basis of too many small head movements – 39% are above the threshold I set.
How do we put this together in a single result? A list or data.frame is the easiest.
(metrics <- data.frame(mfd=mfd, maxfd=maxfd, pspikes=pspikes, bad_subj=bad_subj))
mfd | maxfd | pspikes | bad_subj |
---|---|---|---|
0.467 | 5.42 | 0.39 | TRUE |
Take a look at the data again:
ggplot(fd1, aes(x=time, y=V1, color=V1 >= fd_spike_mm)) + geom_line(aes(color=NULL), color="black") + geom_point()
ggplot(fd1, aes(x=1, fill=V1 >= fd_spike_mm)) + geom_bar()
These are quick, but effective, ways that support the output of the prototype
Here’s the ‘bad’ way to do this even though it is in the right spirit.
all_fd <- data.frame()
for (f in fd_files) {
thisperson <- read.table(file.path(working_dir, "mot_example", f))
all_fd <- rbind(all_fd, thisperson)
}
head(all_fd)
V1 |
---|
0.000 |
0.140 |
0.092 |
0.156 |
0.133 |
0.161 |
str(all_fd)
## 'data.frame': 39099 obs. of 1 variable:
## $ V1: num 0 0.1396 0.0919 0.1559 0.1331 ...
Problems include:
Perhaps this approach is a bit better? We could start with an empty data.frame
and build up statistics one row at a time.
all_fd <- data.frame()
for (f in fd_files) {
thisperson <- read.table(file.path(working_dir, "mot_example", f))$V1
#compute a one-row data.frame for this subject
mmetric <- data.frame(file=f, mfd=mean(thisperson), maxfd=max(thisperson),
pspikes=sum(thisperson > fd_spike_mm)/length(thisperson),
bad_subj=max(thisperson) > fd_max_mm ||
sum(thisperson > fd_spike_mm)/length(thisperson) > p_spikes_cutoff)
#add this subject's statistics to the overall data.frame
all_fd <- rbind(all_fd, mmetric)
}
head(all_fd)
file | mfd | maxfd | pspikes | bad_subj |
---|---|---|---|---|
MMClock/MR_Proc/10637_20140304/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.467 | 5.416 | 0.390 | TRUE |
MMClock/MR_Proc/10638_20140507/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.148 | 1.579 | 0.063 | FALSE |
MMClock/MR_Proc/10662_20140415/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.153 | 0.455 | 0.053 | FALSE |
MMClock/MR_Proc/10711_20140826/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.584 | 4.684 | 0.460 | TRUE |
MMClock/MR_Proc/10717_20140813/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.172 | 2.252 | 0.073 | FALSE |
MMClock/MR_Proc/10767_20140814/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.123 | 1.493 | 0.033 | FALSE |
str(all_fd)
## 'data.frame': 130 obs. of 5 variables:
## $ file : Factor w/ 130 levels "MMClock/MR_Proc/10637_20140304/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mfd : num 0.467 0.148 0.153 0.584 0.172 ...
## $ maxfd : num 5.416 1.579 0.455 4.684 2.252 ...
## $ pspikes : num 0.39 0.0633 0.0533 0.46 0.0733 ...
## $ bad_subj: logi TRUE FALSE FALSE TRUE FALSE FALSE ...
There are a few problems here.
First, building up a large data.frame by repetitively rbinding means that there is a lot of processing time spent on memory management – the all_fd
object has to be reallocated on every iteration. We could handle this problem more elegantly using an lapply
+ dplyr::bind_rows
approach. Like so:
all_fd <- dplyr::bind_rows(lapply(fd_files, function(f) {
thisperson <- read.table(file.path(working_dir, "mot_example", f))$V1
#compute a one-row data.frame for this subject
mmetric <- data.frame(file=f, mfd=mean(thisperson), maxfd=max(thisperson),
pspikes=sum(thisperson > fd_spike_mm)/length(thisperson),
bad_subj=max(thisperson) > fd_max_mm ||
sum(thisperson > fd_spike_mm)/length(thisperson) > p_spikes_cutoff,
stringsAsFactors = FALSE) #store file as character string to quiet bind_rows
}))
head(all_fd)
file | mfd | maxfd | pspikes | bad_subj |
---|---|---|---|---|
MMClock/MR_Proc/10637_20140304/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.467 | 5.416 | 0.390 | TRUE |
MMClock/MR_Proc/10638_20140507/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.148 | 1.579 | 0.063 | FALSE |
MMClock/MR_Proc/10662_20140415/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.153 | 0.455 | 0.053 | FALSE |
MMClock/MR_Proc/10711_20140826/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.584 | 4.684 | 0.460 | TRUE |
MMClock/MR_Proc/10717_20140813/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.172 | 2.252 | 0.073 | FALSE |
MMClock/MR_Proc/10767_20140814/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.123 | 1.493 | 0.033 | FALSE |
Note that bind_rows
from dplyr
does a row-wise concatentation of all elements in a list – here, the single-row data.frame objects for each subject.
Second, although we now have a record of subject identity because we’ve tacked on the filename to the resulting data.frame
, this is organized in terms of the file path, not by identifying variables such as subject ID or scan date. We could extract the ID more elegantly using regular expressions, though this has its own learning curve:
#tack on ID column based on file naming
all_fd$ID <- sub(".*/MR_Proc/([^/]+)/mni.*", "\\1", all_fd$file)
head(all_fd)
file | mfd | maxfd | pspikes | bad_subj | ID |
---|---|---|---|---|---|
MMClock/MR_Proc/10637_20140304/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.467 | 5.416 | 0.390 | TRUE | 10637_20140304 |
MMClock/MR_Proc/10638_20140507/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.148 | 1.579 | 0.063 | FALSE | 10638_20140507 |
MMClock/MR_Proc/10662_20140415/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.153 | 0.455 | 0.053 | FALSE | 10662_20140415 |
MMClock/MR_Proc/10711_20140826/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.584 | 4.684 | 0.460 | TRUE | 10711_20140826 |
MMClock/MR_Proc/10717_20140813/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.172 | 2.252 | 0.073 | FALSE | 10717_20140813 |
MMClock/MR_Proc/10767_20140814/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt | 0.123 | 1.493 | 0.033 | FALSE | 10767_20140814 |
#rearrange columns in a friendly order. Put ID first, then file, then all the rest
all_fd <- all_fd %>% select(ID, file, everything())
str(all_fd)
## 'data.frame': 130 obs. of 6 variables:
## $ ID : chr "10637_20140304" "10638_20140507" "10662_20140415" "10711_20140826" ...
## $ file : chr "MMClock/MR_Proc/10637_20140304/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt" "MMClock/MR_Proc/10638_20140507/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt" "MMClock/MR_Proc/10662_20140415/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt" "MMClock/MR_Proc/10711_20140826/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt" ...
## $ mfd : num 0.467 0.148 0.153 0.584 0.172 ...
## $ maxfd : num 5.416 1.579 0.455 4.684 2.252 ...
## $ pspikes : num 0.39 0.0633 0.0533 0.46 0.0733 ...
## $ bad_subj: logi TRUE FALSE FALSE TRUE FALSE FALSE ...
Third, this approach assumes that all FD files are relevant to the task at hand. However, keep step 9 in mind: validate the global outputs. If we are interested in checking the motion statistics on the whole sample and also want to know who should be included or excluded from analysis, the approach of reading all files may mislead us. For example, if we wish to report group statistics such as mean FD in the paper, we need to ensure that all files are in the set of subjects to be analyzed.
That is, one problem with looping over files is that we may catch subjects we don’t want (for other exclusion criteria) or we may be missing a file for someone whose data we expect to be present. For this reason, it’s usually best in these kinds of batch operations to have a basic records file that keeps track of expected participants. Furthermore, this file should include information about any other exclusion criteria that should be accounted for at this step. For example, we may exclude someone after the fMRI scan if they fell asleep during the experimental task, or if their performance was remarkably poor.
Here, let’s read in a participant info file to guide us on whose fd.txt
files are relevant to our group analysis. Note the FMRI_Exclude
column. Furthermore, this is a study where the data were collected in two parallel substudies, which are organized into different top-level folders: MMClock
and SPECC
. So, we need to do a bit of dplyr-based data wrangling to get the expected structure setup.
groupdir <- file.path(working_dir, "mot_example")
mr_subdir <- "mni_aroma_minimal_fsl/rest1" #this is where we expect the processed fMRI data for each subject
expect_mr_file <- "rnawuktm_rest1.nii.gz" #this is the expected file name of the fMRI data
#read in and process data
specc_info <- read.csv(file.path("..", "data", "SPECC_Participant_Info.csv"), stringsAsFactors=FALSE) %>%
filter(HasRest==1 & FMRI_Exclude==0) %>%
mutate(ScanDate = as.Date(ScanDate, format="%m/%d/%y"), Luna_ID=as.character(Luna_ID)) %>%
#If LunaMRI is 1, then data are in the MMClock substudy
# Convert ScanDate Date, then reformat YYYYMMDD to match the folder naming structure of MMClock
#If LunaMRI is 0, use the SPECC folder and reformat data as DDMONYR to match folder naming structure.
mutate(mr_dir=if_else(LunaMRI==1,
paste0(groupdir, "/MMClock/MR_Proc/", Luna_ID, "_", format((as.Date(ScanDate, format="%Y-%m-%d")), "%Y%m%d")),
paste0(groupdir, "/SPECC/MR_Proc/", tolower(SPECC_ID), "_", tolower(format((as.Date(ScanDate, format="%Y-%m-%d")), "%d%b%Y")))),
mr_file=file.path(mr_dir, mr_subdir, expect_mr_file),
fd_file=file.path(mr_dir, mr_subdir, "motion_info", "fd.txt"),
mr_exists=file.exists(mr_file),
file_id=if_else(LunaMRI==1, Luna_ID, SPECC_ID))
str(specc_info)
## 'data.frame': 90 obs. of 17 variables:
## $ NUM_ID : int 1 2 3 5 8 10 12 13 15 16 ...
## $ SPECC_ID : chr "001RA" "002HS" "003BU" "005AI" ...
## $ Luna_ID : chr "11131" "10997" "10895" "10644" ...
## $ BPD : int 0 0 0 0 1 0 0 1 1 0 ...
## $ ScanDate : Date, format: "2013-12-07" "2014-03-08" ...
## $ AgeAtScan : num 20.5 16.2 21.6 15.2 22.8 ...
## $ Female : int 0 1 1 1 1 1 1 1 1 0 ...
## $ HasRest : int 1 1 1 1 1 1 1 1 1 1 ...
## $ LunaMRI : int 0 1 1 0 0 1 1 0 0 1 ...
## $ HasClock : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FMRI_Exclude : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Wrong_fmap_dims: int 1 1 1 1 1 1 1 0 0 0 ...
## $ mr_dir : chr "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/SPECC/MR_Proc/001ra_07dec2013" "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/10997_20140308" "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/10895_20131204" "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/SPECC/MR_Proc/005ai_06nov2013" ...
## $ mr_file : chr "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/SPECC/MR_Proc/001ra_07dec2013/mni_arom"| __truncated__ "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/10997_20140308/mni_aro"| __truncated__ "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/10895_20131204/mni_aro"| __truncated__ "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/SPECC/MR_Proc/005ai_06nov2013/mni_arom"| __truncated__ ...
## $ fd_file : chr "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/SPECC/MR_Proc/001ra_07dec2013/mni_arom"| __truncated__ "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/10997_20140308/mni_aro"| __truncated__ "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/10895_20131204/mni_aro"| __truncated__ "/var/folders/pw/z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/SPECC/MR_Proc/005ai_06nov2013/mni_arom"| __truncated__ ...
## $ mr_exists : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ file_id : chr "001RA" "10997" "10895" "005AI" ...
Here are the first 10 expected FD files based on the subject information file:
#just using sub here to trim off the working_dir from fd file paths to make it easier to see in the output
head(sub(working_dir, "", specc_info$fd_file, fixed=TRUE), n=10)
## [1] "/mot_example/SPECC/MR_Proc/001ra_07dec2013/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [2] "/mot_example/MMClock/MR_Proc/10997_20140308/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [3] "/mot_example/MMClock/MR_Proc/10895_20131204/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [4] "/mot_example/SPECC/MR_Proc/005ai_06nov2013/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [5] "/mot_example/SPECC/MR_Proc/008jh_13jan2014/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [6] "/mot_example/MMClock/MR_Proc/11250_20140228/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [7] "/mot_example/MMClock/MR_Proc/11252_20140213/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [8] "/mot_example/SPECC/MR_Proc/013jk_30apr2014/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [9] "/mot_example/SPECC/MR_Proc/015cw_03may2014/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
## [10] "/mot_example/MMClock/MR_Proc/11277_20140410/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt"
Note that I trimmed off the first part of the path to make it easier to see on the screen.
Now that we have the FD files we expect to read from relevant subjects, it would be useful to write a short function to compute relevant FD statistics. This essentially does what we were doing in the loop above, but lets us have a predictable piece of code that takes a file, the relevant decision thresholds (e.g., how much movement is ‘too much’) and gives predictable outputs.
Importantly, as we will see in step 8 (self-healing code), functions also give us much finer control over how to handle unexpected or missing inputs (e.g., FD files that don’t match the typical format).
Let’s write a very simple function, then test it on our example file. This essentially goes back to step 6 for a moment (develop a working prototype), but with the goal of having a portable function that will help with scaling.
#simplest worker function to get statistics for one file
#note that the default arguments here may not match the thresholds in the overall pipeline
fd_stats <- function(fd_file, fd_spike=0.5, max_prop_spikes=.1, max_spike=5) {
fd <- read.table(fd_file)$V1 #vector of FD values
n_spikes=sum(fd >= fd_spike) #number of spikes above the threshold
p_spikes <- n_spikes/length(fd) #spikes as a proportion of total volumes
bad_fd <- p_spikes > max_prop_spikes || any(fd > max_spike) #decisions above subject exclusion
ret <- data.frame(mean_fd=mean(fd), max_fd=max(fd), nvol=length(fd),
prop_spikes=p_spikes, bad_fd=bad_fd) #note that map_dfr from purrr (below) would accept a list, too
return(ret)
}
Okay, here’s the output of our function for the test file:
#test this on a single case
fd_stats(single_file)
mean_fd | max_fd | nvol | prop_spikes | bad_fd |
---|---|---|---|---|
0.467 | 5.42 | 300 | 0.23 | TRUE |
Ack! Why doesn’t this match our hand calculations above for spike proportion? Note that in the function call above, we only provided single_file
as the argument to fd_stats
. But the default arguments for the function are:
fd_stats <- function(fd_file, fd_spike=0.5, max_prop_spikes=.1, max_spike=5) {
Thus, the default is to treat FD >= 0.5mm as a spike, whereas we chose 0.3. There are similar mismatches between the defaults for max FD and proportion of spikes. This highlights that it is important to know the default arguments, if any, for a function. And if we should always require the user to specify every input to a function explicitly, it may be better not to use default arguments at all.
Here’s how we correct this small oversight here. We pass forward the thresholds set at the top of this document so that the decisions/calculations match our choices above.
fd_stats(single_file, fd_spike = fd_spike_mm, max_prop_spikes = p_spikes_cutoff, max_spike=fd_max_mm)
mean_fd | max_fd | nvol | prop_spikes | bad_fd |
---|---|---|---|---|
0.467 | 5.42 | 300 | 0.39 | TRUE |
Finally, we can use the map
functions from the purrr
package to scale up this sort of calculation rather easily. In particular, look at the map_dfr
function, which iterates over rows (that’s the r
in the name) of a data.frame (that’s the df
in the name) and applies a function to each row.
Remember from the dplyr
walkthrough that .
in a dplyr
pipeline refers to the current dataset.
Let’s add motion information as additional columns to our participant info data.frame using purr::map
approach
specc_info <- specc_info %>%
bind_cols(map_dfr(.$fd_file, fd_stats, fd_spike=fd_spike_mm, max_prop_spikes=p_spikes_cutoff, max_spike=fd_max_mm))
#If you wanted the group fd data.frame alone
#vv <- map_dfr(specc_info$fd_file, fd_stats, fd_spike=0.5, max_prop_spikes=.20, max_spike=fd_max_mm)
#just print the motion-relevant parts
head(specc_info %>% select(NUM_ID, mean_fd, max_fd, nvol, prop_spikes, bad_fd))
NUM_ID | mean_fd | max_fd | nvol | prop_spikes | bad_fd |
---|---|---|---|---|---|
1 | 0.192 | 0.535 | 300 | 0.187 | FALSE |
2 | 0.085 | 0.266 | 300 | 0.000 | FALSE |
3 | 0.127 | 0.284 | 350 | 0.000 | FALSE |
5 | 0.133 | 0.378 | 300 | 0.007 | FALSE |
8 | 0.166 | 1.063 | 350 | 0.069 | FALSE |
10 | 0.218 | 5.772 | 300 | 0.087 | FALSE |
This is great, but what if the function encounters problems with the file that it cannot handle effectively? For example, what if the file does not exist? Or what if it has many columns instead of one? Or what if we expect there to be 300 volumes, but only 200 are found?
All of these failure conditions may require a different treatment. But minimally, we (the user) want to know about them. If you write a function, start by validating the inputs. If one argument is a file, does it exist? Is an argument missing and no default value makes sense? Check out warning
and message
to provide feedback to the user of your function (which is probabl you!) for messages that do not stop the execution of the code (i.e., they won’t crash). Check out stopifnot
and stop
to throw errors if the function shouldn’t proceed at all until the user fixes the problem.
Let’s examine the case of a file that is expected to be present, but is not. I’ve messed up the file structure a bit here, and we’ll just try to re-run our code above to get the motion stats.
vv <- map_dfr(specc_info$fd_file, fd_stats, fd_spike=fd_spike_mm, max_prop_spikes=p_spikes_cutoff, max_spike=fd_max_mm)
## Warning in file(file, "rt"): cannot open file '/var/folders/pw/
## z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/
## 11366_20150425/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt': No such
## file or directory
## Error in file(file, "rt"): cannot open the connection
In this case, R gives us a reasonably sane, somewhat intuitive output. In other cases, you may get very odd errors that don’t make much sense at first glance. Regardless, validating the inputs to your function can cut down on unexpected, strange errors.
Part of making code self-healing is to ask yourself:
The third question is not a leading one. Rather, there are occasions when an entire pipeline should fail if one input is wrong because proceeding would invalidate other steps in the pipeline or give misleading results. In other cases, the failure of one step or iteration in a pipeline should not halt the broader process.
Here, if an FD file does not exist for a subject, I would lean toward telling the user about it, but not halting the broader pipeline. To make that work, the function should:
With respect to the third item, one principle of function programming is that the output structure should always be the same so that the user of a function knows what to expect. If we are supposed to return a one-row data.frame with motion metrics, we should still do so in the case of handling problems with warnings – that is non-fatal failure conditions. If the function does not behave predictably in its outputs (e.g., returning NULL instead of a data.frame), then it will be prone to failures in a broader pipeline.
Here is the amendment to our function to handle a missing FD file:
#worker function to get statistics for one file
fd_stats <- function(fd_file, fd_spike=0.5, max_prop_spikes=.1, max_spike=5) {
if (!file.exists(fd_file)) { #this code is new
warning("Could not read FD file: ", fd_file, ". Returning NAs.")
return(data.frame(mean_fd=NA, max_fd=NA, nvol=NA, prop_spikes=NA, bad_fd=NA))
}
#this code is the same
fd <- read.table(fd_file)$V1 #vector of FDs
n_spikes=sum(fd >= fd_spike)
p_spikes <- n_spikes/length(fd)
bad_fd <- p_spikes > max_prop_spikes || any(fd > max_spike)
ret <- data.frame(mean_fd=mean(fd), max_fd=max(fd), nvol=length(fd),
prop_spikes=p_spikes, bad_fd=bad_fd)
return(ret)
}
And here is how it handles the case of the missing file in the context of the broader pipeline:
vv <- map_dfr(specc_info$fd_file, fd_stats, fd_spike=fd_spike_mm, max_prop_spikes=p_spikes_cutoff, max_spike=fd_max_mm)
## Warning in .f(.x[[i]], ...): Could not read FD file: /var/folders/pw/
## z4ywwn8x5cg4m8_bbbqmh7g0vlwlvl/T//RtmpwLsE55/mot_example/MMClock/MR_Proc/
## 11366_20150425/mni_aroma_minimal_fsl/rest1/motion_info/fd.txt. Returning
## NAs.
print(vv)
## mean_fd max_fd nvol prop_spikes bad_fd
## 1 0.1924 0.535 300 0.18667 FALSE
## 2 0.0848 0.266 300 0.00000 FALSE
## 3 0.1273 0.284 350 0.00000 FALSE
## 4 0.1331 0.378 300 0.00667 FALSE
## 5 0.1664 1.063 350 0.06857 FALSE
## 6 0.2179 5.772 300 0.08667 FALSE
## 7 0.0867 0.566 300 0.01000 FALSE
## 8 0.2505 1.748 300 0.26333 TRUE
## 9 0.1741 1.196 300 0.08667 FALSE
## 10 0.1799 0.669 300 0.15333 FALSE
## 11 0.0899 0.234 300 0.00000 FALSE
## 12 0.1364 0.721 300 0.04000 FALSE
## 13 0.2135 1.622 300 0.15333 FALSE
## 14 0.2187 2.160 300 0.14000 FALSE
## 15 0.0996 0.235 300 0.00000 FALSE
## 16 0.1294 0.446 300 0.02000 FALSE
## 17 0.1191 0.421 300 0.01333 FALSE
## 18 0.1699 2.590 300 0.12667 FALSE
## 19 0.1507 0.421 300 0.05000 FALSE
## 20 0.4055 9.938 300 0.14667 FALSE
## 21 0.2418 2.720 300 0.13000 FALSE
## 22 0.1995 1.375 300 0.14000 FALSE
## 23 0.1174 0.838 300 0.03333 FALSE
## 24 0.1488 0.379 300 0.00333 FALSE
## 25 0.1584 0.790 300 0.06333 FALSE
## 26 0.1527 0.832 300 0.04000 FALSE
## 27 0.2976 1.416 300 0.36333 TRUE
## 28 0.6360 13.915 300 0.45333 TRUE
## 29 0.1090 0.528 300 0.02000 FALSE
## 30 0.4719 6.919 300 0.34333 TRUE
## 31 0.1642 1.023 300 0.08667 FALSE
## 32 0.1029 0.778 300 0.01667 FALSE
## 33 0.1256 0.344 300 0.01667 FALSE
## 34 0.1710 2.364 300 0.08667 FALSE
## 35 0.1414 0.299 300 0.00000 FALSE
## 36 0.1018 0.309 300 0.00333 FALSE
## 37 0.1735 1.984 300 0.06000 FALSE
## 38 0.1524 2.227 300 0.04667 FALSE
## 39 0.5838 4.684 300 0.46000 TRUE
## 40 0.2280 0.662 300 0.25333 TRUE
## 41 0.1135 0.728 300 0.01333 FALSE
## 42 0.1472 0.352 300 0.01667 FALSE
## 43 0.3695 3.708 300 0.30333 TRUE
## 44 0.0830 0.196 300 0.00000 FALSE
## 45 0.1189 1.743 300 0.03000 FALSE
## 46 0.1965 0.779 300 0.14667 FALSE
## 47 0.1149 0.610 300 0.02000 FALSE
## 48 0.1016 0.694 300 0.08000 FALSE
## 49 0.4840 8.800 300 0.37000 TRUE
## 50 0.1966 0.601 300 0.17333 FALSE
## 51 0.1437 3.184 300 0.01667 FALSE
## 52 0.1502 1.659 300 0.06333 FALSE
## 53 0.1525 0.455 300 0.05333 FALSE
## 54 0.1651 0.747 300 0.10667 FALSE
## 55 0.2939 2.436 300 0.31000 TRUE
## 56 0.1113 0.337 300 0.00667 FALSE
## 57 0.1213 0.476 300 0.01000 FALSE
## 58 0.4562 3.814 300 0.48000 TRUE
## 59 0.3325 1.184 300 0.50333 TRUE
## 60 0.1679 2.180 300 0.08667 FALSE
## 61 0.2845 4.832 300 0.26000 TRUE
## 62 0.1039 0.295 300 0.00000 FALSE
## 63 0.2015 2.564 300 0.13333 FALSE
## 64 0.1641 1.412 300 0.15667 FALSE
## 65 0.3127 3.622 300 0.29333 TRUE
## 66 0.2677 1.663 300 0.40333 TRUE
## 67 0.0910 0.317 300 0.00333 FALSE
## 68 0.2199 1.613 300 0.22667 TRUE
## 69 0.2918 1.039 300 0.44333 TRUE
## 70 0.1595 0.546 300 0.04000 FALSE
## 71 0.2050 0.548 300 0.17667 FALSE
## 72 0.1584 0.708 300 0.10333 FALSE
## 73 0.1136 0.294 300 0.00000 FALSE
## 74 0.2304 1.232 300 0.22333 TRUE
## 75 0.0949 0.975 300 0.00667 FALSE
## 76 1.1024 19.032 300 0.71667 TRUE
## 77 0.1420 0.354 300 0.01000 FALSE
## 78 0.1040 0.485 300 0.02333 FALSE
## 79 0.0889 0.266 300 0.00000 FALSE
## 80 0.3496 2.738 300 0.36000 TRUE
## 81 0.1748 0.484 300 0.10000 FALSE
## 82 0.1298 0.969 300 0.05667 FALSE
## 83 0.1448 0.332 300 0.01333 FALSE
## 84 0.1013 0.193 300 0.00000 FALSE
## 85 0.2062 1.278 300 0.17000 FALSE
## 86 0.1510 0.527 300 0.04333 FALSE
## 87 NA NA NA NA NA
## 88 0.1549 1.970 300 0.02667 FALSE
## 89 0.1241 0.709 300 0.02000 FALSE
## 90 0.1371 0.520 300 0.02333 FALSE
Note the NAs on row 87. This is what we expect (and want) – a graceful failure, with NAs propagated for that subject.
For more general error handling in R, particularly in functions, check out the tryCatch
function. This function tries to evaluate a given R expression, but allows you to catch and handle any arbitrary error. For example, let’s try to evaluate whether yy
is greater than 2. But yy
is not defined anywhere in this document – that is, it doesn’t exist. If you just try yy > 2
, you will get a fatal error. What if in this circumstance (undefined variable), we instead want to print the error, but return NA
?
#this will generate an error if uncommented because yy is not defined anywhere in this document
#is_yy_big <- yy > 2
is_yy_big <- tryCatch(yy > 2, error=function(err) { print(err); return(NA )})
## <simpleError in doTryCatch(return(expr), name, parentenv, handler): object 'yy' not found>
print(is_yy_big)
## [1] NA
Now that we’re done with the section on handling failures gracefully, I’m going to put the missing fd.txt file back into place for the final step of global metrics.
Finally, now that our pipeline works, we should step out to the overall sample/global level. From step 3 above (defining success), we said that our pipeline should:
So, did we achieve this goal? Here is the mean FD for those included versus excluded.
specc_info %>% group_by(bad_fd) %>% dplyr::summarize(group_mean=mean(mean_fd))
bad_fd | group_mean |
---|---|
FALSE | 0.147 |
TRUE | 0.393 |
Here is the number of people excluded versus included based on FD:
table(specc_info$bad_fd)
##
## FALSE TRUE
## 71 19
Here is the histogram of mean FD, colored by inclusion:
specc_info %>% ggplot(aes(x=mean_fd, fill=bad_fd)) + geom_histogram(bins=8) +
ggtitle("Histogram of mean FD by exclusion/inclusion")
And the histogram for max FD
specc_info %>% ggplot(aes(x=max_fd, fill=bad_fd)) + geom_histogram(bins=8) +
ggtitle("Histogram of max FD by exclusion/inclusion")
Finally, filter down to the good data:
specc_info <- specc_info %>% filter(!bad_fd)
This would yield a data.frame for further analysis.