Tricks and Tips

Say you have a dataframe that has a bunch of stuff in it that you don’t want, but it also has maybe…thicknesses that you do want.

##   rh_rostralmiddlefrontal_volume lh_rostralmiddlefrontal_thickness
## 1                            0.1                              0.01
## 2                            0.2                              0.40
## 3                            0.3                              0.02
## 4                            0.2                              0.10
##   rh_rostralmiddlefrontal_thickness lh_precuneus_volume
## 1                               0.4                0.04
## 2                               0.3                0.04
## 3                               0.2                0.05
## 4                               0.1                0.10

For a small dataframe, it’s easy to manually select them. But for a bigger dataframe, that’s no good. You can use GREP to grab everything that contains the string of characters you’re looking for.

df<-df.1[ , grepl( "_thickness" , names( df.1 ) ) ]
df

##   lh_rostralmiddlefrontal_thickness rh_rostralmiddlefrontal_thickness
## 1                              0.01                               0.4
## 2                              0.40                               0.3
## 3                              0.02                               0.2
## 4                              0.10                               0.1

Or maybe you have to deal with “pid_scandate”. You just want to figure out what date the participants were scanned, but it’s got all kinds of extra garbage preceding the underscore. You can use a regex expression to just get the part of the expression that you need. Regex is …unnecessarily complicated…but here is a great quick guide that can get you through most of what you want: https://stevencarlislewalker.wordpress.com/2013/02/13/remove-or-replace-everything-before-or-after-a-specified-character-in-r-strings/ and here is a much longer and more thorough treatment: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

Here’s an example of regex at work:

df<-data.frame(c("21066A_042017", "21066A_01092011", "24311A_12042006", "21112A_090312"))
colnames(df)<-"pid_scandate"
df

##      pid_scandate
## 1   21066A_042017
## 2 21066A_01092011
## 3 24311A_12042006
## 4   21112A_090312

df$scandate<-sub('.*\\_', '', df$pid_scandate)  #the .* means "do it every time" rather than "do it the first time"
df

##      pid_scandate scandate
## 1   21066A_042017   042017
## 2 21066A_01092011 01092011
## 3 24311A_12042006 12042006
## 4   21112A_090312   090312

So we got rid of the participant ID’s preceding the date. But the problem is that we now have a character string that looks like a date to humans, but unfortunately, it does not look like a date to a computer.

str(df)

## 'data.frame':    4 obs. of  2 variables:
##  $ pid_scandate: Factor w/ 4 levels "21066A_01092011",..: 2 1 4 3
##  $ scandate    : chr  "042017" "01092011" "12042006" "090312"

I was doing this in real life, and I thought I could just use “as.Date” to convert the character strings to dates. But this doesn’t work because the leading zero is dropped for the months other than Oct/Nov/Dec. So I had to write a function that would add a leading zero for months Jan-Sept.

To write a function in R, you set it up like this:

convertDates<-function(x){as.Date(sprintf("%06s", x), "%m%d%y")}

What’s happening inside the funciton isn’t super important. To create a function you come up with a name and you put that on the lefthand side. You have to type “function(list all the stuff you’re going to pass in to the function)”, and then between curly brackets you tell R what you want the function to do.

So now i want to use this function on every value in my scandate column.

I DO NOT USE A LOOP

(OK, I still use loops all the time, but they are waaaay less efficient than using apply/sapply/lapply)

lapply is great when you want to apply a function to a single column. The tricky thing is that lapply outputs a list, so then you have to take a second step to get it back into dataframe form. Both steps are below. lapply is the most straightforward of the apply functions, you just to lapply(column or list, function to apply to each item in the column or list)

scandate2<-lapply(df$scandate, convertDates)
df$scandate2<-sapply(scandate2, paste0, collapse = "")

head(df)

##      pid_scandate scandate  scandate2
## 1   21066A_042017   042017 2017-04-20
## 2 21066A_01092011 01092011 2020-01-09
## 3 24311A_12042006 12042006 2020-12-04
## 4   21112A_090312   090312 2012-09-03

I also use lists when I want to create plots of every region individually. (Notice that here I am using a loop, which is bad R practice, but it works and I don’t really care that it takes an extra 10 seconds to run). I’m not going to run this, but here is an example of writing a function to create plots, then using that function to make a plot for each FC network, and then writing those plots to a single pdf document that I can email out. Notice at the end that I use lapply to print each element of my list.

 plot_data_column <- function (TimeBetween, Network, WillConverts){
   ggplot(df.plots2, aes_string(x = TimeBetween, y = Network, shape = WillConverts, color = WillConverts)) + geom_point()+
   geom_smooth(method=lm, aes(fill = WillConverts)) + labs(x = "EYO") + theme_classic()+ geom_vline(xintercept=0)+
     theme(legend.position = "none")+ylim(-2, 2)+xlim(-10,10)}
     
     
for(i in 1:(length(df.plots2)-5)){
  plist[[i]]<-plot_data_column(df.plots2$TimeBetween, names(df.plots2)[i], df.plots2$WillConvert)
}


#write all of my scatter plots for each region to a pdf

pdf("C:/Users/wischj/Documents/PracticeData/Cathy/Images_lm.pdf")
invisible(lapply(plist, print))
dev.off()

Also, here’s a function I wrote that I’ve gotten a ton of use out of already. It may be useful for you guys, too?

It combines two dataframes by nearest date and ID. So say you have BOLD and neuropsych data for the same cohort of people, and you want to match each scan to its nearest test date, you can use this function. You’d call it by using MatchbyNearestDate(df.BOLD, df.NP, “ID”) and it would return a big dataframe that has matched dates.

(Again I used loops even though they are really bad…I’m trying to transition to that apply lifestyle, I’m just not totally there yet.)

MatchbyNearestDate<-function(dataframe1, dataframe2, IDcolumnName, datedf1, datedf2){
  df.combined<-merge(dataframe1, dataframe2, by = IDcolumnName)
  unique.id<-unique(df.combined[,IDcolumnName])
  k<-1
  df.result<-df.combined[FALSE,]
  for(i in 1:length(unique(df.combined[,IDcolumnName]))){
    df.working<-subset(df.combined, df.combined[,IDcolumnName] == unique.id[i])
    scandates<-unique(df.working[,datedf1])
    for(j in 1:length(scandates)){
      df.working2<-subset(df.working, df.working[,datedf1] == scandates[j])
      ind.val<-which.min(abs(df.working2[,datedf1]-df.working2[,datedf2]))
      df.result[k,]<-df.working2[ind.val,]
      k<-k+1
    }
  }
  return(df.result)
}

##Example of using the function:
#note that the two dataframes have to have the same column name that you're matching on (probably ID, MAP, or Subject)
Mega<-MatchbyNearestDate(MRI, Clin, "ID", "MR_Date", "TESTDATE")

Tricks and Tips

Julie Wisch

October 9, 2018