Preface:

You have to log into the Pitt network via Pulse Secure first, and then either use the terminal to access Zeus or a remote desktop program.

If you have not installed the following R packages, do so now by copy/pasting the following commands to your R Console:
install.packages(“dplyr”)
install.packages(“readxl”)
install.packages(“summarytools”)

library(dplyr)
library(summarytools)
library(foreign)
library(readxl)
knitr::opts_chunk$set(echo = TRUE)

Step 1:

Using the terminal tab in RStudio (not Console tab), ssh into Zeuz by typing the following command (plus enter password):
ssh ericksonlab@psych-0538.psychology.pitt.edu

Step 2:

Still in the terminal, move to the the directory where your subject-level data is stored:
cd /Volumes/Disk1/EPICC/SubjectScans

Step 3:

Generate a subject ID list using the following command in the terminal:
ls -d -1 7*_1

Copy and paste the output into an excel sheet. (I haven’t quite figured out how to [simply] assign variables from the terminal into R studio). Name the column SubID. Save the spreadsheet as ‘Baseline_Subs_Brain.csv’. We’ll continue to add columns to this below.

Step 4:

Do the same thing we just did (using the terminal), but this time, list out the subject’s left hippocampus .feat directories:
ls -d -1 /rest/lhip.feat

*Copy and paste the output into a new column in your excel sheet, labeled ‘L_HIPP_path’

Step 5:

Do the same thing we just did, but this time, list out the subject’s RIGHT hippocampus .feat directories (or any other seeds that you wish): ls -d -1 /rest/rhip.feat

*Copy and paste the output into a new column in your excel sheet, labeled ‘R_HIPP_path’

Step 6:

Save as Baseline_Subs_Brain.csv and close

Step 7:Concatinate ROI paths-

Import the .csv you just made.

BRAIN_SUBS<-read.csv("Baseline_Subs_Brain.csv") #Import the .csv you just made. 
BRAIN_SUBS$SubID<-as.character(BRAIN_SUBS$SubID) #read SubID as a characher variable, not numeric.

Below, we use a few commands to split the subject ids that we have (e.g., 7001_1) into subject IDs that can be merged with other databases (i.e., 7001).

x<-BRAIN_SUBS$SubID 
x<-as.character(x)
tmp<-strsplit(x, "_")   #strsplit: splits a charachter string in a fixed spot. in this case at '_'
mat  <- matrix(unlist(tmp), ncol=2, byrow=TRUE) #this breaks the char string into 2 sep columns.
df<-as.data.frame(mat) #make a data frame
df$ID<-as.character(df$V1) #make var ID
df$Session<-as.character(df$V2) #make var session
df<-df %>% select(ID, Session) #extract ID & session only (e.i., ignore identical colummns V1 & V2)
BRAIN_SUBS<-cbind(df, BRAIN_SUBS) #bind/merge with orginal dataframe 

Now that we’ve cleaned up our SubID variable to match other databases, we’ll want pull all ROI paths that we’ve copied to our excel spreadsheet Baseline_Subs_Brain.csv. To do so, the R package ‘dply’ has a great function to extract any columns that end with a specific set of characters.

Seed.df<-BRAIN_SUBS %>% 
  select(ID, Session, ends_with("path")) #This will extract any columns that you've added to the Baseline_Subs_Brain.csv that end w/ the phrase 'path'. This is why it's important to add column labels (first row of each column) for any ROI paths generated in the Steps 4 & 5 above. Be sure each ROI path column label ends in path (i.e., L_HIPP_path, R_HIPP_path, Bilat_HIPP_path, etc.). this is also case sensative, so make sure the '_path' is lower-case.


#The following 4 lines aren't super relavant right now. However, the may become important when we start to add numerous ROIS. I can likely create a quick loop to varify/count missing .gfeat paths  (if needed). For example, it looks like 7105 has baseline brain data (i.e., they're in /EPICC/SubjectScans), however, they do not have L/R Hippo .gfeat directories created from first-level (and may still need indivdual processing, and/or further inquiry).
Seed.df$L_HIPP_path<-as.character(Seed.df$L_HIPP_path)
Seed.df$R_HIPP_path<-as.character(Seed.df$R_HIPP_path)
Seed.dfL_HIPP_path<-if_else( (Seed.df$L_HIPP_path==""), "NA", Seed.df$L_HIPP_path)
Seed.dfR_HIPP_path<-if_else( (Seed.df$R_HIPP_path==""), "NA", Seed.df$R_HIPP_path)

rm(list=setdiff(ls(), "Seed.df")) #This cleans up our RStudio Global Enviroment (variable list), to remove everything but the dataframe we just created, 'Seed.df'. 

Steps 8:

Import Fitness Data

VO2.df<-readxl::read_xlsx("EPICC Data.xlsx", sheet = "VO2 Data") # Import fitness database. 

# Define variables as numeric values (because R sometimes likes to make them character strings). 
VO2.df$`PEAK VO2/KG`<-as.numeric(VO2.df$`PEAK VO2/KG`)
VO2.df$`PEAK VO2`<-as.numeric(VO2.df$`PEAK VO2`)
VO2.df$BMI<-as.numeric(VO2.df$BMI)
VO2.df$AGE<-as.numeric(VO2.df$AGE)

# Rename the `LAB ID` column to match the ID variable in other databases (for future merging). 
VO2.df$ID<-as.character(VO2.df$`LAB ID`) 

# Only include rows of data [x,] from column `Pre / Post` that is equal to "PRE".
VO2.df<-VO2.df[VO2.df$`Pre / Post`=="PRE",] 

# Only include rows of data which have a value for ID (remove NA rows, which orginally had post data in)
VO2.df<-VO2.df[complete.cases(VO2.df$ID),]

#############################################################################################
######## Select the fittness variables you would like to include in your final output ####### #############################################################################################
fittness.df<-VO2.df %>%
  select(ID, AGE, BMI, `PEAK VO2`, `PEAK VO2/KG`)

Step 9:

Import .Sav data

EPIC_vars<-read.spss("merged EPICC data with 1024 &1069 & 674 & 7701 & 745.sav", to.data.frame=TRUE,use.value.labels = TRUE)
# View data if you want to pick other vars:
#Command: View(EPIC_vars)  -Viewer in R is slow, so commented out for now 

Rename variables for the exported sublist. In a future tutorial, we may be able to automatically rename variables using the ‘codebook’ package. For now, we’ll just rename the veriables we’re interested in for analyses.

EPIC_vars$EDU<-EPIC_vars$BDH004   #Rename variables for exported sublist
EPIC_vars$Handedness<-EPIC_vars$BDH003 #Rename variables for exported sublist

Next select the variables you wish to include in the final output database. These will be included in the merging process below.

##############################################################################
###### Select any IV/DVs that you wish to include in your final output #######
##############################################################################
Demos.df<-EPIC_vars %>%  
select(ID,EDU, Handedness)

Step 10: Join dataframes + mean center continuous IVs

First, assign ID variables as character strings for merging. The ‘dplyr’ pkg (used for merging) will not be able to merge dataframes by ID, if each ID variable is a different variable class. For example, if I want to merge two dataframes by the shared variable, “ID”, then “ID” must be a character vector in BOTH dataframes.

Demos.df$ID<-as.character(Demos.df$ID)
Seed.df$ID<-as.character(Seed.df$ID)
fittness.df$ID<-as.character(fittness.df$ID)

Next, use ‘dplyr’ (installed & loaded in set-up) to ‘merge’ the 3 databases we’ve just created. As a side note, the various ’_join’ functions by dplyr (left_join, rigt_join, full_join, etc.) are very useful to learn. Here, we’ll use ‘left_join()’ to combine two dataframes at a time.

  • For left_join(), whichever dataframe you input first (i.e., on the left) will come first in the dataframe.
  • IMPORTANTLY, the resulting dataframe will only include subjects from the LEFT (first df input to the command) dataframe.

Below, we enter Seed.df first (on the left), so that our resulting dataframe (data1) will only include subjects from Seed.df. Then, we’ll merge the resulting dataframe, data1, with our fittness.df If any participant from Seed.df has missing data within Demos.df, they’ll just recieve NAs for that column entry.

data1<-left_join(Seed.df,Demos.df )
BS_EPICC_Group1<-left_join(data1,fittness.df)

Step 11: Mean Centering

Assign numeric/continuous variables classes

BS_EPICC_Group1$AGE<-as.numeric(BS_EPICC_Group1$AGE)
BS_EPICC_Group1$BMI<-as.numeric(BS_EPICC_Group1$BMI)
BS_EPICC_Group1$EDU<-as.numeric(BS_EPICC_Group1$EDU)

Step 11: Basic Sample Reports

BS_EPICC_Group<-BS_EPICC_Group1[,-c(1,2:4,11:13 )]

print(dfSummary(BS_EPICC_Group[,2:6], graph.magnif = 0.75, valid.col = FALSE, varnumbers = FALSE, na.col = FALSE , labels.col =FALSE), method = 'render')

Data Frame Summary

BS_EPICC_Group

Dimensions: 34 x 5
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph
Handedness [factor] 1. Right 2. Left 3. Both
22(78.6%)
6(21.4%)
0(0.0%)
AGE [numeric] Mean (sd) : 63.6 (5.7) min < med < max: 51 < 64 < 76 IQR (CV) : 7.8 (0.1) 18 distinct values
BMI [numeric] Mean (sd) : 31.8 (6.6) min < med < max: 20.2 < 31.4 < 43.3 IQR (CV) : 10.7 (0.2) 34 distinct values
PEAK VO2 [numeric] Mean (sd) : 1.4 (0.3) min < med < max: 0.9 < 1.4 < 2.1 IQR (CV) : 0.3 (0.2) 34 distinct values
PEAK VO2/KG [numeric] Mean (sd) : 16.8 (2.7) min < med < max: 11.4 < 16.8 < 24 IQR (CV) : 3.7 (0.2) 34 distinct values

Generated by summarytools 0.9.6 (R version 3.6.1)
2020-06-01

Step 12: Clean-up data structure for basic statisitcal reports (work in progress)

  1. We’ll use the following chunk to break up the spss database into it’s respective sections (i.e, BDH=Basic_Demos_Health, SWM=Spat_working_memory_task, etc.). From there, we can clean-up the data structure a tad, and create demographics reports.
    • For example, in the BDH section, BDH010A:HX represents binary logic (1=yes, 2=no) for catigorical racial status (White, Black, Asian, Other). It is likely that these questions were presented as multiple-entry check-boxes, and thus, they’re imported as separate variables. We’ll likely want to break this into some interpretation of Black/White/Other for a demographics report.
  2. Another thing to consider/discuss, is whether we’ll be including left-handed particpants.
    • There are 6 lefties in the bunch.
  3. Finally, after we’ve established which variables we’re interested-in, and cleaned-up the basic database structure, we’ll report some basic statistics.
    • I’ll additionally add a working tutorial for the R Add-On Codebook Package (Publication included in R Project Folder).
library(haven)
library(expss)
spss_data = haven::read_spss("merged EPICC data with 1024 &1069 & 674 & 7701 & 745.sav")
# add missing 'labelled' class
EPICC_data = add_labelled_class(spss_data) 
rm(list=setdiff(ls(), "EPICC_data"))


#Currently Working 
DEMOS<-EPICC_data %>%
  select(ID, starts_with("BDH"))


SWM<-EPICC_data %>%
  select(ID, starts_with("SWM"))

Step 22: Working Codebook Example:

# Best reference aside from protocol publication included: https://github.com/rubenarslan/codebook
# codebook::new_codebook_rmd()