Tutorial created for OSU’s Social Psychology Workshop Series

Prerequisites

The only things you need for this tutorial to work for you is:

  1. An Rstudio installation
  2. An install of the packages listed below
  3. A link to the data for download

Note: if you haven’t installed these packages into R before run install.packages('package name') for each one!

library(haven)    # Plays nice with SPSS files
library(tidyr)    # For manipulating/cleaning data
library(dplyr)    # For manipulating/cleaning data
library(ggplot2)  # For making pretty plots!
library(jtools)   # For easily plotting interactions

Introduction

There comes a time in every data set’s life where it must transform from it’s rapidly growing miscoded items, unruly text strings, and missing values, into a beautiful and finalized analysis. While this key step is this is only a small part of the research we do, the time it takes is often underestimated.

Take this data for example collected from a group of data scientists: Data Drudgery Poll collected by CrowdFlower

That means around 80% of data scientists’ time is spent doing data drudgery!!!

The goal of this tutorial is to demonstrate and supply reproducible code to speed up your next data wrangling session using R!

Tidy Data

Before we can arrange our data into perfection, we must first ask - what does perfection look like? The answer for most is Tidy Data.

What is tidy data?

  1. Each variable is saved in its own column
  2. Each observation is saved in its own row
  3. Each ‘type’ of observation is stored in a single table.

Tidy Data R for Data Science (Hadley Wickham and Garrett Grolemund)

Use SPSS? Fantastic! If you are performing analyses you are already using a Tidy data structure!


Why even bother?

  1. Makes many functions (stats and graphing) a breeze!
  2. Compatible with many software types (SPSS, Stata, SAS, HLM, MPlus, Python and more!)
  3. Makes data sharing easier!

Data Wrangling

Great, so now that we know what perfection looks like, how can R help us get there?

First, a few helper functions to look at our data along the way and speed up our implementation

  1. tbl_df() - a class of a data structure. Shows a “preview” of the data frame
  2. View() - shows your entire data frame
  3. object %>% function() - inserts object as the first variable into function(). Useful for multi-step and chaining functions.

About the Dataset

For the purposes of this tutorial, we will be using a dataset that contains behavioral data from BeanFest and a battery of survey measures! BeanFest is a computer game in which participants interact with beans that vary in terms of their shape (from circular to oval to oblong) and number of speckles (from 1 to 10). Different types of beans (e.g., circular beans with few speckles) are associated with either positive or negative outcomes when they are selected. BeanFest was developed by Fazio, Eiser, and Shook (2004) as a paradigm for studying the development of attitudes toward novel objects and the subsequent generalization of those newly-developed attitudes to objects of varying similarity to the original stimuli.

Beans

Beans

The data we will be wrangling today consists of 40 “subjects” (created from 4 of real subject’s data bootstrapped) organized into the following files:

  1. 40 individual subject files of BeanFest behavioral data. One row indicates on experimental trial. Created by Inquisit.
  2. 1 aggregate file with all survey measures. One row indicates one subject. Created by MediaLab. This data is super fake, but useful for tutorial purposes!

A huge thank you to Shelby Boggs for allowing us to access and utilize this data!

Setting Project Directory

It is often useful to set the project directory so that you do not have to write out the whole file path for each external file (e.g., .csv, .txt, etc.) that you need to either write in or write out. So much easier than typing out 40 (or 300) file paths! The code below shows how to do so. Note that in Rstudio you can make quotes (“”) and then use the TAB key once inside the quotes to auto-complete file paths until you find the directory/file you are interested in.

To use the data from link above, you’ll want to download this data (or store it on your dropbox). Then, you’ll want to save your R script somewhere

# This is the directory (i.e. folder) where all the data and codes will be found
setwd("~/Dropbox/R Data Processing Workshop/")

Combining Multiple Subjects’ Data

Now that the directory has been set, we need to read in all the behavioral data. Because the behavioral data is in individual files for each subject, we need to find a way to read in each file and then bind them all together. The code below shows how to find the files you need in a given folder, loop through each file, and bind them all together so that we have a single, merged data.frame with all subjects’ data.

# Find file names in "Data" directory that contain the pattern "subject"
file_names <- list.files(path = "Data", pattern = "subject")

# Create empty variable that will store all subject "behavioral" data
beh_data <- NULL

# A loop! Here, "subj" will be assigned each of the individual subject files as 
# the loop iterates through each element of "file_names"
for (subj in file_names) {
  # Read in the individual subject data
  tmp_data <- read.csv(paste0("Data/", subj), stringsAsFactors = F)
  
  # Bind individual subject data to "my_data" variable
  beh_data <- rbind(beh_data, tmp_data)
}

# "beh_data" now contains long-format data across all 40 subjects
tbl_df(beh_data)

Scoring Trials Using a Key

For behavioral tasks, we often need to score results for each participant individually, and these summary measures are then used to compare groups, probe condition effects, or determine correlations between behavioral performance and external (e.g., self-report questionnaire) data. Below, the “beankey.csv” file contains the values that are associated with beans for each X-Y coordinate pair in the BeanFest grid (named xPos and yPos in the .csv file). Now, if the values associated with different beans are changed, only the beankey.csv file needs to be updated to reflect this change (no changes to the R code are necessary).

BeanKey

BeanKey

# Read in beanfest key
bean_key <- read.csv("beankey.csv")

# Take a look at the bean key!
tbl_df(bean_key)

Now for our first taste of some dplyr and tidyr functions. In the block below we use mutate and full_join functions.

  • mutate allows us to make new variables based on the string inside the “values.currentbean” variable.
  • full_join allows us to combine the beankey and beh_data dataframes, while matching on the key varibles (xPos and yPos).
# Compute behavioral summaries for later analysis and merge with beanfest key.
# Note that this does the value recoding automatically! Now individual-subject 
# summaries can be computed
recoded_data <- beh_data %>%
  mutate(xPos = as.numeric(substr(values.currentbean, start = 1, stop = 2)),
         yPos = as.numeric(substr(values.currentbean, start = 5, stop = 6))) %>%
  full_join(bean_key, by = c("xPos"="xPos", "yPos"="yPos"))

# "recoded_data" now contains value-recoded data from the key for each trial
tbl_df(recoded_data)

Computing Behavioral Summaries

Now that we have all the behavioral data in one place and trials have been scored appropriately, it is time to compute summary statistics for each subject. The code proceeds by first finding out which condition each subject was assigned, adding the condition variable to every trial for each subject, and then coding the responses that subjects made so that we know both: (1) if they responded correctly given an old bean, and (2) how they responded to novel beans. Then, we can filter out the test phase, group the data by the “subject” variable, and then take the mean of each subject’s response to novel, old negative, and old positive beans. Finally, we can compute weighting bias and learning asymmetry.

Remember that the pipe operators (%>%) just tell R to send the result of one line to the function on the next line, which allows us to perform functions on the data sequentially without having to save each step separately. Note that the “.” before the “,” at the beginning of each function represents the data that was sent from the previous line, and we included it here for completeness despite it not being necessary.

Again, we are utilizing some helpful tidyr and dplyr functions:

  • mutate here is helping us create a condition variable.
  • case_when is a useful conditional function. It follows the format case_when(if x, than y, if not x but a, then b, if not x or a, then q).
  • group_by helps us group data by value, such that rows with the same value cluster.
  • summarize is similar to the AGGREGATE function in SPSS. It collapses the data into a summary statistic of choice.
  • full_join as above, helps us combine two datasets based on a value key.
  • filter is similar to a sort & filter in excel or SPSS.
# Find condition and compute behavioral summaries for each subject 
scored_data <- recoded_data %>%
  mutate(., cond = case_when(yPos==1 & values.currentbeanvalue==10 & blocknum==2 & xPos==1 ~ 1, #Define condition
                             yPos==1 & values.currentbeanvalue==-10 & blocknum==2 & xPos==1 ~ 2, 
                             TRUE ~ NA_real_)) %>%
  group_by(., subject) %>%
  summarize(., condition = mean(cond, na.rm = T)) %>% # A specific trial denotes condition, and we are doing this
  full_join(., recoded_data, by = "subject") %>%      # to find and then enter that condition value for every trial
  mutate(., bean_val = case_when(condition==1 ~ recodedValue*1, 
                                 condition==2 ~ recodedValue*-1, 
                                 TRUE ~ NA_real_),
         rec_resp = case_when(response==37 ~ 1, 
                              response==32 ~ -1, 
                              TRUE ~ NA_real_)) %>%
  mutate(., cor_resp = case_when(bean_val==10 & rec_resp==1 ~ 1, 
                                 bean_val==10 & rec_resp==-1 ~ 0, 
                                 bean_val==-10 & rec_resp==1 ~ 0, 
                                 bean_val==-10 & rec_resp==-1 ~ 1, 
                                 TRUE ~ NA_real_)) %>%
  filter(., blockcode=="testblock") %>%
  group_by(., subject) %>%
  summarize(., avg_resp_nov = mean(rec_resp[presented==0 & bean_val==0], na.rm = T),
            neg_prop_cor = mean(cor_resp[presented==1 & bean_val==-10], na.rm = T),
            pos_prop_cor = mean(cor_resp[presented==1 & bean_val==10], na.rm = T),
            condition    = first(condition)) %>%
  mutate(., weight_bias = avg_resp_nov - (.59 * pos_prop_cor - .83 * neg_prop_cor + .08),
         learn_asymm = pos_prop_cor - neg_prop_cor)

# Beanfest data is now scored for each subject
tbl_df(scored_data)

For those unfamiliar with the variable names, they are interpreted as follows: * ‘avg_resp_nov’: an individual’s averaged response to the novel beans during the test phase * ‘XXX_PropCrct’: proportion of positive (negative) game beans individual got correct during the final test phase * ‘weight_bias’: individuals’ weighting bias as calculated using the normative equation from Fazio, Pietri, Rocklage, & Shook (2015) * ‘learn_asymm’: proportion of positive minus negative game beans correct

Comparison of R and SPSS for Behavioral Summaries

Follow this link to open the SPSS Syntax in your web browser. This code is 1320 lines long! R takes care of the same problem, plus all of the data aggregation in 35 annotated lines.

We can compare each part of the code above to the SPSS syntax.

Combining Behavioral and Self-report Data

Now that the behavioral data has been scored, we can merge it with the self-report data so that we can test any hypothesized relationships between behavior and other measures. The code below uses the “full_join” function once more to merge both files be the subject index. Note that this time, the subject ID column is named differently in each file (Subject vs. subject). We can tell “full_join” about this difference using the “by = .” argument as shown below.

# Read in non-behavioral data (e.g., questionnaire) and combine with Beanfest
all_data <- read.csv("Data/questionnaire_data.csv") %>%
  full_join(scored_data, by = c("Subject"="subject"))

# Now all the data is in one place
tbl_df(all_data)

Stats - Testing an interaction

This tutorial is not meant to replace a stats class, so we will only be covering one simple interaction as to find something interesting enough to plot! We will show syntax for the plot below.

model1 <- lm(TDDSPath ~ weight_bias * FoodNeophobia, data = all_data)
summary(model1)
## 
## Call:
## lm(formula = TDDSPath ~ weight_bias * FoodNeophobia, data = all_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43519 -0.00962  0.00691  0.04630  0.35290 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.88108    0.10250  18.352  < 2e-16 ***
## weight_bias               -4.28424    0.35973 -11.910 4.79e-14 ***
## FoodNeophobia              1.32876    0.03887  34.186  < 2e-16 ***
## weight_bias:FoodNeophobia  1.32672    0.16492   8.045 1.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1721 on 36 degrees of freedom
## Multiple R-squared:  0.9723, Adjusted R-squared:   0.97 
## F-statistic: 421.4 on 3 and 36 DF,  p-value: < 2.2e-16

Wowee! Everything is significant!

Visualization (the best part!)

A brief introduction to ggplot philosophy

A great read to get in the mindset of ggplot2. A list of helpful terms for decoding ggplot syntax!

Prepare Data for ggplot

The R package ggplot2 is a great tool for visualization. To use ggplot2, however, the data need to be in long format. When we computed behavioral summaries for each subject and combined them with the self-report data, the data we were working with became wide format. To fix this, tidyr offers some helpful functions that allow us to transform data from wide to long format. The code below shows how to use the gather function to implement this.

gather

gather

# Gather Beanfest measures into long format to visualize together
plot_data <- all_data %>%
  gather(key = "Measure", value = "Value", avg_resp_nov:learn_asymm) %>%
  mutate(Subject = as.factor(Subject), # Making these variables into factors helps for plotting later
         Gender  = as.factor(Gender))

# Scroll to the right and find the "Measure" column to see what happened
tbl_df(plot_data)

Plotting Individual Subjects for each Measure

Since we have so few subjects, we can visualize each subject’s performance across all the measures. To do so, we will use the geom_bar layer from ggplot2, which is allows us to make a bar graph and color different bars according to other variables in our dataset. In particular, we can make each subject a different color so that it is easy to compare performance across subjects. The below code shows how to make a bar plot with color representing subjects.

# Visualize measures within subjects
ggplot(plot_data, aes(x = Subject, y = Value, fill = Subject)) + # fill = Subject tells ggplot to color the bars differently for each Subject
  geom_bar(stat = "identity") +   # stat = "identity" tells geom_bar that Value contains the exact numbers to plot
  facet_wrap("Measure") +         # This line makes separate panels for each Measure
  theme_minimal(base_size = 20) + # theme_minimal is a nice theme that makes the plot simple. Try other themes! 
  theme(legend.position = "none", # Here, we are turning off the legend. Try changing "none" to "right" to see what happens
        axis.text.x = element_blank()) # This line is getting rid of subject ID labels along the x-axis for a less cluttered visualization

Plotting Performance Across Subjects

The above plot is useful for quickly checking the data, but it is not very helpful for making inference. To make inference, we may want to look at average performance across subjects, and we need to make sure we take uncertainty into account. To do so, we can compute summary statistics across subjects and create 95% confidence intervals to show the uncertainty in our estimates. The below code shows how to summarize the data across subjects so that confidence intervals can be calculated correctly.

# Visualize measures across subjects (with confidence intervals)
ci <- .95 # Calculate 95% confidence interval. Try changing this value to see what happens to the intervals (e.g., to .5)
plot_data %>%
  group_by(Measure) %>%
  summarize(Value_mu = mean(Value),
            Value_se = sd(Value)/sqrt(n()),
            n_subj   = n()) %>%
  ggplot(aes(x = Measure, y = Value_mu, fill = Measure)) + # The color is different for each Measure now instead of Subject
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = Value_mu - qt(ci/2+.5, n_subj-1)*Value_se,  # Calculating t-value with N-1 DF
                    ymax = Value_mu + qt(ci/2+.5, n_subj-1)*Value_se), 
                width = .2) +
  theme_minimal(base_size = 20) +
  theme(legend.position = "none")

Note that the labels are overlapping. Click on the first button from the left in the upper-right portion of the plot (only in the .Rmd file) to view it in a new window. Alternatively, we can change the font size/rotate the labels to make them easier to read:

plot_data %>%
  group_by(Measure) %>%
  summarize(Value_mu = mean(Value),
            Value_se = sd(Value)/sqrt(n()),
            n_subj   = n()) %>%
  ggplot(aes(x = Measure, y = Value_mu, fill = Measure)) + 
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = Value_mu - qt(ci/2+.5, n_subj-1)*Value_se, 
                    ymax = Value_mu + qt(ci/2+.5, n_subj-1)*Value_se), 
                width = .2) +
  theme_minimal(base_size = 18) + # Base font size changed from 20 -> 18
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 45, vjust = 0.9)) # Rotating the x-axis text 45 degrees and adjusting the vertical distance of the labels

Manually Visualizing Interactions

Above, we showed how to probe for interactions using R syntax—how should we visualize such interactions? The code below shows how to make a nice interaction plot:

# Visualize Weighting Bias across Stress*Gender interaction
ci <- .95 # Calculate 95% confidence interval
plot_data %>%
  filter(Measure=="weight_bias") %>%
  mutate(weight_bias = case_when(Value < median(Value) ~ "Low", 
                                 TRUE ~ "High"),
         FoodNeophobia = case_when(FoodNeophobia < median(FoodNeophobia) ~ "Low", 
                                 TRUE ~ "High")) %>%
  group_by(weight_bias, FoodNeophobia) %>%
  summarize(TDDSPath_mu = mean(TDDSPath),
            TDDSPath_se = sd(TDDSPath)/sqrt(n()),
            n_subj      = n()) %>%
  ggplot(aes(x = weight_bias, y = TDDSPath_mu, color = FoodNeophobia, group = FoodNeophobia)) +
  geom_point(size = 2, position = position_dodge(width = .05)) +
  geom_line(position = position_dodge(width = .05)) +
  geom_errorbar(aes(ymin = TDDSPath_mu - qt(ci/2+.5, n_subj-1)*TDDSPath_se,  # Calculating t-value with N-1 DF
                    ymax = TDDSPath_mu + qt(ci/2+.5, n_subj-1)*TDDSPath_se),
                width = .1, position = position_dodge(width = .05)) +
  xlab("Median Split on Weighting Bias") +
  ylab("Disgust Sensitivity") +
  theme_minimal(base_size = 20) +
  theme(legend.position = "right")

Packages for Interactions

There are some great tools that we can use to easily plot interactions after fitting an interaction model. Below, we use jtools to plot the interaction between food neophobia and disgust sensitivity in predicting weighting bias (because why not?).

# Fit the linear model with the interaction (the right way!)
model1 <- lm(TDDSPath ~  weight_bias * FoodNeophobia, data = all_data)
summary(model1)
## 
## Call:
## lm(formula = TDDSPath ~ weight_bias * FoodNeophobia, data = all_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43519 -0.00962  0.00691  0.04630  0.35290 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.88108    0.10250  18.352  < 2e-16 ***
## weight_bias               -4.28424    0.35973 -11.910 4.79e-14 ***
## FoodNeophobia              1.32876    0.03887  34.186  < 2e-16 ***
## weight_bias:FoodNeophobia  1.32672    0.16492   8.045 1.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1721 on 36 degrees of freedom
## Multiple R-squared:  0.9723, Adjusted R-squared:   0.97 
## F-statistic: 421.4 on 3 and 36 DF,  p-value: < 2.2e-16
# Save the final plot  
plot_for_JPSP <- interact_plot(model1, pred = "weight_bias", modx = "FoodNeophobia", 
                               x.label = "Weighting Bias", y.label = "Disgust Sensitivity", 
                               legend.main = "Food Neophobia", interval = TRUE, int.width = 0.8) + 
  theme_minimal(base_size = 15)

# Show plot
plot_for_JPSP

# Save plot in PDF format with journal ready 300 DPIness
ggsave(plot_for_JPSP, filename = "paradigm_shifting_plot.pdf", units = "in",
       height = 4, width = 6, dpi = 300)

Resources

  1. ?function - Not sure about what a function is doing? Add a question mark before the code to pull up documentation.
  2. Want to learn more about the awesome features of Rstudio?
  3. More information on Tidy Data.
  4. Like this webpage? We made it with Rmd and published on Rpubs.
  5. Want to learn more about jtools? Heres a tutorial for the jtools package designed for simple interaction plots.
  6. If you want to get into the beans so to speak, the Fazio webpage has lots of useful information.