Starting note: The best reference for this material is Hadley Wickham’s R for data scientists. My contribution here is to translate this reference for psychology.

Goals and Introduction

By the end of this tutorial, you will know:

What “tidy data” is and why it’s an awesome format
How to do some stuff with tidy data
How to get your data to be tidy
Some tips’n’tricks for dealing with “medium data” in R

This intro will describe a few concepts that you will need to know, using the famous iris dataset that comes with ggplot2.

Data frames

The basic data structure we’re working with is the data frame, or tibble (in the tidyverse reimplementation). Data frames have rows and columns, and each column has a distinct data type. The implementation in Python’s pandas is distinct but most of the concepts are the same.

iris is a data frame showing the measurements of a bunch of different instances of iris flowers from different species. (Sepals are the things outside the petals of the flowers that protect the petals while it’s blooming, petals are the actual petals of the flower).

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Exercise. R is a very flexible programming language, which is both a strength and a weakness. There are many ways to get a particular value of a variable in a data frame. You can use $ to access a column, as in iris$Sepal.Length or you can treat the data frame as a matrix, e.g. iris[1,1] or even as a list, as in iris[[1]]. You can also mix numeric references and named references, e.g. iris[["Sepal.Length"]]. Turn to your neighbor (and/or google) and find as many ways as you can to access the petal length of the third iris in the dataset (row 3).

# fill me in with calls to the iris dataset that all return the same cell (third from the top, Petal Length).

Discussion. Why might some ways of doing this be better than others?

Tidy data

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Here’s the basic idea: In tidy data, every row is a single observation (trial), and every column describes a variable with some value describing that trial.

And if you know that data are formatted this way, then you can do amazing things, basically because you can take a uniform approach to the dataset. From R4DS:

“There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine.”

iris is a tidy dataset. Each row is an observation of an individual iris, each column is a different variable.

Exercise. Take a look at these data, as downloaded from Amazon Mechanical Turk. They describe an experiment where people had to estimate the price of a dog, a plasma TV, and a sushi dinner (and they were primed with anchors that differed across conditions). It’s a replication of a paper by Janiszewksi & Uy (2008). Examine this dataset with your nextdoor neighbor and sketch out what a tidy version of the dataset would look like (using paper and pencil).

ju <- read_csv("data/janiszewski_rep_cleaned.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Reward = col_double(),
##   MaxAssignments = col_double(),
##   AssignmentDurationInSeconds = col_double(),
##   AutoApprovalDelayInSeconds = col_double(),
##   NumberOfSimilarHITs = col_double(),
##   LifetimeInSeconds = col_logical(),
##   WorkerId = col_double(),
##   ApprovalTime = col_logical(),
##   RejectionTime = col_logical(),
##   RequesterFeedback = col_logical(),
##   WorkTimeInSeconds = col_double(),
##   Input.price1 = col_double(),
##   Input.price2 = col_double(),
##   Input.price3 = col_double(),
##   Answer.dog_cost = col_double(),
##   Answer.plasma_cost = col_double(),
##   Answer.sushi_cost = col_double()
## )

## See spec(...) for full column specifications.

head(ju)

## # A tibble: 6 x 34
##   HITId HITTypeId Title Description Keywords Reward CreationTime
##   <chr> <chr>     <chr> <chr>       <chr>     <dbl> <chr>       
## 1 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …
## 2 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …
## 3 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …
## 4 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …
## 5 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …
## 6 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …
## # … with 27 more variables: MaxAssignments <dbl>,
## #   RequesterAnnotation <chr>, AssignmentDurationInSeconds <dbl>,
## #   AutoApprovalDelayInSeconds <dbl>, Expiration <chr>,
## #   NumberOfSimilarHITs <dbl>, LifetimeInSeconds <lgl>,
## #   AssignmentId <chr>, WorkerId <dbl>, AssignmentStatus <chr>,
## #   AcceptTime <chr>, SubmitTime <chr>, AutoApprovalTime <chr>,
## #   ApprovalTime <lgl>, RejectionTime <lgl>, RequesterFeedback <lgl>,
## #   WorkTimeInSeconds <dbl>, LifetimeApprovalRate <chr>,
## #   Last30DaysApprovalRate <chr>, Last7DaysApprovalRate <chr>,
## #   Input.condition <chr>, Input.price1 <dbl>, Input.price2 <dbl>,
## #   Input.price3 <dbl>, Answer.dog_cost <dbl>, Answer.plasma_cost <dbl>,
## #   Answer.sushi_cost <dbl>

Functions and Pipes

Everything you typically want to do in statistical programming uses functions. mean is a good example. mean takes one argument, a numeric vector.

mean(iris$Petal.Length)

## [1] 3.758

We’re going to call this applying the function mean to the variable Petal.Length.

Pipes are a way to write strings of functions more easily. They bring the first argument of the function to the bedginning. So you can write:

iris$Petal.Length %>% mean()

## [1] 3.758

That’s not very useful yet, but when you start nesting functions, it gets better.

mean(unique(iris$Petal.Length))

## [1] 4.22093

iris$Petal.Length %>% unique() %>% mean()

## [1] 4.22093

iris$Petal.Length %>% unique %>% mean

## [1] 4.22093

round(mean(unique(iris$Petal.Length)), 
      digits = 2)

## [1] 4.22

iris$Petal.Length %>% unique %>% mean %>% round(digits = 2)

## [1] 4.22

# indenting makes things even easier to read

This can be super helpful for writing strings of functions so that they are readable and distinct.

We’ll be doing a lot of piping of functions with multiple arguments later, and it will really help keep our syntax simple.

Exercise. Rewrite these commands using pipes and check that they do the same thing! (Or at least produce the same output). Unpiped version:

length(unique(iris$Species)) # number of species

## [1] 3

Piped version:

`ggplot2` and tidy data

The last piece of our workflow here is going to be the addition of visualiation elements. ggplot2 is a plotting package that easily takes advantage of tidy data. ggplots have two important parts (there are of course more):

aes - the aesthetic mapping, or which data variables get mapped to which visual variables (x, y, color, symbol, etc.)
geom - the plotting objects that represent the data (points, lines, shapes, etc.)

iris %>%
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) + 
  geom_point()

And just to let you know my biases, I like theme_few from ggthemes and scale_color_solarized as my palette.

iris %>%
  ggplot(aes(Sepal.Width, Sepal.Length, col = Species)) + 
  geom_point() + 
  ggthemes::theme_few() + 
  ggthemes::scale_color_solarized()

Tidy Data Analysis with `dplyr`

Reference: R4DS Chapter 5

Let’s take a psychological dataset. Here are the raw data from [Stiller, Goodman, & Frank (2015)].

These data are tidy: each row describes a single trial, each column describes some aspect of tha trial, including their id (subid), age (age), condition (condition - “label” is the experimental condition, “No Label” is the control), item (item - which thing furble was trying to find).

We are going to manipulate these data using “verbs” from dplyr. I’ll only teach four verbs, the most common in my workflow (but there are many other useful ones):

filter - remove rows by some logical condition
mutate - create new columns
group_by - group the data into subsets by some column
summarize - apply some function over columns in each group

Exploring and characterizing the dataset

sgf <- read_csv("data/stiller_scales_data.csv")

## Parsed with column specification:
## cols(
##   subid = col_character(),
##   item = col_character(),
##   correct = col_double(),
##   age = col_double(),
##   condition = col_character()
## )

sgf

## # A tibble: 588 x 5
##    subid item   correct   age condition
##    <chr> <chr>    <dbl> <dbl> <chr>    
##  1 M22   faces        1  2    Label    
##  2 M22   houses       1  2    Label    
##  3 M22   pasta        0  2    Label    
##  4 M22   beds         0  2    Label    
##  5 T22   beds         0  2.13 Label    
##  6 T22   faces        0  2.13 Label    
##  7 T22   houses       1  2.13 Label    
##  8 T22   pasta        1  2.13 Label    
##  9 T17   pasta        0  2.32 Label    
## 10 T17   faces        0  2.32 Label    
## # … with 578 more rows

Inspect the various variables before you start any analysis. Lots of people recommend summary but TBH I don’t find it useful.

summary(sgf)

##     subid               item              correct            age       
##  Length:588         Length:588         Min.   :0.0000   Min.   :2.000  
##  Class :character   Class :character   1st Qu.:0.0000   1st Qu.:2.850  
##  Mode  :character   Mode  :character   Median :0.0000   Median :3.460  
##                                        Mean   :0.4473   Mean   :3.525  
##                                        3rd Qu.:1.0000   3rd Qu.:4.290  
##                                        Max.   :1.0000   Max.   :4.960  
##   condition        
##  Length:588        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

This output just feels overwhelming and uninformative.

You can look at each variable by itself:

unique(sgf$condition)

## [1] "Label"    "No Label"

sgf$subid %>%
 unique %>%
 length

## [1] 147

Analyzing Psychology Data in the Tidyverse

Mike Frank

10/1/2019

Goals and Introduction

Data frames

Tidy data

Functions and Pipes

`ggplot2` and tidy data

Tidy Data Analysis with `dplyr`

Exploring and characterizing the dataset

Analyzing Psychology Data in the Tidyverse

Mike Frank

10/1/2019

Goals and Introduction

Data frames

Tidy data

Functions and Pipes

ggplot2 and tidy data

Tidy Data Analysis with dplyr

Exploring and characterizing the dataset

`ggplot2` and tidy data

Tidy Data Analysis with `dplyr`