Starting note: The best reference for this material is Hadley Wickham’s R for data scientists. My contribution here is to translate this reference for psychology.

Goals and Introduction

By the end of this tutorial, you will know:

This intro will describe a few concepts that you will need to know, using the famous iris dataset that comes with ggplot2.

Data frames

The basic data structure we’re working with is the data frame, or tibble (in the tidyverse reimplementation). Data frames have rows and columns, and each column has a distinct data type. The implementation in Python’s pandas is distinct but most of the concepts are the same.

iris is a data frame showing the measurements of a bunch of different instances of iris flowers from different species. (Sepals are the things outside the petals of the flowers that protect the petals while it’s blooming, petals are the actual petals of the flower).

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Exercise. R is a very flexible programming language, which is both a strength and a weakness. There are many ways to get a particular value of a variable in a data frame. You can use $ to access a column, as in iris$Sepal.Length or you can treat the data frame as a matrix, e.g. iris[1,1] or even as a list, as in iris[[1]]. You can also mix numeric references and named references, e.g. iris[["Sepal.Length"]]. Turn to your neighbor (and/or google) and find as many ways as you can to access the petal length of the third iris in the dataset (row 3).

# fill me in with calls to the iris dataset that all return the same cell (third from the top, Petal Length).
iris %>% select(Petal.Length) %>% slice(3) %>% pull()
## [1] 1.3
iris[["Petal.Length"]][3]
## [1] 1.3
iris[3,3]
## [1] 1.3

Discussion. Why might some ways of doing this be better than others? Readability

Tidy data

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Here’s the basic idea: In tidy data, every row is a single observation (trial), and every column describes a variable with some value describing that trial.

And if you know that data are formatted this way, then you can do amazing things, basically because you can take a uniform approach to the dataset. From R4DS:

“There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine.”

iris is a tidy dataset. Each row is an observation of an individual iris, each column is a different variable.

Exercise. Take a look at these data, as downloaded from Amazon Mechanical Turk. They describe an experiment where people had to estimate the price of a dog, a plasma TV, and a sushi dinner (and they were primed with anchors that differed across conditions). It’s a replication of a paper by Janiszewksi & Uy (2008). Examine this dataset with your nextdoor neighbor and sketch out what a tidy version of the dataset would look like (using paper and pencil).

ju <- read_csv("data/janiszewski_rep_cleaned.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Reward = col_double(),
##   MaxAssignments = col_double(),
##   AssignmentDurationInSeconds = col_double(),
##   AutoApprovalDelayInSeconds = col_double(),
##   NumberOfSimilarHITs = col_double(),
##   LifetimeInSeconds = col_logical(),
##   WorkerId = col_double(),
##   ApprovalTime = col_logical(),
##   RejectionTime = col_logical(),
##   RequesterFeedback = col_logical(),
##   WorkTimeInSeconds = col_double(),
##   Input.price1 = col_double(),
##   Input.price2 = col_double(),
##   Input.price3 = col_double(),
##   Answer.dog_cost = col_double(),
##   Answer.plasma_cost = col_double(),
##   Answer.sushi_cost = col_double()
## )
## See spec(...) for full column specifications.
head(ju)
## # A tibble: 6 x 34
##   HITId HITTypeId Title Description Keywords Reward CreationTime MaxAssignments
##   <chr> <chr>     <chr> <chr>       <chr>     <dbl> <chr>                 <dbl>
## 1 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 2 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 3 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 4 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 5 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 6 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## # … with 26 more variables: RequesterAnnotation <chr>,
## #   AssignmentDurationInSeconds <dbl>, AutoApprovalDelayInSeconds <dbl>,
## #   Expiration <chr>, NumberOfSimilarHITs <dbl>, LifetimeInSeconds <lgl>,
## #   AssignmentId <chr>, WorkerId <dbl>, AssignmentStatus <chr>,
## #   AcceptTime <chr>, SubmitTime <chr>, AutoApprovalTime <chr>,
## #   ApprovalTime <lgl>, RejectionTime <lgl>, RequesterFeedback <lgl>,
## #   WorkTimeInSeconds <dbl>, LifetimeApprovalRate <chr>,
## #   Last30DaysApprovalRate <chr>, Last7DaysApprovalRate <chr>,
## #   Input.condition <chr>, Input.price1 <dbl>, Input.price2 <dbl>,
## #   Input.price3 <dbl>, Answer.dog_cost <dbl>, Answer.plasma_cost <dbl>,
## #   Answer.sushi_cost <dbl>

Functions and Pipes

Everything you typically want to do in statistical programming uses functions. mean is a good example. mean takes one argument, a numeric vector.

mean(iris$Petal.Length)
## [1] 3.758

We’re going to call this applying the function mean to the variable Petal.Length.

Pipes are a way to write strings of functions more easily. They bring the first argument of the function to the bedginning. So you can write:

iris$Petal.Length %>% mean
## [1] 3.758

That’s not very useful yet, but when you start nesting functions, it gets better.

mean(unique(iris$Petal.Length))
## [1] 4.22093
iris$Petal.Length %>% unique() %>% mean(na.rm=TRUE)
## [1] 4.22093

or

round(mean(unique(iris$Petal.Length)), digits = 2)
## [1] 4.22
iris$Petal.Length %>% unique %>% mean %>% round(digits = 2)
## [1] 4.22
# indenting makes things even easier to read
iris$Petal.Length %>% 
  unique %>% 
  mean %>% 
  round(digits = 2)
## [1] 4.22

This can be super helpful for writing strings of functions so that they are readable and distinct.

We’ll be doing a lot of piping of functions with multiple arguments later, and it will really help keep our syntax simple.

Exercise. Rewrite these commands using pipes and check that they do the same thing! (Or at least produce the same output). Unpiped version:

length(unique(iris$Species)) # number of species
## [1] 3

Piped version:

iris$Species %>% 
  unique() %>%
  length
## [1] 3

ggplot2 and tidy data

The last piece of our workflow here is going to be the addition of visualiation elements. ggplot2 is a plotting package that easily takes advantage of tidy data. ggplots have two important parts (there are of course more):

  • aes - the aesthetic mapping, or which data variables get mapped to which visual variables (x, y, color, symbol, etc.)
  • geom - the plotting objects that represent the data (points, lines, shapes, etc.)
iris %>%
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) + 
  geom_point()

And just to let you know my biases, I like theme_few from ggthemes and scale_color_solarized as my palette.

iris %>%
  ggplot(aes(Sepal.Width, Sepal.Length, col = Species)) + 
  geom_point() + 
  ggthemes::theme_few() + 
  ggthemes::scale_color_solarized()