Starting note: The best reference for this material is Hadley Wickham’s R for data scientists. My contribution here is to translate this reference for psychology.
By the end of this tutorial, you will know:
This intro will describe a few concepts that you will need to know, using the famous iris dataset that comes with ggplot2.
The basic data structure we’re working with is the data frame, or tibble (in the tidyverse reimplementation). Data frames have rows and columns, and each column has a distinct data type. The implementation in Python’s pandas is distinct but most of the concepts are the same.
iris is a data frame showing the measurements of a bunch of different instances of iris flowers from different species. (Sepals are the things outside the petals of the flowers that protect the petals while it’s blooming, petals are the actual petals of the flower).
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Exercise. R is a very flexible programming language, which is both a strength and a weakness. There are many ways to get a particular value of a variable in a data frame. You can use
$to access a column, as iniris$Sepal.Lengthor you can treat the data frame as a matrix, e.g.iris[1,1]or even as a list, as iniris[[1]]. You can also mix numeric references and named references, e.g.iris[["Sepal.Length"]]. Turn to your neighbor (and/or google) and find as many ways as you can to access the petal length of the third iris in the dataset (row 3).
# fill me in with calls to the iris dataset that all return the same cell (third from the top, Petal Length).
Discussion. Why might some ways of doing this be better than others?
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Here’s the basic idea: In tidy data, every row is a single observation (trial), and every column describes a variable with some value describing that trial.
And if you know that data are formatted this way, then you can do amazing things, basically because you can take a uniform approach to the dataset. From R4DS:
“There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine.”
iris is a tidy dataset. Each row is an observation of an individual iris, each column is a different variable.
Exercise. Take a look at these data, as downloaded from Amazon Mechanical Turk. They describe an experiment where people had to estimate the price of a dog, a plasma TV, and a sushi dinner (and they were primed with anchors that differed across conditions). It’s a replication of a paper by Janiszewksi & Uy (2008). Examine this dataset with your nextdoor neighbor and sketch out what a tidy version of the dataset would look like (using paper and pencil).
ju <- read_csv("data/janiszewski_rep_cleaned.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Reward = col_double(),
## MaxAssignments = col_double(),
## AssignmentDurationInSeconds = col_double(),
## AutoApprovalDelayInSeconds = col_double(),
## NumberOfSimilarHITs = col_double(),
## LifetimeInSeconds = col_logical(),
## WorkerId = col_double(),
## ApprovalTime = col_logical(),
## RejectionTime = col_logical(),
## RequesterFeedback = col_logical(),
## WorkTimeInSeconds = col_double(),
## Input.price1 = col_double(),
## Input.price2 = col_double(),
## Input.price3 = col_double(),
## Answer.dog_cost = col_double(),
## Answer.plasma_cost = col_double(),
## Answer.sushi_cost = col_double()
## )
## See spec(...) for full column specifications.
head(ju)
## # A tibble: 6 x 34
## HITId HITTypeId Title Description Keywords Reward CreationTime
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 …
## 2 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 …
## 3 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 …
## 4 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 …
## 5 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 …
## 6 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 …
## # … with 27 more variables: MaxAssignments <dbl>,
## # RequesterAnnotation <chr>, AssignmentDurationInSeconds <dbl>,
## # AutoApprovalDelayInSeconds <dbl>, Expiration <chr>,
## # NumberOfSimilarHITs <dbl>, LifetimeInSeconds <lgl>,
## # AssignmentId <chr>, WorkerId <dbl>, AssignmentStatus <chr>,
## # AcceptTime <chr>, SubmitTime <chr>, AutoApprovalTime <chr>,
## # ApprovalTime <lgl>, RejectionTime <lgl>, RequesterFeedback <lgl>,
## # WorkTimeInSeconds <dbl>, LifetimeApprovalRate <chr>,
## # Last30DaysApprovalRate <chr>, Last7DaysApprovalRate <chr>,
## # Input.condition <chr>, Input.price1 <dbl>, Input.price2 <dbl>,
## # Input.price3 <dbl>, Answer.dog_cost <dbl>, Answer.plasma_cost <dbl>,
## # Answer.sushi_cost <dbl>
Everything you typically want to do in statistical programming uses functions. mean is a good example. mean takes one argument, a numeric vector.
mean(iris$Petal.Length)
## [1] 3.758
We’re going to call this applying the function mean to the variable Petal.Length.
Pipes are a way to write strings of functions more easily. They bring the first argument of the function to the bedginning. So you can write:
iris$Petal.Length %>% mean()
## [1] 3.758
That’s not very useful yet, but when you start nesting functions, it gets better.
mean(unique(iris$Petal.Length))
## [1] 4.22093
iris$Petal.Length %>% unique() %>% mean()
## [1] 4.22093
iris$Petal.Length %>% unique %>% mean
## [1] 4.22093
or
round(mean(unique(iris$Petal.Length)),
digits = 2)
## [1] 4.22
iris$Petal.Length %>% unique %>% mean %>% round(digits = 2)
## [1] 4.22
# indenting makes things even easier to read
This can be super helpful for writing strings of functions so that they are readable and distinct.
We’ll be doing a lot of piping of functions with multiple arguments later, and it will really help keep our syntax simple.
Exercise. Rewrite these commands using pipes and check that they do the same thing! (Or at least produce the same output). Unpiped version:
length(unique(iris$Species)) # number of species
## [1] 3
Piped version:
ggplot2 and tidy dataThe last piece of our workflow here is going to be the addition of visualiation elements. ggplot2 is a plotting package that easily takes advantage of tidy data. ggplots have two important parts (there are of course more):
aes - the aesthetic mapping, or which data variables get mapped to which visual variables (x, y, color, symbol, etc.)geom - the plotting objects that represent the data (points, lines, shapes, etc.)iris %>%
ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) +
geom_point()
And just to let you know my biases, I like theme_few from ggthemes and scale_color_solarized as my palette.
iris %>%
ggplot(aes(Sepal.Width, Sepal.Length, col = Species)) +
geom_point() +
ggthemes::theme_few() +
ggthemes::scale_color_solarized()
dplyrReference: R4DS Chapter 5
Let’s take a psychological dataset. Here are the raw data from [Stiller, Goodman, & Frank (2015)].
These data are tidy: each row describes a single trial, each column describes some aspect of tha trial, including their id (subid), age (age), condition (condition - “label” is the experimental condition, “No Label” is the control), item (item - which thing furble was trying to find).
We are going to manipulate these data using “verbs” from dplyr. I’ll only teach four verbs, the most common in my workflow (but there are many other useful ones):
filter - remove rows by some logical conditionmutate - create new columnsgroup_by - group the data into subsets by some columnsummarize - apply some function over columns in each groupsgf <- read_csv("data/stiller_scales_data.csv")
## Parsed with column specification:
## cols(
## subid = col_character(),
## item = col_character(),
## correct = col_double(),
## age = col_double(),
## condition = col_character()
## )
sgf
## # A tibble: 588 x 5
## subid item correct age condition
## <chr> <chr> <dbl> <dbl> <chr>
## 1 M22 faces 1 2 Label
## 2 M22 houses 1 2 Label
## 3 M22 pasta 0 2 Label
## 4 M22 beds 0 2 Label
## 5 T22 beds 0 2.13 Label
## 6 T22 faces 0 2.13 Label
## 7 T22 houses 1 2.13 Label
## 8 T22 pasta 1 2.13 Label
## 9 T17 pasta 0 2.32 Label
## 10 T17 faces 0 2.32 Label
## # … with 578 more rows
Inspect the various variables before you start any analysis. Lots of people recommend summary but TBH I don’t find it useful.
summary(sgf)
## subid item correct age
## Length:588 Length:588 Min. :0.0000 Min. :2.000
## Class :character Class :character 1st Qu.:0.0000 1st Qu.:2.850
## Mode :character Mode :character Median :0.0000 Median :3.460
## Mean :0.4473 Mean :3.525
## 3rd Qu.:1.0000 3rd Qu.:4.290
## Max. :1.0000 Max. :4.960
## condition
## Length:588
## Class :character
## Mode :character
##
##
##
This output just feels overwhelming and uninformative.
You can look at each variable by itself:
unique(sgf$condition)
## [1] "Label" "No Label"
sgf$subid %>%
unique %>%
length
## [1] 147