Starting note: The best reference for this material is Hadley Wickham’s R for data scientists. My contribution here is to translate this reference for psychology.

Goals and Introduction

By the end of this tutorial, you will know:

What “tidy data” is and why it’s an awesome format
How to do some stuff with tidy data
How to get your data to be tidy
Some tips’n’tricks for dealing with “medium data” in R

This intro will describe a few concepts that you will need to know, using the famous iris dataset that comes with ggplot2.

Data frames

The basic data structure we’re working with is the data frame, or tibble (in the tidyverse reimplementation). Data frames have rows and columns, and each column has a distinct data type. The implementation in Python’s pandas is distinct but most of the concepts are the same.

iris is a data frame showing the measurements of a bunch of different instances of iris flowers from different species. (Sepals are the things outside the petals of the flowers that protect the petals while it’s blooming, petals are the actual petals of the flower).

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Exercise. R is a very flexible programming language, which is both a strength and a weakness. There are many ways to get a particular value of a variable in a data frame. You can use $ to access a column, as in iris$Sepal.Length or you can treat the data frame as a matrix, e.g. iris[1,1] or even as a list, as in iris[[1]]. You can also mix numeric references and named references, e.g. iris[["Sepal.Length"]]. Turn to your neighbor (and/or google) and find as many ways as you can to access the petal length of the third iris in the dataset (row 3).

# fill me in with calls to the iris dataset that all return the same cell (third from the top, Petal Length).
iris %>% select(Petal.Length) %>% slice(3) %>% pull()

## [1] 1.3

iris[["Petal.Length"]][3]

## [1] 1.3

iris[3,3]

## [1] 1.3

Discussion. Why might some ways of doing this be better than others? Readability

Tidy data

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Here’s the basic idea: In tidy data, every row is a single observation (trial), and every column describes a variable with some value describing that trial.

And if you know that data are formatted this way, then you can do amazing things, basically because you can take a uniform approach to the dataset. From R4DS:

“There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine.”

iris is a tidy dataset. Each row is an observation of an individual iris, each column is a different variable.

Exercise. Take a look at these data, as downloaded from Amazon Mechanical Turk. They describe an experiment where people had to estimate the price of a dog, a plasma TV, and a sushi dinner (and they were primed with anchors that differed across conditions). It’s a replication of a paper by Janiszewksi & Uy (2008). Examine this dataset with your nextdoor neighbor and sketch out what a tidy version of the dataset would look like (using paper and pencil).

ju <- read_csv("data/janiszewski_rep_cleaned.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Reward = col_double(),
##   MaxAssignments = col_double(),
##   AssignmentDurationInSeconds = col_double(),
##   AutoApprovalDelayInSeconds = col_double(),
##   NumberOfSimilarHITs = col_double(),
##   LifetimeInSeconds = col_logical(),
##   WorkerId = col_double(),
##   ApprovalTime = col_logical(),
##   RejectionTime = col_logical(),
##   RequesterFeedback = col_logical(),
##   WorkTimeInSeconds = col_double(),
##   Input.price1 = col_double(),
##   Input.price2 = col_double(),
##   Input.price3 = col_double(),
##   Answer.dog_cost = col_double(),
##   Answer.plasma_cost = col_double(),
##   Answer.sushi_cost = col_double()
## )

## See spec(...) for full column specifications.

head(ju)

## # A tibble: 6 x 34
##   HITId HITTypeId Title Description Keywords Reward CreationTime MaxAssignments
##   <chr> <chr>     <chr> <chr>       <chr>     <dbl> <chr>                 <dbl>
## 1 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 2 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 3 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 4 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 5 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## 6 261W… 2DVGTP9R… How … A quick tw… survey,…    0.1 Wed Jan 25 …             30
## # … with 26 more variables: RequesterAnnotation <chr>,
## #   AssignmentDurationInSeconds <dbl>, AutoApprovalDelayInSeconds <dbl>,
## #   Expiration <chr>, NumberOfSimilarHITs <dbl>, LifetimeInSeconds <lgl>,
## #   AssignmentId <chr>, WorkerId <dbl>, AssignmentStatus <chr>,
## #   AcceptTime <chr>, SubmitTime <chr>, AutoApprovalTime <chr>,
## #   ApprovalTime <lgl>, RejectionTime <lgl>, RequesterFeedback <lgl>,
## #   WorkTimeInSeconds <dbl>, LifetimeApprovalRate <chr>,
## #   Last30DaysApprovalRate <chr>, Last7DaysApprovalRate <chr>,
## #   Input.condition <chr>, Input.price1 <dbl>, Input.price2 <dbl>,
## #   Input.price3 <dbl>, Answer.dog_cost <dbl>, Answer.plasma_cost <dbl>,
## #   Answer.sushi_cost <dbl>

Functions and Pipes

Everything you typically want to do in statistical programming uses functions. mean is a good example. mean takes one argument, a numeric vector.

mean(iris$Petal.Length)

## [1] 3.758

We’re going to call this applying the function mean to the variable Petal.Length.

Pipes are a way to write strings of functions more easily. They bring the first argument of the function to the bedginning. So you can write:

iris$Petal.Length %>% mean

## [1] 3.758

That’s not very useful yet, but when you start nesting functions, it gets better.

mean(unique(iris$Petal.Length))

## [1] 4.22093

iris$Petal.Length %>% unique() %>% mean(na.rm=TRUE)

## [1] 4.22093

round(mean(unique(iris$Petal.Length)), digits = 2)

## [1] 4.22

iris$Petal.Length %>% unique %>% mean %>% round(digits = 2)

## [1] 4.22

# indenting makes things even easier to read
iris$Petal.Length %>% 
  unique %>% 
  mean %>% 
  round(digits = 2)

## [1] 4.22

This can be super helpful for writing strings of functions so that they are readable and distinct.

We’ll be doing a lot of piping of functions with multiple arguments later, and it will really help keep our syntax simple.

Exercise. Rewrite these commands using pipes and check that they do the same thing! (Or at least produce the same output). Unpiped version:

length(unique(iris$Species)) # number of species

## [1] 3

Piped version:

iris$Species %>% 
  unique() %>%
  length

## [1] 3

`ggplot2` and tidy data

The last piece of our workflow here is going to be the addition of visualiation elements. ggplot2 is a plotting package that easily takes advantage of tidy data. ggplots have two important parts (there are of course more):

aes - the aesthetic mapping, or which data variables get mapped to which visual variables (x, y, color, symbol, etc.)
geom - the plotting objects that represent the data (points, lines, shapes, etc.)

iris %>%
  ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) + 
  geom_point()

And just to let you know my biases, I like theme_few from ggthemes and scale_color_solarized as my palette.

iris %>%
  ggplot(aes(Sepal.Width, Sepal.Length, col = Species)) + 
  geom_point() + 
  ggthemes::theme_few() + 
  ggthemes::scale_color_solarized()

Tidy Data Analysis with `dplyr`

Reference: R4DS Chapter 5

Let’s take a psychological dataset. Here are the raw data from [Stiller, Goodman, & Frank (2015)].

These data are tidy: each row describes a single trial, each column describes some aspect of tha trial, including their id (subid), age (age), condition (condition - “label” is the experimental condition, “No Label” is the control), item (item - which thing furble was trying to find).

We are going to manipulate these data using “verbs” from dplyr. I’ll only teach four verbs, the most common in my workflow (but there are many other useful ones):

filter - remove rows by some logical condition
mutate - create new columns
group_by - group the data into subsets by some column
summarize - apply some function over columns in each group

Exploring and characterizing the dataset

sgf <- read_csv("data/stiller_scales_data.csv")

## Parsed with column specification:
## cols(
##   subid = col_character(),
##   item = col_character(),
##   correct = col_double(),
##   age = col_double(),
##   condition = col_character()
## )

sgf

## # A tibble: 588 x 5
##    subid item   correct   age condition
##    <chr> <chr>    <dbl> <dbl> <chr>    
##  1 M22   faces        1  2    Label    
##  2 M22   houses       1  2    Label    
##  3 M22   pasta        0  2    Label    
##  4 M22   beds         0  2    Label    
##  5 T22   beds         0  2.13 Label    
##  6 T22   faces        0  2.13 Label    
##  7 T22   houses       1  2.13 Label    
##  8 T22   pasta        1  2.13 Label    
##  9 T17   pasta        0  2.32 Label    
## 10 T17   faces        0  2.32 Label    
## # … with 578 more rows

Inspect the various variables before you start any analysis. Lots of people recommend summary but TBH I don’t find it useful.

summary(sgf)

##     subid               item              correct            age       
##  Length:588         Length:588         Min.   :0.0000   Min.   :2.000  
##  Class :character   Class :character   1st Qu.:0.0000   1st Qu.:2.850  
##  Mode  :character   Mode  :character   Median :0.0000   Median :3.460  
##                                        Mean   :0.4473   Mean   :3.525  
##                                        3rd Qu.:1.0000   3rd Qu.:4.290  
##                                        Max.   :1.0000   Max.   :4.960  
##   condition        
##  Length:588        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

This output just feels overwhelming and uninformative.

You can look at each variable by itself:

unique(sgf$condition)

## [1] "Label"    "No Label"

sgf$subid %>%
 unique %>%
 length

## [1] 147

Or use interactive tools like View or DT::datatable (which I really like).

View(sgf)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

DT::datatable(sgf)

Filtering & Mutating

There are lots of reasons you might want to remove rows from your dataset, including getting rid of outliers, selecting subpopulations, etc. filter is a verb (function) that takes a data frame as its first argument, and then as its second takes the condition you want to filter on.

So if you wanted to look only at two year olds, you could do this. (Note you can give two conditions, could also do age > 2 & age < 3). (equivalent: filter(sgf, age > 2, age < 3))

Note that we’re going to be using pipes with functions over data frames here. The way this works is that:

dplyr verbs always take the data frame as their first argument, and
because pipes pull out the first argument, the data frame just gets passed through successive operations
so you can read a pipe chain as “take this data frame and first do this, then do this, then do that.”

This is essentially the huge insight of dplyr: you can chain verbs into readable and efficient sequences of operations over dataframes, provided 1) the verbs all have the same syntax (which they do) and 2) the data all have the same structure (which they do if they are tidy).

OK, so filtering:

sgf %>%
  filter(age > 2, 
         age < 3)

## # A tibble: 188 x 5
##    subid item   correct   age condition
##    <chr> <chr>    <dbl> <dbl> <chr>    
##  1 T22   beds         0  2.13 Label    
##  2 T22   faces        0  2.13 Label    
##  3 T22   houses       1  2.13 Label    
##  4 T22   pasta        1  2.13 Label    
##  5 T17   pasta        0  2.32 Label    
##  6 T17   faces        0  2.32 Label    
##  7 T17   houses       0  2.32 Label    
##  8 T17   beds         0  2.32 Label    
##  9 M3    faces        0  2.38 Label    
## 10 M3    houses       1  2.38 Label    
## # … with 178 more rows

Exercise. Filter out only the “face” trial in the “Label” condition.

sgf %>%
  filter(condition == "Label", 
         item == "faces")

## # A tibble: 75 x 5
##    subid item  correct   age condition
##    <chr> <chr>   <dbl> <dbl> <chr>    
##  1 M22   faces       1  2    Label    
##  2 T22   faces       0  2.13 Label    
##  3 T17   faces       0  2.32 Label    
##  4 M3    faces       0  2.38 Label    
##  5 T19   faces       0  2.47 Label    
##  6 T20   faces       1  2.5  Label    
##  7 T21   faces       1  2.58 Label    
##  8 M26   faces       1  2.59 Label    
##  9 T18   faces       1  2.61 Label    
## 10 T12   faces       0  2.72 Label    
## # … with 65 more rows

sgf[sgf$condition == "Label" & sgf$item == "faces", ] # all the columns

## # A tibble: 75 x 5
##    subid item  correct   age condition
##    <chr> <chr>   <dbl> <dbl> <chr>    
##  1 M22   faces       1  2    Label    
##  2 T22   faces       0  2.13 Label    
##  3 T17   faces       0  2.32 Label    
##  4 M3    faces       0  2.38 Label    
##  5 T19   faces       0  2.47 Label    
##  6 T20   faces       1  2.5  Label    
##  7 T21   faces       1  2.58 Label    
##  8 M26   faces       1  2.59 Label    
##  9 T18   faces       1  2.61 Label    
## 10 T12   faces       0  2.72 Label    
## # … with 65 more rows

There are also times when you want to add or remove columns. You might want to remove columns to simplify the dataset. There’s not much to simplify here, but if you wanted to do that, the verb is select.

sgf %>%
  select(subid, age, correct)

## # A tibble: 588 x 3
##    subid   age correct
##    <chr> <dbl>   <dbl>
##  1 M22    2          1
##  2 M22    2          1
##  3 M22    2          0
##  4 M22    2          0
##  5 T22    2.13       0
##  6 T22    2.13       0
##  7 T22    2.13       1
##  8 T22    2.13       1
##  9 T17    2.32       0
## 10 T17    2.32       0
## # … with 578 more rows

sgf %>%
  select(-condition)

## # A tibble: 588 x 4
##    subid item   correct   age
##    <chr> <chr>    <dbl> <dbl>
##  1 M22   faces        1  2   
##  2 M22   houses       1  2   
##  3 M22   pasta        0  2   
##  4 M22   beds         0  2   
##  5 T22   beds         0  2.13
##  6 T22   faces        0  2.13
##  7 T22   houses       1  2.13
##  8 T22   pasta        1  2.13
##  9 T17   pasta        0  2.32
## 10 T17   faces        0  2.32
## # … with 578 more rows

sgf %>%
  select(1)

## # A tibble: 588 x 1
##    subid
##    <chr>
##  1 M22  
##  2 M22  
##  3 M22  
##  4 M22  
##  5 T22  
##  6 T22  
##  7 T22  
##  8 T22  
##  9 T17  
## 10 T17  
## # … with 578 more rows

sgf %>%
  select(starts_with("sub"))

## # A tibble: 588 x 1
##    subid
##    <chr>
##  1 M22  
##  2 M22  
##  3 M22  
##  4 M22  
##  5 T22  
##  6 T22  
##  7 T22  
##  8 T22  
##  9 T17  
## 10 T17  
## # … with 578 more rows

# learn about this with ?select

Perhaps more useful is adding columns. You might do this perhaps to compute some kind of derived variable. mutate is the verb for these situations - it allows you to add a column. Let’s add a discrete age group factor to our dataset.

sgf <- sgf %>%
  mutate(age_group = cut(age, 2:5, include.lowest = TRUE), 
         age_group_halfyear = cut(age, seq(2,5,.5), include.lowest = TRUE))

# sgf$age_group <- cut(sgf$age, 2:5, include.lowest = TRUE)
# sgf$age_group <- with(sgf, cut(age, 2:5, include.lowest = TRUE))

head(sgf$age_group)

## [1] [2,3] [2,3] [2,3] [2,3] [2,3] [2,3]
## Levels: [2,3] (3,4] (4,5]

Standard psychological descriptives

We typically describe datasets at the level of subjects, not trials. We need two verbs to get a summary at the level of subjects: group_by and summarise (kiwi spelling). Grouping alone doesn’t do much.

sgf %>%
  group_by(age_group)

## # A tibble: 588 x 7
## # Groups:   age_group [3]
##    subid item   correct   age condition age_group age_group_halfyear
##    <chr> <chr>    <dbl> <dbl> <chr>     <fct>     <fct>             
##  1 M22   faces        1  2    Label     [2,3]     [2,2.5]           
##  2 M22   houses       1  2    Label     [2,3]     [2,2.5]           
##  3 M22   pasta        0  2    Label     [2,3]     [2,2.5]           
##  4 M22   beds         0  2    Label     [2,3]     [2,2.5]           
##  5 T22   beds         0  2.13 Label     [2,3]     [2,2.5]           
##  6 T22   faces        0  2.13 Label     [2,3]     [2,2.5]           
##  7 T22   houses       1  2.13 Label     [2,3]     [2,2.5]           
##  8 T22   pasta        1  2.13 Label     [2,3]     [2,2.5]           
##  9 T17   pasta        0  2.32 Label     [2,3]     [2,2.5]           
## 10 T17   faces        0  2.32 Label     [2,3]     [2,2.5]           
## # … with 578 more rows

All it does is add a grouping marker.

What summarise does is to apply a function to a part of the dataset to create a new summary dataset. So we can apply the function mean to the dataset and get the grand mean.

## DO NOT DO THIS!!!
# foo <- initialize_the_thing_being_bound()
# for (i in 1:length(unique(sgf$item))) {
#   for (j in 1:length(unique(sgf$condition))) {
#     this_data <- sgf[sgf$item == unique(sgf$item)[i] & 
#                       sgf$condition == unique(sgf$condition)[n],]
#     do_a_thing(this_data)
#     bind_together_somehow(this_data)
#   }
# }

sgf %>%
  summarise(correct = mean(correct))

## # A tibble: 1 x 1
##   correct
##     <dbl>
## 1   0.447

Note the syntax here: summarise takes multiple new_column_name = function_to_be_applied_to_data(data_column) entries in a list. Using this syntax, we can create more elaborate summary datasets also:

sgf %>%
  summarise(correct = mean(correct), 
            n_observations = length(subid))

## # A tibble: 1 x 2
##   correct n_observations
##     <dbl>          <int>
## 1   0.447            588

Where these two verbs shine is in combination, though. Because summarise applies functions to columns in your grouped data, not just to the whole dataset!

So we can group by age or condition or whatever else we want and then carry out the same procedure, and all of a sudden we are doing something extremely useful!

sgf_means <- sgf %>%
  group_by(age_group, condition) %>%
  summarise(correct = mean(correct), 
            n_observations = length(subid))

## `summarise()` regrouping output by 'age_group' (override with `.groups` argument)

sgf_means

## # A tibble: 6 x 4
## # Groups:   age_group [3]
##   age_group condition correct n_observations
##   <fct>     <chr>       <dbl>          <int>
## 1 [2,3]     Label       0.570            100
## 2 [2,3]     No Label    0.240             96
## 3 (3,4]     Label       0.712            104
## 4 (3,4]     No Label    0.240             96
## 5 (4,5]     Label       0.771             96
## 6 (4,5]     No Label    0.125             96

These summary data are typically very useful for plotting. .

ggplot(sgf_means, 
       aes(x = age_group, y = correct, col = condition, group = condition)) + 
  geom_line() + 
  ylim(0,1) +
  ggthemes::theme_few()

# sgf %>%
#   mutate(age_group) %>%
#   group_by() %>%
#   summarise %>%
#   ggplot()

Exercise. One of the most important analytic workflows for psychological data is to take some function (e.g., the mean) for each participant and then look at grand means and variability across participant means. This analytic workflow requires grouping, summarising, and then grouping again and summarising again! Use dplyr to make the same table as above (sgf_means) but with means (and SDs if you want) computed across subject means, not across all data points. (The means will be pretty similar as this is a balanced design but in a case with lots of missing data, they will vary. In contrast, the SD doesn’t even really make sense across the binary data before you aggregate across subjects.)

# exercise
sgf_sub_means <- sgf %>%
  group_by(age_group, condition, subid) %>%
  summarise(correct = mean(correct))

## `summarise()` regrouping output by 'age_group', 'condition' (override with `.groups` argument)

sgf_grand_means <- sgf_sub_means %>%
  group_by(age_group, condition) %>%
  summarise(mean_correct = mean(correct), 
            sd_correct = sd(correct))

## `summarise()` regrouping output by 'age_group' (override with `.groups` argument)

Getting to Tidy with `tidyr`

Reference: R4DS Chapter 12

Psychological data often comes in two flavors: long and wide data. Long form data is tidy, but that format is less common. It’s much more common to get wide data, in which every row is a case (e.g., a subject), and each column is a variable. In this format multiple trials (observations) are stored as columns.

This can go a bunch of ways, for example, the most common might be to have subjects as rows and trials as columns. But here’s an example from a real dataset on “unconscious arithmetic” from Sklar et al. (2012). In it, items (particular arithmetic problems) are rows and subjects are columns.

sklar <- read_csv("data/sklar_data.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   prime = col_character(),
##   congruent = col_character(),
##   operand = col_character()
## )

## See spec(...) for full column specifications.

head(sklar)

## # A tibble: 6 x 28
##   prime prime.result target congruent operand distance counterbalance   `1`
##   <chr>        <dbl>  <dbl> <chr>     <chr>      <dbl>          <dbl> <dbl>
## 1 =1+2…            8      9 no        A             -1              1   597
## 2 =1+3…            9     11 no        A             -2              1   699
## 3 =1+4…            8     12 no        A             -4              1   700
## 4 =1+6…           10     12 no        A             -2              1   628
## 5 =1+9…           12     11 no        A              1              1   768
## 6 =1+9…           13     12 no        A              1              1   595
## # … with 20 more variables: `2` <dbl>, `3` <dbl>, `4` <dbl>, `5` <dbl>,
## #   `6` <dbl>, `7` <dbl>, `8` <dbl>, `9` <dbl>, `10` <dbl>, `11` <dbl>,
## #   `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>,
## #   `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>

Tidy verbs

The two main verbs for tidying are gather and spread. (There are lots of others in the tidyr package if you want to split or merge columns etc.).

First, let’s go away from tidiness. We’re going to spread a tidy dataset. Remember that tidy data has one observation in each row, but we want to “spread” it out so it’s wide. (The metaphor works better in this description). This may not be helpful, but I think of the data as a long cream cheese pat, and I “spread” it over a wide bagel.

Let’s try it on the SGF data above. First we’ll spread it so it’s wide. I do this by indicating what column is going to be the column labels in the new data frame, here it’s item, and what column is going to have the values in those columns, here it’s correct:

sgf_wide <- sgf %>% 
  spread(item, correct)
head(sgf_wide)

## # A tibble: 6 x 9
##   subid   age condition age_group age_group_halfyear  beds faces houses pasta
##   <chr> <dbl> <chr>     <fct>     <fct>              <dbl> <dbl>  <dbl> <dbl>
## 1 C1     4.16 Label     (4,5]     (4,4.5]                1     1      1     1
## 2 C10    3.46 Label     (3,4]     (3,3.5]                1     0      0     1
## 3 C11    4.22 Label     (4,5]     (4,4.5]                1     1      0     1
## 4 C12    3.56 Label     (3,4]     (3.5,4]                1     1      0     1
## 5 C13    4.38 Label     (4,5]     (4,4.5]                1     0      1     0
## 6 C14    4.57 Label     (4,5]     (4.5,5]                1     1      1     0

Now you can see that there is no explicit specification that all those item columns, e.g. faces, beds are holding correct values, but the data are much more compact. (This form is easy to work with in Excel, so that’s probably why people use it in psych).

OK, let’s go back to our original format. gather is about making wide data into tidy (long) data. When you gather a dataset you are “gathering” a bunch of columns (maybe that you previously spread). You specify what all the columns have in common (e.g., they are all subject_ids in the example above), and you say what measure they all contain (they all have RTs). So in that sense, it’s the flip of spread. You did spread(item, correct) and now you’ll gather(item, correct, ...). The one extra argument is that you need to specify the columns that will go into item!

sgf_long <- sgf_wide %>% 
  gather(item, correct, beds, faces, houses, pasta)
head(sgf_long)

## # A tibble: 6 x 7
##   subid   age condition age_group age_group_halfyear item  correct
##   <chr> <dbl> <chr>     <fct>     <fct>              <chr>   <dbl>
## 1 C1     4.16 Label     (4,5]     (4,4.5]            beds        1
## 2 C10    3.46 Label     (3,4]     (3,3.5]            beds        1
## 3 C11    4.22 Label     (4,5]     (4,4.5]            beds        1
## 4 C12    3.56 Label     (3,4]     (3.5,4]            beds        1
## 5 C13    4.38 Label     (4,5]     (4,4.5]            beds        1
## 6 C14    4.57 Label     (4,5]     (4.5,5]            beds        1

head(sgf)

## # A tibble: 6 x 7
##   subid item   correct   age condition age_group age_group_halfyear
##   <chr> <chr>    <dbl> <dbl> <chr>     <fct>     <fct>             
## 1 M22   faces        1  2    Label     [2,3]     [2,2.5]           
## 2 M22   houses       1  2    Label     [2,3]     [2,2.5]           
## 3 M22   pasta        0  2    Label     [2,3]     [2,2.5]           
## 4 M22   beds         0  2    Label     [2,3]     [2,2.5]           
## 5 T22   beds         0  2.13 Label     [2,3]     [2,2.5]           
## 6 T22   faces        0  2.13 Label     [2,3]     [2,2.5]

There are lots of flexible ways to specify these columns - you can enumerate their names like I did.

# gather(item, correct, 5:8)
# gather(item, correct, starts_with("foo"))

Exercise. Take the Sklar data from above, where each column is a separate subject, and gather it so that it’s a tidy dataset. What challenges come up?

sklar

## # A tibble: 154 x 28
##    prime prime.result target congruent operand distance counterbalance   `1`
##    <chr>        <dbl>  <dbl> <chr>     <chr>      <dbl>          <dbl> <dbl>
##  1 =1+2…            8      9 no        A             -1              1   597
##  2 =1+3…            9     11 no        A             -2              1   699
##  3 =1+4…            8     12 no        A             -4              1   700
##  4 =1+6…           10     12 no        A             -2              1   628
##  5 =1+9…           12     11 no        A              1              1   768
##  6 =1+9…           13     12 no        A              1              1   595
##  7 =2+1…           12     11 no        A              1              1   664
##  8 =2+3…           11     10 no        A              1              1   803
##  9 =2+3…           12     11 no        A              1              1   767
## 10 =2+5…           13      9 no        A              4              1   700
## # … with 144 more rows, and 20 more variables: `2` <dbl>, `3` <dbl>, `4` <dbl>,
## #   `5` <dbl>, `6` <dbl>, `7` <dbl>, `8` <dbl>, `9` <dbl>, `10` <dbl>,
## #   `11` <dbl>, `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>,
## #   `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>

sklar_tidy <- sklar %>%
  gather(subid, rt, 8:28)

sklar_tidy

## # A tibble: 3,234 x 9
##    prime prime.result target congruent operand distance counterbalance subid
##    <chr>        <dbl>  <dbl> <chr>     <chr>      <dbl>          <dbl> <chr>
##  1 =1+2…            8      9 no        A             -1              1 1    
##  2 =1+3…            9     11 no        A             -2              1 1    
##  3 =1+4…            8     12 no        A             -4              1 1    
##  4 =1+6…           10     12 no        A             -2              1 1    
##  5 =1+9…           12     11 no        A              1              1 1    
##  6 =1+9…           13     12 no        A              1              1 1    
##  7 =2+1…           12     11 no        A              1              1 1    
##  8 =2+3…           11     10 no        A              1              1 1    
##  9 =2+3…           12     11 no        A              1              1 1    
## 10 =2+5…           13      9 no        A              4              1 1    
## # … with 3,224 more rows, and 1 more variable: rt <dbl>

Let’s also go back and tidy an easier one: iris.

iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

iris %>%
  mutate(iris_id = 1:nrow(iris)) %>%
  gather(measurement, centimeters, Sepal.Length, Petal.Length, Sepal.Width, Petal.Width)

##        Species iris_id  measurement centimeters
## 1       setosa       1 Sepal.Length         5.1
## 2       setosa       2 Sepal.Length         4.9
## 3       setosa       3 Sepal.Length         4.7
## 4       setosa       4 Sepal.Length         4.6
## 5       setosa       5 Sepal.Length         5.0
## 6       setosa       6 Sepal.Length         5.4
## 7       setosa       7 Sepal.Length         4.6
## 8       setosa       8 Sepal.Length         5.0
## 9       setosa       9 Sepal.Length         4.4
## 10      setosa      10 Sepal.Length         4.9
## 11      setosa      11 Sepal.Length         5.4
## 12      setosa      12 Sepal.Length         4.8
## 13      setosa      13 Sepal.Length         4.8
## 14      setosa      14 Sepal.Length         4.3
## 15      setosa      15 Sepal.Length         5.8
## 16      setosa      16 Sepal.Length         5.7
## 17      setosa      17 Sepal.Length         5.4
## 18      setosa      18 Sepal.Length         5.1
## 19      setosa      19 Sepal.Length         5.7
## 20      setosa      20 Sepal.Length         5.1
## 21      setosa      21 Sepal.Length         5.4
## 22      setosa      22 Sepal.Length         5.1
## 23      setosa      23 Sepal.Length         4.6
## 24      setosa      24 Sepal.Length         5.1
## 25      setosa      25 Sepal.Length         4.8
## 26      setosa      26 Sepal.Length         5.0
## 27      setosa      27 Sepal.Length         5.0
## 28      setosa      28 Sepal.Length         5.2
## 29      setosa      29 Sepal.Length         5.2
## 30      setosa      30 Sepal.Length         4.7
## 31      setosa      31 Sepal.Length         4.8
## 32      setosa      32 Sepal.Length         5.4
## 33      setosa      33 Sepal.Length         5.2
## 34      setosa      34 Sepal.Length         5.5
## 35      setosa      35 Sepal.Length         4.9
## 36      setosa      36 Sepal.Length         5.0
## 37      setosa      37 Sepal.Length         5.5
## 38      setosa      38 Sepal.Length         4.9
## 39      setosa      39 Sepal.Length         4.4
## 40      setosa      40 Sepal.Length         5.1
## 41      setosa      41 Sepal.Length         5.0
## 42      setosa      42 Sepal.Length         4.5
## 43      setosa      43 Sepal.Length         4.4
## 44      setosa      44 Sepal.Length         5.0
## 45      setosa      45 Sepal.Length         5.1
## 46      setosa      46 Sepal.Length         4.8
## 47      setosa      47 Sepal.Length         5.1
## 48      setosa      48 Sepal.Length         4.6
## 49      setosa      49 Sepal.Length         5.3
## 50      setosa      50 Sepal.Length         5.0
## 51  versicolor      51 Sepal.Length         7.0
## 52  versicolor      52 Sepal.Length         6.4
## 53  versicolor      53 Sepal.Length         6.9
## 54  versicolor      54 Sepal.Length         5.5
## 55  versicolor      55 Sepal.Length         6.5
## 56  versicolor      56 Sepal.Length         5.7
## 57  versicolor      57 Sepal.Length         6.3
## 58  versicolor      58 Sepal.Length         4.9
## 59  versicolor      59 Sepal.Length         6.6
## 60  versicolor      60 Sepal.Length         5.2
## 61  versicolor      61 Sepal.Length         5.0
## 62  versicolor      62 Sepal.Length         5.9
## 63  versicolor      63 Sepal.Length         6.0
## 64  versicolor      64 Sepal.Length         6.1
## 65  versicolor      65 Sepal.Length         5.6
## 66  versicolor      66 Sepal.Length         6.7
## 67  versicolor      67 Sepal.Length         5.6
## 68  versicolor      68 Sepal.Length         5.8
## 69  versicolor      69 Sepal.Length         6.2
## 70  versicolor      70 Sepal.Length         5.6
## 71  versicolor      71 Sepal.Length         5.9
## 72  versicolor      72 Sepal.Length         6.1
## 73  versicolor      73 Sepal.Length         6.3
## 74  versicolor      74 Sepal.Length         6.1
## 75  versicolor      75 Sepal.Length         6.4
## 76  versicolor      76 Sepal.Length         6.6
## 77  versicolor      77 Sepal.Length         6.8
## 78  versicolor      78 Sepal.Length         6.7
## 79  versicolor      79 Sepal.Length         6.0
## 80  versicolor      80 Sepal.Length         5.7
## 81  versicolor      81 Sepal.Length         5.5
## 82  versicolor      82 Sepal.Length         5.5
## 83  versicolor      83 Sepal.Length         5.8
## 84  versicolor      84 Sepal.Length         6.0
## 85  versicolor      85 Sepal.Length         5.4
## 86  versicolor      86 Sepal.Length         6.0
## 87  versicolor      87 Sepal.Length         6.7
## 88  versicolor      88 Sepal.Length         6.3
## 89  versicolor      89 Sepal.Length         5.6
## 90  versicolor      90 Sepal.Length         5.5
## 91  versicolor      91 Sepal.Length         5.5
## 92  versicolor      92 Sepal.Length         6.1
## 93  versicolor      93 Sepal.Length         5.8
## 94  versicolor      94 Sepal.Length         5.0
## 95  versicolor      95 Sepal.Length         5.6
## 96  versicolor      96 Sepal.Length         5.7
## 97  versicolor      97 Sepal.Length         5.7
## 98  versicolor      98 Sepal.Length         6.2
## 99  versicolor      99 Sepal.Length         5.1
## 100 versicolor     100 Sepal.Length         5.7
## 101  virginica     101 Sepal.Length         6.3
## 102  virginica     102 Sepal.Length         5.8
## 103  virginica     103 Sepal.Length         7.1
## 104  virginica     104 Sepal.Length         6.3
## 105  virginica     105 Sepal.Length         6.5
## 106  virginica     106 Sepal.Length         7.6
## 107  virginica     107 Sepal.Length         4.9
## 108  virginica     108 Sepal.Length         7.3
## 109  virginica     109 Sepal.Length         6.7
## 110  virginica     110 Sepal.Length         7.2
## 111  virginica     111 Sepal.Length         6.5
## 112  virginica     112 Sepal.Length         6.4
## 113  virginica     113 Sepal.Length         6.8
## 114  virginica     114 Sepal.Length         5.7
## 115  virginica     115 Sepal.Length         5.8
## 116  virginica     116 Sepal.Length         6.4
## 117  virginica     117 Sepal.Length         6.5
## 118  virginica     118 Sepal.Length         7.7
## 119  virginica     119 Sepal.Length         7.7
## 120  virginica     120 Sepal.Length         6.0
## 121  virginica     121 Sepal.Length         6.9
## 122  virginica     122 Sepal.Length         5.6
## 123  virginica     123 Sepal.Length         7.7
## 124  virginica     124 Sepal.Length         6.3
## 125  virginica     125 Sepal.Length         6.7
## 126  virginica     126 Sepal.Length         7.2
## 127  virginica     127 Sepal.Length         6.2
## 128  virginica     128 Sepal.Length         6.1
## 129  virginica     129 Sepal.Length         6.4
## 130  virginica     130 Sepal.Length         7.2
## 131  virginica     131 Sepal.Length         7.4
## 132  virginica     132 Sepal.Length         7.9
## 133  virginica     133 Sepal.Length         6.4
## 134  virginica     134 Sepal.Length         6.3
## 135  virginica     135 Sepal.Length         6.1
## 136  virginica     136 Sepal.Length         7.7
## 137  virginica     137 Sepal.Length         6.3
## 138  virginica     138 Sepal.Length         6.4
## 139  virginica     139 Sepal.Length         6.0
## 140  virginica     140 Sepal.Length         6.9
## 141  virginica     141 Sepal.Length         6.7
## 142  virginica     142 Sepal.Length         6.9
## 143  virginica     143 Sepal.Length         5.8
## 144  virginica     144 Sepal.Length         6.8
## 145  virginica     145 Sepal.Length         6.7
## 146  virginica     146 Sepal.Length         6.7
## 147  virginica     147 Sepal.Length         6.3
## 148  virginica     148 Sepal.Length         6.5
## 149  virginica     149 Sepal.Length         6.2
## 150  virginica     150 Sepal.Length         5.9
## 151     setosa       1 Petal.Length         1.4
## 152     setosa       2 Petal.Length         1.4
## 153     setosa       3 Petal.Length         1.3
## 154     setosa       4 Petal.Length         1.5
## 155     setosa       5 Petal.Length         1.4
## 156     setosa       6 Petal.Length         1.7
## 157     setosa       7 Petal.Length         1.4
## 158     setosa       8 Petal.Length         1.5
## 159     setosa       9 Petal.Length         1.4
## 160     setosa      10 Petal.Length         1.5
## 161     setosa      11 Petal.Length         1.5
## 162     setosa      12 Petal.Length         1.6
## 163     setosa      13 Petal.Length         1.4
## 164     setosa      14 Petal.Length         1.1
## 165     setosa      15 Petal.Length         1.2
## 166     setosa      16 Petal.Length         1.5
## 167     setosa      17 Petal.Length         1.3
## 168     setosa      18 Petal.Length         1.4
## 169     setosa      19 Petal.Length         1.7
## 170     setosa      20 Petal.Length         1.5
## 171     setosa      21 Petal.Length         1.7
## 172     setosa      22 Petal.Length         1.5
## 173     setosa      23 Petal.Length         1.0
## 174     setosa      24 Petal.Length         1.7
## 175     setosa      25 Petal.Length         1.9
## 176     setosa      26 Petal.Length         1.6
## 177     setosa      27 Petal.Length         1.6
## 178     setosa      28 Petal.Length         1.5
## 179     setosa      29 Petal.Length         1.4
## 180     setosa      30 Petal.Length         1.6
## 181     setosa      31 Petal.Length         1.6
## 182     setosa      32 Petal.Length         1.5
## 183     setosa      33 Petal.Length         1.5
## 184     setosa      34 Petal.Length         1.4
## 185     setosa      35 Petal.Length         1.5
## 186     setosa      36 Petal.Length         1.2
## 187     setosa      37 Petal.Length         1.3
## 188     setosa      38 Petal.Length         1.4
## 189     setosa      39 Petal.Length         1.3
## 190     setosa      40 Petal.Length         1.5
## 191     setosa      41 Petal.Length         1.3
## 192     setosa      42 Petal.Length         1.3
## 193     setosa      43 Petal.Length         1.3
## 194     setosa      44 Petal.Length         1.6
## 195     setosa      45 Petal.Length         1.9
## 196     setosa      46 Petal.Length         1.4
## 197     setosa      47 Petal.Length         1.6
## 198     setosa      48 Petal.Length         1.4
## 199     setosa      49 Petal.Length         1.5
## 200     setosa      50 Petal.Length         1.4
## 201 versicolor      51 Petal.Length         4.7
## 202 versicolor      52 Petal.Length         4.5
## 203 versicolor      53 Petal.Length         4.9
## 204 versicolor      54 Petal.Length         4.0
## 205 versicolor      55 Petal.Length         4.6
## 206 versicolor      56 Petal.Length         4.5
## 207 versicolor      57 Petal.Length         4.7
## 208 versicolor      58 Petal.Length         3.3
## 209 versicolor      59 Petal.Length         4.6
## 210 versicolor      60 Petal.Length         3.9
## 211 versicolor      61 Petal.Length         3.5
## 212 versicolor      62 Petal.Length         4.2
## 213 versicolor      63 Petal.Length         4.0
## 214 versicolor      64 Petal.Length         4.7
## 215 versicolor      65 Petal.Length         3.6
## 216 versicolor      66 Petal.Length         4.4
## 217 versicolor      67 Petal.Length         4.5
## 218 versicolor      68 Petal.Length         4.1
## 219 versicolor      69 Petal.Length         4.5
## 220 versicolor      70 Petal.Length         3.9
## 221 versicolor      71 Petal.Length         4.8
## 222 versicolor      72 Petal.Length         4.0
## 223 versicolor      73 Petal.Length         4.9
## 224 versicolor      74 Petal.Length         4.7
## 225 versicolor      75 Petal.Length         4.3
## 226 versicolor      76 Petal.Length         4.4
## 227 versicolor      77 Petal.Length         4.8
## 228 versicolor      78 Petal.Length         5.0
## 229 versicolor      79 Petal.Length         4.5
## 230 versicolor      80 Petal.Length         3.5
## 231 versicolor      81 Petal.Length         3.8
## 232 versicolor      82 Petal.Length         3.7
## 233 versicolor      83 Petal.Length         3.9
## 234 versicolor      84 Petal.Length         5.1
## 235 versicolor      85 Petal.Length         4.5
## 236 versicolor      86 Petal.Length         4.5
## 237 versicolor      87 Petal.Length         4.7
## 238 versicolor      88 Petal.Length         4.4
## 239 versicolor      89 Petal.Length         4.1
## 240 versicolor      90 Petal.Length         4.0
## 241 versicolor      91 Petal.Length         4.4
## 242 versicolor      92 Petal.Length         4.6
## 243 versicolor      93 Petal.Length         4.0
## 244 versicolor      94 Petal.Length         3.3
## 245 versicolor      95 Petal.Length         4.2
## 246 versicolor      96 Petal.Length         4.2
## 247 versicolor      97 Petal.Length         4.2
## 248 versicolor      98 Petal.Length         4.3
## 249 versicolor      99 Petal.Length         3.0
## 250 versicolor     100 Petal.Length         4.1
## 251  virginica     101 Petal.Length         6.0
## 252  virginica     102 Petal.Length         5.1
## 253  virginica     103 Petal.Length         5.9
## 254  virginica     104 Petal.Length         5.6
## 255  virginica     105 Petal.Length         5.8
## 256  virginica     106 Petal.Length         6.6
## 257  virginica     107 Petal.Length         4.5
## 258  virginica     108 Petal.Length         6.3
## 259  virginica     109 Petal.Length         5.8
## 260  virginica     110 Petal.Length         6.1
## 261  virginica     111 Petal.Length         5.1
## 262  virginica     112 Petal.Length         5.3
## 263  virginica     113 Petal.Length         5.5
## 264  virginica     114 Petal.Length         5.0
## 265  virginica     115 Petal.Length         5.1
## 266  virginica     116 Petal.Length         5.3
## 267  virginica     117 Petal.Length         5.5
## 268  virginica     118 Petal.Length         6.7
## 269  virginica     119 Petal.Length         6.9
## 270  virginica     120 Petal.Length         5.0
## 271  virginica     121 Petal.Length         5.7
## 272  virginica     122 Petal.Length         4.9
## 273  virginica     123 Petal.Length         6.7
## 274  virginica     124 Petal.Length         4.9
## 275  virginica     125 Petal.Length         5.7
## 276  virginica     126 Petal.Length         6.0
## 277  virginica     127 Petal.Length         4.8
## 278  virginica     128 Petal.Length         4.9
## 279  virginica     129 Petal.Length         5.6
## 280  virginica     130 Petal.Length         5.8
## 281  virginica     131 Petal.Length         6.1
## 282  virginica     132 Petal.Length         6.4
## 283  virginica     133 Petal.Length         5.6
## 284  virginica     134 Petal.Length         5.1
## 285  virginica     135 Petal.Length         5.6
## 286  virginica     136 Petal.Length         6.1
## 287  virginica     137 Petal.Length         5.6
## 288  virginica     138 Petal.Length         5.5
## 289  virginica     139 Petal.Length         4.8
## 290  virginica     140 Petal.Length         5.4
## 291  virginica     141 Petal.Length         5.6
## 292  virginica     142 Petal.Length         5.1
## 293  virginica     143 Petal.Length         5.1
## 294  virginica     144 Petal.Length         5.9
## 295  virginica     145 Petal.Length         5.7
## 296  virginica     146 Petal.Length         5.2
## 297  virginica     147 Petal.Length         5.0
## 298  virginica     148 Petal.Length         5.2
## 299  virginica     149 Petal.Length         5.4
## 300  virginica     150 Petal.Length         5.1
## 301     setosa       1  Sepal.Width         3.5
## 302     setosa       2  Sepal.Width         3.0
## 303     setosa       3  Sepal.Width         3.2
## 304     setosa       4  Sepal.Width         3.1
## 305     setosa       5  Sepal.Width         3.6
## 306     setosa       6  Sepal.Width         3.9
## 307     setosa       7  Sepal.Width         3.4
## 308     setosa       8  Sepal.Width         3.4
## 309     setosa       9  Sepal.Width         2.9
## 310     setosa      10  Sepal.Width         3.1
## 311     setosa      11  Sepal.Width         3.7
## 312     setosa      12  Sepal.Width         3.4
## 313     setosa      13  Sepal.Width         3.0
## 314     setosa      14  Sepal.Width         3.0
## 315     setosa      15  Sepal.Width         4.0
## 316     setosa      16  Sepal.Width         4.4
## 317     setosa      17  Sepal.Width         3.9
## 318     setosa      18  Sepal.Width         3.5
## 319     setosa      19  Sepal.Width         3.8
## 320     setosa      20  Sepal.Width         3.8
## 321     setosa      21  Sepal.Width         3.4
## 322     setosa      22  Sepal.Width         3.7
## 323     setosa      23  Sepal.Width         3.6
## 324     setosa      24  Sepal.Width         3.3
## 325     setosa      25  Sepal.Width         3.4
## 326     setosa      26  Sepal.Width         3.0
## 327     setosa      27  Sepal.Width         3.4
## 328     setosa      28  Sepal.Width         3.5
## 329     setosa      29  Sepal.Width         3.4
## 330     setosa      30  Sepal.Width         3.2
## 331     setosa      31  Sepal.Width         3.1
## 332     setosa      32  Sepal.Width         3.4
## 333     setosa      33  Sepal.Width         4.1
## 334     setosa      34  Sepal.Width         4.2
## 335     setosa      35  Sepal.Width         3.1
## 336     setosa      36  Sepal.Width         3.2
## 337     setosa      37  Sepal.Width         3.5
## 338     setosa      38  Sepal.Width         3.6
## 339     setosa      39  Sepal.Width         3.0
## 340     setosa      40  Sepal.Width         3.4
## 341     setosa      41  Sepal.Width         3.5
## 342     setosa      42  Sepal.Width         2.3
## 343     setosa      43  Sepal.Width         3.2
## 344     setosa      44  Sepal.Width         3.5
## 345     setosa      45  Sepal.Width         3.8
## 346     setosa      46  Sepal.Width         3.0
## 347     setosa      47  Sepal.Width         3.8
## 348     setosa      48  Sepal.Width         3.2
## 349     setosa      49  Sepal.Width         3.7
## 350     setosa      50  Sepal.Width         3.3
## 351 versicolor      51  Sepal.Width         3.2
## 352 versicolor      52  Sepal.Width         3.2
## 353 versicolor      53  Sepal.Width         3.1
## 354 versicolor      54  Sepal.Width         2.3
## 355 versicolor      55  Sepal.Width         2.8
## 356 versicolor      56  Sepal.Width         2.8
## 357 versicolor      57  Sepal.Width         3.3
## 358 versicolor      58  Sepal.Width         2.4
## 359 versicolor      59  Sepal.Width         2.9
## 360 versicolor      60  Sepal.Width         2.7
## 361 versicolor      61  Sepal.Width         2.0
## 362 versicolor      62  Sepal.Width         3.0
## 363 versicolor      63  Sepal.Width         2.2
## 364 versicolor      64  Sepal.Width         2.9
## 365 versicolor      65  Sepal.Width         2.9
## 366 versicolor      66  Sepal.Width         3.1
## 367 versicolor      67  Sepal.Width         3.0
## 368 versicolor      68  Sepal.Width         2.7
## 369 versicolor      69  Sepal.Width         2.2
## 370 versicolor      70  Sepal.Width         2.5
## 371 versicolor      71  Sepal.Width         3.2
## 372 versicolor      72  Sepal.Width         2.8
## 373 versicolor      73  Sepal.Width         2.5
## 374 versicolor      74  Sepal.Width         2.8
## 375 versicolor      75  Sepal.Width         2.9
## 376 versicolor      76  Sepal.Width         3.0
## 377 versicolor      77  Sepal.Width         2.8
## 378 versicolor      78  Sepal.Width         3.0
## 379 versicolor      79  Sepal.Width         2.9
## 380 versicolor      80  Sepal.Width         2.6
## 381 versicolor      81  Sepal.Width         2.4
## 382 versicolor      82  Sepal.Width         2.4
## 383 versicolor      83  Sepal.Width         2.7
## 384 versicolor      84  Sepal.Width         2.7
## 385 versicolor      85  Sepal.Width         3.0
## 386 versicolor      86  Sepal.Width         3.4
## 387 versicolor      87  Sepal.Width         3.1
## 388 versicolor      88  Sepal.Width         2.3
## 389 versicolor      89  Sepal.Width         3.0
## 390 versicolor      90  Sepal.Width         2.5
## 391 versicolor      91  Sepal.Width         2.6
## 392 versicolor      92  Sepal.Width         3.0
## 393 versicolor      93  Sepal.Width         2.6
## 394 versicolor      94  Sepal.Width         2.3
## 395 versicolor      95  Sepal.Width         2.7
## 396 versicolor      96  Sepal.Width         3.0
## 397 versicolor      97  Sepal.Width         2.9
## 398 versicolor      98  Sepal.Width         2.9
## 399 versicolor      99  Sepal.Width         2.5
## 400 versicolor     100  Sepal.Width         2.8
## 401  virginica     101  Sepal.Width         3.3
## 402  virginica     102  Sepal.Width         2.7
## 403  virginica     103  Sepal.Width         3.0
## 404  virginica     104  Sepal.Width         2.9
## 405  virginica     105  Sepal.Width         3.0
## 406  virginica     106  Sepal.Width         3.0
## 407  virginica     107  Sepal.Width         2.5
## 408  virginica     108  Sepal.Width         2.9
## 409  virginica     109  Sepal.Width         2.5
## 410  virginica     110  Sepal.Width         3.6
## 411  virginica     111  Sepal.Width         3.2
## 412  virginica     112  Sepal.Width         2.7
## 413  virginica     113  Sepal.Width         3.0
## 414  virginica     114  Sepal.Width         2.5
## 415  virginica     115  Sepal.Width         2.8
## 416  virginica     116  Sepal.Width         3.2
## 417  virginica     117  Sepal.Width         3.0
## 418  virginica     118  Sepal.Width         3.8
## 419  virginica     119  Sepal.Width         2.6
## 420  virginica     120  Sepal.Width         2.2
## 421  virginica     121  Sepal.Width         3.2
## 422  virginica     122  Sepal.Width         2.8
## 423  virginica     123  Sepal.Width         2.8
## 424  virginica     124  Sepal.Width         2.7
## 425  virginica     125  Sepal.Width         3.3
## 426  virginica     126  Sepal.Width         3.2
## 427  virginica     127  Sepal.Width         2.8
## 428  virginica     128  Sepal.Width         3.0
## 429  virginica     129  Sepal.Width         2.8
## 430  virginica     130  Sepal.Width         3.0
## 431  virginica     131  Sepal.Width         2.8
## 432  virginica     132  Sepal.Width         3.8
## 433  virginica     133  Sepal.Width         2.8
## 434  virginica     134  Sepal.Width         2.8
## 435  virginica     135  Sepal.Width         2.6
## 436  virginica     136  Sepal.Width         3.0
## 437  virginica     137  Sepal.Width         3.4
## 438  virginica     138  Sepal.Width         3.1
## 439  virginica     139  Sepal.Width         3.0
## 440  virginica     140  Sepal.Width         3.1
## 441  virginica     141  Sepal.Width         3.1
## 442  virginica     142  Sepal.Width         3.1
## 443  virginica     143  Sepal.Width         2.7
## 444  virginica     144  Sepal.Width         3.2
## 445  virginica     145  Sepal.Width         3.3
## 446  virginica     146  Sepal.Width         3.0
## 447  virginica     147  Sepal.Width         2.5
## 448  virginica     148  Sepal.Width         3.0
## 449  virginica     149  Sepal.Width         3.4
## 450  virginica     150  Sepal.Width         3.0
## 451     setosa       1  Petal.Width         0.2
## 452     setosa       2  Petal.Width         0.2
## 453     setosa       3  Petal.Width         0.2
## 454     setosa       4  Petal.Width         0.2
## 455     setosa       5  Petal.Width         0.2
## 456     setosa       6  Petal.Width         0.4
## 457     setosa       7  Petal.Width         0.3
## 458     setosa       8  Petal.Width         0.2
## 459     setosa       9  Petal.Width         0.2
## 460     setosa      10  Petal.Width         0.1
## 461     setosa      11  Petal.Width         0.2
## 462     setosa      12  Petal.Width         0.2
## 463     setosa      13  Petal.Width         0.1
## 464     setosa      14  Petal.Width         0.1
## 465     setosa      15  Petal.Width         0.2
## 466     setosa      16  Petal.Width         0.4
## 467     setosa      17  Petal.Width         0.4
## 468     setosa      18  Petal.Width         0.3
## 469     setosa      19  Petal.Width         0.3
## 470     setosa      20  Petal.Width         0.3
## 471     setosa      21  Petal.Width         0.2
## 472     setosa      22  Petal.Width         0.4
## 473     setosa      23  Petal.Width         0.2
## 474     setosa      24  Petal.Width         0.5
## 475     setosa      25  Petal.Width         0.2
## 476     setosa      26  Petal.Width         0.2
## 477     setosa      27  Petal.Width         0.4
## 478     setosa      28  Petal.Width         0.2
## 479     setosa      29  Petal.Width         0.2
## 480     setosa      30  Petal.Width         0.2
## 481     setosa      31  Petal.Width         0.2
## 482     setosa      32  Petal.Width         0.4
## 483     setosa      33  Petal.Width         0.1
## 484     setosa      34  Petal.Width         0.2
## 485     setosa      35  Petal.Width         0.2
## 486     setosa      36  Petal.Width         0.2
## 487     setosa      37  Petal.Width         0.2
## 488     setosa      38  Petal.Width         0.1
## 489     setosa      39  Petal.Width         0.2
## 490     setosa      40  Petal.Width         0.2
## 491     setosa      41  Petal.Width         0.3
## 492     setosa      42  Petal.Width         0.3
## 493     setosa      43  Petal.Width         0.2
## 494     setosa      44  Petal.Width         0.6
## 495     setosa      45  Petal.Width         0.4
## 496     setosa      46  Petal.Width         0.3
## 497     setosa      47  Petal.Width         0.2
## 498     setosa      48  Petal.Width         0.2
## 499     setosa      49  Petal.Width         0.2
## 500     setosa      50  Petal.Width         0.2
## 501 versicolor      51  Petal.Width         1.4
## 502 versicolor      52  Petal.Width         1.5
## 503 versicolor      53  Petal.Width         1.5
## 504 versicolor      54  Petal.Width         1.3
## 505 versicolor      55  Petal.Width         1.5
## 506 versicolor      56  Petal.Width         1.3
## 507 versicolor      57  Petal.Width         1.6
## 508 versicolor      58  Petal.Width         1.0
## 509 versicolor      59  Petal.Width         1.3
## 510 versicolor      60  Petal.Width         1.4
## 511 versicolor      61  Petal.Width         1.0
## 512 versicolor      62  Petal.Width         1.5
## 513 versicolor      63  Petal.Width         1.0
## 514 versicolor      64  Petal.Width         1.4
## 515 versicolor      65  Petal.Width         1.3
## 516 versicolor      66  Petal.Width         1.4
## 517 versicolor      67  Petal.Width         1.5
## 518 versicolor      68  Petal.Width         1.0
## 519 versicolor      69  Petal.Width         1.5
## 520 versicolor      70  Petal.Width         1.1
## 521 versicolor      71  Petal.Width         1.8
## 522 versicolor      72  Petal.Width         1.3
## 523 versicolor      73  Petal.Width         1.5
## 524 versicolor      74  Petal.Width         1.2
## 525 versicolor      75  Petal.Width         1.3
## 526 versicolor      76  Petal.Width         1.4
## 527 versicolor      77  Petal.Width         1.4
## 528 versicolor      78  Petal.Width         1.7
## 529 versicolor      79  Petal.Width         1.5
## 530 versicolor      80  Petal.Width         1.0
## 531 versicolor      81  Petal.Width         1.1
## 532 versicolor      82  Petal.Width         1.0
## 533 versicolor      83  Petal.Width         1.2
## 534 versicolor      84  Petal.Width         1.6
## 535 versicolor      85  Petal.Width         1.5
## 536 versicolor      86  Petal.Width         1.6
## 537 versicolor      87  Petal.Width         1.5
## 538 versicolor      88  Petal.Width         1.3
## 539 versicolor      89  Petal.Width         1.3
## 540 versicolor      90  Petal.Width         1.3
## 541 versicolor      91  Petal.Width         1.2
## 542 versicolor      92  Petal.Width         1.4
## 543 versicolor      93  Petal.Width         1.2
## 544 versicolor      94  Petal.Width         1.0
## 545 versicolor      95  Petal.Width         1.3
## 546 versicolor      96  Petal.Width         1.2
## 547 versicolor      97  Petal.Width         1.3
## 548 versicolor      98  Petal.Width         1.3
## 549 versicolor      99  Petal.Width         1.1
## 550 versicolor     100  Petal.Width         1.3
## 551  virginica     101  Petal.Width         2.5
## 552  virginica     102  Petal.Width         1.9
## 553  virginica     103  Petal.Width         2.1
## 554  virginica     104  Petal.Width         1.8
## 555  virginica     105  Petal.Width         2.2
## 556  virginica     106  Petal.Width         2.1
## 557  virginica     107  Petal.Width         1.7
## 558  virginica     108  Petal.Width         1.8
## 559  virginica     109  Petal.Width         1.8
## 560  virginica     110  Petal.Width         2.5
## 561  virginica     111  Petal.Width         2.0
## 562  virginica     112  Petal.Width         1.9
## 563  virginica     113  Petal.Width         2.1
## 564  virginica     114  Petal.Width         2.0
## 565  virginica     115  Petal.Width         2.4
## 566  virginica     116  Petal.Width         2.3
## 567  virginica     117  Petal.Width         1.8
## 568  virginica     118  Petal.Width         2.2
## 569  virginica     119  Petal.Width         2.3
## 570  virginica     120  Petal.Width         1.5
## 571  virginica     121  Petal.Width         2.3
## 572  virginica     122  Petal.Width         2.0
## 573  virginica     123  Petal.Width         2.0
## 574  virginica     124  Petal.Width         1.8
## 575  virginica     125  Petal.Width         2.1
## 576  virginica     126  Petal.Width         1.8
## 577  virginica     127  Petal.Width         1.8
## 578  virginica     128  Petal.Width         1.8
## 579  virginica     129  Petal.Width         2.1
## 580  virginica     130  Petal.Width         1.6
## 581  virginica     131  Petal.Width         1.9
## 582  virginica     132  Petal.Width         2.0
## 583  virginica     133  Petal.Width         2.2
## 584  virginica     134  Petal.Width         1.5
## 585  virginica     135  Petal.Width         1.4
## 586  virginica     136  Petal.Width         2.3
## 587  virginica     137  Petal.Width         2.4
## 588  virginica     138  Petal.Width         1.8
## 589  virginica     139  Petal.Width         1.8
## 590  virginica     140  Petal.Width         2.1
## 591  virginica     141  Petal.Width         2.4
## 592  virginica     142  Petal.Width         2.3
## 593  virginica     143  Petal.Width         1.9
## 594  virginica     144  Petal.Width         2.3
## 595  virginica     145  Petal.Width         2.5
## 596  virginica     146  Petal.Width         2.3
## 597  virginica     147  Petal.Width         1.9
## 598  virginica     148  Petal.Width         2.0
## 599  virginica     149  Petal.Width         2.3
## 600  virginica     150  Petal.Width         1.8

A bigger worked example: Wordbank data

We’re going to be using some data on vocabulary growth that we load from the Wordbank database. Wordbank is a database of children’s language learning.

(Go explore it for a moment).

We’re going to look at data from the English Words and Sentences form. These data describe the repsonses of parents to questions about whether their child says 680 different words.

dplyr really shines in this context.

# to avoid dependency on the wordbankr package, we cache these data. 
# ws <- wordbankr::get_administration_data(language = "English", 
#                                          form = "WS")

ws <- read_csv("data/ws.csv")

## Parsed with column specification:
## cols(
##   data_id = col_double(),
##   age = col_double(),
##   comprehension = col_double(),
##   production = col_double(),
##   language = col_character(),
##   form = col_character(),
##   birth_order = col_character(),
##   ethnicity = col_character(),
##   sex = col_character(),
##   zygosity = col_logical(),
##   norming = col_logical(),
##   longitudinal = col_logical(),
##   source_name = col_character(),
##   mom_ed = col_character()
## )

Take a look at the data that comes out.

DT::datatable(ws)

ggplot(ws, aes(x = age, y = production)) + 
  geom_point()

Aside: How can we fix this plot? Suggestions from group?

ggplot(ws, aes(x = age, y = production)) + 
  geom_jitter(size = .5, width = .25, height = 0, alpha = .3)

Ok, let’s plot the relationship between sex and productive vocabulary, using dplyr.

ggplot(ws, aes(x = age, y = production, col=sex)) + 
  geom_jitter(size = .5, width = .25, height = 0, alpha = .3)

This is a bit useless, because the variability is so high. So let’s summarise!

Exercise. Get means and SDs of productive vocabulary (production) by age and sex. Filter the kids with missing data for sex (coded by NA).

HINT: is.na(x) is useful for filtering.

# View(ws)
ws_sex <- ws %>%
  filter(!is.na(sex)) %>%
  group_by(age, sex) %>%
  summarise(production = mean(production),
            production_sd = sd(production))

## `summarise()` regrouping output by 'age' (override with `.groups` argument)

ws_sex

## # A tibble: 30 x 4
## # Groups:   age [15]
##      age sex    production production_sd
##    <dbl> <chr>       <dbl>         <dbl>
##  1    16 Female       66.9            NA
##  2    16 Male         52.4            NA
##  3    17 Female       85.7            NA
##  4    17 Male         65.7            NA
##  5    18 Female      137.             NA
##  6    18 Male        105.             NA
##  7    19 Female      172.             NA
##  8    19 Male        137.             NA
##  9    20 Female      198.             NA
## 10    20 Male        164.             NA
## # … with 20 more rows

Now plot:

ggplot(ws_sex, 
       aes(x = age, y = production, col = sex)) + 
  geom_line() + 
  geom_jitter(data = filter(ws, !is.na(sex)), 
              size = .5, width = .25, height = 0, alpha = .3) + 
  geom_linerange(aes(ymin = production - production_sd, 
                     ymax = production + production_sd), 
                 position = position_dodge(width = .2)) # keep SDs from overlapping

## Warning: Removed 30 rows containing missing values (geom_linerange).

Bonus: Compute effect size.

# instructor demo

Exciting stuff you can do with this workflow

Here are three little demos of exciting stuff that you can do (and that are facilitated by this workflow).

Reading bigger files, faster

A few other things will help you with “medium size data”:

read_csv - Much faster than read.csv and has better defaults.
dbplyr - For connecting directly to databases. This package got forked off of dplyr recently but is very useful.
feather - The feather package is a fast-loading binary format that is interoperable with python. All you need to know is write_feather(d, "filename") and read_feather("filename").

Here’s a timing demo for read.csv, read_csv, and read_feather.

system.time(read.csv("data/ws.csv"))

##    user  system elapsed 
##   0.019   0.001   0.021

system.time(read_csv("data/ws.csv"))

## Parsed with column specification:
## cols(
##   data_id = col_double(),
##   age = col_double(),
##   comprehension = col_double(),
##   production = col_double(),
##   language = col_character(),
##   form = col_character(),
##   birth_order = col_character(),
##   ethnicity = col_character(),
##   sex = col_character(),
##   zygosity = col_logical(),
##   norming = col_logical(),
##   longitudinal = col_logical(),
##   source_name = col_character(),
##   mom_ed = col_character()
## )

##    user  system elapsed 
##   0.011   0.001   0.015

system.time(feather::read_feather("data/ws.feather"))

##    user  system elapsed 
##   0.010   0.002   0.020

I see about a 2x speedup for read_csv (bigger for bigger files) and a 20x speedup for read_feather.

Interactive visualization

The shiny package is a great way to do interactives in R. We’ll walk through constructing a simple shiny app for the wordbank data here.

Technically, this is embedded shiny as opposed to freestanding shiny apps (like Wordbank).

The two parts of a shiny app are ui and server. Both of these are funny in that they are lists of other things. The ui is a list of elements of an HTML page, and the server is a list of “reactive” elements. In brief, the UI says what should be shown, and the server specifies the mechanics of how to create those elements.

This little embedded shiny app shows a page with two elements: 1) a selector that lets you choose a demographic field, and 2) a plot of vocabulary split by that field.

The server then has the job of splitting the data by that field (for ws_split) and rendering the plot (agePlot).

The one fancy thing that’s going on here is that the app makes use of the calls group_by_ (in the dplyr chain) and aes_ (for the ggplot call). These _ functions are a little complex - they are an example of “standard evaluation” that lets you feed actual variables into ggplot2 and dplyr rather than names of variables. For more information, there is a nice vignette on standard and non-standard evaluation: try (vignette("nse").

library(shiny)
shinyApp(
  ui <- fluidPage(
    selectInput("demographic", "Demographic Split Variable", 
                c("Sex" = "sex", "Maternal Education" = "mom_ed",
                  "Birth Order" = "birth_order", "Ethnicity" = "ethnicity")),
    plotOutput("agePlot")
  ),
  
  server <- function(input, output) {
    ws_split <- reactive({
      ws %>%
        group_by_("age", input$demographic) %>%
        summarise(production_mean = mean(production))
    })
    
    output$agePlot <- renderPlot({
      ggplot(ws_split(), 
             aes_(quote(age), quote(production_mean), col = as.name(input$demographic))) + 
        geom_line() 
    })
  },
  
  options = list(height = 500)
)

## 
## Listening on http://127.0.0.1:6890

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## `summarise()` regrouping output by 'age' (override with `.groups` argument)

Function application

As I’ve tried to highlight, dplyr is actually all about applying functions. summarise is a verb that helps you apply functions to chunks of data and then bind them together. But that creates a requirement that all the functions return a single value (e.g., mean). There are lots of things you can do that summarise data but don’t return a single value. For example, maybe you want to run a linear regression and return the slope and the intercept.

For that, I want to highlight two things.

One is do, which allows function application to grouped data. The only tricky thing about using do is that you have to refer to the dataframe that you’re working on as ..

The second is the amazing broom package, which provides methods to tidy the output of lots of different statistical models. So for example, you can run a linear regression on chunks of a dataset and get back out the coefficients in a data frame.

Here’s a toy example, again with Wordbank data.

ws %>%
  filter(!is.na(sex)) %>%
  group_by(sex) %>%
  do(broom::tidy(lm(production ~ age, data = .)))

## # A tibble: 4 x 6
## # Groups:   sex [2]
##   sex    term        estimate std.error statistic   p.value
##   <chr>  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 Female (Intercept)   -515.     14.4       -35.6 9.92e-215
## 2 Female age             36.2     0.629      57.5 0.       
## 3 Male   (Intercept)   -494.     14.2       -34.9 3.04e-210
## 4 Male   age             33.5     0.615      54.5 0.

In recent years, this workflow in R ihas gotten really good. purrr is an amazing package that introduces consistent ways to map functions. It’s beyond the scope of the course.

Exercise solutions

Returning the third cell.

iris$Petal.Length[3]

## [1] 1.3

iris[3,3]

## [1] 1.3

iris[3,"Petal.Length"]

## [1] 1.3

iris[[3]][3]

## [1] 1.3

iris[["Petal.Length"]][3]

## [1] 1.3

# probably more?

Piped commands.

iris$Species %>%
  unique %>%
  length

## [1] 3

Mean of participant means.

sgf %>%
  group_by(age_group, subid) %>%
  summarise(correct = mean(correct)) %>%
  summarise(mean_correct = mean(correct), 
            sd_correct = sd(correct))

## `summarise()` regrouping output by 'age_group' (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 3
##   age_group mean_correct sd_correct
##   <fct>            <dbl>      <dbl>
## 1 [2,3]            0.408      0.269
## 2 (3,4]            0.485      0.348
## 3 (4,5]            0.448      0.375

Sklar tidying.

sklar %>%
  gather(participant, RT, 8:28)

## # A tibble: 3,234 x 9
##    prime prime.result target congruent operand distance counterbalance
##    <chr>        <dbl>  <dbl> <chr>     <chr>      <dbl>          <dbl>
##  1 =1+2…            8      9 no        A             -1              1
##  2 =1+3…            9     11 no        A             -2              1
##  3 =1+4…            8     12 no        A             -4              1
##  4 =1+6…           10     12 no        A             -2              1
##  5 =1+9…           12     11 no        A              1              1
##  6 =1+9…           13     12 no        A              1              1
##  7 =2+1…           12     11 no        A              1              1
##  8 =2+3…           11     10 no        A              1              1
##  9 =2+3…           12     11 no        A              1              1
## 10 =2+5…           13      9 no        A              4              1
## # … with 3,224 more rows, and 2 more variables: participant <chr>, RT <dbl>

# might be a better way to select these columns than by number, e.g. regex

Sex means.

ws_sex <- ws %>%
  filter(!is.na(sex)) %>%
  group_by(age, sex) %>%
  summarise(production_sd = sd(production, na.rm=TRUE),
            production_mean = mean(production))

## `summarise()` regrouping output by 'age' (override with `.groups` argument)

Effect size. (Instructor demo)

ws_es <- ws_sex %>%
  group_by(age) %>%
  summarise(es = (production_mean[sex=="Female"] - production_mean[sex=="Male"]) / 
              mean(production_sd))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(ws_es, aes(x = age, y = es)) + 
  geom_point() + 
  geom_smooth(span = 1) + 
  ylab("Female advantage (standard deviations)") + 
  xlab("Age (months)")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Medium Data in the Tidyverse

Mike Frank

6/22/2017, updated 10/14/2019

Goals and Introduction

Data frames

Tidy data

Functions and Pipes

`ggplot2` and tidy data

Tidy Data Analysis with `dplyr`

Exploring and characterizing the dataset

Filtering & Mutating

Standard psychological descriptives

Getting to Tidy with `tidyr`

Tidy verbs

A bigger worked example: Wordbank data

Exciting stuff you can do with this workflow

Reading bigger files, faster

Interactive visualization

Function application

Exercise solutions

Medium Data in the Tidyverse

Mike Frank

6/22/2017, updated 10/14/2019

Goals and Introduction

Data frames

Tidy data

Functions and Pipes

ggplot2 and tidy data

Tidy Data Analysis with dplyr

Exploring and characterizing the dataset

Filtering & Mutating

Standard psychological descriptives

Getting to Tidy with tidyr

Tidy verbs

A bigger worked example: Wordbank data

Exciting stuff you can do with this workflow

Reading bigger files, faster

Interactive visualization

Function application

Exercise solutions

`ggplot2` and tidy data

Tidy Data Analysis with `dplyr`

Getting to Tidy with `tidyr`