Starting note: The best reference for this material is Hadley Wickham’s R for data scientists. My contribution here is to translate this reference for psychology.
By the end of this tutorial, you will know:
This intro will describe a few concepts that you will need to know, using the famous iris
dataset that comes with ggplot2
.
The basic data structure we’re working with is the data frame, or tibble
(in the tidyverse
reimplementation). Data frames have rows and columns, and each column has a distinct data type. The implementation in Python’s pandas
is distinct but most of the concepts are the same.
iris
is a data frame showing the measurements of a bunch of different instances of iris flowers from different species. (Sepals are the things outside the petals of the flowers that protect the petals while it’s blooming, petals are the actual petals of the flower).
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Exercise. R is a very flexible programming language, which is both a strength and a weakness. There are many ways to get a particular value of a variable in a data frame. You can use
$
to access a column, as iniris$Sepal.Length
or you can treat the data frame as a matrix, e.g.iris[1,1]
or even as a list, as iniris[[1]]
. You can also mix numeric references and named references, e.g.iris[["Sepal.Length"]]
. Turn to your neighbor (and/or google) and find as many ways as you can to access the petal length of the third iris in the dataset (row 3).
# fill me in with calls to the iris dataset that all return the same cell (third from the top, Petal Length).
iris %>% select(Petal.Length) %>% slice(3) %>% pull()
## [1] 1.3
iris[["Petal.Length"]][3]
## [1] 1.3
iris[3,3]
## [1] 1.3
Discussion. Why might some ways of doing this be better than others? Readability
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
Here’s the basic idea: In tidy data, every row is a single observation (trial), and every column describes a variable with some value describing that trial.
And if you know that data are formatted this way, then you can do amazing things, basically because you can take a uniform approach to the dataset. From R4DS:
“There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine.”
iris
is a tidy dataset. Each row is an observation of an individual iris, each column is a different variable.
Exercise. Take a look at these data, as downloaded from Amazon Mechanical Turk. They describe an experiment where people had to estimate the price of a dog, a plasma TV, and a sushi dinner (and they were primed with anchors that differed across conditions). It’s a replication of a paper by Janiszewksi & Uy (2008). Examine this dataset with your nextdoor neighbor and sketch out what a tidy version of the dataset would look like (using paper and pencil).
ju <- read_csv("data/janiszewski_rep_cleaned.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Reward = col_double(),
## MaxAssignments = col_double(),
## AssignmentDurationInSeconds = col_double(),
## AutoApprovalDelayInSeconds = col_double(),
## NumberOfSimilarHITs = col_double(),
## LifetimeInSeconds = col_logical(),
## WorkerId = col_double(),
## ApprovalTime = col_logical(),
## RejectionTime = col_logical(),
## RequesterFeedback = col_logical(),
## WorkTimeInSeconds = col_double(),
## Input.price1 = col_double(),
## Input.price2 = col_double(),
## Input.price3 = col_double(),
## Answer.dog_cost = col_double(),
## Answer.plasma_cost = col_double(),
## Answer.sushi_cost = col_double()
## )
## See spec(...) for full column specifications.
head(ju)
## # A tibble: 6 x 34
## HITId HITTypeId Title Description Keywords Reward CreationTime MaxAssignments
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 … 30
## 2 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 … 30
## 3 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 … 30
## 4 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 … 30
## 5 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 … 30
## 6 261W… 2DVGTP9R… How … A quick tw… survey,… 0.1 Wed Jan 25 … 30
## # … with 26 more variables: RequesterAnnotation <chr>,
## # AssignmentDurationInSeconds <dbl>, AutoApprovalDelayInSeconds <dbl>,
## # Expiration <chr>, NumberOfSimilarHITs <dbl>, LifetimeInSeconds <lgl>,
## # AssignmentId <chr>, WorkerId <dbl>, AssignmentStatus <chr>,
## # AcceptTime <chr>, SubmitTime <chr>, AutoApprovalTime <chr>,
## # ApprovalTime <lgl>, RejectionTime <lgl>, RequesterFeedback <lgl>,
## # WorkTimeInSeconds <dbl>, LifetimeApprovalRate <chr>,
## # Last30DaysApprovalRate <chr>, Last7DaysApprovalRate <chr>,
## # Input.condition <chr>, Input.price1 <dbl>, Input.price2 <dbl>,
## # Input.price3 <dbl>, Answer.dog_cost <dbl>, Answer.plasma_cost <dbl>,
## # Answer.sushi_cost <dbl>
Everything you typically want to do in statistical programming uses functions. mean
is a good example. mean
takes one argument, a numeric vector.
mean(iris$Petal.Length)
## [1] 3.758
We’re going to call this applying the function mean
to the variable Petal.Length
.
Pipes are a way to write strings of functions more easily. They bring the first argument of the function to the bedginning. So you can write:
iris$Petal.Length %>% mean
## [1] 3.758
That’s not very useful yet, but when you start nesting functions, it gets better.
mean(unique(iris$Petal.Length))
## [1] 4.22093
iris$Petal.Length %>% unique() %>% mean(na.rm=TRUE)
## [1] 4.22093
or
round(mean(unique(iris$Petal.Length)), digits = 2)
## [1] 4.22
iris$Petal.Length %>% unique %>% mean %>% round(digits = 2)
## [1] 4.22
# indenting makes things even easier to read
iris$Petal.Length %>%
unique %>%
mean %>%
round(digits = 2)
## [1] 4.22
This can be super helpful for writing strings of functions so that they are readable and distinct.
We’ll be doing a lot of piping of functions with multiple arguments later, and it will really help keep our syntax simple.
Exercise. Rewrite these commands using pipes and check that they do the same thing! (Or at least produce the same output). Unpiped version:
length(unique(iris$Species)) # number of species
## [1] 3
Piped version:
iris$Species %>%
unique() %>%
length
## [1] 3
ggplot2
and tidy dataThe last piece of our workflow here is going to be the addition of visualiation elements. ggplot2
is a plotting package that easily takes advantage of tidy data. ggplots have two important parts (there are of course more):
aes
- the aesthetic mapping, or which data variables get mapped to which visual variables (x, y, color, symbol, etc.)geom
- the plotting objects that represent the data (points, lines, shapes, etc.)iris %>%
ggplot(aes(x = Sepal.Width, y = Sepal.Length, col = Species)) +
geom_point()
And just to let you know my biases, I like theme_few
from ggthemes
and scale_color_solarized
as my palette.
iris %>%
ggplot(aes(Sepal.Width, Sepal.Length, col = Species)) +
geom_point() +
ggthemes::theme_few() +
ggthemes::scale_color_solarized()
dplyr
Reference: R4DS Chapter 5
Let’s take a psychological dataset. Here are the raw data from [Stiller, Goodman, & Frank (2015)].
These data are tidy: each row describes a single trial, each column describes some aspect of tha trial, including their id (subid
), age (age
), condition (condition
- “label” is the experimental condition, “No Label” is the control), item (item
- which thing furble was trying to find).
We are going to manipulate these data using “verbs” from dplyr
. I’ll only teach four verbs, the most common in my workflow (but there are many other useful ones):
filter
- remove rows by some logical conditionmutate
- create new columnsgroup_by
- group the data into subsets by some columnsummarize
- apply some function over columns in each groupsgf <- read_csv("data/stiller_scales_data.csv")
## Parsed with column specification:
## cols(
## subid = col_character(),
## item = col_character(),
## correct = col_double(),
## age = col_double(),
## condition = col_character()
## )
sgf
## # A tibble: 588 x 5
## subid item correct age condition
## <chr> <chr> <dbl> <dbl> <chr>
## 1 M22 faces 1 2 Label
## 2 M22 houses 1 2 Label
## 3 M22 pasta 0 2 Label
## 4 M22 beds 0 2 Label
## 5 T22 beds 0 2.13 Label
## 6 T22 faces 0 2.13 Label
## 7 T22 houses 1 2.13 Label
## 8 T22 pasta 1 2.13 Label
## 9 T17 pasta 0 2.32 Label
## 10 T17 faces 0 2.32 Label
## # … with 578 more rows
Inspect the various variables before you start any analysis. Lots of people recommend summary
but TBH I don’t find it useful.
summary(sgf)
## subid item correct age
## Length:588 Length:588 Min. :0.0000 Min. :2.000
## Class :character Class :character 1st Qu.:0.0000 1st Qu.:2.850
## Mode :character Mode :character Median :0.0000 Median :3.460
## Mean :0.4473 Mean :3.525
## 3rd Qu.:1.0000 3rd Qu.:4.290
## Max. :1.0000 Max. :4.960
## condition
## Length:588
## Class :character
## Mode :character
##
##
##
This output just feels overwhelming and uninformative.
You can look at each variable by itself:
unique(sgf$condition)
## [1] "Label" "No Label"
sgf$subid %>%
unique %>%
length
## [1] 147
Or use interactive tools like View
or DT::datatable
(which I really like).
View(sgf)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
DT::datatable(sgf)
There are lots of reasons you might want to remove rows from your dataset, including getting rid of outliers, selecting subpopulations, etc. filter
is a verb (function) that takes a data frame as its first argument, and then as its second takes the condition you want to filter on.
So if you wanted to look only at two year olds, you could do this. (Note you can give two conditions, could also do age > 2 & age < 3
). (equivalent: filter(sgf, age > 2, age < 3)
)
Note that we’re going to be using pipes with functions over data frames here. The way this works is that:
dplyr
verbs always take the data frame as their first argument, andThis is essentially the huge insight of dplyr
: you can chain verbs into readable and efficient sequences of operations over dataframes, provided 1) the verbs all have the same syntax (which they do) and 2) the data all have the same structure (which they do if they are tidy).
OK, so filtering:
sgf %>%
filter(age > 2,
age < 3)
## # A tibble: 188 x 5
## subid item correct age condition
## <chr> <chr> <dbl> <dbl> <chr>
## 1 T22 beds 0 2.13 Label
## 2 T22 faces 0 2.13 Label
## 3 T22 houses 1 2.13 Label
## 4 T22 pasta 1 2.13 Label
## 5 T17 pasta 0 2.32 Label
## 6 T17 faces 0 2.32 Label
## 7 T17 houses 0 2.32 Label
## 8 T17 beds 0 2.32 Label
## 9 M3 faces 0 2.38 Label
## 10 M3 houses 1 2.38 Label
## # … with 178 more rows
Exercise. Filter out only the “face” trial in the “Label” condition.
sgf %>%
filter(condition == "Label",
item == "faces")
## # A tibble: 75 x 5
## subid item correct age condition
## <chr> <chr> <dbl> <dbl> <chr>
## 1 M22 faces 1 2 Label
## 2 T22 faces 0 2.13 Label
## 3 T17 faces 0 2.32 Label
## 4 M3 faces 0 2.38 Label
## 5 T19 faces 0 2.47 Label
## 6 T20 faces 1 2.5 Label
## 7 T21 faces 1 2.58 Label
## 8 M26 faces 1 2.59 Label
## 9 T18 faces 1 2.61 Label
## 10 T12 faces 0 2.72 Label
## # … with 65 more rows
sgf[sgf$condition == "Label" & sgf$item == "faces", ] # all the columns
## # A tibble: 75 x 5
## subid item correct age condition
## <chr> <chr> <dbl> <dbl> <chr>
## 1 M22 faces 1 2 Label
## 2 T22 faces 0 2.13 Label
## 3 T17 faces 0 2.32 Label
## 4 M3 faces 0 2.38 Label
## 5 T19 faces 0 2.47 Label
## 6 T20 faces 1 2.5 Label
## 7 T21 faces 1 2.58 Label
## 8 M26 faces 1 2.59 Label
## 9 T18 faces 1 2.61 Label
## 10 T12 faces 0 2.72 Label
## # … with 65 more rows
There are also times when you want to add or remove columns. You might want to remove columns to simplify the dataset. There’s not much to simplify here, but if you wanted to do that, the verb is select
.
sgf %>%
select(subid, age, correct)
## # A tibble: 588 x 3
## subid age correct
## <chr> <dbl> <dbl>
## 1 M22 2 1
## 2 M22 2 1
## 3 M22 2 0
## 4 M22 2 0
## 5 T22 2.13 0
## 6 T22 2.13 0
## 7 T22 2.13 1
## 8 T22 2.13 1
## 9 T17 2.32 0
## 10 T17 2.32 0
## # … with 578 more rows
sgf %>%
select(-condition)
## # A tibble: 588 x 4
## subid item correct age
## <chr> <chr> <dbl> <dbl>
## 1 M22 faces 1 2
## 2 M22 houses 1 2
## 3 M22 pasta 0 2
## 4 M22 beds 0 2
## 5 T22 beds 0 2.13
## 6 T22 faces 0 2.13
## 7 T22 houses 1 2.13
## 8 T22 pasta 1 2.13
## 9 T17 pasta 0 2.32
## 10 T17 faces 0 2.32
## # … with 578 more rows
sgf %>%
select(1)
## # A tibble: 588 x 1
## subid
## <chr>
## 1 M22
## 2 M22
## 3 M22
## 4 M22
## 5 T22
## 6 T22
## 7 T22
## 8 T22
## 9 T17
## 10 T17
## # … with 578 more rows
sgf %>%
select(starts_with("sub"))
## # A tibble: 588 x 1
## subid
## <chr>
## 1 M22
## 2 M22
## 3 M22
## 4 M22
## 5 T22
## 6 T22
## 7 T22
## 8 T22
## 9 T17
## 10 T17
## # … with 578 more rows
# learn about this with ?select
Perhaps more useful is adding columns. You might do this perhaps to compute some kind of derived variable. mutate
is the verb for these situations - it allows you to add a column. Let’s add a discrete age group factor to our dataset.
sgf <- sgf %>%
mutate(age_group = cut(age, 2:5, include.lowest = TRUE),
age_group_halfyear = cut(age, seq(2,5,.5), include.lowest = TRUE))
# sgf$age_group <- cut(sgf$age, 2:5, include.lowest = TRUE)
# sgf$age_group <- with(sgf, cut(age, 2:5, include.lowest = TRUE))
head(sgf$age_group)
## [1] [2,3] [2,3] [2,3] [2,3] [2,3] [2,3]
## Levels: [2,3] (3,4] (4,5]
We typically describe datasets at the level of subjects, not trials. We need two verbs to get a summary at the level of subjects: group_by
and summarise
(kiwi spelling). Grouping alone doesn’t do much.
sgf %>%
group_by(age_group)
## # A tibble: 588 x 7
## # Groups: age_group [3]
## subid item correct age condition age_group age_group_halfyear
## <chr> <chr> <dbl> <dbl> <chr> <fct> <fct>
## 1 M22 faces 1 2 Label [2,3] [2,2.5]
## 2 M22 houses 1 2 Label [2,3] [2,2.5]
## 3 M22 pasta 0 2 Label [2,3] [2,2.5]
## 4 M22 beds 0 2 Label [2,3] [2,2.5]
## 5 T22 beds 0 2.13 Label [2,3] [2,2.5]
## 6 T22 faces 0 2.13 Label [2,3] [2,2.5]
## 7 T22 houses 1 2.13 Label [2,3] [2,2.5]
## 8 T22 pasta 1 2.13 Label [2,3] [2,2.5]
## 9 T17 pasta 0 2.32 Label [2,3] [2,2.5]
## 10 T17 faces 0 2.32 Label [2,3] [2,2.5]
## # … with 578 more rows
All it does is add a grouping marker.
What summarise
does is to apply a function to a part of the dataset to create a new summary dataset. So we can apply the function mean
to the dataset and get the grand mean.
## DO NOT DO THIS!!!
# foo <- initialize_the_thing_being_bound()
# for (i in 1:length(unique(sgf$item))) {
# for (j in 1:length(unique(sgf$condition))) {
# this_data <- sgf[sgf$item == unique(sgf$item)[i] &
# sgf$condition == unique(sgf$condition)[n],]
# do_a_thing(this_data)
# bind_together_somehow(this_data)
# }
# }
sgf %>%
summarise(correct = mean(correct))
## # A tibble: 1 x 1
## correct
## <dbl>
## 1 0.447
Note the syntax here: summarise
takes multiple new_column_name = function_to_be_applied_to_data(data_column)
entries in a list. Using this syntax, we can create more elaborate summary datasets also:
sgf %>%
summarise(correct = mean(correct),
n_observations = length(subid))
## # A tibble: 1 x 2
## correct n_observations
## <dbl> <int>
## 1 0.447 588
Where these two verbs shine is in combination, though. Because summarise
applies functions to columns in your grouped data, not just to the whole dataset!
So we can group by age or condition or whatever else we want and then carry out the same procedure, and all of a sudden we are doing something extremely useful!
sgf_means <- sgf %>%
group_by(age_group, condition) %>%
summarise(correct = mean(correct),
n_observations = length(subid))
## `summarise()` regrouping output by 'age_group' (override with `.groups` argument)
sgf_means
## # A tibble: 6 x 4
## # Groups: age_group [3]
## age_group condition correct n_observations
## <fct> <chr> <dbl> <int>
## 1 [2,3] Label 0.570 100
## 2 [2,3] No Label 0.240 96
## 3 (3,4] Label 0.712 104
## 4 (3,4] No Label 0.240 96
## 5 (4,5] Label 0.771 96
## 6 (4,5] No Label 0.125 96
These summary data are typically very useful for plotting. .
ggplot(sgf_means,
aes(x = age_group, y = correct, col = condition, group = condition)) +
geom_line() +
ylim(0,1) +
ggthemes::theme_few()
# sgf %>%
# mutate(age_group) %>%
# group_by() %>%
# summarise %>%
# ggplot()
Exercise. One of the most important analytic workflows for psychological data is to take some function (e.g., the mean) for each participant and then look at grand means and variability across participant means. This analytic workflow requires grouping, summarising, and then grouping again and summarising again! Use
dplyr
to make the same table as above (sgf_means
) but with means (and SDs if you want) computed across subject means, not across all data points. (The means will be pretty similar as this is a balanced design but in a case with lots of missing data, they will vary. In contrast, the SD doesn’t even really make sense across the binary data before you aggregate across subjects.)
# exercise
sgf_sub_means <- sgf %>%
group_by(age_group, condition, subid) %>%
summarise(correct = mean(correct))
## `summarise()` regrouping output by 'age_group', 'condition' (override with `.groups` argument)
sgf_grand_means <- sgf_sub_means %>%
group_by(age_group, condition) %>%
summarise(mean_correct = mean(correct),
sd_correct = sd(correct))
## `summarise()` regrouping output by 'age_group' (override with `.groups` argument)
tidyr
Reference: R4DS Chapter 12
Psychological data often comes in two flavors: long and wide data. Long form data is tidy, but that format is less common. It’s much more common to get wide data, in which every row is a case (e.g., a subject), and each column is a variable. In this format multiple trials (observations) are stored as columns.
This can go a bunch of ways, for example, the most common might be to have subjects as rows and trials as columns. But here’s an example from a real dataset on “unconscious arithmetic” from Sklar et al. (2012). In it, items (particular arithmetic problems) are rows and subjects are columns.
sklar <- read_csv("data/sklar_data.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## prime = col_character(),
## congruent = col_character(),
## operand = col_character()
## )
## See spec(...) for full column specifications.
head(sklar)
## # A tibble: 6 x 28
## prime prime.result target congruent operand distance counterbalance `1`
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 =1+2… 8 9 no A -1 1 597
## 2 =1+3… 9 11 no A -2 1 699
## 3 =1+4… 8 12 no A -4 1 700
## 4 =1+6… 10 12 no A -2 1 628
## 5 =1+9… 12 11 no A 1 1 768
## 6 =1+9… 13 12 no A 1 1 595
## # … with 20 more variables: `2` <dbl>, `3` <dbl>, `4` <dbl>, `5` <dbl>,
## # `6` <dbl>, `7` <dbl>, `8` <dbl>, `9` <dbl>, `10` <dbl>, `11` <dbl>,
## # `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>, `17` <dbl>,
## # `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>
The two main verbs for tidying are gather
and spread
. (There are lots of others in the tidyr
package if you want to split or merge columns etc.).
First, let’s go away from tidiness. We’re going to spread
a tidy dataset. Remember that tidy data has one observation in each row, but we want to “spread” it out so it’s wide. (The metaphor works better in this description). This may not be helpful, but I think of the data as a long cream cheese pat, and I “spread” it over a wide bagel.
Let’s try it on the SGF data above. First we’ll spread it so it’s wide. I do this by indicating what column is going to be the column labels in the new data frame, here it’s item
, and what column is going to have the values in those columns, here it’s correct
:
sgf_wide <- sgf %>%
spread(item, correct)
head(sgf_wide)
## # A tibble: 6 x 9
## subid age condition age_group age_group_halfyear beds faces houses pasta
## <chr> <dbl> <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 C1 4.16 Label (4,5] (4,4.5] 1 1 1 1
## 2 C10 3.46 Label (3,4] (3,3.5] 1 0 0 1
## 3 C11 4.22 Label (4,5] (4,4.5] 1 1 0 1
## 4 C12 3.56 Label (3,4] (3.5,4] 1 1 0 1
## 5 C13 4.38 Label (4,5] (4,4.5] 1 0 1 0
## 6 C14 4.57 Label (4,5] (4.5,5] 1 1 1 0
Now you can see that there is no explicit specification that all those item columns, e.g. faces
, beds
are holding correct
values, but the data are much more compact. (This form is easy to work with in Excel, so that’s probably why people use it in psych).
OK, let’s go back to our original format. gather
is about making wide data into tidy (long) data. When you gather a dataset you are “gathering” a bunch of columns (maybe that you previously spread
). You specify what all the columns have in common (e.g., they are all subject_id
s in the example above), and you say what measure they all contain (they all have RTs). So in that sense, it’s the flip of spread
. You did spread(item, correct)
and now you’ll gather(item, correct, ...)
. The one extra argument is that you need to specify the columns that will go into item
!
sgf_long <- sgf_wide %>%
gather(item, correct, beds, faces, houses, pasta)
head(sgf_long)
## # A tibble: 6 x 7
## subid age condition age_group age_group_halfyear item correct
## <chr> <dbl> <chr> <fct> <fct> <chr> <dbl>
## 1 C1 4.16 Label (4,5] (4,4.5] beds 1
## 2 C10 3.46 Label (3,4] (3,3.5] beds 1
## 3 C11 4.22 Label (4,5] (4,4.5] beds 1
## 4 C12 3.56 Label (3,4] (3.5,4] beds 1
## 5 C13 4.38 Label (4,5] (4,4.5] beds 1
## 6 C14 4.57 Label (4,5] (4.5,5] beds 1
head(sgf)
## # A tibble: 6 x 7
## subid item correct age condition age_group age_group_halfyear
## <chr> <chr> <dbl> <dbl> <chr> <fct> <fct>
## 1 M22 faces 1 2 Label [2,3] [2,2.5]
## 2 M22 houses 1 2 Label [2,3] [2,2.5]
## 3 M22 pasta 0 2 Label [2,3] [2,2.5]
## 4 M22 beds 0 2 Label [2,3] [2,2.5]
## 5 T22 beds 0 2.13 Label [2,3] [2,2.5]
## 6 T22 faces 0 2.13 Label [2,3] [2,2.5]
There are lots of flexible ways to specify these columns - you can enumerate their names like I did.
# gather(item, correct, 5:8)
# gather(item, correct, starts_with("foo"))
Exercise. Take the Sklar data from above, where each column is a separate subject, and
gather
it so that it’s a tidy dataset. What challenges come up?
sklar
## # A tibble: 154 x 28
## prime prime.result target congruent operand distance counterbalance `1`
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 =1+2… 8 9 no A -1 1 597
## 2 =1+3… 9 11 no A -2 1 699
## 3 =1+4… 8 12 no A -4 1 700
## 4 =1+6… 10 12 no A -2 1 628
## 5 =1+9… 12 11 no A 1 1 768
## 6 =1+9… 13 12 no A 1 1 595
## 7 =2+1… 12 11 no A 1 1 664
## 8 =2+3… 11 10 no A 1 1 803
## 9 =2+3… 12 11 no A 1 1 767
## 10 =2+5… 13 9 no A 4 1 700
## # … with 144 more rows, and 20 more variables: `2` <dbl>, `3` <dbl>, `4` <dbl>,
## # `5` <dbl>, `6` <dbl>, `7` <dbl>, `8` <dbl>, `9` <dbl>, `10` <dbl>,
## # `11` <dbl>, `12` <dbl>, `13` <dbl>, `14` <dbl>, `15` <dbl>, `16` <dbl>,
## # `17` <dbl>, `18` <dbl>, `19` <dbl>, `20` <dbl>, `21` <dbl>
sklar_tidy <- sklar %>%
gather(subid, rt, 8:28)
sklar_tidy
## # A tibble: 3,234 x 9
## prime prime.result target congruent operand distance counterbalance subid
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 =1+2… 8 9 no A -1 1 1
## 2 =1+3… 9 11 no A -2 1 1
## 3 =1+4… 8 12 no A -4 1 1
## 4 =1+6… 10 12 no A -2 1 1
## 5 =1+9… 12 11 no A 1 1 1
## 6 =1+9… 13 12 no A 1 1 1
## 7 =2+1… 12 11 no A 1 1 1
## 8 =2+3… 11 10 no A 1 1 1
## 9 =2+3… 12 11 no A 1 1 1
## 10 =2+5… 13 9 no A 4 1 1
## # … with 3,224 more rows, and 1 more variable: rt <dbl>
Let’s also go back and tidy an easier one: iris
.
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
## 51 7.0 3.2 4.7 1.4 versicolor
## 52 6.4 3.2 4.5 1.5 versicolor
## 53 6.9 3.1 4.9 1.5 versicolor
## 54 5.5 2.3 4.0 1.3 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 56 5.7 2.8 4.5 1.3 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 59 6.6 2.9 4.6 1.3 versicolor
## 60 5.2 2.7 3.9 1.4 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 62 5.9 3.0 4.2 1.5 versicolor
## 63 6.0 2.2 4.0 1.0 versicolor
## 64 6.1 2.9 4.7 1.4 versicolor
## 65 5.6 2.9 3.6 1.3 versicolor
## 66 6.7 3.1 4.4 1.4 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 68 5.8 2.7 4.1 1.0 versicolor
## 69 6.2 2.2 4.5 1.5 versicolor
## 70 5.6 2.5 3.9 1.1 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 72 6.1 2.8 4.0 1.3 versicolor
## 73 6.3 2.5 4.9 1.5 versicolor
## 74 6.1 2.8 4.7 1.2 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 76 6.6 3.0 4.4 1.4 versicolor
## 77 6.8 2.8 4.8 1.4 versicolor
## 78 6.7 3.0 5.0 1.7 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 80 5.7 2.6 3.5 1.0 versicolor
## 81 5.5 2.4 3.8 1.1 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 83 5.8 2.7 3.9 1.2 versicolor
## 84 6.0 2.7 5.1 1.6 versicolor
## 85 5.4 3.0 4.5 1.5 versicolor
## 86 6.0 3.4 4.5 1.6 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 89 5.6 3.0 4.1 1.3 versicolor
## 90 5.5 2.5 4.0 1.3 versicolor
## 91 5.5 2.6 4.4 1.2 versicolor
## 92 6.1 3.0 4.6 1.4 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 94 5.0 2.3 3.3 1.0 versicolor
## 95 5.6 2.7 4.2 1.3 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 97 5.7 2.9 4.2 1.3 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 99 5.1 2.5 3.0 1.1 versicolor
## 100 5.7 2.8 4.1 1.3 versicolor
## 101 6.3 3.3 6.0 2.5 virginica
## 102 5.8 2.7 5.1 1.9 virginica
## 103 7.1 3.0 5.9 2.1 virginica
## 104 6.3 2.9 5.6 1.8 virginica
## 105 6.5 3.0 5.8 2.2 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 107 4.9 2.5 4.5 1.7 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 109 6.7 2.5 5.8 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 112 6.4 2.7 5.3 1.9 virginica
## 113 6.8 3.0 5.5 2.1 virginica
## 114 5.7 2.5 5.0 2.0 virginica
## 115 5.8 2.8 5.1 2.4 virginica
## 116 6.4 3.2 5.3 2.3 virginica
## 117 6.5 3.0 5.5 1.8 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 120 6.0 2.2 5.0 1.5 virginica
## 121 6.9 3.2 5.7 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 124 6.3 2.7 4.9 1.8 virginica
## 125 6.7 3.3 5.7 2.1 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 127 6.2 2.8 4.8 1.8 virginica
## 128 6.1 3.0 4.9 1.8 virginica
## 129 6.4 2.8 5.6 2.1 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 133 6.4 2.8 5.6 2.2 virginica
## 134 6.3 2.8 5.1 1.5 virginica
## 135 6.1 2.6 5.6 1.4 virginica
## 136 7.7 3.0 6.1 2.3 virginica
## 137 6.3 3.4 5.6 2.4 virginica
## 138 6.4 3.1 5.5 1.8 virginica
## 139 6.0 3.0 4.8 1.8 virginica
## 140 6.9 3.1 5.4 2.1 virginica
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
iris %>%
mutate(iris_id = 1:nrow(iris)) %>%
gather(measurement, centimeters, Sepal.Length, Petal.Length, Sepal.Width, Petal.Width)
## Species iris_id measurement centimeters
## 1 setosa 1 Sepal.Length 5.1
## 2 setosa 2 Sepal.Length 4.9
## 3 setosa 3 Sepal.Length 4.7
## 4 setosa 4 Sepal.Length 4.6
## 5 setosa 5 Sepal.Length 5.0
## 6 setosa 6 Sepal.Length 5.4
## 7 setosa 7 Sepal.Length 4.6
## 8 setosa 8 Sepal.Length 5.0
## 9 setosa 9 Sepal.Length 4.4
## 10 setosa 10 Sepal.Length 4.9
## 11 setosa 11 Sepal.Length 5.4
## 12 setosa 12 Sepal.Length 4.8
## 13 setosa 13 Sepal.Length 4.8
## 14 setosa 14 Sepal.Length 4.3
## 15 setosa 15 Sepal.Length 5.8
## 16 setosa 16 Sepal.Length 5.7
## 17 setosa 17 Sepal.Length 5.4
## 18 setosa 18 Sepal.Length 5.1
## 19 setosa 19 Sepal.Length 5.7
## 20 setosa 20 Sepal.Length 5.1
## 21 setosa 21 Sepal.Length 5.4
## 22 setosa 22 Sepal.Length 5.1
## 23 setosa 23 Sepal.Length 4.6
## 24 setosa 24 Sepal.Length 5.1
## 25 setosa 25 Sepal.Length 4.8
## 26 setosa 26 Sepal.Length 5.0
## 27 setosa 27 Sepal.Length 5.0
## 28 setosa 28 Sepal.Length 5.2
## 29 setosa 29 Sepal.Length 5.2
## 30 setosa 30 Sepal.Length 4.7
## 31 setosa 31 Sepal.Length 4.8
## 32 setosa 32 Sepal.Length 5.4
## 33 setosa 33 Sepal.Length 5.2
## 34 setosa 34 Sepal.Length 5.5
## 35 setosa 35 Sepal.Length 4.9
## 36 setosa 36 Sepal.Length 5.0
## 37 setosa 37 Sepal.Length 5.5
## 38 setosa 38 Sepal.Length 4.9
## 39 setosa 39 Sepal.Length 4.4
## 40 setosa 40 Sepal.Length 5.1
## 41 setosa 41 Sepal.Length 5.0
## 42 setosa 42 Sepal.Length 4.5
## 43 setosa 43 Sepal.Length 4.4
## 44 setosa 44 Sepal.Length 5.0
## 45 setosa 45 Sepal.Length 5.1
## 46 setosa 46 Sepal.Length 4.8
## 47 setosa 47 Sepal.Length 5.1
## 48 setosa 48 Sepal.Length 4.6
## 49 setosa 49 Sepal.Length 5.3
## 50 setosa 50 Sepal.Length 5.0
## 51 versicolor 51 Sepal.Length 7.0
## 52 versicolor 52 Sepal.Length 6.4
## 53 versicolor 53 Sepal.Length 6.9
## 54 versicolor 54 Sepal.Length 5.5
## 55 versicolor 55 Sepal.Length 6.5
## 56 versicolor 56 Sepal.Length 5.7
## 57 versicolor 57 Sepal.Length 6.3
## 58 versicolor 58 Sepal.Length 4.9
## 59 versicolor 59 Sepal.Length 6.6
## 60 versicolor 60 Sepal.Length 5.2
## 61 versicolor 61 Sepal.Length 5.0
## 62 versicolor 62 Sepal.Length 5.9
## 63 versicolor 63 Sepal.Length 6.0
## 64 versicolor 64 Sepal.Length 6.1
## 65 versicolor 65 Sepal.Length 5.6
## 66 versicolor 66 Sepal.Length 6.7
## 67 versicolor 67 Sepal.Length 5.6
## 68 versicolor 68 Sepal.Length 5.8
## 69 versicolor 69 Sepal.Length 6.2
## 70 versicolor 70 Sepal.Length 5.6
## 71 versicolor 71 Sepal.Length 5.9
## 72 versicolor 72 Sepal.Length 6.1
## 73 versicolor 73 Sepal.Length 6.3
## 74 versicolor 74 Sepal.Length 6.1
## 75 versicolor 75 Sepal.Length 6.4
## 76 versicolor 76 Sepal.Length 6.6
## 77 versicolor 77 Sepal.Length 6.8
## 78 versicolor 78 Sepal.Length 6.7
## 79 versicolor 79 Sepal.Length 6.0
## 80 versicolor 80 Sepal.Length 5.7
## 81 versicolor 81 Sepal.Length 5.5
## 82 versicolor 82 Sepal.Length 5.5
## 83 versicolor 83 Sepal.Length 5.8
## 84 versicolor 84 Sepal.Length 6.0
## 85 versicolor 85 Sepal.Length 5.4
## 86 versicolor 86 Sepal.Length 6.0
## 87 versicolor 87 Sepal.Length 6.7
## 88 versicolor 88 Sepal.Length 6.3
## 89 versicolor 89 Sepal.Length 5.6
## 90 versicolor 90 Sepal.Length 5.5
## 91 versicolor 91 Sepal.Length 5.5
## 92 versicolor 92 Sepal.Length 6.1
## 93 versicolor 93 Sepal.Length 5.8
## 94 versicolor 94 Sepal.Length 5.0
## 95 versicolor 95 Sepal.Length 5.6
## 96 versicolor 96 Sepal.Length 5.7
## 97 versicolor 97 Sepal.Length 5.7
## 98 versicolor 98 Sepal.Length 6.2
## 99 versicolor 99 Sepal.Length 5.1
## 100 versicolor 100 Sepal.Length 5.7
## 101 virginica 101 Sepal.Length 6.3
## 102 virginica 102 Sepal.Length 5.8
## 103 virginica 103 Sepal.Length 7.1
## 104 virginica 104 Sepal.Length 6.3
## 105 virginica 105 Sepal.Length 6.5
## 106 virginica 106 Sepal.Length 7.6
## 107 virginica 107 Sepal.Length 4.9
## 108 virginica 108 Sepal.Length 7.3
## 109 virginica 109 Sepal.Length 6.7
## 110 virginica 110 Sepal.Length 7.2
## 111 virginica 111 Sepal.Length 6.5
## 112 virginica 112 Sepal.Length 6.4
## 113 virginica 113 Sepal.Length 6.8
## 114 virginica 114 Sepal.Length 5.7
## 115 virginica 115 Sepal.Length 5.8
## 116 virginica 116 Sepal.Length 6.4
## 117 virginica 117 Sepal.Length 6.5
## 118 virginica 118 Sepal.Length 7.7
## 119 virginica 119 Sepal.Length 7.7
## 120 virginica 120 Sepal.Length 6.0
## 121 virginica 121 Sepal.Length 6.9
## 122 virginica 122 Sepal.Length 5.6
## 123 virginica 123 Sepal.Length 7.7
## 124 virginica 124 Sepal.Length 6.3
## 125 virginica 125 Sepal.Length 6.7
## 126 virginica 126 Sepal.Length 7.2
## 127 virginica 127 Sepal.Length 6.2
## 128 virginica 128 Sepal.Length 6.1
## 129 virginica 129 Sepal.Length 6.4
## 130 virginica 130 Sepal.Length 7.2
## 131 virginica 131 Sepal.Length 7.4
## 132 virginica 132 Sepal.Length 7.9
## 133 virginica 133 Sepal.Length 6.4
## 134 virginica 134 Sepal.Length 6.3
## 135 virginica 135 Sepal.Length 6.1
## 136 virginica 136 Sepal.Length 7.7
## 137 virginica 137 Sepal.Length 6.3
## 138 virginica 138 Sepal.Length 6.4
## 139 virginica 139 Sepal.Length 6.0
## 140 virginica 140 Sepal.Length 6.9
## 141 virginica 141 Sepal.Length 6.7
## 142 virginica 142 Sepal.Length 6.9
## 143 virginica 143 Sepal.Length 5.8
## 144 virginica 144 Sepal.Length 6.8
## 145 virginica 145 Sepal.Length 6.7
## 146 virginica 146 Sepal.Length 6.7
## 147 virginica 147 Sepal.Length 6.3
## 148 virginica 148 Sepal.Length 6.5
## 149 virginica 149 Sepal.Length 6.2
## 150 virginica 150 Sepal.Length 5.9
## 151 setosa 1 Petal.Length 1.4
## 152 setosa 2 Petal.Length 1.4
## 153 setosa 3 Petal.Length 1.3
## 154 setosa 4 Petal.Length 1.5
## 155 setosa 5 Petal.Length 1.4
## 156 setosa 6 Petal.Length 1.7
## 157 setosa 7 Petal.Length 1.4
## 158 setosa 8 Petal.Length 1.5
## 159 setosa 9 Petal.Length 1.4
## 160 setosa 10 Petal.Length 1.5
## 161 setosa 11 Petal.Length 1.5
## 162 setosa 12 Petal.Length 1.6
## 163 setosa 13 Petal.Length 1.4
## 164 setosa 14 Petal.Length 1.1
## 165 setosa 15 Petal.Length 1.2
## 166 setosa 16 Petal.Length 1.5
## 167 setosa 17 Petal.Length 1.3
## 168 setosa 18 Petal.Length 1.4
## 169 setosa 19 Petal.Length 1.7
## 170 setosa 20 Petal.Length 1.5
## 171 setosa 21 Petal.Length 1.7
## 172 setosa 22 Petal.Length 1.5
## 173 setosa 23 Petal.Length 1.0
## 174 setosa 24 Petal.Length 1.7
## 175 setosa 25 Petal.Length 1.9
## 176 setosa 26 Petal.Length 1.6
## 177 setosa 27 Petal.Length 1.6
## 178 setosa 28 Petal.Length 1.5
## 179 setosa 29 Petal.Length 1.4
## 180 setosa 30 Petal.Length 1.6
## 181 setosa 31 Petal.Length 1.6
## 182 setosa 32 Petal.Length 1.5
## 183 setosa 33 Petal.Length 1.5
## 184 setosa 34 Petal.Length 1.4
## 185 setosa 35 Petal.Length 1.5
## 186 setosa 36 Petal.Length 1.2
## 187 setosa 37 Petal.Length 1.3
## 188 setosa 38 Petal.Length 1.4
## 189 setosa 39 Petal.Length 1.3
## 190 setosa 40 Petal.Length 1.5
## 191 setosa 41 Petal.Length 1.3
## 192 setosa 42 Petal.Length 1.3
## 193 setosa 43 Petal.Length 1.3
## 194 setosa 44 Petal.Length 1.6
## 195 setosa 45 Petal.Length 1.9
## 196 setosa 46 Petal.Length 1.4
## 197 setosa 47 Petal.Length 1.6
## 198 setosa 48 Petal.Length 1.4
## 199 setosa 49 Petal.Length 1.5
## 200 setosa 50 Petal.Length 1.4
## 201 versicolor 51 Petal.Length 4.7
## 202 versicolor 52 Petal.Length 4.5
## 203 versicolor 53 Petal.Length 4.9
## 204 versicolor 54 Petal.Length 4.0
## 205 versicolor 55 Petal.Length 4.6
## 206 versicolor 56 Petal.Length 4.5
## 207 versicolor 57 Petal.Length 4.7
## 208 versicolor 58 Petal.Length 3.3
## 209 versicolor 59 Petal.Length 4.6
## 210 versicolor 60 Petal.Length 3.9
## 211 versicolor 61 Petal.Length 3.5
## 212 versicolor 62 Petal.Length 4.2
## 213 versicolor 63 Petal.Length 4.0
## 214 versicolor 64 Petal.Length 4.7
## 215 versicolor 65 Petal.Length 3.6
## 216 versicolor 66 Petal.Length 4.4
## 217 versicolor 67 Petal.Length 4.5
## 218 versicolor 68 Petal.Length 4.1
## 219 versicolor 69 Petal.Length 4.5
## 220 versicolor 70 Petal.Length 3.9
## 221 versicolor 71 Petal.Length 4.8
## 222 versicolor 72 Petal.Length 4.0
## 223 versicolor 73 Petal.Length 4.9
## 224 versicolor 74 Petal.Length 4.7
## 225 versicolor 75 Petal.Length 4.3
## 226 versicolor 76 Petal.Length 4.4
## 227 versicolor 77 Petal.Length 4.8
## 228 versicolor 78 Petal.Length 5.0
## 229 versicolor 79 Petal.Length 4.5
## 230 versicolor 80 Petal.Length 3.5
## 231 versicolor 81 Petal.Length 3.8
## 232 versicolor 82 Petal.Length 3.7
## 233 versicolor 83 Petal.Length 3.9
## 234 versicolor 84 Petal.Length 5.1
## 235 versicolor 85 Petal.Length 4.5
## 236 versicolor 86 Petal.Length 4.5
## 237 versicolor 87 Petal.Length 4.7
## 238 versicolor 88 Petal.Length 4.4
## 239 versicolor 89 Petal.Length 4.1
## 240 versicolor 90 Petal.Length 4.0
## 241 versicolor 91 Petal.Length 4.4
## 242 versicolor 92 Petal.Length 4.6
## 243 versicolor 93 Petal.Length 4.0
## 244 versicolor 94 Petal.Length 3.3
## 245 versicolor 95 Petal.Length 4.2
## 246 versicolor 96 Petal.Length 4.2
## 247 versicolor 97 Petal.Length 4.2
## 248 versicolor 98 Petal.Length 4.3
## 249 versicolor 99 Petal.Length 3.0
## 250 versicolor 100 Petal.Length 4.1
## 251 virginica 101 Petal.Length 6.0
## 252 virginica 102 Petal.Length 5.1
## 253 virginica 103 Petal.Length 5.9
## 254 virginica 104 Petal.Length 5.6
## 255 virginica 105 Petal.Length 5.8
## 256 virginica 106 Petal.Length 6.6
## 257 virginica 107 Petal.Length 4.5
## 258 virginica 108 Petal.Length 6.3
## 259 virginica 109 Petal.Length 5.8
## 260 virginica 110 Petal.Length 6.1
## 261 virginica 111 Petal.Length 5.1
## 262 virginica 112 Petal.Length 5.3
## 263 virginica 113 Petal.Length 5.5
## 264 virginica 114 Petal.Length 5.0
## 265 virginica 115 Petal.Length 5.1
## 266 virginica 116 Petal.Length 5.3
## 267 virginica 117 Petal.Length 5.5
## 268 virginica 118 Petal.Length 6.7
## 269 virginica 119 Petal.Length 6.9
## 270 virginica 120 Petal.Length 5.0
## 271 virginica 121 Petal.Length 5.7
## 272 virginica 122 Petal.Length 4.9
## 273 virginica 123 Petal.Length 6.7
## 274 virginica 124 Petal.Length 4.9
## 275 virginica 125 Petal.Length 5.7
## 276 virginica 126 Petal.Length 6.0
## 277 virginica 127 Petal.Length 4.8
## 278 virginica 128 Petal.Length 4.9
## 279 virginica 129 Petal.Length 5.6
## 280 virginica 130 Petal.Length 5.8
## 281 virginica 131 Petal.Length 6.1
## 282 virginica 132 Petal.Length 6.4
## 283 virginica 133 Petal.Length 5.6
## 284 virginica 134 Petal.Length 5.1
## 285 virginica 135 Petal.Length 5.6
## 286 virginica 136 Petal.Length 6.1
## 287 virginica 137 Petal.Length 5.6
## 288 virginica 138 Petal.Length 5.5
## 289 virginica 139 Petal.Length 4.8
## 290 virginica 140 Petal.Length 5.4
## 291 virginica 141 Petal.Length 5.6
## 292 virginica 142 Petal.Length 5.1
## 293 virginica 143 Petal.Length 5.1
## 294 virginica 144 Petal.Length 5.9
## 295 virginica 145 Petal.Length 5.7
## 296 virginica 146 Petal.Length 5.2
## 297 virginica 147 Petal.Length 5.0
## 298 virginica 148 Petal.Length 5.2
## 299 virginica 149 Petal.Length 5.4
## 300 virginica 150 Petal.Length 5.1
## 301 setosa 1 Sepal.Width 3.5
## 302 setosa 2 Sepal.Width 3.0
## 303 setosa 3 Sepal.Width 3.2
## 304 setosa 4 Sepal.Width 3.1
## 305 setosa 5 Sepal.Width 3.6
## 306 setosa 6 Sepal.Width 3.9
## 307 setosa 7 Sepal.Width 3.4
## 308 setosa 8 Sepal.Width 3.4
## 309 setosa 9 Sepal.Width 2.9
## 310 setosa 10 Sepal.Width 3.1
## 311 setosa 11 Sepal.Width 3.7
## 312 setosa 12 Sepal.Width 3.4
## 313 setosa 13 Sepal.Width 3.0
## 314 setosa 14 Sepal.Width 3.0
## 315 setosa 15 Sepal.Width 4.0
## 316 setosa 16 Sepal.Width 4.4
## 317 setosa 17 Sepal.Width 3.9
## 318 setosa 18 Sepal.Width 3.5
## 319 setosa 19 Sepal.Width 3.8
## 320 setosa 20 Sepal.Width 3.8
## 321 setosa 21 Sepal.Width 3.4
## 322 setosa 22 Sepal.Width 3.7
## 323 setosa 23 Sepal.Width 3.6
## 324 setosa 24 Sepal.Width 3.3
## 325 setosa 25 Sepal.Width 3.4
## 326 setosa 26 Sepal.Width 3.0
## 327 setosa 27 Sepal.Width 3.4
## 328 setosa 28 Sepal.Width 3.5
## 329 setosa 29 Sepal.Width 3.4
## 330 setosa 30 Sepal.Width 3.2
## 331 setosa 31 Sepal.Width 3.1
## 332 setosa 32 Sepal.Width 3.4
## 333 setosa 33 Sepal.Width 4.1
## 334 setosa 34 Sepal.Width 4.2
## 335 setosa 35 Sepal.Width 3.1
## 336 setosa 36 Sepal.Width 3.2
## 337 setosa 37 Sepal.Width 3.5
## 338 setosa 38 Sepal.Width 3.6
## 339 setosa 39 Sepal.Width 3.0
## 340 setosa 40 Sepal.Width 3.4
## 341 setosa 41 Sepal.Width 3.5
## 342 setosa 42 Sepal.Width 2.3
## 343 setosa 43 Sepal.Width 3.2
## 344 setosa 44 Sepal.Width 3.5
## 345 setosa 45 Sepal.Width 3.8
## 346 setosa 46 Sepal.Width 3.0
## 347 setosa 47 Sepal.Width 3.8
## 348 setosa 48 Sepal.Width 3.2
## 349 setosa 49 Sepal.Width 3.7
## 350 setosa 50 Sepal.Width 3.3
## 351 versicolor 51 Sepal.Width 3.2
## 352 versicolor 52 Sepal.Width 3.2
## 353 versicolor 53 Sepal.Width 3.1
## 354 versicolor 54 Sepal.Width 2.3
## 355 versicolor 55 Sepal.Width 2.8
## 356 versicolor 56 Sepal.Width 2.8
## 357 versicolor 57 Sepal.Width 3.3
## 358 versicolor 58 Sepal.Width 2.4
## 359 versicolor 59 Sepal.Width 2.9
## 360 versicolor 60 Sepal.Width 2.7
## 361 versicolor 61 Sepal.Width 2.0
## 362 versicolor 62 Sepal.Width 3.0
## 363 versicolor 63 Sepal.Width 2.2
## 364 versicolor 64 Sepal.Width 2.9
## 365 versicolor 65 Sepal.Width 2.9
## 366 versicolor 66 Sepal.Width 3.1
## 367 versicolor 67 Sepal.Width 3.0
## 368 versicolor 68 Sepal.Width 2.7
## 369 versicolor 69 Sepal.Width 2.2
## 370 versicolor 70 Sepal.Width 2.5
## 371 versicolor 71 Sepal.Width 3.2
## 372 versicolor 72 Sepal.Width 2.8
## 373 versicolor 73 Sepal.Width 2.5
## 374 versicolor 74 Sepal.Width 2.8
## 375 versicolor 75 Sepal.Width 2.9
## 376 versicolor 76 Sepal.Width 3.0
## 377 versicolor 77 Sepal.Width 2.8
## 378 versicolor 78 Sepal.Width 3.0
## 379 versicolor 79 Sepal.Width 2.9
## 380 versicolor 80 Sepal.Width 2.6
## 381 versicolor 81 Sepal.Width 2.4
## 382 versicolor 82 Sepal.Width 2.4
## 383 versicolor 83 Sepal.Width 2.7
## 384 versicolor 84 Sepal.Width 2.7
## 385 versicolor 85 Sepal.Width 3.0
## 386 versicolor 86 Sepal.Width 3.4
## 387 versicolor 87 Sepal.Width 3.1
## 388 versicolor 88 Sepal.Width 2.3
## 389 versicolor 89 Sepal.Width 3.0
## 390 versicolor 90 Sepal.Width 2.5
## 391 versicolor 91 Sepal.Width 2.6
## 392 versicolor 92 Sepal.Width 3.0
## 393 versicolor 93 Sepal.Width 2.6
## 394 versicolor 94 Sepal.Width 2.3
## 395 versicolor 95 Sepal.Width 2.7
## 396 versicolor 96 Sepal.Width 3.0
## 397 versicolor 97 Sepal.Width 2.9
## 398 versicolor 98 Sepal.Width 2.9
## 399 versicolor 99 Sepal.Width 2.5
## 400 versicolor 100 Sepal.Width 2.8
## 401 virginica 101 Sepal.Width 3.3
## 402 virginica 102 Sepal.Width 2.7
## 403 virginica 103 Sepal.Width 3.0
## 404 virginica 104 Sepal.Width 2.9
## 405 virginica 105 Sepal.Width 3.0
## 406 virginica 106 Sepal.Width 3.0
## 407 virginica 107 Sepal.Width 2.5
## 408 virginica 108 Sepal.Width 2.9
## 409 virginica 109 Sepal.Width 2.5
## 410 virginica 110 Sepal.Width 3.6
## 411 virginica 111 Sepal.Width 3.2
## 412 virginica 112 Sepal.Width 2.7
## 413 virginica 113 Sepal.Width 3.0
## 414 virginica 114 Sepal.Width 2.5
## 415 virginica 115 Sepal.Width 2.8
## 416 virginica 116 Sepal.Width 3.2
## 417 virginica 117 Sepal.Width 3.0
## 418 virginica 118 Sepal.Width 3.8
## 419 virginica 119 Sepal.Width 2.6
## 420 virginica 120 Sepal.Width 2.2
## 421 virginica 121 Sepal.Width 3.2
## 422 virginica 122 Sepal.Width 2.8
## 423 virginica 123 Sepal.Width 2.8
## 424 virginica 124 Sepal.Width 2.7
## 425 virginica 125 Sepal.Width 3.3
## 426 virginica 126 Sepal.Width 3.2
## 427 virginica 127 Sepal.Width 2.8
## 428 virginica 128 Sepal.Width 3.0
## 429 virginica 129 Sepal.Width 2.8
## 430 virginica 130 Sepal.Width 3.0
## 431 virginica 131 Sepal.Width 2.8
## 432 virginica 132 Sepal.Width 3.8
## 433 virginica 133 Sepal.Width 2.8
## 434 virginica 134 Sepal.Width 2.8
## 435 virginica 135 Sepal.Width 2.6
## 436 virginica 136 Sepal.Width 3.0
## 437 virginica 137 Sepal.Width 3.4
## 438 virginica 138 Sepal.Width 3.1
## 439 virginica 139 Sepal.Width 3.0
## 440 virginica 140 Sepal.Width 3.1
## 441 virginica 141 Sepal.Width 3.1
## 442 virginica 142 Sepal.Width 3.1
## 443 virginica 143 Sepal.Width 2.7
## 444 virginica 144 Sepal.Width 3.2
## 445 virginica 145 Sepal.Width 3.3
## 446 virginica 146 Sepal.Width 3.0
## 447 virginica 147 Sepal.Width 2.5
## 448 virginica 148 Sepal.Width 3.0
## 449 virginica 149 Sepal.Width 3.4
## 450 virginica 150 Sepal.Width 3.0
## 451 setosa 1 Petal.Width 0.2
## 452 setosa 2 Petal.Width 0.2
## 453 setosa 3 Petal.Width 0.2
## 454 setosa 4 Petal.Width 0.2
## 455 setosa 5 Petal.Width 0.2
## 456 setosa 6 Petal.Width 0.4
## 457 setosa 7 Petal.Width 0.3
## 458 setosa 8 Petal.Width 0.2
## 459 setosa 9 Petal.Width 0.2
## 460 setosa 10 Petal.Width 0.1
## 461 setosa 11 Petal.Width 0.2
## 462 setosa 12 Petal.Width 0.2
## 463 setosa 13 Petal.Width 0.1
## 464 setosa 14 Petal.Width 0.1
## 465 setosa 15 Petal.Width 0.2
## 466 setosa 16 Petal.Width 0.4
## 467 setosa 17 Petal.Width 0.4
## 468 setosa 18 Petal.Width 0.3
## 469 setosa 19 Petal.Width 0.3
## 470 setosa 20 Petal.Width 0.3
## 471 setosa 21 Petal.Width 0.2
## 472 setosa 22 Petal.Width 0.4
## 473 setosa 23 Petal.Width 0.2
## 474 setosa 24 Petal.Width 0.5
## 475 setosa 25 Petal.Width 0.2
## 476 setosa 26 Petal.Width 0.2
## 477 setosa 27 Petal.Width 0.4
## 478 setosa 28 Petal.Width 0.2
## 479 setosa 29 Petal.Width 0.2
## 480 setosa 30 Petal.Width 0.2
## 481 setosa 31 Petal.Width 0.2
## 482 setosa 32 Petal.Width 0.4
## 483 setosa 33 Petal.Width 0.1
## 484 setosa 34 Petal.Width 0.2
## 485 setosa 35 Petal.Width 0.2
## 486 setosa 36 Petal.Width 0.2
## 487 setosa 37 Petal.Width 0.2
## 488 setosa 38 Petal.Width 0.1
## 489 setosa 39 Petal.Width 0.2
## 490 setosa 40 Petal.Width 0.2
## 491 setosa 41 Petal.Width 0.3
## 492 setosa 42 Petal.Width 0.3
## 493 setosa 43 Petal.Width 0.2
## 494 setosa 44 Petal.Width 0.6
## 495 setosa 45 Petal.Width 0.4
## 496 setosa 46 Petal.Width 0.3
## 497 setosa 47 Petal.Width 0.2
## 498 setosa 48 Petal.Width 0.2
## 499 setosa 49 Petal.Width 0.2
## 500 setosa 50 Petal.Width 0.2
## 501 versicolor 51 Petal.Width 1.4
## 502 versicolor 52 Petal.Width 1.5
## 503 versicolor 53 Petal.Width 1.5
## 504 versicolor 54 Petal.Width 1.3
## 505 versicolor 55 Petal.Width 1.5
## 506 versicolor 56 Petal.Width 1.3
## 507 versicolor 57 Petal.Width 1.6
## 508 versicolor 58 Petal.Width 1.0
## 509 versicolor 59 Petal.Width 1.3
## 510 versicolor 60 Petal.Width 1.4
## 511 versicolor 61 Petal.Width 1.0
## 512 versicolor 62 Petal.Width 1.5
## 513 versicolor 63 Petal.Width 1.0
## 514 versicolor 64 Petal.Width 1.4
## 515 versicolor 65 Petal.Width 1.3
## 516 versicolor 66 Petal.Width 1.4
## 517 versicolor 67 Petal.Width 1.5
## 518 versicolor 68 Petal.Width 1.0
## 519 versicolor 69 Petal.Width 1.5
## 520 versicolor 70 Petal.Width 1.1
## 521 versicolor 71 Petal.Width 1.8
## 522 versicolor 72 Petal.Width 1.3
## 523 versicolor 73 Petal.Width 1.5
## 524 versicolor 74 Petal.Width 1.2
## 525 versicolor 75 Petal.Width 1.3
## 526 versicolor 76 Petal.Width 1.4
## 527 versicolor 77 Petal.Width 1.4
## 528 versicolor 78 Petal.Width 1.7
## 529 versicolor 79 Petal.Width 1.5
## 530 versicolor 80 Petal.Width 1.0
## 531 versicolor 81 Petal.Width 1.1
## 532 versicolor 82 Petal.Width 1.0
## 533 versicolor 83 Petal.Width 1.2
## 534 versicolor 84 Petal.Width 1.6
## 535 versicolor 85 Petal.Width 1.5
## 536 versicolor 86 Petal.Width 1.6
## 537 versicolor 87 Petal.Width 1.5
## 538 versicolor 88 Petal.Width 1.3
## 539 versicolor 89 Petal.Width 1.3
## 540 versicolor 90 Petal.Width 1.3
## 541 versicolor 91 Petal.Width 1.2
## 542 versicolor 92 Petal.Width 1.4
## 543 versicolor 93 Petal.Width 1.2
## 544 versicolor 94 Petal.Width 1.0
## 545 versicolor 95 Petal.Width 1.3
## 546 versicolor 96 Petal.Width 1.2
## 547 versicolor 97 Petal.Width 1.3
## 548 versicolor 98 Petal.Width 1.3
## 549 versicolor 99 Petal.Width 1.1
## 550 versicolor 100 Petal.Width 1.3
## 551 virginica 101 Petal.Width 2.5
## 552 virginica 102 Petal.Width 1.9
## 553 virginica 103 Petal.Width 2.1
## 554 virginica 104 Petal.Width 1.8
## 555 virginica 105 Petal.Width 2.2
## 556 virginica 106 Petal.Width 2.1
## 557 virginica 107 Petal.Width 1.7
## 558 virginica 108 Petal.Width 1.8
## 559 virginica 109 Petal.Width 1.8
## 560 virginica 110 Petal.Width 2.5
## 561 virginica 111 Petal.Width 2.0
## 562 virginica 112 Petal.Width 1.9
## 563 virginica 113 Petal.Width 2.1
## 564 virginica 114 Petal.Width 2.0
## 565 virginica 115 Petal.Width 2.4
## 566 virginica 116 Petal.Width 2.3
## 567 virginica 117 Petal.Width 1.8
## 568 virginica 118 Petal.Width 2.2
## 569 virginica 119 Petal.Width 2.3
## 570 virginica 120 Petal.Width 1.5
## 571 virginica 121 Petal.Width 2.3
## 572 virginica 122 Petal.Width 2.0
## 573 virginica 123 Petal.Width 2.0
## 574 virginica 124 Petal.Width 1.8
## 575 virginica 125 Petal.Width 2.1
## 576 virginica 126 Petal.Width 1.8
## 577 virginica 127 Petal.Width 1.8
## 578 virginica 128 Petal.Width 1.8
## 579 virginica 129 Petal.Width 2.1
## 580 virginica 130 Petal.Width 1.6
## 581 virginica 131 Petal.Width 1.9
## 582 virginica 132 Petal.Width 2.0
## 583 virginica 133 Petal.Width 2.2
## 584 virginica 134 Petal.Width 1.5
## 585 virginica 135 Petal.Width 1.4
## 586 virginica 136 Petal.Width 2.3
## 587 virginica 137 Petal.Width 2.4
## 588 virginica 138 Petal.Width 1.8
## 589 virginica 139 Petal.Width 1.8
## 590 virginica 140 Petal.Width 2.1
## 591 virginica 141 Petal.Width 2.4
## 592 virginica 142 Petal.Width 2.3
## 593 virginica 143 Petal.Width 1.9
## 594 virginica 144 Petal.Width 2.3
## 595 virginica 145 Petal.Width 2.5
## 596 virginica 146 Petal.Width 2.3
## 597 virginica 147 Petal.Width 1.9
## 598 virginica 148 Petal.Width 2.0
## 599 virginica 149 Petal.Width 2.3
## 600 virginica 150 Petal.Width 1.8
We’re going to be using some data on vocabulary growth that we load from the Wordbank database. Wordbank is a database of children’s language learning.
(Go explore it for a moment).
We’re going to look at data from the English Words and Sentences form. These data describe the repsonses of parents to questions about whether their child says 680 different words.
dplyr
really shines in this context.
# to avoid dependency on the wordbankr package, we cache these data.
# ws <- wordbankr::get_administration_data(language = "English",
# form = "WS")
ws <- read_csv("data/ws.csv")
## Parsed with column specification:
## cols(
## data_id = col_double(),
## age = col_double(),
## comprehension = col_double(),
## production = col_double(),
## language = col_character(),
## form = col_character(),
## birth_order = col_character(),
## ethnicity = col_character(),
## sex = col_character(),
## zygosity = col_logical(),
## norming = col_logical(),
## longitudinal = col_logical(),
## source_name = col_character(),
## mom_ed = col_character()
## )
Take a look at the data that comes out.
DT::datatable(ws)
ggplot(ws, aes(x = age, y = production)) +
geom_point()
Aside: How can we fix this plot? Suggestions from group?
ggplot(ws, aes(x = age, y = production)) +
geom_jitter(size = .5, width = .25, height = 0, alpha = .3)
Ok, let’s plot the relationship between sex and productive vocabulary, using dplyr
.
ggplot(ws, aes(x = age, y = production, col=sex)) +
geom_jitter(size = .5, width = .25, height = 0, alpha = .3)
This is a bit useless, because the variability is so high. So let’s summarise!
Exercise. Get means and SDs of productive vocabulary (
production
) byage
andsex
. Filter the kids with missing data forsex
(coded byNA
).
HINT: is.na(x)
is useful for filtering.
# View(ws)
ws_sex <- ws %>%
filter(!is.na(sex)) %>%
group_by(age, sex) %>%
summarise(production = mean(production),
production_sd = sd(production))
## `summarise()` regrouping output by 'age' (override with `.groups` argument)
ws_sex
## # A tibble: 30 x 4
## # Groups: age [15]
## age sex production production_sd
## <dbl> <chr> <dbl> <dbl>
## 1 16 Female 66.9 NA
## 2 16 Male 52.4 NA
## 3 17 Female 85.7 NA
## 4 17 Male 65.7 NA
## 5 18 Female 137. NA
## 6 18 Male 105. NA
## 7 19 Female 172. NA
## 8 19 Male 137. NA
## 9 20 Female 198. NA
## 10 20 Male 164. NA
## # … with 20 more rows
Now plot:
ggplot(ws_sex,
aes(x = age, y = production, col = sex)) +
geom_line() +
geom_jitter(data = filter(ws, !is.na(sex)),
size = .5, width = .25, height = 0, alpha = .3) +
geom_linerange(aes(ymin = production - production_sd,
ymax = production + production_sd),
position = position_dodge(width = .2)) # keep SDs from overlapping
## Warning: Removed 30 rows containing missing values (geom_linerange).
Bonus: Compute effect size.
# instructor demo
Here are three little demos of exciting stuff that you can do (and that are facilitated by this workflow).
A few other things will help you with “medium size data”:
read_csv
- Much faster than read.csv
and has better defaults.dbplyr
- For connecting directly to databases. This package got forked off of dplyr
recently but is very useful.feather
- The feather
package is a fast-loading binary format that is interoperable with python. All you need to know is write_feather(d, "filename")
and read_feather("filename")
.Here’s a timing demo for read.csv
, read_csv
, and read_feather
.
system.time(read.csv("data/ws.csv"))
## user system elapsed
## 0.019 0.001 0.021
system.time(read_csv("data/ws.csv"))
## Parsed with column specification:
## cols(
## data_id = col_double(),
## age = col_double(),
## comprehension = col_double(),
## production = col_double(),
## language = col_character(),
## form = col_character(),
## birth_order = col_character(),
## ethnicity = col_character(),
## sex = col_character(),
## zygosity = col_logical(),
## norming = col_logical(),
## longitudinal = col_logical(),
## source_name = col_character(),
## mom_ed = col_character()
## )
## user system elapsed
## 0.011 0.001 0.015
system.time(feather::read_feather("data/ws.feather"))
## user system elapsed
## 0.010 0.002 0.020
I see about a 2x speedup for read_csv
(bigger for bigger files) and a 20x speedup for read_feather
.
The shiny
package is a great way to do interactives in R. We’ll walk through constructing a simple shiny app for the wordbank data here.
Technically, this is embedded shiny as opposed to freestanding shiny apps (like Wordbank).
The two parts of a shiny app are ui
and server
. Both of these are funny in that they are lists of other things. The ui
is a list of elements of an HTML page, and the server is a list of “reactive” elements. In brief, the UI says what should be shown, and the server specifies the mechanics of how to create those elements.
This little embedded shiny app shows a page with two elements: 1) a selector that lets you choose a demographic field, and 2) a plot of vocabulary split by that field.
The server then has the job of splitting the data by that field (for ws_split
) and rendering the plot (agePlot
).
The one fancy thing that’s going on here is that the app makes use of the calls group_by_
(in the dplyr
chain) and aes_
(for the ggplot
call). These _
functions are a little complex - they are an example of “standard evaluation” that lets you feed actual variables into ggplot2
and dplyr
rather than names of variables. For more information, there is a nice vignette on standard and non-standard evaluation: try (vignette("nse")
.
library(shiny)
shinyApp(
ui <- fluidPage(
selectInput("demographic", "Demographic Split Variable",
c("Sex" = "sex", "Maternal Education" = "mom_ed",
"Birth Order" = "birth_order", "Ethnicity" = "ethnicity")),
plotOutput("agePlot")
),
server <- function(input, output) {
ws_split <- reactive({
ws %>%
group_by_("age", input$demographic) %>%
summarise(production_mean = mean(production))
})
output$agePlot <- renderPlot({
ggplot(ws_split(),
aes_(quote(age), quote(production_mean), col = as.name(input$demographic))) +
geom_line()
})
},
options = list(height = 500)
)
##
## Listening on http://127.0.0.1:6890
## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## `summarise()` regrouping output by 'age' (override with `.groups` argument)
As I’ve tried to highlight, dplyr
is actually all about applying functions. summarise
is a verb that helps you apply functions to chunks of data and then bind them together. But that creates a requirement that all the functions return a single value (e.g., mean
). There are lots of things you can do that summarise data but don’t return a single value. For example, maybe you want to run a linear regression and return the slope and the intercept.
For that, I want to highlight two things.
One is do
, which allows function application to grouped data. The only tricky thing about using do
is that you have to refer to the dataframe that you’re working on as .
.
The second is the amazing broom
package, which provides methods to tidy
the output of lots of different statistical models. So for example, you can run a linear regression on chunks of a dataset and get back out the coefficients in a data frame.
Here’s a toy example, again with Wordbank data.
ws %>%
filter(!is.na(sex)) %>%
group_by(sex) %>%
do(broom::tidy(lm(production ~ age, data = .)))
## # A tibble: 4 x 6
## # Groups: sex [2]
## sex term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Female (Intercept) -515. 14.4 -35.6 9.92e-215
## 2 Female age 36.2 0.629 57.5 0.
## 3 Male (Intercept) -494. 14.2 -34.9 3.04e-210
## 4 Male age 33.5 0.615 54.5 0.
In recent years, this workflow in R ihas gotten really good. purrr
is an amazing package that introduces consistent ways to map
functions. It’s beyond the scope of the course.
Returning the third cell.
iris$Petal.Length[3]
## [1] 1.3
iris[3,3]
## [1] 1.3
iris[3,"Petal.Length"]
## [1] 1.3
iris[[3]][3]
## [1] 1.3
iris[["Petal.Length"]][3]
## [1] 1.3
# probably more?
Piped commands.
iris$Species %>%
unique %>%
length
## [1] 3
Mean of participant means.
sgf %>%
group_by(age_group, subid) %>%
summarise(correct = mean(correct)) %>%
summarise(mean_correct = mean(correct),
sd_correct = sd(correct))
## `summarise()` regrouping output by 'age_group' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## age_group mean_correct sd_correct
## <fct> <dbl> <dbl>
## 1 [2,3] 0.408 0.269
## 2 (3,4] 0.485 0.348
## 3 (4,5] 0.448 0.375
Sklar tidying.
sklar %>%
gather(participant, RT, 8:28)
## # A tibble: 3,234 x 9
## prime prime.result target congruent operand distance counterbalance
## <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 =1+2… 8 9 no A -1 1
## 2 =1+3… 9 11 no A -2 1
## 3 =1+4… 8 12 no A -4 1
## 4 =1+6… 10 12 no A -2 1
## 5 =1+9… 12 11 no A 1 1
## 6 =1+9… 13 12 no A 1 1
## 7 =2+1… 12 11 no A 1 1
## 8 =2+3… 11 10 no A 1 1
## 9 =2+3… 12 11 no A 1 1
## 10 =2+5… 13 9 no A 4 1
## # … with 3,224 more rows, and 2 more variables: participant <chr>, RT <dbl>
# might be a better way to select these columns than by number, e.g. regex
Sex means.
ws_sex <- ws %>%
filter(!is.na(sex)) %>%
group_by(age, sex) %>%
summarise(production_sd = sd(production, na.rm=TRUE),
production_mean = mean(production))
## `summarise()` regrouping output by 'age' (override with `.groups` argument)
Effect size. (Instructor demo)
ws_es <- ws_sex %>%
group_by(age) %>%
summarise(es = (production_mean[sex=="Female"] - production_mean[sex=="Male"]) /
mean(production_sd))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(ws_es, aes(x = age, y = es)) +
geom_point() +
geom_smooth(span = 1) +
ylab("Female advantage (standard deviations)") +
xlab("Age (months)")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'