Make sure that you load the following packages: dplyr, ggplot2, tidyr, and readr.
Make sure that you install and load the packages wordbankr and dbplyr (note that reads dbplyr, not dplyr).
Make sure that you download the template for the lab, as well as the lab data file (
lab10_data.csv).
(After today) Make sure that you add a consent form to your app, by downloading the template from the Learn site.
Because it is increasingly easy to share data, the cognitive and developmental sciences are increasingly relying on large, shared databases of information. This includes resources like the UK Biobank, which include large stores of medical and neurophysiological data, as well as measures of our interactions with the world (e.g., the British National Corpus, which is a large set of spoken and written linguistic material).
A database can be considered a selection of related tables. When using a database for a research task, you goal is to:
In the first part of this lab, you will practice filtering and joining simple dataframes, that will act in lieu of tables from databases. In the second part of the lab, you will be importing data from a large online database of cross-cultural development, and using your new-found skills to plot and analyze the data in different ways.
In the chunk load_lab_data, use read_csv() to load in the lab data file, and then use head() to look at its form.
This lab datafile is from a JSPsych experiment, a bit like yours. In this study, participants had to choose between pictures of a woman and a man, in conditions under which the people in the pictures were using funky poses, and conditions under which the people in the pictures were using fresh poses.
## # A tibble: 6 x 9
## rt responses trial_type trial_index subject button_pressed choices
## <int> <chr> <chr> <int> <int> <int> <chr>
## 1 3226 "{\"Q0\"… survey-te… 0 0 NA <NA>
## 2 4179 <NA> button-re… 1 0 0 one
## 3 2263 <NA> button-re… 2 0 1 woman2
## 4 2905 <NA> button-re… 3 0 0 woman1
## 5 3339 <NA> button-re… 4 0 0 man1
## 6 1468 <NA> button-re… 5 0 0 man2
## # ... with 2 more variables: stimulus <chr>, condition <chr>
You’ll see a number of different columns, which carry information about a variety of things, including e.g., the RT on each trial, the subject number, etc. For instance, the column condition indicates if a trial was in the funky or fresh condition, while the column choices indicates what picture participants chose
This table contains a variety of different rows, just like the datafiles your experiments produces, and not all of those rows have the same structure. For example, the first row is the result of the survey questions on the first screen. Most of the subsequent rows are responses on the different trials of the main task, but some of the rows are for filler trials in the task.
Our first job is to process the responses from that first screen. You can see that it has been read into R as a string with odd formatting: "{\"Q0\":\"Male\",\"Q1\":\"5\"}". This means that the answer to the first question was male, and the answer to the second question was age 5. What we would like to do, is to strip that information out of the original dataframe. Then, we will create a new dataframe that specifies the participant’s subject number, their gender, and their age. Finally, we will merge that dataframe back into the original dataset, so that it can be easily accessed.
To do this, first create a new tibble, which is a dplyr version of a dataframe, that combines the participant information, with the subject number.
subj_info = tibble(part_info = dataset$responses[1],
subject = dataset$subject[1])Then, use the function gsub() to remove all the noise characters from the participant information cell. gsub() takes a first argument that specifies a pattern to match, a second argument to specify what to replace that pattern with, and then a final argument that says where to search for that pattern.
You want to return a pattern that looks like: Q0:Male,Q1:5. To do this, combine gsub() with the dplyr function mutate() in order to replace all of the noise characters (e.g., { and \) with blank text.
# Example of using gsub (run this in your R console)
gsub('[abcde]',"1","123_abc_xyz") # returns "123_111_xyz"
# Here, the first argument of gsub() is a set of characters inside square brackets. This is the pattern that will be matched (we will come back to what it means soon). The second argument says what the replacement will be. The third argument says what the pattern to be searched is.
# gsub works by searching for a "regular expression". Regular expressions are too complex to explain here but, in short, they provide a very flexible way to search. For example, the pattern above ('[abcde]') indicates that the search should return TRUE whenever it finds any of the letters a through e. By contrast, the pattern 'abcde' (without the square brackets) would return TRUE whenever it finds the sequence of letters a through e.
# Try to edit the regular expression above in order to return an appropriate pattern.
# Integrating gsub into your dplyr workflow
# This is relatively straight forward
subj_info = tibble(part_info = dataset$responses[1],
subject = dataset$subject[1]) %>%
mutate(part_info = gsub(...))You can then augment your code above with this dplyr code, that will split the string into two columns and return a nice tidy dataset
%>%
mutate(KeyValPairs = strsplit(as.character(part_info), ",")) %>%
unnest(KeyValPairs) %>%
separate(KeyValPairs, into = c("Question", "Answer"), ":") %>%
select(-info) %>%
spread(key = Question, value = Answer) ## # A tibble: 1 x 3
## subject Q0 Q1
## <int> <chr> <chr>
## 1 0 Male 5
Now, let’s join the original dataset with your new participant information dataset. To do that, we will use the function left_join(), which takes two arguments. The first argument is the primary dataset, and the second argument is the secondary dataset. left_join() returns all the rows that are in the primary dataset, but now merged with all the columns in the secondary dataset. Importantly, it matches on column names, so if both datasets have a column called subject it will find those, and ensure that rows where subject=0 in one dataset are matched to rows where subject=0 in the other dataset (and similarly for subject = 1,2,3...).
dataset = left_join(dataset,part_info)
...Now, we are going to do a bit more pattern matching, to create our outcome variable, i.e., the data we will analyze. Remember that the column condition says whether the participants are in the funky or fresh conditions, while the column choices indicates which picture they chose. If you look at those two columns, you will see that each trial in the condition column is either funky or fresh. However, in the choices column, the text is more variable. Sometimes participants chose woman2, sometimes they chose man3, etc. We can’t analyse such variable data. Instead, we want to create a new column, called pick_woman, which has value 1 if participants picked a woman picture, and value 0 if participants picked a man picture.
Using your dplyr knowledge, augment the code above with a call to mutate(). In the mutate() call, create a new column called pick_woman, where you use the function ifelse() (i.e., mutate(pick_woman = ifelse(...))).
In the ifelse() call, we will use the function grepl(). grepl() is a bit like gsub() in that it takes a pattern, and tests if that pattern is present in a dataset. However, rather than replacing that pattern, it returns TRUE if the pattern is present, and FALSE otherwise. Thus, you can combine grepl() with ifelse(), to return 1 if the column choices contains the word woman, and 0 otherwise:
dataset = left_join(dataset,part_info) %>%
mutate(pick_woman = ifelse(grepl('woman',choices), 1,0))Once you have got that working, use your dplyr magic to augment the code further. You want to first filter out all rows that aren’t critical trials (e.g., you want to remove the first row, where no picture was presented). Then, you want to group the trials by condition, and calculate the mean proportion of female choices per condition.
filter() to remove rows where is.na(stimulus) == FALSE.group_by() to group by the column condition.summarise() and mean() to create a new column called pick_woman.mean.Then, plot the resulting data using ggplot [using geom_bar(stat="identity")].
This is tricky, so feel free to ask Hugh for help (and you can use the help files, too).
Nice job on that part of the lab. Hopefully that will help you process and graph your own experiment’s data.
In the next part of the lab we are going to scale those skills up, to learn how to use much larger databases. We are going to focus on one particular database today, called WordBank, which is a cross-linguistic database of children’s early words. Wordbank relies on the MacArthur Communicative Development Inventory (the CDI), which is a set of checklist forms that parents fill in to describe their child’s knowledge and use of words and simple sentences. You can find wordbank online, and you can see an example CDI form below. For a large number of different words, parents tick whether their child either Understands the word, or Understands and Says the word (i.e., produces the word).
First, play around with wordbank online for just a couple of minutes, then return to the lab. For instance, you can look at the Analyses tab, and examine the age of acquisition for different words, or vocabulary norms for different ages.
Wordbank contains millions and millions of datapoints, and so it would not be practical for us to load the entire database onto our computers. Instead, there is an R package called wordbankr that allows us to analyze the data remotely, from our own terminals.
Instead, we use functions from the wordbankr package to query the server for particular types of information. The server will then send that information back in terms of a data table.
We can request three types of data tables from the server.
A table of items can be gathered by usig the function get_item_data() (type ?get_item_data() for more information).
Here, you should request item data for the language “English (American)”, and for the CDI form “Words and Gestures”, which is the version of the CDI administered for young infants. You can use the code below to do this; make sure you understand what each part of the code does.
English_US_Items = get_item_data(
language = "English (American)",
form = "WG")Then, explore the dataset using head() and summary(), and also use the function unique() to look at the column type (i.e, unique(English_US_Items$type))
## # A tibble: 6 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_1 respname English… WG firs… <NA> <NA>
## 2 item_2 respno English… WG firs… <NA> <NA>
## 3 item_3 reactmd English… WG firs… <NA> <NA>
## 4 item_4 arehngry English… WG phra… <NA> <NA>
## 5 item_5 aretired English… WG phra… <NA> <NA>
## 6 item_6 becarefl English… WG phra… <NA> <NA>
## # ... with 4 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>, num_item_id <dbl>
## [1] "Unique values of column 'type'"
## [1] "first_signs" "phrases" "starting_to_talk"
## [4] "word" "gestures_first" "gestures_games"
## [7] "gestures_objects" "gestures_parent" "gestures_adult"
You’ll see there are quite a few types of item that we might not want to examine, if we only want to focus on children’s knowledge of single words (which is the focus of this lab). The list includes things like gestures, whole phrases, signs, etc etc. So, let’s use filter() to get rid of those.
English_US_Items = English_US_Items %>%
filter(...["put some code in here to only select rows where column type indicates a word is present"])Once you have done this, use unique() to check the column type again, as well as other columns such as category, lexical_category, and so on.
## [1] "Unique values of column 'type'"
## [1] "word"
## [1] "Unique values of column 'category'"
## [1] "sounds" "animals" "vehicles"
## [4] "toys" "food_drink" "clothing"
## [7] "body_parts" "furniture_rooms" "household"
## [10] "outside" "people" "games_routines"
## [13] "action_words" "time_words" "descriptive_words"
## [16] "pronouns" "question_words" "locations"
## [19] "quantifiers"
## [1] "Unique values of column 'lexical_category'"
## [1] "other" "nouns" "predicates" "function_words"
Now, let’s get data on which American English kids were administered the Words and Gestures form.
English_US_Admin = get_administration_data(
language = "English (American)",
form = "WG")Again, use head() to explore the dataset, and try to find out how many kids took part by using the function length().
## # A tibble: 6 x 15
## data_id age comprehension production language form birth_order
## <dbl> <int> <int> <int> <chr> <chr> <fct>
## 1 145913 18 150 109 English… WG Second
## 2 145914 18 185 18 English… WG First
## 3 145915 18 224 104 English… WG Second
## 4 145916 18 209 125 English… WG First
## 5 145917 18 283 129 English… WG First
## 6 145918 17 324 65 English… WG Second
## # ... with 8 more variables: ethnicity <fct>, sex <fct>, zygosity <chr>,
## # norming <lgl>, mom_ed <fct>, longitudinal <lgl>, source_name <chr>,
## # license <chr>
## [1] "How many administrations?"
## [1] 2435
Using your R sleuthing skills, try to find out the item_id for the word dog from the table of items. As a hint, the column containing the list of words is defintion.
Yell if you get lost, because it will be important for the next part.
Now, we will use info from the previous two data tables to import CDI data for the word dog, from American English kids, who took part in the Words and Gestures CDI,
English_US_Dog = get_instrument_data(language = "English (American)",
form = "WG",
items = "ADD_the_item_number_for_dog (e.g., item_32)",
administrations = English_US_Admin)## # A tibble: 6 x 17
## data_id value num_item_id age comprehension production language form
## <dbl> <chr> <dbl> <int> <int> <int> <chr> <chr>
## 1 145913 prod… 57 18 150 109 English… WG
## 2 145914 unde… 57 18 185 18 English… WG
## 3 145915 prod… 57 18 224 104 English… WG
## 4 145916 prod… 57 18 209 125 English… WG
## 5 145917 prod… 57 18 283 129 English… WG
## 6 145918 prod… 57 17 324 65 English… WG
## # ... with 9 more variables: birth_order <fct>, ethnicity <fct>,
## # sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## # longitudinal <lgl>, source_name <chr>, license <chr>
What are the different columns in this table?
data_id defines the subject id.num_item_id defines the item id (in this case, item_57, i.e., dog)comprehension and production describe the overall comprehension and production scores for each subject, over all items in the database.What we would like to do is to plot the probability of a child producing or understanding a word, over age. However, before we do this, we have to reckon with the fact that the most critcial column for plotting, value, conflates producing and understanding. Ideally, we want our dataset to be very explicit about what kids do, and do not, know. So while our current data is similar to the figure on the left, we would like it to be more like the data on the right.
Achieving this is done in two steps. First, we use the function mutate() to create two columns, as below. The column produces is 1 if and only if the child is marked as producing the word. The column understands is 1 if the child either is marked as producing or is marked as understanding the word. Once this is done, we then gather the columns together into one large datafile.
To do this, I want you to take the code in the first chunk below, and then edit it so that it can be merged with the code in the second chunk below.
English_US_Dog$produces = ifelse(English_US_Dog$value == "produces",1,0)
English_US_Dog$understands = ifelse(English_US_Dog$value %in% c("produces","understands"),1,0)
English_US_Dog = gather(English_US_Dog, "mode","score", produces,understands)English_US_Dog = English_US_Dog %>%
mutate(produces = ...,
understands = ...) %>%
gather()Then use summary() and head() to examine the resulting dataframe. You might also want to use unique(English_US_Dog$sex) which shows that some participants do not have an indicated gender. Since this may indicate a mistake in the dataframe, you should filter out those particular participants.
English_US_Dog = English_US_Dog %>%
filter(!is.na(sex))Then, use ggplot(), geom_jitter, and geom_smooth(), to create a graph like the one below.
## data_id value num_item_id age
## Min. :145913 Length:4870 Min. :57 Min. : 8.00
## 1st Qu.:146537 Class :character 1st Qu.:57 1st Qu.:13.00
## Median :147148 Mode :character Median :57 Median :13.00
## Mean :147146 Mean :57 Mean :13.81
## 3rd Qu.:147757 3rd Qu.:57 3rd Qu.:16.00
## Max. :148366 Max. :57 Max. :18.00
##
## comprehension production language form
## Min. : 0.0 Min. : 0.00 Length:4870 Length:4870
## 1st Qu.: 62.0 1st Qu.: 4.00 Class :character Class :character
## Median :125.0 Median : 13.00 Mode :character Mode :character
## Mean :140.6 Mean : 28.92
## 3rd Qu.:207.8 3rd Qu.: 34.00
## Max. :396.0 Max. :386.00
##
## birth_order ethnicity sex zygosity
## First :1050 Asian : 92 Female:2314 Length:4870
## Second : 674 Black : 248 Male :2470 Class :character
## Third : 268 Other : 124 Other : 0 Mode :character
## Fourth : 74 White :1548 NA's : 86
## Sixth : 14 Hispanic: 122
## (Other): 14 NA's :2736
## NA's :2776
## norming mom_ed longitudinal source_name
## Mode :logical College : 588 Mode :logical Length:4870
## FALSE:2728 Some College : 534 FALSE:2298 Class :character
## TRUE :2142 Secondary : 508 TRUE :2572 Mode :character
## Graduate : 282
## Some Secondary: 138
## (Other) : 86
## NA's :2734
## license mode score
## Length:4870 Length:4870 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :1.0000
## Mean :0.6218
## 3rd Qu.:1.0000
## Max. :1.0000
## NA's :16
This graph is a decent start, but it is also pretty ugly. Why don’t we summarize all those datapoints, instead? i.e., get the mean for each point, instead?
English_US_Dog_summary = English_US_Dog %>%
select("which columns should you select?") %>%
group_by("which columns should you group by?") %>%
summarise(score.mean = mean(score))The English_US_Dog data table contains columns with interesting demographic information, like the sex of the baby and the education of the mother. Now, edit your code from above to select() and group_by() those columns, and try to make a graph like the one below.
Well done, you made a graph for the word dog! Now let’s add a whole bunch more words! Using get_instrument_data(), create a new datatable. But this time, rather than selecting one particular item from English_US_Items, we will select all the items, as below!
Edit the code below, based on the code you wrote before, so that you pull in the relevant data table, and then process it to create columns called mode and score, and also filter out unwanted data.
English_US_AllWords = get_instrument_data(language = "English (American)",
form = "WG",
items = English_US_Items$item_id,
administrations = English_US_Admin) %>%
mutate(...) %>%
gather(...) %>%
filter %>%Now, as you did before, edit the below dplyr calls to summarise the data and create the graph below. Remember that you can add a second aes() to geom_point(), where you specify that the color of the points depends on the mother’s education.
English_US_AllWords_summary = English_US_AllWords %>%
select() %>%
group_by() %>%
summarise(score.mean = mean(score))It would also be cool to be able to split these plots by the type of word being learned. For example, do kids learn animal names before they learn body part names? Do they learn nouns before verbs?
Use head() and summary() to see if we have information on things like lexical category in our data tables English_US_AllWords and English_US_Items. You should see that that information is present in one but not the other.
## # A tibble: 6 x 19
## data_id value num_item_id age comprehension production language form
## <dbl> <chr> <dbl> <int> <int> <int> <chr> <chr>
## 1 145913 "" 34 18 150 109 English… WG
## 2 145913 "" 35 18 150 109 English… WG
## 3 145913 "" 36 18 150 109 English… WG
## 4 145913 prod… 37 18 150 109 English… WG
## 5 145913 "" 38 18 150 109 English… WG
## 6 145913 "" 39 18 150 109 English… WG
## # ... with 11 more variables: birth_order <fct>, ethnicity <fct>,
## # sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## # longitudinal <lgl>, source_name <chr>, license <chr>, mode <chr>,
## # score <dbl>
## # A tibble: 6 x 11
## item_id definition language form type category lexical_category
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 item_34 baa baa English… WG word sounds other
## 2 item_35 choo choo English… WG word sounds other
## 3 item_36 cockadood… English… WG word sounds other
## 4 item_37 grrr English… WG word sounds other
## 5 item_38 meow English… WG word sounds other
## 6 item_39 moo English… WG word sounds other
## # ... with 4 more variables: lexical_class <chr>, uni_lemma <chr>,
## # complexity_category <chr>, num_item_id <dbl>
By editing the code below, join these two tables up and then produce the below graph.
English_US_AllWords = left_join(English_US_AllWords, English_US_Items)
English_US_AllWords %>%
select(age, ???,mode, score, mom_ed, sex) %>%
group_by(age, ???,mode, mom_ed, sex) %>%
summarise(score.mean = mean(score)) %>%
ggplot(aes(x = age, y = score.mean, lty = sex))+
geom_point(aes(color=mom_ed))+
geom_smooth()+
facet_grid(???~mode)This part is optional.
I’m sure you’ve been taught in your statistics classes, that it is better to write functions than copy out the same code again and again. So let’s do that! Create a function that, for any given language, produces a nicely summarised table of what kids do and do not know.
You can try and do this by yourself, or if you need a hand, try to edit the function below, which
language.get_words = function(language){
items = get_item_data(language = ???,
form = "WG") %>%
filter("EDIT ME") # Filter out when sex = NA
admin = get_administration_data(language = ???,
form = "WG")
words = get_instrument_data(language = ???,
form = "WG",
items = ???$item_id, # get all the items from your data table of items.
administrations = ???) %>% # your admin data table
mutate(produces = ifelse(value == "produces",1,0),
understands = ifelse(value %in% c("produces","understands"),1,0)) %>%
gather("mode","score", produces,understands) %>%
left_join(items) # What does this last bit do?
summary = words %>%
select("what factors should you select?") %>%
group_by("what factors should you group by?") %>%
summarise(score.mean = mean(score,na.rm=T))
return(summary)
}Apply the function to a new language
get_words("Russian") %>%
group_by(age,mode, sex,language)%>%
summarise(score.mean = mean(score.mean)) %>%
ggplot(aes(x = age, y = score.mean, color = sex))+
geom_point()+
geom_smooth()+
facet_grid(mode~language)xl_data = bind_rows(get_words("English (American)"),
get_words("Russian"),
get_words("Norwegian"),
get_words("Italian"),
get_words("Korean"))
xl_data %>%
group_by(age,mode, sex,language)%>%
summarise(score.mean = mean(score.mean)) %>%
ggplot(aes(x = age, y = score.mean, color = sex))+
geom_point()+
geom_smooth()+
facet_grid(mode~language)