1 Big data

1.1 Preliminaries

Make sure that you load the following packages: dplyr, ggplot2, tidyr, and readr.

Make sure that you install and load the packages wordbankr and dbplyr (note that reads dbplyr, not dplyr).

Make sure that you download the template for the lab, as well as the lab data file (lab10_data.csv).

(After today) Make sure that you add a consent form to your app, by downloading the template from the Learn site.

2 Setting the scene

Because it is increasingly easy to share data, the cognitive and developmental sciences are increasingly relying on large, shared databases of information. This includes resources like the UK Biobank, which include large stores of medical and neurophysiological data, as well as measures of our interactions with the world (e.g., the British National Corpus, which is a large set of spoken and written linguistic material).

A database can be considered a selection of related tables. When using a database for a research task, you goal is to:

Import the right tables from your database.
Filter out unnecessary information.
Join multiple tables to link different information sources.

In the first part of this lab, you will practice filtering and joining simple dataframes, that will act in lieu of tables from databases. In the second part of the lab, you will be importing data from a large online database of cross-cultural development, and using your new-found skills to plot and analyze the data in different ways.

3 Practice filtering data

In the chunk load_lab_data, use read_csv() to load in the lab data file, and then use head() to look at its form.

This lab datafile is from a JSPsych experiment, a bit like yours. In this study, participants had to choose between pictures of a woman and a man, in conditions under which the people in the pictures were using funky poses, and conditions under which the people in the pictures were using fresh poses.

## # A tibble: 6 x 9
##      rt responses trial_type trial_index subject button_pressed choices
##   <int> <chr>     <chr>            <int>   <int>          <int> <chr>  
## 1  3226 "{\"Q0\"… survey-te…           0       0             NA <NA>   
## 2  4179 <NA>      button-re…           1       0              0 one    
## 3  2263 <NA>      button-re…           2       0              1 woman2 
## 4  2905 <NA>      button-re…           3       0              0 woman1 
## 5  3339 <NA>      button-re…           4       0              0 man1   
## 6  1468 <NA>      button-re…           5       0              0 man2   
## # ... with 2 more variables: stimulus <chr>, condition <chr>

You’ll see a number of different columns, which carry information about a variety of things, including e.g., the RT on each trial, the subject number, etc. For instance, the column condition indicates if a trial was in the funky or fresh condition, while the column choices indicates what picture participants chose

This table contains a variety of different rows, just like the datafiles your experiments produces, and not all of those rows have the same structure. For example, the first row is the result of the survey questions on the first screen. Most of the subsequent rows are responses on the different trials of the main task, but some of the rows are for filler trials in the task.

Our first job is to process the responses from that first screen. You can see that it has been read into R as a string with odd formatting: "{\"Q0\":\"Male\",\"Q1\":\"5\"}". This means that the answer to the first question was male, and the answer to the second question was age 5. What we would like to do, is to strip that information out of the original dataframe. Then, we will create a new dataframe that specifies the participant’s subject number, their gender, and their age. Finally, we will merge that dataframe back into the original dataset, so that it can be easily accessed.

To do this, first create a new tibble, which is a dplyr version of a dataframe, that combines the participant information, with the subject number.

subj_info = tibble(part_info = dataset$responses[1],
                   subject = dataset$subject[1])

Then, use the function gsub() to remove all the noise characters from the participant information cell. gsub() takes a first argument that specifies a pattern to match, a second argument to specify what to replace that pattern with, and then a final argument that says where to search for that pattern.

You want to return a pattern that looks like: Q0:Male,Q1:5. To do this, combine gsub() with the dplyr function mutate() in order to replace all of the noise characters (e.g., { and \) with blank text.

# Example of using gsub (run this in your R console)
gsub('[abcde]',"1","123_abc_xyz") # returns "123_111_xyz"

# Here, the first argument of gsub() is a set of characters inside square brackets. This is the pattern that will be matched (we will come back to what it means soon). The second argument says what the replacement will be. The third argument says what the pattern to be searched is.

# gsub works by searching for a "regular expression". Regular expressions are too complex to explain here but, in short, they provide a very flexible way to search. For example, the pattern above ('[abcde]') indicates that the search should return TRUE whenever it finds any of the letters a through e. By contrast, the pattern 'abcde' (without the square brackets) would return TRUE whenever it finds the sequence of letters a through e.

# Try to edit the regular expression above in order to return an appropriate pattern.

# Integrating gsub into your dplyr workflow
# This is relatively straight forward
subj_info = tibble(part_info = dataset$responses[1],
                   subject = dataset$subject[1]) %>%
            mutate(part_info = gsub(...))

You can then augment your code above with this dplyr code, that will split the string into two columns and return a nice tidy dataset

%>%
  mutate(KeyValPairs = strsplit(as.character(part_info), ",")) %>% 
  unnest(KeyValPairs) %>% 
  separate(KeyValPairs, into = c("Question", "Answer"), ":") %>%
  select(-info) %>%
  spread(key = Question, value = Answer)

## # A tibble: 1 x 3
##   subject Q0    Q1   
##     <int> <chr> <chr>
## 1       0 Male  5

4 Practice joining data

Now, let’s join the original dataset with your new participant information dataset. To do that, we will use the function left_join(), which takes two arguments. The first argument is the primary dataset, and the second argument is the secondary dataset. left_join() returns all the rows that are in the primary dataset, but now merged with all the columns in the secondary dataset. Importantly, it matches on column names, so if both datasets have a column called subject it will find those, and ensure that rows where subject=0 in one dataset are matched to rows where subject=0 in the other dataset (and similarly for subject = 1,2,3...).

dataset = left_join(dataset,part_info)
    ...

Now, we are going to do a bit more pattern matching, to create our outcome variable, i.e., the data we will analyze. Remember that the column condition says whether the participants are in the funky or fresh conditions, while the column choices indicates which picture they chose. If you look at those two columns, you will see that each trial in the condition column is either funky or fresh. However, in the choices column, the text is more variable. Sometimes participants chose woman2, sometimes they chose man3, etc. We can’t analyse such variable data. Instead, we want to create a new column, called pick_woman, which has value 1 if participants picked a woman picture, and value 0 if participants picked a man picture.

Using your dplyr knowledge, augment the code above with a call to mutate(). In the mutate() call, create a new column called pick_woman, where you use the function ifelse() (i.e., mutate(pick_woman = ifelse(...))).

In the ifelse() call, we will use the function grepl(). grepl() is a bit like gsub() in that it takes a pattern, and tests if that pattern is present in a dataset. However, rather than replacing that pattern, it returns TRUE if the pattern is present, and FALSE otherwise. Thus, you can combine grepl() with ifelse(), to return 1 if the column choices contains the word woman, and 0 otherwise:

dataset = left_join(dataset,part_info) %>%
    mutate(pick_woman = ifelse(grepl('woman',choices), 1,0))

Once you have got that working, use your dplyr magic to augment the code further. You want to first filter out all rows that aren’t critical trials (e.g., you want to remove the first row, where no picture was presented). Then, you want to group the trials by condition, and calculate the mean proportion of female choices per condition.

use filter() to remove rows where is.na(stimulus) == FALSE.
use group_by() to group by the column condition.
use summarise() and mean() to create a new column called pick_woman.mean.

Then, plot the resulting data using ggplot [using geom_bar(stat="identity")].

This is tricky, so feel free to ask Hugh for help (and you can use the help files, too).

5 Access data tables

Nice job on that part of the lab. Hopefully that will help you process and graph your own experiment’s data.

In the next part of the lab we are going to scale those skills up, to learn how to use much larger databases. We are going to focus on one particular database today, called WordBank, which is a cross-linguistic database of children’s early words. Wordbank relies on the MacArthur Communicative Development Inventory (the CDI), which is a set of checklist forms that parents fill in to describe their child’s knowledge and use of words and simple sentences. You can find wordbank online, and you can see an example CDI form below. For a large number of different words, parents tick whether their child either Understands the word, or Understands and Says the word (i.e., produces the word).

First, play around with wordbank online for just a couple of minutes, then return to the lab. For instance, you can look at the Analyses tab, and examine the age of acquisition for different words, or vocabulary norms for different ages.

5.1 Gather Table of Items

Wordbank contains millions and millions of datapoints, and so it would not be practical for us to load the entire database onto our computers. Instead, there is an R package called wordbankr that allows us to analyze the data remotely, from our own terminals.

Instead, we use functions from the wordbankr package to query the server for particular types of information. The server will then send that information back in terms of a data table.

We can request three types of data tables from the server.

A table of items (i.e., words). For example, we might request that the server give us the list of words that are in the American English version of the MacArthur CDI form.
A table of administrations (i.e., children). This would be a table providing information, such as demographic details, on the families and children who were administered the CDI. For example, we might request details on all the (thousands!) of children who have been administered the MacArthur CDI in the USA.
A table of CDI results. This is the largest possible table, where each row is a child’s performance on one item in the CDI (e.g., do they understand/produce the word dog?).

5.2 Table of items

A table of items can be gathered by usig the function get_item_data() (type ?get_item_data() for more information).

Here, you should request item data for the language “English (American)”, and for the CDI form “Words and Gestures”, which is the version of the CDI administered for young infants. You can use the code below to do this; make sure you understand what each part of the code does.

English_US_Items = get_item_data(
          language = "English (American)", 
          form = "WG")

Then, explore the dataset using head() and summary(), and also use the function unique() to look at the column type (i.e, unique(English_US_Items$type))

## # A tibble: 6 x 11
##   item_id definition language form  type  category lexical_category
##   <chr>   <chr>      <chr>    <chr> <chr> <chr>    <chr>           
## 1 item_1  respname   English… WG    firs… <NA>     <NA>            
## 2 item_2  respno     English… WG    firs… <NA>     <NA>            
## 3 item_3  reactmd    English… WG    firs… <NA>     <NA>            
## 4 item_4  arehngry   English… WG    phra… <NA>     <NA>            
## 5 item_5  aretired   English… WG    phra… <NA>     <NA>            
## 6 item_6  becarefl   English… WG    phra… <NA>     <NA>            
## # ... with 4 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>, num_item_id <dbl>

## [1] "Unique values of column 'type'"

## [1] "first_signs"      "phrases"          "starting_to_talk"
## [4] "word"             "gestures_first"   "gestures_games"  
## [7] "gestures_objects" "gestures_parent"  "gestures_adult"

You’ll see there are quite a few types of item that we might not want to examine, if we only want to focus on children’s knowledge of single words (which is the focus of this lab). The list includes things like gestures, whole phrases, signs, etc etc. So, let’s use filter() to get rid of those.

English_US_Items = English_US_Items %>%
    filter(...["put some code in here to only select rows where column type indicates a word is present"])

Once you have done this, use unique() to check the column type again, as well as other columns such as category, lexical_category, and so on.

## [1] "Unique values of column 'type'"

## [1] "word"

## [1] "Unique values of column 'category'"

##  [1] "sounds"            "animals"           "vehicles"         
##  [4] "toys"              "food_drink"        "clothing"         
##  [7] "body_parts"        "furniture_rooms"   "household"        
## [10] "outside"           "people"            "games_routines"   
## [13] "action_words"      "time_words"        "descriptive_words"
## [16] "pronouns"          "question_words"    "locations"        
## [19] "quantifiers"

## [1] "Unique values of column 'lexical_category'"

## [1] "other"          "nouns"          "predicates"     "function_words"

5.2.1 Administation data

Now, let’s get data on which American English kids were administered the Words and Gestures form.

English_US_Admin = get_administration_data(
  language = "English (American)", 
  form = "WG")

Again, use head() to explore the dataset, and try to find out how many kids took part by using the function length().

## # A tibble: 6 x 15
##   data_id   age comprehension production language form  birth_order
##     <dbl> <int>         <int>      <int> <chr>    <chr> <fct>      
## 1  145913    18           150        109 English… WG    Second     
## 2  145914    18           185         18 English… WG    First      
## 3  145915    18           224        104 English… WG    Second     
## 4  145916    18           209        125 English… WG    First      
## 5  145917    18           283        129 English… WG    First      
## 6  145918    17           324         65 English… WG    Second     
## # ... with 8 more variables: ethnicity <fct>, sex <fct>, zygosity <chr>,
## #   norming <lgl>, mom_ed <fct>, longitudinal <lgl>, source_name <chr>,
## #   license <chr>

## [1] "How many administrations?"

## [1] 2435

5.3 Who let the dogs in?

Using your R sleuthing skills, try to find out the item_id for the word dog from the table of items. As a hint, the column containing the list of words is defintion.

Yell if you get lost, because it will be important for the next part.

Now, we will use info from the previous two data tables to import CDI data for the word dog, from American English kids, who took part in the Words and Gestures CDI,

English_US_Dog = get_instrument_data(language = "English (American)", 
                                     form = "WG",
                                     items = "ADD_the_item_number_for_dog (e.g., item_32)",
                                     administrations = English_US_Admin)

## # A tibble: 6 x 17
##   data_id value num_item_id   age comprehension production language form 
##     <dbl> <chr>       <dbl> <int>         <int>      <int> <chr>    <chr>
## 1  145913 prod…          57    18           150        109 English… WG   
## 2  145914 unde…          57    18           185         18 English… WG   
## 3  145915 prod…          57    18           224        104 English… WG   
## 4  145916 prod…          57    18           209        125 English… WG   
## 5  145917 prod…          57    18           283        129 English… WG   
## 6  145918 prod…          57    17           324         65 English… WG   
## # ... with 9 more variables: birth_order <fct>, ethnicity <fct>,
## #   sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## #   longitudinal <lgl>, source_name <chr>, license <chr>

What are the different columns in this table?

data_id defines the subject id.
num_item_id defines the item id (in this case, item_57, i.e., dog)
comprehension and production describe the overall comprehension and production scores for each subject, over all items in the database.
the key column is value, which takes one of three values:
produces indicates that the parent has heard the child use the word.
understands indicates that the parent believes the child understands the word.
And a blank value indicates that the parent does not believe the child understands the word.

5.3.1 Processing the data.

What we would like to do is to plot the probability of a child producing or understanding a word, over age. However, before we do this, we have to reckon with the fact that the most critcial column for plotting, value, conflates producing and understanding. Ideally, we want our dataset to be very explicit about what kids do, and do not, know. So while our current data is similar to the figure on the left, we would like it to be more like the data on the right.

Achieving this is done in two steps. First, we use the function mutate() to create two columns, as below. The column produces is 1 if and only if the child is marked as producing the word. The column understands is 1 if the child either is marked as producing or is marked as understanding the word. Once this is done, we then gather the columns together into one large datafile.

To do this, I want you to take the code in the first chunk below, and then edit it so that it can be merged with the code in the second chunk below.

English_US_Dog$produces = ifelse(English_US_Dog$value == "produces",1,0)
English_US_Dog$understands = ifelse(English_US_Dog$value %in% c("produces","understands"),1,0)
English_US_Dog = gather(English_US_Dog, "mode","score", produces,understands)

English_US_Dog = English_US_Dog %>%
  mutate(produces = ...,
        understands = ...) %>%
  gather()

Then use summary() and head() to examine the resulting dataframe. You might also want to use unique(English_US_Dog$sex) which shows that some participants do not have an indicated gender. Since this may indicate a mistake in the dataframe, you should filter out those particular participants.

English_US_Dog = English_US_Dog %>%
  filter(!is.na(sex))

Then, use ggplot(), geom_jitter, and geom_smooth(), to create a graph like the one below.

##     data_id          value            num_item_id      age       
##  Min.   :145913   Length:4870        Min.   :57   Min.   : 8.00  
##  1st Qu.:146537   Class :character   1st Qu.:57   1st Qu.:13.00  
##  Median :147148   Mode  :character   Median :57   Median :13.00  
##  Mean   :147146                      Mean   :57   Mean   :13.81  
##  3rd Qu.:147757                      3rd Qu.:57   3rd Qu.:16.00  
##  Max.   :148366                      Max.   :57   Max.   :18.00  
##                                                                  
##  comprehension     production       language             form          
##  Min.   :  0.0   Min.   :  0.00   Length:4870        Length:4870       
##  1st Qu.: 62.0   1st Qu.:  4.00   Class :character   Class :character  
##  Median :125.0   Median : 13.00   Mode  :character   Mode  :character  
##  Mean   :140.6   Mean   : 28.92                                        
##  3rd Qu.:207.8   3rd Qu.: 34.00                                        
##  Max.   :396.0   Max.   :386.00                                        
##                                                                        
##   birth_order      ethnicity        sex         zygosity        
##  First  :1050   Asian   :  92   Female:2314   Length:4870       
##  Second : 674   Black   : 248   Male  :2470   Class :character  
##  Third  : 268   Other   : 124   Other :   0   Mode  :character  
##  Fourth :  74   White   :1548   NA's  :  86                     
##  Sixth  :  14   Hispanic: 122                                   
##  (Other):  14   NA's    :2736                                   
##  NA's   :2776                                                   
##   norming                   mom_ed     longitudinal    source_name       
##  Mode :logical   College       : 588   Mode :logical   Length:4870       
##  FALSE:2728      Some College  : 534   FALSE:2298      Class :character  
##  TRUE :2142      Secondary     : 508   TRUE :2572      Mode  :character  
##                  Graduate      : 282                                     
##                  Some Secondary: 138                                     
##                  (Other)       :  86                                     
##                  NA's          :2734                                     
##    license              mode               score       
##  Length:4870        Length:4870        Min.   :0.0000  
##  Class :character   Class :character   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median :1.0000  
##                                        Mean   :0.6218  
##                                        3rd Qu.:1.0000  
##                                        Max.   :1.0000  
##                                        NA's   :16

5.3.1.1 Make a better graph.

This graph is a decent start, but it is also pretty ugly. Why don’t we summarize all those datapoints, instead? i.e., get the mean for each point, instead?

English_US_Dog_summary = English_US_Dog %>%
  select("which columns should you select?") %>%
  group_by("which columns should you group by?") %>%
  summarise(score.mean = mean(score))

The English_US_Dog data table contains columns with interesting demographic information, like the sex of the baby and the education of the mother. Now, edit your code from above to select() and group_by() those columns, and try to make a graph like the one below.

5.4 Big data sets – all the words!

Well done, you made a graph for the word dog! Now let’s add a whole bunch more words! Using get_instrument_data(), create a new datatable. But this time, rather than selecting one particular item from English_US_Items, we will select all the items, as below!

Edit the code below, based on the code you wrote before, so that you pull in the relevant data table, and then process it to create columns called mode and score, and also filter out unwanted data.

English_US_AllWords = get_instrument_data(language = "English (American)", 
                                     form = "WG",
                                     items = English_US_Items$item_id,
                                     administrations = English_US_Admin) %>%
                      mutate(...) %>%
                      gather(...) %>%
                      filter %>%

5.4.0.1 Summarise the dataset

Now, as you did before, edit the below dplyr calls to summarise the data and create the graph below. Remember that you can add a second aes() to geom_point(), where you specify that the color of the points depends on the mother’s education.

English_US_AllWords_summary = English_US_AllWords %>%
                    select() %>%
                    group_by() %>%
                    summarise(score.mean = mean(score))

5.5 Joining data tables

It would also be cool to be able to split these plots by the type of word being learned. For example, do kids learn animal names before they learn body part names? Do they learn nouns before verbs?

Use head() and summary() to see if we have information on things like lexical category in our data tables English_US_AllWords and English_US_Items. You should see that that information is present in one but not the other.

## # A tibble: 6 x 19
##   data_id value num_item_id   age comprehension production language form 
##     <dbl> <chr>       <dbl> <int>         <int>      <int> <chr>    <chr>
## 1  145913 ""             34    18           150        109 English… WG   
## 2  145913 ""             35    18           150        109 English… WG   
## 3  145913 ""             36    18           150        109 English… WG   
## 4  145913 prod…          37    18           150        109 English… WG   
## 5  145913 ""             38    18           150        109 English… WG   
## 6  145913 ""             39    18           150        109 English… WG   
## # ... with 11 more variables: birth_order <fct>, ethnicity <fct>,
## #   sex <fct>, zygosity <chr>, norming <lgl>, mom_ed <fct>,
## #   longitudinal <lgl>, source_name <chr>, license <chr>, mode <chr>,
## #   score <dbl>

## # A tibble: 6 x 11
##   item_id definition language form  type  category lexical_category
##   <chr>   <chr>      <chr>    <chr> <chr> <chr>    <chr>           
## 1 item_34 baa baa    English… WG    word  sounds   other           
## 2 item_35 choo choo  English… WG    word  sounds   other           
## 3 item_36 cockadood… English… WG    word  sounds   other           
## 4 item_37 grrr       English… WG    word  sounds   other           
## 5 item_38 meow       English… WG    word  sounds   other           
## 6 item_39 moo        English… WG    word  sounds   other           
## # ... with 4 more variables: lexical_class <chr>, uni_lemma <chr>,
## #   complexity_category <chr>, num_item_id <dbl>

By editing the code below, join these two tables up and then produce the below graph.

English_US_AllWords = left_join(English_US_AllWords, English_US_Items)

English_US_AllWords %>%
  select(age, ???,mode, score, mom_ed, sex) %>%
  group_by(age, ???,mode, mom_ed, sex) %>%
  summarise(score.mean = mean(score)) %>%
  ggplot(aes(x = age, y = score.mean, lty = sex))+
  geom_point(aes(color=mom_ed))+
  geom_smooth()+
  facet_grid(???~mode)

5.6 Try to write a function to achieve all this

This part is optional.

I’m sure you’ve been taught in your statistics classes, that it is better to write functions than copy out the same code again and again. So let’s do that! Create a function that, for any given language, produces a nicely summarised table of what kids do and do not know.

You can try and do this by yourself, or if you need a hand, try to edit the function below, which

Takes a single argument called language.
Retrieves item data for that language, for the Words and Gestures form.
Retrives admin data for that language, for the Words and Gestures form.
Retrives all the CDI data for the administered children in the admin data, for all items in the item data.
Processes the data as before.
Summarizes the data, so you can make a graph.

get_words = function(language){
    items = get_item_data(language = ???, 
                          form = "WG") %>%
                    filter("EDIT ME") # Filter out when sex = NA 
    
    admin = get_administration_data(language = ???, 
                                    form = "WG")
    
    words = get_instrument_data(language = ???, 
                              form = "WG",
                              items = ???$item_id, # get all the items from your data table of items.
                              administrations = ???) %>% # your admin data table
                mutate(produces = ifelse(value == "produces",1,0),
                       understands = ifelse(value %in% c("produces","understands"),1,0)) %>%
                gather("mode","score", produces,understands) %>%
                left_join(items) # What does this last bit do?
    
  summary = words %>%
    select("what factors should you select?") %>%
    group_by("what factors should you group by?") %>%
    summarise(score.mean = mean(score,na.rm=T))
  return(summary)
}

5.6.1 Now, apply the function

Apply the function to a new language

get_words("Russian") %>%
  group_by(age,mode, sex,language)%>%
  summarise(score.mean = mean(score.mean)) %>%
  ggplot(aes(x = age, y = score.mean, color = sex))+
  geom_point()+
  geom_smooth()+
  facet_grid(mode~language)

5.6.1.1 Apply the function to lots of languages

xl_data = bind_rows(get_words("English (American)"),
                    get_words("Russian"),
                    get_words("Norwegian"),
                    get_words("Italian"),
                    get_words("Korean"))

xl_data %>%
  group_by(age,mode, sex,language)%>%
  summarise(score.mean = mean(score.mean)) %>%
  ggplot(aes(x = age, y = score.mean, color = sex))+
  geom_point()+
  geom_smooth()+
  facet_grid(mode~language)

6 Assignment for this week

Make sure that you can download and process a datafile from your app, such that you can turn it into an informative graph.
Make sure to append the consent form in this week’s lab to the start of your app.

Lab 8 - Big Data

Hugh Rabagliati

13/11/2018