1 Big data

1.1 Preliminaries

Make sure that you load the following packages: dplyr, ggplot2, tidyr, and readr.

Make sure that you install and load the packages wordbankr and dbplyr (note that reads dbplyr, not dplyr).

Make sure that you download the template for the lab, as well as the test file (lab10_data.csv).

(After today) Make sure that you add a consent form to your app, by downloading the template from the Learn site.

2 Setting the scene

Because it is increasingly easy to share data, the cognitive and developmental sciences are increasingly relying on large, shared databases of information. This includes resources like the UK Biobank, which include large stores of medical and neurophysiological data, as well as measures of out interactions with the world (e.g., the British National Corpus, which is a large set of spoken and written linguistic material).

As we discussed at the start, a database can be considered a selection of related tables. When using a database for a research task, you goal is to:

Import the right tables from your database.
Filter out unnecessary information.
Join multiple tables to link different information sources.

In the first part of this lab, you will practice filtering and joining simple dataframes, that will act in lieu of tables from databases. In the second part of the lab, you will be importing data from a large online database of cross-cultural development, and using your new-found skills to plot and analyze the data in different ways.

3 Practice filtering data

In the chunk load_lab_data, use read_csv() to load in the lab data file, and then use head() to look at its form.

This lab datafile is from a JSPsych experiment, a bit like yours. In this study, participants had to choose between pictures of a woman and a man, in conditions under which the people in the pictures were using funky poses, and conditions under which the people in the pictures were using fresh poses.

## # A tibble: 6 x 9
##      rt                        responses      trial_type trial_index
##   <int>                            <chr>           <chr>       <int>
## 1  3226 "{\"Q0\":\"Male\",\"Q1\":\"5\"}"     survey-text           0
## 2  4179                             <NA> button-response           1
## 3  2263                             <NA> button-response           2
## 4  2905                             <NA> button-response           3
## 5  3339                             <NA> button-response           4
## 6  1468                             <NA> button-response           5
## # ... with 5 more variables: subject <int>, button_pressed <int>,
## #   choices <chr>, stimulus <chr>, condition <chr>

You’ll see a number of different columns, which carry information about a variety of things, including e.g., the RT on each trial, the subject number, etc. For instance, the column condition indicates if a trial was in the funky or fresh condition, while the column choices indicates what choice participants made.

This table contains a variety of different rows, just like the datafiles your experiments produces, for example the first row is the result of the survey questions on the first screen, while most of the rows are responses on the different trials of the main task.

Our first task is to process the responses from that first screen. You can see that it has been read into R as a string with odd formatting: "{\"Q0\":\"Male\",\"Q1\":\"5\"}". This means that the answer to the first question was male, and the answer to the second question was age 5. What we would like to do, is to strip that information out of the original dataframe. Then, we will create a new dataframe that specifies the subject number, their gender, and their age. Finally, we will merge that dataframe back into the original dataset.

To do this, first create a new tibble that combines the participant information, with the subject number.

subj_info = tibble(part_info = dataset$responses[1],
                   subject = dataset$subject[1])

Then, use the function gsub() to remove all the noise characters from the participant information cell. gsub() takes a first argument that specifies a pattern to match, a second argument to specify what to replace that pattern with, and then a final argument that says where to search for that pattern.

You want to return a pattern that looks like: Q0:Male,Q1:5

# Example of using gsub (run this in your R console)
gsub('[abcde]',"1","123_abc_xyz") # returns "123_111_xyz"

# Here, the first argument of gsub() is a set of characters inside square brackets. This is the pattern that will be matched (we will come back to what it means soon). The second argument says what the replacement will be. The third argument says what the pattern to be searched is.

# gsub works by searching for a "regular expression". Regular expressions are too complex to explain here but, in short, they provide a very flexible way to search. For example, the pattern above ('[abcde]') indicates that the search should return TRUE whenever it finds any of the letters a through e. By contrast, the pattern 'abcde' would return TRUE whenever it finds the sequence of letters a through e.

# Try to edit the regular expression above in order to return an appropriate pattern.

# Integrating gsub into your dplyr workflow
# This is relatively straight forward
subj_info = tibble(part_info = dataset$responses[1],
                   subject = dataset$subject[1]) %>%
            mutate(part_info = gsub(...))

You can then augment your code above with this dplyr code, that will split the string into two columns and return a nice tidy dataset

%>%
  mutate(KeyValPairs = strsplit(as.character(info), ",")) %>% 
  unnest(KeyValPairs) %>% 
  separate(KeyValPairs, into = c("Question", "Answer"), ":") %>%
  select(-info) %>%
  spread(key = Question, value = Answer)

## # A tibble: 1 x 3
##   subject    Q0    Q1
## *   <int> <chr> <chr>
## 1       0  Male     5

4 Practice joining data

Now, let’s join the original dataset with your new participant information dataset. To do that, we will use the function left_join(), which takes two arguments. The first argument is the primary dataset, and the second argument is the secondary dataset. left_join() returns all the rows that are in the primary dataset, but now merged with all the columns in the secondary dataset. Importantly, it matches on column names, so if both datasets have a column called subject it will find those, and ensure that rows where subject=0 in one dataset are matched to rows where subject=0 in the other dataset (and similarly for subject = 1,2,3...).

dataset = left_join(dataset,part_info) %
    ...

Now, we are going to do a bit more pattern matching, to create our outcome variable, i.e., the data will will analyze. Remember that the column condition says whether the participants are in the funky or fresh conditions, while the column choices indicates which picture they chose. If you look at those two columns, you will see that each trial in the condition column is either funky or fresh. However, in the choices column, the text is more variable. Sometimes participants chose woman2, sometimes they chose man3, etc. We want to create a new column, called pick_woman, which has value 1 if participants picked a woman picture, and value 0 if participants picked a man picture.

Using your dplyr knowledge, augment the code above with a call to mutate(). In the mutate() call, create a new column called pick_woman, where you use the function ifelse() (i.e., mutate(pick_woman = ifelse(...))).

In the ifelse() call, we will use the function grepl(). grepl() is a bit like gsub() in that it takes a pattern, and sees if that pattern is present in a dataset. However, rather than replacing that pattern, it returns TRUE if the pattern is present, and FALSE otherwise. Knowing this, you should be able to combine grepl() with ifelse(), to return 1 if the column choices contains the word woman, and 0 otherwise mutate(pick_woman = ifelse(grepl(‘woman’,choices), 1,0))

Once you have got that working, use your dplyr magic to augment the code further. You want to first filter out all rows that aren’t critical trials (e.g., you want to remove the first row, where no picture was presented). Then, you want to group the trials by condition, and calculate the mean proportion of female choices per condition.

filter() rows where is.na(stimulus) == FALSE.
use group_by() to group by the column condition.
use summarise() and mean() to create a new column called pick_woman.mean.

Then, plot the resulting data using ggplot [using `geom_bar(stat=“identity”)].

5 Access data tables

Nice job on that part of the lab. Hopefully that will help you graph your own experiment’s data.

In the next part of the lab we are going to scale those skills up, to learn how to use much larger databases. We are going to focus on one particular database today, called WordBank, which is a cross-linguistic database of children’s early words. Wordbank relies on the MacArthur Communicative Development Inventory, which is a set of checklist forms that parents fill in to describe their child’s knowledge and use of words and simple sentences. You can find wordbank online, and you can see an example CDI form below. For a large number of different words, parents tick whether their child either Understands the word, or Understands and says the word (i.e., produces the word).

First, play around with wordbank online for just a couple of minutes, then return to the lab. There is an R package called wordbankr that allows us to analyze the data ourselves, from our own terminals.

5.1 Gather Table of Items

Wordbank contains millions and millions of datapoints, and so it would not be practical for us to load the entire database onto our computers.

Instead, we use functions from the wordbankr package to query the server for particular types of information. The server will then send that information back in terms of a data table.

We can request three types of data tables from the server.

A table of items (i.e., words). For example, we might request that the server give us the list of words that are in the American English version of the MacArthur CDI form.
A table of administrations (i.e., children). This would be a table providing information, such as demographic details, on the families and children who were administered the CDI. For example, we might request details on all the (thousands!) of children who have been administered the MacArthur CDI in the USA.
A table of CDI results. This is the largest possible table, where each row is a child’s performance on one item in the CDI (e.g., do they understand/produce the word dog?).

5.2 Table of items

A table of items can be gathered by usig the function get_item_data() (type ?get_item_data() for more information).

Here, you should request item data for the language “English (American)”, and for the form “Words and Gestures”, which is the version of the CDI administered for children aged under 2. You can use the code below to do this.

English_US_Items = get_item_data(
          language = "English (American)", 
          form = "WG")

Then, explore the dataset using head() and summary(), and also use the function unique() to look at the column type (i.e, unique(English_US_Items$type))

## # A tibble: 6 x 11
##   item_id definition           language  form        type category
##     <chr>      <chr>              <chr> <chr>       <chr>    <chr>
## 1  item_1   respname English (American)    WG first_signs     <NA>
## 2  item_2     respno English (American)    WG first_signs     <NA>
## 3  item_3    reactmd English (American)    WG first_signs     <NA>
## 4  item_4   arehngry English (American)    WG     phrases     <NA>
## 5  item_5   aretired English (American)    WG     phrases     <NA>
## 6  item_6   becarefl English (American)    WG     phrases     <NA>
## # ... with 5 more variables: lexical_category <chr>, lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>

## [1] "Unique values of column 'type'"

## [1] "first_signs"      "phrases"          "starting_to_talk"
## [4] "word"             "gestures_first"   "gestures_games"  
## [7] "gestures_objects" "gestures_parent"  "gestures_adult"

You’ll see there are quite a few types of item that we might not want to examine, if we only want to focus on single words. This includes things like gestures, whole phrases, signs, etc etc. So, let’s use filter() to get rid of those.

English_US_Items = English_US_Items %>%
    filter(...["put some code in here to only select rows where column type indicates a word is present"])

Once you have done this, use unique() to check the column type again, as well as other columns such as category, lexical_category, and so on.

## [1] "Unique values of column 'type'"

## [1] "word"

## [1] "Unique values of column 'category'"

##  [1] "sounds"            "animals"           "vehicles"         
##  [4] "toys"              "food_drink"        "clothing"         
##  [7] "body_parts"        "furniture_rooms"   "household"        
## [10] "outside"           "people"            "games_routines"   
## [13] "action_words"      "time_words"        "descriptive_words"
## [16] "pronouns"          "question_words"    "locations"        
## [19] "quantifiers"

## [1] "Unique values of column 'lexical_category'"

## [1] "other"          "nouns"          "predicates"     "function_words"

5.2.1 Administation data

Now, let’s get data on which American English kids were administered the Words and Gestures form.

English_US_Admin = get_administration_data(
  language = "English (American)", 
  form = "WG")

Again, use head() to explore the dataset, and try to find out how many kids took part by using the function length().

## # A tibble: 6 x 14
##   data_id   age comprehension production           language  form
##     <dbl> <int>         <int>      <int>              <chr> <chr>
## 1  145913    18           150        109 English (American)    WG
## 2  145914    18           185         18 English (American)    WG
## 3  145915    18           224        104 English (American)    WG
## 4  145916    18           209        125 English (American)    WG
## 5  145917    18           283        129 English (American)    WG
## 6  145918    17           324         65 English (American)    WG
## # ... with 8 more variables: birth_order <fctr>, ethnicity <fctr>,
## #   sex <fctr>, zygosity <chr>, norming <lgl>, longitudinal <lgl>,
## #   source_name <chr>, mom_ed <fctr>

## [1] "How many administrations?"

## [1] 2435

5.3 Who let the dogs in?

Using your R sleuthing skills, try to find out the item_id for the word dog from the table of items. As a hint, the column containing the list of words is defintion.

Yell if you get lost, because it will be important for the next part.

Now, we will use info from the previous two data tables to import CDI data for the word dog, from American English kids, who took part in the Words and Gestures CDI,

English_US_Dog = get_instrument_data(instrument_language = "English (American)", 
                                     instrument_form = "WG",
                                     items = "item_number_for_dog (e.g., item_32)",
                                     administrations = English_US_Admin)

## # A tibble: 6 x 16
##   data_id       value num_item_id   age comprehension production
##     <dbl>       <chr>       <dbl> <int>         <int>      <int>
## 1  145913    produces          57    18           150        109
## 2  145914 understands          57    18           185         18
## 3  145915    produces          57    18           224        104
## 4  145916    produces          57    18           209        125
## 5  145917    produces          57    18           283        129
## 6  145918    produces          57    17           324         65
## # ... with 10 more variables: language <chr>, form <chr>,
## #   birth_order <fctr>, ethnicity <fctr>, sex <fctr>, zygosity <chr>,
## #   norming <lgl>, longitudinal <lgl>, source_name <chr>, mom_ed <fctr>

What are the different columns in this table?

data_id defines the subject id.
num_item_id defines the item id (in this case, item_57, i.e., dog)
comprehension and production describe the overall comprehension and production scores for each subject, over all items in the database.
the key column is value, which takes one of three values:
produces indicates that the parent has heard the child use the word.
understands indicates that the parent believes the child understands the word.
And a blank value indicates that the parent does not believe the child understands the word.

5.3.1 Processing the data.

What we would like to do is to plot the probability of a child producing or understanding a word, over age. However, before we do this, we have to reckon with the fact that the most critcial column for plotting, value, conflates producing and understanding. Ideally, we want our dataset to be very explicit about what kids do, and do not, know. So while our current data is similar to the figure on the left, we would like it to be more like the data on the right.

Achieving this is done in two steps. First, we use the function mutate() to create two columns, as below. The column produces is 1 if and only if the child is marked as producing the word. The column understands is 1 if the child either is marked as producing or is marked as understanding the word. Once this is done, we then gather the columns together into one large datafile.

To do this, I want you to take the code in the first chunk below, and then edit it and merge it with the code in the second chunk below.

English_US_Dog$produces = ifelse(English_US_Dog$value == "produces",1,0)
English_US_Dog$understands = ifelse(English_US_Dog$value %in% c("produces","understands"),1,0)
English_US_Dog = gather(English_US_Dog, "mode","score", produces,understands)

English_US_Dog = English_US_Dog %>%
  mutate(produces = ...,
        understands = ...) %>%
  gather()

Then use summary() and head() to examine the resulting dataframe. You might also want to use unique(English_US_Dog$sex) which shows that some participants do not have an indicated gender. Since this may indicate a mistake in the dataframe, why don’t you filter out those particular participants.

English_US_Dog = English_US_Dog %>%
  filter(!is.na(sex))

Then, use ggplot(), geom_jitter, and geom_smooth(), to create a graph like the one below.

##     data_id          value            num_item_id      age       
##  Min.   :145913   Length:4870        Min.   :57   Min.   : 8.00  
##  1st Qu.:146537   Class :character   1st Qu.:57   1st Qu.:13.00  
##  Median :147148   Mode  :character   Median :57   Median :13.00  
##  Mean   :147146                      Mean   :57   Mean   :13.81  
##  3rd Qu.:147757                      3rd Qu.:57   3rd Qu.:16.00  
##  Max.   :148366                      Max.   :57   Max.   :18.00  
##                                                                  
##  comprehension     production       language             form          
##  Min.   :  0.0   Min.   :  0.00   Length:4870        Length:4870       
##  1st Qu.: 62.0   1st Qu.:  4.00   Class :character   Class :character  
##  Median :125.0   Median : 13.00   Mode  :character   Mode  :character  
##  Mean   :140.6   Mean   : 28.92                                        
##  3rd Qu.:207.8   3rd Qu.: 34.00                                        
##  Max.   :396.0   Max.   :386.00                                        
##                                                                        
##   birth_order      ethnicity        sex         zygosity        
##  First  :1050   Asian   :  92   Female:2314   Length:4870       
##  Second : 674   Black   : 248   Male  :2470   Class :character  
##  Third  : 268   Other   : 124   Other :   0   Mode  :character  
##  Fourth :  74   White   :1548   NA's  :  86                     
##  Sixth  :  14   Hispanic: 122                                   
##  (Other):  14   NA's    :2736                                   
##  NA's   :2776                                                   
##   norming        longitudinal    source_name                   mom_ed    
##  Mode :logical   Mode :logical   Length:4870        College       : 588  
##  FALSE:2728      FALSE:2298      Class :character   Some College  : 534  
##  TRUE :2142      TRUE :2572      Mode  :character   Secondary     : 508  
##  NA's :0         NA's :0                            Graduate      : 282  
##                                                     Some Secondary: 138  
##                                                     (Other)       :  86  
##                                                     NA's          :2734  
##      mode               score       
##  Length:4870        Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :1.0000  
##                     Mean   :0.6218  
##                     3rd Qu.:1.0000  
##                     Max.   :1.0000  
##                     NA's   :16

5.3.1.1 Make a better graph.

This graph is a decent start, but it is also pretty ugly. Why don’t we summarize all those datapoints, instead? i.e., get the mean for each point, instead?

English_US_Dog_summary = English_US_Dog %>%
  select("which columns should you select?") %>%
  group_by("which columns should you group by?") %>%
  summarise(score.mean = mean(score))

The English_US_Dog data table contains columns with interesting demographic information, like the sex of the baby and the education of the mother. Now, edit your code from above to select() and group_by() those columns, and try to make a graph like the one below.

5.4 Big data sets – all the words!

Well done, you made a graph for the word dog! Now let’s add a whole bunch more words! Using get_instrument_data(), create a new datatable. But this time, rather than selecting one particular item from English_US_Items, we will select all the items, as below!

Edit the code below, based on the code you wrote before, so that you pull in the relevant data table, and then process it to create columns called mode and score, and also filter out unwanted data.

English_US_AllWords = get_instrument_data(instrument_language = "English (American)", 
                                     instrument_form = "WG",
                                     items = English_US_Items$item_id,
                                     administrations = English_US_Admin) %>%
                      mutate(...) %>%
                      gather(...) %>%
                      filter %>%

5.4.0.1 Summarise the dataset

Now, as before, edit these dplyr calls and then create the graph below. Remember that you can add a second aes() to geom_point(), where you specify that the color of the points depends on the mother’s education.

English_US_AllWords_summary = English_US_AllWords %>%
                    select() %>%
                    group_by() %>%
                    summarise(score.mean = mean(score))

5.5 Joining data tables

It would also be cool to be able to split these plots by the type of word being learned. For example, do kids learn animal names before they learn body part names? Do they learn nouns before verbs?

Use head() and summary() to see if we have information on things like lexical category in our data tables English_US_AllWords and English_US_Items. You should see that that information is present in

## # A tibble: 6 x 18
##   data_id    value num_item_id   age comprehension production
##     <dbl>    <chr>       <dbl> <int>         <int>      <int>
## 1  145913                   34    18           150        109
## 2  145913                   35    18           150        109
## 3  145913                   36    18           150        109
## 4  145913 produces          37    18           150        109
## 5  145913                   38    18           150        109
## 6  145913                   39    18           150        109
## # ... with 12 more variables: language <chr>, form <chr>,
## #   birth_order <fctr>, ethnicity <fctr>, sex <fctr>, zygosity <chr>,
## #   norming <lgl>, longitudinal <lgl>, source_name <chr>, mom_ed <fctr>,
## #   mode <chr>, score <dbl>

## # A tibble: 6 x 11
##   item_id     definition           language  form  type category
##     <chr>          <chr>              <chr> <chr> <chr>    <chr>
## 1 item_34        baa baa English (American)    WG  word   sounds
## 2 item_35      choo choo English (American)    WG  word   sounds
## 3 item_36 cockadoodledoo English (American)    WG  word   sounds
## 4 item_37           grrr English (American)    WG  word   sounds
## 5 item_38           meow English (American)    WG  word   sounds
## 6 item_39            moo English (American)    WG  word   sounds
## # ... with 5 more variables: lexical_category <chr>, lexical_class <chr>,
## #   uni_lemma <chr>, complexity_category <chr>, num_item_id <dbl>

Join these two tables up, as below, and then edit that code to produce the below graph.

English_US_AllWords = left_join(English_US_AllWords, English_US_Items)

English_US_AllWords %>%
  select(age, ???,mode, score, mom_ed, sex) %>%
  group_by(age, ???,mode, mom_ed, sex) %>%
  summarise(score.mean = mean(score)) %>%
  ggplot(aes(x = age, y = score.mean, lty = sex))+
  geom_point(aes(color=mom_ed))+
  geom_smooth()+
  facet_grid(???~mode)

5.6 Try to write a function to achieve all this

I’m sure you’ve been taught in your statistics classes, that it is better to write functions than copy out the same code again and again. So let’s do that!

Copy and edit the function below, which

Takes a single argument called language.
Retrieves item data for that language, for the Words and Gestures form.
Retrives admin data for that language, for the Words and Gestures form.
Retrives all the CDI data for the administered children in the admin data, for all items in the item data.
Processes the data as before.
Summarizes the data, so you can make a graph.

get_words = function(language){
    items = get_item_data(language = ???, 
                          form = "WG") %>%
                    filter("EDIT ME") # Filter out when sex = NA 
    
    admin = get_administration_data(language = ???, 
                                    form = "WG")
    
    words = get_instrument_data(instrument_language = ???, 
                              instrument_form = "WG",
                              items = ???$item_id, # get all the items from you data table of items.
                              administrations = ???) %>% # your admin data table
                mutate(produces = ifelse(value == "produces",1,0),
                       understands = ifelse(value %in% c("produces","understands"),1,0)) %>%
                gather("mode","score", produces,understands) %>%
                left_join(items) # What does this last bit do?
    
  summary = words %>%
    select("what factors should you select?") %>%
    group_by("what factors should you group by?") %>%
    summarise(score.mean = mean(score,na.rm=T))
  return(summary)
}

5.6.1 Now, apply the function

Apply the function to a new language

get_words("Russian") %>%
  group_by(age,mode, sex,language)%>%
  summarise(score.mean = mean(score.mean)) %>%
  ggplot(aes(x = age, y = score.mean, color = sex))+
  geom_point()+
  geom_smooth()+
  facet_grid(mode~language)

5.6.1.1 Apply the function to lots of languages

xl_data = bind_rows(get_words("English (American)"),
                    get_words("Russian"),
                    get_words("Norwegian"),
                    get_words("Spanish"),
                    get_words("Korean"))

xl_data %>%
  group_by(age,mode, sex,language)%>%
  summarise(score.mean = mean(score.mean)) %>%
  ggplot(aes(x = age, y = score.mean, color = sex))+
  geom_point()+
  geom_smooth()+
  facet_grid(mode~language)

Lab 10 - Big Data

Hugh Rabagliati

21/11/2017