Try to use tidyverse style coding for this where possible.
library(wordbankr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Open up the Words and Gestures data. This data contains several scores: productive and receptive vocabulary, and five gesture checklists.
Inst <- get_instrument_data(language="English (American)", form="WG")
Admin <- get_administration_data(language="English (American)", form="WG")
Item <- get_item_data(language="English (American)", form = "WG")
Remember that Admin tells us about a particular administration: the details of the participant, and their scores on the productive and receptive vocabulary subsection. Item tells us the details of each item on the CDI. Administration tells us the score for each participant on each item. For this project, you’ll need to combine the three data sets, but first let’s take at the items on the CDI.
If we look at the Item-level data, you can see that there are multiple item types. There are words, which are reflected in each participant’s receptive or productive vocabulary score in the Admin dataset above. But there are other item types that are not in that dataset, such as various sorts of gestures.
Item
We can take a look at the number of items in each of these types with the code below.
Item %>%
group_by(type) %>%
count()
If we want to narrow our view on to particular types of items, we can use the code below. I encourage you to copy this and take a look at any other sets of items you’re interested in.
Item %>%
filter(type=="phrases")
Using the data set above, answer the following questions.
Focus on any two of the gesture subscales from the MBCDI. Using the administration data, calculate total scores for each of the gesture subscales. NB: this will mean you have to convert the string variables to numeric variables. Note, you should have one score for each participant. (Look at what we did last week with the grammar subscale for an example)
Using this new data set produce some descriptive statistics, and graphs, to show the distribution of (a) age, (b) receptive vocabulary, (c) productive vocabulary, (d) each of the two gesture scores. Describe the distribution of these variables: do you see any outliers, etc.
Look at scatterplots and correlation matrices to see if these variables are related to each other. Interpret the relationships.
Run a linear regression (we did this at the end of the last class) predicting receptive vocabulary from age and the two gesture variables. Which variables predict receptive vocabulary.
Run another linear regression predicting productive vocabulary from age and the two gesture variables.
Write (in your Markdown file) what the relationship seems to be between gesture and vocabulary.
Look at the various model diagnostics for the models in 5 and 6. Do you see any reason to change the model, i.e., transforming variables, dropping problematic observations, etc?
The CDI has a ceiling and a floor. One solution for analysing data with a ceiling and floor is converting it to a proportion and running beta regression. See the betareg package for this.
There are many datasets available on wordbank from different administrations of the CDI, in different languages, countries etc. Look through the available languages on the website http://wordbank.stanford.edu/. Also take a look at teh item-level data for various data sets to find one that has some variables that interest you. Using the skills you’ve practiced, ask any question you want to address, and produce a report in this markdown file to answer it. For example you could compare vocabulary growth across two of the languages, look at how some of demographic variables relate to vocabulary size, etc. This doesn’t have to be statistically sophisticated, as this isn’t a stats class.