Data wrangling using tidyr

Author

Shota Momma

Throughout this course, we will make an extensive use of the programming language R. Here we will introduce the basic methods of loading, manipulating, summarizing and visualizing data using the tidyverse package, which may make your life a bit easier later on.

Dataset

We will use the acceptability norming dataset from Momma & Yoshida (2023). The sentences like the following were used:

  • Which movie did the critic watch and praise? [Across-The-Board movement]
  • Which movie did the critic watch before praising? [Parasitic gap sentences]

The dataset also contains obviously good-sounding and obviously bad-sounding filler sentences of various types.

Loading data as a ‘tibble’ object.

The data are stored in csv format, so you need to import them. To do so, you can use the function read_csv(). Note that this is different from the read.csv() function in base r. the read_csv function loads the data in a ‘tibble’ format, which is a table format often used in the tidyverse framework.

## install.packages(tidyverse)
library(tidyverse)
df <- read_csv("/Users/shota/Documents/Teaching/Fall2024/Statistics/data/ATBvsPG.csv")
df
# A tibble: 3,328 × 6
   Subj_ID                          Rating Sentence        Condi…¹ Item_ID List 
   <chr>                             <dbl> <chr>           <chr>     <dbl> <chr>
 1 ddbc836a2df857339b8aa9115a6d93fb      6 Which policy d… ATB          33 A    
 2 ddbc836a2df857339b8aa9115a6d93fb      6 Which movie di… ATB           9 A    
 3 ddbc836a2df857339b8aa9115a6d93fb      7 Which manuscri… ATB           5 A    
 4 ddbc836a2df857339b8aa9115a6d93fb      6 Which vase did… ATB          39 A    
 5 ddbc836a2df857339b8aa9115a6d93fb      6 Which sculptur… ATB          15 A    
 6 ddbc836a2df857339b8aa9115a6d93fb      7 Which manuscri… ATB          21 A    
 7 ddbc836a2df857339b8aa9115a6d93fb      7 Which chair di… ATB          29 A    
 8 ddbc836a2df857339b8aa9115a6d93fb      7 Which question… ATB          49 A    
 9 ddbc836a2df857339b8aa9115a6d93fb      7 Which guitar d… ATB          11 A    
10 ddbc836a2df857339b8aa9115a6d93fb      7 Which car did … ATB          23 A    
# … with 3,318 more rows, and abbreviated variable name ¹​Condition

A nice feature of a tibble format is that it tells you the type of each column, e.g., = character, = double.

Manipulating and summarizing data using dplyr

filter()

The dataset contains both filler and experimental sentences. Let’s say you want to create a data frame where the data from filler trials are removed. You can use the filter() function.

df_exp <- df %>% filter(Condition == 'PG' | Condition == 'ATB')
df_exp
# A tibble: 2,048 × 6
   Subj_ID                          Rating Sentence        Condi…¹ Item_ID List 
   <chr>                             <dbl> <chr>           <chr>     <dbl> <chr>
 1 ddbc836a2df857339b8aa9115a6d93fb      6 Which policy d… ATB          33 A    
 2 ddbc836a2df857339b8aa9115a6d93fb      6 Which movie di… ATB           9 A    
 3 ddbc836a2df857339b8aa9115a6d93fb      7 Which manuscri… ATB           5 A    
 4 ddbc836a2df857339b8aa9115a6d93fb      6 Which vase did… ATB          39 A    
 5 ddbc836a2df857339b8aa9115a6d93fb      6 Which sculptur… ATB          15 A    
 6 ddbc836a2df857339b8aa9115a6d93fb      7 Which manuscri… ATB          21 A    
 7 ddbc836a2df857339b8aa9115a6d93fb      7 Which chair di… ATB          29 A    
 8 ddbc836a2df857339b8aa9115a6d93fb      7 Which question… ATB          49 A    
 9 ddbc836a2df857339b8aa9115a6d93fb      7 Which guitar d… ATB          11 A    
10 ddbc836a2df857339b8aa9115a6d93fb      7 Which car did … ATB          23 A    
# … with 2,038 more rows, and abbreviated variable name ¹​Condition

%>% is known as pipe, and it is an operator that takes the output of one function and passes it onto another function as an argument. The benefit of using this becomes clear in a minute.

You can also filter the data using a certain numerical threshold. Let’s say you want to subset the data so they only acceptability scores lower than 4.

df_exp_low <- df_exp %>% filter(Rating < 4)
df_exp_low
# A tibble: 129 × 6
   Subj_ID                          Rating Sentence        Condi…¹ Item_ID List 
   <chr>                             <dbl> <chr>           <chr>     <dbl> <chr>
 1 ddbc836a2df857339b8aa9115a6d93fb      1 Which movie di… ATB          43 A    
 2 ddbc836a2df857339b8aa9115a6d93fb      1 Which book did… ATB          27 A    
 3 49189b961f14912e503b9424d79991b0      3 Which guitar d… ATB          11 A    
 4 608824109a1b7d5ebbff3faa89f26f57      3 Which formula … ATB          31 A    
 5 608824109a1b7d5ebbff3faa89f26f57      3 Which girl did… ATB          61 A    
 6 608824109a1b7d5ebbff3faa89f26f57      3 Which movie di… ATB           9 A    
 7 608824109a1b7d5ebbff3faa89f26f57      3 Which question… ATB          49 A    
 8 608824109a1b7d5ebbff3faa89f26f57      2 Which horse di… ATB          25 A    
 9 608824109a1b7d5ebbff3faa89f26f57      3 Which cake did… ATB          17 A    
10 608824109a1b7d5ebbff3faa89f26f57      2 Which paper di… ATB          47 A    
# … with 119 more rows, and abbreviated variable name ¹​Condition

select()

Sometimes, you want to remove the columns that you don’t need. To do so, you can use the select() function from the dplyr package, which is included in the tidyverse package. To remove a column, you can use the ‘-’ operator. For example, suppose that you don’t want to remove the “Sentence” column.

df_clean <- df %>% select(-Sentence)
df_clean
# A tibble: 3,328 × 5
   Subj_ID                          Rating Condition Item_ID List 
   <chr>                             <dbl> <chr>       <dbl> <chr>
 1 ddbc836a2df857339b8aa9115a6d93fb      6 ATB            33 A    
 2 ddbc836a2df857339b8aa9115a6d93fb      6 ATB             9 A    
 3 ddbc836a2df857339b8aa9115a6d93fb      7 ATB             5 A    
 4 ddbc836a2df857339b8aa9115a6d93fb      6 ATB            39 A    
 5 ddbc836a2df857339b8aa9115a6d93fb      6 ATB            15 A    
 6 ddbc836a2df857339b8aa9115a6d93fb      7 ATB            21 A    
 7 ddbc836a2df857339b8aa9115a6d93fb      7 ATB            29 A    
 8 ddbc836a2df857339b8aa9115a6d93fb      7 ATB            49 A    
 9 ddbc836a2df857339b8aa9115a6d93fb      7 ATB            11 A    
10 ddbc836a2df857339b8aa9115a6d93fb      7 ATB            23 A    
# … with 3,318 more rows

You can also keep the columns you want by simply telling which column you want to keep.

df_subset <- df %>% select(Subj_ID, Item_ID, Rating)
df_subset
# A tibble: 3,328 × 3
   Subj_ID                          Item_ID Rating
   <chr>                              <dbl>  <dbl>
 1 ddbc836a2df857339b8aa9115a6d93fb      33      6
 2 ddbc836a2df857339b8aa9115a6d93fb       9      6
 3 ddbc836a2df857339b8aa9115a6d93fb       5      7
 4 ddbc836a2df857339b8aa9115a6d93fb      39      6
 5 ddbc836a2df857339b8aa9115a6d93fb      15      6
 6 ddbc836a2df857339b8aa9115a6d93fb      21      7
 7 ddbc836a2df857339b8aa9115a6d93fb      29      7
 8 ddbc836a2df857339b8aa9115a6d93fb      49      7
 9 ddbc836a2df857339b8aa9115a6d93fb      11      7
10 ddbc836a2df857339b8aa9115a6d93fb      23      7
# … with 3,318 more rows

group_by() %>% summarize()

Often, you want to compute descriptive statistics separately by condition, by subject, by item, etc. In such cases you can the combinations of two functions: group_by() and summarize(). For instance, you can compute the mean rating by condition.

mean_bycondition <- df %>% group_by(Condition) %>% summarise(mean = mean(Rating))
mean_bycondition
# A tibble: 4 × 2
  Condition    mean
  <chr>       <dbl>
1 ATB          6.07
2 filler_bad   2.34
3 filler_good  6.18
4 PG           5.95

You can also compute the mean rating by condition by subject.

mean_bycondition_bysubject <- df %>% group_by(Subj_ID, Condition) %>% summarize(mean = mean(Rating))
`summarise()` has grouped output by 'Subj_ID'. You can override using the
`.groups` argument.
mean_bycondition_bysubject
# A tibble: 128 × 3
# Groups:   Subj_ID [32]
   Subj_ID                          Condition    mean
   <chr>                            <chr>       <dbl>
 1 08db5078ac9dcf705b13ac28eb838e2d ATB          5.88
 2 08db5078ac9dcf705b13ac28eb838e2d filler_bad   2.04
 3 08db5078ac9dcf705b13ac28eb838e2d filler_good  5.93
 4 08db5078ac9dcf705b13ac28eb838e2d PG           5.34
 5 2d4c21f83c3728a4e4d53f58eae56a9a ATB          6.44
 6 2d4c21f83c3728a4e4d53f58eae56a9a filler_bad   1.36
 7 2d4c21f83c3728a4e4d53f58eae56a9a filler_good  6.47
 8 2d4c21f83c3728a4e4d53f58eae56a9a PG           6.41
 9 3b3b63a75981b844f04ce7452361e2db ATB          5.88
10 3b3b63a75981b844f04ce7452361e2db filler_bad   1.24
# … with 118 more rows

You can also compute various descriptive statistics simultaneously. For example:

descriptive_stats_by_condition <- df %>% group_by(Condition) %>% summarize(mean = mean(Rating), median = median(Rating), sd = sd(Rating), max = max(Rating), min = min(Rating))
descriptive_stats_by_condition
# A tibble: 4 × 6
  Condition    mean median    sd   max   min
  <chr>       <dbl>  <dbl> <dbl> <dbl> <dbl>
1 ATB          6.07      7  1.31     7     1
2 filler_bad   2.34      2  1.76     7     1
3 filler_good  6.18      7  1.40     7     1
4 PG           5.95      7  1.36     7     1

The benefit of using the pipe operator is that you can chain multiple operations. For instance, you can first filter the dataset and then compute the descriptive statistics.

desc_stats <- df %>% filter(Condition == 'PG' | Condition == 'ATB') %>% group_by(Condition) %>% summarize(mean = mean(Rating))
desc_stats
# A tibble: 2 × 2
  Condition  mean
  <chr>     <dbl>
1 ATB        6.07
2 PG         5.95

mutate()

You sometimes want to create a new column by using information in the other columns. In that case, you can use the mutate() function. For instance, suppose that you want to add a column with standardized rating scores (z-score, i.e., how many standard deviations away from the mean a value is) by subject. This is often useful when analyzing acceptability judgment data:

df_stan <- df %>% group_by(Subj_ID) %>% mutate(Rating_z = scale(Rating)) 
df_stan
# A tibble: 3,328 × 7
# Groups:   Subj_ID [32]
   Subj_ID                          Rating Sente…¹ Condi…² Item_ID List  Ratin…³
   <chr>                             <dbl> <chr>   <chr>     <dbl> <chr>   <dbl>
 1 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB          33 A       0.383
 2 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB           9 A       0.383
 3 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB           5 A       0.752
 4 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB          39 A       0.383
 5 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB          15 A       0.383
 6 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          21 A       0.752
 7 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          29 A       0.752
 8 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          49 A       0.752
 9 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          11 A       0.752
10 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          23 A       0.752
# … with 3,318 more rows, and abbreviated variable names ¹​Sentence, ²​Condition,
#   ³​Rating_z[,1]

define your own function

It is useful to know how to define your own function in R. As an exercise, let’s define a function that computes a z-score (that’s what the scale() function does).

## equivalently:
z <- function(vec){
    (vec - mean(vec))/sd(vec)
}
df_stan2 <- df %>% group_by(Subj_ID) %>% mutate(Rating_z = z(Rating)) 
df_stan2
# A tibble: 3,328 × 7
# Groups:   Subj_ID [32]
   Subj_ID                          Rating Sente…¹ Condi…² Item_ID List  Ratin…³
   <chr>                             <dbl> <chr>   <chr>     <dbl> <chr>   <dbl>
 1 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB          33 A       0.383
 2 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB           9 A       0.383
 3 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB           5 A       0.752
 4 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB          39 A       0.383
 5 ddbc836a2df857339b8aa9115a6d93fb      6 Which … ATB          15 A       0.383
 6 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          21 A       0.752
 7 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          29 A       0.752
 8 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          49 A       0.752
 9 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          11 A       0.752
10 ddbc836a2df857339b8aa9115a6d93fb      7 Which … ATB          23 A       0.752
# … with 3,318 more rows, and abbreviated variable names ¹​Sentence, ²​Condition,
#   ³​Rating_z

Data visualization using ggplot2

It’s always useful to visualize your data to understand what’s going on. The tidiverse package comes with a package named ggplot2, and it is a powerful tool for data visualization. There are a whole bunch of things we can do with ggplot2, but we are gonna create two basic types of graphs for now.

Let’s first make a histogram. Histograms are pretty useful for visualizing how a variable of interest is distributed, and often, the first thing you want to do with your data is to create the histogram.

df_exp <- df %>% filter(Condition == 'ATB'| Condition == 'PG')

ggplot(df_exp) + 
  geom_histogram(aes(x = Rating))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can also craete two graphs by conditions:

ggplot(df_exp) + 
  geom_histogram(aes(x = Rating)) +
  facet_wrap(~Condition)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sometimes, you may want to look at subject-wise or item-wise distributions. Let’s take a look at the distribution of average ratings of the bad filler sentences by subject.

df_subj <- df %>% filter(Condition == 'filler_bad') %>% group_by(Subj_ID) %>% summarize(mean_rating = mean(Rating))

ggplot(df_subj) + 
  geom_histogram(aes(x = mean_rating))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You may notice that 1 subject didn’t really judge bad fillers as bad. This may be a signal that that person didn’t really pay attention to the task. Generally, plotting distributions is very useful for noticing something odd in your data.

Another widely used type of plots are bar plots. Barplots can be created by using the geom_barplot() function. Say you want to plot the mean ratings of each condition.

df$Condition <- factor(df$Condition, levels = c("ATB","PG", "filler_good", "filler_bad") ) # let's reorder the factor
df_means <- df %>% group_by(Condition) %>% summarize(mean_rating = mean(Rating))

ggplot(df_means, aes(x = Condition,y = mean_rating)) + geom_bar(stat='identity')

Barplots are useful (especially when you put error bars - we don’t put error bars here because computing ‘errors’ is actually complicated. We will come back to this issue later), but they do not contain any information about how variables are distributed. In some cases, you may want to make a boxplot instead:

df_boxplot <- df %>% group_by(Subj_ID, Condition) %>% summarize(mean_rating = mean(Rating))
`summarise()` has grouped output by 'Subj_ID'. You can override using the
`.groups` argument.
ggplot(df_boxplot) + geom_boxplot(aes(x = Condition,y = mean_rating))

HW exercise (do this by the next class)

  1. Compute the mean and the standard deviation of the z-scored acceptability rating using group_by() %>% summarize(). z-score should be computed for each subject separately.

  2. Suppose that you want to identify ‘bad’ subjects, who did not reliably discriminate between good and bad filler sentences. Using various functions introduced above, create a tibble object that shows mean rating differences between good and bad filler sentences for each subject.

  3. Do the same but for each item instead of for each subject.

  4. Create a boxplot plotting the acceptability rating by condition, where each data point corresponds to an item (rather than subject) average.