Instructions

Exercise 1: Vectors & matrices [1 point]

Exercise 1a: create a vector

Create a named Boolean vector and assign it to variable passed. This vector should contain the following information about who passed and didn’t pass the final exam: Jax passed, Jamie passed, Jason didn’t, Geralt did.

passed <- c("Jax" = T, "Jamie" = T, "Jason" = F, "Geralt" = T)

Exercise 1b: who passed

Compute the proportion of participants who passed by applying a single function XXX to the vector passed, like XXX(passed). Hint: Remember that certain numerical operations will automatically cast a Boolean value to a number.

mean(passed)
## [1] 0.75

Exercise 1c: use stringr

Use an appropriate function from the stringr package to obtain a Boolean vector indicating whether a name of the entries in passed (i.e., a name of the participants in our class) starts with the letter “J”. Use this Boolean vector to obtain the subset of entries from passed for all participants whose name starts with the letter “J”.

passed[names(passed) %>% str_detect("^J")]
##   Jax Jamie Jason 
##  TRUE  TRUE FALSE

Exercise 1d: create a matrix

Use a single line of R code to return a matrix with the following values by using seq and piping %>%.

 3   9    15 
 5   11   17
 7   13   19
seq(from = 3, to = 19, by = 2) %>% matrix(nrow = 3)
##      [,1] [,2] [,3]
## [1,]    3    9   15
## [2,]    5   11   17
## [3,]    7   13   19

Exercise 2: Data frames and tibbles [2 points]

Exercise 2a: create a tibble with tribble

Use the function tribble to represent the following information in a tibble:

# this chunk does not evaluate (notice eval = FALSE)
# it's here to show you the format that you should achieve in the output
participant   condition     response
S01           A             32
S01           B             54
S02           A             23
S02           B             03
tribble(
~participant,   ~condition,     ~response,
"S01",           "A",             32,
"S01",           "B",             54,
"S02",           "A",             23,
"S02",           "B",             03
)
## # A tibble: 4 × 3
##   participant condition response
##   <chr>       <chr>        <dbl>
## 1 S01         A               32
## 2 S01         B               54
## 3 S02         A               23
## 4 S02         B                3

Exercise 2b: create a tibble with tibble

Create the exact same tibble, but now using the command tibble

tibble(
  participant = c("S01", "S01", "S02", "S02"),
  condition   = c("A", "B", "A", "B"),
  response    = c(32, 54, 23, 3)
)
## # A tibble: 4 × 3
##   participant condition response
##   <chr>       <chr>        <dbl>
## 1 S01         A               32
## 2 S01         B               54
## 3 S02         A               23
## 4 S02         B                3

Exercise 2c: identify factors

Which (if any) of these columns could reasonably be represented as a factor (the data type factor in R)?

Solution:

Whatever the “response” is, it is likely a numerical measure, so not something that falls in a small set of relevant categories. But there are finitely many participants and conditions, so it might make sense to treat these columns as factors.

Exercise 2d: fix code

Create a tibble with data from a larger imaginary experiment. Concretely, your tibble should have 100 rows, and four columns called participant, trial, condition and response. There are five conditions, labelled A, B, C, D and E. There are ten participants, each of which sees all five conditions exactly twice. Participants are represented with strings like “S1”, “S2” and so on. You can make up random real numbers sampled uniformly at random from 0 to 100 (rounded to two digits after the comma) to fill the response column. Below, we’ve already started on the code, but some of our code is wrong, some parts might give good results but are really clumsy and inelegant. Finish the job and improve where you can! (Check out help pages for all commands you do not know.)

exp_data <- tibble(
  participant = str_c("", seq(from = 1, to = 100, by = 1), sep = "S"),
  condition   = rep(1:10, each = 10), 
  trial       = c("A", "B", "C", "D", "E"),
  response    = (runif(100) - 5) %>% ceiling()
)
exp_data <- tibble(
  participant = str_c("S", 1:10) %>% rep(each = 10),
  condition   = c("A", "B", "C", "D", "E") %>%  rep(each = 2) %>% rep(times = 10),
  trial       = rep(1:10, times = 10), 
  response    = (runif(100) * 100) %>% round(2)
)
exp_data
## # A tibble: 100 × 4
##    participant condition trial response
##    <chr>       <chr>     <int>    <dbl>
##  1 S1          A             1    19.9 
##  2 S1          A             2    83.1 
##  3 S1          B             3    34.7 
##  4 S1          B             4    79.1 
##  5 S1          C             5    52.9 
##  6 S1          C             6    89.9 
##  7 S1          D             7    16.9 
##  8 S1          D             8    51.3 
##  9 S1          E             9    28.2 
## 10 S1          E            10     7.37
## # ℹ 90 more rows

Exercise 2e: process strings

A friend gives you some useful information in a useless format (a standard problem of data analysis):

great_info_in_crap_format <- "Johnny_Rotten->Sex_Pistols*Johnny_Ramone->Ramones*Johnny_Cash->The_Tennessee_Three"

Use piping and some magic from the stringr package to produce output (as close as possible to something) like this:

Johnny Rotten was part of the Sex Pistols.
Johnny Ramone was part of the Ramones.
Johnny Cash played with The Tennesse Three.

Hint: Use of "\n" inserts a linebrake, but it shows only when using cat to print the output.

great_info_in_crap_format %>%
  str_replace_all(pattern = "_", replacement = " ") %>% 
  str_replace_all(pattern = "\\*", replacement = ".\n") %>% 
  str_replace(pattern = '->The Tennessee Three', replacement = ' played with The Tennessee Three') %>% 
  str_replace_all(pattern = '->', replacement = ' was part of the ') %>%
  str_c(".") %>% 
  cat()
## Johnny Rotten was part of the Sex Pistols.
## Johnny Ramone was part of the Ramones.
## Johnny Cash played with The Tennessee Three.

Exercise 3: Functions and piping [3 points]

Exercise 3: even cumulative mean calculation

Write a named function called even_cumulative_mean that takes a numeric vector input as argument. The function should first check whether the input is numeric (using is.numeric) and whether it has more than one element. If these conditions are not fulfilled, informative error messages should be given (with the stop command). If the input check succeeds, it should return a tibble with three columns (choose appropriate names yourself):

  1. the list of all even indices for which entries exist in input,
  2. all of the entries in input with an even index
  3. a vector whose \(i\)th entry is the mean over all entries in the second column up to index \(i\)

Apply your function to the following vectors:

input_1 <- c(12, 43, 56, 87, 98)
input_2 <- c(12, 43, 56, 87, 98, 101)
even_cumulative_mean <- function(input) {
  if (! is.numeric(input)) {
    stop("Input to 'odd_cumulation' should be numeric.")
  }
  if (length(input) <= 1) {
    stop("Input to 'odd_cumulation' must have at least two elements")
  }
  last_even_index <- ifelse(length(input) %% 2 == 0, length(input), length(input)-1)
  even_numbers <- seq(from = 2, to = last_even_index, by = 2)
  tibble(
    even_numbers = even_numbers,
    input_at_even = input[even_numbers],
    cumulative_mean = (map_dbl(1:length(input_at_even), ~ input_at_even[1:.x] %>% mean))
  ) 
}
even_cumulative_mean(input_1)
## # A tibble: 2 × 3
##   even_numbers input_at_even cumulative_mean
##          <dbl>         <dbl>           <dbl>
## 1            2            43              43
## 2            4            87              65
even_cumulative_mean(input_2)
## # A tibble: 3 × 3
##   even_numbers input_at_even cumulative_mean
##          <dbl>         <dbl>           <dbl>
## 1            2            43              43
## 2            4            87              65
## 3            6           101              77

Exercise 4: Tidy data [2 points]

Exercise 4a: tidy up the mess

Here’s a messy data set from an experiment in which participants saw three critical conditions, and had to respond with pressing a button for either option A or option B. There were four participants in the experiment, identified anonymously in variable subject_id. The button press and associated reaction times of each of three trials are stored, respectively, in columns choices and reaction_times (in milliseconds) in a string which separates the data from different trials either with a comma (for choices) or a single white space (for reaction_times).

messy_data <- tribble(
  ~subject_id,  ~choices,  ~reaction_times,
  1,            "A,B,B",   "312 433 365",
  2,            "B,A,B",   "393 491 327",
  3,            "B,A,A",   "356 313 475",
  4,            "A,B,B",   "292 352 378"
)

Use tidyverse tools to tidy up this data set. Check the html file with instructions to see how the output should look like.

choice_data <- messy_data %>% 
  # separate choices
  separate(
    col  = choices, 
    into = str_c("C_", 1:3),
    sep = ","
    ) %>% 
  # make longer
  pivot_longer(
    cols      = contains("C_"), 
    names_to  = "condition",
    values_to = "response" 
  )

RT_data <- messy_data %>% 
  # separate RTs
  separate(
    col  = reaction_times, 
    into = str_c("C_", 1:3),
    sep = " ",
    convert = T
    ) %>% 
  # make longer
  pivot_longer(
    cols      = contains("C_"), 
    names_to  = "condition",
    values_to = "RT" 
  )

tidy_data <- full_join(choice_data, RT_data, by = c("subject_id", "condition")) %>% 
  select(subject_id, condition, response, RT)
tidy_data
## # A tibble: 12 × 4
##    subject_id condition response    RT
##         <dbl> <chr>     <chr>    <int>
##  1          1 C_1       A          312
##  2          1 C_2       B          433
##  3          1 C_3       B          365
##  4          2 C_1       B          393
##  5          2 C_2       A          491
##  6          2 C_3       B          327
##  7          3 C_1       B          356
##  8          3 C_2       A          313
##  9          3 C_3       A          475
## 10          4 C_1       A          292
## 11          4 C_2       B          352
## 12          4 C_3       B          378

Hint: There are many ways to Rome. One way leading to the current Rome is to tidy up messy_data in two steps. Create a tidy data set for the choice data (using some combination of separate, a pivoting function and possibly select), and another one for the reaction time data (using basically the same chain of operations). You would then use a joining operation, e.g., full_join, possibly followed by massaging the output one more time with select. Careful: make sure that the column RT in the final output is of type numeric (integer or double does not matter).

Exercise 4b: summarize the reaction times

Use the final tidy representation of the messy_data from the previous exercise, stored in a variable tidy_data. If you have not managed to produce this representation with tools from the tidyverse, you can write the desired tibble by hand (without loss of points for this exercise). Produce a summary table of mean reaction times per condition, using the tools from the tidyverse. Take a look at the html file with instructions to see how your output should look like:

tidy_data %>% 
  group_by(condition) %>% 
  summarise(mean_RT = mean(RT))
## # A tibble: 3 × 2
##   condition mean_RT
##   <chr>       <dbl>
## 1 C_1          338.
## 2 C_2          397.
## 3 C_3          386.

Now produce a table giving the mean reaction times for each participant. But make sure that, in this case, the mean reaction times are rounded to full integers. (Hint: you can use mutate in a final step or round inside of a call to summarise). The output should look like this:

tidy_data %>% 
  group_by(subject_id) %>% 
  summarise(mean_RT = mean(RT) %>% round)
## # A tibble: 4 × 2
##   subject_id mean_RT
##        <dbl>   <dbl>
## 1          1     370
## 2          2     404
## 3          3     381
## 4          4     341

Exercise 4c: extract numeric information

Consider this vector of weirdly specified reaction times (similar to Exercise 2.14 from the web-book).

weird_RTs <- c("RT = 323", "ReactTime = 345", "howfast? -> 23 (wow!)", "speed = 421", "RT:50")

Starting with that vector use a chain of pipes to:

  • extract the numeric information from the strings,
  • cast the information into a vector of type numeric,
  • remove all RT entries lower than 100
    • if you can, use an anonymous function defined in situ; otherwise define a named function;
    • if you can, use Booleans and indexing, not some other other ready-made function,
  • take the log, take the mean,
  • round to 3 significant digits

Hint: Check the use of regular expressions on the cheat sheet of the stringr package.

weird_RTs %>% 
  str_extract("[:digit:]+") %>% 
  as.numeric() %>% 
  (function(x) {x[x>100]}) %>% 
  log %>% 
  mean %>% 
  signif(3)
## [1] 5.89

Exercise 5: The King of France visits IDA [2 points]

We will work with the King of France experiment,for a detailed description of the theoretical background and the procedure look into the Appendix D.4 of your course book.

Here is a condensed description of the materials. The data set consists of five vignettes:

Where each vignette consists of five critical conditions. The following five sentences are examples of the critical conditions for the first vignette.

Additionally, for each vignette there exists a background check. This sentence is intended to find out whether participants know whether the relevant presuppositions are true. The five background checks are:

Finally, there are also 110 filler sentences, which do not have a presupposition, but also require common world knowledge for a correct answer. We will use the filler sentences also as controls, because there is a “correct” answer to each of these.

Exercise 5a: experimental design

Look into the procedure described in the Appendix D.4 of your script and answer the following questions:

  1. Is the “King of France” experiment an instance of a factorial design? If so, what is/are the factor(s), and what are the levels of each factor?

Solution:

It’s a one-factor factorial design. The factor is ‘condition’ and it has five levels: C0, C1, C6, … Vignette is NOT a factor. This is item-level variation to make sure that we are not too repetitive. This shows in there not being any theoretically motivated hypothesis that hinges on a contrast or comparison among vignettes. We treat each vignette equal for all current purposes (until we engage in hierarchical modeling).

  1. Is this experiment a within-subjects or a between-subjects design?

Solution:

The experiment is a within-subjects design: every participant gives one observation for each design cell (each condition).

  1. Give one advantage and one disadvantage for this design-type (within- vs between-subjects).

Solution:

Advantage: fewer participants needed

Disadvantage: possible cross-contamination between conditions

  1. Is this experiment a repeated-measures design?

Solution:

No, since every participant only gives exactly one data point for each design cell.

  1. Indicate the dependent variable of the experiment (give the column name in the data representation) and the corresponding variable type.

Solution:

Dependent variable: “response”

Variable type: binary

Exercise 5b: exploring IDA’s King of France

Load the data the King of France data:

data_KoF_raw_IDA <- aida::data_KoF_raw
  1. How many rows does the data set in data_KoF_raw_IDA contain? (Hint: use the nrow function!)
nrow(data_KoF_raw_IDA)
## [1] 2813
  1. How many participants took part in the study? (Hint: use a sequence of operations pull, unique and length.)
data_KoF_raw_IDA %>% pull(submission_id) %>% unique %>% length
## [1] 97
  1. Calculate the grand average of the variable age (Hint: As soon as a vector contains missing data (an entry NA), it’s mean is NA as well. Try removing the missing values when calculating the mean, e.g., by checking the documentation of the function mean for anything helpful.)
data_KoF_raw_IDA %>% pull(age) %>% mean(na.rm=T)
## [1] 32.36842
  1. Give the type of each of the following variables included in the data set (i.e., state whether it is ordinal, metric, etc.).
  • submission_id: nominal

  • RT: metric

  • correct: binary

  • education: ordinal

  • item_version: nominal

  • question: nominal

  • response: binary

  • timeSpent: metric

  • trial_name: nominal

  • trial_number: metric

  • trial_type: nominal

  • vignette: nominal