Instructions

Please remember to annotate your code
Be consistent in naming conventions
Surround operators with spaces
If you need help, take a look at the suggested readings in the lecture, make use of the cheat sheets and the help possibility in R
Make sure you have R and RStudio installed. If you are an advanced user and aren’t using RStudio, make sure you have everything installed in order to ‘knit’ the HTML output.
Create an Rmd-file with your group number (equivalent to StudIP group) in the ‘author’ heading and answer the following questions.
When all answers are ready, ‘Knit’ the document to produce a HTML file.
Create a ZIP archive called “IDA_HW01-Group-XYZ.zip” (where ‘XYZ’ is your group number) containing:
- an R Markdown file “IDA_HW1-Group-XYZ.Rmd”
- a knitted HTML document “IDA_HW01-Group-XYZ.html”
- any other files necessary to compile your Rmarkdown to HTML (pictures etc.)
Upload the ZIP archive on Stud.IP in the respective Vips assignment before the deadline (see above). You may upload as many times as you like before the deadline, only your final submission will count.

Exercise 1: Vectors & matrices [1 point]

Exercise 1a: create a vector

Create a named Boolean vector and assign it to variable passed. This vector should contain the following information about who passed and didn’t pass the final exam: Jax passed, Jamie passed, Jason didn’t, Geralt did.

passed <- c("Jax" = T, "Jamie" = T, "Jason" = F, "Geralt" = T)

Exercise 1b: who passed

Compute the proportion of participants who passed by applying a single function XXX to the vector passed, like XXX(passed). Hint: Remember that certain numerical operations will automatically cast a Boolean value to a number.

mean(passed)

## [1] 0.75

Exercise 1c: use `stringr`

Use an appropriate function from the stringr package to obtain a Boolean vector indicating whether a name of the entries in passed (i.e., a name of the participants in our class) starts with the letter “J”. Use this Boolean vector to obtain the subset of entries from passed for all participants whose name starts with the letter “J”.

passed[names(passed) %>% str_detect("^J")]

##   Jax Jamie Jason 
##  TRUE  TRUE FALSE

Exercise 1d: create a matrix

Use a single line of R code to return a matrix with the following values by using seq and piping %>%.

 3   9    15 
 5   11   17
 7   13   19

seq(from = 3, to = 19, by = 2) %>% matrix(nrow = 3)

##      [,1] [,2] [,3]
## [1,]    3    9   15
## [2,]    5   11   17
## [3,]    7   13   19

Exercise 2: Data frames and tibbles [2 points]

Exercise 2a: create a tibble with `tribble`

Use the function tribble to represent the following information in a tibble:

# this chunk does not evaluate (notice eval = FALSE)
# it's here to show you the format that you should achieve in the output
participant   condition     response
S01           A             32
S01           B             54
S02           A             23
S02           B             03

tribble(
~participant,   ~condition,     ~response,
"S01",           "A",             32,
"S01",           "B",             54,
"S02",           "A",             23,
"S02",           "B",             03
)

## # A tibble: 4 × 3
##   participant condition response
##   <chr>       <chr>        <dbl>
## 1 S01         A               32
## 2 S01         B               54
## 3 S02         A               23
## 4 S02         B                3

Exercise 2b: create a tibble with `tibble`

Create the exact same tibble, but now using the command tibble

tibble(
  participant = c("S01", "S01", "S02", "S02"),
  condition   = c("A", "B", "A", "B"),
  response    = c(32, 54, 23, 3)
)

## # A tibble: 4 × 3
##   participant condition response
##   <chr>       <chr>        <dbl>
## 1 S01         A               32
## 2 S01         B               54
## 3 S02         A               23
## 4 S02         B                3

Exercise 2c: identify factors

Which (if any) of these columns could reasonably be represented as a factor (the data type factor in R)?

Solution:

Whatever the “response” is, it is likely a numerical measure, so not something that falls in a small set of relevant categories. But there are finitely many participants and conditions, so it might make sense to treat these columns as factors.

Exercise 2d: fix code

Create a tibble with data from a larger imaginary experiment. Concretely, your tibble should have 100 rows, and four columns called participant, trial, condition and response. There are five conditions, labelled A, B, C, D and E. There are ten participants, each of which sees all five conditions exactly twice. Participants are represented with strings like “S1”, “S2” and so on. You can make up random real numbers sampled uniformly at random from 0 to 100 (rounded to two digits after the comma) to fill the response column. Below, we’ve already started on the code, but some of our code is wrong, some parts might give good results but are really clumsy and inelegant. Finish the job and improve where you can! (Check out help pages for all commands you do not know.)

exp_data <- tibble(
  participant = str_c("", seq(from = 1, to = 100, by = 1), sep = "S"),
  condition   = rep(1:10, each = 10), 
  trial       = c("A", "B", "C", "D", "E"),
  response    = (runif(100) - 5) %>% ceiling()
)

exp_data <- tibble(
  participant = str_c("S", 1:10) %>% rep(each = 10),
  condition   = c("A", "B", "C", "D", "E") %>%  rep(each = 2) %>% rep(times = 10),
  trial       = rep(1:10, times = 10), 
  response    = (runif(100) * 100) %>% round(2)
)

exp_data

## # A tibble: 100 × 4
##    participant condition trial response
##    <chr>       <chr>     <int>    <dbl>
##  1 S1          A             1    19.9 
##  2 S1          A             2    83.1 
##  3 S1          B             3    34.7 
##  4 S1          B             4    79.1 
##  5 S1          C             5    52.9 
##  6 S1          C             6    89.9 
##  7 S1          D             7    16.9 
##  8 S1          D             8    51.3 
##  9 S1          E             9    28.2 
## 10 S1          E            10     7.37
## # ℹ 90 more rows

Exercise 2e: process strings

A friend gives you some useful information in a useless format (a standard problem of data analysis):

great_info_in_crap_format <- "Johnny_Rotten->Sex_Pistols*Johnny_Ramone->Ramones*Johnny_Cash->The_Tennessee_Three"

Use piping and some magic from the stringr package to produce output (as close as possible to something) like this:

Johnny Rotten was part of the Sex Pistols.
Johnny Ramone was part of the Ramones.
Johnny Cash played with The Tennesse Three.

Hint: Use of "\n" inserts a linebrake, but it shows only when using cat to print the output.

great_info_in_crap_format %>%
  str_replace_all(pattern = "_", replacement = " ") %>% 
  str_replace_all(pattern = "\\*", replacement = ".\n") %>% 
  str_replace(pattern = '->The Tennessee Three', replacement = ' played with The Tennessee Three') %>% 
  str_replace_all(pattern = '->', replacement = ' was part of the ') %>%
  str_c(".") %>% 
  cat()

## Johnny Rotten was part of the Sex Pistols.
## Johnny Ramone was part of the Ramones.
## Johnny Cash played with The Tennessee Three.

Exercise 3: Functions and piping [3 points]

Exercise 3: even cumulative mean calculation

Write a named function called even_cumulative_mean that takes a numeric vector input as argument. The function should first check whether the input is numeric (using is.numeric) and whether it has more than one element. If these conditions are not fulfilled, informative error messages should be given (with the stop command). If the input check succeeds, it should return a tibble with three columns (choose appropriate names yourself):

the list of all even indices for which entries exist in input,
all of the entries in input with an even index
a vector whose \(i\)th entry is the mean over all entries in the second column up to index \(i\)

Apply your function to the following vectors:

input_1 <- c(12, 43, 56, 87, 98)
input_2 <- c(12, 43, 56, 87, 98, 101)

even_cumulative_mean <- function(input) {
  if (! is.numeric(input)) {
    stop("Input to 'odd_cumulation' should be numeric.")
  }
  if (length(input) <= 1) {
    stop("Input to 'odd_cumulation' must have at least two elements")
  }
  last_even_index <- ifelse(length(input) %% 2 == 0, length(input), length(input)-1)
  even_numbers <- seq(from = 2, to = last_even_index, by = 2)
  tibble(
    even_numbers = even_numbers,
    input_at_even = input[even_numbers],
    cumulative_mean = (map_dbl(1:length(input_at_even), ~ input_at_even[1:.x] %>% mean))
  ) 
}

even_cumulative_mean(input_1)

## # A tibble: 2 × 3
##   even_numbers input_at_even cumulative_mean
##          <dbl>         <dbl>           <dbl>
## 1            2            43              43
## 2            4            87              65

even_cumulative_mean(input_2)

## # A tibble: 3 × 3
##   even_numbers input_at_even cumulative_mean
##          <dbl>         <dbl>           <dbl>
## 1            2            43              43
## 2            4            87              65
## 3            6           101              77

Exercise 4: Tidy data [2 points]

Exercise 4a: tidy up the mess

Here’s a messy data set from an experiment in which participants saw three critical conditions, and had to respond with pressing a button for either option A or option B. There were four participants in the experiment, identified anonymously in variable subject_id. The button press and associated reaction times of each of three trials are stored, respectively, in columns choices and reaction_times (in milliseconds) in a string which separates the data from different trials either with a comma (for choices) or a single white space (for reaction_times).

messy_data <- tribble(
  ~subject_id,  ~choices,  ~reaction_times,
  1,            "A,B,B",   "312 433 365",
  2,            "B,A,B",   "393 491 327",
  3,            "B,A,A",   "356 313 475",
  4,            "A,B,B",   "292 352 378"
)

Use tidyverse tools to tidy up this data set. Check the html file with instructions to see how the output should look like.

choice_data <- messy_data %>% 
  # separate choices
  separate(
    col  = choices, 
    into = str_c("C_", 1:3),
    sep = ","
    ) %>% 
  # make longer
  pivot_longer(
    cols      = contains("C_"), 
    names_to  = "condition",
    values_to = "response" 
  )

RT_data <- messy_data %>% 
  # separate RTs
  separate(
    col  = reaction_times, 
    into = str_c("C_", 1:3),
    sep = " ",
    convert = T
    ) %>% 
  # make longer
  pivot_longer(
    cols      = contains("C_"), 
    names_to  = "condition",
    values_to = "RT" 
  )

tidy_data <- full_join(choice_data, RT_data, by = c("subject_id", "condition")) %>% 
  select(subject_id, condition, response, RT)
tidy_data

## # A tibble: 12 × 4
##    subject_id condition response    RT
##         <dbl> <chr>     <chr>    <int>
##  1          1 C_1       A          312
##  2          1 C_2       B          433
##  3          1 C_3       B          365
##  4          2 C_1       B          393
##  5          2 C_2       A          491
##  6          2 C_3       B          327
##  7          3 C_1       B          356
##  8          3 C_2       A          313
##  9          3 C_3       A          475
## 10          4 C_1       A          292
## 11          4 C_2       B          352
## 12          4 C_3       B          378

Hint: There are many ways to Rome. One way leading to the current Rome is to tidy up messy_data in two steps. Create a tidy data set for the choice data (using some combination of separate, a pivoting function and possibly select), and another one for the reaction time data (using basically the same chain of operations). You would then use a joining operation, e.g., full_join, possibly followed by massaging the output one more time with select. Careful: make sure that the column RT in the final output is of type numeric (integer or double does not matter).

Exercise 4b: summarize the reaction times

Use the final tidy representation of the messy_data from the previous exercise, stored in a variable tidy_data. If you have not managed to produce this representation with tools from the tidyverse, you can write the desired tibble by hand (without loss of points for this exercise). Produce a summary table of mean reaction times per condition, using the tools from the tidyverse. Take a look at the html file with instructions to see how your output should look like:

tidy_data %>% 
  group_by(condition) %>% 
  summarise(mean_RT = mean(RT))

## # A tibble: 3 × 2
##   condition mean_RT
##   <chr>       <dbl>
## 1 C_1          338.
## 2 C_2          397.
## 3 C_3          386.

Now produce a table giving the mean reaction times for each participant. But make sure that, in this case, the mean reaction times are rounded to full integers. (Hint: you can use mutate in a final step or round inside of a call to summarise). The output should look like this:

tidy_data %>% 
  group_by(subject_id) %>% 
  summarise(mean_RT = mean(RT) %>% round)

## # A tibble: 4 × 2
##   subject_id mean_RT
##        <dbl>   <dbl>
## 1          1     370
## 2          2     404
## 3          3     381
## 4          4     341

Exercise 4c: extract numeric information

Consider this vector of weirdly specified reaction times (similar to Exercise 2.14 from the web-book).

weird_RTs <- c("RT = 323", "ReactTime = 345", "howfast? -> 23 (wow!)", "speed = 421", "RT:50")

Starting with that vector use a chain of pipes to:

extract the numeric information from the strings,
cast the information into a vector of type numeric,
remove all RT entries lower than 100
- if you can, use an anonymous function defined in situ; otherwise define a named function;
- if you can, use Booleans and indexing, not some other other ready-made function,
take the log, take the mean,
round to 3 significant digits

Hint: Check the use of regular expressions on the cheat sheet of the stringr package.

weird_RTs %>% 
  str_extract("[:digit:]+") %>% 
  as.numeric() %>% 
  (function(x) {x[x>100]}) %>% 
  log %>% 
  mean %>% 
  signif(3)

## [1] 5.89

Exercise 5: The King of France visits IDA [2 points]

We will work with the King of France experiment,for a detailed description of the theoretical background and the procedure look into the Appendix D.4 of your course book.

Here is a condensed description of the materials. The data set consists of five vignettes:

V1. The King of France is bald.
V2. The Emperor of Canada is fond of sushi.
V3. The Pope’s wife is a lawyer.
V4. The Belgian rainforest provides a habitat for many species.
V5. The volcanoes of Germany dominate the landscape.

Where each vignette consists of five critical conditions. The following five sentences are examples of the critical conditions for the first vignette.

C0. The king of France is bald.
C1. France has a king, and he is bald.
C6. The King of France isn’t bald.
C9. The King of France, he did not call Emmanuel Macron last night.
C10. Emmanuel Macron, he did not call the King of France last night.

Additionally, for each vignette there exists a background check. This sentence is intended to find out whether participants know whether the relevant presuppositions are true. The five background checks are:

BC1. France has a king.
BC2. The Pope is currently not married.
BC3. Canada is a democracy.
BC4. Belgium has rainforests.
BC5. Germany has volcanoes.

Finally, there are also 110 filler sentences, which do not have a presupposition, but also require common world knowledge for a correct answer. We will use the filler sentences also as controls, because there is a “correct” answer to each of these.

Exercise 5a: experimental design

Look into the procedure described in the Appendix D.4 of your script and answer the following questions:

Is the “King of France” experiment an instance of a factorial design? If so, what is/are the factor(s), and what are the levels of each factor?

Solution:

It’s a one-factor factorial design. The factor is ‘condition’ and it has five levels: C0, C1, C6, … Vignette is NOT a factor. This is item-level variation to make sure that we are not too repetitive. This shows in there not being any theoretically motivated hypothesis that hinges on a contrast or comparison among vignettes. We treat each vignette equal for all current purposes (until we engage in hierarchical modeling).

Is this experiment a within-subjects or a between-subjects design?

Solution:

The experiment is a within-subjects design: every participant gives one observation for each design cell (each condition).

Give one advantage and one disadvantage for this design-type (within- vs between-subjects).

Solution:

Advantage: fewer participants needed

Disadvantage: possible cross-contamination between conditions

Is this experiment a repeated-measures design?

Solution:

No, since every participant only gives exactly one data point for each design cell.

Indicate the dependent variable of the experiment (give the column name in the data representation) and the corresponding variable type.

Solution:

Dependent variable: “response”

Variable type: binary

Exercise 5b: exploring IDA’s King of France

Load the data the King of France data:

data_KoF_raw_IDA <- aida::data_KoF_raw

How many rows does the data set in data_KoF_raw_IDA contain? (Hint: use the nrow function!)

nrow(data_KoF_raw_IDA)

## [1] 2813

How many participants took part in the study? (Hint: use a sequence of operations pull, unique and length.)

data_KoF_raw_IDA %>% pull(submission_id) %>% unique %>% length

## [1] 97

Calculate the grand average of the variable age (Hint: As soon as a vector contains missing data (an entry NA), it’s mean is NA as well. Try removing the missing values when calculating the mean, e.g., by checking the documentation of the function mean for anything helpful.)

data_KoF_raw_IDA %>% pull(age) %>% mean(na.rm=T)

## [1] 32.36842

Give the type of each of the following variables included in the data set (i.e., state whether it is ordinal, metric, etc.).

submission_id: nominal
RT: metric
correct: binary
education: ordinal
item_version: nominal
question: nominal
response: binary
timeSpent: metric
trial_name: nominal
trial_number: metric
trial_type: nominal
vignette: nominal

Homework Sheet 1 – Basics of R and data wrangling

Due: Wednesday, November 15 by 11:59 CET

Instructions

Exercise 1: Vectors & matrices [1 point]

Exercise 1a: create a vector

Exercise 1b: who passed

Exercise 1c: use `stringr`

Exercise 1d: create a matrix

Exercise 2: Data frames and tibbles [2 points]

Exercise 2a: create a tibble with `tribble`

Exercise 2b: create a tibble with `tibble`

Exercise 2c: identify factors

Exercise 2d: fix code

Exercise 2e: process strings

Exercise 3: Functions and piping [3 points]

Exercise 3: even cumulative mean calculation

Exercise 4: Tidy data [2 points]

Exercise 4a: tidy up the mess

Exercise 4b: summarize the reaction times

Exercise 4c: extract numeric information

Exercise 5: The King of France visits IDA [2 points]

Exercise 5a: experimental design

Exercise 5b: exploring IDA’s King of France

Homework Sheet 1 – Basics of R and data wrangling

Due: Wednesday, November 15 by 11:59 CET

Instructions

Exercise 1: Vectors & matrices [1 point]

Exercise 1a: create a vector

Exercise 1b: who passed

Exercise 1c: use stringr

Exercise 1d: create a matrix

Exercise 2: Data frames and tibbles [2 points]

Exercise 2a: create a tibble with tribble

Exercise 2b: create a tibble with tibble

Exercise 2c: identify factors

Exercise 2d: fix code

Exercise 2e: process strings

Exercise 3: Functions and piping [3 points]

Exercise 3: even cumulative mean calculation

Exercise 4: Tidy data [2 points]

Exercise 4a: tidy up the mess

Exercise 4b: summarize the reaction times

Exercise 4c: extract numeric information

Exercise 5: The King of France visits IDA [2 points]

Exercise 5a: experimental design

Exercise 5b: exploring IDA’s King of France

Exercise 1c: use `stringr`

Exercise 2a: create a tibble with `tribble`

Exercise 2b: create a tibble with `tibble`