Create a named Boolean vector and assign it to variable
passed. This vector should contain the following
information about who passed and didn’t pass the final exam: Jax passed,
Jamie passed, Jason didn’t, Geralt did.
Compute the proportion of participants who passed by applying a
single function XXX to the vector passed, like
XXX(passed). Hint: Remember that certain numerical
operations will automatically cast a Boolean value to a
number.
## [1] 0.75
stringrUse an appropriate function from the stringr package to
obtain a Boolean vector indicating whether a name of the entries in
passed (i.e., a name of the participants in our class)
starts with the letter “J”. Use this Boolean vector to obtain the subset
of entries from passed for all participants whose name
starts with the letter “J”.
## Jax Jamie Jason
## TRUE TRUE FALSE
tribbleUse the function tribble to represent the following
information in a tibble:
# this chunk does not evaluate (notice eval = FALSE)
# it's here to show you the format that you should achieve in the output
participant condition response
S01 A 32
S01 B 54
S02 A 23
S02 B 03tribble(
~participant, ~condition, ~response,
"S01", "A", 32,
"S01", "B", 54,
"S02", "A", 23,
"S02", "B", 03
)## # A tibble: 4 × 3
## participant condition response
## <chr> <chr> <dbl>
## 1 S01 A 32
## 2 S01 B 54
## 3 S02 A 23
## 4 S02 B 3
tibbleCreate the exact same tibble, but now using the command
tibble
tibble(
participant = c("S01", "S01", "S02", "S02"),
condition = c("A", "B", "A", "B"),
response = c(32, 54, 23, 3)
)## # A tibble: 4 × 3
## participant condition response
## <chr> <chr> <dbl>
## 1 S01 A 32
## 2 S01 B 54
## 3 S02 A 23
## 4 S02 B 3
Which (if any) of these columns could reasonably be represented as a
factor (the data type factor in R)?
Solution:
Whatever the “response” is, it is likely a numerical measure, so not something that falls in a small set of relevant categories. But there are finitely many participants and conditions, so it might make sense to treat these columns as factors.
Create a tibble with data from a larger imaginary experiment.
Concretely, your tibble should have 100 rows, and four columns called
participant, trial, condition and
response. There are five conditions, labelled A, B, C, D
and E. There are ten participants, each of which sees all five
conditions exactly twice. Participants are represented with strings like
“S1”, “S2” and so on. You can make up random real numbers sampled
uniformly at random from 0 to 100 (rounded to two digits after the
comma) to fill the response column. Below, we’ve already
started on the code, but some of our code is wrong, some parts might
give good results but are really clumsy and inelegant. Finish the job
and improve where you can! (Check out help pages for all commands you do
not know.)
exp_data <- tibble(
participant = str_c("", seq(from = 1, to = 100, by = 1), sep = "S"),
condition = rep(1:10, each = 10),
trial = c("A", "B", "C", "D", "E"),
response = (runif(100) - 5) %>% ceiling()
)exp_data <- tibble(
participant = str_c("S", 1:10) %>% rep(each = 10),
condition = c("A", "B", "C", "D", "E") %>% rep(each = 2) %>% rep(times = 10),
trial = rep(1:10, times = 10),
response = (runif(100) * 100) %>% round(2)
)## # A tibble: 100 × 4
## participant condition trial response
## <chr> <chr> <int> <dbl>
## 1 S1 A 1 19.9
## 2 S1 A 2 83.1
## 3 S1 B 3 34.7
## 4 S1 B 4 79.1
## 5 S1 C 5 52.9
## 6 S1 C 6 89.9
## 7 S1 D 7 16.9
## 8 S1 D 8 51.3
## 9 S1 E 9 28.2
## 10 S1 E 10 7.37
## # ℹ 90 more rows
A friend gives you some useful information in a useless format (a standard problem of data analysis):
great_info_in_crap_format <- "Johnny_Rotten->Sex_Pistols*Johnny_Ramone->Ramones*Johnny_Cash->The_Tennessee_Three"Use piping and some magic from the stringr package to
produce output (as close as possible to something) like this:
Johnny Rotten was part of the Sex Pistols.
Johnny Ramone was part of the Ramones.
Johnny Cash played with The Tennesse Three.Hint: Use of "\n" inserts a linebrake,
but it shows only when using cat to print the output.
great_info_in_crap_format %>%
str_replace_all(pattern = "_", replacement = " ") %>%
str_replace_all(pattern = "\\*", replacement = ".\n") %>%
str_replace(pattern = '->The Tennessee Three', replacement = ' played with The Tennessee Three') %>%
str_replace_all(pattern = '->', replacement = ' was part of the ') %>%
str_c(".") %>%
cat()## Johnny Rotten was part of the Sex Pistols.
## Johnny Ramone was part of the Ramones.
## Johnny Cash played with The Tennessee Three.
Write a named function called even_cumulative_mean that
takes a numeric vector input as argument. The function
should first check whether the input is numeric (using
is.numeric) and whether it has more than one element. If
these conditions are not fulfilled, informative error messages should be
given (with the stop command). If the input check succeeds,
it should return a tibble with three columns (choose appropriate names
yourself):
input,input with an even indexApply your function to the following vectors:
even_cumulative_mean <- function(input) {
if (! is.numeric(input)) {
stop("Input to 'odd_cumulation' should be numeric.")
}
if (length(input) <= 1) {
stop("Input to 'odd_cumulation' must have at least two elements")
}
last_even_index <- ifelse(length(input) %% 2 == 0, length(input), length(input)-1)
even_numbers <- seq(from = 2, to = last_even_index, by = 2)
tibble(
even_numbers = even_numbers,
input_at_even = input[even_numbers],
cumulative_mean = (map_dbl(1:length(input_at_even), ~ input_at_even[1:.x] %>% mean))
)
}## # A tibble: 2 × 3
## even_numbers input_at_even cumulative_mean
## <dbl> <dbl> <dbl>
## 1 2 43 43
## 2 4 87 65
## # A tibble: 3 × 3
## even_numbers input_at_even cumulative_mean
## <dbl> <dbl> <dbl>
## 1 2 43 43
## 2 4 87 65
## 3 6 101 77
Here’s a messy data set from an experiment in which participants saw
three critical conditions, and had to respond with pressing a button for
either option A or option B. There were four participants in the
experiment, identified anonymously in variable subject_id.
The button press and associated reaction times of each of three trials
are stored, respectively, in columns choices and
reaction_times (in milliseconds) in a string which
separates the data from different trials either with a comma (for
choices) or a single white space (for
reaction_times).
messy_data <- tribble(
~subject_id, ~choices, ~reaction_times,
1, "A,B,B", "312 433 365",
2, "B,A,B", "393 491 327",
3, "B,A,A", "356 313 475",
4, "A,B,B", "292 352 378"
)Use tidyverse tools to tidy up this data set. Check the html file with instructions to see how the output should look like.
choice_data <- messy_data %>%
# separate choices
separate(
col = choices,
into = str_c("C_", 1:3),
sep = ","
) %>%
# make longer
pivot_longer(
cols = contains("C_"),
names_to = "condition",
values_to = "response"
)
RT_data <- messy_data %>%
# separate RTs
separate(
col = reaction_times,
into = str_c("C_", 1:3),
sep = " ",
convert = T
) %>%
# make longer
pivot_longer(
cols = contains("C_"),
names_to = "condition",
values_to = "RT"
)
tidy_data <- full_join(choice_data, RT_data, by = c("subject_id", "condition")) %>%
select(subject_id, condition, response, RT)
tidy_data## # A tibble: 12 × 4
## subject_id condition response RT
## <dbl> <chr> <chr> <int>
## 1 1 C_1 A 312
## 2 1 C_2 B 433
## 3 1 C_3 B 365
## 4 2 C_1 B 393
## 5 2 C_2 A 491
## 6 2 C_3 B 327
## 7 3 C_1 B 356
## 8 3 C_2 A 313
## 9 3 C_3 A 475
## 10 4 C_1 A 292
## 11 4 C_2 B 352
## 12 4 C_3 B 378
Hint: There are many ways to Rome. One way leading
to the current Rome is to tidy up messy_data in two steps.
Create a tidy data set for the choice data (using some combination of
separate, a pivoting function and possibly
select), and another one for the reaction time data (using
basically the same chain of operations). You would then use a joining
operation, e.g., full_join, possibly followed by massaging
the output one more time with select. Careful: make sure
that the column RT in the final output is of type
numeric (integer or double does not matter).
Use the final tidy representation of the messy_data from
the previous exercise, stored in a variable tidy_data. If
you have not managed to produce this representation with tools from the
tidyverse, you can write the desired tibble by hand (without loss of
points for this exercise). Produce a summary table of mean reaction
times per condition, using the tools from the tidyverse. Take a look at
the html file with instructions to see how your output should look
like:
## # A tibble: 3 × 2
## condition mean_RT
## <chr> <dbl>
## 1 C_1 338.
## 2 C_2 397.
## 3 C_3 386.
Now produce a table giving the mean reaction times for each
participant. But make sure that, in this case, the mean reaction times
are rounded to full integers. (Hint: you can use
mutate in a final step or round inside of a call to
summarise). The output should look like this:
## # A tibble: 4 × 2
## subject_id mean_RT
## <dbl> <dbl>
## 1 1 370
## 2 2 404
## 3 3 381
## 4 4 341
Consider this vector of weirdly specified reaction times (similar to Exercise 2.14 from the web-book).
Starting with that vector use a chain of pipes to:
numeric,Hint: Check the use of regular expressions on the cheat
sheet of the stringr package.
weird_RTs %>%
str_extract("[:digit:]+") %>%
as.numeric() %>%
(function(x) {x[x>100]}) %>%
log %>%
mean %>%
signif(3)## [1] 5.89
We will work with the King of France experiment,for a detailed description of the theoretical background and the procedure look into the Appendix D.4 of your course book.
Here is a condensed description of the materials. The data set consists of five vignettes:
Where each vignette consists of five critical conditions. The following five sentences are examples of the critical conditions for the first vignette.
Additionally, for each vignette there exists a background check. This sentence is intended to find out whether participants know whether the relevant presuppositions are true. The five background checks are:
Finally, there are also 110 filler sentences, which do not have a presupposition, but also require common world knowledge for a correct answer. We will use the filler sentences also as controls, because there is a “correct” answer to each of these.
Look into the procedure described in the Appendix D.4 of your script and answer the following questions:
Solution:
It’s a one-factor factorial design. The factor is ‘condition’ and it has five levels: C0, C1, C6, … Vignette is NOT a factor. This is item-level variation to make sure that we are not too repetitive. This shows in there not being any theoretically motivated hypothesis that hinges on a contrast or comparison among vignettes. We treat each vignette equal for all current purposes (until we engage in hierarchical modeling).
Solution:
The experiment is a within-subjects design: every participant gives one observation for each design cell (each condition).
Solution:
Advantage: fewer participants needed
Disadvantage: possible cross-contamination between conditions
Solution:
No, since every participant only gives exactly one data point for each design cell.
Solution:
Dependent variable: “response”
Variable type: binary
Load the data the King of France data:
data_KoF_raw_IDA
contain? (Hint: use the nrow
function!)## [1] 2813
pull,
unique and length.)## [1] 97
age
(Hint: As soon as a vector contains missing data (an
entry NA), it’s mean is NA as well. Try
removing the missing values when calculating the mean, e.g., by checking
the documentation of the function mean for anything
helpful.)## [1] 32.36842
submission_id: nominal
RT: metric
correct: binary
education: ordinal
item_version: nominal
question: nominal
response: binary
timeSpent: metric
trial_name: nominal
trial_number: metric
trial_type: nominal
vignette: nominal