Welcome to the PSYC3361 coding W3 self test. The test assesses your ability to use the coding skills covered in the Week 3 online coding modules.
In particular, it assesses your ability to…
It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.
Your notes should also document the troubleshooting process you went through to arrive at the code that worked.
For each of the challenges below, the documentation is JUST AS IMPORTANT as the code.
Good luck!!
Jenny
PS- if you get stuck have a look in the /images folder for inspiration
I am loading the tidyverse and here packages.
library(tidyverse)
library(here)
library(ggplot2)
alone <- read.csv("data/alone.csv")
We are mostly interested in gender, age, the days they lasted and whether contestants were medically evacuated. Use select() to make a smaller data frame containing just the relevant variables. Rename the variable called medically_evacuated to make it shorter and easier to type
medically_evacuated <- alone %>%
select(gender, age, days_lasted, medically_evacuated)
Use the arrange function to sort age in descending order. The oldest male is 61 years old, and the oldest female is 57 years old.
medically_evacuated %>%
arrange(desc(age))
## gender age days_lasted medically_evacuated
## 1 Male 61 74 FALSE
## 2 Female 57 75 FALSE
## 3 Male 55 21 FALSE
## 4 Male 55 4 TRUE
## 5 Male 53 51 FALSE
## 6 Male 50 66 FALSE
## 7 Male 50 36 FALSE
## 8 Male 49 73 TRUE
## 9 Female 49 46 TRUE
## 10 Male 48 2 FALSE
## 11 Male 48 6 FALSE
## 12 Female 47 9 TRUE
## 13 Male 47 100 FALSE
## 14 Male 47 24 FALSE
## 15 Male 46 4 FALSE
## 16 Male 46 41 FALSE
## 17 Female 46 21 FALSE
## 18 Male 46 27 TRUE
## 19 Male 45 59 FALSE
## 20 Female 45 57 FALSE
## 21 Female 45 49 FALSE
## 22 Female 45 28 FALSE
## 23 Male 45 22 TRUE
## 24 Male 44 6 FALSE
## 25 Male 44 64 FALSE
## 26 Female 44 8 FALSE
## 27 Male 44 14 FALSE
## 28 Male 44 5 TRUE
## 29 Female 44 52 TRUE
## 30 Male 43 19 FALSE
## 31 Male 43 10 FALSE
## 32 Female 43 37 TRUE
## 33 Male 43 19 FALSE
## 34 Female 42 73 FALSE
## 35 Male 42 22 FALSE
## 36 Male 41 1 FALSE
## 37 Female 41 78 FALSE
## 38 Male 41 56 FALSE
## 39 Male 40 56 FALSE
## 40 Male 40 35 FALSE
## 41 Male 40 49 FALSE
## 42 Male 40 58 FALSE
## 43 Male 40 74 FALSE
## 44 Female 40 69 TRUE
## 45 Male 39 72 FALSE
## 46 Male 39 69 TRUE
## 47 Male 39 20 FALSE
## 48 Male 38 8 TRUE
## 49 Male 37 8 FALSE
## 50 Male 37 6 FALSE
## 51 Male 37 2 FALSE
## 52 Female 36 7 TRUE
## 53 Male 36 87 FALSE
## 54 Male 36 32 FALSE
## 55 Male 36 67 TRUE
## 56 Male 36 52 FALSE
## 57 Male 35 35 FALSE
## 58 Male 35 75 FALSE
## 59 Male 35 77 FALSE
## 60 Male 35 43 FALSE
## 61 Male 34 43 FALSE
## 62 Male 34 51 FALSE
## 63 Male 34 40 FALSE
## 64 Male 33 14 FALSE
## 65 Female 33 80 FALSE
## 66 Male 33 44 FALSE
## 67 Male 32 39 FALSE
## 68 Male 32 75 FALSE
## 69 Male 32 24 TRUE
## 70 Male 31 0 FALSE
## 71 Male 31 5 TRUE
## 72 Male 31 35 FALSE
## 73 Female 31 48 TRUE
## 74 Female 31 89 TRUE
## 75 Male 31 44 FALSE
## 76 Male 31 63 FALSE
## 77 Female 30 5 TRUE
## 78 Male 30 12 TRUE
## 79 Male 30 78 FALSE
## 80 Male 30 42 TRUE
## 81 Male 29 73 FALSE
## 82 Male 28 21 FALSE
## 83 Female 28 86 TRUE
## 84 Female 27 72 FALSE
## 85 Male 26 74 FALSE
## 86 Male 24 4 FALSE
## 87 Male 24 60 FALSE
## 88 Male 24 7 FALSE
## 89 Male 23 1 TRUE
## 90 Male 23 15 FALSE
## 91 Male 22 55 FALSE
## 92 Male 22 8 TRUE
## 93 Male 19 2 FALSE
## 94 Male 19 1 TRUE
The goal is to find the mean amount of time that contestants lasted per season. I have created a new data frame within the alone data set. I have piped it into a group_by to define the variables by season, and then obtained the mean, SD, n and standard error for each season using the summarise function.
days_lasted <- alone %>%
group_by(season) %>%
summarise(
mean_days = mean(days_lasted),
sd_days = sd(days_lasted),
n = n(),
stderr = sd_days/sqrt(n))
print(days_lasted)
## # A tibble: 9 × 5
## season mean_days sd_days n stderr
## <int> <dbl> <dbl> <int> <dbl>
## 1 1 21.6 23.6 10 7.45
## 2 2 34.4 25.0 10 7.89
## 3 3 54.3 30.9 10 9.76
## 4 4 31.4 32.4 14 8.65
## 5 5 30.1 19.4 10 6.14
## 6 6 45.4 28.0 10 8.86
## 7 7 49.9 31.6 10 9.99
## 8 8 41.2 26.6 10 8.40
## 9 9 46.1 21.6 10 6.84
HINT: can you make a line graph that has error bars around the mean for each season?
# creating a line graph using geom_line function
ggplot(days_lasted) +
geom_line(aes(
x = season,
y = mean_days
)) +
geom_point(aes(
x = season,
y = mean_days
)) +
geom_errorbar(aes(
x = season,
ymin = mean_days - stderr,
ymax = mean_days + stderr),
width = 0.2,
colour = "blue"
) +
labs(
title = "Mean number of days Alone contestents survive in each season",
y = "Mean number of days",
x = "Season") +
theme_light() +
scale_x_continuous(breaks = 1:9)
I am piping from the alone data frame and grouping by gender to find the mean days lasted for males and females.
alone %>%
group_by(gender) %>%
summarise(mean_days = mean(days_lasted))
## # A tibble: 2 × 2
## gender mean_days
## <chr> <dbl>
## 1 Female 49.4
## 2 Male 36.2
HINT: can you make a plot that captures the median and distribution of days survived, by gender?
gender_days_lasted <- medically_evacuated %>%
select(gender, days_lasted)
ggplot(gender_days_lasted) +
geom_boxplot(mapping = aes(
x = gender,
y = days_lasted,
fill = gender
)) +
geom_jitter(aes(
x = gender,
y = days_lasted),
width = 0.1,
alpha = 0.5,
size = 2
) +
theme_light() +
labs(
title = "Distribution of Days Survived by Gender",
y = "Mean number of days lasted",
x = "Gender")
HINT: Use case_when to create a new variable that groups participants by age in decades
alone <- alone %>%
mutate(decade = case_when(
age < 20 ~ "teenager",
age >=20 & age < 30 ~ "twenties",
age >=30 & age < 40 ~ "thirties",
age >=40 & age < 50 ~ "forties",
age >=50 & age < 60 ~ "fifties",
age >=60 & age < 70 ~ "sixties",
))
HINT: what is the mean length of time in the game for each age group? How many participants fall into each group?
alone %>%
group_by(decade) %>%
summarise(m_time = mean(days_lasted), n = n())
## # A tibble: 6 × 3
## decade m_time n
## <chr> <dbl> <int>
## 1 fifties 42.2 6
## 2 forties 37.1 37
## 3 sixties 74 1
## 4 teenager 1.5 2
## 5 thirties 41.4 36
## 6 twenties 39.7 12
HINT: filter the dataset to keep only those contestants who didn’t win, then calculate the mean age, separately for those who were medically evacuated vs. not. Those who were medically evacuated were, on average, younger than those not medically evacuated by approximately 2 years. This seems to differ by gender. Those who were not medically evacuated tend to be older among females (42.5 years) than males (37.8 years). Those who were medically evacuated tend to be older also among females compared to males (37.9 and 35.8 years, respectively).
# Average age of those medically evacuated
med_evac <- alone %>%
filter(result > 1) %>%
group_by(medically_evacuated) %>%
summarise(mean_age = mean(age))
print(med_evac)
## # A tibble: 2 × 2
## medically_evacuated mean_age
## <lgl> <dbl>
## 1 FALSE 38.6
## 2 TRUE 36.7
# Average age of those medically evacuated by gender
med_gender <- alone %>%
filter(result > 1) %>%
group_by(medically_evacuated, gender) %>%
summarise(mean_age = mean(age))
## `summarise()` has grouped output by 'medically_evacuated'. You can override
## using the `.groups` argument.
print(med_gender)
## # A tibble: 4 × 3
## # Groups: medically_evacuated [2]
## medically_evacuated gender mean_age
## <lgl> <chr> <dbl>
## 1 FALSE Female 42.5
## 2 FALSE Male 37.8
## 3 TRUE Female 37.9
## 4 TRUE Male 35.9
HINT: make a column graph of the data you just summarised
ggplot(med_gender) +
geom_col(aes(
x = gender,
y = mean_age,
fill = medically_evacuated),
position = "dodge"
) +
theme_light() +
labs(
title = "Contestants Medically Evacuated or Not as a Function of Gender",
y = "Mean age",
x = "Gender") + # adding appropriate labels to graph and axes
scale_fill_discrete(name = "Medically Evacuated") # renaming legend title to remove _ symbol