Welcome to the PSYC3361 coding W3 self test. The test assesses your ability to use the coding skills covered in the Week 3 online coding modules.
In particular, it assesses your ability to…
It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.
Your notes should also document the troubleshooting process you went through to arrive at the code that worked.
For each of the challenges below, the documentation is JUST AS IMPORTANT as the code.
Good luck!!
Jenny
I am going to use the tidyverse package, which contains
both the ggplot and dplyr packages, along with
here (which is useful in telling R where the data is). I am
also including the janitor package, because the tabyl() function is
useful for counting things.
library(tidyverse)
library(here)
library(janitor)
The data is in .csv format so I am giong to use the read_csv() function. This call tells R to find the data “here” within the data folder and to make a new object called alone.
alone <- read_csv(here("data", "alone.csv"))
We are mostly interested in gender, age, the days they lasted and whether contestants were medically evacuted. Use select() to make a smaller dataframe containing just the relevant variables. Rename the variable called medically_evacuated to make it shorter and easier to type
Here I am overwriting the along data with a new smaller dataframe that uses select() to pull just the relevant variables. I am also renaming the medically_evacuated variable to something shorter using the rename() function.
alone <- alone %>%
select(season, name, age, gender, days_lasted, result, medically_evacuated) %>%
rename(medi_vac = medically_evacuated)
Here I am using the arrange function to sort the dataframe by age. The oldest male and female contestants were Pete (61) and Karie (57) so I can use the slice() function to select just the first and second observations.
alone %>%
arrange(desc(age)) %>%
slice(1:2)
## # A tibble: 2 × 7
## season name age gender days_lasted result medi_vac
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <lgl>
## 1 4 Pete Brockdorff 61 Male 74 2 FALSE
## 2 9 Karie Lee Knoke 57 Female 75 2 FALSE
I am piping the alone dataframe into a group_by so that I can get M, SD, n, and stderror separately for each season. It definitely looks like time in the competition has increased over time.
time <- alone %>%
group_by(season) %>%
summarise(mean_time = mean(days_lasted),
sd_time = sd(days_lasted),
n = n(),
stderr = sd_time/sqrt(n))
time
## # A tibble: 9 × 5
## season mean_time sd_time n stderr
## <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 21.6 23.6 10 7.45
## 2 2 34.4 25.0 10 7.89
## 3 3 54.3 30.9 10 9.76
## 4 4 31.4 32.4 14 8.65
## 5 5 30.1 19.4 10 6.14
## 6 6 45.4 28.0 10 8.86
## 7 7 49.9 31.6 10 9.99
## 8 8 41.2 26.6 10 8.40
## 9 9 46.1 21.6 10 6.84
HINT: can you make a line graph that has error bars around the mean for each season?
I need to convert the season variable within the time dataframe to factor so that the x axis lists each season separately. Then I am plotting season on the x axis and mean+time on the y axis. I had to define group = “season” to make the lines connect the dots. I have added error bars using geom_errorbar which has its own aesthetics; the width controls the length of the lines at the top and bottom of each bar. Ive added labels and used theme_minimal to get rid of the ugly grey.
Season 3 seems like a bit of a anomaly, but generally the amount of time that contestants last has improved gradually over time.
time$season <- as.factor(time$season)
time %>%
ggplot(aes(x = season, y = mean_time, group = "season")) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = mean_time - stderr, ymax = mean_time + stderr), width = 0.2) +
labs(title = "Mean number of days Alone contestents survive in each season", y = "Mean number of days", x = "Season") +
theme_minimal()
Here I am grouping by gender and then calculating the mean length of time that contestants last on the show. Not surprisingly, female contestants last on average more than 10 days longer than the male contestants.
alone %>%
group_by(gender) %>%
summarise(mean = mean(days_lasted))
## # A tibble: 2 × 2
## gender mean
## <chr> <dbl>
## 1 Female 49.4
## 2 Male 36.2
HINT: can you make a plot that captures the median and distribution of days survived, by gender?
A boxplot is the best option for capturing the median of days lasted and here i can use geom_jitter() to add the points on top. I have made the points a little transparent with alpha = 0.5.
It looks like many more male contestants drop out in the first few days of the competition.
alone %>%
ggplot(aes(x = gender, y = days_lasted, fill = gender)) +
geom_boxplot() +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
theme_minimal() +
labs(title = "The distribution of days lasted on Alone as a function of gender", y = "Number of dayss lasted", y = "Gender")
HINT: Use case_when to create a new variable that groups participants by age in decades
Here I am using mutate to create a new variable called age_group and using case_when to define the range of ages that should be coded as teen, twenties, thirties, etc etc.
I am using tabyl() to check that the new age_group variable is working as expected
alone <- alone %>%
mutate(age_group = case_when(age < 20 ~ "teen",
age >=20 & age < 30 ~ "twenties",
age >=30 & age < 40 ~ "thirties",
age >=40 & age < 50 ~ "forties",
age >=50 & age < 60 ~ "fifties",
age >=60 & age < 70 ~ "sixties" ))
alone %>%
tabyl(age_group)
## age_group n percent
## fifties 6 0.06382979
## forties 37 0.39361702
## sixties 1 0.01063830
## teen 2 0.02127660
## thirties 36 0.38297872
## twenties 12 0.12765957
HINT: what is the mean length of time in the game for each age group? How many participants fall into each group?
OK I am grouping by age group and then summarising both the M length of time and the number of participants falling into each category. Good to see the numbers line up with the tabyl() output above.
alone %>%
group_by(age_group) %>%
summarise(Mtime = mean(days_lasted),
n = n())
## # A tibble: 6 × 3
## age_group Mtime n
## <chr> <dbl> <int>
## 1 fifties 42.2 6
## 2 forties 37.1 37
## 3 sixties 74 1
## 4 teen 1.5 2
## 5 thirties 41.4 36
## 6 twenties 39.7 12
HINT: filter the dataset to keep only those contestants who didn’t win, then calculate the mean age, separately for those who were medically evacuated vs. not.
I am filtering the dataset so that we drop the participants who won (i.e. result = 1) and then grouping by both medi_vac and gender. In order to make the plot labels work, I needed to make the medi_vac variable a factor and change the order of TRUE / FALSE using fct_relevel (the default is alphabetical. )
med_gender <- alone %>%
filter(result > 1) %>%
group_by(medi_vac, gender) %>%
summarise(mean_age = mean(age))
## `summarise()` has grouped output by 'medi_vac'. You can override using the
## `.groups` argument.
med_gender$medi_vac <- as.factor(med_gender$medi_vac )
med_gender$medi_vac <- fct_relevel(med_gender$medi_vac, c("TRUE" , "FALSE") )
HINT: make a column graph of the data you just summarised
Here I am making a column graph of the number of days that men and women lasted depending on whether they were medically evacuated and not. I need to use position = dodge to make the bars sit next to each other.
It seems like participants who were medically evacuated were on average YOUNGER than those who tapped out of the competition; this pattern is particularly the case for female contestants.
med_gender %>%
ggplot(aes(x = gender, y = mean_age, fill = medi_vac)) +
geom_col(position = "dodge") +
theme_minimal() +
labs(title = "Mean age of contestants medically evacuated vs not, as a function of gender", y = "Mean Age", x = "Gender")