Welcome to the PSYC3361 coding W3 self test. The test assesses your ability to use the coding skills covered in the Week 3 online coding modules.

It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.

Your notes should also document the troubleshooting process you went through to arrive at the code that worked.

For each of the challenges below, the documentation is JUST AS IMPORTANT as the code.

load the packages you will need

I am going to use the tidyverse package, which contains both the ggplot and dplyr packages, along with here (which is useful in telling R where the data is). I am also including the janitor package, because the tabyl() function is useful for counting things.

library(tidyverse)
library(here)
library(janitor)

read the data

The data is in .csv format so I am giong to use the read_csv() function. This call tells R to find the data “here” within the data folder and to make a new object called alone.

alone <- read_csv(here("data", "alone.csv"))

1. make a smaller dataset

We are mostly interested in gender, age, the days they lasted and whether contestants were medically evacuted. Use select() to make a smaller dataframe containing just the relevant variables. Rename the variable called medically_evacuated to make it shorter and easier to type

Here I am overwriting the along data with a new smaller dataframe that uses select() to pull just the relevant variables. I am also renaming the medically_evacuated variable to something shorter using the rename() function.

alone <- alone %>%
  select(season, name, age, gender, days_lasted, result, medically_evacuated) %>%
  rename(medi_vac = medically_evacuated)

2. write code to determine how old the oldest male and female contestant are

Here I am using the arrange function to sort the dataframe by age. The oldest male and female contestants were Pete (61) and Karie (57) so I can use the slice() function to select just the first and second observations.

alone %>%
  arrange(desc(age)) %>%
  slice(1:2)

## # A tibble: 2 × 7
##   season name              age gender days_lasted result medi_vac
##    <dbl> <chr>           <dbl> <chr>        <dbl>  <dbl> <lgl>   
## 1      4 Pete Brockdorff    61 Male            74      2 FALSE   
## 2      9 Karie Lee Knoke    57 Female          75      2 FALSE

3. has the average length of time that alone contestants lasted changed over seasons?

I am piping the alone dataframe into a group_by so that I can get M, SD, n, and stderror separately for each season. It definitely looks like time in the competition has increased over time.

time <- alone %>%
  group_by(season) %>%
  summarise(mean_time = mean(days_lasted), 
            sd_time = sd(days_lasted), 
            n = n(), 
            stderr = sd_time/sqrt(n))

time

## # A tibble: 9 × 5
##   season mean_time sd_time     n stderr
##    <dbl>     <dbl>   <dbl> <int>  <dbl>
## 1      1      21.6    23.6    10   7.45
## 2      2      34.4    25.0    10   7.89
## 3      3      54.3    30.9    10   9.76
## 4      4      31.4    32.4    14   8.65
## 5      5      30.1    19.4    10   6.14
## 6      6      45.4    28.0    10   8.86
## 7      7      49.9    31.6    10   9.99
## 8      8      41.2    26.6    10   8.40
## 9      9      46.1    21.6    10   6.84

HINT: can you make a line graph that has error bars around the mean for each season?

I need to convert the season variable within the time dataframe to factor so that the x axis lists each season separately. Then I am plotting season on the x axis and mean+time on the y axis. I had to define group = “season” to make the lines connect the dots. I have added error bars using geom_errorbar which has its own aesthetics; the width controls the length of the lines at the top and bottom of each bar. Ive added labels and used theme_minimal to get rid of the ugly grey.

Season 3 seems like a bit of a anomaly, but generally the amount of time that contestants last has improved gradually over time.

time$season <- as.factor(time$season)

time %>%
  ggplot(aes(x = season, y = mean_time, group = "season")) +
           geom_point() + 
           geom_line() +
  geom_errorbar(aes(ymin = mean_time - stderr, ymax = mean_time + stderr), width = 0.2) +
  labs(title = "Mean number of days Alone contestents survive in each season", y = "Mean number of days", x = "Season") +
  theme_minimal()

4. do women on average last longer in the game than men? Are men more likely to leave early?

Here I am grouping by gender and then calculating the mean length of time that contestants last on the show. Not surprisingly, female contestants last on average more than 10 days longer than the male contestants.

alone %>%
  group_by(gender) %>%
  summarise(mean = mean(days_lasted))

## # A tibble: 2 × 2
##   gender  mean
##   <chr>  <dbl>
## 1 Female  49.4
## 2 Male    36.2

HINT: can you make a plot that captures the median and distribution of days survived, by gender?

A boxplot is the best option for capturing the median of days lasted and here i can use geom_jitter() to add the points on top. I have made the points a little transparent with alpha = 0.5.

It looks like many more male contestants drop out in the first few days of the competition.

alone %>%
  ggplot(aes(x = gender, y = days_lasted, fill = gender)) +
  geom_boxplot() +
  geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
  theme_minimal() +
  labs(title = "The distribution of days lasted on Alone as a function of gender", y = "Number of dayss lasted", y = "Gender")

5. do older contestants last longer?

HINT: Use case_when to create a new variable that groups participants by age in decades

Here I am using mutate to create a new variable called age_group and using case_when to define the range of ages that should be coded as teen, twenties, thirties, etc etc.

I am using tabyl() to check that the new age_group variable is working as expected

alone <- alone %>%
  mutate(age_group = case_when(age < 20 ~ "teen", 
                               age >=20 & age < 30 ~ "twenties", 
                                age >=30 & age < 40 ~ "thirties", 
                                age >=40 & age < 50 ~ "forties", 
                                age >=50 & age < 60 ~ "fifties", 
                                age >=60 & age < 70 ~ "sixties" ))

alone %>%
  tabyl(age_group)

##  age_group  n    percent
##    fifties  6 0.06382979
##    forties 37 0.39361702
##    sixties  1 0.01063830
##       teen  2 0.02127660
##   thirties 36 0.38297872
##   twenties 12 0.12765957

HINT: what is the mean length of time in the game for each age group? How many participants fall into each group?

OK I am grouping by age group and then summarising both the M length of time and the number of participants falling into each category. Good to see the numbers line up with the tabyl() output above.

alone %>%
  group_by(age_group) %>%
  summarise(Mtime = mean(days_lasted), 
            n = n())

## # A tibble: 6 × 3
##   age_group Mtime     n
##   <chr>     <dbl> <int>
## 1 fifties    42.2     6
## 2 forties    37.1    37
## 3 sixties    74       1
## 4 teen        1.5     2
## 5 thirties   41.4    36
## 6 twenties   39.7    12

6. Are contestants who are medically evacuted, on average older than those who pull out themselves? does that differ by gender?

HINT: filter the dataset to keep only those contestants who didn’t win, then calculate the mean age, separately for those who were medically evacuated vs. not.

I am filtering the dataset so that we drop the participants who won (i.e. result = 1) and then grouping by both medi_vac and gender. In order to make the plot labels work, I needed to make the medi_vac variable a factor and change the order of TRUE / FALSE using fct_relevel (the default is alphabetical. )

med_gender <- alone %>% 
  filter(result > 1) %>%
  group_by(medi_vac, gender) %>%
  summarise(mean_age = mean(age))

## `summarise()` has grouped output by 'medi_vac'. You can override using the
## `.groups` argument.

med_gender$medi_vac <- as.factor(med_gender$medi_vac )

med_gender$medi_vac <- fct_relevel(med_gender$medi_vac, c("TRUE" , "FALSE") )

HINT: make a column graph of the data you just summarised

Here I am making a column graph of the number of days that men and women lasted depending on whether they were medically evacuated and not. I need to use position = dodge to make the bars sit next to each other.

It seems like participants who were medically evacuated were on average YOUNGER than those who tapped out of the competition; this pattern is particularly the case for female contestants.

med_gender %>% 
  ggplot(aes(x = gender, y = mean_age, fill = medi_vac)) +
  geom_col(position = "dodge") +
  theme_minimal() +
  labs(title = "Mean age of contestants medically evacuated vs not, as a function of gender", y = "Mean Age", x = "Gender")

w3 self test answer

model answer

Jen Richmond

2023-05-10