Welcome to the PSYC3361 coding W3 self test. The test assesses your ability to use the coding skills covered in the Week 3 online coding modules.

In particular, it assesses your ability to…

It is IMPORTANT to document the code that you write so that someone who is looking at your code can understand what it is doing. Above each chunk, write a few sentences outlining which packages/functions you have chosen to use and what the function is doing to your data. Where relevant, also write a sentence that interprets the output of your code.

Your notes should also document the troubleshooting process you went through to arrive at the code that worked.

For each of the challenges below, the documentation is JUST AS IMPORTANT as the code.

Good luck!! Jenny

PS- if you get stuck have a look in the /images folder for inspiration

load the packages you will need

Loading tidyverse, which bundles together dplyr (for select/filter/group_by/summarise/mutate), readr (for read_csv), and ggplot2 (for plots) - everything needed for this self test in one package.

library(tidyverse)

read the Alone data

Using read_csv() from readr to import the alone.csv file into a data frame called alone.

alone <- read_csv("data/alone.csv")

1. make a smaller dataset

We are mostly interested in gender, age, the days they lasted and whether contestants were medically evacuted. Use select() to make a smaller dataframe containing just the relevant variables. Rename the variable called medically_evacuated to make it shorter and easier to type

Using select() to keep only the four variables I need (gender, age, days_lasted, medically_evacuated), dropping everything else. Then piping into rename() to shorten medically_evacuated to medic_evac so it’s quicker to type in later code. New name goes on the left, old name on the right.

alone_small <- alone %>%
  select(gender, age, days_lasted, medically_evacuated) %>%
  rename(medic_evac = medically_evacuated)

2. write code to determine how old the oldest male and female contestant are

Using group_by(gender) to split the data into male/female groups, then summarise() to collapse each group down to one number: the maximum age in that group, via max(age).

Interpretation: output gives the oldest age for each gender. The male group has the higher number (M > F as 61 > 57) and therefore has the oldest contestant.

alone_small %>%
  group_by(gender) %>%
  summarise(oldest = max(age))
## # A tibble: 2 × 2
##   gender oldest
##   <chr>   <dbl>
## 1 Female     57
## 2 Male       61

3. has the average length of time that alone contestants lasted changed over seasons?

Going back to the full alone dataset here since season wasn’t kept in alone_small. group_by(season) splits by season, summarise() calculates the average days_lasted per season using mean().

Interpretation: gives one average per season - can scan up/down the table to see if it’s trending up, down, or flat across seasons.

alone %>%
  group_by(season) %>%
  summarise(mean_days = mean(days_lasted))
## # A tibble: 9 × 2
##   season mean_days
##    <dbl>     <dbl>
## 1      1      21.6
## 2      2      34.4
## 3      3      54.3
## 4      4      31.4
## 5      5      30.1
## 6      6      45.4
## 7      7      49.9
## 8      8      41.2
## 9      9      46.1

HINT: can you make a line graph that has error bars around the mean for each season? Same group_by/summarise setup as above, but also calculating sd() (standard deviation), n() (sample size per season), and se_days (standard error = sd / sqrt(n)) so I have something to build error bars from.

Then plotting with ggplot(): geom_line() + geom_point() show the trend in mean days lasted across seasons, and geom_errorbar() adds bars stretching from mean - se to mean + se around each point, showing how precise each season’s mean estimate is.

Interpretation: if error bars overlap a lot between seasons, differences in the means are probably just noise. Non-overlapping bars suggest a more genuine difference between seasons.

alone_summary <- alone %>%
  group_by(season) %>%
  summarise(
    mean_days = mean(days_lasted),
    sd_days = sd(days_lasted),
    n = n(),
    se_days = sd_days / sqrt(n)
  )

ggplot(
  data = alone_summary,
  mapping = aes(x = season, y = mean_days)
) +
  geom_line() +
  geom_point() +
  geom_errorbar(
    mapping = aes(ymin = mean_days - se_days, ymax = mean_days + se_days),
    width = 0.2
  )

4. do women on average last longer in the game than men? Are men more likely to leave early?

group_by(gender) then summarise(): mean_days gives average days_lasted per gender. For “leaving early”, using medic_evac (a TRUE/FALSE column) as a stand-in for getting pulled early - taking mean() of a logical column works because R reads TRUE as 1 and FALSE as 0, so prop_evac ends up being the proportion of each gender that was medically evacuated.

Interpretation: compare mean_days across genders to answer the first part (answer = yes); compare prop_evac to answer the second (higher proportion = more likely to leave early via medical evac, answer = no).

alone_small %>%
  group_by(gender) %>%
  summarise(
    mean_days = mean(days_lasted),
    prop_evac = mean(medic_evac)
  )
## # A tibble: 2 × 3
##   gender mean_days prop_evac
##   <chr>      <dbl>     <dbl>
## 1 Female      49.4     0.5  
## 2 Male        36.2     0.203

HINT: can you make a plot that captures the median and distribution of days survived, by gender? Using geom_boxplot() since it shows the median (middle line), the spread of the middle 50% of the data (the box), and the overall range/outliers (whiskers and dots) all in one go, split by gender on the x-axis.

Interpretation: compare median lines between the two boxes for typical performance, and compare box/whisker size for variability - taller box or longer whiskers = more spread out results for that gender.

ggplot(
  data = alone_small,
  mapping = aes(x = gender, y = days_lasted)
) +
  geom_boxplot()

5. do older contestants last longer?

HINT: Use case_when to create a new variable that groups participants by age in decades Using mutate() to add a new age_group column, with case_when() checking each contestant’s age against a series of conditions top-to-bottom and assigning the first matching label. e.g. age < 30 gets labelled “20s”, age < 40 gets “30s”, and so on, with the final TRUE ~ “70s+” catching anyone not caught by an earlier condition (i.e. 70+).

alone_small <- alone_small %>%
  mutate(
    age_group = case_when(
      age < 30 ~ "20s",
      age < 40 ~ "30s",
      age < 50 ~ "40s",
      age < 60 ~ "50s",
      age < 70 ~ "60s",
      TRUE ~ "70s+"
    )
  )

HINT: what is the mean length of time in the game for each age group? How many participants fall into each group? group_by(age_group) splits the data into the decade buckets just created. summarise() then gives the average days_lasted and the count of contestants (n()) in each group.

Interpretation: check whether mean_days rises/falls/stays flat as age_group increases - tells me if older contestants tend to last longer. Also check the n column - groups with very few contestants (e.g. “70s+”) give a less reliable mean than groups with more contestants.

alone_small %>%
  group_by(age_group) %>%
  summarise(
    mean_days = mean(days_lasted),
    n = n()
  )
## # A tibble: 5 × 3
##   age_group mean_days     n
##   <chr>         <dbl> <int>
## 1 20s            34.2    14
## 2 30s            41.4    36
## 3 40s            37.1    37
## 4 50s            42.2     6
## 5 60s            74       1

6. Are contestants who are medically evacuted, on average older than those who pull out themselves? does that differ by gender?

HINT: filter the dataset to keep only those contestants who didn’t win, then calculate the mean age, separately for those who were medically evacuated vs. not. First checked names(alone) to find the winner column, then unique(alone$result) to see how it’s coded - turned out result is finishing placement (1-10) per season, so 1 = winner each season.

Using filter(result != 1) to drop the winner from each season (note: this removes one winner per season, not one overall, since placements are season-relative). Then group_by(gender, medically_evacuated) splits the remaining contestants into 4 groups (male/evac, male/not, female/evac, female/not), and summarise(mean_age = mean(age)) gives the average age in each.

alone_evac_summary <- alone %>%
  filter(result != 1) %>%
  group_by(gender, medically_evacuated) %>%
  summarise(mean_age = mean(age))
## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.
alone_evac_summary
## # A tibble: 4 × 3
## # Groups:   gender [2]
##   gender medically_evacuated mean_age
##   <chr>  <lgl>                  <dbl>
## 1 Female FALSE                   42.5
## 2 Female TRUE                    37.9
## 3 Male   FALSE                   37.8
## 4 Male   TRUE                    35.9

HINT: make a column graph of the data you just summarised Using geom_col() to plot the mean_age values as bars, with fill = gender colouring bars by gender and position = “dodge” placing male/female bars side-by-side within each medic_evac category (rather than stacked), so they’re easy to compare directly.

Interpretation: compare bar heights within each gender across TRUE/FALSE to see if evacuated contestants are older on average. Compare the size of the gap between TRUE/FALSE bars across genders to see if the pattern differs by gender.

ggplot(
  data = alone_evac_summary,
  mapping = aes(x = medically_evacuated, y = mean_age, fill = gender)
) +
  geom_col(position = "dodge")

7. knit your document to pdf