Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age, team, year, medalist)
df
## # A tibble: 25,528 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 Finland 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 Finland 1952 TRUE
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df |>
filter(year %in% c(2008,2012,2016))
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df2 |>
group_by(year) |>
summarize(mean(age))
## # A tibble: 3 × 2
## year `mean(age)`
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarize(mean(age))
min(oly_year$`mean(age)`)
## [1] 19.86606
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
Q: Which 5 gymnasts have appeared in the most Olympic competitions (games/events)?
# Your R code here
#Using only things we've learned
p511 <- olympic_gymnasts |>
select(name, year) |>
count(name) |>
arrange(desc(n))
p512 <- head(p511, 5)
#My style
p521 <- olympic_gymnasts |>
select(name, year) |>
distinct() |>
count(name) |>
arrange(desc(n)) |>
left_join(p511, by="name") |>
rename(
n_games = n.x,
n_events = n.y
)
p522 <- p521 |>
arrange(desc(n_games),desc(n_events)) |>
head(5)
Discussion: Enter your discussion of results here.
I wondered which 5 gymnasts had appeared in the most Olympic Games vs. the most event appearances. These numbers don’t exactly line up, though there is some overlap, which isn’t very surprising. p512 is the basic answer to the question “which gymnasts appeared in the most Olympic Games?” It uses a select, count, then light rearranging.
Figuring out which gymnasts had competed in the most individual events required some extra work, and I had some fun by using a left outer join to combine the working dataframe to the results from the earlier functional dataset p511. This was preceded by simplifying the data to distinct rows, and then using a count and arrange (thus became p521). This was all to to compare the gymnasts with the most event appearances to those with the most Olympic Games appearances. p522 is the culmination of my solution. While Josy Stoffel participated in the most events (39), Oksana Chusovitina holds the record for showing up to the most Olympic Games (8).