Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df <- olympic_gymnasts %>%
select(name, sex, age, team, year, medalist)
head(df)
## # A tibble: 6 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df %>%
filter(year %in% c(2008, 2012, 2016))
head(df2)
## # A tibble: 6 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df2_summary <- df2 %>%
group_by(year) %>%
summarize(mean_age = mean(age, na.rm = TRUE))
df2_summary
## # A tibble: 3 × 2
## year mean_age
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts %>%
group_by(year) %>%
summarize(mean_age = mean(age, na.rm = TRUE))
# Display dataset
oly_year
## # A tibble: 29 × 2
## year mean_age
## <dbl> <dbl>
## 1 1896 24.3
## 2 1900 22.2
## 3 1904 25.1
## 4 1906 24.7
## 5 1908 23.2
## 6 1912 24.2
## 7 1920 26.7
## 8 1924 27.6
## 9 1928 25.6
## 10 1932 23.9
## # ℹ 19 more rows
# Year with minimum average age
oly_year %>%
filter(mean_age == min(mean_age))
## # A tibble: 1 × 2
## year mean_age
## <dbl> <dbl>
## 1 1988 19.9
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
# Your R code here
oldest_team_2016 <- df %>%
filter(year == 2016) %>%
group_by(team) %>%
summarize(mean_age = mean(age, na.rm = TRUE)) %>%
arrange(desc(mean_age))
oldest_team_2016
## # A tibble: 60 × 2
## team mean_age
## <chr> <dbl>
## 1 Uzbekistan 35
## 2 Greece 30
## 3 Venezuela 30
## 4 Israel 29
## 5 North Korea 28.3
## 6 Chile 28
## 7 Armenia 27
## 8 Romania 26.8
## 9 Vietnam 25.3
## 10 Egypt 25
## # ℹ 50 more rows
Discussion: I wanted to explore which countries had older gymnasts on average in the 2016 Olympics. I first filtered the dataset for the year 2016, then grouped by team and calculated the mean age. Sorting in descending order allowed me to quickly identify the teams with the oldest gymnasts. I used at least two verbs: filter() and group_by() with summarize(), satisfying the requirements of the question.