Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age)
df <- olympic_gymnasts[c("name", "sex", "age", "team", "year","medalist")]
df
## # A tibble: 25,528 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 Finland 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 Finland 1952 TRUE
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df[df$year %in% c(2008, 2012, 2016), ]
df2
## # A tibble: 2,703 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 7 Katja Abel F 25 Germany 2008 FALSE
## 8 Katja Abel F 25 Germany 2008 FALSE
## 9 Katja Abel F 25 Germany 2008 FALSE
## 10 Katja Abel F 25 Germany 2008 FALSE
## # ℹ 2,693 more rows
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df2 |>
group_by(year) |>
summarize(
mean_age = mean(age, na.rm = TRUE),
n = n()
)
## # A tibble: 3 × 3
## year mean_age n
## <dbl> <dbl> <int>
## 1 2008 21.6 994
## 2 2012 21.9 848
## 3 2016 22.2 861
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarize(
n = n(),
mean_age = mean(age, na.rm = TRUE),
max_age = max(age, na.rm = TRUE), )
oly_year
## # A tibble: 29 × 4
## year n mean_age max_age
## <dbl> <int> <dbl> <dbl>
## 1 1896 73 24.3 31
## 2 1900 33 22.2 31
## 3 1904 317 25.1 37
## 4 1906 70 24.7 35
## 5 1908 240 23.2 49
## 6 1912 310 24.2 38
## 7 1920 206 26.7 45
## 8 1924 499 27.6 38
## 9 1928 561 25.6 39
## 10 1932 140 23.9 34
## # ℹ 19 more rows
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
# Your R code here
df2 <- olympic_gymnasts |>
filter(medal == "Gold")
df2
## # A tibble: 785 × 16
## id name sex age height weight team noc games year season city
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 17 "Paavo … M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 2 17 "Paavo … M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 3 17 "Paavo … M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 4 521 "Isak A… M 21 NA NA Norw… NOR 1912… 1912 Summer Stoc…
## 5 697 "Fausto… M 22 NA NA Swed… SWE 1920… 1920 Summer Antw…
## 6 1109 "Lavini… F 16 148 40 Roma… ROU 1984… 1984 Summer Los …
## 7 1211 "Estell… F 19 NA NA Neth… NED 1928… 1928 Summer Amst…
## 8 1483 "Nobuyu… M 25 154 53 Japan JPN 1960… 1960 Summer Roma
## 9 1483 "Nobuyu… M 25 154 53 Japan JPN 1960… 1960 Summer Roma
## 10 2347 "Georg … M 30 NA NA Denm… DEN 1920… 1920 Summer Antw…
## # ℹ 775 more rows
## # ℹ 4 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>
Discussion: Enter your discussion of results here. In this step, I filtered the dataset to only include rows where the medal type is “Gold”. This creates a new dataset, df2, that shows only gymnasts who won gold medals. Using filter() is useful here because it quickly narrows down the data to just the results I’m interested in studying. Now, instead of looking at all gymnasts, I can focus on the performances of gold medalists.