Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age, team, year, medalist)
df
## # A tibble: 25,528 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 Finland 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 Finland 1952 TRUE
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df |>
filter(year %in% c(2008, 2012, 2016))
df2
## # A tibble: 2,703 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 7 Katja Abel F 25 Germany 2008 FALSE
## 8 Katja Abel F 25 Germany 2008 FALSE
## 9 Katja Abel F 25 Germany 2008 FALSE
## 10 Katja Abel F 25 Germany 2008 FALSE
## # ℹ 2,693 more rows
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df2 |>
group_by(year) |>
summarize(age_mean = mean(age, na.rm = T))
## # A tibble: 3 × 2
## year age_mean
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarize(age_mean = mean(age, na.rm = T))
oly_year
## # A tibble: 29 × 2
## year age_mean
## <dbl> <dbl>
## 1 1896 24.3
## 2 1900 22.2
## 3 1904 25.1
## 4 1906 24.7
## 5 1908 23.2
## 6 1912 24.2
## 7 1920 26.7
## 8 1924 27.6
## 9 1928 25.6
## 10 1932 23.9
## # ℹ 19 more rows
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
#Create a subset dataset from olympic_gymnasts that groups by gymnast and reads their total number of medals, only if they have at least one. Sort by number of medals most to least.
medals_per_gymnast <- olympic_gymnasts |>
group_by(name, id) |>
summarize(total_medals = sum(medalist, na.rm = T)) |>
filter(total_medals > 0) |>
arrange(desc(total_medals))
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
medals_per_gymnast
## # A tibble: 1,252 × 3
## # Groups: name [1,252]
## name id total_medals
## <chr> <dbl> <int>
## 1 Larysa Semenivna Latynina (Diriy-) 67046 18
## 2 Nikolay Yefimovich Andrianov 4198 15
## 3 Borys Anfiyanovych Shakhlin 109161 13
## 4 Takashi Ono 89187 13
## 5 Aleksey Yuryevich Nemov 85286 12
## 6 Sawao Kato 57998 12
## 7 Viktor Ivanovych Chukarin 21402 11
## 8 Vra slavsk (-Odloilov) 18826 11
## 9 Akinori Nakayama 84381 10
## 10 Aleksandr Nikolayevich Dityatin 28790 10
## # ℹ 1,242 more rows
Discussion: Enter your discussion of results here.
In this question, I wanted to see which gymnasts won the most Olympic medals. To answer this, I grouped the data by gymnast, using both name and ID to ensure athletes with the same name were not combined. I then calculated the total number of medals that each gymnast earned by summing the medalist column to count the number of TRUE values per name + id. I filtered the results to ensure only athletes who had at least one medal showed up, and sorted the list so that the most decorated gymnasts appear first. This data could also be used (in conjunction with other, more refined data) to analyze athletic performance across regions or countries based on participating olympic countries.