Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)
df

## # A tibble: 25,528 × 6
##    name                    sex     age team     year medalist
##    <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32 Finland  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32 Finland  1952 TRUE    
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |>
  filter(year %in% c(2008, 2012, 2016))
df2

## # A tibble: 2,703 × 6
##    name              sex     age team     year medalist
##    <chr>             <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  2 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  3 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  4 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  5 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  6 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  7 Katja Abel        F        25 Germany  2008 FALSE   
##  8 Katja Abel        F        25 Germany  2008 FALSE   
##  9 Katja Abel        F        25 Germany  2008 FALSE   
## 10 Katja Abel        F        25 Germany  2008 FALSE   
## # ℹ 2,693 more rows

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df2 |>
  group_by(year) |>
  summarize(age_mean = mean(age, na.rm = T))

## # A tibble: 3 × 2
##    year age_mean
##   <dbl>    <dbl>
## 1  2008     21.6
## 2  2012     21.9
## 3  2016     22.2

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarize(age_mean = mean(age, na.rm = T))

oly_year

## # A tibble: 29 × 2
##     year age_mean
##    <dbl>    <dbl>
##  1  1896     24.3
##  2  1900     22.2
##  3  1904     25.1
##  4  1906     24.7
##  5  1908     23.2
##  6  1912     24.2
##  7  1920     26.7
##  8  1924     27.6
##  9  1928     25.6
## 10  1932     23.9
## # ℹ 19 more rows

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure

#Create a subset dataset from olympic_gymnasts that groups by gymnast and reads their total number of medals, only if they have at least one. Sort by number of medals most to least.

medals_per_gymnast <- olympic_gymnasts |>
  group_by(name, id) |>
  summarize(total_medals = sum(medalist, na.rm = T)) |>
  filter(total_medals > 0) |>
  arrange(desc(total_medals))

## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.

medals_per_gymnast

## # A tibble: 1,252 × 3
## # Groups:   name [1,252]
##    name                                   id total_medals
##    <chr>                               <dbl>        <int>
##  1 Larysa Semenivna Latynina (Diriy-)  67046           18
##  2 Nikolay Yefimovich Andrianov         4198           15
##  3 Borys Anfiyanovych Shakhlin        109161           13
##  4 Takashi Ono                         89187           13
##  5 Aleksey Yuryevich Nemov             85286           12
##  6 Sawao Kato                          57998           12
##  7 Viktor Ivanovych Chukarin           21402           11
##  8 Vra slavsk (-Odloilov)              18826           11
##  9 Akinori Nakayama                    84381           10
## 10 Aleksandr Nikolayevich Dityatin     28790           10
## # ℹ 1,242 more rows

Discussion: Enter your discussion of results here.

In this question, I wanted to see which gymnasts won the most Olympic medals. To answer this, I grouped the data by gymnast, using both name and ID to ensure athletes with the same name were not combined. I then calculated the total number of medals that each gymnast earned by summing the medalist column to count the number of TRUE values per name + id. I filtered the results to ensure only athletes who had at least one medal showed up, and sorted the list so that the most decorated gymnasts appear first. This data could also be used (in conjunction with other, more refined data) to analyze athletic performance across regions or countries based on participating olympic countries.