Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)
df

## # A tibble: 25,528 × 6
##    name                    sex     age team     year medalist
##    <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32 Finland  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32 Finland  1952 TRUE    
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |>
  filter(year == c(2008,2012,2016))
df2

## # A tibble: 886 × 6
##    name                           sex     age team     year medalist
##    <chr>                          <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Nstor Abad Sanjun              M        23 Spain    2016 FALSE   
##  2 Nstor Abad Sanjun              M        23 Spain    2016 FALSE   
##  3 Katja Abel                     F        25 Germany  2008 FALSE   
##  4 Denis Mikhaylovich Ablyazin    M        19 Russia   2012 TRUE    
##  5 Denis Mikhaylovich Ablyazin    M        19 Russia   2012 FALSE   
##  6 Denis Mikhaylovich Ablyazin    M        24 Russia   2016 TRUE    
##  7 Denis Mikhaylovich Ablyazin    M        24 Russia   2016 TRUE    
##  8 Andreea Roxana Acatrinei       F        16 Romania  2008 TRUE    
##  9 Jonna Eva-Maj Adlerteg         F        17 Sweden   2012 FALSE   
## 10 Kseniya Dmitriyevna Afanasyeva F        16 Russia   2008 FALSE   
## # ℹ 876 more rows

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df2 |> 
  group_by(year) |>
  summarize(
    mean_age = mean(age)
  )

## # A tibble: 3 × 2
##    year mean_age
##   <dbl>    <dbl>
## 1  2008     21.7
## 2  2012     22.0
## 3  2016     22.2

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarize(
    mean_age = mean(age),
    min_age = min(age)
  )
oly_year

## # A tibble: 29 × 3
##     year mean_age min_age
##    <dbl>    <dbl>   <dbl>
##  1  1896     24.3      10
##  2  1900     22.2      17
##  3  1904     25.1      18
##  4  1906     24.7      14
##  5  1908     23.2      16
##  6  1912     24.2      18
##  7  1920     26.7      17
##  8  1924     27.6      19
##  9  1928     25.6      11
## 10  1932     23.9      15
## # ℹ 19 more rows

mean(oly_year$min_age)

## [1] 14.58621

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure

Find the average physical characteristics of Olympic gold, silver, and bronze medalists.

# Your R code here
medalists <- olympic_gymnasts |>
  filter(medalist == T) |>
  group_by(medal) |>
  summarize(
    n = n(),
    mean_age = mean(age),
    mean_height = mean(height, na.rm = T),
    mean_weight = mean(weight, na.rm = T)
  )
medalists

## # A tibble: 3 × 5
##   medal      n mean_age mean_height mean_weight
##   <chr>  <int>    <dbl>       <dbl>       <dbl>
## 1 Bronze   675     23.2        162.        55.7
## 2 Gold     785     23.6        161.        54.7
## 3 Silver   727     23.4        161.        54.9

Discussion: Enter your discussion of results here.

I chose my question to determine if physical characteristics have a significant impact on which medal a medalist would receive. To investigate this, I filtered the olympic gymnast dataset to only include those who earned medals, and I grouped the data by the three medals. I then calculated the average age, height, and weight of gymnasts in their respective medal groups. From the table, I see there is no significant difference in physical characteristics between the three groups of medalists.