Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)
df

## # A tibble: 25,528 × 6
##    name                    sex     age team     year medalist
##    <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32 Finland  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32 Finland  1952 TRUE    
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |>
  filter(year %in% c(2012, 2008, 2016))
df2

## # A tibble: 2,703 × 6
##    name              sex     age team     year medalist
##    <chr>             <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  2 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  3 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  4 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  5 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  6 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  7 Katja Abel        F        25 Germany  2008 FALSE   
##  8 Katja Abel        F        25 Germany  2008 FALSE   
##  9 Katja Abel        F        25 Germany  2008 FALSE   
## 10 Katja Abel        F        25 Germany  2008 FALSE   
## # ℹ 2,693 more rows

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df2 |>
  group_by(year) |>
  summarize(mean(age))

## # A tibble: 3 × 2
##    year `mean(age)`
##   <dbl>       <dbl>
## 1  2008        21.6
## 2  2012        21.9
## 3  2016        22.2

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarize(mean(age))
oly_year

## # A tibble: 29 × 2
##     year `mean(age)`
##    <dbl>       <dbl>
##  1  1896        24.3
##  2  1900        22.2
##  3  1904        25.1
##  4  1906        24.7
##  5  1908        23.2
##  6  1912        24.2
##  7  1920        26.7
##  8  1924        27.6
##  9  1928        25.6
## 10  1932        23.9
## # ℹ 19 more rows

min(oly_year)

## [1] 19.86606

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure Which tier of medal had the tallest gymnasts on average?

# Your R code here
meanog <- mean(olympic_gymnasts$height, na.rm = TRUE)

olympic_gymnasts |>
  mutate(heights = ifelse(is.na(height), meanog, height)) |>
  group_by(medal) |>
  summarize(mean(heights))

## # A tibble: 4 × 2
##   medal  `mean(heights)`
##   <chr>            <dbl>
## 1 Bronze            162.
## 2 Gold              162.
## 3 Silver            162.
## 4 <NA>              163.

Discussion: Enter your discussion of results here.

The question was created to see if being taller or shorter was better on average for gymnastics. Looking at the average height of each medal in gymnastics is a way this can be looked at. In the code above I first got the mean of the heights of gymnasts without the NAs, as there were many NAs present, and I took this mean to replace the NAs in the data set. I then mutated the data set and added a column called heights which had the mean height of the gymnasts as the value for the observations that were originally NAs. I then grouped them by medal, summarized, and got the mean heights of each medal, Gold, Silver, Bronze, and NA (no medal). As shown by the table the average heights would decrease as the medal ranking increased, no medal being taller on average and gold medal being shorter on average.