Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)
df

## # A tibble: 25,528 × 6
##    name                    sex     age team     year medalist
##    <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32 Finland  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32 Finland  1952 TRUE    
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |> 
  filter(year%in%c(2008, 2012, 2016))
df2

## # A tibble: 2,703 × 6
##    name              sex     age team     year medalist
##    <chr>             <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  2 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  3 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  4 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  5 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  6 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  7 Katja Abel        F        25 Germany  2008 FALSE   
##  8 Katja Abel        F        25 Germany  2008 FALSE   
##  9 Katja Abel        F        25 Germany  2008 FALSE   
## 10 Katja Abel        F        25 Germany  2008 FALSE   
## # ℹ 2,693 more rows

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df3 <- df2 |> 
  group_by(year) |> 
  mutate(age_mean = mean(age)) |> 
  group_by(year, age_mean) |> 
  summarize()

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by year and age_mean.
## ℹ Output is grouped by year.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(year, age_mean))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

df3

## # A tibble: 3 × 2
## # Groups:   year [3]
##    year age_mean
##   <dbl>    <dbl>
## 1  2008     21.6
## 2  2012     21.9
## 3  2016     22.2

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- df |> 
  select(year, age) |> 
  group_by(year) |> 
  mutate(avg_age = mean(age)) |> 
  arrange(desc(year)) |> 
  count(year, avg_age)
oly_year

## # A tibble: 29 × 3
## # Groups:   year [29]
##     year avg_age     n
##    <dbl>   <dbl> <int>
##  1  1896    24.3    73
##  2  1900    22.2    33
##  3  1904    25.1   317
##  4  1906    24.7    70
##  5  1908    23.2   240
##  6  1912    24.2   310
##  7  1920    26.7   206
##  8  1924    27.6   499
##  9  1928    25.6   561
## 10  1932    23.9   140
## # ℹ 19 more rows

min(oly_year$avg_age)

## [1] 19.86606

Question 5 Using the olympic_gymnasts dataset, find the mean height of a gymnast from each nation.

# Your R code here
avg_h = mean(olympic_gymnasts$height, na.rm=TRUE)

dfm <- olympic_gymnasts |> 
  select(team, height) |> 
  group_by(team) |> 
  mutate(avg_height = mean(height)) |> 
  mutate(impute_avg = ifelse(is.na(height), avg_h, height)) |> 
  count(team, avg_height, impute_avg)

dfm

## # A tibble: 1,088 × 4
## # Groups:   team [108]
##    team      avg_height impute_avg     n
##    <chr>          <dbl>      <dbl> <int>
##  1 Algeria         167.       155      4
##  2 Algeria         167.       164      6
##  3 Algeria         167.       170      7
##  4 Algeria         167.       175      7
##  5 Argentina        NA        156      5
##  6 Argentina        NA        157      5
##  7 Argentina        NA        158      4
##  8 Argentina        NA        163.    60
##  9 Argentina        NA        164     14
## 10 Argentina        NA        165      3
## # ℹ 1,078 more rows

Discussion: Enter your discussion of results here. To solve this issue with removing all NA values the first step was identifying the mean of values for the heights that were input. This allowed me to get an average height avg_h which was the average height for all olympic gymnasts. Then for the analysis, the first step was making the dfm dataframe and assigning it to a pipe. The pipe first selects just the team and height from the olympic_gymnasts dataset. It then groups by the team, analyzes the average height with mutate() creating a new column, then creates another column with the imputed average height for all gymnasts if there is an NA value. This makes it much more accurate to the real average heights when the mean is imputed with minimal outliers. The final step is counting the team, avg_height, and impute_height to clean the dataframe, and displaying dfm.