Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|> # Creates a new dataset called "df" containing a wrangled dataset from the original dataset "olympic_gymnasts". 
  select(name, sex, age) # Select function which chooses all elements from the columns name, sex, and age. 
df # Prints out a new dataset with the three selected columns.

## # A tibble: 25,528 × 3
##    name                    sex     age
##    <chr>                   <chr> <dbl>
##  1 Paavo Johannes Aaltonen M        28
##  2 Paavo Johannes Aaltonen M        28
##  3 Paavo Johannes Aaltonen M        28
##  4 Paavo Johannes Aaltonen M        28
##  5 Paavo Johannes Aaltonen M        28
##  6 Paavo Johannes Aaltonen M        28
##  7 Paavo Johannes Aaltonen M        28
##  8 Paavo Johannes Aaltonen M        28
##  9 Paavo Johannes Aaltonen M        32
## 10 Paavo Johannes Aaltonen M        32
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- olympic_gymnasts |> # Dataset "df2" which contains all elements from the selected vlaues from the row and columns of specific years (2008, 2012, 2016). 
  group_by(year == 2008, year == 2012, year == 2016) # Group_by function which groups each data value by the selected years. 
df2 # Prints out the new dataset df2.

## # A tibble: 25,528 × 19
## # Groups:   year == 2008, year == 2012, year == 2016 [4]
##       id name     sex     age height weight team  noc   games  year season city 
##    <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
##  1    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  2    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  3    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  4    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  5    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  6    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  7    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  8    17 Paavo J… M        28    175     64 Finl… FIN   1948…  1948 Summer Lond…
##  9    17 Paavo J… M        32    175     64 Finl… FIN   1952…  1952 Summer Hels…
## 10    17 Paavo J… M        32    175     64 Finl… FIN   1952…  1952 Summer Hels…
## # ℹ 25,518 more rows
## # ℹ 7 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>,
## #   `year == 2008` <lgl>, `year == 2012` <lgl>, `year == 2016` <lgl>

# Alternative dataset
alt_df2 <- olympic_gymnasts |>
  filter(year %in% c(2008, 2012, 2016)) |> # I wanted to see if using data framing reference would allow me to gather data form the selected years more precisely by gathering through all the rows containing 2008, 2012, and 2016 from the column years. 
  group_by(year)
alt_df2

## # A tibble: 2,703 × 16
## # Groups:   year [3]
##       id name     sex     age height weight team  noc   games  year season city 
##    <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
##  1    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
##  2    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
##  3    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
##  4    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
##  5    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
##  6    51 Nstor A… M        23    167     64 Spain ESP   2016…  2016 Summer Rio …
##  7   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  8   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  9   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
## 10   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
## # ℹ 2,693 more rows
## # ℹ 4 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df3 <- olympic_gymnasts |> # Creates a new dataset "df3" containing the mean of ages in each of the selected years, 2008, 2012, 2016. 
  filter(year %in% c(2008, 2012, 2016)) |> # Filter function to gather all the rows in the years column with 2008, 2012, and 2016. 
  group_by(year) |> # df3 dataset will be grouped by each year by rows. 
  summarize(n = n(), # Summarize function to gather the total count of athletes within each of the selected years "n" number from the total "n()". 
            mean_age = mean(age, na.rm = TRUE) # mean_age which stores the average of age from each year while removing all values with NA in each row. 
            )
df3 # Prints the new dataset.

## # A tibble: 3 × 3
##    year     n mean_age
##   <dbl> <int>    <dbl>
## 1  2008   994     21.6
## 2  2012   848     21.9
## 3  2016   861     22.2

#alt_df3 <- olympic_gymnasts |>
  #filter(year == 2008, year == 2012, year == 2016) |>
  #group_by(year) |>
  #summarize(n = n(),
            #mean_age = mean(age, na.rm = TRUE)
            #)
#alt_df3

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- olympic_gymnasts |> # Dataset "oly_year" which gathers the average age of all years of the olympics. 
  group_by(year) |> # Group_by function which categorizes every year into its own row. 
  summarize(n = n(), # Summarize which displays the total amount of people per year from all the years of the olympics. 
            mean_age = mean(age, na.rm = TRUE) # Mean age which contains the average age of each individual year of the olympics. 
            )
oly_year # Prints out the new dataset.

## # A tibble: 29 × 3
##     year     n mean_age
##    <dbl> <int>    <dbl>
##  1  1896    73     24.3
##  2  1900    33     22.2
##  3  1904   317     25.1
##  4  1906    70     24.7
##  5  1908   240     23.2
##  6  1912   310     24.2
##  7  1920   206     26.7
##  8  1924   499     27.6
##  9  1928   561     25.6
## 10  1932   140     23.9
## # ℹ 19 more rows

# Optional task (I used the data frame activity to help me on this since I couldn't do it in the data wrangling section).
min(oly_year$mean_age, na.rm = TRUE)

## [1] 19.86606

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure

*** Custom Question *** Use the olympic_gymnasts dataset to find all athletes with the height of 170 cm or taller, group by height, and find the mean height of each year of the olympics, and round it two the nearest hundredth. Store this all as a new dataset called “athlete_height”.

# Your R code here
athlete_height <- olympic_gymnasts |> # Dataset "athlete_height" which gathers the heights of all athletes that are greater than 170 cm from each year and the average height of each year. 
  filter(height > 170) |> # Filter function which selects the heights only from above 170. 
  group_by(year) |> # Group_by function which categorizes each year by its own row. 
  summarize(Amount_greater_than_170_cm = n(), # Summarize function which counts the amount of athletes taller than 170 cm from each year. 
            avg_height = round(mean(height, na.rm = TRUE), 2) # avg_height which contains the average high of all years rounded to the nearest hundredth. 
            )
athlete_height # Prints the new dataset.

## # A tibble: 29 × 3
##     year Amount_greater_than_170_cm avg_height
##    <dbl>                      <int>      <dbl>
##  1  1896                          1       188 
##  2  1900                          1       175 
##  3  1904                         25       177.
##  4  1906                          4       182 
##  5  1908                          5       176.
##  6  1912                          2       177 
##  7  1920                          2       175 
##  8  1924                         18       178 
##  9  1928                         14       175 
## 10  1932                         13       174.
## # ℹ 19 more rows

Discussion: Enter your discussion of results here.