Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known                                         age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success with                                         medals
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)
df

## # A tibble: 25,528 × 6
##    name                    sex     age team     year medalist
##    <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32 Finland  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32 Finland  1952 TRUE    
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |>
  subset(year %in% c(2008,2012,2016))
df2

## # A tibble: 2,703 × 6
##    name              sex     age team     year medalist
##    <chr>             <chr> <dbl> <chr>   <dbl> <lgl>   
##  1 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  2 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  3 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  4 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  5 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  6 Nstor Abad Sanjun M        23 Spain    2016 FALSE   
##  7 Katja Abel        F        25 Germany  2008 FALSE   
##  8 Katja Abel        F        25 Germany  2008 FALSE   
##  9 Katja Abel        F        25 Germany  2008 FALSE   
## 10 Katja Abel        F        25 Germany  2008 FALSE   
## # ℹ 2,693 more rows

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df2_year <- df2 |> 
  group_by(year) |>   # group_by to group to separate specific years                         into one group
  summarise(average_age =mean(age))     # summarise the average age                                            of each year
df2_year

## # A tibble: 3 × 2
##    year average_age
##   <dbl>       <dbl>
## 1  2008        21.6
## 2  2012        21.9
## 3  2016        22.2

df2 |>
  summarise(average_age =mean(age)) # summarise the average age                                            of entire dataSet

## # A tibble: 1 × 1
##   average_age
##         <dbl>
## 1        21.9

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

df3 <- df|>
  group_by(year)

only_year <- df3|>
  summarise(mean_age = mean(age))
only_year

## # A tibble: 29 × 2
##     year mean_age
##    <dbl>    <dbl>
##  1  1896     24.3
##  2  1900     22.2
##  3  1904     25.1
##  4  1906     24.7
##  5  1908     23.2
##  6  1912     24.2
##  7  1920     26.7
##  8  1924     27.6
##  9  1928     25.6
## 10  1932     23.9
## # ℹ 19 more rows

min_avrg_age <- min(only_year$mean_age)
min_avrg_age

## [1] 19.86606

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure

My Question Find the maximum age in the entire dataset, then filter the dataset to include only ages between 18 and 25. After filtering, confirm the maximum age within this group and calculate the average age for each year

# Your R code here

max_age <- max(df3$age) 

filtered_age <- df3 |>
  filter(age >= 18 & age <= 25) |>
  summarise( max_age <- max(age), filtered_avg_age = mean(age))

max_age

## [1] 49

filtered_age

## # A tibble: 29 × 3
##     year `max_age <- max(age)` filtered_avg_age
##    <dbl>                 <dbl>            <dbl>
##  1  1896                    25             21.4
##  2  1900                    25             22.1
##  3  1904                    25             22.6
##  4  1906                    25             22.4
##  5  1908                    25             21.5
##  6  1912                    25             21.9
##  7  1920                    25             22.5
##  8  1924                    25             23.1
##  9  1928                    25             21.9
## 10  1932                    25             22.3
## # ℹ 19 more rows

Discussion: My goal for this question was to gain a better understanding of how to use the pipe operator “|>”. By calculating the maximum age of the entire dataset and then filtering it to include only participants aged 18 to 25, I was able to gain meaningful insight into the ages of the gymnasts. I successfully applied “filter()” and “summarise()” to calculate both the maximum and average ages. One challenge I encountered was trying to assign variables inside the pipeline, which I resolved by performing the calculations directly within “summarise()”. Overall, this exercise improved my understanding of data transformations and how to chain operations efficiently in R.