Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, year, medalist)
df

## # A tibble: 25,528 × 5
##    name                    sex     age  year medalist
##    <chr>                   <chr> <dbl> <dbl> <lgl>   
##  1 Paavo Johannes Aaltonen M        28  1948 TRUE    
##  2 Paavo Johannes Aaltonen M        28  1948 TRUE    
##  3 Paavo Johannes Aaltonen M        28  1948 FALSE   
##  4 Paavo Johannes Aaltonen M        28  1948 TRUE    
##  5 Paavo Johannes Aaltonen M        28  1948 FALSE   
##  6 Paavo Johannes Aaltonen M        28  1948 FALSE   
##  7 Paavo Johannes Aaltonen M        28  1948 FALSE   
##  8 Paavo Johannes Aaltonen M        28  1948 TRUE    
##  9 Paavo Johannes Aaltonen M        32  1952 FALSE   
## 10 Paavo Johannes Aaltonen M        32  1952 TRUE    
## # ℹ 25,518 more rows

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |>
  mutate(year = olympic_gymnasts$year) |>
  filter(year %in% c("2008", "2012", "2016"))

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df2 |>
  group_by(year) |>
  summarise(
    mean_age = mean(age)
  )

## # A tibble: 3 × 2
##    year mean_age
##   <dbl>    <dbl>
## 1  2008     21.6
## 2  2012     21.9
## 3  2016     22.2

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarise(
    mean_age = mean(age)
  )

oly_year

## # A tibble: 29 × 2
##     year mean_age
##    <dbl>    <dbl>
##  1  1896     24.3
##  2  1900     22.2
##  3  1904     25.1
##  4  1906     24.7
##  5  1908     23.2
##  6  1912     24.2
##  7  1920     26.7
##  8  1924     27.6
##  9  1928     25.6
## 10  1932     23.9
## # ℹ 19 more rows

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure

Question: Filter the olympic_gymnasts dataset for cities only in Asia into a new dataset titled “oly_asia”. Then mutate a column showing age status (minor or adult)

unique(olympic_gymnasts$city)

##  [1] "London"         "Helsinki"       "Antwerpen"      "Rio de Janeiro"
##  [5] "Sydney"         "Munich"         "Beijing"        "Roma"          
##  [9] "Berlin"         "Stockholm"      "Mexico City"    "Tokyo"         
## [13] "Moskva"         "Los Angeles"    "Amsterdam"      "Seoul"         
## [17] "Melbourne"      "Barcelona"      "Athina"         "Atlanta"       
## [21] "St. Louis"      "Montreal"       "Paris"

oly_asia <- olympic_gymnasts |>
    filter(city %in% c("Tokyo", "Beijing", "Seoul")) |>
    mutate(age_status = ifelse(age < 18, "minor", "adult"))
oly_asia

## # A tibble: 3,695 × 17
##       id name     sex     age height weight team  noc   games  year season city 
##    <dbl> <chr>    <chr> <dbl>  <dbl>  <dbl> <chr> <chr> <chr> <dbl> <chr>  <chr>
##  1   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  2   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  3   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  4   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  5   396 Katja A… F        25    165     55 Germ… GER   2008…  2008 Summer Beij…
##  6   610 Ginko A… F        26    148     46 Japan JPN   1964…  1964 Summer Tokyo
##  7   610 Ginko A… F        26    148     46 Japan JPN   1964…  1964 Summer Tokyo
##  8   610 Ginko A… F        26    148     46 Japan JPN   1964…  1964 Summer Tokyo
##  9   610 Ginko A… F        26    148     46 Japan JPN   1964…  1964 Summer Tokyo
## 10   610 Ginko A… F        26    148     46 Japan JPN   1964…  1964 Summer Tokyo
## # ℹ 3,685 more rows
## # ℹ 5 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>,
## #   age_status <chr>

Discussion: I saw the wide variety of cities in the dataset and realized they can be grouped by continent. However, I wanted to narrow this dataset down by only filtering to Asian countries. I checked all the countries and found that the ones located in Asia are Tokyo, Beijing, and Seoul. I used the filter function to choose these cities and add it into a new dataset. Next, I noticed that there is a wide range of ages in the dataset. To specify whether the person is an adult or a minor, I mutated a coloumn and wrote an ifelse statement, stating that if the variable “age” is less than 18, it will output minor in the column. if else, the output will be adult.

Homework 3

Emme Gunther