Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age, year, medalist)
df
## # A tibble: 25,528 × 5
## name sex age year medalist
## <chr> <chr> <dbl> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 1952 TRUE
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df |>
mutate(year = olympic_gymnasts$year) |>
filter(year %in% c("2008", "2012", "2016"))
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df2 |>
group_by(year) |>
summarise(
mean_age = mean(age)
)
## # A tibble: 3 × 2
## year mean_age
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarise(
mean_age = mean(age)
)
oly_year
## # A tibble: 29 × 2
## year mean_age
## <dbl> <dbl>
## 1 1896 24.3
## 2 1900 22.2
## 3 1904 25.1
## 4 1906 24.7
## 5 1908 23.2
## 6 1912 24.2
## 7 1920 26.7
## 8 1924 27.6
## 9 1928 25.6
## 10 1932 23.9
## # ℹ 19 more rows
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
Question: Filter the olympic_gymnasts dataset for cities only in Asia into a new dataset titled “oly_asia”. Then mutate a column showing age status (minor or adult)
unique(olympic_gymnasts$city)
## [1] "London" "Helsinki" "Antwerpen" "Rio de Janeiro"
## [5] "Sydney" "Munich" "Beijing" "Roma"
## [9] "Berlin" "Stockholm" "Mexico City" "Tokyo"
## [13] "Moskva" "Los Angeles" "Amsterdam" "Seoul"
## [17] "Melbourne" "Barcelona" "Athina" "Atlanta"
## [21] "St. Louis" "Montreal" "Paris"
oly_asia <- olympic_gymnasts |>
filter(city %in% c("Tokyo", "Beijing", "Seoul")) |>
mutate(age_status = ifelse(age < 18, "minor", "adult"))
oly_asia
## # A tibble: 3,695 × 17
## id name sex age height weight team noc games year season city
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 2 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 3 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 4 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 5 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 6 610 Ginko A… F 26 148 46 Japan JPN 1964… 1964 Summer Tokyo
## 7 610 Ginko A… F 26 148 46 Japan JPN 1964… 1964 Summer Tokyo
## 8 610 Ginko A… F 26 148 46 Japan JPN 1964… 1964 Summer Tokyo
## 9 610 Ginko A… F 26 148 46 Japan JPN 1964… 1964 Summer Tokyo
## 10 610 Ginko A… F 26 148 46 Japan JPN 1964… 1964 Summer Tokyo
## # ℹ 3,685 more rows
## # ℹ 5 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>,
## # age_status <chr>
Discussion: I saw the wide variety of cities in the dataset and realized they can be grouped by continent. However, I wanted to narrow this dataset down by only filtering to Asian countries. I checked all the countries and found that the ones located in Asia are Tokyo, Beijing, and Seoul. I used the filter function to choose these cities and add it into a new dataset. Next, I noticed that there is a wide range of ages in the dataset. To specify whether the person is an adult or a minor, I mutated a coloumn and wrote an ifelse statement, stating that if the variable “age” is less than 18, it will output minor in the column. if else, the output will be adult.