Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age, team, year, medalist)
df
## # A tibble: 25,528 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 Finland 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 Finland 1952 TRUE
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df |>
filter(year%in%c(2008, 2012, 2016))
df2
## # A tibble: 2,703 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 7 Katja Abel F 25 Germany 2008 FALSE
## 8 Katja Abel F 25 Germany 2008 FALSE
## 9 Katja Abel F 25 Germany 2008 FALSE
## 10 Katja Abel F 25 Germany 2008 FALSE
## # ℹ 2,693 more rows
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df3 <- df2 |>
group_by(year) |>
mutate(age_mean = mean(age)) |>
group_by(year, age_mean) |>
summarize()
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by year and age_mean.
## ℹ Output is grouped by year.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(year, age_mean))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
df3
## # A tibble: 3 × 2
## # Groups: year [3]
## year age_mean
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- df |>
select(year, age) |>
group_by(year) |>
mutate(avg_age = mean(age)) |>
arrange(desc(year)) |>
count(year, avg_age)
oly_year
## # A tibble: 29 × 3
## # Groups: year [29]
## year avg_age n
## <dbl> <dbl> <int>
## 1 1896 24.3 73
## 2 1900 22.2 33
## 3 1904 25.1 317
## 4 1906 24.7 70
## 5 1908 23.2 240
## 6 1912 24.2 310
## 7 1920 26.7 206
## 8 1924 27.6 499
## 9 1928 25.6 561
## 10 1932 23.9 140
## # ℹ 19 more rows
min(oly_year$avg_age)
## [1] 19.86606
Question 5 Using the olympic_gymnasts dataset, find the mean height of a gymnast from each nation.
# Your R code here
avg_h = mean(olympic_gymnasts$height, na.rm=TRUE)
dfm <- olympic_gymnasts |>
select(team, height) |>
group_by(team) |>
mutate(avg_height = mean(height)) |>
mutate(impute_avg = ifelse(is.na(height), avg_h, height)) |>
count(team, avg_height, impute_avg)
dfm
## # A tibble: 1,088 × 4
## # Groups: team [108]
## team avg_height impute_avg n
## <chr> <dbl> <dbl> <int>
## 1 Algeria 167. 155 4
## 2 Algeria 167. 164 6
## 3 Algeria 167. 170 7
## 4 Algeria 167. 175 7
## 5 Argentina NA 156 5
## 6 Argentina NA 157 5
## 7 Argentina NA 158 4
## 8 Argentina NA 163. 60
## 9 Argentina NA 164 14
## 10 Argentina NA 165 3
## # ℹ 1,078 more rows
Discussion: Enter your discussion of results
here. To solve this issue with removing all NA values the first
step was identifying the mean of values for the heights that were input.
This allowed me to get an average height avg_h which was
the average height for all olympic gymnasts. Then for the analysis, the
first step was making the dfm dataframe and assigning it to a pipe. The
pipe first selects just the team and height from the
olympic_gymnasts dataset. It then groups by the team,
analyzes the average height with mutate() creating a new column, then
creates another column with the imputed average height for all gymnasts
if there is an NA value. This makes it much more accurate to the real
average heights when the mean is imputed with minimal outliers. The
final step is counting the team, avg_height, and impute_height to clean
the dataframe, and displaying dfm.