Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|> # Creates a new dataset called "df" containing a wrangled dataset from the original dataset "olympic_gymnasts".
select(name, sex, age) # Select function which chooses all elements from the columns name, sex, and age.
df # Prints out a new dataset with the three selected columns.
## # A tibble: 25,528 × 3
## name sex age
## <chr> <chr> <dbl>
## 1 Paavo Johannes Aaltonen M 28
## 2 Paavo Johannes Aaltonen M 28
## 3 Paavo Johannes Aaltonen M 28
## 4 Paavo Johannes Aaltonen M 28
## 5 Paavo Johannes Aaltonen M 28
## 6 Paavo Johannes Aaltonen M 28
## 7 Paavo Johannes Aaltonen M 28
## 8 Paavo Johannes Aaltonen M 28
## 9 Paavo Johannes Aaltonen M 32
## 10 Paavo Johannes Aaltonen M 32
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- olympic_gymnasts |> # Dataset "df2" which contains all elements from the selected vlaues from the row and columns of specific years (2008, 2012, 2016).
group_by(year == 2008, year == 2012, year == 2016) # Group_by function which groups each data value by the selected years.
df2 # Prints out the new dataset df2.
## # A tibble: 25,528 × 19
## # Groups: year == 2008, year == 2012, year == 2016 [4]
## id name sex age height weight team noc games year season city
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 2 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 3 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 4 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 5 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 6 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 7 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 8 17 Paavo J… M 28 175 64 Finl… FIN 1948… 1948 Summer Lond…
## 9 17 Paavo J… M 32 175 64 Finl… FIN 1952… 1952 Summer Hels…
## 10 17 Paavo J… M 32 175 64 Finl… FIN 1952… 1952 Summer Hels…
## # ℹ 25,518 more rows
## # ℹ 7 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>,
## # `year == 2008` <lgl>, `year == 2012` <lgl>, `year == 2016` <lgl>
# Alternative dataset
alt_df2 <- olympic_gymnasts |>
filter(year %in% c(2008, 2012, 2016)) |> # I wanted to see if using data framing reference would allow me to gather data form the selected years more precisely by gathering through all the rows containing 2008, 2012, and 2016 from the column years.
group_by(year)
alt_df2
## # A tibble: 2,703 × 16
## # Groups: year [3]
## id name sex age height weight team noc games year season city
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 51 Nstor A… M 23 167 64 Spain ESP 2016… 2016 Summer Rio …
## 2 51 Nstor A… M 23 167 64 Spain ESP 2016… 2016 Summer Rio …
## 3 51 Nstor A… M 23 167 64 Spain ESP 2016… 2016 Summer Rio …
## 4 51 Nstor A… M 23 167 64 Spain ESP 2016… 2016 Summer Rio …
## 5 51 Nstor A… M 23 167 64 Spain ESP 2016… 2016 Summer Rio …
## 6 51 Nstor A… M 23 167 64 Spain ESP 2016… 2016 Summer Rio …
## 7 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 8 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 9 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## 10 396 Katja A… F 25 165 55 Germ… GER 2008… 2008 Summer Beij…
## # ℹ 2,693 more rows
## # ℹ 4 more variables: sport <chr>, event <chr>, medal <chr>, medalist <lgl>
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df3 <- olympic_gymnasts |> # Creates a new dataset "df3" containing the mean of ages in each of the selected years, 2008, 2012, 2016.
filter(year %in% c(2008, 2012, 2016)) |> # Filter function to gather all the rows in the years column with 2008, 2012, and 2016.
group_by(year) |> # df3 dataset will be grouped by each year by rows.
summarize(n = n(), # Summarize function to gather the total count of athletes within each of the selected years "n" number from the total "n()".
mean_age = mean(age, na.rm = TRUE) # mean_age which stores the average of age from each year while removing all values with NA in each row.
)
df3 # Prints the new dataset.
## # A tibble: 3 × 3
## year n mean_age
## <dbl> <int> <dbl>
## 1 2008 994 21.6
## 2 2012 848 21.9
## 3 2016 861 22.2
#alt_df3 <- olympic_gymnasts |>
#filter(year == 2008, year == 2012, year == 2016) |>
#group_by(year) |>
#summarize(n = n(),
#mean_age = mean(age, na.rm = TRUE)
#)
#alt_df3
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |> # Dataset "oly_year" which gathers the average age of all years of the olympics.
group_by(year) |> # Group_by function which categorizes every year into its own row.
summarize(n = n(), # Summarize which displays the total amount of people per year from all the years of the olympics.
mean_age = mean(age, na.rm = TRUE) # Mean age which contains the average age of each individual year of the olympics.
)
oly_year # Prints out the new dataset.
## # A tibble: 29 × 3
## year n mean_age
## <dbl> <int> <dbl>
## 1 1896 73 24.3
## 2 1900 33 22.2
## 3 1904 317 25.1
## 4 1906 70 24.7
## 5 1908 240 23.2
## 6 1912 310 24.2
## 7 1920 206 26.7
## 8 1924 499 27.6
## 9 1928 561 25.6
## 10 1932 140 23.9
## # ℹ 19 more rows
# Optional task (I used the data frame activity to help me on this since I couldn't do it in the data wrangling section).
min(oly_year$mean_age, na.rm = TRUE)
## [1] 19.86606
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
*** Custom Question *** Use the olympic_gymnasts dataset to find all athletes with the height of 170 cm or taller, group by height, and find the mean height of each year of the olympics, and round it two the nearest hundredth. Store this all as a new dataset called “athlete_height”.
# Your R code here
athlete_height <- olympic_gymnasts |> # Dataset "athlete_height" which gathers the heights of all athletes that are greater than 170 cm from each year and the average height of each year.
filter(height > 170) |> # Filter function which selects the heights only from above 170.
group_by(year) |> # Group_by function which categorizes each year by its own row.
summarize(Amount_greater_than_170_cm = n(), # Summarize function which counts the amount of athletes taller than 170 cm from each year.
avg_height = round(mean(height, na.rm = TRUE), 2) # avg_height which contains the average high of all years rounded to the nearest hundredth.
)
athlete_height # Prints the new dataset.
## # A tibble: 29 × 3
## year Amount_greater_than_170_cm avg_height
## <dbl> <int> <dbl>
## 1 1896 1 188
## 2 1900 1 175
## 3 1904 25 177.
## 4 1906 4 182
## 5 1908 5 176.
## 6 1912 2 177
## 7 1920 2 175
## 8 1924 18 178
## 9 1928 14 175
## 10 1932 13 174.
## # ℹ 19 more rows
Discussion: Enter your discussion of results here.