Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age,team,year, medalist)
df
Question 2: From df create df2 that only have year of 2008 2012, and 2016
df2 <- df |>
filter(year %in% c(2008,2012,2016))
df2
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df|>
group_by(year %in% c(2008,2012,2016)) |>
summarise(mean(age))
df
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |>
group_by(year) |>
summarise(mean(age))
oly_year
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
My Question Use Olympic_gymnasts data-set, group by teams(country), filter the gymnasts who got a gold medal, get the the total amount of gold medals each country got, then only display the top 10 countries with the most gold medals.
# Your R code here
oly_teams_gold_medals <- olympic_gymnasts |>
filter(medal == "Gold") |> # filtering teams with gold medals only
# Count function counts the total of gold medals each country earned throughout the years, the "name=" part renames the variable
count(team, name = "Gold_Medal_Total") |>
arrange(desc(Gold_Medal_Total)) |> # arranges team in from highest to lowest
slice_head(n = 10) # displays the top 10 countries with the most medals
oly_teams_gold_medals
Discussion: Enter your discussion of results here. As a result of coding all of this, we see that the top ten countries are…
Soviet Union, Sweden, Italy, Japan, United States, China, Germany, Norway, Romania, and Switzerland. At first I did the top 5 countries, and it just shocked me to see that the U.S had more gold medals than China, so that’s why I decided to display the top ten, to see if China even made it into the top 10 countries with the most gold medals. Keep in mind this is only from years 1896-2016, so the number definitely differ from how many gold medals each countries have accumulated, but still very interesting to see. Troubles I had when coding my question, I actually changed my questions 3 times because I couldn’t figure out how to code them. Another trouble I had was the count() function, honestly I forgot that existed and I was using group_by() and summarise and it wasn’t compiling correctly, but when I figured out the count function can do both for me(group teams into 1 country and getting the sum for gold medals for each country, I felt success). Overall great learning experience trying to code this, made me think a lot.