Homework 3

Do not change anything in the following chunk

You will be working on olympic_gymnasts dataset. Do not change the code below:

olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')

olympic_gymnasts <- olympics %>% 
  filter(!is.na(age)) %>%             # only keep athletes with known age
  filter(sport == "Gymnastics") %>%   # keep only gymnasts
  mutate(
    medalist = case_when(             # add column for success in medaling
      is.na(medal) ~ FALSE,           # NA values go to FALSE
      !is.na(medal) ~ TRUE            # non-NA values (Gold, Silver, Bronze) go to TRUE
    )
  )

More information about the dataset can be found at

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md

Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.

df<- olympic_gymnasts|>
  select(name, sex, age, team, year, medalist)
head(df)

## # A tibble: 6 × 6
##   name                    sex     age team     year medalist
##   <chr>                   <chr> <dbl> <chr>   <dbl> <lgl>   
## 1 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
## 2 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
## 3 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
## 4 Paavo Johannes Aaltonen M        28 Finland  1948 TRUE    
## 5 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE   
## 6 Paavo Johannes Aaltonen M        28 Finland  1948 FALSE

Question 2: From df create df2 that only have year of 2008 2012, and 2016

df2 <- df |>
  select(name, sex, age, team, year, medalist) |>
  filter(year == c("2008", "2012", "2016"))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `year == c("2008", "2012", "2016")`.
## Caused by warning in `year == c("2008", "2012", "2016")`:
## ! longer object length is not a multiple of shorter object length

head(df2)

## # A tibble: 6 × 6
##   name                        sex     age team     year medalist
##   <chr>                       <chr> <dbl> <chr>   <dbl> <lgl>   
## 1 Nstor Abad Sanjun           M        23 Spain    2016 FALSE   
## 2 Nstor Abad Sanjun           M        23 Spain    2016 FALSE   
## 3 Katja Abel                  F        25 Germany  2008 FALSE   
## 4 Denis Mikhaylovich Ablyazin M        19 Russia   2012 TRUE    
## 5 Denis Mikhaylovich Ablyazin M        19 Russia   2012 FALSE   
## 6 Denis Mikhaylovich Ablyazin M        24 Russia   2016 TRUE

Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.

df3 <- df2 |>
  group_by(year) |>
  summarize(mean = mean(age))
head(df3)

## # A tibble: 3 × 2
##    year  mean
##   <dbl> <dbl>
## 1  2008  21.7
## 2  2012  22.0
## 3  2016  22.2

Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)

oly_year <- olympic_gymnasts |>
  group_by(year) |>
  summarize(mean = mean(age))
  head(oly_year)

## # A tibble: 6 × 2
##    year  mean
##   <dbl> <dbl>
## 1  1896  24.3
## 2  1900  22.2
## 3  1904  25.1
## 4  1906  24.7
## 5  1908  23.2
## 6  1912  24.2

Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure

Which countries had the top six Gold Medal counts, and what are the counts?

country_medals <- olympic_gymnasts|>
  filter(!is.na(team) & !is.na(medal)) |>
  filter(medal == "Gold") |>
  select(team, medal) |>
  group_by(team) |> 
  count(medal) |>
arrange(desc(n))
head(country_medals)

## # A tibble: 6 × 3
## # Groups:   team [6]
##   team          medal     n
##   <chr>         <chr> <int>
## 1 Soviet Union  Gold    141
## 2 Sweden        Gold     95
## 3 Italy         Gold     68
## 4 Japan         Gold     65
## 5 United States Gold     55
## 6 China         Gold     45

Discussion:

I chose this question because I believe it is the easiest to go into depth with. I first started by filtering out NA values for the team and medal variables, as it allowed me to have a cleaner data set. Next, I filtered the medals to Gold only, as I felt that they are the most meaningful and recognizable in the real world. Next, I selected the team and medal variables as those were the ones that I would be using. Then I used group_by, to group my count by the team variable, without this line we would not be able to see the team column in the tibble. I then used count to count the medals. Lastly, to make my question viable, I arranged the values in descending order, going from most to least medals, which made it possible to answer my question. So, the top six countries with the most gold medal counts was 1st. Soviet Union (141), 2nd. Sweden (95), 3rd. Italy (68), 4th. Japan (65), 5th. United States (55), 6th. China(45).