Do not change anything in the following chunk
You will be working on olympic_gymnasts dataset. Do not change the code below:
olympics <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/olympics.csv')
olympic_gymnasts <- olympics %>%
filter(!is.na(age)) %>% # only keep athletes with known age
filter(sport == "Gymnastics") %>% # keep only gymnasts
mutate(
medalist = case_when( # add column for success in medaling
is.na(medal) ~ FALSE, # NA values go to FALSE
!is.na(medal) ~ TRUE # non-NA values (Gold, Silver, Bronze) go to TRUE
)
)
More information about the dataset can be found at
https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-07-27/readme.md
Question 1: Create a subset dataset with the following columns only: name, sex, age, team, year and medalist. Call it df.
df<- olympic_gymnasts|>
select(name, sex, age, team, year, medalist)
df
## # A tibble: 25,528 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 2 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 3 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 4 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 5 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 6 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 7 Paavo Johannes Aaltonen M 28 Finland 1948 FALSE
## 8 Paavo Johannes Aaltonen M 28 Finland 1948 TRUE
## 9 Paavo Johannes Aaltonen M 32 Finland 1952 FALSE
## 10 Paavo Johannes Aaltonen M 32 Finland 1952 TRUE
## # ℹ 25,518 more rows
Question 2: From df create df2 that only have year of 2008 2012, and 2016
# at least a couple of ways this can be done: using logical tests and OR operator
df2 <- df |>
filter (year == 2008 | year == 2012 | year == 2016)
df2
## # A tibble: 2,703 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 7 Katja Abel F 25 Germany 2008 FALSE
## 8 Katja Abel F 25 Germany 2008 FALSE
## 9 Katja Abel F 25 Germany 2008 FALSE
## 10 Katja Abel F 25 Germany 2008 FALSE
## # ℹ 2,693 more rows
# or in a more compact way, using the %in% operator
df2 <- df |>
filter (year %in% c(2008, 2012, 2016))
df2
## # A tibble: 2,703 × 6
## name sex age team year medalist
## <chr> <chr> <dbl> <chr> <dbl> <lgl>
## 1 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 2 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 3 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 4 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 5 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 6 Nstor Abad Sanjun M 23 Spain 2016 FALSE
## 7 Katja Abel F 25 Germany 2008 FALSE
## 8 Katja Abel F 25 Germany 2008 FALSE
## 9 Katja Abel F 25 Germany 2008 FALSE
## 10 Katja Abel F 25 Germany 2008 FALSE
## # ℹ 2,693 more rows
Question 3 Group by these three years (2008,2012, and 2016) and summarize the mean of the age in each group.
df2 |> group_by (year) |>
summarize (avg_age = mean(age, na.rm=TRUE)) #NAs were removed at the top, but as a precaution, I'm adding the na.rm argument
## # A tibble: 3 × 2
## year avg_age
## <dbl> <dbl>
## 1 2008 21.6
## 2 2012 21.9
## 3 2016 22.2
Question 4 Use olympic_gymnasts dataset, group by year, and find the mean of the age for each year, call this dataset oly_year. (optional after creating the dataset, find the minimum average age)
oly_year <- olympic_gymnasts |> group_by (year) |>
summarize (avg_age = mean (age, na.rm=TRUE)) #age NA were removed at the top so it's not necessary to repeat, but done just for consistency
oly_year
## # A tibble: 29 × 2
## year avg_age
## <dbl> <dbl>
## 1 1896 24.3
## 2 1900 22.2
## 3 1904 25.1
## 4 1906 24.7
## 5 1908 23.2
## 6 1912 24.2
## 7 1920 26.7
## 8 1924 27.6
## 9 1928 25.6
## 10 1932 23.9
## # ℹ 19 more rows
# the minimum avg age can be directly found as:
oly_year |> min(oly_year$avg_age, na.rm=TRUE) #na.rm done for consistency
## [1] 19.86606
# but it is valuable to show the year that minimum occurred. For that, I'll use filter():
oly_year |> filter (avg_age == min(avg_age, na.rm=TRUE)) # find row with the min avg_age
## # A tibble: 1 × 2
## year avg_age
## <dbl> <dbl>
## 1 1988 19.9
Question 5 This question is open ended. Create a question that requires you to use at least two verbs. Create a code that answers your question. Then below the chunk, reflect on your question choice and coding procedure
My question is: how are teams/countries ranked according to “efficacy at winning medals” over the entire history covered by this dataset? “Efficacy” can be defined as the fraction of gymnasts that won medals for each team/country. Although this fraction varies by year, for simplicity I calculate the average efficacy over all the years. Teams are better identified by the variable NOC (national olympic committee code) and correspond to countries but not exactly. For example ITA corresponds to two team: Italy and Pistoja/Firenze. This was dealt with as discussed below.
# LONG APPROACH: I first went through a long and stepwise approach: grouping the dataset by noc, summarizing to get the total number of gymnast entries by noc, filtering the entries that had medalists, summarizing again to get the number of medals by noc, and then dividing number of medals by total entries to get the efficacy ratios. This took 8 lines in R plus more lines to merge intermediate dataframes and display intermediate results.
# SHORT APPROACH: after days of playing with this problem and reading more about the use of summarize() I realized I could get all the calculations done within **a single summarize() call**. Thus, in the shorter approach, I grouped the dataset by noc, and inside the call to summarize(), I added the number of entries by noc using n(), added the number of medalists using sum(), which coerces TRUE to 1 and FALSE to 0, and calculated efficacy as the ratio of number of medalists to number of entries by noc. Finally, I arranged the dataframe in descending order of efficacy. This approach took only 4 lines of R within the same pipe. Since the ratio of number of medalists to number of entries in each noc is by definition the mean value of medalists in each noc, I then simplified the calculation even further by calling mean() of medalists inside the summarize() function. Mean(medalist) automatically adds up all medalists and adds up all entries, and calculates the fraction (here called 'efficacy').
#SHORT APPROACH, by piping all verbs and not making duplicate calls to summarize():
eff_sorted <- olympic_gymnasts |> # pipe the original olympic_gymnasts dataset created above
select(team, noc, medalist) |> # select columns: team, noc, medalist, to create a reduced dataframe
group_by(noc) |> # group by noc
summarize( # use summarize to
ntot = n(), # 1. count the number of entries per noc
nmedals = sum(medalist, na.rm = TRUE), # 2. sum the number of medalists after removing the NAs
efficacy = nmedals/ntot) |> # 3. calculate efficacy as ratio of medalists to total entries by noc
arrange(desc(efficacy)) # then, arrange in descending order of efficacy.
eff_sorted
## # A tibble: 93 × 4
## noc ntot nmedals efficacy
## <chr> <int> <int> <dbl>
## 1 DEN 292 110 0.377
## 2 URS 797 288 0.361
## 3 NOR 346 121 0.350
## 4 EUN 84 28 0.333
## 5 SWE 489 103 0.211
## 6 GDR 420 86 0.205
## 7 RUS 408 83 0.203
## 8 CHN 641 109 0.170
## 9 FIN 713 99 0.139
## 10 JPN 1216 166 0.137
## # ℹ 83 more rows
#AN ALTERNATIVE APPROACH, by exploiting the definition of mean() of medalists within each group, which is the sum of medalists divided by number of entries, which is what I called 'efficacy'
eff_sorted2 <- olympic_gymnasts |> # pipe the olympic_gymnasts dataset
select(team, noc, medalist) |> # select columns: team, noc, medalist
group_by(noc) |> # group by noc
summarize( # use summarize to
efficacy = mean (medalist, na.rm = TRUE)) |> # calculate efficacy using mean(): which is the ratio of number of medalists to total entries by noc
arrange(desc(efficacy)) # then, arrange in descending order of efficacy.
eff_sorted2
## # A tibble: 93 × 2
## noc efficacy
## <chr> <dbl>
## 1 DEN 0.377
## 2 URS 0.361
## 3 NOR 0.350
## 4 EUN 0.333
## 5 SWE 0.211
## 6 GDR 0.205
## 7 RUS 0.203
## 8 CHN 0.170
## 9 FIN 0.139
## 10 JPN 0.137
## # ℹ 83 more rows
Discussion: Enter your discussion of results here.
The simple question of which teams/countries have historically been the most efficient in winning medals turned out to be more interesting and difficult to deal with in R than I thought. I had to take a simple approach and not deal with all of the various nuances. For example, some countries have changed names and geographical area over the historic period covered by the dataset. Also, some of the team names in the dataset are not countries (for example team: Philadelphia Turngemeinde, Philadelphia). Since the noc for that team is USA, I interpreted that its athletes competed for USA. In addition, the answers could have been more interesting if the color of medal had been taken into account. Or if the question had also asked about the efficacy of individual athletes. I did not deal with those issues to avoid converting this question into a massive project.
The results were somewhat unexpected to me based on my incorrect assumption that countries with higher number of gymnasts are the most efficient. However, Denmark is the most efficient team (efficacy: 0.38), without having the largest number of gymnasts. Norway’s high efficacy (0.35) is also unexpected. Other teams followed my assumption, particularly USSR/Russia/EUN, Germany/East Germany/GDR and China. The USA, having one the largest number of gymnasts, is thirteenth in terms of efficacy in winning medals (0.10). Interestingly, one can explore historical questions such as: how have countries performed before and after major disruptions. As simplistic examples, take USSR before and after splitting. USSR was second (efficacy of 0.36), EUN (Unified Team of 1992 olympics) was fourth (efficacy of 0.33) and after splitting, Russia is seventh (efficacy of 0.20). Before the Berlin wall collapse, East Germany was sixth (efficacy of 0.20), West Germany was thirty-sixth and after reunification, Germany is sixteenth (efficacy of 0.08).
In terms of R, I found that the functions covered in the lectures were sufficient to deal with my question. However, it took much effort to learn the appropriate way to use summarize(); for example that summarize removes the columns that are not explicitly entered. Also that group_by() two variables (noc, team) prior to summarize() does not produce the desired results by noc (e.g., Italy and Pistoja/Firenze are grouped separately although both belong to noc ITA). Finally, I learned that summarize() allows calculations as arguments to produce new columns (similar to mutate()), simplifying the pipe significantly.
All and all, it was a useful exercise for learning R. This dataset gives an opportunity to explore various types of questions, even of world history.
Publish in RPubs
Published at https://rpubs.com/rmiranda/1348319