In this report, we analyze different aspects of an Olympic athlete dataset. The dataset is comprised of roughly 271,000 unique athlete-event-year tuples, as well as whether or not this athlete (in a given year and event) had received a medal. Olympic games from 1896 to 2016 are considered in this analysis.
Here is a sample of the first few columns of the first few entries of the data:
Name Sex Age Height Weight Team NOC
1 A Dijiang M 24 180 80 China CHN
2 A Lamusi M 23 170 60 China CHN
3 Gunnar Nielsen Aaby M 24 NA NA Denmark DEN
4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN
5 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
6 Christine Jacoba Aaftink F 21 185 82 Netherlands NED
7 Christine Jacoba Aaftink F 25 185 82 Netherlands NED
8 Christine Jacoba Aaftink F 25 185 82 Netherlands NED
9 Christine Jacoba Aaftink F 27 185 82 Netherlands NED
10 Christine Jacoba Aaftink F 27 185 82 Netherlands NED
It is well-known that the USA typically dominates the Olympics in terms of raw counts of medals won. The USA has accumulated 5637 medals: 2638 gold, 1641 silver, and 1358 bronze. Second place is, in fact, the USSR (wow!), with a total of 2503 medals. Then, Germany with 2165, Great Britain with 2068, and so on:
The average number of medals/country is about 173, the median is 2 (heavily unbalanced), and there are 81 countries that have earned zero medals (as of the 2016 Olympics).
2. Medals by sport
Which sports tend to have the most medals awarded?
Depending on the type of the sport (individual or team), the age of the sport, and the sport’s popularity, not all Olympic sports have awarded the same counts of medals over the years. So, we ask, what does the distribution look like?
Athletics includes the track-and-field events (things like pole-vaulting, xx-meter sprints, hurdles, shotput, and more). As such, given the diversity of events present within “Athletics”, it makes sense to see so many awarded medals! Generally, however, much of the above graphic can be easily explained by the total number of Olympic games that the sport was present in.
Here is an adjustment for medals/game, so that we can see more clearly which games give many medals and which give few:
Now we see what we expected. Athletics still gives many of medals each time it is present in the Olympics (again, lots of events in “Athletics”), but the team sports now show many awarded medals each year. Notably, there are few winter games among the top medal-giving games! This is likely because winter games tend to be more solitary than -team-based, with exceptions such as hockey and curling.
3. Athlete ages
Are there any trends related to the ages of the participants in the Olympics?
Most likely. We can analyze both the distribution of ages, as well as inspect for trends:
Ages of Olympic athletes
Ages of athletes by year
Code
# Make density plotggplot(data = oly |>filter(Age <50), aes(x = Year, y = Age)) +ylim(0, 50) +geom_density_2d_filled() +scale_fill_manual(values =colorRampPalette(c("white", "blue4"))(8),name ="Density" ) +labs(title ="Athlete ages and Olympic games years",subtitle ="2d kernel density estimate",y ="Age (years)" ) +theme_classic()
The above plot gives us an idea of the shape of the distribution of ages of athletes by year. We can see increasing density, and therefore increasing numbers of (as the distribution remains tight) athletes as the years continue, especially since ~1980. Additionally, the average age of an athlete looks to be about 25 years old:
Indeed, this is the case. We can also note the series of mean athlete ages by year:
Code
count.ages <- oly |>group_by(Year) |>summarize(age.mean =mean(Age, na.rm =TRUE))# Time plot of mean agesggplot(data = count.ages, aes(x = Year, y = age.mean)) +ylim(0, 50) +geom_line() +labs(title ="Mean athlete ages by year",y ="Mean age (years)" ) +theme_classic()
Aside from early fluctuations in early games’ years (due to small athlete counts), the mean age of the athletes has remained quite consistent at 25-26 years old. Studies suggest that athleticism/physical ability peaks are around this age, so the trend will probably reflect this (and remain constant) in years to come.
There exists a general clustering of ages around the mean, then significant right-skew into the older ages. This is likely explained by the presence of some games that don’t necessarily demand raw athleticism; instead, some other developed technique or skill.
4. Athlete sexes
How has the sex/gender distribution of athletes changed over the years?
Sexes of Olympic athletes
Sexes of Olympic athletes
Code
# Calculate summary statisticscount.sex <- oly |>group_by(Year, ID) |>slice(1) |>ungroup() |>group_by(Year, Sex, Season) |>summarize(total =n())count.sex.winter <- count.sex |>filter(Season =="Winter")count.sex.summer <- count.sex |>filter(Season =="Summer")# Plot over timeggplot(data = count.sex.summer, aes(x = Year, y = total, fill = Sex)) +geom_area(position ="stack") +scale_fill_manual(values =c(M ="skyblue", F =alpha("hotpink2", alpha =0.75))) +labs(title ="Male vs. Female athletes",subtitle ="Summer Olympics",y ="Athletes (count)" ) +theme_classic()
Code
ggplot(data = count.sex.winter, aes(x = Year, y = total, fill = Sex)) +geom_area(position ="stack") +scale_fill_manual(values =c(M ="skyblue", F =alpha("hotpink2", alpha =0.75))) +labs(title ="Male vs. Female athletes",subtitle ="Winter Olympics",y ="Athletes (count)" ) +theme_classic()
We can see that the relative proportions of male to female athletes has always been large on the male side, holding a constant majority of athlete pools year-by-year (for both summer and winter Olympic games). The number of female athletes is increasing relative to men, and nearing a 50-50 split, especially in the summer games!
Additionally, it’s interesting that the summer games see far more athletes in general than the winter games do. This aligns with earlier conclusions about medal distributions, as the winter games typically has a greater number of solitary events than the summer games.
5. Host country performance
Does the performance of a country change when they are hosting the Olympics?
Comparison of medals won by host-country status
Performance of Canada
Code
# Find Canada host yearscan.hostyears <- oly.host |>filter(NOC =="CAN") |>pull(Year)can.presyears <- oly |>filter(NOC =="CAN") |>group_by(Year) |>slice(1) |>pull(Year)# Find awarded medals by yearcount.years <- oly |>filter(!is.na(Medal)) |>group_by(Year) |>summarize(total =n()) |>filter(Year %in% can.presyears)# Find awarded medals (to Canada) by yearcan.perf <- oly |>filter(NOC =="CAN", !is.na(Medal)) |>group_by(Year) |>summarize(total =n())# Find proportions of all medals won by USA, mark host yearcan.perf <- can.perf |>mutate(prop = total/count.years$total,host =ifelse(Year %in% can.hostyears, TRUE, FALSE) )# Share resultsresults <-data.frame(c(paste0("Mean proportion of medals won (hosting): ", trunc(mean(can.perf |>filter(host ==TRUE) |>pull(prop)*10000))/100, "%"),paste0("Mean proportion of medals won (not hosting): ", trunc(mean(can.perf |>filter(host ==FALSE) |>pull(prop)*10000))/100, "%") ))colnames(results) <-"Results (Canada)"results
Results (Canada)
1 Mean proportion of medals won (hosting): 6.85%
2 Mean proportion of medals won (not hosting): 4.57%
Code
# Make plotggplot(data = can.perf, aes(x = Year, y = prop, fill = host)) +geom_col() +scale_fill_manual(name ="Host status",values =c("TRUE"="royalblue", "FALSE"="grey50") ) +labs(title ="Proportion of total medals won",subtitle ="Performance of Canada by year",y ="Proportion" ) +theme_classic()
Clearly, for Canada at least, they win a higher proportion of the medals (on average) while hosting than while not hosting. Note that there may be some explanation in that Canada tends to perform well in the winter games, and 2/3 of their hosting years were winter games years.
Performance of France
We can perform the same analysis with France as well:
Code
# Find France host yearsfra.hostyears <- oly.host |>filter(NOC =="FRA") |>pull(Year)fra.presyears <- oly |>filter(NOC =="FRA") |>group_by(Year) |>slice(1) |>pull(Year)# Find awarded medals by yearcount.years <- oly |>filter(!is.na(Medal)) |>group_by(Year) |>summarize(total =n()) |>filter(Year %in% fra.presyears)# Find awarded medals (to France) by yearfra.perf <- oly |>filter(NOC =="FRA", !is.na(Medal)) |>group_by(Year) |>summarize(total =n())# Find proportions of all medals won by USA, mark host yearfra.perf <- fra.perf |>mutate(prop = total/count.years$total,host =ifelse(Year %in% fra.hostyears, TRUE, FALSE) )# Share resultsresults <-data.frame(c(paste0("Mean proportion of medals won (hosting): ", trunc(mean(fra.perf |>filter(host ==TRUE) |>pull(prop)*10000))/100, "%"),paste0("Mean proportion of medals won (not hosting): ", trunc(mean(fra.perf |>filter(host ==FALSE) |>pull(prop)*10000))/100, "%") ))colnames(results) <-"Results (France)"results
Results (France)
1 Mean proportion of medals won (hosting): 14.41%
2 Mean proportion of medals won (not hosting): 4.09%
Code
# Make plotggplot(data = fra.perf, aes(x = Year, y = prop, fill = host)) +geom_col() +scale_fill_manual(name ="Host status",values =c("TRUE"="red4", "FALSE"="grey50") ) +labs(title ="Proportion of total medals won",subtitle ="Performance of France by year",y ="Proportion" ) +theme_classic()
This one may be more thoroughly explained by the fact that, in the early modern Olympics (~1900s), there were far fewer countries present in the games, and France was one of them. By year, after that, the proportion of medals won by France seems consistent, regardless of whether or not they are hosting the games.
Performance of Italy
We can perform the same analysis with France as well:
Code
# Find France host yearsita.hostyears <- oly.host |>filter(NOC =="ITA") |>pull(Year)ita.presyears <- oly |>filter(NOC =="ITA") |>group_by(Year) |>slice(1) |>pull(Year)# Find awarded medals by yearcount.years <- oly |>filter(!is.na(Medal)) |>group_by(Year) |>summarize(total =n()) |>filter(Year %in% ita.presyears)# Find awarded medals (to France) by yearita.perf <- oly |>filter(NOC =="ITA") |>group_by(Year) |>summarize(total =sum(!is.na(Medal)))# Find proportions of all medals won by USA, mark host yearita.perf <- ita.perf |>mutate(prop = total/count.years$total,host =ifelse(Year %in% ita.hostyears, TRUE, FALSE) )# Share resultsresults <-data.frame(c(paste0("Mean proportion of medals won (hosting): ", trunc(mean(ita.perf |>filter(host ==TRUE) |>pull(prop)*10000))/100, "%"),paste0("Mean proportion of medals won (not hosting): ", trunc(mean(ita.perf |>filter(host ==FALSE) |>pull(prop)*10000))/100, "%") ))colnames(results) <-"Results (Italy)"results
Results (Italy)
1 Mean proportion of medals won (hosting): 6.14%
2 Mean proportion of medals won (not hosting): 4.2%
Code
# Make plotggplot(data = ita.perf, aes(x = Year, y = prop, fill = host)) +geom_col() +scale_fill_manual(name ="Host status",values =c("TRUE"="forestgreen", "FALSE"="grey50") ) +labs(title ="Proportion of total medals won",subtitle ="Performance of Italy by year",y ="Proportion" ) +theme_classic()
For Italy, there is somewhat of a better case for hosting being an indicator of athlete success.
In total, we can can conclude (informally) that whether or not a country is hosting the games may impact performance of the country, in terms of proportions of total medals earned; but not greatly. There are typically too few instances of each country hosting to draw real generalizations from, with the vast majority of countries (that have hosted) hosting less than 5 times total.
Source Code
---title: "Olympic Athletes"subtitle: "STAT 360, Project 5"date: "`r Sys.Date()`"format: html: toc: true code-fold: true code-tools: true code-block-bg: true code-block-border-left: "skyblue" highlight-style: "github"editor: visual---### IntroductionIn this report, we analyze different aspects of an Olympic athlete dataset. The dataset is comprised of roughly 271,000 unique athlete-event-year tuples, as well as whether or not this athlete (in a given year and event) had received a medal. Olympic games from 1896 to 2016 are considered in this analysis.Here is a sample of the first few columns of the first few entries of the data::::{.callout collapse=true title="Sample of Olympic athlete data"}```{r, message = FALSE, warning = FALSE}library(ggplot2)library(dplyr)library(readxl)oly.host <-read.csv("/home/user/School/STAT360/Project 5 (Olympics)/olym.csv")oly <-read.csv("/home/user/School/STAT360/Project 5 (Olympics)/athlete_events.csv")oly$Host <-ifelse(paste0(oly$NOC, oly$Year) %in%paste0(oly.host$NOC, oly.host$Year), TRUE, FALSE)oly[1:10, 2:8]```:::{.callout collapse=true title="Column names of athlete data"}```{r}colnames(oly)```::::::### 1. Medal Distribution#### Which countries have historically dominated the Olympics in terms of medals won?Some key countries consistently win medals in specific events. This may be because of geography, national/social interest, lots of things::::{.callout collapse=true title="Distribution of medals by country"}#### Distribution of Olympic medal awards by country```{r, cache = TRUE}# Calculate summary statisticscount.countries <- oly |>group_by(NOC) |>summarize(gold =sum(Medal =="Gold", na.rm =TRUE),silver =sum(Medal =="Silver", na.rm =TRUE),bronze =sum(Medal =="Bronze", na.rm =TRUE) ) |>mutate(total = gold + silver + bronze) |>arrange(desc(total))# Create stacked barplotmedals.bp <-barplot(height =t(as.matrix(count.countries[1:25, -(-4:-2)])),col =c("darkorange4", "grey", "gold"),main ="Olympic medals by country",xlab ="Country",ylab ="Medals")legend(x ="topright",fill =c("gold", "grey", "darkorange4"),legend =c("Gold", "Silver", "Bronze"),bty ="n")text(x = medals.bp,y =-300,labels = count.countries$NOC[1:25],adj =1,srt =60,xpd =TRUE)text(x =median(medals.bp),y =6000,labels ="Top 25 countries",xpd =TRUE)```It is well-known that the USA typically dominates the Olympics in terms of raw counts of medals won. The USA has accumulated 5637 medals: 2638 gold, 1641 silver, and 1358 bronze. Second place is, in fact, the USSR (wow!), with a total of 2503 medals. Then, Germany with 2165, Great Britain with 2068, and so on::::{.callout collapse=true title="Ranking of countries by Olympic medals (long)"}```{r}as.data.frame(count.countries)```:::The average number of medals/country is about 173, the median is 2 (heavily unbalanced), and there are 81 countries that have earned *zero* medals (as of the 2016 Olympics).:::### 2. Medals by sport#### Which sports tend to have the most medals awarded?Depending on the type of the sport (individual or team), the age of the sport, and the sport's popularity, not all Olympic sports have awarded the same counts of medals over the years. So, we ask, what does the distribution look like?:::{.callout collapse=true title="Distribution of medals by sport"}#### Medals awarded by sport```{r, message = FALSE, warning = FALSE, cache = TRUE}# Calculate summary statisticscount.games <- oly |>group_by(Sport, Season) |>summarize(total =sum(!is.na(Medal)) ) |>arrange(desc(total)) |>mutate(color =case_when( Season =="Summer"~"khaki", Season =="Winter"~"skyblue" ) )par(oma =c(2, 0, 0, 0))# Create barplotgames.bp <-barplot(height = count.games$total[1:25],col = count.games$color[1:25],# col = colorRampPalette(c("grey", "skyblue3"))(25),main ="Olympic medals by sport",xlab ="",ylab ="Medals (absolute count)")text(x = games.bp,y =-250,labels =c(count.games$Sport[1:19], "CC Skiing", count.games$Sport[21:25]),cex =0.8,adj =1,srt =60,xpd =TRUE)text(x =median(medals.bp),y =4225,labels ="Top 25 medal-giving games",xpd =TRUE)legend(x ="topright",fill =c("khaki", "skyblue"),legend =c("Summer", "Winter"),title ="Event season",cex =0.8,bty ="n")mtext("Games",side =1,outer =TRUE)par(oma =c(0, 0, 0, 0))```Athletics includes the track-and-field events (things like pole-vaulting, xx-meter sprints, hurdles, shotput, and more). As such, given the diversity of events present within "Athletics", it makes sense to see so many awarded medals! Generally, however, much of the above graphic can be easily explained by the total number of Olympic games that the sport was present in.Here is an adjustment for medals/game, so that we can see more clearly which games give many medals and which give few:```{r, message = FALSE, warning = FALSE}# Calculate summary statisticscount.gamesadj <- oly |>group_by(Sport, Season) |>summarize(total =sum(!is.na(Medal)),years =length(unique(Year)) ) |>mutate(peryear = total/years) |>arrange(desc(peryear)) |>mutate(color =case_when( Season =="Summer"~"khaki", Season =="Winter"~"skyblue" ) )par(oma =c(2, 0, 0, 0))# Create barplotgamesadj.bp <-barplot(height = count.gamesadj$peryear[1:25],col = count.gamesadj$color[1:25],main ="Olympic medals by sport",xlab ="",ylab ="Medals (per occurance)",cex.axis =0.86)text(x = gamesadj.bp,y =-7,labels =c(count.gamesadj$Sport[1:20], "ST Speed Skating", count.gamesadj$Sport[22:25]),cex =0.8,adj =1,srt =60,xpd =TRUE)text(x =median(gamesadj.bp),y =145,labels ="Top 25 medal-giving games, (adjusted for sport presence)",cex =0.8,xpd =TRUE)legend(x ="topright",fill =c("khaki", "skyblue"),legend =c("Summer", "Winter"),title ="Event season",cex =0.8,bty ="n")mtext("Games",side =1,outer =TRUE)par(oma =c(0, 0, 0, 0))```Now we see what we expected. Athletics still gives many of medals each time it is present in the Olympics (again, lots of events in "Athletics"), but the team sports now show many awarded medals each year. Notably, there are few winter games among the top medal-giving games! This is likely because winter games tend to be more solitary than -team-based, with exceptions such as hockey and curling.:::### 3. Athlete ages#### Are there any trends related to the ages of the participants in the Olympics?Most likely. We can analyze both the distribution of ages, as well as inspect for trends::::{.callout collapse=true title="Ages of Olympic athletes"}#### Ages of athletes by year```{r, message = FALSE, warning = FALSE, cache = TRUE}# Make density plotggplot(data = oly |>filter(Age <50), aes(x = Year, y = Age)) +ylim(0, 50) +geom_density_2d_filled() +scale_fill_manual(values =colorRampPalette(c("white", "blue4"))(8),name ="Density" ) +labs(title ="Athlete ages and Olympic games years",subtitle ="2d kernel density estimate",y ="Age (years)" ) +theme_classic()```The above plot gives us an idea of the shape of the distribution of ages of athletes by year. We can see increasing density, and therefore increasing numbers of (as the distribution remains tight) athletes as the years continue, especially since ~1980. Additionally, the average age of an athlete looks to be about 25 years old:```{r}print(paste0("Mean age: ", trunc(mean(oly[oly$Age <50, ]$Age, na.rm =TRUE)*100)/100, " years"))```Indeed, this is the case. We can also note the series of mean athlete ages by year:```{r}count.ages <- oly |>group_by(Year) |>summarize(age.mean =mean(Age, na.rm =TRUE))# Time plot of mean agesggplot(data = count.ages, aes(x = Year, y = age.mean)) +ylim(0, 50) +geom_line() +labs(title ="Mean athlete ages by year",y ="Mean age (years)" ) +theme_classic()```Aside from early fluctuations in early games' years (due to small athlete counts), the mean age of the athletes has remained quite consistent at 25-26 years old. Studies suggest that athleticism/physical ability peaks are around this age, so the trend will probably reflect this (and remain constant) in years to come.#### Ages of 2010 Olympic athletesLets look at an example of a specific year, 2010:```{r}# Age histogramhist( oly[!is.na(oly$Age) & oly$Year ==2010, ]$Age,breaks =37,col =colorRampPalette(c("grey", "skyblue"))(37),main ="Olympic athlete age distribution",xlab ="Age (years)",ylab ="Count",xaxt ="n")axis(side =1,at =2*(7:25) +1,labels =2*(8:26))text(x =33.5,y =472,labels ="Winter games (2010)",cex =0.8,xpd =TRUE)abline(v =mean(oly[!is.na(oly$Age) & oly$Year ==2010, ]$Age) -1,col ="red3",lwd =2,lty =2)legend(x ="topright",fill ="red3",legend ="Mean: 26.12",cex =0.8,bty ="n")```There exists a general clustering of ages around the mean, then significant right-skew into the older ages. This is likely explained by the presence of some games that don't necessarily demand raw athleticism; instead, some other developed technique or skill.:::### 4. Athlete sexes#### How has the sex/gender distribution of athletes changed over the years?:::{.callout collapse=true title="Sexes of Olympic athletes"}#### Sexes of Olympic athletes```{r, message = FALSE, warning = FALSE, cache = TRUE}# Calculate summary statisticscount.sex <- oly |>group_by(Year, ID) |>slice(1) |>ungroup() |>group_by(Year, Sex, Season) |>summarize(total =n())count.sex.winter <- count.sex |>filter(Season =="Winter")count.sex.summer <- count.sex |>filter(Season =="Summer")# Plot over timeggplot(data = count.sex.summer, aes(x = Year, y = total, fill = Sex)) +geom_area(position ="stack") +scale_fill_manual(values =c(M ="skyblue", F =alpha("hotpink2", alpha =0.75))) +labs(title ="Male vs. Female athletes",subtitle ="Summer Olympics",y ="Athletes (count)" ) +theme_classic()ggplot(data = count.sex.winter, aes(x = Year, y = total, fill = Sex)) +geom_area(position ="stack") +scale_fill_manual(values =c(M ="skyblue", F =alpha("hotpink2", alpha =0.75))) +labs(title ="Male vs. Female athletes",subtitle ="Winter Olympics",y ="Athletes (count)" ) +theme_classic()```We can see that the relative proportions of male to female athletes has always been large on the male side, holding a constant majority of athlete pools year-by-year (for both summer and winter Olympic games). The number of female athletes *is increasing* relative to men, and nearing a 50-50 split, especially in the summer games!Additionally, it's interesting that the summer games see far more athletes in general than the winter games do. This aligns with earlier conclusions about medal distributions, as the winter games typically has a greater number of solitary events than the summer games.:::### 5. Host country performance#### Does the performance of a country change when they are hosting the Olympics?:::{.callout collapse=true title="Comparison of medals won by host-country status"}:::{.callout collapse=true title="Performance of Canada"}```{r}# Find Canada host yearscan.hostyears <- oly.host |>filter(NOC =="CAN") |>pull(Year)can.presyears <- oly |>filter(NOC =="CAN") |>group_by(Year) |>slice(1) |>pull(Year)# Find awarded medals by yearcount.years <- oly |>filter(!is.na(Medal)) |>group_by(Year) |>summarize(total =n()) |>filter(Year %in% can.presyears)# Find awarded medals (to Canada) by yearcan.perf <- oly |>filter(NOC =="CAN", !is.na(Medal)) |>group_by(Year) |>summarize(total =n())# Find proportions of all medals won by USA, mark host yearcan.perf <- can.perf |>mutate(prop = total/count.years$total,host =ifelse(Year %in% can.hostyears, TRUE, FALSE) )# Share resultsresults <-data.frame(c(paste0("Mean proportion of medals won (hosting): ", trunc(mean(can.perf |>filter(host ==TRUE) |>pull(prop)*10000))/100, "%"),paste0("Mean proportion of medals won (not hosting): ", trunc(mean(can.perf |>filter(host ==FALSE) |>pull(prop)*10000))/100, "%") ))colnames(results) <-"Results (Canada)"results# Make plotggplot(data = can.perf, aes(x = Year, y = prop, fill = host)) +geom_col() +scale_fill_manual(name ="Host status",values =c("TRUE"="royalblue", "FALSE"="grey50") ) +labs(title ="Proportion of total medals won",subtitle ="Performance of Canada by year",y ="Proportion" ) +theme_classic()```Clearly, for Canada at least, they win a higher proportion of the medals (on average) while hosting than while not hosting. Note that there may be some explanation in that Canada tends to perform well in the winter games, and 2/3 of their hosting years were winter games years.::::::{.callout collapse=true title="Performance of France"}We can perform the same analysis with France as well:```{r}# Find France host yearsfra.hostyears <- oly.host |>filter(NOC =="FRA") |>pull(Year)fra.presyears <- oly |>filter(NOC =="FRA") |>group_by(Year) |>slice(1) |>pull(Year)# Find awarded medals by yearcount.years <- oly |>filter(!is.na(Medal)) |>group_by(Year) |>summarize(total =n()) |>filter(Year %in% fra.presyears)# Find awarded medals (to France) by yearfra.perf <- oly |>filter(NOC =="FRA", !is.na(Medal)) |>group_by(Year) |>summarize(total =n())# Find proportions of all medals won by USA, mark host yearfra.perf <- fra.perf |>mutate(prop = total/count.years$total,host =ifelse(Year %in% fra.hostyears, TRUE, FALSE) )# Share resultsresults <-data.frame(c(paste0("Mean proportion of medals won (hosting): ", trunc(mean(fra.perf |>filter(host ==TRUE) |>pull(prop)*10000))/100, "%"),paste0("Mean proportion of medals won (not hosting): ", trunc(mean(fra.perf |>filter(host ==FALSE) |>pull(prop)*10000))/100, "%") ))colnames(results) <-"Results (France)"results# Make plotggplot(data = fra.perf, aes(x = Year, y = prop, fill = host)) +geom_col() +scale_fill_manual(name ="Host status",values =c("TRUE"="red4", "FALSE"="grey50") ) +labs(title ="Proportion of total medals won",subtitle ="Performance of France by year",y ="Proportion" ) +theme_classic()```This one may be more thoroughly explained by the fact that, in the early modern Olympics (~1900s), there were far fewer countries present in the games, and France was one of them. By year, after that, the proportion of medals won by France seems consistent, regardless of whether or not they are hosting the games.::::::{.callout collapse=true title="Performance of Italy"}We can perform the same analysis with France as well:```{r}# Find France host yearsita.hostyears <- oly.host |>filter(NOC =="ITA") |>pull(Year)ita.presyears <- oly |>filter(NOC =="ITA") |>group_by(Year) |>slice(1) |>pull(Year)# Find awarded medals by yearcount.years <- oly |>filter(!is.na(Medal)) |>group_by(Year) |>summarize(total =n()) |>filter(Year %in% ita.presyears)# Find awarded medals (to France) by yearita.perf <- oly |>filter(NOC =="ITA") |>group_by(Year) |>summarize(total =sum(!is.na(Medal)))# Find proportions of all medals won by USA, mark host yearita.perf <- ita.perf |>mutate(prop = total/count.years$total,host =ifelse(Year %in% ita.hostyears, TRUE, FALSE) )# Share resultsresults <-data.frame(c(paste0("Mean proportion of medals won (hosting): ", trunc(mean(ita.perf |>filter(host ==TRUE) |>pull(prop)*10000))/100, "%"),paste0("Mean proportion of medals won (not hosting): ", trunc(mean(ita.perf |>filter(host ==FALSE) |>pull(prop)*10000))/100, "%") ))colnames(results) <-"Results (Italy)"results# Make plotggplot(data = ita.perf, aes(x = Year, y = prop, fill = host)) +geom_col() +scale_fill_manual(name ="Host status",values =c("TRUE"="forestgreen", "FALSE"="grey50") ) +labs(title ="Proportion of total medals won",subtitle ="Performance of Italy by year",y ="Proportion" ) +theme_classic()```For Italy, there is somewhat of a better case for hosting being an indicator of athlete success.::::::In total, we can can conclude (informally) that whether or not a country is hosting the games *may* impact performance of the country, in terms of proportions of total medals earned; but not greatly. There are typically too few instances of each country hosting to draw real generalizations from, with the vast majority of countries (that have hosted) hosting less than 5 times *total*.