BAIS Final Project
Final Project: What characteristics help people in winning an Olympic Medal?
This analysis will use two data sets, one which includes information from historical Olympics from Kaggle that includes Athlete information from 1896 to 2016, and one that was scraped from Olympics.com and focuses on the 2024 Olympic games to determine what characteristics are helpful in creating an Olympic champion. The Paris data set includes limited information from the United States from three popular sports, “athletics” which is sports in track and field and other footraces, artistic gymnastics, and volleyball. The data includes various Olympic athletes, some of their characteristics, and the highest medal that they won. I spent much of the summer watching the Olympic games so I found this topic interesting.
Although I am exploring generally what characteristics help people to win in the Olympics, my country of focus is the United States Olympic team.
One thing to note is that the Olympics were only separated into Winter and Summer starting in 1992! Since the Paris Olympics was a Summer Olympics, it might yield different results than that of games prior to 1992 because of this.
Below is the data dictionary from Kaggle for the larger dataset. Most of the variables are straight forward, height, age, weight, sport, etc. NOC stands for National Olympics Committee and this variable includes each Country’s acronym (USA, MEX, FRA, etc.) The 2024 Olympics data includes similar variables, but also includes a general interest variable which is information describing different things about the athlete like how they got into the sport, their motivations, and other personal information they chose to include.
Understanding the Variables
Data Dictionary
Variable Name | Description |
---|---|
ID | Unique number for each athlete |
Name | Athlete’s name |
Sex | Gender of the athlete, M or F |
Age | Age of the athlete |
Height (cm) | Height of the athlete in centimeters |
Weight (kg) | Weight of the athlete in kilograms |
Team | Team name |
NOC | National Olympic Committee 3-letter code |
Games | The Olympic Games year and season |
Year | Year of the Games |
Season | Summer or Winter |
City | The city where the Games are held |
Sport | The sport in which the athlete competes |
Event | The event within the sport |
Medal | Gold, Silver, Bronze, or NA (The medal won, if any) |
Summary Statistics
Variable | Mean | Median | Min | Max |
---|---|---|---|---|
Age | 25.56 | 24 | 10 | 97 |
Height (cm) | 175.3 | 175 | 127 | 226 |
Weight (kg) | 70.7 | 70 | 25 | 214 |
The variables all have very broad ranges that likely vary based on the sport that the athlete is involved in. The age averages at 25, with an average height of 175.3 centimeters which is 5’7”, and an average weight of 70.7 kilograms which is 156 lbs.
This distribution allows us to see the large range and presence of age outliers. Generally, Athletes are on the younger end, when people tend to be in their athletic prime.
Another interesting thing to note, is that as the Olympics have progressed over the years, the gender inequality has begun to shrink. In the Paris Olympics, there is almost a 1:1 ratio of male to female, but this has not been the case historically.
We can see that men far outweigh women when looking at historical data of the Olympics. To further examine these changes over time, I created a time series graph divided by the season of the Olympics, because the split in 1992 makes visualization complicated. Overtime, entries into the Olympics have increased overall, and the divide between men and women being admitted has been lessening, which is more noticeable in the graph depicting the summer Olympics.
Analysis
Is Age a Determinant of Winning a Medal?
To answer this question, I created box plots comparing the ages of athletes compared to their ages.
There doesn’t seem to be a large visual difference between the ages of medalists, but the median age of non medalists appears to be slightly lower compared to medalists. This could indicate that older athletes who have more experience might be able to perform better on the Olympic stage.
What About Height or Weight?
For both Height and Weight, we see similar trends to the Age category where the non medalists appear to have slightly lower height and weights compared to medalists, but the medalists do not seem to differ significantly. Olympic Medalists appear to be taller and weigh more than the average Olympian.
Is the U.S. Really More Likely to Win?
The US has the most entries out of any other country so I wanted to explore whether the US is actually more likely to win, or if they just have the most athletes competing which results in higher medal counts. To explore this, I looked at the average medal per athlete of each country and created a scatter plot and a table to explore the results.
The dot in the top right is the United States. The US leads in the most medals by a significant margin, but based on the regression line they would be expected to be performing better than they have been, although since they are such an outlier the regression is likely not accurate for them.
# A tibble: 147 × 4
NOC total_wins total_entries win_rate
<chr> <int> <int> <dbl>
1 URS 2503 5685 0.440
2 GDR 1005 2645 0.380
3 ANZ 29 86 0.337
4 EUN 279 864 0.323
5 USA 5637 18853 0.299
6 RUS 1165 5143 0.227
7 GER 2165 9830 0.220
8 SRB 85 392 0.217
9 PAK 121 562 0.215
10 NOR 1033 4960 0.208
# ℹ 137 more rows
The funny thing about this table, is that the four teams above the USA no longer exist, and the team just below the US is no longer allowed to compete in the Olympics. The top four teams are the USSR, East Germany, Australasia, and a Unified team that consisted of Soviet athletes following the fall of the USSR, it is unsurprising that some of those teams have high win rates based on the doping scandal that was uncovered (USSR 44% win rate!). Based on this table, the USA clearly has a very impressive win rate at almost 30%, meaning almost 1 in 3 athletes from the USA end up medaling. So, being an athlete in the United States looks like it would provide an athlete an upper hand on their competitors. This might be because the US is a rich country and is able to provide more resources to prepare their athletes compared to other nations. Of nations that are still able to compete in the Olympics in the top 10 there are also other generally rich nations, including Germany and Norway.
What Season does the US win the most?
Since we determined that the US has the best win rate compared to any other current Olympic nations, I wanted to dive deeper into areas where the US can improve their winnings. To do this, I explored which seasons the US is most successful in.
Based on this line graph, we see that the winter Olympics result in significantly less medals than the summer Olympics, but this could just be that the winter Olympics have less events compared to the summer Olympics which is stacked with track and field and swimming events that the US generally excels in. There appears to be a trend where the US has been able to consistently increase their medal counts over each Olympics. Once the Olympics was split into summer and winter in 1992, there appears to have been an uptick in the number of events and possible medals based on this graph. This presents an opportunity for the US to increase their winnings. This slightly mirrors the time series gender graph from before depicting how entries into the games have increased. As entries and events increases, it makes sense that the medal counts would also increase.
# A tibble: 50 × 3
# Groups: Year [35]
Year Season Wins
<dbl> <chr> <dbl>
1 1904 Summer 394
2 1984 Summer 352
3 2008 Summer 317
4 2016 Summer 264
5 2004 Summer 263
6 1996 Summer 259
7 2012 Summer 248
8 2000 Summer 242
9 1992 Summer 224
10 1988 Summer 207
# ℹ 40 more rows
We also see on that graph and in this table that the US has not been able to beat their peak medal count from the 1904 Summer Olympics at 394 medals. It would be interesting to explore this outlier to see what caused this spike in winnings.
Paris 2024 Comparisons
To continue my analysis, I scraped the Olympics website for data on the 2024 Paris Olympics to compare the new and modern Olympics to the historical data.
Gender Comparison
As we explored previously, historically the Olympics skewed male, but this has been changing in more modern games so I wanted to create a bar chart to demonstrate this change.
This graph clearly demonstrates how the demographics of the Olympics have been changing over time. Previously, it looks like the ratio was 3:1 Male to Female, but this year, the ratio has equalized. Since the Paris Dataset is not a complete data set like the Combined Olympics one, we see that more female athletes happen to be in the dataset, even though men still outnumbered women in Paris.
Sentiment/Text Analysis
Finally, I thought it would be interesting to explore what emotions are included in the general interest variables for the athletes. These describe the athlete personally and professionally and are generally written in a positive light, so I expect the results to be generally positive. I removed words that would be difficult to interpret or wouldn’t be helpful, like olympics and games and other similar words. I also removed any words that included numbers or non alphabetical words, and any word that only had 2 characters or less. I also removed names of sports.
First, I found the word counts, then visualized them, and then did a sentiment analysis.
# A tibble: 1,550 × 2
word n
<chr> <int>
1 university 22
2 coach 20
3 high 17
4 team 17
5 life 15
6 athletes 14
7 day 14
8 year 14
9 jump 13
10 school 13
# ℹ 1,540 more rows
The top words include university, high and school, team, coach, life. Likely these athletes are describing their path to the olympics (university, or high school, maybe through a coach), and generally their lives which probably revolve around their teams and their coaches. I decided to create a visualization of these words to show the most prevalent words and see if there are any other interesting words included.
In this visualization, we also see the athletes describe training, winning, work, and family.
Finally, I explored the sentiment analysis using the NRC lexicon to rate each score on what sentiment it conveys.
As expected, many of the words seem to display positive, trusting, anticipation, and joy sentiments. There is significantly less words that have anger, disgust, sadness, or surprise. Although, there is quite a few negative words. This could be words that are describing the hard work that athletes have to endure.
Conclusion
Age, Height, and Weight are all slightly higher for Medalists compared to non-medalists
The US has the highest win rate out of any currently active nations
The US is most successful in the Summer Olympics for medal counts
Gender Inequality in the Olympics has decreased over the years
Athletes’ descriptions of their General Interests are typically positive.