BAIS Final Project

Author

Emily

Final Project: What characteristics help people in winning an Olympic Medal?

This analysis will use two data sets, one which includes information from historical Olympics from Kaggle that includes Athlete information from 1896 to 2016, and one that was scraped from Olympics.com and focuses on the 2024 Olympic games to determine what characteristics are helpful in creating an Olympic champion. The Paris data set includes limited information from the United States from three popular sports, “athletics” which is sports in track and field and other footraces, artistic gymnastics, and volleyball. The data includes various Olympic athletes, some of their characteristics, and the highest medal that they won. I spent much of the summer watching the Olympic games so I found this topic interesting.

Although I am exploring generally what characteristics help people to win in the Olympics, my country of focus is the United States Olympic team.

One thing to note is that the Olympics were only separated into Winter and Summer starting in 1992! Since the Paris Olympics was a Summer Olympics, it might yield different results than that of games prior to 1992 because of this.

Below is the data dictionary from Kaggle for the larger dataset. Most of the variables are straight forward, height, age, weight, sport, etc. NOC stands for National Olympics Committee and this variable includes each Country’s acronym (USA, MEX, FRA, etc.) The 2024 Olympics data includes similar variables, but also includes a general interest variable which is information describing different things about the athlete like how they got into the sport, their motivations, and other personal information they chose to include.

Understanding the Variables

Data Dictionary

Variable Name Description
ID Unique number for each athlete
Name Athlete’s name
Sex Gender of the athlete, M or F
Age Age of the athlete
Height (cm) Height of the athlete in centimeters
Weight (kg) Weight of the athlete in kilograms
Team Team name
NOC National Olympic Committee 3-letter code
Games The Olympic Games year and season
Year Year of the Games
Season Summer or Winter
City The city where the Games are held
Sport The sport in which the athlete competes
Event The event within the sport
Medal Gold, Silver, Bronze, or NA (The medal won, if any)

Summary Statistics

Variable Mean Median Min Max
Age 25.56 24 10 97
Height (cm) 175.3 175 127 226
Weight (kg) 70.7 70 25 214

The variables all have very broad ranges that likely vary based on the sport that the athlete is involved in. The age averages at 25, with an average height of 175.3 centimeters which is 5’7”, and an average weight of 70.7 kilograms which is 156 lbs.

This distribution allows us to see the large range and presence of age outliers. Generally, Athletes are on the younger end, when people tend to be in their athletic prime.

Another interesting thing to note, is that as the Olympics have progressed over the years, the gender inequality has begun to shrink. In the Paris Olympics, there is almost a 1:1 ratio of male to female, but this has not been the case historically.

We can see that men far outweigh women when looking at historical data of the Olympics. To further examine these changes over time, I created a time series graph divided by the season of the Olympics, because the split in 1992 makes visualization complicated. Overtime, entries into the Olympics have increased overall, and the divide between men and women being admitted has been lessening, which is more noticeable in the graph depicting the summer Olympics.

Analysis

Is Age a Determinant of Winning a Medal?

To answer this question, I created box plots comparing the ages of athletes compared to their ages.

There doesn’t seem to be a large visual difference between the ages of medalists, but the median age of non medalists appears to be slightly lower compared to medalists. This could indicate that older athletes who have more experience might be able to perform better on the Olympic stage.

What About Height or Weight?

For both Height and Weight, we see similar trends to the Age category where the non medalists appear to have slightly lower height and weights compared to medalists, but the medalists do not seem to differ significantly. Olympic Medalists appear to be taller and weigh more than the average Olympian.

Is the U.S. Really More Likely to Win?

The US has the most entries out of any other country so I wanted to explore whether the US is actually more likely to win, or if they just have the most athletes competing which results in higher medal counts. To explore this, I looked at the average medal per athlete of each country and created a scatter plot and a table to explore the results.

The dot in the top right is the United States. The US leads in the most medals by a significant margin, but based on the regression line they would be expected to be performing better than they have been, although since they are such an outlier the regression is likely not accurate for them.

# A tibble: 147 × 4
   NOC   total_wins total_entries win_rate
   <chr>      <int>         <int>    <dbl>
 1 URS         2503          5685    0.440
 2 GDR         1005          2645    0.380
 3 ANZ           29            86    0.337
 4 EUN          279           864    0.323
 5 USA         5637         18853    0.299
 6 RUS         1165          5143    0.227
 7 GER         2165          9830    0.220
 8 SRB           85           392    0.217
 9 PAK          121           562    0.215
10 NOR         1033          4960    0.208
# ℹ 137 more rows

The funny thing about this table, is that the four teams above the USA no longer exist, and the team just below the US is no longer allowed to compete in the Olympics. The top four teams are the USSR, East Germany, Australasia, and a Unified team that consisted of Soviet athletes following the fall of the USSR, it is unsurprising that some of those teams have high win rates based on the doping scandal that was uncovered (USSR 44% win rate!). Based on this table, the USA clearly has a very impressive win rate at almost 30%, meaning almost 1 in 3 athletes from the USA end up medaling. So, being an athlete in the United States looks like it would provide an athlete an upper hand on their competitors. This might be because the US is a rich country and is able to provide more resources to prepare their athletes compared to other nations. Of nations that are still able to compete in the Olympics in the top 10 there are also other generally rich nations, including Germany and Norway.

What Season does the US win the most?

Since we determined that the US has the best win rate compared to any other current Olympic nations, I wanted to dive deeper into areas where the US can improve their winnings. To do this, I explored which seasons the US is most successful in.

Based on this line graph, we see that the winter Olympics result in significantly less medals than the summer Olympics, but this could just be that the winter Olympics have less events compared to the summer Olympics which is stacked with track and field and swimming events that the US generally excels in. There appears to be a trend where the US has been able to consistently increase their medal counts over each Olympics. Once the Olympics was split into summer and winter in 1992, there appears to have been an uptick in the number of events and possible medals based on this graph. This presents an opportunity for the US to increase their winnings. This slightly mirrors the time series gender graph from before depicting how entries into the games have increased. As entries and events increases, it makes sense that the medal counts would also increase.

# A tibble: 50 × 3
# Groups:   Year [35]
    Year Season  Wins
   <dbl> <chr>  <dbl>
 1  1904 Summer   394
 2  1984 Summer   352
 3  2008 Summer   317
 4  2016 Summer   264
 5  2004 Summer   263
 6  1996 Summer   259
 7  2012 Summer   248
 8  2000 Summer   242
 9  1992 Summer   224
10  1988 Summer   207
# ℹ 40 more rows

We also see on that graph and in this table that the US has not been able to beat their peak medal count from the 1904 Summer Olympics at 394 medals. It would be interesting to explore this outlier to see what caused this spike in winnings.

Paris 2024 Comparisons

To continue my analysis, I scraped the Olympics website for data on the 2024 Paris Olympics to compare the new and modern Olympics to the historical data.

Gender Comparison

As we explored previously, historically the Olympics skewed male, but this has been changing in more modern games so I wanted to create a bar chart to demonstrate this change.

This graph clearly demonstrates how the demographics of the Olympics have been changing over time. Previously, it looks like the ratio was 3:1 Male to Female, but this year, the ratio has equalized. Since the Paris Dataset is not a complete data set like the Combined Olympics one, we see that more female athletes happen to be in the dataset, even though men still outnumbered women in Paris.

Sentiment/Text Analysis

Finally, I thought it would be interesting to explore what emotions are included in the general interest variables for the athletes. These describe the athlete personally and professionally and are generally written in a positive light, so I expect the results to be generally positive. I removed words that would be difficult to interpret or wouldn’t be helpful, like olympics and games and other similar words. I also removed any words that included numbers or non alphabetical words, and any word that only had 2 characters or less. I also removed names of sports.

First, I found the word counts, then visualized them, and then did a sentiment analysis.

# A tibble: 1,550 × 2
   word           n
   <chr>      <int>
 1 university    22
 2 coach         20
 3 high          17
 4 team          17
 5 life          15
 6 athletes      14
 7 day           14
 8 year          14
 9 jump          13
10 school        13
# ℹ 1,540 more rows

The top words include university, high and school, team, coach, life. Likely these athletes are describing their path to the olympics (university, or high school, maybe through a coach), and generally their lives which probably revolve around their teams and their coaches. I decided to create a visualization of these words to show the most prevalent words and see if there are any other interesting words included.

In this visualization, we also see the athletes describe training, winning, work, and family.

Finally, I explored the sentiment analysis using the NRC lexicon to rate each score on what sentiment it conveys.

As expected, many of the words seem to display positive, trusting, anticipation, and joy sentiments. There is significantly less words that have anger, disgust, sadness, or surprise. Although, there is quite a few negative words. This could be words that are describing the hard work that athletes have to endure.

Conclusion

  • Age, Height, and Weight are all slightly higher for Medalists compared to non-medalists

  • The US has the highest win rate out of any currently active nations

  • The US is most successful in the Summer Olympics for medal counts

  • Gender Inequality in the Olympics has decreased over the years

  • Athletes’ descriptions of their General Interests are typically positive.