The purpose was to understand Olympian profiles and visualise changes to the Olympic history over time, with a focus on Summer Olympics.
I had also combined Olympian data to country medal tally to determine if there is a relationship between height and weight of athletes to the total number of medals a country can win.
Finally, we will like to explore the correlation of such with other macro-economy measures such as GDP, population and Human Development Indices etc to determine if any of these factors contributes to a country’s performance in the Olympics.
Desired visualisations:
1. Number of Athletes participating in the Olympics over the years by Season
2. Number of Athletes participating in the Olympics over the years by Gender
3. Age density plots of participating athletes
4. Top countries by medal counts over the years
5. Correlation of Total Medal tally with athlete attributes and Country macro-environment factors
Dataset was downloaded from Kaggle. It detailed 120 years of Olympian data, including names, height, weight, gender, Games, Season, Event and Medal etc.
You may find the link to the dataset here: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results
In addition, as we are going to compare against country macro-economic factors to Olympic performance, data such as GDP, population figures, HDI etc are extracted from Gapminder: https://www.gapminder.org/
The Olympian dataset contained only medals won by each and every individual athlete that participated in the games over the years. If a basketball team had won the gold medal for a particular event, then all of the members will receive a gold medal. As such, we are unable to perform analysis of the actual number of medals won by any country using the data as it is due to duplication. Besides sports events, art competitions that took place in the earlier years were also included in the dataset.
Data was thus passed through JMP for cleaning and consolidating. Using country name and code as the combining field, I had also combined other country data such as GDP, population, human development index and internet penetration rate etc for potential other forms of analysis.
Steps of data wrangling in JMP includes:
1. Review team events and remove duplicated medals.
2. Compute sum of athletes representing their country in any particular year.
3. Compute average Height and Weight of athletes for each country.
4. Combine country GDP, HDI, Population etc with Olympic data using Country code as joint.
5. Compute total medal tally, Total Gold, Total Silver and Total Bronze for each country.
6. Compute total Male and Female athletes for each country over the different years.
Loading of Olympian data and preparing data type for the various fields.
athletes = read_csv("athlete_events.csv",
col_types = cols(
ID = col_character(),
Name = col_character(),
Sex = col_factor(levels = c("M","F")),
Age = col_integer(),
Height = col_double(),
Weight = col_double(),
Team = col_character(),
NOC = col_character(),
Games = col_character(),
Year = col_integer(),
Season = col_factor(levels = c("Summer","Winter")),
City = col_character(),
Sport = col_character(),
Event = col_character(),
Medal = col_factor(levels = c("Gold","Silver","Bronze"))
)
)
sports = athletes %>% filter(Sport != "Art Competitions") #remove Art Competitions from the dataset as we are only interested in sports eventsLoading of cleaned country medal tally with joint data retrieved from Gapminder.org.
Firstly, we examine if there is a change in the overall number of participants in the Olympics over the years. From the graph below, we can see that there is a oveall increasing trend in the number of Olympians participating in both Summer and Winter games over the years. Also, there are more athletes participating in the Summer Olympics than the Winter Olympics.
participation = sports %>%
group_by(Year, Season) %>%
summarise(
Participants=length(unique(ID)),
)
ggplot(participation,aes(x=Year, y=Participants, group = Season, color=Season))+
geom_point(size=2)+
geom_line(size=1)+
scale_color_manual(values = c("orange","lightblue"))We can also observe from the graph that there are significant dips in the participation rates in the 1980 Olympics and in the 1930s. After performing research on the internet, we note that these dips corresponds to historical boycotts of the Olympic Games back in the years.
The first dip corresponded to the 1936 Olympics dubbed the Nazi Olympics that was held in Berlin. The Olympic Games that year was seen as a powerful propaganda tool used by Nazi Germany to promote a strong and united Germany while downplaying its treatment towards Jews and political opponents. More details can be found on: https://encyclopedia.ushmm.org/content/en/article/the-nazi-olympics-berlin-1936
The second dip observed in the chart corresponds to the 1980s summer olympics hosted by Soviet Union. 65 countries that were invited to the games did not participate. More deatils can be found on: https://en.wikipedia.org/wiki/1980_Summer_Olympics_boycott
Focusing on the Summer Olympic Games, we narrow it down to gender participation throughout Olympic history. There are generally less female representation in the Summer olympics than male until after the 1980s. More and more female athletes started to participate in the games and the numbers are almost on par with the male athletes in the latest games in 2016.
summer = sports %>% filter(Season != "Winter")
participation = summer %>%
group_by(Year, Sex) %>%
summarise(
Participants=length(unique(ID)),
)
ggplot(participation,aes(x= Year, y=Participants, group = Sex, color=Sex))+
geom_point(size=2)+
geom_line(size=1)+
scale_color_manual(values = c("lightblue","lightpink"))An age density graph was plotted for all participating athletes according to the types of medals that they won over the years. Not much difference was observed between the different medal winners and between the genders, although the age distribution of female athletes seem to be inching towards the right compared to the male.
ggplot(data = sports, aes(x=Age, fill=Sex))+
geom_density(alpha=0.4)+
labs(x = "Age", title = "Age Distribution of Olympians by Sex and Awards")+
theme_minimal()+
facet_wrap(~Medal)The following chart shows the top 10 countries with the highest medal tally for the Summer Olympic Games in 2016. The top winning country is unsurprisingly United States and is comparatively more than any other countries in the list.
counts_2016 = country %>% filter(Year==2016)
topdown = counts_2016[order(counts_2016$`Overall Medal Tally`, decreasing = TRUE),]
top10 = head(topdown,10)
top101 = top10
top101$Country = factor(top101$Country,
levels = c("USA","China","UK","Russia","Germany","France","Japan","Australia","Italy","Canada"))
ggplot(top101,aes(`Country`,`Overall Medal Tally`))+
geom_col()A correlation plot was used to visualise the relationship of athlete attributes with every country’s macro-environment factors.
From the figure below, we can identify that overall medal tally positively and strongly correlates with the number of athletes that a country sends to participate in the games at +0.84 and moderately correlates with GDP as +0.67.
We also note that there is a relationship between mean height and mean weight of athletes.
combined = country %>% select(5,9,10,11,12,15,16,17,19,20)
olympic.cor = cor(combined)
corrplot.mixed(olympic.cor,
lower="ellipse",
upper="number",
tl.pos="lt",
diag = "l",
tl.col = "black",
number.cex=0.7)Using the combined dataset and the information presented in the corelation plot, users can potentially perform multiple linear regression to determine what are the factors that contribute to country performance in the Olympic Games.