Introduction
This dataset is based on the Olympic Games which took place between the period of 1896 to 2016. It contains information on athletes regarding their physiques, such as weight and height. It also contains demographic information such as age, gender and country of origin. Lastly, information regarding the sport and the specific event is also available. The main aim of this analysis is to analyze and compare key characteristics and patterns amongst participants as well as analyzing trends at a sport and country level.
Data Cleaning
The primary data cleaning which had to be done was for the team variable which contains the country name. Many countries had multiple ways in which their name was stored so that had to be rectified. The code used for that has been included below. Lastly, a custom theme for all subsequent visuals was also set.
medal <- medal[team %in% c('East Germany','East Germany-1','East Germany-2','East Germany-3',
'West Germany','West Germany-1','West Germany-2','West Germany-3',
'Germany-1', 'Germany-2','Germany-3'), team := 'Germany']
US <- paste0('United States-',1:14)
medal <- medal[team %in% US, team := 'United States']
SU <- paste0('Soviet Union-',1:3)
RU <- paste0('Russia-',1:3)
medal <- medal[team %in% c(SU,RU, 'Soviet Union'), team := 'Russia']
GBR <- paste0('Great Britain-',1:4)
medal <- medal[team %in% c(GBR, 'England','England-1'), team := 'Great Britain']
FRA <- paste0('France-',1:4)
medal <- medal[team %in% FRA, team := 'France']
SWE <- paste0('Sweden-',1:4)
medal <- medal[team %in% SWE, team := 'Sweden']
ITA <- paste0('Italy-',1:3)
medal <- medal[team %in% c(ITA,'Italia'), team := 'Italy']
HUN <- paste0('Hungary-',1:3)
medal <- medal[team %in% c('Hungaria','Hungaria Evezos Egylet',HUN), team := 'Hungary']
AUS <- paste0('Australia-',1:3)
medal <- medal[team %in% AUS, team := 'Australia']
CAN <- paste0('Canada-',1:3)
medal <- medal[team %in% CAN, team := 'Canada']
## Setting the theme ##
theme_custom <- function( base_size = 11, base_family = "") {
theme_bw() %+replace%
theme(
plot.background = element_rect(fill = "gray99", colour = "gray99") ,
panel.grid.major = element_line(color = NA),
panel.background = element_rect(fill = "gray99", colour = "gray99"),
panel.border = element_rect(linetype = "solid", fill = NA),
axis.line = element_line(color = "black", size = 1),
axis.text = element_text(color = "black"),
axis.ticks = element_line(color = "royalblue4", size = 1),
axis.ticks.length.y = unit(.25, "cm"),
axis.ticks.length.x = unit(.25, "cm"),
axis.text.x = element_text(margin = margin(t = .3, unit = "cm")),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
}
theme_set(theme_custom())Characteristic distribution of athletes
The starting point for this analysis was to look at athletes at a holistic level and view differences in features such as age, weight, height and gender between the medalists and the non-medalists.
#1) characteristics distribution of athletes
g1 <- ggplot(df[age != 'NA' & age< 50], aes( x = age, fill = medal_won )) +
geom_boxplot() + scale_fill_excel_new() +
labs( x = 'Age', y = 'Relative Frequency', title = 'Age distribution comparison')
g2 <- ggplot(df[weight != 'NA'], aes( x = weight, fill = medal_won )) +
geom_boxplot() + scale_fill_excel_new() +
labs( x = 'Weight', y = 'Relative Frequency', title = 'Weight distribution comparison')
g3 <- ggplot(df[height != 'NA'], aes( x = height, fill = medal_won )) +
geom_boxplot() + scale_fill_excel_new() +
labs( x = 'Height', y = 'Relative Frequency', title = 'Height distribution comparison')
g4 <- ggplot(df[sex != 'NA'], aes( x = sex, fill = medal_won )) +
geom_bar(col = 'black')+ scale_fill_excel_new() +
labs( x = 'Gender', y = 'Relative Frequency', title = 'Gender distribution comparison')
grid.arrange(g1, g2, g3, g4, ncol = 2)The age distribution is pretty similar for medalists and non-medalists, indicating that inherently, age is not a key factor in determining whether an athlete wins a medal or not. An interesting find though is in the case of the weight and height distributions where the median weight for medalists is higher than non-medalists and so are the upper and lower quartiles. While the difference may not seem significant in absolute terms, this can be a good indication for prospective participants in the upcoming years to aim for an ideal weight. This can even be a guiding point for team selectors, who may prefer to choose players within the ideal weight and height brackets in order to maximise the likelihood of their team winning a medal. Lastly, we can see that in absolute terms, men have participated more than women till date, and also have a higher absolute number of medals won by athletes. However, it can be observed that the medals won to total participants ratio is similar for both genders.
#2) explore height to weight ratio b/w male and female athletes
ggplot(df[, medal_won := ifelse(medal_won ==1, 'Medal won','No medal') ],
aes( x = weight, y = height, color = sex)) +
geom_point(alpha = 0.2) + geom_smooth(method = 'lm') +
scale_color_excel_new() + facet_wrap(~df$medal_won) +
labs( x = 'Weight', y = 'Height', title = 'Analyzing the weight to height ratio')Next, the weights and heights are compared between male and female medalists as well as male and female non-medalists. In the non-medalist category, it can be seen that the males have a higher weight-to-height ratio, indicating that for the same weight, male athletes tend to have higher heights. In the case of medalists, the disparity between male and female athletes is lesser (comparing those with equal weights).
#3) line graph
p_gender <- df[, .(number_of_participants = .N), by = .(year, sex)]
gender <- medal[, .(number_of_medals = .N), by = .(year, sex)]
ggplot() +
geom_line(data = p_gender,
aes( x = year, y = number_of_participants, col = sex), size = 1) +
labs(x = 'Year', y = 'Number of participants',
title = 'Gender-segregated number of participants year-wise') +
scale_color_excel_new() +
transition_reveal(year) ggplot() +
geom_line(data = gender,
aes( x = year, y = number_of_medals, col = sex), size = 1) +
labs(x = 'Year', y ='Number of medalists',
title = 'Gender-segregated number of medalists year-wise') +
scale_color_excel_new() +
transition_reveal(year)Finally, the trend for the year-wise number of athletes, segregated by gender, was analyzed. It can be seen that initially, there was a significantly higher proportion of males who were competing in the olympics whereas in the later years, the number of women participants has steadily increased. In the recent years there has been some fluctuation in the total number of participants, but it is encouraging to see that there is now a somewhat more equal representation of both genders. A very similar pattern is accordingly also observed in the case of the medalists as well.This particular aspect of the analysis will come in handy when we want to compare physical features of athletes (males and females) in the next section (particularly in respect to the sample size.
Sport and Event level granularity
Now that we have an idea about the physical characteristic distributions of the medalists, we can go into a greater level of granularity.
#4) sports with most medals
sports <- medal[, .(number_of_medals = .N), by = .(sport)][order(-number_of_medals)][, head(.SD, 10)]
ggplot(sports, aes(x = reorder(sport, number_of_medals), y = number_of_medals)) + geom_bar(stat = 'identity', col = 'black',fill = 'orange') +
labs( x = 'Sport', y = 'Number of Medals', title = 'Total Number of medals by sport') +
coord_flip()At a sport level, we can see that the greatest participation and the most of number of medals have been won in athletics even though there are other team-based sports in which there are more number of medals available for grabs. Therefore, to slice and dice through the data, we can go into events pertaining to the sport of athletics as they will have the most number of observations for comparison.
#4) i) Males
sport_sub <- medal [season=="Summer" &
year >= 1992 &
sex=="M" &
sport =="Athletics"]
#table(as.factor(sport_sub$event))
sport_sub <- sport_sub[event %in% c("Athletics Men's 4 x 400 metres Relay",
"Athletics Men's High Jump",
"Athletics Men's Javelin Throw",
"Athletics Men's Pole Vault")]
s1 <- sport_sub[!is.na(weight) & !is.na(height)] %>%
ggplot(aes(weight, height, col=event)) +
geom_point() +
theme_custom() +
labs(x="Weight", y="Height",
title= "Olympic medalists 1992-2016, Men") +
xlim(60,110) +ylim(160,210) + scale_color_excel_new()
# 4) ii) Females
sport_sub_f <- medal [season=="Summer" &
year >= 1992 &
sex=="F" &
sport =="Athletics"]
#table(as.factor(sport_sub_f$event))
sport_sub_f <- sport_sub_f[event %in% c("Athletics Women's 4 x 400 metres Relay",
"Athletics Women's High Jump",
"Athletics Women's Javelin Throw",
"Athletics Women's Pole Vault")]
s2 <- sport_sub_f[!is.na(weight) & !is.na(height)] %>%
ggplot(aes(weight, height, col=event)) +
geom_point() +
theme_custom() +
labs(x="Weight", y="Height",
title= "Olympic medalists 1992-2016, Women") +
xlim(60,110) +ylim(160,210) + scale_color_excel_new()
grid.arrange(s1, s2)For athletics medalists, we can see pretty clear clustering of data points among different events for males as well as for females. For instance, javelin throw participants tend to have higher weights than other event participants. Similarly, High jump participants, in the case of both genders, tend to have lower weights and higher heights than other sport participants. Again, this can be a good guiding point for team selectors to aim for an ideal weight-to-height ratio so that they can improve their chances of securing a medal in the future.
Next, to identify similar clusters in terms of the age, weight and height distributions, hierarchical clustering was done for different sports. The package Nbclust was used to gain insight regarding the ideal number of clusters for our dataset. It turns out that 4 clusters would be ideal in this case.
The dendrogram below shows sports divided into 4 distinct groups. Each group contains sports with athletes having similar features in terms of age, height and weight. For instance, sports such as shooting, golf etc which do not require much athleticism are grouped together.
# plotting the dendrogram
plot(hc1, cex = 0.6, hang = -1)
rect.hclust(hc1, k = 4)Country-level trends
#5) Map visual of number of athletes rn (medals won) from each country
medal$team <- replace(medal$team, medal$team == 'Soviet Union', 'Russia')
medal$team <- replace(medal$team, medal$team == 'East Germany', 'Germany')
medal$team <- replace(medal$team, medal$team == 'West Germany', 'Germany')
countries <- medal[, .(number_of_medals = .N), by = team][order(-number_of_medals)][, head(.SD, 20)]
geocodes <- geocode(countries, 'team')
worldmap <- map_data('world')
ggplot() +
geom_polygon(
data = worldmap,
aes(x = long, y = lat, group = group),
fill = 'white', color = 'black') +
geom_point(
data = geocodes,
aes(long, lat, size = number_of_medals),
color = 'orange') +
theme(legend.position = 'top') +
xlab('') + ylab('') +
coord_fixed(1.3) +
labs( title = 'Number of olympic medals won by each country till date')The final level of analysis has been done at a country level. The map visual indicates the medals won by the top 20 performing countries in the olympics till date. It can be seen that the US has dominated in this category by some margin. Region-wise clusters can be observed. The most prominent one for instance is the one in Europe, where most European countries have performed exceedingly well. Very few Asian countries have fared well, such as China and Japan. No Middle Eastern countries nor African countries are observed in the countries with the best performance till date.
#6) Cumulative medals animation
medal$year <- as.factor(medal$year)
medal$team <- as.factor(medal$team)
medal_c <- medal[season == 'Summer', .( n = .N), by = .(year, team)]
medal_c <- medal_c[team %in% c("United States", "Russia","Germany","Great Britain","France",
"Italy","Sweden","Canada","Australia","Hungary")]
medal_c <- medal_c[order(rank(team), year)]
medal_c <- medal_c[,CumulativeSums := cumsum(n),by = team]
medal_c$year = as.numeric(as.character(medal_c$year))
d <- ggplot(medal_c, aes ( x = reorder(team, CumulativeSums), y = CumulativeSums)) +
geom_bar(stat = 'identity', col = 'black',fill = 'orange') +
coord_flip() + transition_states(year) +
labs( x = 'Top Performing Countries', y = 'Total number of medals won',
title = 'Cumulative olympic medal totals till {closest_state}') + coord_flip()
animate(d, duration = 30)The final visualization also highlights the cumulative medals won over time by the top performing countries. It can be seen that the US maintains a lead from all other countries.
Conclusion
Based on the exploratory data analysis done above, it can be seen that certain physical features such as weight and height can be critical to success in an Olympic sport or event. Furthermore, based on the data, country of origin also tends to impact the success achieved in an Olympic event. While some of the likelihood of success can be attributed to opportunities and the infrastructure relating to the sport in a country, country of origin still seems to have a part to play in a participant’s chances of success. This can be seen from the fact that several Asian nations which invest greatly in sport and have significant amounts of athletes (e.g China and Japan), still tend to underperform compared to several European countries and the US.