Finding the Recipe For Winter Olympic Success

Retrieving the Data

I started my project by seeking out data that would touch on three main categories: economic, demographic, and geographic. Most of my data came from the World Bank which provides a wealth of different reports that are broken down by country and can be exported to an excel file. (http://data.worldbank.org/) From here I gathered several reports covering a range of topics to see if I could find a link between them and the 2014 Winter Olympic results. In addition, I wanted to see if Olympic medal count correlated to a countries proximity to the North Pole. Wikipedia provided a table ranking all countries based on their northern most point, this was formatted and loaded into the workspace via an excel worksheet. (https://en.wikipedia.org/wiki/List_of_countries_by_northernmost_point)

Aggregation

First, I formatted the 2014 data into a standardized format that would allow the introduction of historical data for direct comparisons and aggregate analysis. The variables chosen for study were added to the dataframes as columns by merging the two dataframes.

2014 Winter Olympic Analysis

Location, Location, Location

The first relationship I wanted to study was a Country’s medal count when compared to its proximity to the North Pole. The plot above shows how visualizing data can be distorted by one extreme value. For the sake of properly assessing the majority of the data I decided to map the plot to focus on all values except Australia.

#Outlier Disproportionately disrupts Proximity to Pole Data
ggplot(ttl_2014_agg, aes(Prox, count)) + geom_point(col = "blue", size = 4) +
  geom_smooth() + geom_text(aes(label = Country),check_overlap = TRUE, hjust = "inward", nudge_y = 2.2) +
  labs(x = "Descending Rank Of Country Northernmost Point", y = "Winter Olympic Medal Count") + 
  ggtitle("2014 Winter Olympic Success: Proximity to North Pole") +
  theme_fivethirtyeight()
FALSE `geom_smooth()` using method = 'loess'

The plot above shows how visualizing data can be distorted by one extreme value. For the sake of properly assessing the majority of the data I decided to map the plot to focus on all values except Australia.

#Allows us to see a relationship between Proximity to North Pole and Medal Totals
ggplot(ttl_2014_agg, aes(Prox, count)) + geom_point(col = "blue", size = 3, alpha = .6) +
  geom_text(aes(label = Country),size = 3, alpha = .5,check_overlap = TRUE, hjust = "inward", nudge_y = 2.2) +
  geom_smooth(se = FALSE) + coord_cartesian(xlim = c(0,75)) +
  theme_fivethirtyeight() +
  labs(x = "Descending Rank Of Country Northernmost Point", y = "Winter Olympic Medal Count") + 
  ggtitle("2014 Winter Olympic Success: Proximity to North Pole")
## `geom_smooth()` using method = 'loess'

The scatterplot and smooth line both reveal a strong negative correlation between medal count and proximity to North Pole. As a Country’s geographic location does not change (at least not from our time horizon!) the x-values are reliably stable and adding additional Countries or points from historical data maintains the plot’s integrity.

Medal Count and GDP Per Capita

I used a strip plot to establish if there is a relationship between a Country’s gross domestic product per capita and 2014 Olympic medals. Countries are aligned on the x-axis by GDP per capita rank, with the leftmost (Ukraine) having the lowest GDP of the group and rightmost (Norway) the highest. We can see that a possible positive correlation may exist but several values, such as Russia and Australia, buck the trend and represent opposite extremes.

#Strip Chart Countries ordered left to right by GDP
ttl_2014_agg$Country <- reorder(ttl_2014_agg$Country, ttl_2014_agg$GDP)
ggplot(drop_na(ttl_2014_agg, GDP), aes(Country, count)) + geom_jitter(size = 2, color = "blue") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), panel.background = element_rect(fill = "white"), panel.grid.major.y = element_line(color = "light blue")) +
  ggtitle("Ranked GDP per Capita From Left to Right")

Worth The Wait: Life Expectancy and Olymmpic Results

The bar chart below also uses the order of the values to convey meaning. Countries are aligned on the y-axis by average life expectancy (combined male/female) with the topmost value having being highest (Japan) and the bottommost value the lowest (Russia). Although the top half of the plot contains the majority of the values, both the United States and Russia weaken the case that life expectancy can be considered a factor for performance at the Winter Olympics. Further analysis will allow us to discount the data from a single year and take an aggregate view.

ttl_2014_agg$Country <- reorder(ttl_2014_agg$Country, ttl_2014_agg$Life_Exp)
ggplot(ttl_2014_agg, aes(Country, count)) + geom_col(fill = "skyblue") + scale_color_brewer(palette = 5) + coord_flip() + theme_minimal() +
  labs(y = "Winter Olympic Medal Count", x = "Ranked Life Expectancy: Descending Order") +
  ggtitle("Average Life Expectancy Vs. Olympic Medals (1994, 2002, 2014)")

Comparison To 1994 and 2002 Winter olympic Results

To verify that the variables selected are correlated with Olympic performance, I compiled dataframes containing the results of the 1994 and 2002 games. This provided a consistent time frame of ~10 years allowing us to study changes over time and the magnitude of their impact on medal count.

Proximity To North Pole

Proximity to the North Pole as a factor of medal count seems to hold up as all three Olympic results are presented together. I have altered the the plot size and masked Australia, which is the only Country represented below the equator. As this Country is the single outlier of its type, I believe that its removal from the plot space maintains the integrity of the data and does not mislead the viewer.

#Removing extreme outlier, Australia, to show Correlation
ggplot(comb_ttl_agg, aes(Prox, count)) + geom_point(aes(col = factor(Year)), size = 2, alpha = .8) + geom_smooth(se = FALSE) + 
  coord_cartesian(xlim = c(0,65)) + ggtitle("Winter Olympic Success Factors: Proximity to North Pole") +
  labs(x = "Descending Rank Of Country Northernmost Point", y = "Winter Olympic Medal Count") + scale_color_discrete(name = "Year") +
  theme_few()
FALSE `geom_smooth()` using method = 'loess'

The trend line indicates that there is a strong relationship between how far North a Country is and its performance at the Winter Olympics. Historical data has backed up our claim and this variable can be considered as an indicator of Winter Olympic performance.

Life Proximity

We have replicated the plot comparing life expectancy to medal count. Again, Countries are listed from highest to lowest life expectancy. While it does appear the top half of the plot contains more values (with the United States and Russia as extremes) it would be more intuitive to see the progression over time. Once we facet the same chart into the 3 time periods we see how Country’s with a higher life expectancy have been gathering more share of the medal count. Russia being the special case, we can see that medal growth seems to be related to high life expectancy by observing the increase in total medals in the top 50% of the plot. A possible explanation for this growth is the argument that developed Countries experience both an increased presence at the Olympic Games and an increase in life expectancy. While this could certainly factor in, most of the Countries with the highest life expectancy had been fully developed prior to 1994. Life expectancy, while arguably linked to many other interrelated factors, has become a reliable indicator for Winter Olympic performance.

GDP Per Capita

Let’s revisit the third variable identified, GDP per capita. The strip plot below again shows countries ranked from lowest to highest GDP with medal counts plotted along the y axis. With three periods represented we can better see that upward slope of the points as GDP per capita increases.

comb_ttl_agg$Country <- reorder(comb_ttl_agg$Country, comb_ttl_agg$GDP)
ggplot(comb_ttl_agg, aes(Country, count)) + geom_jitter(size = 3, width =0, color = "light pink") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), panel.background = element_rect(fill = "white"), panel.grid.major.y = element_line(color = "light blue")) +
  ggtitle("Total Medals By Increasing GDP Per Capita") +
  labs(x = "Ranked GDP per Capita Ascending Left to Right", y = "Winter Olympic Medal Count")

While GDP per capita’s relationship seems stronger with 3 datasets, why not experiment with a last variable? If the plot’s meaning isn’t lost, additional primary data vehicles (color, size, geom) can add meaning to the image. In this last graph we have filled the points with a gradient representing the percent of the population who live in urban areas. Judging by the results we could further explore this if needed as there does seem to be an increased urban population correlated both to higher GDP per capita, and Olympic medal count.

ggplot(comb_ttl_agg, aes(Country, count)) + geom_jitter(aes(col = Urban_Perc),size = 3, width =0) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_color_gradientn(name = "Urban %", colors = terrain.colors(6)) +
  ggtitle("Total Medals By Increasing GDP Per Capita and % Urban Population") +
  labs(x = "Ranked GDP per Capita Ascending Left to Right", y = "Winter Olympic Medal Count")

Reflection and Conclusion

Correlation Coefficients For the Three Variables Studied

#Assess Effectiveness of Visualization
cor(comb_ttl_agg$Life_Exp, comb_ttl_agg$count, method = "pearson")
## [1] 0.2152628
cor(comb_ttl_agg$Prox, comb_ttl_agg$count, method = "pearson")
## [1] -0.3150113
cor(comb_ttl_agg$GDP, comb_ttl_agg$count, method = "pearson")
## [1] 0.428735

I have intentionally relied on the visualizations in this project instead of numerical values to tell the story through images. I did want to perform a spot check though to ensure that I hadn’t obscured the data to narrate a story that didn’t exist. Below are the correlation values for each of the 3 variables discussed. GDP it turns out was the most likely indicator and Life expectancy the lowest. I was actually quite surprised, believing that proximity to the North Pole would be the most effective measurement. Upon reflection, the same value I struggled with (Australia) would have been factored into the correlation measurement. By being able to use our eyes rather than a function, we were better able to see the relationship through data visualization. Thank you for taking the time to read my analysis of why GDP per capita, proximity to the North Pole, and average life expectancy could be correlated to Winter Olympic results.