I started my project by seeking out data that would touch on three main categories: economic, demographic, and geographic. Most of my data came from the World Bank which provides a wealth of different reports that are broken down by country and can be exported to an excel file. (http://data.worldbank.org/) From here I gathered several reports covering a range of topics to see if I could find a link between them and the 2014 Winter Olympic results. In addition, I wanted to see if Olympic medal count correlated to a countries proximity to the North Pole. Wikipedia provided a table ranking all countries based on their northern most point, this was formatted and loaded into the workspace via an excel worksheet. (https://en.wikipedia.org/wiki/List_of_countries_by_northernmost_point)
First, I formatted the 2014 data into a standardized format that would allow the introduction of historical data for direct comparisons and aggregate analysis. The variables chosen for study were added to the dataframes as columns by merging the two dataframes.
The first relationship I wanted to study was a Country’s medal count when compared to its proximity to the North Pole. The plot above shows how visualizing data can be distorted by one extreme value. For the sake of properly assessing the majority of the data I decided to map the plot to focus on all values except Australia.
#Outlier Disproportionately disrupts Proximity to Pole Data
ggplot(ttl_2014_agg, aes(Prox, count)) + geom_point(col = "blue", size = 4) +
geom_smooth() + geom_text(aes(label = Country),check_overlap = TRUE, hjust = "inward", nudge_y = 2.2) +
labs(x = "Descending Rank Of Country Northernmost Point", y = "Winter Olympic Medal Count") +
ggtitle("2014 Winter Olympic Success: Proximity to North Pole") +
theme_fivethirtyeight()
FALSE `geom_smooth()` using method = 'loess'
The plot above shows how visualizing data can be distorted by one extreme value, Australia. For the sake of properly assessing the data we’re going to have to find a different way to structure the data. Instead of using just a nominal rank value for each country, let’s bring the South Pole into perspective. Also fostering a polar climate countries closer to the South Pole may experience a boost in Olympic performance as well. The data for the plot below has been restructure by absolute value of its latitude allowing us to add Australia in its relative position for comparison.
## Warning in merge.data.frame(ttl_2014_agg, life_exp, by.x = "Country", by.y
## = "Country Name", : column names '2014.x', '2014.y' are duplicated in the
## result
#By Latitude
ggplot(ttl_2014_agg, aes(Prox, count)) + geom_point(col = "blue", size = 4, alpha = .6) +
geom_text(aes(label = Country),size = 3, alpha = .5,check_overlap = TRUE, hjust = "inward", nudge_y = 1.4) +
geom_smooth(se = TRUE) +
theme(axis.text.x = element_text(size=5, angle=45),panel.background = element_rect(fill = "white"), panel.grid.major.y = element_line(color = "light blue")) +
labs(x = "Descending Rank From Latitude Absolute Value", y = "Winter Olympic Medal Count") +
ggtitle("2014 Winter Olympic Success: Proximity to Pole Relation to Medal Count")
## `geom_smooth()` using method = 'loess'
The scatterplot and smooth line both reveal a strong negative correlation between medal count and proximity to Poles. As a Country’s geographic location does not change (at least not from our time horizon!) the x-values are reliably stable and adding additional Countries or points from historical data maintains the plot’s integrity.
I used a strip plot to establish if there is a relationship between a Country’s gross domestic product per capita and 2014 Olympic medals. Countries are aligned on the x-axis by GDP per capita rank, with the leftmost (Ukraine) having the lowest GDP of the group and rightmost (Norway) the highest. We can see that a possible positive correlation may exist but several values, such as Russia and Australia, buck the trend and represent opposite extremes.
#Strip Chart Countries ordered left to right by GDP
ttl_2014_agg$Country <- reorder(ttl_2014_agg$Country, ttl_2014_agg$GDP)
ggplot(drop_na(ttl_2014_agg, GDP), aes(Country, count)) + geom_jitter(size = 2, color = "blue") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), panel.background = element_rect(fill = "white"), panel.grid.major.y = element_line(color = "light blue")) +
labs(y = "2014 Winter Olympic Medal Count") +
ggtitle("Ranked GDP per Capita From Left to Right")
The bar chart below also uses the order of the values to convey meaning. Countries are aligned on the y-axis by average life expectancy (combined male/female) with the topmost value having being highest (Japan) and the bottommost value the lowest (Russia). The black bar in the center splits the Countries into the lower/upper life expectancy ranges. Although the top half of the plot contains the majority of the values, both the United States and Russia weaken the case that life expectancy can be considered a factor for performance at the Winter Olympics. Further analysis will allow us to discount the data from a single year and take an aggregate view.
ttl_2014_agg$Country <- reorder(ttl_2014_agg$Country, ttl_2014_agg$Life_Exp)
ggplot(ttl_2014_agg, aes(Country, count)) + geom_col(fill = "skyblue") + scale_color_brewer(palette = 5) + coord_flip() + theme_minimal() +
labs(y = "Winter Olympic Medal Count", x = "Ranked Life Expectancy: Descending Order") +
ggtitle("Average Life Expectancy Vs. Olympic Medals (2014)") +
geom_vline(xintercept = 13.5) +
annotate("text", label = "Top 50% of Medaling Countries By Life Expectancy", x = 15, y = 20, color = "dark gray") +
annotate("text", label = "Lower 50% of Medaling Countries By Life Expectancy", x = 12, y = 20, color = "dark gray")
To verify that the variables selected are correlated with Olympic performance, I compiled dataframes containing the results of the 1994 and 2002 games. This provided a consistent time frame of ~10 years allowing us to study changes over time and the magnitude of their impact on medal count.
Proximity to the planet’s poles as a factor of medal count seems to hold up as all three Olympic results are presented together. We have continued to rank countries by the absolute value of their latitude. Australia is represented here in the same plot space as Russia. We can see that being in the top 10 northernmost Countries seems to have the largest effect with the top five seeing an extreme rise in medal count. The trend line does still show a gradual uptrend throughout the plot though and we can declare proximity to a Pole a possible factor of Olympic success.
ggplot(comb_ttl_agg, aes(Prox, count)) + geom_point(aes(col = factor(Year)), size = 3, alpha = .8) + geom_smooth(method = "loess") +
scale_color_discrete(name = "Year") +
theme_few() +
labs(x = "Descending Rank From Latitude Absolute Value", y = "Winter Olympic Medal Count") +
ggtitle("Comparison: Proximity to Pole Relation to Medal Count (1994,2002,2014)")
The trend line indicates that there is a strong relationship between how far North a Country is and its performance at the Winter Olympics. Historical data has backed up our claim and this variable can be considered as an indicator of Winter Olympic performance.
We have replicated the plot comparing life expectancy to medal count. Again, Countries are listed from highest to lowest life expectancy. The line in the middle of the plot again splits the top and bottom 50% of medaling Countries by life expectancy for comparison. While it does appear the top half of the plot contains more values (with the United States and Russia as extremes) it would be more intuitive to see the progression over time. Once we facet the same chart into the 3 time periods we see how Country’s with a higher life expectancy have been gathering more share of the medal count. Russia being the special case, we can see that medal growth seems to be related to high life expectancy by observing the increase in total medals in the top 50% of the plot. A possible explanation for this growth is the argument that developed Countries experience both an increased presence at the Olympic Games and an increase in life expectancy. While this could certainly factor in, most of the Countries with the highest life expectancy had been fully developed prior to 1994. Life expectancy, while arguably linked to many other interrelated factors, has become a reliable indicator for Winter Olympic performance.
ggplot(comb_ttl_agg,aes(Life_Exp, count)) + geom_bin2d(binwidth = c(2,5)) +
geom_smooth(col = "sky blue") +
labs(x = "Life Expectancy", y = "Winter Olympic Medal Count") +
ggtitle("Comparison: Life Expectancy as a Factor of Winter Olympic Success") +
theme_few()
## `geom_smooth()` using method = 'loess'
The plot above avoids looking at life expectancy from a Country level and instead focuses on the variable independently. By splitting into bins we can see that the majority of medal winning occurences happened in COuntries with a higher life expectancy based on the fill coverage on the right side of the graph. By this alone it would appear life expectancy plays an oversized role in Olympic achievement. The smoothing line is introduced to show that while life expectancy does seem to play a role between a 65-80 average, the few Countries higher than 80 do not exhibit a continuation of the trend.
Let’s revisit the third variable identified, GDP per capita. The strip plot below again shows countries ranked from lowest to highest GDP with medal counts plotted along the y axis. With three periods represented we can better see that upward slope of the points as GDP per capita increases.
comb_ttl_agg$Country <- reorder(comb_ttl_agg$Country, comb_ttl_agg$GDP)
ggplot(comb_ttl_agg, aes(Country, count)) + geom_jitter(size = 3, width =0, color = "light pink") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), panel.background = element_rect(fill = "white"), panel.grid.major.y = element_line(color = "light blue")) +
ggtitle("Total Medals By Increasing GDP Per Capita") +
labs(x = "Ranked GDP per Capita Ascending Left to Right", y = "Winter Olympic Medal Count")
While GDP per capita’s relationship seems stronger with 3 datasets, why not experiment with a last variable? If the plot’s meaning isn’t lost, additional primary data vehicles (color, size, geom) can add meaning to the image. In this last graph we have filled the points with a gradient representing the percent of the population who live in urban areas. Judging by the results we could further explore this if needed as there does seem to be an increased urban population correlated both to higher GDP per capita, and Olympic medal count.
ggplot(comb_ttl_agg, aes(Country, count)) + geom_jitter(aes(col = Urban_Perc),size = 3, width =0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_color_gradientn(name = "Urban %", colors = terrain.colors(6)) +
ggtitle("Total Medals By Increasing GDP Per Capita and % Urban Population") +
labs(x = "Ranked GDP per Capita Ascending Left to Right", y = "Winter Olympic Medal Count")
#Assess Effectiveness of Visualization
cor(comb_ttl_agg$Life_Exp, comb_ttl_agg$count, method = "pearson")
## [1] 0.2152628
cor(comb_ttl_agg$GDP, comb_ttl_agg$count, method = "pearson")
## [1] 0.428735
I have intentionally relied on the visualizations in this project instead of numerical values to tell the story through images. I did want to perform a spot check though to ensure that I hadn’t obscured the data to narrate a story that didn’t exist. Above are the correlation values for each of the 3 variables discussed. Pole proximity turns out was the most likely indicator and Life expectancy the lowest (because of the manipulation performed I could not get the correlation to show in the knitr console, the relation between medal count and pole proximity was -0.4448048). It was at the suggestion at a class mate that I look at Pole absolute value (taking into account North/South hemisphere) rather than simply the North Pole. It’s amazing how changing this variable showed the truth about the effect a Country’s proximity to a Pole has on its medal count. By being able to use our eyes and minds rather than a function, we were better able to see the relationship through data visualization. Thank you for taking the time to read my analysis of why GDP per capita, proximity to the Poles, and average life expectancy could be correlated to Winter Olympic results.