Final Project

This report examines the 2014 Winter Olympic medal winners and tries to explain what might be a driver for their success. Specifically, it will examine the relationship between the medals won in 2014 and the following for each nation:

GDP per Capita
Population
Latitude
Past Medals at the Winter Games

It will then explore these drivers in three other instances in the past, using the 1960, 1980, and 1998 Winter Games.

To complete this analysis data was gathered from a variety of sources, including:

GDP Per Capita (nominal $ amounts) for all available nations
Population data for all nations
Latitude/Longitude Coordinates for all nations
2014 Olympic Results by event
Historic Olympic Results from 1924 through 2006
2010 Olympic Results from 2010

Method and Assumptions

Data was cleaned and aggregated where appropriate, largely driven by the International Olympic Committee’s designation of NOC (National Olympic Committees). More information on this topic can be found here. The data was cleaned to use the NOC naming convention in all data sets to make data manipulation and compilation easier later. This also eliminated any non-Olympic nations, making the data set purer.

Many assumptions were made to make the data comparable, especially when considering the 1960 and 1980 results. Nations have changed names, expanded and contracted, or skipped games altogether for political reasons. Some of the more notable assumptions made here when examining historical data include:

Consolidating nations:
- Germany (GER) includes Germany, United Team of Germany, East Germany, West Germany
- Russia (RUS) includes Russia, USSR, Soviet Union, Russian Federation
- Czech Republic (CZE) includes Czech Republic, Czechoslovakia, Slovakia (except for 1st visual below)
Renaming disparate nation names such as South and North Korea, China, Great Britain
Using 2000 data for population and GDP as a reference point for the 1998 games (would have had year 2000 games if they had not changed the olympic schedule, affecting all games from 1992 on); using 2015 data for population and GPD as a reference point for 2014 games.
Absolute value of latitude was used to put southern and northern hemispheres on approximately the same scale

2014 Olympic Results

The results of the 2014 Olympics in Sochi, Russia are shown below by medal count. Is it just a fluke that Russia came out on top - or are there more meaningful patterns to explain the success of all medal winning nations? One aspect this analysis will not focus on, but may be relevant, is the success of the host team - in this case Russia. There are a few general observations to explore further just from glancing at this chart. Most winners are first world nations, are located in regions where colder weather is present, and most of them are seemingly well populated. These observations served as a guide for further analysis.

#2014 Olympic Medal Counts Barchart
MedalData <- read.csv("2014AllData.csv", stringsAsFactors = FALSE)
Medals2014_long <- subset(MedalData, select = c("NOC", "Gold", "Silver", "Bronze")) %>% melt(id.var = "NOC")
black.text <- element_text(color = "black")
black.45angle.text <- element_text(angle = 45, hjust = 1, color = "black")

Medals2014.plot <- ggplot(Medals2014_long, aes(x = reorder(NOC, -value), y = value, fill = variable)) +
  geom_bar(stat = "identity") +
  labs(title = "Medal Counts at 2014 Winter Olympics", x = "NOC (Country)", y = "Number of Medals Won", fill = "Medal") +
  theme(axis.text.x = black.45angle.text,axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
  scale_fill_manual(values = c("Gold" = "gold", "Silver" = "gray60", "Bronze" = "darkorange4"))

Medals2014.plot

Other Drivers for Success

GDP per Capita

Using 2015 GDP per Capita data and the 2014 Olympic Results there appears to be a relationship to medaling in the Olympic Games. This makes logical sense as wealthier nations have the ability to invest more in their Olympic teams. It could also be surmised that wealthier nations have more and better trained athletes in general, leading to more Olympians. This article explains more on specific sport investment in the U.S. While the actual investment in each sport may have a very strong correlation, that data was not widely available for all nations, and certainly not historically. The biggest outlier in this relationship is Russia. They had a fairly meager GDP per Capita, but produced the most medals in 2014. If they are excluded from the data set, the relationship is even stronger. That said, excluding the top medal winner from this analysis would not reflect reality. With Russia included we see a correlation of .46 between GDP per Capita and medals won in 2014.

#GDP2014
GDP2014Data <- MedalData %>% select(NOC,CountryName, Total, GDP2014)
GDP2014.plot <- ggplot(GDP2014Data, aes(x = GDP2014/1000, y = Total, label = NOC)) +
  geom_point(aes(col = CountryName), show.legend = FALSE) +
  geom_text(aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "2014 Medal Counts by 2014 GDP Per Capita", x = "GDP Per Capita ($thousands)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = dollar_format())


GDP2014.plot

cor(MedalData$GDP2014, MedalData$Total)

## [1] 0.4571301

Population

As mentioned above, common sense would argue: more people > more potential athletes > more potential Olympians. The following analysis looks at 2015 populations by nation versus the medals won by each nation in 2014. One big caveat - China was excluded - not because it distorts the relationship significantly, but because in 2015 they had an estimated population of 1.4 billion people. Plotting China below renders the chart completely illegible. Overall (with China included) there is a much weaker relationship than with GDP. Looking closer at the data, there are a number of small European nations that perform well at the Olympics (Norway, Sweden, Switzerland, etc.). Perhaps there is something unique about this winning bloc of nations…

#Population2015
#Dropped China (population of 1.4 billion)
Pop2015Data <- MedalData %>% select(NOC,CountryName, Total, Pop2015) %>% arrange(desc(Pop2015)) %>% slice(2:26)
Pop2015.plot <- ggplot(Pop2015Data, aes(x = Pop2015/1000, y = Total, label = NOC)) +
  geom_point(aes(col = CountryName), show.legend = FALSE) +
  geom_text(data=subset(Pop2015Data, Pop2015/1000 > 5), aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "2014 Medal Counts by 2015 Country Populations", x = "2015 Country Population (in millions)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = comma_format())

Pop2015.plot

cor(MedalData$Pop2015, MedalData$Total)

## [1] 0.07069048

Latitude

There are a couple ways to analyze this data by geography - longitude, altitude (of highest peak), latitude, etc. Longitude seems like the least likely driver for teams, and would capture lots of oceans in between. Looking at latitude looks more holistically at a nations location moreso than a single peak. Therefore, colder nations, regardless of one single point, may be better performers in the Winter Olympics. The downside is that to pick one coordinate per nation may not characterize the “colder weather” aspect properly (as the coordinates used to identify the U.S. are for Kansas). Latitude is measured in degrees from -90 (south pole), to 0 (equator), to 90 (north pole). To normalize the data, all latitude was transformed as an absolute value. The hypothesis is that nations closer to the equator, closer to 0 degrees, would not be great performers at the Winter Olympics, whereas those nations farther away would be more inclined to participate in events that are available in their nation (what about Cool Runnings?)

The first visualization below shows all nations that medaled in 2014 by latitude. It does appear to be densely populated in a narrow band - but without all other nations as reference points, this visual does not tell the full story. The relationship has a .28 correlation, indicating there is evidence that the farther away from the equator a team is, the more medals it wins - compared only with the other Olympic teams in 2014.

#Latitude
MedalData$abslatitude <- abs(MedalData$latitude)
Lat2014Data <- MedalData %>% select(NOC,CountryName, Total, abslatitude)
Lat2014.plot <- ggplot(Lat2014Data, aes(x = abslatitude, y = Total, label = NOC)) +
  geom_point(aes(col = CountryName), size = 4, show.legend = FALSE) +
  geom_text(aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "2014 Medal Counts by Latitude", x = "Latitude (absolute value)", y = "Number of Medals Won") +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(limits = c(25, 65))

Lat2014.plot

cor(MedalData$abslatitude, MedalData$Total)

## [1] 0.2807579

To get a more complete picture, the following plot shows all Olympic Teams (both summer and winter) with those that earned one or more medals in the 2014 Olympics indicated with blue dots. This dotplot does a much better job of showing the full field of Olympic Teams, and the density of latitude of those that were successful in the 2014 games. This would indicate that latitude of the nation plays a very significant role in its ability to medal at the Winter Olympics.

#Latitude histogram
AllCountryLat <- read.csv("AllCountryLat.csv", stringsAsFactors = FALSE)
AllCountryLat$abslatitude <- abs(AllCountryLat$latitude)
Lat2014.hist <- ggplot(AllCountryLat, aes(AllCountryLat$latitude, fill = Medal2014)) +
  geom_dotplot(position = "identity", binwidth = NULL, stackgroups = TRUE, binpositions = "all") +
  ylim(0, 22) +
  labs(title = "Latitude of Teams that Medaled in 2014 Olympics", x = "Latitude (absolute value of degrees)", y = "Count", fill = "Medaled") +
  scale_x_continuous(limits = c(0, 75)) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5))

Lat2014.hist

Past Success

Finally, the big winning teams at the 2014 games sound familiar - because they have been the winners in the past. No one is surprised when Russia wins medals - or the U.S. - or Canada. But is that really a driver for future success? According to the data, very much so. The correlation of past medals to the 2014 medal winners is .85 - the strongest relationship thus far. The only exceptions are teams that have really grown into success as of late such as China. But the takeaway here is that the nations with stronger teams continue to breed success game after game.

Note: The names of the teams with less than 75 past medal wins were dropped from the chart as there were too many to see clearly near the axis origin.

#Past wins
TotalMedalCounts2010 <- read.csv("TotalMedalCounts2010.csv", stringsAsFactors = FALSE)
PastWins2014 <- left_join(TotalMedalCounts2010, MedalData, by = "NOC")
PastWins2014[is.na(PastWins2014)] <- 0

PastWins2014.plot <- ggplot(PastWins2014, aes(x = Total2010, y = Total, label = NOC)) +
  geom_point(aes(col = Country), size = 4, show.legend = FALSE) +
  geom_text(data=subset(PastWins2014, Total2010 > 75), aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "2014 Medal Counts by Previous Medals", x = "Previous Medals Won (at Winter Olympics)", y = "Number of Medals Won in 2014") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5))

PastWins2014.plot

cor(PastWins2014$Total2010, PastWins2014$Total)

## [1] 0.8457907

Historical perspective

Do these drivers prove valid over time? This is explored by looking back to the games in 1960, 1980, and 1998. Using the same metrics as above, comparisons were made for these three reference years, as well as 2014. In general, all of the chosen drivers support the claim that there are patterns and causes to winning medals at the Winter Olympics.

Note: To aggregate the GDP and Population plots the multiplot feature was used. This was done because the scale changes vastly over time - the representations of GDP between 1960 and 1980 would have had almost no visable points.

GDP generally shows a positive pattern to winning medals at the winter games. Russian data (at that time the Soviet Union) was impossible to find, but was available for years 1980 going forward. This impact is seen in 1980, as Russia won a number of medals but had a very low GDP Per Capita. While 1980 might show a negative pattern between GDP and medal winning, there are a large number of nations that did not win any medals, and overwhelmingly have low GDPs. This was excluded to focus only on the teams that did win at least one medal in each game. Had this analysis been more inclusive, there would have been a large number of data points near the axis origin (low GDP Per Capita and zero medal wins), improving the relationship.

#GDP Historical Analysis
#GDP1960
#missing Russia and 2 others 1960 GDP data
HistoricalMedalCount <- read.csv("HistoricalMedalCount.csv", stringsAsFactors = FALSE)
GDP1960Data <-  HistoricalMedalCount %>% select(NOC, year, MedalCount) %>% filter(year == "1960") %>% left_join(MedalData, HistoricalMedalCount, by = "NOC") %>% arrange(desc(MedalCount))
GDP1960.plot <- ggplot(GDP1960Data, aes(x = GDP1960/1000, y = MedalCount, label = NOC)) +
  geom_point(aes(col = NOC), show.legend = FALSE) +
  geom_text(aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "1960 Olympics", x = "GDP Per Capita ($thousands)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = dollar_format())

#GDP1980
#review missing data
GDP1980Data <-  HistoricalMedalCount %>% select(NOC, year, MedalCount) %>% filter(year == "1980") %>% left_join(MedalData, HistoricalMedalCount, by = "NOC") %>% arrange(desc(MedalCount))
GDP1980.plot <- ggplot(GDP1980Data, aes(x = GDP1980/1000, y = MedalCount, label = NOC)) +
  geom_point(aes(col = NOC), show.legend = FALSE) +
  geom_text(aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "1980 Olympics", x = "GDP Per Capita ($thousands)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = dollar_format())

#GDP2000
#review missing data
HistoricalMedalCount <- read.csv("HistoricalMedalCount.csv", stringsAsFactors = FALSE)
GDP2000Data <-  HistoricalMedalCount %>% select(NOC, year, MedalCount) %>% filter(year == "1998") %>% left_join(MedalData, HistoricalMedalCount, by = "NOC") %>% arrange(desc(MedalCount))
GDP2000.plot <- ggplot(GDP2000Data, aes(x = GDP2000/1000, y = MedalCount, label = NOC)) +
  geom_point(aes(col = NOC), show.legend = FALSE) +
  geom_text(aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "1998 Olympics", x = "GDP Per Capita ($thousands)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = dollar_format())

#GDP2014
#Taking out Russia makes this even a stronger relationship (outlier) - Given it was in their country - there is probably some relationship there.
#Overlapping labels
GDP2014A.plot <- ggplot(GDP2014Data, aes(x = GDP2014/1000, y = Total, label = NOC)) +
  geom_point(aes(col = CountryName), show.legend = FALSE) +
  geom_text(aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "2014 Olympics", x = "GDP Per Capita ($thousands)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = dollar_format())

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}
multiplot(GDP1960.plot, GDP2000.plot, GDP1980.plot, GDP2014A.plot, cols=2)

Population as a driver also held its own over past years. Again, due to population growth the following plots are on individual axes, and China was dropped in 1998 and 2014, otherwise the plots would be unreadable.

#Population Plots

#Population1960
#Consider standardizing y-axis scale (impractical for x-axis)
Pop1960Data <- HistoricalMedalCount %>% select(NOC,year, MedalCount) %>% filter(year == "1960") %>% left_join(MedalData, HistoricalMedalCount, by = "NOC") %>% arrange(desc(Pop1960)) %>% slice(1:26)
Pop1960.plot <- ggplot(Pop1960Data, aes(x = Pop1960/1000, y = MedalCount, label = NOC)) +
  geom_point(aes(col = NOC), show.legend = FALSE) +
  geom_text(data=subset(Pop1960Data, Pop1960/1000 > 25), aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "1960 Olympics", x = "1960 Country Population (in millions)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = comma_format())

#Population1980
Pop1980Data <- HistoricalMedalCount %>% select(NOC,year, MedalCount) %>% filter(year == "1980") %>% left_join(MedalData, HistoricalMedalCount, by = "NOC") %>% arrange(desc(Pop1980)) %>% slice(1:26)
Pop1980.plot <- ggplot(Pop1980Data, aes(x = Pop1980/1000, y = MedalCount, label = NOC)) +
  geom_point(aes(col = NOC), show.legend = FALSE) +
  geom_text(data=subset(Pop1980Data, Pop1980/1000 > 25), aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "1980 Olympics", x = "1980 Country Population (in millions)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = comma_format())

#Population2000
#dropped China due to large population
Pop2000Data <- HistoricalMedalCount %>% select(NOC,year, MedalCount) %>% filter(year == "1998") %>% left_join(MedalData, HistoricalMedalCount, by = "NOC") %>% arrange(desc(Pop2000)) %>% slice(2:26)
Pop2000.plot <- ggplot(Pop2000Data, aes(x = Pop2000/1000, y = MedalCount, label = NOC)) +
  geom_point(aes(col = NOC), show.legend = FALSE) +
  geom_text(data=subset(Pop2000Data, Pop2000/1000 > 25), aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "1998 Olympics", x = "2000 Country Population (in millions)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = comma_format())

#Population2015
#dropped China due to large population
Pop2015A.plot <- ggplot(Pop2015Data, aes(x = Pop2015/1000, y = Total, label = NOC)) +
  geom_point(aes(col = CountryName), show.legend = FALSE) +
  geom_text(data=subset(Pop2015Data, Pop2015/1000 > 25), aes(label = NOC), hjust = 1, vjust = 1, size = 3, angle = 0) +
  labs(title = "2014 Olympics", x = "2015 Country Population (in millions)", y = "Number of Medals Won") +
  stat_smooth(method = 'lm', level = 0) +
  #Low outlier (~25) is Austrailia, which seems to be an exception every time.  stat_smooth(method = 'lm', level = 0) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
   scale_x_continuous(label = comma_format())

multiplot(Pop1960.plot, Pop2000.plot, Pop1980.plot, Pop2015A.plot, cols=2)

Looking to historical wins by latitude also shows strong support that geography plays a part in winning medals. As seen in the following plot, many teams are repeat winners. And since their physical locations did not move, the nations plotted below show a core winning bloc approximately between 40 and 60 degrees of absolute latitude. The team that medaled repeatedly and has a latitude around 25 degrees is Australia - the only prominent winning team from the southern hemisphere. This seems very meaningful, as southern hemisphere nations such as Argentina and Chile have ample cold weather and some of the best conditions for skiing and other winter sports.

#Latitude Facet
Melted_AllCountryLat <- AllCountryLat %>% select("Medal1960", "Medal1980", "Medal1998", "Medal2014", "abslatitude") %>% melt(id.var = "abslatitude")

#Low outlier (~25) is Austrailia, which seems to be an exception every time.
ggplot(Melted_AllCountryLat, aes(abslatitude,fill=value)) + 
 geom_dotplot(position = "identity", binwidth = NULL, stackgroups = TRUE, binpositions = "all") + 
 ylim(0, 18) +
  labs(title = "Latitude of all Olympic Teams Over Time", x = "Latitude (absolute value of degrees)", y = "Count", fill = "Medaled") +
  scale_x_continuous(limits = c(0, 76), breaks = c(0, 15, 30, 45, 60, 75)) +
  theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5)) +
  facet_wrap(~ variable)

Lastly, the strongest relationship examined here was the past performance of teams. It is undeniable that those teams that have won medals in the past, continued to so, as evidenced of the following plots. Each year only evaluated medals won previous to that game (ie. 1960 results only considered past medals won from 1956 or earlier). A subject for further research would be those teams that are outliers. It could be suggested that those nations under that linear regression line are under performing (team is in a decline) and those above are over performing (rising stars and upcoming powerhouses) - especially if those trends repeated over the course of a few winter games.

#Past wins Facet
HistoricalMedalCount2 <- read.csv("HistoricalMedalCount2.csv", stringsAsFactors = FALSE)
PastWins <- HistoricalMedalCount2 

PastWins.plot <- ggplot(PastWins, aes(x = PastWins, y = Medal, fill = NOC)) +
  geom_point(aes(color = NOC)) +
    labs(title = "Medals at Olympic Games by Past Performance", x = "Previous Medals Won at Winter Olympics", y = "Medals Won (at each specified game)") +
     geom_smooth(method="lm", se= FALSE, size = .8, color = "black", aes(group = "NOC")) +
    theme(axis.text.x = black.text, axis.text.y = black.text, plot.title = element_text(hjust = 0.5), legend.position = "none") +
  facet_wrap(~year)

PastWins.plot

Conclusion

This analysis supports that all variables evaluated here are drivers for success at the Winter Olympics, especially a nation’s past performance and geography (latitude). This is also supported historically.

Notes

Difficulties with the the collection and analysis of the data included:

Retrieving data on the Soviet Union in 1960 and 1980, namely GDP
Inability to include all nations in the world or even all Olympic nations due to the vast amount of data required, and inability to retrieve all necessary data for all nations (such as GDP Per Capita for all nations, especially previous century data)
Fairly comparing nations that changed (geographically, names) over time. See the assumptions noted at the beginning of this report.
Improvement could be made to show statistical relationships relevance and confidence (multiple regression model)

Further research

Outliers of all data sets:
- Russia (low GDP)
- China (abnormally high population)
Inclusion of all Olympic teams in GDP analysis
Similar analysis for the summer games - would latitude still be relevant or are summer sports more universal?