Olympic Breakfast

Breakfast?

The name is borrowed' from a BBC Radio 5 show which for a while can be found here as well as an item on the menu at the Little Chef chain of restuarants in the UK - and from the fact that I started writing this article at breakfast time (coffee and toast, not the Little Chef option). I'll now stop talking about breakfast and concentrate on the olympics.

Olympic!

This is an exercise in looking at data about the olympics using R. A particular question I am interested in here is whether particular countries tend to concentrate on particular categories of events - and if so to see which countries tend to compete against each other for medals, and in which events do such rivalries occur. Openly available data, plus some use of R allows us to do this.

For the data, the Guardian's Datablog provides a pointer to this Google docs page, which gives details of every medal winner in the modern Olympiad (that is, since Athens 1896). If you want to work through the R code given here, firstly create a new directory, and set R to work in this directory. If you go to the Google docs link above, it is possible to download the data in csv form - make sure the current sheet is ALL MEDALISTS then click on File, then Download as… then select Comma Separated Values. Next, change the current sheet to IOC COUNTRY CODES and repeat the download process. In both cases download to the directory you have just created. You can now read the data into R:

csv.files <- dir(pattern = "csv$")  # Find the csv file names in the directory.
csv.files  # List them.
## [1] "Summer%20Olympic%20medallists%20since%201896%20-%20ALL%20MEDALISTS.csv"      
## [2] "Summer%20Olympic%20medallists%20since%201896%20-%20IOC%20COUNTRY%20CODES.csv"
# Read in the medal data and the IOC country codes
medals <- read.csv(csv.files[1], skip = 4, as.is = TRUE)
codes <- read.csv(csv.files[2], as.is = TRUE)

In the future, I'd like to replace the clumsy GUI part above with R code that reads each file directly from URLs - the current approach is quite slow for a human to carry out (compared to running a script with read.csv comands referring to URLs) and isn't reproducible. This page will get updated if I succeed in doing this.

Next, take a look at the data - here are the first 10 records:

medals[1:10, ]
##      City Edition    Sport Discipline               Athlete NOC Gender
## 1  Athens    1896 Aquatics   Swimming         HAJOS, Alfred HUN    Men
## 2  Athens    1896 Aquatics   Swimming      HERSCHMANN, Otto AUT    Men
## 3  Athens    1896 Aquatics   Swimming     DRIVAS, Dimitrios GRE    Men
## 4  Athens    1896 Aquatics   Swimming    MALOKINIS, Ioannis GRE    Men
## 5  Athens    1896 Aquatics   Swimming    CHASAPIS, Spiridon GRE    Men
## 6  Athens    1896 Aquatics   Swimming CHOROPHAS, Efstathios GRE    Men
## 7  Athens    1896 Aquatics   Swimming         HAJOS, Alfred HUN    Men
## 8  Athens    1896 Aquatics   Swimming      ANDREOU, Joannis GRE    Men
## 9  Athens    1896 Aquatics   Swimming CHOROPHAS, Efstathios GRE    Men
## 10 Athens    1896 Aquatics   Swimming         NEUMANN, Paul AUT    Men
##                         Event Event_gender  Medal
## 1              100m freestyle            M   Gold
## 2              100m freestyle            M Silver
## 3  100m freestyle for sailors            M Bronze
## 4  100m freestyle for sailors            M   Gold
## 5  100m freestyle for sailors            M Silver
## 6             1200m freestyle            M Bronze
## 7             1200m freestyle            M   Gold
## 8             1200m freestyle            M Silver
## 9              400m freestyle            M Bronze
## 10             400m freestyle            M   Gold

Each record contains several fields. Perhaps the most interesting ones here are: Discipline - broadly stating the kind of sport for which the medal was awarded, NOC - which stands for National Olympic Committee - these are three-letter abbreviations for each country used by the International Olympic Committee (IOC) and Edition which is the year of the Olympics in which the medal was one. At the time of writing this the data does not include any 2012 results.

The variable codes provides a look-up between this and a full country name - again here are the first 10:

codes[1:10, ]
##                Country Int.Olympic.Committee.code ISO.code
## 1          Afghanistan                        AFG       AF
## 2              Albania                        ALB       AL
## 3              Algeria                        ALG       DZ
## 4      American Samoa*                        ASA       AS
## 5              Andorra                        AND       AD
## 6               Angola                        ANG       AO
## 7  Antigua and Barbuda                        ANT       AG
## 8            Argentina                        ARG       AR
## 9              Armenia                        ARM       AM
## 10              Aruba*                        ARU       AW
##              Country.1
## 1          Afghanistan
## 2              Albania
## 3              Algeria
## 4      American Samoa*
## 5              Andorra
## 6               Angola
## 7  Antigua and Barbuda
## 8            Argentina
## 9              Armenia
## 10              Aruba*

There is some repetition in here - in that there are two comuns giving the name of the country - but the idea is to use the data frame codes create a dictionary variable - rather like the ones used in Python - as a look up tool.

code.dict <- codes[, 1]
names(code.dict) <- codes[, 2]

code.dict is a standard R character vector, with named elements. The name of the element is the IOC code for a country, and the corresponding element is the full country name. Thus it can be used to translate the codes to full names, and used in R expressions:

# Translate a single code
code.dict["HUN"]
##       HUN 
## "Hungary" 
# Translate a list of codes
code.dict[medals$NOC[1:7]]
##       HUN       AUT       GRE       GRE       GRE       GRE       HUN 
## "Hungary" "Austria"  "Greece"  "Greece"  "Greece"  "Greece" "Hungary" 
# Top ten total medals 1896 to 2008
rev(sort(table(code.dict[medals$NOC])))[1:10]
## 
##  United States United Kingdom         France          Italy        Germany 
##           4335           1594           1314           1228           1211 
##      Australia        Hungary         Sweden    Netherlands          Japan 
##           1075           1053           1021            782            704 

Analysing the Data

Using cross-tabulation is a good way to start looking at the relationship between countries and the disciplines they tend to win medals in. As a start, lets just look at the results in 2008, and focus on gold medals:

# Select the subset of gold medals from 2008
medals2008 <- subset(medals, Edition == 2008 & Medal == "Gold", select = c(NOC, 
    Discipline))
# Cross-tabulate the country and discipline
tab2008 <- table(medals2008$NOC, medals2008$Discipline)
dim(tab2008)  # Write out the dimension
## [1] 54 41

There are 54 counties and 41 disciplines included in this data set. Note that this need not include all countries represented, as to appear in tab2008 a country needs to win at least one medal. The whole crosstabulation is quite large, but here is the top-left hand corner.

# First replace the IOC-code rownames with the full country names
rownames(tab2008) <- code.dict[rownames(tab2008)]
# Now print out a 10x10 chunk from the table
tab2008[1:10, 1:10]
##             
##              Archery Artistic G. Athletics BMX Badminton Baseball
##   Argentina        0           0         0   0         0        0
##   Australia        0           0         1   0         0        0
##   Azerbaijan       0           0         0   0         0        0
##   Belgium          0           0         1   0         0        0
##   Belarus          0           0         1   0         0        0
##   Brazil           0           0         1   0         0        0
##   Bulgaria         0           0         0   0         0        0
##   Canada           0           0         0   0         0        0
##   China            1          19         0   0         4        0
##   Cameroon         0           0         1   0         0        0
##             
##              Basketball Beach volley. Boxing Canoe / Kayak F
##   Argentina           0             0      0               0
##   Australia           0             0      0               1
##   Azerbaijan          0             0      0               0
##   Belgium             0             0      0               0
##   Belarus             0             0      0               6
##   Brazil              0             0      0               0
##   Bulgaria            0             0      0               0
##   Canada              0             0      0               0
##   China               0             0      2               2
##   Cameroon            0             0      0               0

One thing that is apparent is that the cross-tabulation is quite sparse - not really surprising as most countries don't win medals in most sports. Because the table is a little too large to display on a single page, some kind of visual approach is helpful. One 'off the shelf' approach is correspondence analysis - available in the MASS package.

require(MASS)  # load the MASS package if not already loaded
# Do the correspondence analysis add draw a biplot -- nf=2 tells 'corresp'
# to return the first two dimensions of scores
biplot(corresp(~NOC + Discipline, data = medals2008, nf = 2), cex = 0.7)

plot of chunk corresp_1

This approach attempts to represent the rows and the columns of a contingency table (ie a cross-tabulation) as two sets of points in a low-dimensional space. When the space has dimension 2, the biplot draws both sets of points on the same graph, labelling the sets of points with their respective row or column names. For now I have not turned the IOC country codes into full names, since, as can be seen, the figure is already fairly cluttered.

However, what may be seen is that although a number of outliers exist, projecting the information onto the \( x \)-axis still encapsulates much of the information in this plot. Although there are some obvious outliers (such as Slovakia) the two-dimensional outliers occur at the extremes of the \( x \)-dimension. This suggests that the countries and sports may be reasonably represented as points in one-dimensional space. One reason for doing this is to unclutter the display (with minimal loss of information in terms of spatial pattern in the scores) by separating out the countries from the disciplines, and arranging them regularly. However, this takes a little more coding. The idea here is to compute the 1-dimensional row and column scores, and create a plot of each on separate, parallel axes:

# Do the correspondence analysis
ca.results.1d <- corresp(~NOC + Discipline, data = medals2008, nf = 1)
# Pull out the row and column scores - and give y-locations
row.scores <- cbind(ca.results.1d$rscore, 0.05)
col.scores <- cbind(ca.results.1d$cscore, -0.05)
plot(rbind(row.scores, col.scores), type = "n", axes = FALSE, xlab = "", 
    ylab = "", ylim = c(-1.5, 1.5))
text(col.scores, labels = rownames(col.scores), srt = 90, adj = c(1, 
    0.5), col = "red")
text(row.scores, labels = code.dict[rownames(row.scores)], srt = 90, 
    adj = c(0, 0.5))
box()

plot of chunk corresp_2

The idea is that the locations of the disciplines (written below) should be close to the locations of countries (written above) that have won medals on those disciplines.

However, the display is still cluttered. One solution might be to replace the row and column scores with their ranks:

# Do the correspondence analysis
ca.results.1d <- corresp(~NOC + Discipline, data = medals2008, nf = 1)
# Pull out the row and column scores - and give y-locations
r.row.scores <- cbind(scale(rank(ca.results.1d$rscore, ties.method = "first")), 
    0.05)
r.col.scores <- cbind(scale(rank(ca.results.1d$cscore, ties.method = "first")), 
    -0.05)
plot(rbind(r.row.scores, r.col.scores), type = "n", axes = FALSE, 
    xlab = "", ylab = "", ylim = c(-1.5, 1.5))
text(r.col.scores, labels = rownames(r.col.scores), srt = 90, adj = c(1, 
    0.5), col = "red")
text(r.row.scores, labels = code.dict[rownames(r.row.scores)], srt = 90, 
    adj = c(0, 0.5), cex = 0.7)
box()

plot of chunk corresp_3

Note that I have also re-scaled the ranks to have a mean of zero and a standard deviation of one. This spreads both sets of scores evenly over roughly the same interval - if this wasn't done, the discipline locations would bunch up to one side, as there are notably fewer of these than countries.

One thing that becomes apparent is that there is a gap in the country names. After some probing, it turns out that two of the country codes given in the spreadsheet are different from those use in the database of medals won. In particular, the codes “ROU” and “SRB” are not supplied in the code look-up. An internet check shows that these are codes for Romania and Serbia respectively - the codes spreadsheet had given 'ROM' and 'SGC' for these. The code dictionary may be amended to allow for this:

names(code.dict)[which(code.dict == "Romania")] <- "ROU"
names(code.dict)[which(code.dict == "Serbia")] <- "SRB"

Also, because the locations of the countries and disciplines will have moved due to replacing locations with ranks (and also re-scaling them) it is perhaps informative to add lines joining the countries to the disciplines in which they have won medals. The lines are semi-transparent - in this way, some impression of the number of medals a country has won in a given discipline can be shown - multiple overlayed lines will show up more than single lines. Also, colour-coding is applied to these lines (on the basis of country) to allow the identification of countries that span a wide range of disciplines. Finally, because I've seen loads of eye-catching visualisations using light colours on a black background, I'm going to do the same…

# Do the correspondence analysis
ca.results.1d <- corresp(~NOC + Discipline, data = medals2008, nf = 1)
# Pull out the row and column scores - and give y-locations
r.row.scores <- cbind(scale(rank(ca.results.1d$rscore, ties.method = "first")), 
    0.35)
r.col.scores <- cbind(scale(rank(ca.results.1d$cscore, ties.method = "first")), 
    -0.35)
par(bg = "black", fg = "wheat")
plot(rbind(r.row.scores, r.col.scores), type = "n", axes = FALSE, 
    xlab = "", ylab = "", ylim = c(-1.2, 1.2), asp = 1)
text(r.col.scores, labels = rownames(r.col.scores), srt = 90, adj = c(1, 
    0.5), col = "red", cex = 0.7)
text(r.row.scores, labels = code.dict[rownames(r.row.scores)], srt = 90, 
    adj = c(0, 0.5), cex = 0.4)
cloc <- r.row.scores[medals2008$NOC, ]

rloc <- r.col.scores[medals2008$Discipline, ]

whatcol <- function(x) 360 * (x - min(x))/(max(x) - min(x))

segments(cloc[, 1], cloc[, 2] - 0.02, rloc[, 1], rloc[, 2] + 0.02, 
    col = hcl(whatcol(cloc[, 1]), alpha = 0.3))

plot of chunk corresp_5

A number of features become apparent. Firstly countries at either ends of the list tend to be ones who have specialised in a small number of disciplines - for example Slovakia in canoeing, or Iran and Mexico in taekwondo. Also, countries with a large range of medals across the board tend to be near the centre of the list - for example the United States. However, distinctions between countries winning large numbers of medals can also be seen - for example, China on the right hand of this diagram have excelled in artistic gymnastics, diving, table tennis, weightlifting and badminton. The USA has excelled in a very different set of disciplines, including football, volleyball, basketball, swimming and rowing. The United Kingdom is more in line with the US - for example having won gold medals in rowing, but also sharing some other disciplines in which they won gold. There is less overlap between the UK and China or South Korea, for example.

Epilogue

This is really just a starting point - a number of issues need to be addressed. Firstly, this is a rather ad hoc approach. Essentially the method here visualises a bipartite graph - a graph in which nodes are paritioned into two distinct groups and all edges join a node in one group to a node in the other - and there are a number of algorithms for visualising this. These may be more effective in reducing the number of line crossings, or making the 'fan out' of lines more compact for countries winning several gold medals, for example.

Secondly, alternative layout schemes could be explored - for example arranging the points on a circle, rather than a pair of lines. Finally, and most importantly perhaps, I haven't analysed the 2012 data yet. I think that I can't really do this until the 2012 Olympics is completed, but watch this space…