Olympic Breakfast Revisited

2012 Catchup

This is a follow up to my RPub about olympic gold medals in 2008. Now that the 2012 closing ceremony is over, there has been time to put together some data about the 2012 results. As promised, I'll carry out a similar analysis to the one in the earlier entry - but this time with 2012 data. One lesson learned here is that although all of this data is free (good), the formats are not always consistent (not so good), and the data is not 100% reliable (also not so good). I had hoped that the original database I used earlier might be updated, but as yet, there is no sign of this. However, the results (in terms of medal winners) are discussed here and a related spreadsheet of results is available here. As suggested above, the database is not perfect - a quick inspection of the previous link shows Mo Farah as only winning one gold medal, when he actually won two. Also, scrolling down to the comments from the first link there is a comment noting other problems. Sadly, the file is a bit of an olympic dogs breakfast. However, in trying to get hold of the data, I found out a few more things about data manipulation in R, so I'll outline these here. However, the final result might need to be updated…

Getting The Data

One useful tip I have found since the last post on this subject is that it is possible to download the spreadsheet referenced above as a csv file, via R commands. This means that some lines in the R script can be used to download the file that will be analysed later in the script. One this command has been pasted in, this means that data can be downloaded relatively rapidly (rather than relying on manual interaction with a web browser) and is also reproducible - anyone running this code should be able to access exactly the same data from the same source as illustrated in this article.

The download may be achieved via the RCurl package - you may need to install this if you have not already done so. This is helpful here, since is provides an interface to the GET method of composing HTTP requests. In the code below, the package is first loaded, and then the getForm function is used to create an HTTP request. The key keyword provides the unique ID for the spreadsheet in question, and the tqx keyword is used to request that the file be provided in csv form. The .opts keyword provides options for the getForm function itself - followlocation tells the function to follow any URL redirections that might occur after sending the initial request, and verbose tells it to provide textuasl feedback of progress - useful if there are any problems.

When the function is called, it returns the response to the request as raw text. This is stored in the variable temp.text. Now, this variable is just a single string, but the information would be more helpfully stored as a data frame. The final line in the code addresses this – the textConnection function creates a psuedo-file handle from the string variable - basically it enables data to be read from the string as though it was a file. Thus, via the read.csv file function, the data is placed into a data frame called medalists:

library(RCurl)
temp.text <- getForm("https://docs.google.com/spreadsheet/tq", key = "0AuKpKzUJbSqtdEdDR29BY0JsRDFlbHQ1SVRHcjlsLWc", 
    tqx = "out:csv", .opts = list(followlocation = TRUE, verbose = TRUE))
medalists <- read.csv(textConnection(temp.text), as.is = TRUE)

Cleaning the Data

The data consists of a number of records - one for each athlete who won a medal at the 2012 games. Firstly, it is helpful to look at the content of the data frame. Here are the first 5 entries:

medalists[1:5, ]
##                        Name                  Country Age Age.group
## 1                A Lam Shin        Republic of Korea  25  21 to 25
## 2              Abby Wambach United States of America  32  31 to 40
## 3         Abdalaati Iguider                  Morocco  25  21 to 25
## 4 Abdullah Waleed Sharbatly             Saudi Arabia  29  26 to 30
## 5         Abel Kiprop Mutai                    Kenya  23  21 to 25
##   Height..cm Weight Sex       DOB             From      Sport
## 1        167     57   F 9/23/1986                     Fencing
## 2        178     81   F  6/2/1980                    Football
## 3        173     57   M 3/25/1987 ERRACHIDIA (MAR)  Athletics
## 4       null   null   M 9/21/1982     London (GBR) Equestrian
## 5        187     73   M 10/2/1988      Nandi (KEN)  Athletics
##                                        Event Medal.winner.    G    S    B
## 1 Women's Individual Epee, Women's Team Epee           YES null  0.5 null
## 2                           Women's Football           YES 0.09 null null
## 3                   Men's 1500m, Men's 5000m           YES null null    1
## 4   Individual Jumping, Team Jumping, SULTAN           YES null null 0.25
## 5                   Men's 3000m Steeplechase           YES null null    1
##   Total
## 1   0.5
## 2  0.09
## 3     1
## 4  0.25
## 5     1
# table(medalists$Country,medalists$Sport)[1:6,1:6]

Firstly, note the G column. This states how many gold medals each athlete won. There are also columns called S and B for silver and bronze counts. One complication is that the original spreadsheet left the cell blank if the corresponding G, S or B count was zero - this has translated to the character string 'null' in the csv file. Because a column in a data frame must be composed of elements of the same kind, this forces all of the G, S and B values to be characters. However, if the intention is to filter out the gold medals and analyse these - as in the earlier Olympic post, this can be achieved by working with character expressions.

gold <- subset(medalists, G != "null")

As noted before, the data does contain some errors. The Mo Farrah error can be seen here:

gold[219, ]
##              Name       Country Age Age.group Height..cm Weight Sex
## 674 Mohamed Farah Great Britain  29  26 to 30        165     58   M
##           DOB            From     Sport                      Event
## 674 3/23/1983 Mogadishu (SOM) Athletics Men's 5000m, Men's 10,000m
##     Medal.winner. G    S    B Total
## 674           YES 1 null null     1

The G value here should be 2, not 1. For now, medal counts of individual athletes won't be considered - just whether or not they were awarded a medal. It is hoped that this will still represent a reasonable indicator of which countries tend to win medals in which disciplines - as was investigated in the first article. If this data gets sorted out (unlikely, as correspondence on the data set is closed), or a more reliable set is found, then this article will be updated.

Applying the Bipartite Graph Visualisation

Having obtained a data set relating to 2012 medal winners the next step is to create a visualisation similar to the one for the 2008 Olympics in the earlier post. The first thing to do is to re-create the cross-tabulation of countries with disciplines. Another inconsisty with the original data is in the naming of variables. In a nutshell, these are the differences that matter here:

Old Name New Name
NOC Country
Discipline Sport

There are also some differences in the names of the counties - for example “People's Republic of China” instead of “China”, and so on. For now, since the meanings of these are clearly understood, these will not be changed to match the original names.

Since the cross-tabulation is based on a count of goldmedals by country, and gold contains counts for individual athletes, a tapply function is used to provide categorical sums of G for Sport crossed with Country. It is also important to recall that initially G is a character variable, and needs to be converted to numeric.

gold.scores <- as.numeric(gold$G)
# Do the tapply summations
medals2012 <- tapply(gold.scores, list(gold$Sport, gold$Country), sum)
# Note that non-appearence of the combinations in `tapply` leads to an NA
# in the cross-tabulation.  Change the NA's into zeroes:
medals2012[is.na(medals2012)] <- 0

The above code gives us a cross-tabulation. However, there are still some problems linked to the data set. The G variable awarded fractional medals - for example a half medal to a person in a two-person team event, one eleventh to a member of a football team and so on. This is good as a general idea - otherwise any country winning a football gold would increase their medal haul by eleven. So much for the efforts of Usain Bolt, Micheal Phelps or Mo Farrah! However, there are yet more difficulties. The numbers are rounded to 3 decimal places, so that the total medal counts for some sports are non-integer 0.990, or similar. In some cases, as noted by one of the links above, they are just plain wrong - ie USA's 0.76 of gold in Water Polo. Presumably this is as the result of an omission, or possibly mis-reporting of a fractional medal. Similarly, the already much-discussed Mo Farrah medal count error is a result of mis-reporting. Although this is something of a sellotape-and-string botch, The next steps attempt to correct this, by rounding counts, and adding 1 to the UK medal count in athletics.

medals2012["Athletics", "Great Britain"] <- medals2012["Athletics", "Great Britain"] + 
    1
medals2012 <- round(medals2012)

Just to give a quick reality check - the rows of medals2012 represent events, and the columns countries. Thus, column sums should give gold medal counts per country. Here, the top ten countries are shown according to this data:

rev(sort(colSums(medals2012)))[1:10]
##   United States of America People's Republic of China 
##                         38                         36 
##              Great Britain          Republic of Korea 
##                         27                         13 
##         Russian Federation                    Germany 
##                         12                         12 
##                    Hungary                     France 
##                          8                          8 
##                  Australia                      Italy 
##                          8                          7 

These can be seen to be reasonably similar to the exact results, by checking this website. So I might trust general trends regarding this visualisation, but would hestitate to draw more detailed conclusions. I would also note that there is a notable undercount for the US. In summary, what he have here is a 'noisy' medal table, which could be thought of as the true medal table as an underlying trend, but with some error superimposed on this…

The Visualisation

The final stage is to re-run the visualisation code, with some minor modificatiions. The key differences are those relating to the new variable names, but it is also worth noting that corresp is called with a matrix argument, rather than a formula. This is because of the need to 'clean' the cross-tabulation prior to analysis.

require(MASS)
## Loading required package: MASS
# Do the correspondence analysis
ca.results.1d <- corresp(t(medals2012), nf = 1)
# Pull out the row and column scores - and give y-locations
r.row.scores <- cbind(scale(rank(ca.results.1d$rscore, ties.method = "first")), 
    0.35)
r.col.scores <- cbind(scale(rank(ca.results.1d$cscore, ties.method = "first")), 
    -0.35)
par(bg = "black", fg = "wheat", mar = c(0, 0, 0, 0) + 0.1)
plot(rbind(r.row.scores, r.col.scores), type = "n", axes = FALSE, xlab = "", 
    ylab = "", ylim = c(-1.2, 1.2), asp = 1)
text(r.col.scores, labels = rownames(r.col.scores), srt = 90, adj = c(1, 0.5), 
    col = "red", cex = 0.9)
text(r.row.scores, labels = rownames(r.row.scores), srt = 90, adj = c(0, 0.5), 
    cex = 0.8)
cloc <- r.row.scores[gold$Country, ]

rloc <- r.col.scores[gold$Sport, ]

whatcol <- function(x) 360 * (x - min(x))/(max(x) - min(x))

segments(cloc[, 1], cloc[, 2] - 0.02, rloc[, 1], rloc[, 2] + 0.02, col = hcl(whatcol(cloc[, 
    1]), alpha = 0.3))

plot of chunk corresp

Epilogue

What started out as a quick post updating my earlier visualisation with 2012 data ended up being more concerned about data cleaning and data quality. No gold medals there, I'm afraid. These are important factors to consider when working with volunteered data. It is of value that such data is made available - but it should be open to scrutiny. Fortunately in this case, it is. Although there are flaws in the data, at least it has been possible to identify them. We have some idea of how much reliability one can attribute to an analysis of this data. If the data were not available for scrutiny we would have to take any published analysis on trust - and might wrongly assume that the data contains no errors.

Thus, I am at least aware that my stop-gap analysis could be improved if more accurate data were available. If I find such data I will re-run the code. In the mean time, if anyone has access to such data (either by carrying out a more rigorous search, or by manual compilation) and are will to put a link to it in the comments section to this post, I'll do my best to re-run the analysis based on that…