Data Aggregation Fun Times!

This assignment will once again make use of the GapMinder dataset, but this time we will be having fun with the Plyr package.

Before beginning, I load the dataset, the needed packages, and then do a quick sanity check that my data is what I expect it to be. As a matter of personal preference I make use of the stringsAsFactors = FALSE argument when I load the data, to keep R from turning the variables for Country and Continent into factors.

gDat <- read.delim("gapminderDataFiveYear.txt", stringsAsFactors = FALSE)
library(plyr)
library(xtable)
str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Well it looks like the data is what I want, and even better, I stopped R from turning the names of the countries into factors, as seen from the output.

I start with some simple data analysis.

I am curious to see what the maximum and minimum GDP are for each continent. I determine this with the code below. The final line is to sort the data into a more meaningful order.

gdpCont <- ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap), maxGDP = max(gdpPercap))
gdpCont <- arrange(gdpCont, gdpCont$minGDP)
gdpTable <- xtable(gdpCont)

A table shows the results below:

print(gdpTable, type = "html", include.rownames = FALSE)

continent	minGDP	maxGDP
Africa	241.17	21951.21
Asia	331.00	113523.13
Europe	973.53	49357.19
Americas	1201.64	42951.65
Oceania	10039.60	34435.37

This is somewhat interesting, but ultimately of little value; however it does match up with what I had expected to find.

For my next trick I want to tackle something a little …trickier. This will entail writing my own function. The code for function is displayed below and is called minMax. The function takes in a dataframe, and then outputs a new 2-by-3 matrix. The top row consists of the numerical value of the minumum life expectancy in the input dataframe, followed by a cell with the word “min”, and finally a cell with the name of the country corresponding to the minimum life expectancy. The second row does the same, but with with the maximum life expectancy.

It now may be a bit clearer why I prevented R from turning the country variable into a factor; this was to make the final output easier to understand and to make sure that R didn't end up outputting a numerical value for the country.

minMax <- function(x) {
    values <- matrix(c(min(x$lifeExp), "Min", x$country[(x$lifeExp == min(x$lifeExp))], 
        max(x$lifeExp), "Max", x$country[(x$lifeExp == max(x$lifeExp))]), nrow = 2, 
        ncol = 3, byrow = TRUE)
    colnames(values) <- c("Life Expectancy", "Min or Max", "Country")
    return(values)
}

I now take this function and plunk it into ddply, using the continent variable to split the original gDat. This means I will end up with a new dataframe that is has two rows for each continent, and four columns.

lifeMinMax <- ddply(gDat, ~continent, minMax)
lifeTable <- xtable(lifeMinMax)

It is displayed below, and while interesting, it does not necesarily show me the “interesting” countries.

print(lifeTable, type = "html", include.rownames = FALSE)

continent	Life Expectancy	Min or Max	Country
Africa	23.599	Min	Rwanda
Africa	76.442	Max	Reunion
Americas	37.579	Min	Haiti
Americas	80.653	Max	Canada
Asia	28.801	Min	Afghanistan
Asia	82.603	Max	Japan
Europe	43.585	Min	Turkey
Europe	81.757	Max	Iceland
Oceania	69.12	Min	Australia
Oceania	81.235	Max	Australia

One way to look for interesting countries is to determine which country most often has the lowest life expectancy and which most often has the highest life expectancy.

To do this I start by creating lifeMinMaxYear using ddply, splitting on the year variable. For each year I will pull out the country with the lowest life expectancy and the country with the highest life expectancy.

lifeMinMaxYear <- ddply(gDat, ~year, summarize, lowest = country[lifeExp == 
    min(lifeExp)], highest = country[lifeExp == max(lifeExp)])

But I am not done there, as I still have a fairly unwieldy (not displayed) list of years and countries. I want to automate the counting process. To do this I create two new objects, one of which is a list of each country that is ever named as a high life expectancy country and the number of times it does so. The other object does the same thing, but for low life expectancy countries.

Finally, I pull out the names of the countries that have the highest frequencies in both objects, and then stick them together in a new vector which I create. The first elecment in this vector is the name of the country that most often has the highest life expectancy and the second element is the name of the country that most often has the lowest life expecancy.

highCount <- count(lifeMinMaxYear$highest)
lowCount <- count(lifeMinMaxYear$lowest)

highLow <- c(toString(highCount$x[highCount$freq == max(highCount$freq)]), toString(lowCount$x[lowCount$freq == 
    max(lowCount$freq)]))

names(highLow) <- c("high", "low")

This lets us know that the country which, on a year by year basis, most frequently had the highest life exptancy was Japan and the country that most often had the lowest life expectancy was Afghanistan. This suggests that these two countries may be interesting in some way, and worth follow up.