STAT 545A Homework 4

Matt Gingerich

Loading and Checking Data

This assignment is once again using the Gapminder data set downloaded from http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt. This first chunk of code loads the dataset from a local folder and uses the peek function to randomly view rows of the dataframe as a sanity check that the data imported properly. The peek function samples six rows of the data frame in a manner similar to R's standard head and tail functions; however, unlike those functions, this peek function requires a data frame and cannot operate on an array or list.

library(plyr)
gDat <- read.delim("gapminderDataFiveYear.txt")
peek <- function(df) df[sample(nrow(df), 6), ]
peek(gDat)
##       country year       pop continent lifeExp gdpPercap
## 123     Benin 1962   2151895    Africa   42.62     949.5
## 1228   Poland 1967  31785378    Europe   69.61    6557.2
## 796     Japan 1967 100825279      Asia   71.43    9847.8
## 1525 Thailand 1952  21289402      Asia   50.85     757.8
## 458     Egypt 1957  25009741    Africa   44.44    1458.9
## 1193 Paraguay 1972   2614104  Americas   65.81    2523.3

As an additional sanity check, the summary function lists the quartile values of numeric variables and the number of rows associated with each level of the country and continent factors (though the list of countries is truncated for length). There's nothing in this summary to suggest that there were problems importing the data, but there is a noticeably tiny number of rows associated with Oceania relative to the other continents.

summary(gDat)
##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
## 

Out with Oceania

In the summary above, we see that each country (or at least each of the six countries displayed), is associated with 12 rows in the data frame. This is expected, since the gDat dataset contains information on every fifth year from 1952 to 2007, so each row corresponds to a different year. Knowing this, and observing that there are only 24 rows in total for Oceania, we can conclude that there are only two countries in Oceania (a more direct way of obtaining this count is shown at the end of my previous homework submission – the number of countries in Oceania was also discussed in class, so these methods are all simply confirming that there really are only two Oceania countries in this dataset).

With so few countries in the continent, Oceania is less interesting to analyze than the other continents, so we'll begin by creating a new dataset in which Oceania has been removed. The droplevels command is used to completely remove Oceania as a level in the continent factor. A table of the number of countries per continent over time is printed using the new dataset to ensure that Oceania has actually been removed and that the remaining continents all have the expected number of countries.

The table is printed as an HTML table using xtable, and jasonm23's markdown7.css file, as outlined in this tutorial.

hDat <- droplevels(subset(gDat, continent != "Oceania"))
library(xtable)
print(xtable(table(hDat$continent, hDat$year)), type = "html")
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa 52 52 52 52 52 52 52 52 52 52 52 52
Americas 25 25 25 25 25 25 25 25 25 25 25 25
Asia 33 33 33 33 33 33 33 33 33 33 33 33
Europe 30 30 30 30 30 30 30 30 30 30 30 30

Reproducing Data

I decided to use Jonathan Zhang's homework 3 submission as the basis for the next part of this assignment. I thought his exploration of residuals for countries was an interesting starting point for an analysis and I was interested in seeing whether I could reproduce and extend his results.

The following code is copied from Jonathan's submission, with the data variable swapped for the Oceania-free hDat data frame that was constructed above.

yearMin <- min(hDat$year)
yearNames <- as.character(unique(hDat$year))
jRes <- function(x) {
    res <- as.vector(abs(resid(lm(lifeExp ~ I(year - yearMin), x))))
    index <- which.max(res)
    return(yearNames[index])
}
largestRES <- ddply(hDat, ~country, jRes)
names(largestRES) <- c("country", "yearLargestRES")
randomRES <- largestRES[sample(nrow(largestRES), size = 10), ]
print(xtable(randomRES, caption = "The year corresponding to the largest residual in each least squares fit"), 
    type = "html")
The year corresponding to the largest residual in each least squares fit
country yearLargestRES
5 Argentina 1957
122 Switzerland 1977
92 Niger 2007
43 Finland 1982
20 Canada 1972
133 United States 1972
12 Bosnia and Herzegovina 1952
66 Japan 1952
19 Cameroon 1987
139 Zambia 1982

The output of the preceding chunk does not match what Jonathan showed, for a couple reasons. The first reason is that the dataset's been modified to remove Oceania, but the second – and more significant – reason is that the table that is shown is randomly sampling 10 rows from the large data frame of all countries and without knowing the seed in the original run, the exact results can't be reproduced (without a huge number of random guesses). Fortunately, Jonathan's exact results aren't too important here, but having noted the potential difficulty, it's worth explicitly setting a seed now so the remainder of this work could be reproduced exactly if desired.

set.seed(613)

Visualizing the “Largest Residual” Data

The first difficulty in visualizing the largestRES data is that the class of year is no longer a factor or even a number:

class(largestRES$yearLargestRES)
## [1] "character"

The other problem is that we've lost most of the data in the original dataset, so there's not much to plot agains. Both problems can be addressed with a simple ammendment to the jRes function:

jRes <- function(x) {
    res <- as.vector(abs(resid(lm(lifeExp ~ I(year - yearMin), x))))
    index <- which.max(res)
    return(x[index, ])  # the only changed line; index into the original data frame
}
largestRES <- ddply(hDat, ~country, jRes)
print(xtable(peek(largestRES)), type = "html")
country year pop continent lifeExp gdpPercap
138 Yemen, Rep. 1952 4963829.00 Asia 32.55 781.72
133 United States 1972 209896000.00 Americas 71.34 21806.04
15 Bulgaria 1952 7274900.00 Europe 59.60 2444.29
113 Slovak Republic 1952 3558137.00 Europe 64.36 5074.66
45 Gabon 2007 1454867.00 Africa 56.73 13206.48
126 Thailand 1952 21289402.00 Asia 50.85 757.80

Our data's now in a form which is more suitable for creating figures and we can finally generate a plot of the data frame. The following graph plots the year at which the largest residual error was found for a country against the life expectancy during that year. Points are coloured by the continent of the country.

library(lattice)
xyplot(year ~ lifeExp, largestRES, groups = continent, auto.key = list(columns = nlevels(largestRES$continent)))

plot of chunk firstPlot

What can we make of this graph? The majority of points are clumped at either 1952 or 2007, which suggests that the linear model is simply struggling to fit the data at the boundary of the observed years. To check this observation, we can check the density plot showing the distribution of years in largestRES.

densityplot(~year, largestRES)

plot of chunk secondPlot

Interestingly, the distribution actually appears trimodal, although the two largest peaks are indeed on 1952 and 2007. The middle peak is puzzling, but it makes more sense when the density plots are separated by continent, like so:

densityplot(~year, largestRES, groups = continent, auto.key = list(columns = nlevels(largestRES$continent)))

plot of chunk thirdPlot

This figure makes it obvious that each continent tends to have a bimodal distribution of “interesting” years, but the relative height of each peak varies quite a lot between the continents. The third peak is simply an artifact created by averaging together the other distributions (the resulting distribution is starting to converge towards the central limit).

A natural question to ask of the previous plot is which continents correspond to which curves? This is clearer in the following graph, which breaks each continent out into a separate subplot.

densityplot(~year | continent, largestRES)

plot of chunk fourthPlot

It appears that the linear models of life expectancy for developing countries tend to have their largest error in recent years, while the same models for developed countries tend to have their largest error in earlier years. I've randomly sampled and plotted a number of European and African countries to determine why this is the case (the graph of Romania, Spain, Niger, and Lesotho, shown below, is representative of the examples I've seen), but from a limited number of examples it's difficult to draw any conclusions.

It was my guess that European countries had a period of exponential growth around the 1950's that is throwing off the linear regression, while African countries are experiencing these shifts now, but the random sampling of countries I've looked at doesn't particularly support that.

xyplot(lifeExp ~ year | country, hDat, subset = (country == "Lesotho" | country == 
    "Niger" | country == "Romania" | country == "Spain"), type = c("p", "r", 
    "smooth"))

plot of chunk fifthPlot

Having seen these visualizations, I believe that the main problem with checking residual error as a measure of “interestingness” is that the residuals don't provide context about the larger trends for the country. A country with a perfectly linear decrease in life expectancy every year would still be extremely atypical, despite not having any particular years with exceptional events.

Possible further things to try