STAT 545A Homework 4

Matt Gingerich

Loading and Checking Data

This assignment is once again using the Gapminder data set downloaded from http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt. This first chunk of code loads the dataset from a local folder and uses the peek function to randomly view rows of the dataframe as a sanity check that the data imported properly. The peek function samples six rows of the data frame in a manner similar to R's standard head and tail functions; however, unlike those functions, this peek function requires a data frame and cannot operate on an array or list.

library(plyr)
gDat <- read.delim("gapminderDataFiveYear.txt")
peek <- function(df) df[sample(nrow(df), 6), ]
peek(gDat)

##       country year       pop continent lifeExp gdpPercap
## 123     Benin 1962   2151895    Africa   42.62     949.5
## 1228   Poland 1967  31785378    Europe   69.61    6557.2
## 796     Japan 1967 100825279      Asia   71.43    9847.8
## 1525 Thailand 1952  21289402      Asia   50.85     757.8
## 458     Egypt 1957  25009741    Africa   44.44    1458.9
## 1193 Paraguay 1972   2614104  Americas   65.81    2523.3

As an additional sanity check, the summary function lists the quartile values of numeric variables and the number of rows associated with each level of the country and continent factors (though the list of countries is truncated for length). There's nothing in this summary to suggest that there were problems importing the data, but there is a noticeably tiny number of rows associated with Oceania relative to the other continents.

summary(gDat)

##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
##

Out with Oceania

In the summary above, we see that each country (or at least each of the six countries displayed), is associated with 12 rows in the data frame. This is expected, since the gDat dataset contains information on every fifth year from 1952 to 2007, so each row corresponds to a different year. Knowing this, and observing that there are only 24 rows in total for Oceania, we can conclude that there are only two countries in Oceania (a more direct way of obtaining this count is shown at the end of my previous homework submission – the number of countries in Oceania was also discussed in class, so these methods are all simply confirming that there really are only two Oceania countries in this dataset).

With so few countries in the continent, Oceania is less interesting to analyze than the other continents, so we'll begin by creating a new dataset in which Oceania has been removed. The droplevels command is used to completely remove Oceania as a level in the continent factor. A table of the number of countries per continent over time is printed using the new dataset to ensure that Oceania has actually been removed and that the remaining continents all have the expected number of countries.

The table is printed as an HTML table using xtable, and jasonm23's markdown7.css file, as outlined in this tutorial.

hDat <- droplevels(subset(gDat, continent != "Oceania"))
library(xtable)
print(xtable(table(hDat$continent, hDat$year)), type = "html")

	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Africa	52	52	52	52	52	52	52	52	52	52	52	52
Americas	25	25	25	25	25	25	25	25	25	25	25	25
Asia	33	33	33	33	33	33	33	33	33	33	33	33
Europe	30	30	30	30	30	30	30	30	30	30	30	30

Reproducing Data

I decided to use Jonathan Zhang's homework 3 submission as the basis for the next part of this assignment. I thought his exploration of residuals for countries was an interesting starting point for an analysis and I was interested in seeing whether I could reproduce and extend his results.

The following code is copied from Jonathan's submission, with the data variable swapped for the Oceania-free hDat data frame that was constructed above.

yearMin <- min(hDat$year)
yearNames <- as.character(unique(hDat$year))
jRes <- function(x) {
    res <- as.vector(abs(resid(lm(lifeExp ~ I(year - yearMin), x))))
    index <- which.max(res)
    return(yearNames[index])
}
largestRES <- ddply(hDat, ~country, jRes)
names(largestRES) <- c("country", "yearLargestRES")
randomRES <- largestRES[sample(nrow(largestRES), size = 10), ]
print(xtable(randomRES, caption = "The year corresponding to the largest residual in each least squares fit"), 
    type = "html")

The year corresponding to the largest residual in each least squares fit
	country	yearLargestRES
5	Argentina	1957
122	Switzerland	1977
92	Niger	2007
43	Finland	1982
20	Canada	1972
133	United States	1972
12	Bosnia and Herzegovina	1952
66	Japan	1952
19	Cameroon	1987
139	Zambia	1982

The output of the preceding chunk does not match what Jonathan showed, for a couple reasons. The first reason is that the dataset's been modified to remove Oceania, but the second – and more significant – reason is that the table that is shown is randomly sampling 10 rows from the large data frame of all countries and without knowing the seed in the original run, the exact results can't be reproduced (without a huge number of random guesses). Fortunately, Jonathan's exact results aren't too important here, but having noted the potential difficulty, it's worth explicitly setting a seed now so the remainder of this work could be reproduced exactly if desired.

set.seed(613)

Visualizing the “Largest Residual” Data

The first difficulty in visualizing the largestRES data is that the class of year is no longer a factor or even a number:

class(largestRES$yearLargestRES)

## [1] "character"

The other problem is that we've lost most of the data in the original dataset, so there's not much to plot agains. Both problems can be addressed with a simple ammendment to the jRes function:

jRes <- function(x) {
    res <- as.vector(abs(resid(lm(lifeExp ~ I(year - yearMin), x))))
    index <- which.max(res)
    return(x[index, ])  # the only changed line; index into the original data frame
}
largestRES <- ddply(hDat, ~country, jRes)
print(xtable(peek(largestRES)), type = "html")

	country	year	pop	continent	lifeExp	gdpPercap
138	Yemen, Rep.	1952	4963829.00	Asia	32.55	781.72
133	United States	1972	209896000.00	Americas	71.34	21806.04
15	Bulgaria	1952	7274900.00	Europe	59.60	2444.29
113	Slovak Republic	1952	3558137.00	Europe	64.36	5074.66
45	Gabon	2007	1454867.00	Africa	56.73	13206.48
126	Thailand	1952	21289402.00	Asia	50.85	757.80

Our data's now in a form which is more suitable for creating figures and we can finally generate a plot of the data frame. The following graph plots the year at which the largest residual error was found for a country against the life expectancy during that year. Points are coloured by the continent of the country.

library(lattice)
xyplot(year ~ lifeExp, largestRES, groups = continent, auto.key = list(columns = nlevels(largestRES$continent)))

plot of chunk firstPlot

What can we make of this graph? The majority of points are clumped at either 1952 or 2007, which suggests that the linear model is simply struggling to fit the data at the boundary of the observed years. To check this observation, we can check the density plot showing the distribution of years in largestRES.

densityplot(~year, largestRES)

plot of chunk secondPlot

Interestingly, the distribution actually appears trimodal, although the two largest peaks are indeed on 1952 and 2007. The middle peak is puzzling, but it makes more sense when the density plots are separated by continent, like so:

densityplot(~year, largestRES, groups = continent, auto.key = list(columns = nlevels(largestRES$continent)))

plot of chunk thirdPlot

This figure makes it obvious that each continent tends to have a bimodal distribution of “interesting” years, but the relative height of each peak varies quite a lot between the continents. The third peak is simply an artifact created by averaging together the other distributions (the resulting distribution is starting to converge towards the central limit).

A natural question to ask of the previous plot is which continents correspond to which curves? This is clearer in the following graph, which breaks each continent out into a separate subplot.

densityplot(~year | continent, largestRES)

plot of chunk fourthPlot

It appears that the linear models of life expectancy for developing countries tend to have their largest error in recent years, while the same models for developed countries tend to have their largest error in earlier years. I've randomly sampled and plotted a number of European and African countries to determine why this is the case (the graph of Romania, Spain, Niger, and Lesotho, shown below, is representative of the examples I've seen), but from a limited number of examples it's difficult to draw any conclusions.

It was my guess that European countries had a period of exponential growth around the 1950's that is throwing off the linear regression, while African countries are experiencing these shifts now, but the random sampling of countries I've looked at doesn't particularly support that.

xyplot(lifeExp ~ year | country, hDat, subset = (country == "Lesotho" | country == 
    "Niger" | country == "Romania" | country == "Spain"), type = c("p", "r", 
    "smooth"))

plot of chunk fifthPlot

Having seen these visualizations, I believe that the main problem with checking residual error as a measure of “interestingness” is that the residuals don't provide context about the larger trends for the country. A country with a perfectly linear decrease in life expectancy every year would still be extremely atypical, despite not having any particular years with exceptional events.

Possible further things to try

Checking for the country with the maximum absolute value of the slope of the life expectancy linear model.
Taking a derivative of the life expectancy and finding the country with the most negative derivative for a single year.
Taking the second derivative of life expectancy and searching for countries with the largest absolute values of that measure (countries with the largest inflections in the life expectancy curve).