Matt Gingerich
This assignment is once again using the Gapminder data set downloaded from http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt. This first chunk of code loads the dataset from a local folder and uses the peek function to randomly view rows of the dataframe as a sanity check that the data imported properly. The peek function samples six rows of the data frame in a manner similar to R's standard head and tail functions; however, unlike those functions, this peek function requires a data frame and cannot operate on an array or list.
library(plyr)
gDat <- read.delim("gapminderDataFiveYear.txt")
peek <- function(df) df[sample(nrow(df), 6), ]
peek(gDat)
## country year pop continent lifeExp gdpPercap
## 123 Benin 1962 2151895 Africa 42.62 949.5
## 1228 Poland 1967 31785378 Europe 69.61 6557.2
## 796 Japan 1967 100825279 Asia 71.43 9847.8
## 1525 Thailand 1952 21289402 Asia 50.85 757.8
## 458 Egypt 1957 25009741 Africa 44.44 1458.9
## 1193 Paraguay 1972 2614104 Americas 65.81 2523.3
As an additional sanity check, the summary function lists the quartile values of numeric variables and the number of rows associated with each level of the country and continent factors (though the list of countries is truncated for length). There's nothing in this summary to suggest that there were problems importing the data, but there is a noticeably tiny number of rows associated with Oceania relative to the other continents.
summary(gDat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.00e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.79e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.02e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.96e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.96e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.32e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.6 Min. : 241
## 1st Qu.:48.2 1st Qu.: 1202
## Median :60.7 Median : 3532
## Mean :59.5 Mean : 7215
## 3rd Qu.:70.8 3rd Qu.: 9325
## Max. :82.6 Max. :113523
##
In the summary above, we see that each country (or at least each of the six countries displayed), is associated with 12 rows in the data frame. This is expected, since the gDat dataset contains information on every fifth year from 1952 to 2007, so each row corresponds to a different year. Knowing this, and observing that there are only 24 rows in total for Oceania, we can conclude that there are only two countries in Oceania (a more direct way of obtaining this count is shown at the end of my previous homework submission – the number of countries in Oceania was also discussed in class, so these methods are all simply confirming that there really are only two Oceania countries in this dataset).
With so few countries in the continent, Oceania is less interesting to analyze than the other continents, so we'll begin by creating a new dataset in which Oceania has been removed. The droplevels command is used to completely remove Oceania as a level in the continent factor. A table of the number of countries per continent over time is printed using the new dataset to ensure that Oceania has actually been removed and that the remaining continents all have the expected number of countries.
The table is printed as an HTML table using xtable, and jasonm23's markdown7.css file, as outlined in this tutorial.
hDat <- droplevels(subset(gDat, continent != "Oceania"))
library(xtable)
print(xtable(table(hDat$continent, hDat$year)), type = "html")
| 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 | 52 |
| Americas | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 | 25 |
| Asia | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 | 33 |
| Europe | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 30 | 30 |
I decided to use Jonathan Zhang's homework 3 submission as the basis for the next part of this assignment. I thought his exploration of residuals for countries was an interesting starting point for an analysis and I was interested in seeing whether I could reproduce and extend his results.
The following code is copied from Jonathan's submission, with the data variable swapped for the Oceania-free hDat data frame that was constructed above.
yearMin <- min(hDat$year)
yearNames <- as.character(unique(hDat$year))
jRes <- function(x) {
res <- as.vector(abs(resid(lm(lifeExp ~ I(year - yearMin), x))))
index <- which.max(res)
return(yearNames[index])
}
largestRES <- ddply(hDat, ~country, jRes)
names(largestRES) <- c("country", "yearLargestRES")
randomRES <- largestRES[sample(nrow(largestRES), size = 10), ]
print(xtable(randomRES, caption = "The year corresponding to the largest residual in each least squares fit"),
type = "html")
| country | yearLargestRES | |
|---|---|---|
| 5 | Argentina | 1957 |
| 122 | Switzerland | 1977 |
| 92 | Niger | 2007 |
| 43 | Finland | 1982 |
| 20 | Canada | 1972 |
| 133 | United States | 1972 |
| 12 | Bosnia and Herzegovina | 1952 |
| 66 | Japan | 1952 |
| 19 | Cameroon | 1987 |
| 139 | Zambia | 1982 |
The output of the preceding chunk does not match what Jonathan showed, for a couple reasons. The first reason is that the dataset's been modified to remove Oceania, but the second – and more significant – reason is that the table that is shown is randomly sampling 10 rows from the large data frame of all countries and without knowing the seed in the original run, the exact results can't be reproduced (without a huge number of random guesses). Fortunately, Jonathan's exact results aren't too important here, but having noted the potential difficulty, it's worth explicitly setting a seed now so the remainder of this work could be reproduced exactly if desired.
set.seed(613)
The first difficulty in visualizing the largestRES data is that the class of year is no longer a factor or even a number:
class(largestRES$yearLargestRES)
## [1] "character"
The other problem is that we've lost most of the data in the original dataset, so there's not much to plot agains. Both problems can be addressed with a simple ammendment to the jRes function:
jRes <- function(x) {
res <- as.vector(abs(resid(lm(lifeExp ~ I(year - yearMin), x))))
index <- which.max(res)
return(x[index, ]) # the only changed line; index into the original data frame
}
largestRES <- ddply(hDat, ~country, jRes)
print(xtable(peek(largestRES)), type = "html")
| country | year | pop | continent | lifeExp | gdpPercap | |
|---|---|---|---|---|---|---|
| 138 | Yemen, Rep. | 1952 | 4963829.00 | Asia | 32.55 | 781.72 |
| 133 | United States | 1972 | 209896000.00 | Americas | 71.34 | 21806.04 |
| 15 | Bulgaria | 1952 | 7274900.00 | Europe | 59.60 | 2444.29 |
| 113 | Slovak Republic | 1952 | 3558137.00 | Europe | 64.36 | 5074.66 |
| 45 | Gabon | 2007 | 1454867.00 | Africa | 56.73 | 13206.48 |
| 126 | Thailand | 1952 | 21289402.00 | Asia | 50.85 | 757.80 |
Our data's now in a form which is more suitable for creating figures and we can finally generate a plot of the data frame. The following graph plots the year at which the largest residual error was found for a country against the life expectancy during that year. Points are coloured by the continent of the country.
library(lattice)
xyplot(year ~ lifeExp, largestRES, groups = continent, auto.key = list(columns = nlevels(largestRES$continent)))
What can we make of this graph? The majority of points are clumped at either 1952 or 2007, which suggests that the linear model is simply struggling to fit the data at the boundary of the observed years. To check this observation, we can check the density plot showing the distribution of years in largestRES.
densityplot(~year, largestRES)
Interestingly, the distribution actually appears trimodal, although the two largest peaks are indeed on 1952 and 2007. The middle peak is puzzling, but it makes more sense when the density plots are separated by continent, like so:
densityplot(~year, largestRES, groups = continent, auto.key = list(columns = nlevels(largestRES$continent)))
This figure makes it obvious that each continent tends to have a bimodal distribution of “interesting” years, but the relative height of each peak varies quite a lot between the continents. The third peak is simply an artifact created by averaging together the other distributions (the resulting distribution is starting to converge towards the central limit).
A natural question to ask of the previous plot is which continents correspond to which curves? This is clearer in the following graph, which breaks each continent out into a separate subplot.
densityplot(~year | continent, largestRES)
It appears that the linear models of life expectancy for developing countries tend to have their largest error in recent years, while the same models for developed countries tend to have their largest error in earlier years. I've randomly sampled and plotted a number of European and African countries to determine why this is the case (the graph of Romania, Spain, Niger, and Lesotho, shown below, is representative of the examples I've seen), but from a limited number of examples it's difficult to draw any conclusions.
It was my guess that European countries had a period of exponential growth around the 1950's that is throwing off the linear regression, while African countries are experiencing these shifts now, but the random sampling of countries I've looked at doesn't particularly support that.
xyplot(lifeExp ~ year | country, hDat, subset = (country == "Lesotho" | country ==
"Niger" | country == "Romania" | country == "Spain"), type = c("p", "r",
"smooth"))
Having seen these visualizations, I believe that the main problem with checking residual error as a measure of “interestingness” is that the residuals don't provide context about the larger trends for the country. A country with a perfectly linear decrease in life expectancy every year would still be extremely atypical, despite not having any particular years with exceptional events.