2013-09-29
Gapminder data can be found here.
> gDat <- read.delim("gapminderDataFiveYear.txt")
Perform a superficial check that data import went OK.
> str(gDat)
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
Load the lattice, plyr and xtable packages.
> library(xtable)
> library(plyr)
> library(lattice)
Drop Oceania right from the start.
> gDat <- droplevels(subset(gDat, continent != "Oceania"))
> ## Double-check it's really gone
> table(gDat$continent)
Africa Americas Asia Europe
624 300 396 360
The code below is borrowed from Jenny's menu of data aggregation tasks.
> jTrim <- 0.2 ## the trim level for the trimmed means is 0.2
> foo <- ddply(gDat, ~ year, summarize, tMean = mean(lifeExp, trim = jTrim))
> xyplot(tMean ~ year, foo, grid=T, type=c("p", "l"))
Plot the gdpPercap over year for each continent, and connect the dots for the year specific median
> bwplot(gdpPercap ~ as.factor(year) | continent, gDat)
Note there are extreme large values of gdpPercap in Asia so the sacle of the figure is affected by these outliers.
The task is inspired by Jonathan Baik's Homework #3
> foo <- ddply(gDat, ~ year + continent, function(x) {
+ gdpPercap <- range(x$gdpPercap)
+ return(data.frame(gdpPercap, stat = c("min", "max")))
+ })
> xyplot(gdpPercap ~ year | continent, foo, group=stat, auto.key = T, type=c("p", "l"))
Focus on the year 2007.
> hDat <- subset(gDat, year==2007)
> bwplot(gdpPercap ~ continent, hDat,
+ panel = function(..., box.ratio) {
+ panel.violin(..., col = "transparent", border = "grey60", varwidth = FALSE,
+ box.ratio = box.ratio)
+ panel.bwplot(..., fill = NULL, box.ratio = 0.1)
+ })
> stripplot(gdpPercap ~ continent, hDat, jitter.data = TRUE, grid = "h", type = c("p", "a"), fun=median)
> densityplot( ~ gdpPercap, hDat, plot.points = FALSE, ref = TRUE, group = continent,
+ auto.key = list(columns = nlevels(hDat$continent)))
The task is inspired by Jenny's menu of data aggregation tasks.
Any life expectancy which is less than the 20th percentile of the whole lifeExp is defined as low life expectancy.
> (bMark <- quantile(gDat$lifeExp, 0.2))
20%
45.68
> foo <- ddply(gDat, ~ continent + year, function(x) {
+ c(prop = sum(x$lifeExp <= bMark) / nrow(x))
+ })
> xyplot(prop ~ year, foo, group=continent, auto.key = T, type=c("p", "l"))
The task is inspired by Sean Jewell's Homework #3.
Fit a linear regression for each country, modelling life expectancy as a function of the year and then retain the estimated intercepts and slopes.
> yearMin <- min(gDat$year)
> jFun <- function(x) {
+ estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
+ names(estCoefs) <- c("intercept", "slope")
+ return(estCoefs)
+ }
> jCoefs <- ddply(gDat, ~ country + continent, jFun)
Report the country with min and max intercept within each continent from the above jCoefs dataset.
> foo <- ddply(jCoefs, ~ continent, function(x) {
+ lerange <- c(which.min(x$slope), which.max(x$slope))
+ cbind(x[lerange, c("continent", "country", "intercept", "slope")], stat=c("min.slope", "max.slope"))
+ })
> print(xtable(foo), type = "html", include.rownames = FALSE, digits = c(0, 0, 2, 2, 0))
| continent | country | intercept | slope | stat |
|---|---|---|---|---|
| Africa | Zimbabwe | 55.22 | -0.09 | min.slope |
| Africa | Libya | 42.10 | 0.63 | max.slope |
| Americas | Paraguay | 62.48 | 0.16 | min.slope |
| Americas | Nicaragua | 43.05 | 0.56 | max.slope |
| Asia | Iraq | 50.11 | 0.24 | min.slope |
| Asia | Oman | 37.21 | 0.77 | max.slope |
| Europe | Denmark | 71.03 | 0.12 | min.slope |
| Europe | Turkey | 46.02 | 0.50 | max.slope |
Note the results displayed in the above table is consistent with those in the course website
Subset the whole dataset, only include information about these “interesting” countries and add the description of these fitted slopes (max.slope vs min.slope)
> hitcountry <- foo$country
> iDat <- subset(gDat, country %in% hitcountry)
> iDat <- merge(iDat, foo[, c("country", "stat")], by = "country")
Plot lifeExp over year with fitted linear regression line for the above interesting countries, seperated by different continents.
> xyplot(lifeExp ~ year | continent, iDat, group=stat, auto.key = T, layout = c(2 ,2),
+ panel = panel.superpose,
+ panel.groups=function(x, y, ...){
+ panel.xyplot(x, y, ...)
+ panel.lmline(x, y, ...)
+ }
+ )
Look at the country of Irap or Zimbabwe, which has the mininal fitted slope in Asia or Afica. Most of their points don't fall on the fitted regression line, so a polynomial curve may be better fitted.