STAT 545A Homework 4

Yumian Hu

2013-09-29

Load the Gapminder data.

Gapminder data can be found here.

> gDat <- read.delim("gapminderDataFiveYear.txt")

Perform a superficial check that data import went OK.

> str(gDat)
'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

Load the lattice, plyr and xtable packages.

> library(xtable)
> library(plyr)
> library(lattice)

Drop Oceania right from the start.

> gDat <- droplevels(subset(gDat, continent != "Oceania"))
> ## Double-check it's really gone
> table(gDat$continent)

  Africa Americas     Asia   Europe 
     624      300      396      360 

Examine trimmed mean of life expectancy for different years.

The code below is borrowed from Jenny's menu of data aggregation tasks.

> jTrim <- 0.2 ## the trim level for the trimmed means is 0.2
> foo <- ddply(gDat, ~ year, summarize, tMean = mean(lifeExp, trim = jTrim))
> xyplot(tMean ~ year, foo, grid=T, type=c("p", "l"))

plot of chunk unnamed-chunk-6

How is life expectancy changing over time on different continents?

Plot the gdpPercap over year for each continent, and connect the dots for the year specific median

> bwplot(gdpPercap ~ as.factor(year) | continent, gDat)

plot of chunk unnamed-chunk-7

Note there are extreme large values of gdpPercap in Asia so the sacle of the figure is affected by these outliers.

Depict the maximum and minimum of GDP per capita for all continents

The task is inspired by Jonathan Baik's Homework #3

> foo <- ddply(gDat, ~ year + continent, function(x) {
+   gdpPercap <- range(x$gdpPercap)
+   return(data.frame(gdpPercap, stat = c("min", "max")))
+   })
> xyplot(gdpPercap ~ year | continent, foo, group=stat, auto.key = T, type=c("p", "l"))

plot of chunk unnamed-chunk-8

Look at the spread of GDP per capita within the continents in year 2007.

Focus on the year 2007.

> hDat <- subset(gDat, year==2007) 

Boxplot

> bwplot(gdpPercap ~ continent, hDat, 
+        panel = function(..., box.ratio) {
+          panel.violin(..., col = "transparent", border = "grey60", varwidth = FALSE, 
+                       box.ratio = box.ratio)
+          panel.bwplot(..., fill = NULL, box.ratio = 0.1)
+          })

plot of chunk unnamed-chunk-10

Stripplot

> stripplot(gdpPercap ~ continent, hDat, jitter.data = TRUE, grid = "h", type = c("p", "a"), fun=median)

plot of chunk unnamed-chunk-11

Densityplot

> densityplot( ~ gdpPercap, hDat, plot.points = FALSE, ref = TRUE, group = continent, 
+             auto.key = list(columns = nlevels(hDat$continent)))

plot of chunk unnamed-chunk-12

Depict the proportion of countries with low life expectancy over time by continent.

The task is inspired by Jenny's menu of data aggregation tasks.

Any life expectancy which is less than the 20th percentile of the whole lifeExp is defined as low life expectancy.

> (bMark <- quantile(gDat$lifeExp, 0.2))
  20% 
45.68 
> foo <- ddply(gDat, ~ continent + year, function(x) {
+   c(prop = sum(x$lifeExp <= bMark) / nrow(x))
+ })
> xyplot(prop ~ year, foo, group=continent, auto.key = T, type=c("p", "l"))

plot of chunk unnamed-chunk-13

Find countries exhibit extremely rapid and slow life expectancy gains.

The task is inspired by Sean Jewell's Homework #3.

Fit a linear regression for each country, modelling life expectancy as a function of the year and then retain the estimated intercepts and slopes.

> yearMin <- min(gDat$year)
> jFun <- function(x) {
+   estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
+   names(estCoefs) <- c("intercept", "slope")
+   return(estCoefs)
+ }
> jCoefs <- ddply(gDat, ~ country + continent, jFun)

Report the country with min and max intercept within each continent from the above jCoefs dataset.

> foo <- ddply(jCoefs, ~ continent, function(x) {
+   lerange <- c(which.min(x$slope), which.max(x$slope))
+   cbind(x[lerange, c("continent", "country", "intercept", "slope")], stat=c("min.slope", "max.slope"))
+ })
> print(xtable(foo), type = "html", include.rownames = FALSE, digits = c(0, 0, 2, 2, 0))
continent country intercept slope stat
Africa Zimbabwe 55.22 -0.09 min.slope
Africa Libya 42.10 0.63 max.slope
Americas Paraguay 62.48 0.16 min.slope
Americas Nicaragua 43.05 0.56 max.slope
Asia Iraq 50.11 0.24 min.slope
Asia Oman 37.21 0.77 max.slope
Europe Denmark 71.03 0.12 min.slope
Europe Turkey 46.02 0.50 max.slope

Note the results displayed in the above table is consistent with those in the course website

Subset the whole dataset, only include information about these “interesting” countries and add the description of these fitted slopes (max.slope vs min.slope)

> hitcountry <- foo$country
> iDat <- subset(gDat, country %in% hitcountry)
> iDat <- merge(iDat, foo[, c("country", "stat")], by = "country")

Plot lifeExp over year with fitted linear regression line for the above interesting countries, seperated by different continents.

> xyplot(lifeExp ~ year | continent, iDat, group=stat, auto.key = T, layout = c(2 ,2),
+        panel = panel.superpose,
+        panel.groups=function(x, y, ...){
+            panel.xyplot(x, y, ...)
+            panel.lmline(x, y, ...)
+          }
+        )

plot of chunk unnamed-chunk-17

Look at the country of Irap or Zimbabwe, which has the mininal fitted slope in Asia or Afica. Most of their points don't fall on the fitted regression line, so a polynomial curve may be better fitted.