stat545a-2013-hw04_woollard-geo

## Attaching package: 'reshape'
## 
## The following object is masked from 'package:plyr':
## 
## rename, round_any

Load the data and get rid of the continent Oceania since it has only 2 countries

Crique of another student's code

This code comes from homework 3: gao-wen source | report

print_table

She first made a generalized function called print_table, this makes the code compact, although one improvement would be to include an option to round certain columns as desired.

print_table = function(df) {
    print(xtable(df), type = "html", include.rownames = FALSE)
}

Change in life expectancy

She sometimes makes the code uber compact (2 lines!).

Pro: avoid place holder names like foo for objects (like data.frames) that are not going to be used later
Con: can make the line really long and hard to read, at least on rpubs which doesn't save the whitespace formatting of the source

In fact the original source is acttually much more readable than the rpubs version

original source

trimFrac <- 0.15
print_table(
  ddply(gDat, ~year, summarize,
        trimmedMean = mean(lifeExp, trim=trimFrac)))

rpubs

print_table(ddply(gDat, ~year, summarize, trimmedMean = mean(lifeExp, trim = trimFrac)))

Plot the time course of median world life expenctancy

Ploting this data shows that it is increasing, but at a slower and slower rate

trimFrac <- 0.15
df <- ddply(gDat, ~year, summarize, meanLifeExp = mean(lifeExp, trim = trimFrac))
xyplot(meanLifeExp ~ year, df, type = c("a", "p"))

plot of chunk unnamed-chunk-5

Rich and Poor Continents

These questions use just a subset of the data from year 2007. Why the year 2007? This is the final year for each country. Perhaps it would be better to have a year variable set to 2007 instead of hardcoding it in, since it it referenced in multiple places. If there was just one variable controling it then it could be changed easily by just modifying it in one spot.

gdp_stat <- function(gDatCntnt) {
  minGDP <- min(gDatCntnt$gdpPercap)
  maxGDP <- max(gDatCntnt$gdpPercap)
  return(data.frame(stat.type = c("min", "max"), GDP.per.cap = c(minGDP, maxGDP)))
}
gdpByContinentTall <- ddply(subset(gDat, year == 2007), ~continent, gdp_stat)
gdpByContinentTall <- arrange(gdpByContinentTall, GDP.per.cap)

One quick fix is to get the year from the data itself and then replace all references to 2007 with finalYear

(finalYear = max(gDat$year))

## [1] 2007

Plot the max and min life expectancy in each continent

there's more than one way to skin a cat

The above results tell me that something of the ordering between continents (omitting Oceania):

MIN: Africa < Asia < Americas < Europe
MAX: Africa < Americas < Asia < Europe
But this is just for the year 2007… what's going on in other years. Since there is only a few continents, we can use panel plots. We tall table has max or min in one column and the value in another. This makes it very easy to plot on the same graph since we can set groups = stat.type.

gdpByContYear <- ddply(gDat, ~continent + year, gdp_stat)
xyplot(GDP.per.cap ~ year | continent, 
      gdpByContYear, 
       groups = stat.type, auto.key=TRUE, type=c("a","p")
       )

plot of chunk unnamed-chunk-8

There is not as much information as a box plot or violin plot, but the presentation is perhaps cleaner. But what are we really interested in? We want to compare the different continents, right? I don't really like looking at different panels back and forth, so we can swap the conditioning argument (|) and groups: | A, groups = B –> | B, groups = A.

xyplot(GDP.per.cap ~ year | stat.type, data = gdpByContYear, 
       groups = continent, auto.key=TRUE, type=c("a","p"), scale="free"
       )

plot of chunk unnamed-chunk-9

note the scale="free" argument , which automatically sets the scales on the panels.

Interesting deviation from predicted life expectancy

We can find “oputlying” countries by finding countries that have a large residual. The approach is:

build function that gets the max residuals ** note that it finds the absolute value of the residual, which can be positive or negative depending on weather it is respectively above or below the fit.
feed it into ddply
order the output by the residuals. The function max_residual is a function that we feed into ddply without the summarize function. It is customized to give back what we want, in this case the maximum residual and the year.

max_residual <- function(cDat){
  clm <- lm(lifeExp ~ I(year-min(cDat$year)), data=cDat)
  largestResidual <- abs(clm$residuals)
  maxRsdl <- max(largestResidual)
  return(c(Max.Residual=maxRsdl,
           Year=cDat[which.max(largestResidual),'year']))
}
interesting <- ddply(gDat, ~country, max_residual)
interesting <- arrange(interesting, abs(Max.Residual), decreasing=T)
print_table(head(interesting))

country	Max.Residual	Year
Rwanda	17.31	1992.00
Cambodia	15.69	1977.00
Swaziland	12.00	2007.00
Zimbabwe	10.58	2002.00
Lesotho	10.04	2007.00
Botswana	9.33	2002.00

At the end we checked the output with head. Here we see an issue with the print_table function – it has .00 after each year, which are always integers. What we need instead is to use the digits command in xtable. This could be packaged up into print_table easily since it works well with data.frames (there aren't issues with rounding factors).

My partial solution

numCountries <- 11  # number of interesting countries
rounded <- xtable(head(interesting, numCountries), digits = c(0, 0, 2, 0))  # just use 0 for row number and country
print(rounded, type = "html", include.rownames = FALSE)

country	Max.Residual	Year
Rwanda	17.31	1992
Cambodia	15.69	1977
Swaziland	12.00	2007
Zimbabwe	10.58	2002
Lesotho	10.04	2007
Botswana	9.33	2002
South Africa	9.31	2007
China	8.00	1962
Namibia	7.21	2002
Gabon	6.77	2007
Iraq	6.70	1987

Plot the top 11 interesting countries

We used the residuals to find countries and their associalted years. We can see the big residuals if we plot the lifeExp vs year with a line – just look for the data points that are far from the line.

interesting <- arrange(interesting, abs(Max.Residual), decreasing = T)
xyplot(lifeExp ~ year | country, gDat, 
       subset = country %in% head(interesting,numCountries)$country, 
       type = c("p", "r"))

plot of chunk unnamed-chunk-12

As we can wee with Iraq, the largest residual is in 1987, but this isn't some dramatic year, but the year before a change in trend. There are so many wars in Iraq, that it is difficult to relate historical events to the data:

1980-1988 Iran-Iraq War
1990-1991 Gulf War
2003 - present Iraq War

Choose your own adventure

This answers the question Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures. Why does the life expencancy decrease so dramatically? Two stricking examples were Cambodia and Rwanda, which had clear conflict times in their history (see Khmer Rouge rule of Cambodia in the 1970s and the Rwandan Genocide in the 1990s). But the life expectancy doesn't just reflect a over loss of life, it reflects changes in the average age of the population. Roiughly speaking the life expectancy changes when on the average more young people are dying than old people, or vice-versa. So I decided to look into the relation between population and life expencancy. Obviously I need to somehow normalize, since population is on the oder of 10^6, while life expencancy tends to fluctuate in the range 20-100.

wide  <- ddply(subset(gDat, subset = country %in% head(interesting,numCountries)$country), 
               ~country, summarize, 
               normPop=scale(pop, center=FALSE), 
               normLifeExp=scale(lifeExp, center=FALSE), 
               year=year
               )
tall  <- melt(wide, id=c('country','year'))
xyplot(data=tall, 
       value ~ year | country, 
       group = variable, auto.key=TRUE
       )

plot of chunk unnamed-chunk-13

I think the take home message from here is that life exectancy is a much more sensative variable than overall population in visualizing major shifts in the population. There are only a few cases where the overall population is decreases over a 5 year interval (Lesotho, Rwanda, Cambodia), and at these time points the life expectancy has a more defined negative trend.