stat545a-2013-hw03_gao-wen.rmd

library(plyr)
library(xtable)
gDat <- read.table("data/gapminderDataFiveYear.txt", sep = "\t", quote = "\"", 
    header = T)

print_table = function(df) {
    print(xtable(df), type = "html", include.rownames = FALSE)
}

For some of the analysis below, I limited the data to those of year 2007, because it seems that it makes little sense that the max GDP per capital for each continent comes from different years.

Rich and Poor Continents

Display maximum and minimum GDP per capita for all continents in year 2007, in a “wide” format, sorted by increasing maximum GDP per capita:

gdpByContinent <- ddply(subset(gDat, year == 2007), ~continent, summarize, maxGDPperCap = max(gdpPercap), 
    minGDPperCap = min(gdpPercap))
maxGdpByContinent <- arrange(gdpByContinent, maxGDPperCap)
print_table(maxGdpByContinent)
continent maxGDPperCap minGDPperCap
Africa 13206.48 277.55
Oceania 34435.37 25185.01
Americas 42951.65 1201.64
Asia 47306.99 944.00
Europe 49357.19 5937.03

The GDP per capita rank by continent varies depending on maximum or minimum GDP per captia is used in the comparison. Oceania has no poor countries, even though it's richest country falls behind the richest countries in serveral other continents. Asia, on the other hand, has both the very rich and the very poor countries.

Same as above, but in a long format

gdp_stat <- function(gDatCntnt) {
    minGDP <- min(gDatCntnt$gdpPercap)
    maxGDP <- max(gDatCntnt$gdpPercap)
    return(data.frame(stat.type = c("min", "max"), GDP.per.cap = c(minGDP, maxGDP)))
}
gdpByContinentTall <- ddply(subset(gDat, year == 2007), ~continent, gdp_stat)
gdpByContinentTall <- arrange(gdpByContinentTall, GDP.per.cap)
print_table(gdpByContinentTall)
continent stat.type GDP.per.cap
Africa min 277.55
Asia min 944.00
Americas min 1201.64
Europe min 5937.03
Africa max 13206.48
Oceania min 25185.01
Oceania max 34435.37
Americas max 42951.65
Asia max 47306.99
Europe max 49357.19

Interesting that Oceania's “poorest” country is richer than the richest country in Africa.

The gap between the rich and the poor in a continent

Tabulate the variation of GDP per capita for each continent, as year 2007. Different measures of spread are used:

vGdpByContinent <- ddply(subset(gDat, year == 2007), ~continent, summarize, 
    STD = sd(gdpPercap), MAD = mad(gdpPercap), IQR = IQR(gdpPercap))
vGdpByContinent <- arrange(vGdpByContinent, STD)
print_table(vGdpByContinent)
continent STD MAD IQR
Africa 3618.16 1032.21 3130.55
Oceania 6540.99 6857.29 4625.18
Americas 9713.21 4773.60 6249.22
Europe 11800.34 12506.17 19006.06
Asia 14154.94 4566.12 19863.98

It looks like that different measures of spread of GDP per capita may produce differnt ranks for these continents. For 2007 data, STD and IQR ranks the continents the same way, but MAD does something different. I guess the distribution of data and the presence of outlier have different impact on these statistics. But it's hard to just imagine how without making a densityplot…

Change in life expectancy

Here shows how life expectancy chages through last ~50 years. For each year, the top and bottom 15% life expectancy by country is trimmed:

trimFrac <- 0.15
print_table(ddply(gDat, ~year, summarize, trimmedMean = mean(lifeExp, trim = trimFrac)))
year trimmedMean
1952 48.20
1957 51.01
1962 53.41
1967 55.80
1972 58.08
1977 60.26
1982 62.32
1987 64.24
1992 65.55
1997 66.45
2002 67.26
2007 68.64

For the past 50 years, global life expectancy has increased steadily.

How is life expectancy changing over time on different continents: (Note that only the first few lines of a very long table are displayed here)

cyLifeExp <- ddply(gDat, ~continent + year, summarize, Mean.Life.Exp = mean(lifeExp))
print_table(head(cyLifeExp, n = 15))
continent year Mean.Life.Exp
Africa 1952 39.14
Africa 1957 41.27
Africa 1962 43.32
Africa 1967 45.33
Africa 1972 47.45
Africa 1977 49.58
Africa 1982 51.59
Africa 1987 53.34
Africa 1992 53.63
Africa 1997 53.60
Africa 2002 53.33
Africa 2007 54.81
Americas 1952 53.28
Americas 1957 55.96
Americas 1962 58.40

Very hard to access data displayed this way.

How is life expectancy changing over time on different continents (using the wide format)

cyLifeExpWide <- daply(gDat, ~year + continent, summarize, Mean.Life.Exp = mean(lifeExp))
print(xtable(cyLifeExpWide), type = "html")
Africa Americas Asia Europe Oceania
1952 39.14 53.28 46.31 64.41 69.25
1957 41.27 55.96 49.32 66.70 70.30
1962 43.32 58.40 51.56 68.54 71.09
1967 45.33 60.41 54.66 69.74 71.31
1972 47.45 62.39 57.32 70.78 71.91
1977 49.58 64.39 59.61 71.94 72.85
1982 51.59 66.23 62.62 72.81 74.29
1987 53.34 68.09 64.85 73.64 75.32
1992 53.63 69.57 66.54 74.44 76.94
1997 53.60 71.15 68.02 75.51 78.19
2002 53.33 72.42 69.23 76.70 79.74
2007 54.81 73.61 70.73 77.65 80.72

Show that over time, how the proportion of countries in each continent with mean life expectancy < 40 changes:

ageCut <- 40
lowLECountriesByCntnt <- daply(gDat, ~year + continent, summarize, sum(lifeExp < 
    ageCut)/length(unique(country)))
print(xtable(lowLECountriesByCntnt), type = "html")
Africa Americas Asia Europe Oceania
1952 0.56 0.04 0.30 0.00 0.00
1957 0.44 0.00 0.15 0.00 0.00
1962 0.29 0.00 0.09 0.00 0.00
1967 0.19 0.00 0.06 0.00 0.00
1972 0.12 0.00 0.06 0.00 0.00
1977 0.06 0.00 0.06 0.00 0.00
1982 0.06 0.00 0.03 0.00 0.00
1987 0.02 0.00 0.00 0.00 0.00
1992 0.06 0.00 0.00 0.00 0.00
1997 0.04 0.00 0.00 0.00 0.00
2002 0.04 0.00 0.00 0.00 0.00
2007 0.02 0.00 0.00 0.00 0.00

The ratio of countries with low life expectancy decrease dramatically across all continents. As of 2007, nearly all countries in all continents have life expectancy above 40 years old.

Interesting deviation from predicted life expectancy

Here shows top countries with big residual in its linear model fit for life expactancy across time, and what year the big residual appears:

max_residual <- function(cDat) {
    clm <- lm(lifeExp ~ I(year - min(cDat$year)), data = cDat)
    largestResidual <- abs(clm$residuals)
    maxRsdl <- max(largestResidual)
    return(c(Max.Residual = maxRsdl, Year = cDat[which.max(largestResidual), 
        "year"]))
}
interesting <- ddply(gDat, ~country, max_residual)
interesting <- arrange(interesting, abs(Max.Residual), decreasing = T)
print_table(head(interesting))
country Max.Residual Year
Rwanda 17.31 1992.00
Cambodia 15.69 1977.00
Swaziland 12.00 2007.00
Zimbabwe 10.58 2002.00
Lesotho 10.04 2007.00
Botswana 9.33 2002.00

I sorted of expected a deviation from predicted life expectancy in Rwanda after the genocide, but it actually happens before the genocide (1994). Maybe it's due to the ongoing civil war?