STAT 545A Homework 4

Sep 29 2013


Contents

The Data

We will be taking another look at the gapminder data set. In the previous homework, we carried out a set of data aggregation tasks to explore the data set. In this homework, we will take a look at a few data aggregation results and generate one or more accompanying figures.

We load the data and perform a subset operation to remove the Oceania continent, as recommended by Jenny. A quick sanity check is done by checking the structure of the data set with str before (not shown) and after (shown below) the subset operation.

str(gDat)
## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Figures for HW3 Data Aggregation Tasks

Depict the maximum and minimum of GDP per capita for all continents

This task is from Jenny's menu of data aggregation tasks (link to task). We alter the task very slightly by including the 10% trimmed mean in addition to the minimum and maximum GDP per capita.

We begin by using ddply to aggregate the data. We use a custom function in ddply to retrieve the countries with the minimum and maximum GDP per capita for each year.

# Min/Max GDP by year and Country
minMaxGDP <- ddply(gDat, ~year + continent, .drop = FALSE, function(x) {
    minCountry = as.character(x$country[which.min(x$gdpPercap)])
    minGdp = min(x$gdpPercap)
    maxCountry = as.character(x$country[which.max(x$gdpPercap)])
    maxGdp = max(x$gdpPercap)
    trimMean = mean(x$gdpPercap, trim = 0.1)
    return(data.frame(gdpPercap = c(minGdp, maxGdp, trimMean), country = c(minCountry, 
        maxCountry, ""), stat = c("minGdpPercap", "maxGdpPercap", "trimMean")))
})

We display the aggregated data in a “long” table below. To save some space, we display only the first 20 rows of the 144 rows in the table. It is (at least for me) very hard to extract interesting information from this table.

year continent gdpPercap country stat
1952 Africa 298.85 Lesotho minGdpPercap
1952 Africa 4725.30 South Africa maxGdpPercap
1952 Africa 1085.16 trimMean
1952 Americas 1397.72 Dominican Republic minGdpPercap
1952 Americas 13990.48 United States maxGdpPercap
1952 Americas 3494.33 trimMean
1952 Asia 331.00 Myanmar minGdpPercap
1952 Asia 108382.35 Kuwait maxGdpPercap
1952 Asia 1690.45 trimMean
1952 Europe 973.53 Bosnia and Herzegovina minGdpPercap
1952 Europe 14734.23 Switzerland maxGdpPercap
1952 Europe 5436.62 trimMean
1957 Africa 336.00 Lesotho minGdpPercap
1957 Africa 5487.10 South Africa maxGdpPercap
1957 Africa 1176.76 trimMean
1957 Americas 1544.40 Dominican Republic minGdpPercap
1957 Americas 14847.13 United States maxGdpPercap
1957 Americas 4037.75 trimMean
1957 Asia 350.00 Myanmar minGdpPercap
1957 Asia 113523.13 Kuwait maxGdpPercap

Now, we create the accompanying figure for the above table. For each continent, we plot the progression of the minimum, the maximum, and the 10% trimmed mean of the GDP per capita over the years in the data set. The max is plotted in blue, the minimum is plotted in pink, and the trimmed mean is plotted in green. A locally fitted (LOESS) regression line is plotted over the points in the data.

# Plot points and a loess smoothed fit
xyplot(gdpPercap ~ year | continent, group = stat, data = minMaxGDP, grid = "h", 
    auto.key = TRUE, type = c("p", "smooth"))

plot of chunk plot1-gdpMinMax

Now, in contrast to the table we created, it is fairly easy to see in the plots how the different continents are fairing in terms of GDP per capita. Overall, each continent is showing a growth in average GDP per capita, as we can see with the trimmed mean (green line).

We notice something funny about Asia, where we see that the country the greatest GDP per capita starts with the highest global GDP per capita (which we identified in our previous assignment as Kuwait), but seems to fall steadily over the years until about 1990, where it starts a modest increasing trend. Note that the highest GDP per capita country in Asia was Kuwait except in 1982 (Saudi Arabia) and in 2002 (Singapore).

Europe and the Americas show similar trends to each other, but countries in Europe are doing better on average than countries in the Americas. Africa seems to be the poorest of the continents, and also contains an interesting trend for its maximum GDP line. It seems that some country (or countries) in Africa experienced some relatively great growth in GDP per capita between around 1967 to 1982, but levelled off to a steady trend of non-growth in 1990 that continued onward to 2007.

Absolute and relative world population in each of the continents

Inspired by Dean Attali's Homework 3 Report (Source)

We are interested in the populations of the continents relative to the total world population for each year. The original author could not aggregate the data in a simple summarize command in the ddply function and created a custom function to aggregate the data. We do the same to generate a table. The first 20 rows are displayed below.

# Get the yearly populations (Note: Without Oceania)
yearlyPop <- ddply(gDat, ~year, summarise, population = sum(pop))

# Now calculate the relative populations by continent each year
relativePop <- ddply(gDat, ~continent + year, function(x) {
    population <- sum(x$pop)
    worldPop <- yearlyPop$population[which(unique(x$year) == yearlyPop$year)]
    percent <- population/worldPop
    return(data.frame(continent = x$continent[1], year = x$year[1], population, 
        percent))
})
continent year population percent
Africa 1952 237640501 0.099
Africa 1957 264837738 0.100
Africa 1962 296516865 0.103
Africa 1967 335289489 0.105
Africa 1972 379879541 0.107
Africa 1977 433061021 0.111
Africa 1982 499348587 0.117
Africa 1987 574834110 0.123
Africa 1992 659081517 0.129
Africa 1997 743832984 0.135
Africa 2002 833723916 0.142
Africa 2007 929539692 0.149
Americas 1952 345152446 0.144
Americas 1957 386953916 0.146
Americas 1962 433270254 0.150
Americas 1967 480746623 0.150
Americas 1972 529384210 0.149
Americas 1977 578067699 0.148
Americas 1982 630290920 0.148
Americas 1987 682753971 0.146

We ordered the table by continent, then year. We can see that the population is increasing over the years, but it is hard to compare, relatively, how the population of Africa is increasing compared to, say, the Americas. We rectify this by plotting the populations, relative to the world population, over time for each continent. To unclutter the plot, we facet the plots by continent. A second plot is generated to show the actual populations over time.

# Make another data set with the world
yearlyPop <- cbind(continent = "World", yearlyPop, percent = 1)
relativePop2 <- rbind(relativePop, yearlyPop)

# Plot the lattice
xyplot(percent ~ year | continent, data = relativePop, grid = "h", type = c("p", 
    "r"), auto.key = TRUE)

plot of chunk relativePop-plot

xyplot(population ~ year, group = continent, data = relativePop2, grid = "h", 
    type = c("p", "r"), auto.key = TRUE)

plot of chunk relativePop-plot

Looking at the first plot, we can see that the relative populations of Asia and Africa have been on the rise since the 1950s. The Americas seem to be holding steady at around 15%, while Europe has dropped below 10% recently.

Looking at the second plot above, we see that the populations in every continent have been on the rise since the 1950s. It seems that the rate of growth in Asia and Africa is higher than the other continents, as was reflected in the first plot.

Note that our numbers may not match exactly with the Dean's original work. This is probably due to the removal of the continent, Oceania, from our data set.

Finding countries that deviate from the trend when analyzing life expectancy vs. time

Inspired by Sean Jewell's Homework 3 Report (Source)

We not take a look at life expectancy by country, and we are interested in finding the countries with “strange” trends. We notice that if we try plotting a simple linear fit for the global life expectancy or the continental life expectancy, we do not see anything that particularly stands out (except maybe some countries with obviously low life expectancies).

# Plot global life expectancy
xyplot(lifeExp ~ year, data = gDat, type = c("p", "r"))

plot of chunk lifeExpData-plot-global


# Now by continent
xyplot(lifeExp ~ year | continent, data = gDat, type = c("p", "r"))

plot of chunk lifeExpData-plot-global

We try fitting a linear regression model of lifeExp versus year for each country. Then, to isolate some countries that may stand out more than others, we try to find countries that have very high residuals.

# Get the minimum year in the dataset
yearMin <- min(gDat$year)

# Regression function for life expectancy
lifeExpRegression <- function(x) {
    fit <- lm(lifeExp ~ I(year - yearMin), data = x)
    res <- max(abs(fit$residuals))
    names(res) <- c("maxAbsResid")
    return(res)
}

# Get all the residuals, sorted by Max residual
resids <- arrange(ddply(gDat, .(continent, country), lifeExpRegression), desc(maxAbsResid))

We show the 20 countries with the largest absolute residuals in a table to try and gather a quick intuition of the data. Save for Bulgaria, the top 20 unusual countries are located in Africa and Asia.

continent country maxAbsResid
Africa Rwanda 17.31
Asia Cambodia 15.69
Africa Swaziland 12.00
Africa Zimbabwe 10.58
Africa Lesotho 10.04
Africa Botswana 9.33
Africa South Africa 9.31
Asia China 8.00
Africa Namibia 7.21
Africa Gabon 6.77
Asia Iraq 6.70
Africa Kenya 6.34
Europe Bulgaria 6.14
Africa Zambia 5.98
Africa Cote d'Ivoire 5.24
Africa Central African Republic 5.24
Africa Uganda 5.17
Asia Myanmar 5.09
Africa Congo, Rep. 5.03
Asia Korea, Dem. Rep. 5.01

Next, we try plotting the life expectancy versus time for these 20 countries with large residuals. Let us take a look at the life expectancies in these 20 countries.

plot of chunk lifeExp-plot

Note: the lattice panels follow the order of the table above, ordered from bottom to top, left to right, starting in the bottom left.

It appears that the majority of the cases displayed in the above plot are cases where, quite unfortunately, the life expectancy of a country suddenly drops for one reason or another. These can mostly be linked to war or genocide (e.g. Rwanda), which is quite unfortunate.

Conclusions

I found that it is much, much easier to find meaning in the data when using figures. It is infinitely harder to grasp an idea from a table display of the data. From my experience in the previous homework, the data aggregation results often contained hundreds of rows of data to digest at a time, resulting in my scrutinizing the computer screen for a long time.

I enjoyed recreating some of the data experiments by other students, and by re-writing the code, I found it very easy to understand and use thanks to the simple syntax of the plyr functions. It was also interesting to follow other people's ideas or hypotheses.

I have been trying to learn the ggplot2 package recently, and I found it hard to create plots as I wanted them using lattice functions. I miss the layering ability in ggplot2, and did not understand how to create panel functions in lattice, though it is probably due to my lack of familiarity with lattice. However, I did find some tasks much simpler when lattice, such as adding a regression line or a LOESS smoothed regression line to a scatter plot with a simple type= call. All in all, I believe that learning both packages will be beneficial.

For the code used to generate this report, click here