STAT 545A Homework 4

Sep 29 2013

The Data
Figures for HW3 Data Aggregation Tasks
- Depict the maximum and minimum of GDP per capita for all continents
- Absolute and relative world population in each of the continents
- Finding countries that deviate from the trend when analyzing life expectancy vs. time
Conclusions

The Data

We will be taking another look at the gapminder data set. In the previous homework, we carried out a set of data aggregation tasks to explore the data set. In this homework, we will take a look at a few data aggregation results and generate one or more accompanying figures.

We load the data and perform a subset operation to remove the Oceania continent, as recommended by Jenny. A quick sanity check is done by checking the structure of the data set with str before (not shown) and after (shown below) the subset operation.

str(gDat)

## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Figures for HW3 Data Aggregation Tasks

Depict the maximum and minimum of GDP per capita for all continents

This task is from Jenny's menu of data aggregation tasks (link to task). We alter the task very slightly by including the 10% trimmed mean in addition to the minimum and maximum GDP per capita.

We begin by using ddply to aggregate the data. We use a custom function in ddply to retrieve the countries with the minimum and maximum GDP per capita for each year.

# Min/Max GDP by year and Country
minMaxGDP <- ddply(gDat, ~year + continent, .drop = FALSE, function(x) {
    minCountry = as.character(x$country[which.min(x$gdpPercap)])
    minGdp = min(x$gdpPercap)
    maxCountry = as.character(x$country[which.max(x$gdpPercap)])
    maxGdp = max(x$gdpPercap)
    trimMean = mean(x$gdpPercap, trim = 0.1)
    return(data.frame(gdpPercap = c(minGdp, maxGdp, trimMean), country = c(minCountry, 
        maxCountry, ""), stat = c("minGdpPercap", "maxGdpPercap", "trimMean")))
})

We display the aggregated data in a “long” table below. To save some space, we display only the first 20 rows of the 144 rows in the table. It is (at least for me) very hard to extract interesting information from this table.

year	continent	gdpPercap	country	stat
1952	Africa	298.85	Lesotho	minGdpPercap
1952	Africa	4725.30	South Africa	maxGdpPercap
1952	Africa	1085.16		trimMean
1952	Americas	1397.72	Dominican Republic	minGdpPercap
1952	Americas	13990.48	United States	maxGdpPercap
1952	Americas	3494.33		trimMean
1952	Asia	331.00	Myanmar	minGdpPercap
1952	Asia	108382.35	Kuwait	maxGdpPercap
1952	Asia	1690.45		trimMean
1952	Europe	973.53	Bosnia and Herzegovina	minGdpPercap
1952	Europe	14734.23	Switzerland	maxGdpPercap
1952	Europe	5436.62		trimMean
1957	Africa	336.00	Lesotho	minGdpPercap
1957	Africa	5487.10	South Africa	maxGdpPercap
1957	Africa	1176.76		trimMean
1957	Americas	1544.40	Dominican Republic	minGdpPercap
1957	Americas	14847.13	United States	maxGdpPercap
1957	Americas	4037.75		trimMean
1957	Asia	350.00	Myanmar	minGdpPercap
1957	Asia	113523.13	Kuwait	maxGdpPercap

Now, we create the accompanying figure for the above table. For each continent, we plot the progression of the minimum, the maximum, and the 10% trimmed mean of the GDP per capita over the years in the data set. The max is plotted in blue, the minimum is plotted in pink, and the trimmed mean is plotted in green. A locally fitted (LOESS) regression line is plotted over the points in the data.

# Plot points and a loess smoothed fit
xyplot(gdpPercap ~ year | continent, group = stat, data = minMaxGDP, grid = "h", 
    auto.key = TRUE, type = c("p", "smooth"))

plot of chunk plot1-gdpMinMax

Now, in contrast to the table we created, it is fairly easy to see in the plots how the different continents are fairing in terms of GDP per capita. Overall, each continent is showing a growth in average GDP per capita, as we can see with the trimmed mean (green line).

We notice something funny about Asia, where we see that the country the greatest GDP per capita starts with the highest global GDP per capita (which we identified in our previous assignment as Kuwait), but seems to fall steadily over the years until about 1990, where it starts a modest increasing trend. Note that the highest GDP per capita country in Asia was Kuwait except in 1982 (Saudi Arabia) and in 2002 (Singapore).

Europe and the Americas show similar trends to each other, but countries in Europe are doing better on average than countries in the Americas. Africa seems to be the poorest of the continents, and also contains an interesting trend for its maximum GDP line. It seems that some country (or countries) in Africa experienced some relatively great growth in GDP per capita between around 1967 to 1982, but levelled off to a steady trend of non-growth in 1990 that continued onward to 2007.

Absolute and relative world population in each of the continents

Inspired by Dean Attali's Homework 3 Report (Source)

We are interested in the populations of the continents relative to the total world population for each year. The original author could not aggregate the data in a simple summarize command in the ddply function and created a custom function to aggregate the data. We do the same to generate a table. The first 20 rows are displayed below.

# Get the yearly populations (Note: Without Oceania)
yearlyPop <- ddply(gDat, ~year, summarise, population = sum(pop))

# Now calculate the relative populations by continent each year
relativePop <- ddply(gDat, ~continent + year, function(x) {
    population <- sum(x$pop)
    worldPop <- yearlyPop$population[which(unique(x$year) == yearlyPop$year)]
    percent <- population/worldPop
    return(data.frame(continent = x$continent[1], year = x$year[1], population, 
        percent))
})

continent	year	population	percent
Africa	1952	237640501	0.099
Africa	1957	264837738	0.100
Africa	1962	296516865	0.103
Africa	1967	335289489	0.105
Africa	1972	379879541	0.107
Africa	1977	433061021	0.111
Africa	1982	499348587	0.117
Africa	1987	574834110	0.123
Africa	1992	659081517	0.129
Africa	1997	743832984	0.135
Africa	2002	833723916	0.142
Africa	2007	929539692	0.149
Americas	1952	345152446	0.144
Americas	1957	386953916	0.146
Americas	1962	433270254	0.150
Americas	1967	480746623	0.150
Americas	1972	529384210	0.149
Americas	1977	578067699	0.148
Americas	1982	630290920	0.148
Americas	1987	682753971	0.146

We ordered the table by continent, then year. We can see that the population is increasing over the years, but it is hard to compare, relatively, how the population of Africa is increasing compared to, say, the Americas. We rectify this by plotting the populations, relative to the world population, over time for each continent. To unclutter the plot, we facet the plots by continent. A second plot is generated to show the actual populations over time.

# Make another data set with the world
yearlyPop <- cbind(continent = "World", yearlyPop, percent = 1)
relativePop2 <- rbind(relativePop, yearlyPop)

# Plot the lattice
xyplot(percent ~ year | continent, data = relativePop, grid = "h", type = c("p", 
    "r"), auto.key = TRUE)

plot of chunk relativePop-plot

xyplot(population ~ year, group = continent, data = relativePop2, grid = "h", 
    type = c("p", "r"), auto.key = TRUE)

plot of chunk relativePop-plot

Looking at the first plot, we can see that the relative populations of Asia and Africa have been on the rise since the 1950s. The Americas seem to be holding steady at around 15%, while Europe has dropped below 10% recently.

Looking at the second plot above, we see that the populations in every continent have been on the rise since the 1950s. It seems that the rate of growth in Asia and Africa is higher than the other continents, as was reflected in the first plot.

Note that our numbers may not match exactly with the Dean's original work. This is probably due to the removal of the continent, Oceania, from our data set.

Finding countries that deviate from the trend when analyzing life expectancy vs. time

Inspired by Sean Jewell's Homework 3 Report (Source)

We not take a look at life expectancy by country, and we are interested in finding the countries with “strange” trends. We notice that if we try plotting a simple linear fit for the global life expectancy or the continental life expectancy, we do not see anything that particularly stands out (except maybe some countries with obviously low life expectancies).

# Plot global life expectancy
xyplot(lifeExp ~ year, data = gDat, type = c("p", "r"))

plot of chunk lifeExpData-plot-global


# Now by continent
xyplot(lifeExp ~ year | continent, data = gDat, type = c("p", "r"))

plot of chunk lifeExpData-plot-global

We try fitting a linear regression model of lifeExp versus year for each country. Then, to isolate some countries that may stand out more than others, we try to find countries that have very high residuals.

# Get the minimum year in the dataset
yearMin <- min(gDat$year)

# Regression function for life expectancy
lifeExpRegression <- function(x) {
    fit <- lm(lifeExp ~ I(year - yearMin), data = x)
    res <- max(abs(fit$residuals))
    names(res) <- c("maxAbsResid")
    return(res)
}

# Get all the residuals, sorted by Max residual
resids <- arrange(ddply(gDat, .(continent, country), lifeExpRegression), desc(maxAbsResid))

We show the 20 countries with the largest absolute residuals in a table to try and gather a quick intuition of the data. Save for Bulgaria, the top 20 unusual countries are located in Africa and Asia.

continent	country	maxAbsResid
Africa	Rwanda	17.31
Asia	Cambodia	15.69
Africa	Swaziland	12.00
Africa	Zimbabwe	10.58
Africa	Lesotho	10.04
Africa	Botswana	9.33
Africa	South Africa	9.31
Asia	China	8.00
Africa	Namibia	7.21
Africa	Gabon	6.77
Asia	Iraq	6.70
Africa	Kenya	6.34
Europe	Bulgaria	6.14
Africa	Zambia	5.98
Africa	Cote d'Ivoire	5.24
Africa	Central African Republic	5.24
Africa	Uganda	5.17
Asia	Myanmar	5.09
Africa	Congo, Rep.	5.03
Asia	Korea, Dem. Rep.	5.01

Next, we try plotting the life expectancy versus time for these 20 countries with large residuals. Let us take a look at the life expectancies in these 20 countries.

plot of chunk lifeExp-plot

Note: the lattice panels follow the order of the table above, ordered from bottom to top, left to right, starting in the bottom left.

It appears that the majority of the cases displayed in the above plot are cases where, quite unfortunately, the life expectancy of a country suddenly drops for one reason or another. These can mostly be linked to war or genocide (e.g. Rwanda), which is quite unfortunate.

Conclusions

I found that it is much, much easier to find meaning in the data when using figures. It is infinitely harder to grasp an idea from a table display of the data. From my experience in the previous homework, the data aggregation results often contained hundreds of rows of data to digest at a time, resulting in my scrutinizing the computer screen for a long time.

I enjoyed recreating some of the data experiments by other students, and by re-writing the code, I found it very easy to understand and use thanks to the simple syntax of the plyr functions. It was also interesting to follow other people's ideas or hypotheses.

I have been trying to learn the ggplot2 package recently, and I found it hard to create plots as I wanted them using lattice functions. I miss the layering ability in ggplot2, and did not understand how to create panel functions in lattice, though it is probably due to my lack of familiarity with lattice. However, I did find some tasks much simpler when lattice, such as adding a regression line or a LOESS smoothed regression line to a scatter plot with a simple type= call. All in all, I believe that learning both packages will be beneficial.

For the code used to generate this report, click here