We will be taking another look at the gapminder data set. In the previous homework, we carried out a set of data aggregation tasks to explore the data set. In this homework, we will take a look at a few data aggregation results and generate one or more accompanying figures.
We load the data and perform a subset operation to remove the Oceania continent, as recommended by Jenny. A quick sanity check is done by checking the structure of the data set with str before (not shown) and after (shown below) the subset operation.
str(gDat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
This task is from Jenny's menu of data aggregation tasks (link to task). We alter the task very slightly by including the 10% trimmed mean in addition to the minimum and maximum GDP per capita.
We begin by using ddply to aggregate the data. We use a custom function in ddply to retrieve the countries with the minimum and maximum GDP per capita for each year.
# Min/Max GDP by year and Country
minMaxGDP <- ddply(gDat, ~year + continent, .drop = FALSE, function(x) {
minCountry = as.character(x$country[which.min(x$gdpPercap)])
minGdp = min(x$gdpPercap)
maxCountry = as.character(x$country[which.max(x$gdpPercap)])
maxGdp = max(x$gdpPercap)
trimMean = mean(x$gdpPercap, trim = 0.1)
return(data.frame(gdpPercap = c(minGdp, maxGdp, trimMean), country = c(minCountry,
maxCountry, ""), stat = c("minGdpPercap", "maxGdpPercap", "trimMean")))
})
We display the aggregated data in a “long” table below. To save some space, we display only the first 20 rows of the 144 rows in the table. It is (at least for me) very hard to extract interesting information from this table.
| year | continent | gdpPercap | country | stat |
|---|---|---|---|---|
| 1952 | Africa | 298.85 | Lesotho | minGdpPercap |
| 1952 | Africa | 4725.30 | South Africa | maxGdpPercap |
| 1952 | Africa | 1085.16 | trimMean | |
| 1952 | Americas | 1397.72 | Dominican Republic | minGdpPercap |
| 1952 | Americas | 13990.48 | United States | maxGdpPercap |
| 1952 | Americas | 3494.33 | trimMean | |
| 1952 | Asia | 331.00 | Myanmar | minGdpPercap |
| 1952 | Asia | 108382.35 | Kuwait | maxGdpPercap |
| 1952 | Asia | 1690.45 | trimMean | |
| 1952 | Europe | 973.53 | Bosnia and Herzegovina | minGdpPercap |
| 1952 | Europe | 14734.23 | Switzerland | maxGdpPercap |
| 1952 | Europe | 5436.62 | trimMean | |
| 1957 | Africa | 336.00 | Lesotho | minGdpPercap |
| 1957 | Africa | 5487.10 | South Africa | maxGdpPercap |
| 1957 | Africa | 1176.76 | trimMean | |
| 1957 | Americas | 1544.40 | Dominican Republic | minGdpPercap |
| 1957 | Americas | 14847.13 | United States | maxGdpPercap |
| 1957 | Americas | 4037.75 | trimMean | |
| 1957 | Asia | 350.00 | Myanmar | minGdpPercap |
| 1957 | Asia | 113523.13 | Kuwait | maxGdpPercap |
Now, we create the accompanying figure for the above table. For each continent, we plot the progression of the minimum, the maximum, and the 10% trimmed mean of the GDP per capita over the years in the data set. The max is plotted in blue, the minimum is plotted in pink, and the trimmed mean is plotted in green. A locally fitted (LOESS) regression line is plotted over the points in the data.
# Plot points and a loess smoothed fit
xyplot(gdpPercap ~ year | continent, group = stat, data = minMaxGDP, grid = "h",
auto.key = TRUE, type = c("p", "smooth"))
Now, in contrast to the table we created, it is fairly easy to see in the plots how the different continents are fairing in terms of GDP per capita. Overall, each continent is showing a growth in average GDP per capita, as we can see with the trimmed mean (green line).
We notice something funny about Asia, where we see that the country the greatest GDP per capita starts with the highest global GDP per capita (which we identified in our previous assignment as Kuwait), but seems to fall steadily over the years until about 1990, where it starts a modest increasing trend. Note that the highest GDP per capita country in Asia was Kuwait except in 1982 (Saudi Arabia) and in 2002 (Singapore).
Europe and the Americas show similar trends to each other, but countries in Europe are doing better on average than countries in the Americas. Africa seems to be the poorest of the continents, and also contains an interesting trend for its maximum GDP line. It seems that some country (or countries) in Africa experienced some relatively great growth in GDP per capita between around 1967 to 1982, but levelled off to a steady trend of non-growth in 1990 that continued onward to 2007.
We are interested in the populations of the continents relative to the total world population for each year. The original author could not aggregate the data in a simple summarize command in the ddply function and created a custom function to aggregate the data. We do the same to generate a table. The first 20 rows are displayed below.
# Get the yearly populations (Note: Without Oceania)
yearlyPop <- ddply(gDat, ~year, summarise, population = sum(pop))
# Now calculate the relative populations by continent each year
relativePop <- ddply(gDat, ~continent + year, function(x) {
population <- sum(x$pop)
worldPop <- yearlyPop$population[which(unique(x$year) == yearlyPop$year)]
percent <- population/worldPop
return(data.frame(continent = x$continent[1], year = x$year[1], population,
percent))
})
| continent | year | population | percent |
|---|---|---|---|
| Africa | 1952 | 237640501 | 0.099 |
| Africa | 1957 | 264837738 | 0.100 |
| Africa | 1962 | 296516865 | 0.103 |
| Africa | 1967 | 335289489 | 0.105 |
| Africa | 1972 | 379879541 | 0.107 |
| Africa | 1977 | 433061021 | 0.111 |
| Africa | 1982 | 499348587 | 0.117 |
| Africa | 1987 | 574834110 | 0.123 |
| Africa | 1992 | 659081517 | 0.129 |
| Africa | 1997 | 743832984 | 0.135 |
| Africa | 2002 | 833723916 | 0.142 |
| Africa | 2007 | 929539692 | 0.149 |
| Americas | 1952 | 345152446 | 0.144 |
| Americas | 1957 | 386953916 | 0.146 |
| Americas | 1962 | 433270254 | 0.150 |
| Americas | 1967 | 480746623 | 0.150 |
| Americas | 1972 | 529384210 | 0.149 |
| Americas | 1977 | 578067699 | 0.148 |
| Americas | 1982 | 630290920 | 0.148 |
| Americas | 1987 | 682753971 | 0.146 |
We ordered the table by continent, then year. We can see that the population is increasing over the years, but it is hard to compare, relatively, how the population of Africa is increasing compared to, say, the Americas. We rectify this by plotting the populations, relative to the world population, over time for each continent. To unclutter the plot, we facet the plots by continent. A second plot is generated to show the actual populations over time.
# Make another data set with the world
yearlyPop <- cbind(continent = "World", yearlyPop, percent = 1)
relativePop2 <- rbind(relativePop, yearlyPop)
# Plot the lattice
xyplot(percent ~ year | continent, data = relativePop, grid = "h", type = c("p",
"r"), auto.key = TRUE)
xyplot(population ~ year, group = continent, data = relativePop2, grid = "h",
type = c("p", "r"), auto.key = TRUE)
Looking at the first plot, we can see that the relative populations of Asia and Africa have been on the rise since the 1950s. The Americas seem to be holding steady at around 15%, while Europe has dropped below 10% recently.
Looking at the second plot above, we see that the populations in every continent have been on the rise since the 1950s. It seems that the rate of growth in Asia and Africa is higher than the other continents, as was reflected in the first plot.
Note that our numbers may not match exactly with the Dean's original work. This is probably due to the removal of the continent, Oceania, from our data set.
We not take a look at life expectancy by country, and we are interested in finding the countries with “strange” trends. We notice that if we try plotting a simple linear fit for the global life expectancy or the continental life expectancy, we do not see anything that particularly stands out (except maybe some countries with obviously low life expectancies).
# Plot global life expectancy
xyplot(lifeExp ~ year, data = gDat, type = c("p", "r"))
# Now by continent
xyplot(lifeExp ~ year | continent, data = gDat, type = c("p", "r"))
We try fitting a linear regression model of lifeExp versus year for each country. Then, to isolate some countries that may stand out more than others, we try to find countries that have very high residuals.
# Get the minimum year in the dataset
yearMin <- min(gDat$year)
# Regression function for life expectancy
lifeExpRegression <- function(x) {
fit <- lm(lifeExp ~ I(year - yearMin), data = x)
res <- max(abs(fit$residuals))
names(res) <- c("maxAbsResid")
return(res)
}
# Get all the residuals, sorted by Max residual
resids <- arrange(ddply(gDat, .(continent, country), lifeExpRegression), desc(maxAbsResid))
We show the 20 countries with the largest absolute residuals in a table to try and gather a quick intuition of the data. Save for Bulgaria, the top 20 unusual countries are located in Africa and Asia.
| continent | country | maxAbsResid |
|---|---|---|
| Africa | Rwanda | 17.31 |
| Asia | Cambodia | 15.69 |
| Africa | Swaziland | 12.00 |
| Africa | Zimbabwe | 10.58 |
| Africa | Lesotho | 10.04 |
| Africa | Botswana | 9.33 |
| Africa | South Africa | 9.31 |
| Asia | China | 8.00 |
| Africa | Namibia | 7.21 |
| Africa | Gabon | 6.77 |
| Asia | Iraq | 6.70 |
| Africa | Kenya | 6.34 |
| Europe | Bulgaria | 6.14 |
| Africa | Zambia | 5.98 |
| Africa | Cote d'Ivoire | 5.24 |
| Africa | Central African Republic | 5.24 |
| Africa | Uganda | 5.17 |
| Asia | Myanmar | 5.09 |
| Africa | Congo, Rep. | 5.03 |
| Asia | Korea, Dem. Rep. | 5.01 |
Next, we try plotting the life expectancy versus time for these 20 countries with large residuals. Let us take a look at the life expectancies in these 20 countries.
Note: the lattice panels follow the order of the table above, ordered from bottom to top, left to right, starting in the bottom left.
It appears that the majority of the cases displayed in the above plot are cases where, quite unfortunately, the life expectancy of a country suddenly drops for one reason or another. These can mostly be linked to war or genocide (e.g. Rwanda), which is quite unfortunate.
I found that it is much, much easier to find meaning in the data when using figures. It is infinitely harder to grasp an idea from a table display of the data. From my experience in the previous homework, the data aggregation results often contained hundreds of rows of data to digest at a time, resulting in my scrutinizing the computer screen for a long time.
I enjoyed recreating some of the data experiments by other students, and by re-writing the code, I found it very easy to understand and use thanks to the simple syntax of the plyr functions. It was also interesting to follow other people's ideas or hypotheses.
I have been trying to learn the ggplot2 package recently, and I found it hard to create plots as I wanted them using lattice functions. I miss the layering ability in ggplot2, and did not understand how to create panel functions in lattice, though it is probably due to my lack of familiarity with lattice. However, I did find some tasks much simpler when lattice, such as adding a regression line or a LOESS smoothed regression line to a scatter plot with a simple type= call. All in all, I believe that learning both packages will be beneficial.
For the code used to generate this report, click here