stat545a-2013-hw04_lee-woo

Presets:

Sys.setenv(lang = "EN")
library("lattice")

The following work is based on the work of Christian Okkels: code | report

All that I write here will appear within the quote box as this sentence itself. My work will appear at the end of each section.

Introduction & loading of libraries and data

This homework assignment considers various data aggregation tasks in R using the plyr library. The data are those of the Gapminder project (available here).

First, we load the necessary libraries:

## library(lattice) # not needed here, because we aren't doing plots.
library(plyr)
library(xtable)

Then we import the Gapminder data into the data.frame gDat:

commented the first line since I don't copy it to my directory.

# gDat <- read.delim('gapminderDataFiveYear.txt')
gDat <- read.delim("http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")

A quick check that the data import went well is done by use of the str() function:

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

From the results we see that we have to do with a data.frame with 1704 observations of 6 variables. The variables are: country, year, pop (population), continent, lifeExp (life expectancy), and gdpPercap (GDP per capita). The types of the variables and their contents seem to make sense.

Now we begin the data aggregation, in which we consider the following tasks:

Maximum and minimum GDP per capita for all continents (“wide format”)

In this task, we construct a data.frame with 3 variables: the continent, the minimum GDP per capita for that continent (among all the countries in it), and the maximum GDP per capita. This is achieved in plyr by use of the ddply() function:

maxminGDPperCapByCont <- ddply(gDat, ~continent, summarize, minGDPperCap = min(gdpPercap), 
    maxGDPperCap = max(gdpPercap))
str(maxminGDPperCapByCont)
## 'data.frame':    5 obs. of  3 variables:
##  $ continent   : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##  $ minGDPperCap: num  241 1202 331 974 10040
##  $ maxGDPperCap: num  21951 42952 113523 49357 34435

From the last command we see that the saved quantity maxminGDPperCapByCont is a data.frame itself. However, it only consists of 5 observations (corresponding to the 5 continents) of 3 variables (continent, minimum GDP per capita, and maximum GDP per capita). As it is such a short data.frama, we can as well show the full size. This can be done in a nice table as follows:

tableData <- xtable(maxminGDPperCapByCont)
print(tableData, type = "html", include.rownames = FALSE)
continent minGDPperCap maxGDPperCap
Africa 241.17 21951.21
Americas 1201.64 42951.65
Asia 331.00 113523.13
Europe 973.53 49357.19
Oceania 10039.60 34435.37

Right now, these results are sorted alphabetically according the continent. We can sort on e.g. the maximum GDP per capita using the arrange() function:

tableData <- arrange(tableData, maxGDPperCap)
tableData <- xtable(tableData)
print(tableData, type = "html", include.rownames = FALSE)
continent minGDPperCap maxGDPperCap
Africa 241.17 21951.21
Oceania 10039.60 34435.37
Americas 1201.64 42951.65
Europe 973.53 49357.19
Asia 331.00 113523.13

First, we notice that Asia is far ahead in the lead when it comes to the maximum GDP per capita, with a value more than twice as large as the second place. After Asia comes Europe and the Americas, which are very close. Then we have Oceania, and finally Asia. It is interesting to see that the continents rank very differently with respect to the maximum GDP per capita than they would have for the minimum GDP per capita. For instance, Asia–although ranking in first place for the maximum–still has the second lowest minimum GDP per capita. This indicates a very large spread across the continent's countries. The situation is less extreme for Europe and the Americas; these ranked 2nd and 3rd, respectively, with respect to the maximum, and they would rank 3rd and 2nd, respectively, if sorted by the minimum GDP per capita. Another rather extreme case is Oceania, which has the second lowest maximum GDP per capita. However, it springs to the eye that this is the continent with the largest minimum GDP per capita, with a value 8-10 times greater than the 2nd and 3rd place. Finally, there is Asia, which preserves its ranking; it has both the lowest maximum and the lowest minimum GDP per capita.

Here I provide a strip plot where continents are ordered by the difference between the minimum and the maximum of GDP per capita:

maxminGDPperCapByCont.Reorder = maxminGDPperCapByCont[order(maxminGDPperCapByCont$maxGDPperCap - 
    maxminGDPperCapByCont$minGDPperCap), ]
maxminGDPperCapByCont.Reorder = within(maxminGDPperCapByCont.Reorder, continent <- reorder(continent, 
    maxGDPperCap - minGDPperCap))

stripplot(maxGDPperCap + minGDPperCap ~ continent, maxminGDPperCapByCont.Reorder, 
    type = "l", auto.key = TRUE)

plot of chunk unnamed-chunk-7

Now it is clear that the difference between maximum and minimum is more affected by maximum than minimum.

The spread of GDP per capita within the continents

In this task we consider the spread of the GDP per capita within the continents. The different measures of spread are:

We now create a data.frame containing these measures of the spread of the GDP per capita for each continent. This is again done by means of ddply(), where we just compute even more things than above.

spreadGDPperCapByCont <- ddply(gDat, ~continent, summarize, sdGDPperCap = sd(gdpPercap), 
    varGDPperCap = var(gdpPercap), madGDPperCap = mad(gdpPercap), iqrGDPperCap = IQR(gdpPercap))
str(spreadGDPperCapByCont)
## 'data.frame':    5 obs. of  5 variables:
##  $ continent   : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##  $ sdGDPperCap : num  2828 6397 14045 9355 6359
##  $ varGDPperCap: num  8.00e+06 4.09e+07 1.97e+08 8.75e+07 4.04e+07
##  $ madGDPperCap: num  775 3269 2821 8846 6459
##  $ iqrGDPperCap: num  1616 4402 7492 13248 8072
# spreadGDPperCapByCont

Here, we also see that the saved variable spreadGDPperCapByCont retains the data.frame structure. It contains 5 variables; the continent and the four measures of spread of the GDP per capita. Let's show the results in a table:

tableData <- xtable(spreadGDPperCapByCont)
print(tableData, type = "html", include.rownames = FALSE)
continent sdGDPperCap varGDPperCap madGDPperCap iqrGDPperCap
Africa 2827.93 7997187.31 775.32 1616.17
Americas 6396.76 40918591.10 3269.33 4402.43
Asia 14045.37 197272505.85 2820.83 7492.26
Europe 9355.21 87520019.60 8846.05 13248.30
Oceania 6358.98 40436668.87 6459.10 8072.26

Again, this is sorted in the default alphabetical order. Sorting on the standard deviation gives us:

tableData <- arrange(tableData, sdGDPperCap)
tableData <- xtable(tableData)
print(tableData, type = "html", include.rownames = FALSE)
continent sdGDPperCap varGDPperCap madGDPperCap iqrGDPperCap
Africa 2827.93 7997187.31 775.32 1616.17
Oceania 6358.98 40436668.87 6459.10 8072.26
Americas 6396.76 40918591.10 3269.33 4402.43
Europe 9355.21 87520019.60 8846.05 13248.30
Asia 14045.37 197272505.85 2820.83 7492.26

From highest to lowest standard deviation of the GDP per capita, the continents are thus: Asia, Europe, Americas, Oceania, and Africa. In fact, the task from above (with the minimum and maximum GDP per capita) gave some indication of this ranking. We saw there that Asia–although having the largest maximum by far–had the second lowest minimum. This indicates a very large spread, and that is exactly what wee see in the table above. In the task above, Asia was the continent with both the lowest minimum and maximum, indicating a low spread. And this is also precisely what the table above tells us. Now, the continents obviously rank the same with respect to the standard deviation and the variance, seeing that the latter is merely the square of the former. However, it is interesting to notice that this ranking order is, in most cases, not retained for the two other measures of spread. For instance, Europe has the highest spread in GDP per capita when looking at the mean absolute deviation and the interquartile range. And Asia, with the highest value for the standard deviation, is pushed back to the second last place when considering the MAD measure. It is also interesting to see that Oceania and the Americas are ever so close with respect to the standard deviation, but differ wildly–up to a factor of 2–for the MAD and IQR measures. Africa, on the other hand, is the only continent that retains its position, having the lowest spread of GDP per capita across all of the different meaures considered. What we can gather qualitatively from this is, of course, that all of the countries in Africa are relatively poor, whereas the other continents generally have both poor and much richer countries.

Here I provide a bar chart for standard deviation and IQR where countries are ordered by standard deviation:

spreadGDPperCapByCont.Reorder = within(spreadGDPperCapByCont, continent <- reorder(continent, 
    sdGDPperCap))
barchart(sdGDPperCap + iqrGDPperCap ~ continent, spreadGDPperCapByCont.Reorder, 
    auto.key = TRUE)

plot of chunk unnamed-chunk-11

Comparing SD and IQR, we can guess that the large SD of Asia is due to the presence of outliers and the large SD of Europe is not due to outliers. To verify this, a violin plot is provided:

bwplot(gdpPercap ~ continent, within(gDat, continent <- reorder(continent, gdpPercap, 
    FUN = sd)), panel = panel.violin)

plot of chunk unnamed-chunk-12

The plot verifies our guess that Europe has large variation and yet has few outliers.

A trimmed mean of life expectancy by year

This task concerns the average life expectancy for different years. We will also consider an alternative measure, namely the so-called trimmed, or truncated, mean.

First, we find the average life expectancy (averaged over all countries) for each year:

avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp))
str(avgLifeExpByYear)
## 'data.frame':    12 obs. of  2 variables:
##  $ year      : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ avgLifeExp: num  49.1 51.5 53.6 55.7 57.6 ...

The last command gives us a feel of the data in avgLifeExpByYear. Evidently, we have a data.frame with 12 observations (corresponding to the 12 years, which are separated by 5 year gaps) and two variables. The variables are, of course, the year and the average life expectancy. The results are shown in the table below.

tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)
year avgLifeExp
1952 49.06
1957 51.51
1962 53.61
1967 55.68
1972 57.65
1977 59.57
1982 61.53
1987 63.21
1992 64.16
1997 65.01
2002 65.69
2007 67.01

We observe a clear trend in these data; the average life expectancy (across all countries in the world) has increased throughout the years. But not only that; looking more closely at the values, we see that the increases in the 5-year time periods are roughly 2 for the first many elements. This pattern of constant increases is broken, however, when we get to about 1990; here, the increases in the average life expectancy start to sort of flatten out, with differences in value between two consecutive periods of below 1. The increase seems to pick up again, though, between the last two data points, i.e. between the years 2002 and 2007.

Next, we will apply the trimmed, or truncated mean. This is another statistical measure of so-called central tendency, much like the regular mean and the median. As the name suggests, the trimmed mean calculates the mean after “trimming” the data sample by discarding a certain amount of entries at the high and low end. The trimmed mean is included in R through the trim() argument in the function mean(). The trim factor, which determines the amount of data entries to be discarded, is a fraction between 0 and 0.5. A trim factor of zero thus reduces the trimmed mean to the regular mean, whereas a trim factor of e.g. 0.4 implies that 40% of observations are discarded from each end of the data before the mean is computed. Usually, [for most statistical applications, 5 to 25 percent of the ends are discarded][wikipedia:trimmed-mean], corresponding to a trim factor between 0.05 and 0.25.

Using a trim factor of 0.1 to discard 10% of the data from each end gives:

avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp, 
    trim = 0.1))
str(avgLifeExpByYear)
## 'data.frame':    12 obs. of  2 variables:
##  $ year      : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ avgLifeExp: num  48.6 51.3 53.6 55.9 58 ...

Let's show these data in a table:

tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)
year avgLifeExp
1952 48.58
1957 51.27
1962 53.58
1967 55.87
1972 58.01
1977 60.10
1982 62.12
1987 63.92
1992 65.19
1997 66.02
2002 66.72
2007 68.11

Evidently, the trend remains the same; the average life expectancy increases with time. The numbers, however, are not the same as above. Let's make it easier to compare the regular mean of the life expectancy with the trimmed mean by creating a data.frame containing both measures as variables.

avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp), 
    avgTrimLifeExp = mean(lifeExp, trim = 0.1))

The data are shown in the table below.

tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)
year avgLifeExp avgTrimLifeExp
1952 49.06 48.58
1957 51.51 51.27
1962 53.61 53.58
1967 55.68 55.87
1972 57.65 58.01
1977 59.57 60.10
1982 61.53 62.12
1987 63.21 63.92
1992 64.16 65.19
1997 65.01 66.02
2002 65.69 66.72
2007 67.01 68.11

Although they are naturally not equal, the mean values for each year are rather close. The general trend of life expectancy increasing with time also remains the same. Using the trimmed mean also results in increases of magnitude roughly 2. Like it was for the regular mean, however, this trend flattens out a bit around 1992, after which the increases are smaller. It does seem to pick up again for the newest data, though we would need more observations to properly comment on whether the trend increases or flattens out.

The most appropriate plot for this section is of course xyplot:

xyplot(avgLifeExp + avgTrimLifeExp ~ year, avgLifeExpByYear, type = "l")

plot of chunk unnamed-chunk-19

I also try xyplot for life expectancy of some countries, which is sometimes a good way to see both the trend and the variation over the years.

set.seed(100)
sampleCont = sample(unique(gDat$country), 40)
xyplot(lifeExp ~ year, subset(gDat, country %in% sampleCont), group = country, 
    type = "l")

plot of chunk unnamed-chunk-20

According to the plot, there are some countries whose life expectancy is greatly dropped after 1990, which makes the mean life expectancy grow slower after 1990. Also, because of some countries who have low life expentancy throughout the years, the trimmed mean does not catch this drop well. In other words, the trimmed mean fails to throw away the countries with great drop, which just leads to roughly the same value as untrimmed mean, therefore providing no meaningful information.

The change of life expectancy over time for different continents

This task considers the question of how life expectancy changes over time on different continents. Thus, compared to the above task, we do not just want the average life expectancy (across all countries) for each year; rather, we want the average life expectancy (across the countries for each respective continent) for each year. In plyr, this can be achieved by adding ~ continent + year (rather than just ~ continent) in the ddply() function call.

avgLifeExpByContAndYear <- ddply(gDat, ~continent + year, summarize, avgLifeExp = mean(lifeExp))

Showing these results requires a slightly longer table:

tableData <- xtable(avgLifeExpByContAndYear)
print(tableData, type = "html", include.rownames = FALSE)
continent year avgLifeExp
Africa 1952 39.14
Africa 1957 41.27
Africa 1962 43.32
Africa 1967 45.33
Africa 1972 47.45
Africa 1977 49.58
Africa 1982 51.59
Africa 1987 53.34
Africa 1992 53.63
Africa 1997 53.60
Africa 2002 53.33
Africa 2007 54.81
Americas 1952 53.28
Americas 1957 55.96
Americas 1962 58.40
Americas 1967 60.41
Americas 1972 62.39
Americas 1977 64.39
Americas 1982 66.23
Americas 1987 68.09
Americas 1992 69.57
Americas 1997 71.15
Americas 2002 72.42
Americas 2007 73.61
Asia 1952 46.31
Asia 1957 49.32
Asia 1962 51.56
Asia 1967 54.66
Asia 1972 57.32
Asia 1977 59.61
Asia 1982 62.62
Asia 1987 64.85
Asia 1992 66.54
Asia 1997 68.02
Asia 2002 69.23
Asia 2007 70.73
Europe 1952 64.41
Europe 1957 66.70
Europe 1962 68.54
Europe 1967 69.74
Europe 1972 70.78
Europe 1977 71.94
Europe 1982 72.81
Europe 1987 73.64
Europe 1992 74.44
Europe 1997 75.51
Europe 2002 76.70
Europe 2007 77.65
Oceania 1952 69.25
Oceania 1957 70.30
Oceania 1962 71.09
Oceania 1967 71.31
Oceania 1972 71.91
Oceania 1977 72.85
Oceania 1982 74.29
Oceania 1987 75.32
Oceania 1992 76.94
Oceania 1997 78.19
Oceania 2002 79.74
Oceania 2007 80.72

For Africa, the general trend for the first many time periods is that the average life expectancy increases. However, something sad–but interesting from a statistical point of view–happens after 1987. The average life expectancy flattens out. In fact, from 1992-1997 and 1997-2002, the average life expectancy decreases! However, it increases again in the period 2002-2007. The average life expectancy over time for the Americas shows the same general trend. Here, however, there is no real flattening, nor any decreases; just a steady increase throughout the years. It is also interesting to notice the magnitude of the average life expectancy for e.g. Africa and the Americas; by 2007, the average life expectancy for Africa has only reached about the same level as the Americas had in 1952! Asia also undergoes a gradual increase in life expectancy over time. It starts lower than the Americas, but the increases are larger, and by 2007 Asia has almost reached the same level as the Americas. The average life expectancy in Europe has also increased with time. But the increases are small; Europe starts out with a very high life expectancy (much higher than what Africa had in 2007) and ends in 2007 at an even higher level. But the increases are still small; whereas e.g. Asia goes from roughly 46 to 71 (an increase of 25) from 1952-2007, Europe goes from 64 to 78 (an increase of only 14). Lastly, there is Oceania. Here, we also observe an increasing trend. The average life expectancy starts out in 1952 at an even higher level than Europe did in the same year (and actually at about the same level that Asia had reached in 2007), and it also ends even higher than Europe by 2007. Oceania is thus the continent with the largest initial (in the year 1952) and final (in 2007) average life expectancy. But it is also the continent with the smallest total increase; the increase from 1952 to 2007 is merely about 11—even lower than Europe.

On a final note, it is both a timely and a tiring task to assess these trends from the data.frame and the table of data. It would be much better, easier, and informative to visualize the average life expectancy over time for each continent as five different plots in the same figure.

Again, xyplot is the most appropriate plot for this section.

xyplot(avgLifeExp ~ year, avgLifeExpByContAndYear, group = continent, type = "l", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-23

Combining this plot with the previous section, some African countries seem to drive the slow growth in the life expectancy after 1990. If we exclude Africa, the mean life expectancy over the years becomes:

avgLifeExpByYear.NAfrica <- ddply(subset(gDat, continent != "Africa"), ~year, 
    summarize, avgLifeExp = mean(lifeExp), avgTrimLifeExp = mean(lifeExp, trim = 0.1))

xyplot(avgLifeExp + avgTrimLifeExp ~ year, avgLifeExpByYear.NAfrica, type = "l")

plot of chunk unnamed-chunk-24

Now the plot seems to exhibit a linear trend throughout the observation period, which means that we should look at African countries if we are concerned with low life expectancy growth after 1990.

Counting the number of countries with low life expectancy over time by continent

In this task we will count the number of countries with a life expectancy lower than some threshold, for each year. We will do this in a few short steps.

First, we consider the subset of gDat for which ´lifeExp´ is less than a certain threshold (this number can be e.g. the mean life expectancy, or any other other value).

lifeExpThreshold = 35
jDat <- subset(gDat, lifeExp < lifeExpThreshold)

Then we can use this data.frame jDat as input to ddply(). In the call to ddply() we also specify that we want to consider each continent and each year. Moreover, we define a variable that holds the number of countries in that continent. The result gives us the number of countries with a life expectancy lower than the specified threshold over time and by continent.

nCountriesWithLowLifeExpOverTime <- ddply(jDat, ~continent + year, summarize, 
    nCountriesWithLowLifeExp = length(unique(country)))
str(nCountriesWithLowLifeExpOverTime)
## 'data.frame':    10 obs. of  3 variables:
##  $ continent               : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 3 3 3 3 3
##  $ year                    : int  1952 1957 1962 1967 1992 1952 1957 1962 1967 1977
##  $ nCountriesWithLowLifeExp: int  12 8 4 1 1 2 2 1 1 1

Of course, the new data.frame jDat didn't have to specified. The computations can be put into a single command:

nCountriesWithLowLifeExpOverTime <- ddply(subset(gDat, lifeExp < 35), ~continent + 
    year, summarize, nCountriesWithLowLifeExp = length(unique(country)))

The results (using 35 as the threshold/benchmark value) are shown in the table below.

tableData <- xtable(nCountriesWithLowLifeExpOverTime)
print(tableData, type = "html", include.rownames = FALSE)
continent year nCountriesWithLowLifeExp
Africa 1952 12
Africa 1957 8
Africa 1962 4
Africa 1967 1
Africa 1992 1
Asia 1952 2
Asia 1957 2
Asia 1962 1
Asia 1967 1
Asia 1977 1

First, we notice that only Africa and Asia are represented. As we saw in the above task, these were also the continents with the lowest average life expectancy over time. Now, from the last column—the number of countries in the continent with a life expectancy lower than the benchmark, for a given year—we observe a decreasin trend. Fortunately, the number of countries with a life expectancy lower than the threshold decreases with time. Of course, as there are so few countries in Asia represented here, this is mostly relevant for Africa. Let's try to increase the threshold value to 50 to get some more data:

nCountriesWithLowLifeExpOverTime <- ddply(subset(gDat, lifeExp < 50), ~continent + 
    year, summarize, nCountriesWithLowLifeExp = length(unique(country)))
tableData <- xtable(nCountriesWithLowLifeExpOverTime)
print(tableData, type = "html", include.rownames = FALSE)
continent year nCountriesWithLowLifeExp
Africa 1952 50
Africa 1957 49
Africa 1962 47
Africa 1967 39
Africa 1972 36
Africa 1977 28
Africa 1982 24
Africa 1987 20
Africa 1992 20
Africa 1997 20
Africa 2002 22
Africa 2007 18
Americas 1952 9
Americas 1957 8
Americas 1962 6
Americas 1967 2
Americas 1972 2
Americas 1977 1
Asia 1952 22
Asia 1957 18
Asia 1962 17
Asia 1967 12
Asia 1972 6
Asia 1977 5
Asia 1982 3
Asia 1987 1
Asia 1992 1
Asia 1997 1
Asia 2002 1
Asia 2007 1
Europe 1952 1
Europe 1957 1

Now we also have a few data entries for the Americas and Europe. And the numbers of countries in the last column are also larger. There was only a single country in Europe with a life expectancy lower than 50 in 1952 and 1957. For the later years, there are none. For the Americas, the trend is a decrease. And after 1977, there are not any countries here with a life expectancy lower than 50 either. The trends for Asia and Africa are the same; the number of countries with a life expectancy lower than 50 decreases with time. Asia only has one such country from 1987 and all the way up to the most recent time. Africa, on the other, still has as many as 18 such countries in 2007. Looking back in time, though, this is still better than the initial number of 50 such countries in 1952.

Finally, it is not much fun to assess these trends from the long table of data above. It would be much better to visualize the data for the different continents by plotting the number of countries against time. This would make trends more visible, allow for easier comparisons between continents, and also make the data more understandable for others.

xyplot for the data with life expectancy threshold 50 is as follows:

xyplot(nCountriesWithLowLifeExp ~ year, nCountriesWithLowLifeExpOverTime, group = continent, 
    type = "l", auto.key = TRUE, )

plot of chunk unnamed-chunk-31

Removing the data points with no country whose life expectancy is below 50 was really good with respect to data visualization because we can clearly see in the plot when there is no country whose life expectancy is below 50 for each continent. Also, because of treating zero as missing, we can see without paying much attention that there are still some countries in Asia with low life expectancy although the number of such countries is close to 0.