STAT 545A Homework 3

Sep 23 2013


Contents

Introduction

We will continue to look at the Gapminder data provided by Jenny Bryan for the STAT 545A class.

The data contains 6 variables (country, year, pop, continent, lifeExp, gdpPercap), with 1704 rows of data. A short summary of the data and the variables can be found in the following table:


Variable Name Info
1 country Country factor, 142 countries
2 year Year integer, 12 years, range: 1952 - 2007
3 pop Population numeric, quantiles: 0%=60011, 25%=2793664, 50%=7023595.5, 75%=19585221.75, 100%=1318683096
4 continent Continent factor, values: Asia, Europe, Africa, Americas, Oceania
5 lifeExp Life Expectancy numeric, quantiles: 0%=23.599, 25%=48.198, 50%=60.7125, 75%=70.8455, 100%=82.603
6 gdpPercap GDP per capita numeric, quantiles: 0%=241.1658765, 25%=1202.06030925, 50%=3531.8469885, 75%=9325.462346, 100%=113523.1329

A quick look into the structure of the data is provided by the str function in R:

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
## NULL

Using the Gapminder data set, we will be focusing on data aggregation tasks using the plyr package written by Hadley Wickam. We will stress the importance of figure in data exploration tasks by trying to extract information from the the data set using only tables. As it will soon become painfully obvious, the table is not the best display to use when one is trying to find interesting relationships or trends, unique properties, or secrets that may lie within the data.

Data Aggregation Tasks

The Rich and the Poor (according to GDP per capita), by Continent

We begin by investigating the quantitative variable, “GDP per capita”. We are interested in finding the countries with the lowest and highest GDP per capita for each continent, as well as the difference between the two. This is displayed in the table below, sorted by ascending order of minimum GDP per capita.

continent minGdpPercap maxGdpPercap diffGdpPercap
1 Africa 241.17 21951.21 21710.05
3 Asia 331.00 113523.13 113192.13
4 Europe 973.53 49357.19 48383.66
2 Americas 1201.64 42951.65 41750.02
5 Oceania 10039.60 34435.37 24395.77

There are several interesting points that can be found in this table. It seems that the continent, Africa, contains the nation with the smallest GDP per capita, as well as the smallest “rich” country out of all the continents. Asia has the greatest gap between minimum GDP per capita nation and maximum GDP per capita nation, and it is also the home to the highest GDP per capita nation in the world. Oceania has the richest “poorest” country, when compared to the other continents.

Let us try displaying the above table in “Tall” format to check if it is easier or harder to decipher the data:

continent stat type
1 Africa 241.17 min
2 Africa 21951.21 max
3 Africa 21710.05 diff
4 Americas 1201.64 min
5 Americas 42951.65 max
6 Americas 41750.02 diff
7 Asia 331.00 min
8 Asia 113523.13 max
9 Asia 113192.13 diff
10 Europe 973.53 min
11 Europe 49357.19 max
12 Europe 48383.66 diff
13 Oceania 10039.60 min
14 Oceania 34435.37 max
15 Oceania 24395.77 diff

It seems that the “wide” table format is easier to read in this first example. This may be because having the data in a long format does not organize the data in a way that is immediately useful for us to understand, forcing us to use our brains more. If I had to choose one table to display, I would keep the “wide” table.

The next immediate question that arises from looking at these tables is “Which country in these continents have the lowest/highest GDP per capita?”. It is a natural extension to our above tables, so we add the country variable for both the minimum and maximum GDP per capita variables. We will leave out the variable that depicted the difference in GDP per capita between the minimum and maximum in the next table.

continent minCntry minGDP maxCntry maxGDP
1 Africa Congo, Dem. Rep. 241.17 Libya 21951.21
3 Asia Myanmar 331.00 Kuwait 113523.13
4 Europe Bosnia and Herzegovina 973.53 Norway 49357.19
2 Americas Haiti 1201.64 United States 42951.65
5 Oceania Australia 10039.60 Australia 34435.37

We were initially a little confused when Australia was found to be the country with both the minimum GDP per capita and the maximum GDP per capita in Oceania. If we delve further into the data set, we find two interesting facts. First, the continent, Oceania, contains only two countries, Australia and New Zealand. Second, Australia had a smaller GDP per capita when compared to New Zealand in 1952. However, the growth of GDP per capita in Australia overtook that of New Zealand, and we find that the maximum of this variable is depicted by Australia in 2012, as we read in the above table. Due to the my ineptitude in geography, I cannot say much about the other countries in the table, except that the United States having the greatest GDP per capita in the Americas was not surprising.

Next, we take a look at the spread of GDP per capita in each continent, and also the number of countries in each continent. We will look at the standard deviation of the GDP per capita variable.

continent sdGdpPercap nCountry
1 Africa 2827.93 52
5 Oceania 6358.98 2
2 Americas 6396.76 25
4 Europe 9355.21 30
3 Asia 14045.37 33

The table is sorted by smallest to largest in the standard deviation of GDP per capita variable. Africa seems to have the smallest spread of GDP per capita, and Asia seems to have the largest. If we take a look at our first table, we can deduce that this may partly be explained by the small and large spread between the minimum and maximum GDP per capita countries within those two continents. It seems that all the countries in Africa have similar GDP per capita, while Asia shows the most diverse set of countries in terms of GDP per capita.

Life Expectancy over Time and by Continent

Next, we aim our focus on Life Expectancy. It would be interesting to take a look at global life expectancy and how it changed over the years. For each year, we calculate the mean life expectancy, the trimmed mean life expectancy, and the median life expectancy. The trimmed mean is calculated with 10% of the observations trimmed from each end of the life expectancy data in each year.

year mean trimMean median
1 1952 49.06 48.58 45.14
2 1957 51.51 51.27 48.36
3 1962 53.61 53.58 50.88
4 1967 55.68 55.87 53.83
5 1972 57.65 58.01 56.53
6 1977 59.57 60.10 59.67
7 1982 61.53 62.12 62.44
8 1987 63.21 63.92 65.83
9 1992 64.16 65.19 67.70
10 1997 65.01 66.02 69.39
11 2002 65.69 66.72 70.83
12 2007 67.01 68.11 71.94

Generally, it looks like life expectancy has been increasing over the years from 1952 to 2007; there is a clear upward trend in life expectancy. The three measures of center seem to generally agree with each other. The median shows the greatest change in life expectancy out of the three statistics.

How about we look at the mean and median life expectancy by continent and year? Since we are calculating some summary statistic, we also include the number of observations used to calculate the statistic.

continent year meanLife medLife nCountry
1 Africa 1952 39.14 38.83 52
2 Africa 1957 41.27 40.59 52
3 Africa 1962 43.32 42.63 52
4 Africa 1967 45.33 44.70 52
5 Africa 1972 47.45 47.03 52
6 Africa 1977 49.58 49.27 52
7 Africa 1982 51.59 50.76 52
8 Africa 1987 53.34 51.64 52
9 Africa 1992 53.63 52.43 52
10 Africa 1997 53.60 52.76 52
11 Africa 2002 53.33 51.24 52
12 Africa 2007 54.81 52.93 52
13 Americas 1952 53.28 54.74 25
14 Americas 1957 55.96 56.07 25
15 Americas 1962 58.40 58.30 25
16 Americas 1967 60.41 60.52 25
17 Americas 1972 62.39 63.44 25
18 Americas 1977 64.39 66.35 25
19 Americas 1982 66.23 67.41 25
20 Americas 1987 68.09 69.50 25
21 Americas 1992 69.57 69.86 25
22 Americas 1997 71.15 72.15 25
23 Americas 2002 72.42 72.05 25
24 Americas 2007 73.61 72.90 25
25 Asia 1952 46.31 44.87 33
26 Asia 1957 49.32 48.28 33
27 Asia 1962 51.56 49.33 33
28 Asia 1967 54.66 53.66 33
29 Asia 1972 57.32 56.95 33
30 Asia 1977 59.61 60.77 33
31 Asia 1982 62.62 63.74 33
32 Asia 1987 64.85 66.30 33
33 Asia 1992 66.54 68.69 33
34 Asia 1997 68.02 70.27 33
35 Asia 2002 69.23 71.03 33
36 Asia 2007 70.73 72.40 33
37 Europe 1952 64.41 65.90 30
38 Europe 1957 66.70 67.65 30
39 Europe 1962 68.54 69.53 30
40 Europe 1967 69.74 70.61 30
41 Europe 1972 70.78 70.89 30
42 Europe 1977 71.94 72.34 30
43 Europe 1982 72.81 73.49 30
44 Europe 1987 73.64 74.81 30
45 Europe 1992 74.44 75.45 30
46 Europe 1997 75.51 76.12 30
47 Europe 2002 76.70 77.54 30
48 Europe 2007 77.65 78.61 30
49 Oceania 1952 69.25 69.25 2
50 Oceania 1957 70.30 70.30 2
51 Oceania 1962 71.09 71.09 2
52 Oceania 1967 71.31 71.31 2
53 Oceania 1972 71.91 71.91 2
54 Oceania 1977 72.85 72.85 2
55 Oceania 1982 74.29 74.29 2
56 Oceania 1987 75.32 75.32 2
57 Oceania 1992 76.94 76.94 2
58 Oceania 1997 78.19 78.19 2
59 Oceania 2002 79.74 79.74 2
60 Oceania 2007 80.72 80.72 2

We notice, again, that the “tall” data format is not the best way to view this data. It may be easy to follow the life expectancy for a single continent at a time, but it is not easy to make direct comparisons of life expectancy for different continents at the same year.

We apply some plyr kung-fu using the daply function to force our data to be aggregated as a “wide” table. (Normally, this task can be easily accomplished by using Hadley Wickham's reshape or reshape2 package, but Jenny challenged us with using only plyr). Since we are going to a “wide” table, let us just calculate the mean life expectancy for each year, just to keep things a simple and readable.

Africa Americas Asia Europe Oceania
1952 39.14 53.28 46.31 64.41 69.25
1957 41.27 55.96 49.32 66.70 70.30
1962 43.32 58.40 51.56 68.54 71.09
1967 45.33 60.41 54.66 69.74 71.31
1972 47.45 62.39 57.32 70.78 71.91
1977 49.58 64.39 59.61 71.94 72.85
1982 51.59 66.23 62.62 72.81 74.29
1987 53.34 68.09 64.85 73.64 75.32
1992 53.63 69.57 66.54 74.44 76.94
1997 53.60 71.15 68.02 75.51 78.19
2002 53.33 72.42 69.23 76.70 79.74
2007 54.81 73.61 70.73 77.65 80.72

Again, we see the advantage of using a “wide” table format over a “tall” format when we want to manually interpret the data in a table. It looks like the secret of long life resides in Oceania. Africa seems to be the most dangerous continent to life, and although the average life expectancy increased quite a bit since 1952, it has only caught up to 1952 levels of life expectancy in 2007! Asia shows the greatest increase in mean life expectancy in the range of the data provided.

Let us define an arbitrary value for “low life expectancy”. Say, if a country has a life expectancy smaller than the overall median life expectancy (calculated using the entire data set), then it is labelled as having low life expectancy. We tabulate the proportion of countries within each continent that have life expectancies smaller than the overall median (which is equal to 60.7125 years).

Africa Americas Asia Europe Oceania
1952 1.000 0.760 0.909 0.233 0.000
1957 1.000 0.640 0.818 0.100 0.000
1962 1.000 0.520 0.788 0.033 0.000
1967 0.981 0.520 0.758 0.033 0.000
1972 0.962 0.400 0.606 0.033 0.000
1977 0.962 0.280 0.485 0.033 0.000
1982 0.885 0.200 0.364 0.000 0.000
1987 0.788 0.080 0.303 0.000 0.000
1992 0.769 0.080 0.242 0.000 0.000
1997 0.846 0.040 0.212 0.000 0.000
2002 0.788 0.040 0.152 0.000 0.000
2007 0.788 0.000 0.091 0.000 0.000

We choose to skip the “tall” format of the table and skip directly to the “wide” format. Again, this is done using only plyr functions (namely, daply). This table tells a similar story to the table above it.

An Interesting Side Story

While we were investigating GDP per capita, we were interested in countries with the lowest and highest GDP per capita within each continent at each year.

Our investigations produced a rather obtuse, and hard to read table.

continent year minCntry minGDP maxCntry maxGDP
1 Africa 1952 Lesotho 298.85 South Africa 4725.30
2 Africa 1957 Lesotho 336.00 South Africa 5487.10
3 Africa 1962 Burundi 355.20 Libya 6757.03
4 Africa 1967 Burundi 412.98 Libya 18772.75
5 Africa 1972 Burundi 464.10 Libya 21011.50
6 Africa 1977 Mozambique 502.32 Libya 21951.21
7 Africa 1982 Mozambique 462.21 Libya 17364.28
8 Africa 1987 Mozambique 389.88 Gabon 11864.41
9 Africa 1992 Mozambique 410.90 Gabon 13522.16
10 Africa 1997 Congo, Dem. Rep. 312.19 Gabon 14722.84
11 Africa 2002 Congo, Dem. Rep. 241.17 Gabon 12521.71
12 Africa 2007 Congo, Dem. Rep. 277.55 Gabon 13206.48
13 Americas 1952 Dominican Republic 1397.72 United States 13990.48
14 Americas 1957 Dominican Republic 1544.40 United States 14847.13
15 Americas 1962 Dominican Republic 1662.14 United States 16173.15
16 Americas 1967 Haiti 1452.06 United States 19530.37
17 Americas 1972 Haiti 1654.46 United States 21806.04
18 Americas 1977 Haiti 1874.30 United States 24072.63
19 Americas 1982 Haiti 2011.16 United States 25009.56
20 Americas 1987 Haiti 1823.02 United States 29884.35
21 Americas 1992 Haiti 1456.31 United States 32003.93
22 Americas 1997 Haiti 1341.73 United States 35767.43
23 Americas 2002 Haiti 1270.36 United States 39097.10
24 Americas 2007 Haiti 1201.64 United States 42951.65
25 Asia 1952 Myanmar 331.00 Kuwait 108382.35
26 Asia 1957 Myanmar 350.00 Kuwait 113523.13
27 Asia 1962 Myanmar 388.00 Kuwait 95458.11
28 Asia 1967 Myanmar 349.00 Kuwait 80894.88
29 Asia 1972 Myanmar 357.00 Kuwait 109347.87
30 Asia 1977 Myanmar 371.00 Kuwait 59265.48
31 Asia 1982 Myanmar 424.00 Saudi Arabia 33693.18
32 Asia 1987 Myanmar 385.00 Kuwait 28118.43
33 Asia 1992 Myanmar 347.00 Kuwait 34932.92
34 Asia 1997 Myanmar 415.00 Kuwait 40300.62
35 Asia 2002 Myanmar 611.00 Singapore 36023.11
36 Asia 2007 Myanmar 944.00 Kuwait 47306.99
37 Europe 1952 Bosnia and Herzegovina 973.53 Switzerland 14734.23
38 Europe 1957 Bosnia and Herzegovina 1353.99 Switzerland 17909.49
39 Europe 1962 Bosnia and Herzegovina 1709.68 Switzerland 20431.09
40 Europe 1967 Bosnia and Herzegovina 2172.35 Switzerland 22966.14
41 Europe 1972 Bosnia and Herzegovina 2860.17 Switzerland 27195.11
42 Europe 1977 Bosnia and Herzegovina 3528.48 Switzerland 26982.29
43 Europe 1982 Albania 3630.88 Switzerland 28397.72
44 Europe 1987 Albania 3738.93 Norway 31540.97
45 Europe 1992 Albania 2497.44 Norway 33965.66
46 Europe 1997 Albania 3193.05 Norway 41283.16
47 Europe 2002 Albania 4604.21 Norway 44683.98
48 Europe 2007 Albania 5937.03 Norway 49357.19
49 Oceania 1952 Australia 10039.60 New Zealand 10556.58
50 Oceania 1957 Australia 10949.65 New Zealand 12247.40
51 Oceania 1962 Australia 12217.23 New Zealand 13175.68
52 Oceania 1967 New Zealand 14463.92 Australia 14526.12
53 Oceania 1972 New Zealand 16046.04 Australia 16788.63
54 Oceania 1977 New Zealand 16233.72 Australia 18334.20
55 Oceania 1982 New Zealand 17632.41 Australia 19477.01
56 Oceania 1987 New Zealand 19007.19 Australia 21888.89
57 Oceania 1992 New Zealand 18363.32 Australia 23424.77
58 Oceania 1997 New Zealand 21050.41 Australia 26997.94
59 Oceania 2002 New Zealand 23189.80 Australia 30687.75
60 Oceania 2007 New Zealand 25185.01 Australia 34435.37

Nothing seems to immediately stand out in the table. Usually, countries with the smallest or largest GDP per capita remain so for at least several decades. Also, the minimum and maximum GDP per capita seem to be generally increasing over time, though this is not always the case.

Then we have Asia, the maximum GDP per capita winner according to our previous tables. However, while the rest of the world is slowly increasing in GDP per capita, the richest Asian country, Kuwuit, shows a declining trend for GDP per capita, even losing its crown as top GDP per capita country twice in the process (once to Saudi Arabia in 1982, and once to Singapore in 2002).

For the countries in Asia, we can perform a simple linear regression of GDP per capita versus time and find the slope and the intercept for each country. As we did in class, we correct the year variable by subtracting the smallest year in the data. The estimated intercept and slopes are the next table, which is sorted by lowest to highest slope estimates.

continent country intercept slope
16 Asia Kuwait 108891.72 -1583.96
10 Asia Iraq 9149.28 -48.64
1 Asia Afghanistan 814.79 -0.44
20 Asia Myanmar 258.41 6.58
21 Asia Nepal 512.02 9.84
3 Asia Bangladesh 528.71 10.50
14 Asia Korea, Dem. Rep. 2287.16 11.08
4 Asia Cambodia 246.51 15.59
31 Asia Vietnam 317.54 25.46
7 Asia India 286.26 28.04
24 Asia Philippines 1398.12 28.24
33 Asia Yemen, Rep. 689.36 32.00
19 Asia Mongolia 764.46 33.76
23 Asia Pakistan 490.30 34.51
27 Asia Sri Lanka 551.88 47.38
28 Asia Syria 1702.39 47.52
13 Asia Jordan 1759.26 49.78
8 Asia Indonesia 301.45 52.36
5 Asia China -303.78 65.17
32 Asia West Bank and Gaza 1863.86 68.95
17 Asia Lebanon 5168.04 76.41
9 Asia Iran 4110.93 118.75
30 Asia Thailand -322.31 122.48
18 Asia Malaysia -24.07 197.46
25 Asia Saudi Arabia 13417.84 248.87
2 Asia Bahrain 10391.29 279.50
11 Asia Israel 3692.09 380.69
15 Asia Korea, Rep. -2826.11 401.58
22 Asia Oman 721.73 415.16
29 Asia Taiwan -3477.60 498.27
12 Asia Japan 2413.51 557.72
6 Asia Hong Kong, China -1843.04 657.15
26 Asia Singapore -4389.13 793.25

It seems that, according to our very simple linear regression model*, Kuwait starts with a very high intercept, but has a very large negative slope! On the other hand, Singapore, which we found overtook the GDP per capita of Kuwait in 2002, started off with a moderately negative intercept, but cause up due to a very large slope!

Note that the assumptions for the linear regression fits were not checked. We doubt that a linear regresison model is suitable for our data.

It would be interesting to investigate the GDP per capita of Asia at a future date with plots instead of just tables.

Conclusions

In the end, we found it quite challenging to explore the Gapminder dataset using only tables. On the other hand, this homework provided great practice for using the data aggregation package plyr. Using ddply made everything much easier, and I have grown an appreciation for the “Split-Apply-Combine” methodology preached by Hadley Wickham and recommended by Jenny. I believe that it will be a very useful and powerful tool for data analysis when combined with making plots.

For the code used to generate this report, click here