stat545a-2013-hw03_khosravi-mah

Data Aggregation

In this homework, we want to execute a series of data aggregation tasks on a set of data.

Max. and min. of GDP for different continents.

Let's import the Gapminder data set again:

SourceDat <- read.delim(file = "gapminderDataFiveYear.txt")

We can check the structure and tail of the imported source data to see if data is properly imported:

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##       country year      pop continent lifeExp gdpPercap
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

Now we want to take a look at the minimum and maximum of GDP per capita for each continent. In order to that
we can use the ddply() syntax from plyr package.

library(plyr)
GDPExterWide <- ddply(SourceDat, ~continent, summarize, MinGDP = round(min(gdpPercap), 
    2), MaxGDP = round(max(gdpPercap), 2))

For showing the result in a more pleasant way, we can use xtable package and draw a table.

library(xtable)
print(arrange(xtable(GDPExterWide), MinGDP), include.rownames = FALSE, type = "html")
continent MinGDP MaxGDP
Africa 241.17 21951.21
Asia 331.00 113523.13
Europe 973.53 49357.19
Americas 1201.64 42951.65
Oceania 10039.60 34435.37

As we can see from the acquired results, different continents won't be in the same order if we sort them on MaxGDP. This means the continent which embraces the country with the lowest minimum GDP around the world, does not necessarily have also the country with the lowest maximum GDP per capita around the world. On the other hand, we should pay attention not to misinterpret the difference between minimum and maximum of GDP per capita for each continent. An important parameter to be considered in that regard is the number of countries in each continent: (The chunk I wrote for this table is hideous and I avoided bringing it here)

Continent Africa Americas Asia Europe Oceania
Countries 52 25 33 30 2

Now we can see that for example Oceania, who is the only continent with minimum and maximum GDP of the same order of magnitude, comprises only two countries!

Spread of GDP within the continents

With a very similar method we can review the spread of GDP per capita within the different continents. Here, the table is sorted by median absolute deviation.

continent SD MAD IQR
Africa 2827.93 775.32 1616.17
Asia 14045.37 2820.83 7492.26
Americas 6396.76 3269.33 4402.43
Oceania 6358.98 6459.10 8072.26
Europe 9355.21 8846.05 13248.30

As we can see again, continents do not follow the same order for standard deviation and interquartile range. However, at this stage, it is not quite clear to me what these stats tell.

Trimmed mean of life expectancy for different years

In this part, we want to check the average life expectancy around the world in different years of our study. Two variables of 12.5% and 25% (interquartile) trimmed mean are also illustrated in the following table

# You have two choices for trim fraction This chunk might be nicer, and more
# practical, if we define a function I think.
TP1 <- 0.125
TP2 <- 0.25
AveLExPYear <- ddply(SourceDat, ~year, summarize, Average = mean(lifeExp), Dumm1 = mean(lifeExp, 
    trim = TP1), Dumm2 = mean(lifeExp, trim = TP2))
names(AveLExPYear)[3:4] <- paste0("Tmean", 100 * c(TP1, TP2))
AveLExPYear <- arrange(AveLExPYear, Average)
# It was already sorted, but we can do it just to make sure
print(xtable(AveLExPYear), type = "html", include.rownames = FALSE)
year Average Tmean12.5 Tmean25
1952 49.06 48.42 47.34
1957 51.51 51.16 50.28
1962 53.61 53.51 52.79
1967 55.68 55.85 55.43
1972 57.65 58.05 58.08
1977 59.57 60.17 60.47
1982 61.53 62.22 62.70
1987 63.21 64.07 64.77
1992 64.16 65.34 66.19
1997 65.01 66.21 67.25
2002 65.69 66.95 68.31
2007 67.01 68.34 69.69

Interestingly, the average/trimmed mean life expectancy monotonically has increased by time around the world. However, this does not eliminate the necessity of investigating correlations of positive factors (e.g. health science and general comfort) and negative factors (e.g. war and disease) with life expectancy.

Also, as we can see here, trimmed mean values demonstrate more pronounced increase in the average life expectancy by time. More end data is truncated in evaluation of trimmed mean, more noticeable increases would be.

Variation of life expectancy over time for different continents (tall and wide)

An easy way to acquire this data is using ddply() function with two simultaneous filters, namely continent and year.

AvelExPYearCont <- ddply(SourceDat, ~continent + year, summarize, Average = mean(lifeExp))

But, let's first take a look at the structure of the output data.frame using str() :

## 'data.frame':    60 obs. of  3 variables:
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ Average  : num  39.1 41.3 43.3 45.3 47.5 ...

As we can see here, we have a data.frame with 60 rows. Since the default result of ddply() call is a tall format file, this result can be terrible to be presented as a regular table sometimes. Here we can see a table showing 10 random rows of this data.frame:

Dumm1 <- AvelExPYearCont[sample(nrow(AvelExPYearCont), 10), , ]
print(xtable(Dumm1), type = "html", include.rownames = FALSE)
continent year Average
Asia 1997 68.02
Asia 2002 69.23
Asia 1962 51.56
Americas 1977 64.39
Americas 1962 58.40
Europe 1957 66.70
Oceania 1987 75.32
Africa 1972 47.45
Europe 1972 70.78
Asia 1977 59.61

However, apparently this is common challenge and people have found the solution for this problem. Using the instruction over a discussion in stackoverflow, we can transform our data.frame to the wide format. In this table you can see the average of life expectancy for each continent in different years.

AvelExPYearContW <- reshape(AvelExPYearCont, idvar = "continent", timevar = "year", 
    direction = "wide")
names(AvelExPYearContW)[-1] <- unique(SourceDat$year)
print(xtable(AvelExPYearContW), type = "html", include.rownames = FALSE)
continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa 39.14 41.27 43.32 45.33 47.45 49.58 51.59 53.34 53.63 53.60 53.33 54.81
Americas 53.28 55.96 58.40 60.41 62.39 64.39 66.23 68.09 69.57 71.15 72.42 73.61
Asia 46.31 49.32 51.56 54.66 57.32 59.61 62.62 64.85 66.54 68.02 69.23 70.73
Europe 64.41 66.70 68.54 69.74 70.78 71.94 72.81 73.64 74.44 75.51 76.70 77.65
Oceania 69.25 70.30 71.09 71.31 71.91 72.85 74.29 75.32 76.94 78.19 79.74 80.72

Interpretation of the results is much easier this way. We can see that the trend is upward for all the five continents. Except for Africa in the years 1997 and 2002, where we can see a slight decrease in life expectancy, the average life expectancy in different continents have been gradually increased by time.

Number of countries with low life expectancy over time by continent.

Now we want to get some sense about the number of countries in each continent that have a life expectancy lower than a certain “global”“ life expectancy index. In order to do this, obviously, we need to define the global life expectancy first.
Defining a constant global life expectancy would not be a good idea according to the increasing trend over time, as we saw earlier. Therefore, we should have a global life expectancy for each year. One way to do this is to get another average on life expectancy for each year in the AvelExPYearContW data.frame, which illustrates the average life expectancy in each continent and each year of our study. But, this approach neglects the effect of number of countries in each continent and gives relatively higher global life expectancy and it is better to take average of life expectancy for all the countries together, as in AveLExPYear data.frame. compare the life expectancies 71.5021 and 67.0074, to see the higher predicted global life expectancy for the year 2007 acquired through another averaging on average life expectancies in different continents. The latter value is from the AveLExPYear data.frame.

AveLExGlob <- round(AveLExPYear$Average, 2)
names(AveLExGlob) <- AveLExPYear$year
AveLExGlob
##  1952  1957  1962  1967  1972  1977  1982  1987  1992  1997  2002  2007 
## 49.06 51.51 53.61 55.68 57.65 59.57 61.53 63.21 64.16 65.01 65.69 67.01

Now based on this global life expectancy vector we count the number of countries that have life expectancies of lower than corresponding vector element. Since the initial tall results is not pleasant to show, we use the reshape() function once again.

LowLExCount <- ddply(SourceDat, ~continent + year, summarize, countries = length(lifeExp[lifeExp < 
    AveLExGlob[as.character(year)]]))
LowLExCount <- reshape(LowLExCount, direction = "wide", idvar = "continent", 
    timevar = "year")
names(LowLExCount)[-1] <- unique(SourceDat$year)
print(xtable(LowLExCount), type = "html", include.rownames = FALSE)
continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa 50 50 50 50 50 49 48 46 46 45 45 45
Americas 9 9 8 6 6 7 7 5 3 2 2 2
Asia 22 20 19 18 18 14 12 12 11 10 10 10
Europe 1 1 1 1 1 1 1 1 0 0 0 0
Oceania 0 0 0 0 0 0 0 0 0 0 0 0

Looking through this table, we realize that Americas and Asia have been improving over years and the number of countries in these continents that have a low life expectancy, based on our definition, is decreasing. Africa, persistently, holds a large number of countries with low life expectancy, although it has started getting better after 1977. Apparently, life expectancy in Oceania and Europe has been well above the global value every year.