Homework #3 Data Aggregation

In this homework we start to deal with data aggregation methods in R.

Loading Data File

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

library(plyr)
library(xtable)

Maximum and Minimum of GDP Per Capita for All Continents

We have sorted the data for different continents based on the maximum GDP per capita in each continent. As we see, excepting Africa, the other four continents are sorted on their minimum GDP per capita from the highest to the lowest- with the reverse order of the maximum GDP per capita. This means roughly that the continents with the richest country have also the poorest one. In other words, the wealth gap between the richest and the poorest country among different countries in a continent increases with the level of wealth of the richest. This general rule has an exception, Africa, which not only has the poorest country, but also its wealthiest country is not as wealthy as those of other continents.


# Produce a data.frame with 3 variables: a continent factor, the minimum GDP
# per capita, the maximum GDP per capita.
MinMaxPPP <- ddply(gDat, ~continent, summarize, ContMinPPP = min(gdpPercap), 
    ContMaxPPP = max(gdpPercap))
# Sort on the maximum
MaxSortedPPP <- arrange(MinMaxPPP, ContMaxPPP)
MaxPPPTable <- xtable(MaxSortedPPP)
print(MaxPPPTable, type = "html", include.rownames = FALSE)
continent ContMinPPP ContMaxPPP
Africa 241.17 21951.21
Oceania 10039.60 34435.37
Americas 1201.64 42951.65
Europe 973.53 49357.19
Asia 331.00 113523.13

Maximum and Minimum of GDP Per Capita for All Continents in a “Tall” Format

We have used a function which returns for each continent, the minimum life expectancy as well as the maximum life expectancy. At last, everything is reported in a table.


# Get the maximum and minimum of Life Expectancy for all continents in a
# 'tall' format.

# this function for each input makes a data frame of two rows, one
# containing the maximum life expectancy and the other has the minimum
# (among countries in a continent)
MinMaxLife <- function(x) {
    MinLife <- min(x$lifeExp)
    MaxLife <- max(x$lifeExp)
    staname <- c("Minimum Life Expectancy", "Maximum Life Expectancy")
    val <- c(MinLife, MaxLife)
    fdat <- data.frame(staname, val)
    return(fdat)
}
MinMaxLifeData <- ddply(gDat, ~continent, MinMaxLife)
MinMaxLifeTable <- xtable(MinMaxLifeData)
print(MinMaxLifeTable, type = "html", include.rownames = FALSE)
continent staname val
Africa Minimum Life Expectancy 23.60
Africa Maximum Life Expectancy 76.44
Americas Minimum Life Expectancy 37.58
Americas Maximum Life Expectancy 80.65
Asia Minimum Life Expectancy 28.80
Asia Maximum Life Expectancy 82.60
Europe Minimum Life Expectancy 43.59
Europe Maximum Life Expectancy 81.76
Oceania Minimum Life Expectancy 69.12
Oceania Maximum Life Expectancy 81.23

Spread of GDP Per Capita Within the Continents

We have produced a data.frame with one row per continent and variables containing the different measures of spread for GDP per capita. Then we sorted on Median Absolute Deviation –which we prefer due to our belief in its robustness. Ignoring Asia for a minute, we can observe that, the data based on other two measures is also fairly sorted. About Asia, we would rather say that it has the second least Median Absolute Deviation. On the other hand, it has the highest standard deviation. This implies that, in Asia, countries with extreme levels of GDP per capita –outlier countries- have extremely high weights in their GDP levels, which influence standard deviation a lot but since are they include small number of countries, their impact on Median Absolute Deviation is not that much. This is somehow the case in Americas as well. This partly shows that, sometimes, it is not reasonable to trust on standard deviation as a measure of the spread of the data.

# Spread of GDP per capita within the continents
gdpSpread <- ddply(gDat, ~continent, summarize, StdDev = sd(gdpPercap), MedAbsDev = mad(gdpPercap), 
    InQuartRange = IQR(gdpPercap))
SortedgdpSpread <- arrange(gdpSpread, MedAbsDev)
SortedSpreadTable <- xtable(SortedgdpSpread)
print(SortedSpreadTable, type = "html", include.rownames = FALSE)
continent StdDev MedAbsDev InQuartRange
Africa 2827.93 775.32 1616.17
Asia 14045.37 2820.83 7492.26
Americas 6396.76 3269.33 4402.43
Oceania 6358.98 6459.10 8072.26
Europe 9355.21 8846.05 13248.30

Trimmed Mean of Life Expectancy for Different Years

In this part we see the trimmed mean of life expectancy among countries in different years. We have chosen a trim fraction of 0.2 (or 20%). The results are as follows:

## Computing a trimmed mean (with 20% trim level) of life expectancy for
## different years
TrimLevel <- 0.2
TrimMeanLifeExp <- ddply(gDat, ~year, summarize, meanlifeExp = mean(lifeExp, 
    TrimLevel))
names(TrimMeanLifeExp) <- c("Year", "Trimmed Mean- level: 20%")
TrimMeanTable <- xtable(TrimMeanLifeExp)
print(TrimMeanTable, type = "html", include.rownames = FALSE)
Year Trimmed Mean- level: 20%
1952 47.75
1957 50.64
1962 53.13
1967 55.64
1972 58.12
1977 60.39
1982 62.47
1987 64.48
1992 65.89
1997 66.84
2002 67.77
2007 69.17

The Trend of Change in Life Expectancy Over Time on Different Continents

We have produced a data.frame with 3 variables: continent, year, and median life expectancy. In general we observe an increase over time in median life expectancy in all of the continents. But it is not too straightforward to compare the rate of this increase across different continents. I think a graph will work way better than this table.

## The trend of life expectancy median over time in different continents?
MedLifeExp <- ddply(gDat, ~continent + year, summarize, ContMedlifeExp = median(lifeExp))
names(MedLifeExp) <- c("continent", "year", "MedianLifeExpectancy")
MedLifeExpTable <- xtable(MedLifeExp)
print(MedLifeExpTable, type = "html", include.rownames = FALSE)
continent year MedianLifeExpectancy
Africa 1952 38.83
Africa 1957 40.59
Africa 1962 42.63
Africa 1967 44.70
Africa 1972 47.03
Africa 1977 49.27
Africa 1982 50.76
Africa 1987 51.64
Africa 1992 52.43
Africa 1997 52.76
Africa 2002 51.24
Africa 2007 52.93
Americas 1952 54.74
Americas 1957 56.07
Americas 1962 58.30
Americas 1967 60.52
Americas 1972 63.44
Americas 1977 66.35
Americas 1982 67.41
Americas 1987 69.50
Americas 1992 69.86
Americas 1997 72.15
Americas 2002 72.05
Americas 2007 72.90
Asia 1952 44.87
Asia 1957 48.28
Asia 1962 49.33
Asia 1967 53.66
Asia 1972 56.95
Asia 1977 60.77
Asia 1982 63.74
Asia 1987 66.30
Asia 1992 68.69
Asia 1997 70.27
Asia 2002 71.03
Asia 2007 72.40
Europe 1952 65.90
Europe 1957 67.65
Europe 1962 69.53
Europe 1967 70.61
Europe 1972 70.89
Europe 1977 72.34
Europe 1982 73.49
Europe 1987 74.81
Europe 1992 75.45
Europe 1997 76.12
Europe 2002 77.54
Europe 2007 78.61
Oceania 1952 69.25
Oceania 1957 70.30
Oceania 1962 71.09
Oceania 1967 71.31
Oceania 1972 71.91
Oceania 1977 72.85
Oceania 1982 74.29
Oceania 1987 75.32
Oceania 1992 76.94
Oceania 1997 78.19
Oceania 2002 79.74
Oceania 2007 80.72

However, in order to get a better insight into the rate of this increase, I have added another table which shows the slope and intercept of a linear regression of the median of life expectancy based on an adjusted year, for different continents. As we see, Asia has the most rapid increase in life expectancy, even though it has started with a relatively small life expectancy at 1952.

# Finding the Trend of change of the median life expectancy over time across
# continents
yearMin <- 1952
RegLifeYear <- function(x) {
    RegCoefs <- coef(lm(MedianLifeExpectancy ~ I(year - yearMin), x))
    names(RegCoefs) <- c("intercept", "slope")
    return(RegCoefs)
}

LifeYearTrend <- ddply(MedLifeExp, ~continent, RegLifeYear)
LifeYearTrendTable <- xtable(LifeYearTrend)
print(LifeYearTrendTable, type = "html", include.rownames = FALSE)
continent intercept slope
Africa 40.68 0.26
Americas 55.54 0.35
Asia 45.90 0.53
Europe 66.74 0.22
Oceania 68.54 0.21

An Extension : The Trend of Change in Life Expectancy Over Time

In this part I got a data.frame with 1 + 5 = 6 variables: year and then one variable per continent, giving median life expectancy. Compared to the last part, this seems to be more compact and insightful. We have used the potentials of daply() in this part.

## The same Task above using daply (data.frame with 1+5 variables)
MedLifeExp2 <- daply(gDat, ~year + continent, summarize, median(lifeExp))
MedLifeExpTable2 <- xtable(MedLifeExp2)
print(MedLifeExpTable2, type = "html")
Africa Americas Asia Europe Oceania
1952 38.83 54.74 44.87 65.90 69.25
1957 40.59 56.07 48.28 67.65 70.30
1962 42.63 58.30 49.33 69.53 71.09
1967 44.70 60.52 53.66 70.61 71.31
1972 47.03 63.44 56.95 70.89 71.91
1977 49.27 66.35 60.77 72.34 72.85
1982 50.76 67.41 63.74 73.49 74.29
1987 51.64 69.50 66.30 74.81 75.32
1992 52.43 69.86 68.69 75.45 76.94
1997 52.76 72.15 70.27 76.12 78.19
2002 51.24 72.05 71.03 77.54 79.74
2007 52.93 72.90 72.40 78.61 80.72

The Number of Countries with Low Life Expectancy Over Time by Continent.

I have computed the average of worldwide life expectancy (about 59.5) . Then I determined how many countries on each continent have a life expectancy less than this benchmark, for each year. I produced a data.frame with 3 variables: continent, year, and a country count. We see an overall decrease in the number of low life expectancy over time across different continents. For some continents, such as Asia, this decrease occurs with a higher rate, while for some others such as Europe it is relatively slow. It is not fun to assess this from this tall table, at all! Maybe a graph or a linear regression might be more meaningful.


## Counting the number of countries with low life expectancy over time by
## continent
Benchmark <- mean(gDat$lifeExp)  ## benchmark is the average of Life Expectancy over time
LowLifeCount <- ddply(gDat, ~continent + year, summarize, LowCount = sum(lifeExp < 
    Benchmark))
LowLifeCountTable <- xtable(LowLifeCount)
print(LowLifeCountTable, type = "html", include.rownames = FALSE)
continent year LowCount
Africa 1952 52
Africa 1957 52
Africa 1962 51
Africa 1967 50
Africa 1972 50
Africa 1977 49
Africa 1982 43
Africa 1987 39
Africa 1992 38
Africa 1997 39
Africa 2002 41
Africa 2007 40
Americas 1952 19
Americas 1957 15
Americas 1962 13
Americas 1967 10
Americas 1972 8
Americas 1977 7
Americas 1982 5
Americas 1987 2
Americas 1992 1
Americas 1997 1
Americas 2002 1
Americas 2007 0
Asia 1952 29
Asia 1957 26
Asia 1962 25
Asia 1967 23
Asia 1972 19
Asia 1977 14
Asia 1982 11
Asia 1987 8
Asia 1992 7
Asia 1997 6
Asia 2002 3
Asia 2007 1
Europe 1952 5
Europe 1957 3
Europe 1962 1
Europe 1967 1
Europe 1972 1
Europe 1977 0
Europe 1982 0
Europe 1987 0
Europe 1992 0
Europe 1997 0
Europe 2002 0
Europe 2007 0
Oceania 1952 0
Oceania 1957 0
Oceania 1962 0
Oceania 1967 0
Oceania 1972 0
Oceania 1977 0
Oceania 1982 0
Oceania 1987 0
Oceania 1992 0
Oceania 1997 0
Oceania 2002 0
Oceania 2007 0

Proportion of Countries with Low Life Expectancy Over Time by Continent.

The continents have very different numbers of countries. First, I counted the number of countries in each continent simply-counting the number of countries with a life expectancy more than zero in each continent. Then the ratio of the low life expectancy countries has been computed. To be more precise I have added another column showing the low life expectancy count besides the ratio. The ratios are more meaningful if we want to compare continents based on some measures describing their growth in life expectancy. However ratios do not tell everything. Sometimes the number of countries that have improved their life expectancy is important for us. So I think adding one column does not cost a lot but gives more comprehensive insight into the data.



## Computing the proportion of countries with low life expectancy over time
## by continent.
Benchmark <- mean(gDat$lifeExp)  ## benchmark is the average of Life Expectancy over time
LowLifeRatio <- ddply(gDat, ~continent + year, summarize, LowRatio = sum(lifeExp < 
    Benchmark)/sum(lifeExp > 0), LowCount = sum(lifeExp < Benchmark))
names(LowLifeRatio) <- c("continent", "year", "Low Life Expectancy Ratio", "Low Life Expectancy Count")
LowLifeRatioTable <- xtable(LowLifeRatio)
print(LowLifeRatioTable, type = "html", include.rownames = FALSE)
continent year Low Life Expectancy Ratio Low Life Expectancy Count
Africa 1952 1.00 52
Africa 1957 1.00 52
Africa 1962 0.98 51
Africa 1967 0.96 50
Africa 1972 0.96 50
Africa 1977 0.94 49
Africa 1982 0.83 43
Africa 1987 0.75 39
Africa 1992 0.73 38
Africa 1997 0.75 39
Africa 2002 0.79 41
Africa 2007 0.77 40
Americas 1952 0.76 19
Americas 1957 0.60 15
Americas 1962 0.52 13
Americas 1967 0.40 10
Americas 1972 0.32 8
Americas 1977 0.28 7
Americas 1982 0.20 5
Americas 1987 0.08 2
Americas 1992 0.04 1
Americas 1997 0.04 1
Americas 2002 0.04 1
Americas 2007 0.00 0
Asia 1952 0.88 29
Asia 1957 0.79 26
Asia 1962 0.76 25
Asia 1967 0.70 23
Asia 1972 0.58 19
Asia 1977 0.42 14
Asia 1982 0.33 11
Asia 1987 0.24 8
Asia 1992 0.21 7
Asia 1997 0.18 6
Asia 2002 0.09 3
Asia 2007 0.03 1
Europe 1952 0.17 5
Europe 1957 0.10 3
Europe 1962 0.03 1
Europe 1967 0.03 1
Europe 1972 0.03 1
Europe 1977 0.00 0
Europe 1982 0.00 0
Europe 1987 0.00 0
Europe 1992 0.00 0
Europe 1997 0.00 0
Europe 2002 0.00 0
Europe 2007 0.00 0
Oceania 1952 0.00 0
Oceania 1957 0.00 0
Oceania 1962 0.00 0
Oceania 1967 0.00 0
Oceania 1972 0.00 0
Oceania 1977 0.00 0
Oceania 1982 0.00 0
Oceania 1987 0.00 0
Oceania 1992 0.00 0
Oceania 1997 0.00 0
Oceania 2002 0.00 0
Oceania 2007 0.00 0

Relative Abundance of Countries with Low Life Expectancy Over Time by Continent

In this part I have created a table with one row per year and one column per continent. In that column, I have reported the relative abundance of “low life expectancy” countries. I have took advantage of daply() with one line of programming and a more compact table than that of last part.


## An extension on Computing the proportion of countries with low life
## expectancy over time by continent.
Benchmark <- mean(gDat$lifeExp)  ## benchmark is the average of Life Expectancy over time
LowLifeRatio2 <- daply(gDat, ~continent + year, summarize, LowRatio = sum(lifeExp < 
    Benchmark)/sum(lifeExp > 0))
LowLifeRatioTable2 <- xtable(LowLifeRatio2)
print(LowLifeRatioTable2, type = "html")
1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Africa 1.00 1.00 0.98 0.96 0.96 0.94 0.83 0.75 0.73 0.75 0.79 0.77
Americas 0.76 0.60 0.52 0.40 0.32 0.28 0.20 0.08 0.04 0.04 0.04 0.00
Asia 0.88 0.79 0.76 0.70 0.58 0.42 0.33 0.24 0.21 0.18 0.09 0.03
Europe 0.17 0.10 0.03 0.03 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Oceania 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00