In this homework we start to deal with data aggregation methods in R.
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
We have sorted the data for different continents based on the maximum GDP per capita in each continent. As we see, excepting Africa, the other four continents are sorted on their minimum GDP per capita from the highest to the lowest- with the reverse order of the maximum GDP per capita. This means roughly that the continents with the richest country have also the poorest one. In other words, the wealth gap between the richest and the poorest country among different countries in a continent increases with the level of wealth of the richest. This general rule has an exception, Africa, which not only has the poorest country, but also its wealthiest country is not as wealthy as those of other continents.
# Produce a data.frame with 3 variables: a continent factor, the minimum GDP
# per capita, the maximum GDP per capita.
MinMaxPPP <- ddply(gDat, ~continent, summarize, ContMinPPP = min(gdpPercap),
ContMaxPPP = max(gdpPercap))
# Sort on the maximum
MaxSortedPPP <- arrange(MinMaxPPP, ContMaxPPP)
MaxPPPTable <- xtable(MaxSortedPPP)
print(MaxPPPTable, type = "html", include.rownames = FALSE)
| continent | ContMinPPP | ContMaxPPP |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Oceania | 10039.60 | 34435.37 |
| Americas | 1201.64 | 42951.65 |
| Europe | 973.53 | 49357.19 |
| Asia | 331.00 | 113523.13 |
We have used a function which returns for each continent, the minimum life expectancy as well as the maximum life expectancy. At last, everything is reported in a table.
# Get the maximum and minimum of Life Expectancy for all continents in a
# 'tall' format.
# this function for each input makes a data frame of two rows, one
# containing the maximum life expectancy and the other has the minimum
# (among countries in a continent)
MinMaxLife <- function(x) {
MinLife <- min(x$lifeExp)
MaxLife <- max(x$lifeExp)
staname <- c("Minimum Life Expectancy", "Maximum Life Expectancy")
val <- c(MinLife, MaxLife)
fdat <- data.frame(staname, val)
return(fdat)
}
MinMaxLifeData <- ddply(gDat, ~continent, MinMaxLife)
MinMaxLifeTable <- xtable(MinMaxLifeData)
print(MinMaxLifeTable, type = "html", include.rownames = FALSE)
| continent | staname | val |
|---|---|---|
| Africa | Minimum Life Expectancy | 23.60 |
| Africa | Maximum Life Expectancy | 76.44 |
| Americas | Minimum Life Expectancy | 37.58 |
| Americas | Maximum Life Expectancy | 80.65 |
| Asia | Minimum Life Expectancy | 28.80 |
| Asia | Maximum Life Expectancy | 82.60 |
| Europe | Minimum Life Expectancy | 43.59 |
| Europe | Maximum Life Expectancy | 81.76 |
| Oceania | Minimum Life Expectancy | 69.12 |
| Oceania | Maximum Life Expectancy | 81.23 |
We have produced a data.frame with one row per continent and variables containing the different measures of spread for GDP per capita. Then we sorted on Median Absolute Deviation which we prefer due to our belief in its robustness. Ignoring Asia for a minute, we can observe that, the data based on other two measures is also fairly sorted. About Asia, we would rather say that it has the second least Median Absolute Deviation. On the other hand, it has the highest standard deviation. This implies that, in Asia, countries with extreme levels of GDP per capita outlier countries- have extremely high weights in their GDP levels, which influence standard deviation a lot but since are they include small number of countries, their impact on Median Absolute Deviation is not that much. This is somehow the case in Americas as well. This partly shows that, sometimes, it is not reasonable to trust on standard deviation as a measure of the spread of the data.
# Spread of GDP per capita within the continents
gdpSpread <- ddply(gDat, ~continent, summarize, StdDev = sd(gdpPercap), MedAbsDev = mad(gdpPercap),
InQuartRange = IQR(gdpPercap))
SortedgdpSpread <- arrange(gdpSpread, MedAbsDev)
SortedSpreadTable <- xtable(SortedgdpSpread)
print(SortedSpreadTable, type = "html", include.rownames = FALSE)
| continent | StdDev | MedAbsDev | InQuartRange |
|---|---|---|---|
| Africa | 2827.93 | 775.32 | 1616.17 |
| Asia | 14045.37 | 2820.83 | 7492.26 |
| Americas | 6396.76 | 3269.33 | 4402.43 |
| Oceania | 6358.98 | 6459.10 | 8072.26 |
| Europe | 9355.21 | 8846.05 | 13248.30 |
In this part we see the trimmed mean of life expectancy among countries in different years. We have chosen a trim fraction of 0.2 (or 20%). The results are as follows:
## Computing a trimmed mean (with 20% trim level) of life expectancy for
## different years
TrimLevel <- 0.2
TrimMeanLifeExp <- ddply(gDat, ~year, summarize, meanlifeExp = mean(lifeExp,
TrimLevel))
names(TrimMeanLifeExp) <- c("Year", "Trimmed Mean- level: 20%")
TrimMeanTable <- xtable(TrimMeanLifeExp)
print(TrimMeanTable, type = "html", include.rownames = FALSE)
| Year | Trimmed Mean- level: 20% |
|---|---|
| 1952 | 47.75 |
| 1957 | 50.64 |
| 1962 | 53.13 |
| 1967 | 55.64 |
| 1972 | 58.12 |
| 1977 | 60.39 |
| 1982 | 62.47 |
| 1987 | 64.48 |
| 1992 | 65.89 |
| 1997 | 66.84 |
| 2002 | 67.77 |
| 2007 | 69.17 |
We have produced a data.frame with 3 variables: continent, year, and median life expectancy. In general we observe an increase over time in median life expectancy in all of the continents. But it is not too straightforward to compare the rate of this increase across different continents. I think a graph will work way better than this table.
## The trend of life expectancy median over time in different continents?
MedLifeExp <- ddply(gDat, ~continent + year, summarize, ContMedlifeExp = median(lifeExp))
names(MedLifeExp) <- c("continent", "year", "MedianLifeExpectancy")
MedLifeExpTable <- xtable(MedLifeExp)
print(MedLifeExpTable, type = "html", include.rownames = FALSE)
| continent | year | MedianLifeExpectancy |
|---|---|---|
| Africa | 1952 | 38.83 |
| Africa | 1957 | 40.59 |
| Africa | 1962 | 42.63 |
| Africa | 1967 | 44.70 |
| Africa | 1972 | 47.03 |
| Africa | 1977 | 49.27 |
| Africa | 1982 | 50.76 |
| Africa | 1987 | 51.64 |
| Africa | 1992 | 52.43 |
| Africa | 1997 | 52.76 |
| Africa | 2002 | 51.24 |
| Africa | 2007 | 52.93 |
| Americas | 1952 | 54.74 |
| Americas | 1957 | 56.07 |
| Americas | 1962 | 58.30 |
| Americas | 1967 | 60.52 |
| Americas | 1972 | 63.44 |
| Americas | 1977 | 66.35 |
| Americas | 1982 | 67.41 |
| Americas | 1987 | 69.50 |
| Americas | 1992 | 69.86 |
| Americas | 1997 | 72.15 |
| Americas | 2002 | 72.05 |
| Americas | 2007 | 72.90 |
| Asia | 1952 | 44.87 |
| Asia | 1957 | 48.28 |
| Asia | 1962 | 49.33 |
| Asia | 1967 | 53.66 |
| Asia | 1972 | 56.95 |
| Asia | 1977 | 60.77 |
| Asia | 1982 | 63.74 |
| Asia | 1987 | 66.30 |
| Asia | 1992 | 68.69 |
| Asia | 1997 | 70.27 |
| Asia | 2002 | 71.03 |
| Asia | 2007 | 72.40 |
| Europe | 1952 | 65.90 |
| Europe | 1957 | 67.65 |
| Europe | 1962 | 69.53 |
| Europe | 1967 | 70.61 |
| Europe | 1972 | 70.89 |
| Europe | 1977 | 72.34 |
| Europe | 1982 | 73.49 |
| Europe | 1987 | 74.81 |
| Europe | 1992 | 75.45 |
| Europe | 1997 | 76.12 |
| Europe | 2002 | 77.54 |
| Europe | 2007 | 78.61 |
| Oceania | 1952 | 69.25 |
| Oceania | 1957 | 70.30 |
| Oceania | 1962 | 71.09 |
| Oceania | 1967 | 71.31 |
| Oceania | 1972 | 71.91 |
| Oceania | 1977 | 72.85 |
| Oceania | 1982 | 74.29 |
| Oceania | 1987 | 75.32 |
| Oceania | 1992 | 76.94 |
| Oceania | 1997 | 78.19 |
| Oceania | 2002 | 79.74 |
| Oceania | 2007 | 80.72 |
However, in order to get a better insight into the rate of this increase, I have added another table which shows the slope and intercept of a linear regression of the median of life expectancy based on an adjusted year, for different continents. As we see, Asia has the most rapid increase in life expectancy, even though it has started with a relatively small life expectancy at 1952.
# Finding the Trend of change of the median life expectancy over time across
# continents
yearMin <- 1952
RegLifeYear <- function(x) {
RegCoefs <- coef(lm(MedianLifeExpectancy ~ I(year - yearMin), x))
names(RegCoefs) <- c("intercept", "slope")
return(RegCoefs)
}
LifeYearTrend <- ddply(MedLifeExp, ~continent, RegLifeYear)
LifeYearTrendTable <- xtable(LifeYearTrend)
print(LifeYearTrendTable, type = "html", include.rownames = FALSE)
| continent | intercept | slope |
|---|---|---|
| Africa | 40.68 | 0.26 |
| Americas | 55.54 | 0.35 |
| Asia | 45.90 | 0.53 |
| Europe | 66.74 | 0.22 |
| Oceania | 68.54 | 0.21 |
In this part I got a data.frame with 1 + 5 = 6 variables: year and then one variable per continent, giving median life expectancy. Compared to the last part, this seems to be more compact and insightful. We have used the potentials of daply() in this part.
## The same Task above using daply (data.frame with 1+5 variables)
MedLifeExp2 <- daply(gDat, ~year + continent, summarize, median(lifeExp))
MedLifeExpTable2 <- xtable(MedLifeExp2)
print(MedLifeExpTable2, type = "html")
| Africa | Americas | Asia | Europe | Oceania | |
|---|---|---|---|---|---|
| 1952 | 38.83 | 54.74 | 44.87 | 65.90 | 69.25 |
| 1957 | 40.59 | 56.07 | 48.28 | 67.65 | 70.30 |
| 1962 | 42.63 | 58.30 | 49.33 | 69.53 | 71.09 |
| 1967 | 44.70 | 60.52 | 53.66 | 70.61 | 71.31 |
| 1972 | 47.03 | 63.44 | 56.95 | 70.89 | 71.91 |
| 1977 | 49.27 | 66.35 | 60.77 | 72.34 | 72.85 |
| 1982 | 50.76 | 67.41 | 63.74 | 73.49 | 74.29 |
| 1987 | 51.64 | 69.50 | 66.30 | 74.81 | 75.32 |
| 1992 | 52.43 | 69.86 | 68.69 | 75.45 | 76.94 |
| 1997 | 52.76 | 72.15 | 70.27 | 76.12 | 78.19 |
| 2002 | 51.24 | 72.05 | 71.03 | 77.54 | 79.74 |
| 2007 | 52.93 | 72.90 | 72.40 | 78.61 | 80.72 |
I have computed the average of worldwide life expectancy (about 59.5) . Then I determined how many countries on each continent have a life expectancy less than this benchmark, for each year. I produced a data.frame with 3 variables: continent, year, and a country count. We see an overall decrease in the number of low life expectancy over time across different continents. For some continents, such as Asia, this decrease occurs with a higher rate, while for some others such as Europe it is relatively slow. It is not fun to assess this from this tall table, at all! Maybe a graph or a linear regression might be more meaningful.
## Counting the number of countries with low life expectancy over time by
## continent
Benchmark <- mean(gDat$lifeExp) ## benchmark is the average of Life Expectancy over time
LowLifeCount <- ddply(gDat, ~continent + year, summarize, LowCount = sum(lifeExp <
Benchmark))
LowLifeCountTable <- xtable(LowLifeCount)
print(LowLifeCountTable, type = "html", include.rownames = FALSE)
| continent | year | LowCount |
|---|---|---|
| Africa | 1952 | 52 |
| Africa | 1957 | 52 |
| Africa | 1962 | 51 |
| Africa | 1967 | 50 |
| Africa | 1972 | 50 |
| Africa | 1977 | 49 |
| Africa | 1982 | 43 |
| Africa | 1987 | 39 |
| Africa | 1992 | 38 |
| Africa | 1997 | 39 |
| Africa | 2002 | 41 |
| Africa | 2007 | 40 |
| Americas | 1952 | 19 |
| Americas | 1957 | 15 |
| Americas | 1962 | 13 |
| Americas | 1967 | 10 |
| Americas | 1972 | 8 |
| Americas | 1977 | 7 |
| Americas | 1982 | 5 |
| Americas | 1987 | 2 |
| Americas | 1992 | 1 |
| Americas | 1997 | 1 |
| Americas | 2002 | 1 |
| Americas | 2007 | 0 |
| Asia | 1952 | 29 |
| Asia | 1957 | 26 |
| Asia | 1962 | 25 |
| Asia | 1967 | 23 |
| Asia | 1972 | 19 |
| Asia | 1977 | 14 |
| Asia | 1982 | 11 |
| Asia | 1987 | 8 |
| Asia | 1992 | 7 |
| Asia | 1997 | 6 |
| Asia | 2002 | 3 |
| Asia | 2007 | 1 |
| Europe | 1952 | 5 |
| Europe | 1957 | 3 |
| Europe | 1962 | 1 |
| Europe | 1967 | 1 |
| Europe | 1972 | 1 |
| Europe | 1977 | 0 |
| Europe | 1982 | 0 |
| Europe | 1987 | 0 |
| Europe | 1992 | 0 |
| Europe | 1997 | 0 |
| Europe | 2002 | 0 |
| Europe | 2007 | 0 |
| Oceania | 1952 | 0 |
| Oceania | 1957 | 0 |
| Oceania | 1962 | 0 |
| Oceania | 1967 | 0 |
| Oceania | 1972 | 0 |
| Oceania | 1977 | 0 |
| Oceania | 1982 | 0 |
| Oceania | 1987 | 0 |
| Oceania | 1992 | 0 |
| Oceania | 1997 | 0 |
| Oceania | 2002 | 0 |
| Oceania | 2007 | 0 |
The continents have very different numbers of countries. First, I counted the number of countries in each continent simply-counting the number of countries with a life expectancy more than zero in each continent. Then the ratio of the low life expectancy countries has been computed. To be more precise I have added another column showing the low life expectancy count besides the ratio. The ratios are more meaningful if we want to compare continents based on some measures describing their growth in life expectancy. However ratios do not tell everything. Sometimes the number of countries that have improved their life expectancy is important for us. So I think adding one column does not cost a lot but gives more comprehensive insight into the data.
## Computing the proportion of countries with low life expectancy over time
## by continent.
Benchmark <- mean(gDat$lifeExp) ## benchmark is the average of Life Expectancy over time
LowLifeRatio <- ddply(gDat, ~continent + year, summarize, LowRatio = sum(lifeExp <
Benchmark)/sum(lifeExp > 0), LowCount = sum(lifeExp < Benchmark))
names(LowLifeRatio) <- c("continent", "year", "Low Life Expectancy Ratio", "Low Life Expectancy Count")
LowLifeRatioTable <- xtable(LowLifeRatio)
print(LowLifeRatioTable, type = "html", include.rownames = FALSE)
| continent | year | Low Life Expectancy Ratio | Low Life Expectancy Count |
|---|---|---|---|
| Africa | 1952 | 1.00 | 52 |
| Africa | 1957 | 1.00 | 52 |
| Africa | 1962 | 0.98 | 51 |
| Africa | 1967 | 0.96 | 50 |
| Africa | 1972 | 0.96 | 50 |
| Africa | 1977 | 0.94 | 49 |
| Africa | 1982 | 0.83 | 43 |
| Africa | 1987 | 0.75 | 39 |
| Africa | 1992 | 0.73 | 38 |
| Africa | 1997 | 0.75 | 39 |
| Africa | 2002 | 0.79 | 41 |
| Africa | 2007 | 0.77 | 40 |
| Americas | 1952 | 0.76 | 19 |
| Americas | 1957 | 0.60 | 15 |
| Americas | 1962 | 0.52 | 13 |
| Americas | 1967 | 0.40 | 10 |
| Americas | 1972 | 0.32 | 8 |
| Americas | 1977 | 0.28 | 7 |
| Americas | 1982 | 0.20 | 5 |
| Americas | 1987 | 0.08 | 2 |
| Americas | 1992 | 0.04 | 1 |
| Americas | 1997 | 0.04 | 1 |
| Americas | 2002 | 0.04 | 1 |
| Americas | 2007 | 0.00 | 0 |
| Asia | 1952 | 0.88 | 29 |
| Asia | 1957 | 0.79 | 26 |
| Asia | 1962 | 0.76 | 25 |
| Asia | 1967 | 0.70 | 23 |
| Asia | 1972 | 0.58 | 19 |
| Asia | 1977 | 0.42 | 14 |
| Asia | 1982 | 0.33 | 11 |
| Asia | 1987 | 0.24 | 8 |
| Asia | 1992 | 0.21 | 7 |
| Asia | 1997 | 0.18 | 6 |
| Asia | 2002 | 0.09 | 3 |
| Asia | 2007 | 0.03 | 1 |
| Europe | 1952 | 0.17 | 5 |
| Europe | 1957 | 0.10 | 3 |
| Europe | 1962 | 0.03 | 1 |
| Europe | 1967 | 0.03 | 1 |
| Europe | 1972 | 0.03 | 1 |
| Europe | 1977 | 0.00 | 0 |
| Europe | 1982 | 0.00 | 0 |
| Europe | 1987 | 0.00 | 0 |
| Europe | 1992 | 0.00 | 0 |
| Europe | 1997 | 0.00 | 0 |
| Europe | 2002 | 0.00 | 0 |
| Europe | 2007 | 0.00 | 0 |
| Oceania | 1952 | 0.00 | 0 |
| Oceania | 1957 | 0.00 | 0 |
| Oceania | 1962 | 0.00 | 0 |
| Oceania | 1967 | 0.00 | 0 |
| Oceania | 1972 | 0.00 | 0 |
| Oceania | 1977 | 0.00 | 0 |
| Oceania | 1982 | 0.00 | 0 |
| Oceania | 1987 | 0.00 | 0 |
| Oceania | 1992 | 0.00 | 0 |
| Oceania | 1997 | 0.00 | 0 |
| Oceania | 2002 | 0.00 | 0 |
| Oceania | 2007 | 0.00 | 0 |
In this part I have created a table with one row per year and one column per continent. In that column, I have reported the relative abundance of “low life expectancy” countries. I have took advantage of daply() with one line of programming and a more compact table than that of last part.
## An extension on Computing the proportion of countries with low life
## expectancy over time by continent.
Benchmark <- mean(gDat$lifeExp) ## benchmark is the average of Life Expectancy over time
LowLifeRatio2 <- daply(gDat, ~continent + year, summarize, LowRatio = sum(lifeExp <
Benchmark)/sum(lifeExp > 0))
LowLifeRatioTable2 <- xtable(LowLifeRatio2)
print(LowLifeRatioTable2, type = "html")
| 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 1.00 | 1.00 | 0.98 | 0.96 | 0.96 | 0.94 | 0.83 | 0.75 | 0.73 | 0.75 | 0.79 | 0.77 |
| Americas | 0.76 | 0.60 | 0.52 | 0.40 | 0.32 | 0.28 | 0.20 | 0.08 | 0.04 | 0.04 | 0.04 | 0.00 |
| Asia | 0.88 | 0.79 | 0.76 | 0.70 | 0.58 | 0.42 | 0.33 | 0.24 | 0.21 | 0.18 | 0.09 | 0.03 |
| Europe | 0.17 | 0.10 | 0.03 | 0.03 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| Oceania | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |