This homeork assignment is an excercise in data aggregation with a focus on writing code that is clean and readable.
# read from internet gDat <-
# read.delim('http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt')
# read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw03", "GapMinder"),
"gapminderDataFiveYear.txt"))
I begin with the “harder” sections. I hope this saves the TA some time in marking.
We compare the mean life expectancy by year to the trimmed mean (20% from each end).
mean.lifeExp_by_year <- arrange(ddply(gDat, ~year, summarize, mean = mean(lifeExp),
trimmed.mean = mean(lifeExp, trim = 0.2)), year)
print(xtable(mean.lifeExp_by_year, digits = 2), type = "html", include.rownames = F)
| year | mean | trimmed.mean |
|---|---|---|
| 1952 | 49.06 | 47.75 |
| 1957 | 51.51 | 50.64 |
| 1962 | 53.61 | 53.13 |
| 1967 | 55.68 | 55.64 |
| 1972 | 57.65 | 58.12 |
| 1977 | 59.57 | 60.39 |
| 1982 | 61.53 | 62.47 |
| 1987 | 63.21 | 64.48 |
| 1992 | 64.16 | 65.89 |
| 1997 | 65.01 | 66.84 |
| 2002 | 65.69 | 67.77 |
| 2007 | 67.01 | 69.17 |
Curiously, the mean value of earlier years are higher than the trimmed mean, while the reverse is true for the later years…
### performing the task with 1 line of code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat, ~year, daply, ~continent, summarize,
mean = mean(lifeExp), median = median(lifeExp)) #note multiple stats can be computed with one call!
This is an odd excercise, that involves essentially “nested” daply() calls. The “inner” call takes data frame for a given year, returns a “array/matrix” of 2 dimensions: 1st dimension the factor levels you split by, and 2nd dimensions the stats you make it calculate. The “outer” call takes the resulting “array/matrix” and inserts another dimension to this array representing the year.
### data structure: array of 3 dimensions
class(mean.lifeExp_by_year_by_cont.wide)
## [1] "array"
dim(mean.lifeExp_by_year_by_cont.wide)
## [1] 12 5 2
To print a 2 dimensional table, one would use the multidimensional array indexing, [,…,]. As an example myMultDimArray[,,“statName”] retrieves the matrix with information on a statistic called “statName”.
### print 2d tables of 'mean'
print(xtable(mean.lifeExp_by_year_by_cont.wide[, , "mean"], digits = 2), type = "html",
include.rownames = T)
| Africa | Americas | Asia | Europe | Oceania | |
|---|---|---|---|---|---|
| 1952 | 39.14 | 53.28 | 46.31 | 64.41 | 69.25 |
| 1957 | 41.27 | 55.96 | 49.32 | 66.70 | 70.30 |
| 1962 | 43.32 | 58.40 | 51.56 | 68.54 | 71.09 |
| 1967 | 45.33 | 60.41 | 54.66 | 69.74 | 71.31 |
| 1972 | 47.45 | 62.39 | 57.32 | 70.78 | 71.91 |
| 1977 | 49.58 | 64.39 | 59.61 | 71.94 | 72.85 |
| 1982 | 51.59 | 66.23 | 62.62 | 72.81 | 74.29 |
| 1987 | 53.34 | 68.09 | 64.85 | 73.64 | 75.32 |
| 1992 | 53.63 | 69.57 | 66.54 | 74.44 | 76.94 |
| 1997 | 53.60 | 71.15 | 68.02 | 75.51 | 78.19 |
| 2002 | 53.33 | 72.42 | 69.23 | 76.70 | 79.74 |
| 2007 | 54.81 | 73.61 | 70.73 | 77.65 | 80.72 |
### print 2d table of 'median', with table caption
print(xtable(mean.lifeExp_by_year_by_cont.wide[, , "median"], digits = 2, caption = "Median life expectancy of 5 continents by year"),
type = "html", include.rownames = T, caption.placement = "top")
| Africa | Americas | Asia | Europe | Oceania | |
|---|---|---|---|---|---|
| 1952 | 38.83 | 54.74 | 44.87 | 65.90 | 69.25 |
| 1957 | 40.59 | 56.07 | 48.28 | 67.65 | 70.30 |
| 1962 | 42.63 | 58.30 | 49.33 | 69.53 | 71.09 |
| 1967 | 44.70 | 60.52 | 53.66 | 70.61 | 71.31 |
| 1972 | 47.03 | 63.44 | 56.95 | 70.89 | 71.91 |
| 1977 | 49.27 | 66.35 | 60.77 | 72.34 | 72.85 |
| 1982 | 50.76 | 67.41 | 63.74 | 73.49 | 74.29 |
| 1987 | 51.64 | 69.50 | 66.30 | 74.81 | 75.32 |
| 1992 | 52.43 | 69.86 | 68.69 | 75.45 | 76.94 |
| 1997 | 52.76 | 72.15 | 70.27 | 76.12 | 78.19 |
| 2002 | 51.24 | 72.05 | 71.03 | 77.54 | 79.74 |
| 2007 | 52.93 | 72.90 | 72.40 | 78.61 | 80.72 |
This task becomes very easy with the use of nested daply()!
(thresh <- quantile(gDat$lifeExp, 0.1))
## 10%
## 41.51
(low_lifeExp <- daply(gDat, ~year, daply, ~continent, summarize, countLowLifeExp = sum(lifeExp <
thresh), propLowLifeExp = mean(lifeExp < thresh))) #note that all of these stats can be computed in one call!
## , , = countLowLifeExp
##
## continent
## year Africa Americas Asia Europe Oceania
## 1952 35 2 11 0 0
## 1957 29 1 8 0 0
## 1962 22 0 4 0 0
## 1967 14 0 3 0 0
## 1972 10 0 3 0 0
## 1977 5 0 2 0 0
## 1982 3 0 1 0 0
## 1987 3 0 1 0 0
## 1992 5 0 0 0 0
## 1997 4 0 0 0 0
## 2002 4 0 0 0 0
## 2007 1 0 0 0 0
##
## , , = propLowLifeExp
##
## continent
## year Africa Americas Asia Europe Oceania
## 1952 0.6731 0.08 0.3333 0 0
## 1957 0.5577 0.04 0.2424 0 0
## 1962 0.4231 0 0.1212 0 0
## 1967 0.2692 0 0.09091 0 0
## 1972 0.1923 0 0.09091 0 0
## 1977 0.09615 0 0.06061 0 0
## 1982 0.05769 0 0.0303 0 0
## 1987 0.05769 0 0.0303 0 0
## 1992 0.09615 0 0 0 0
## 1997 0.07692 0 0 0 0
## 2002 0.07692 0 0 0 0
## 2007 0.01923 0 0 0 0
Note that printing the object results in this “slicing” of multidimensional arrays by the inner-most dimension- in our case the dimension representing statistics we created.
This task actually embeds my knowledge of how to create counts and proportions using comparisons/masks (e.g. propLowLifeExp= mean(lifeExp < thresh)). It is probably unnecessary to print so many tables for 1 assignment. saves times for the marker.
### print 2d tables of 'mean'
print(xtable(low_lifeExp[, , "countLowLifeExp"], digits = 0, caption = "The counts of countries with low life expectancy in each continent by year"),
type = "html", include.rownames = T, caption.placement = "top")
| Africa | Americas | Asia | Europe | Oceania | |
|---|---|---|---|---|---|
| 1952 | 35 | 2 | 11 | 0 | 0 |
| 1957 | 29 | 1 | 8 | 0 | 0 |
| 1962 | 22 | 0 | 4 | 0 | 0 |
| 1967 | 14 | 0 | 3 | 0 | 0 |
| 1972 | 10 | 0 | 3 | 0 | 0 |
| 1977 | 5 | 0 | 2 | 0 | 0 |
| 1982 | 3 | 0 | 1 | 0 | 0 |
| 1987 | 3 | 0 | 1 | 0 | 0 |
| 1992 | 5 | 0 | 0 | 0 | 0 |
| 1997 | 4 | 0 | 0 | 0 | 0 |
| 2002 | 4 | 0 | 0 | 0 | 0 |
| 2007 | 1 | 0 | 0 | 0 | 0 |
### print 2d table of 'median', with table caption
print(xtable(low_lifeExp[, , "propLowLifeExp"], digits = 2, caption = "The proportion of countries with low life expectancy in each continent by year"),
type = "html", include.rownames = T, caption.placement = "top")
| Africa | Americas | Asia | Europe | Oceania | |
|---|---|---|---|---|---|
| 1952 | 0.67 | 0.08 | 0.33 | 0.00 | 0.00 |
| 1957 | 0.56 | 0.04 | 0.24 | 0.00 | 0.00 |
| 1962 | 0.42 | 0.00 | 0.12 | 0.00 | 0.00 |
| 1967 | 0.27 | 0.00 | 0.09 | 0.00 | 0.00 |
| 1972 | 0.19 | 0.00 | 0.09 | 0.00 | 0.00 |
| 1977 | 0.10 | 0.00 | 0.06 | 0.00 | 0.00 |
| 1982 | 0.06 | 0.00 | 0.03 | 0.00 | 0.00 |
| 1987 | 0.06 | 0.00 | 0.03 | 0.00 | 0.00 |
| 1992 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1997 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2002 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 |
| 2007 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 |
Create a table with columns “Continent”, “Minimum GDP per Capita” and “Maximum GDP per Capita”. The continents are arranged by ascending minimum GDP.
minmax.gdp_by_continent.wide <- arrange(ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap),
maxGDP = max(gdpPercap)), minGDP)
print(xtable(minmax.gdp_by_continent.wide, digits = 0), type = "html", include.rownames = F)
| continent | minGDP | maxGDP |
|---|---|---|
| Africa | 241 | 21951 |
| Asia | 331 | 113523 |
| Europe | 974 | 49357 |
| Americas | 1202 | 42952 |
| Oceania | 10040 | 34435 |
An alternative look at minimum and maximum GDP per capita by continent. Here the “type” of statistic occupies its one column followed by the results of applying that summary statistic.
# this is not very elegant
fct <- function(dframe) {
res <- data.frame(factor_name = "min", GDP = min(dframe$gdpPercap))
res <- rbind(res, data.frame(factor_name = "max", GDP = max(dframe$gdpPercap)))
}
minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name,
GDP)
print(xtable(minmax.gdp_by_continent.tall, digits = 0), type = "html", include.rownames = F)
| continent | factor_name | GDP |
|---|---|---|
| Africa | min | 241 |
| Asia | min | 331 |
| Europe | min | 974 |
| Americas | min | 1202 |
| Oceania | min | 10040 |
| Africa | max | 21951 |
| Oceania | max | 34435 |
| Americas | max | 42952 |
| Europe | max | 49357 |
| Asia | max | 113523 |
Here we investigate the spread of GDP per capita in each continent by 3 different measures, the standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR).
spread.gdp_by_continent <- arrange(ddply(gDat, ~continent, summarize, sd = sd(gdpPercap),
mad = mad(gdpPercap), IQR = IQR(gdpPercap)), IQR)
print(xtable(spread.gdp_by_continent, digits = 0), type = "html", include.rownames = F)
| continent | sd | mad | IQR |
|---|---|---|---|
| Africa | 2828 | 775 | 1616 |
| Americas | 6397 | 3269 | 4402 |
| Asia | 14045 | 2821 | 7492 |
| Oceania | 6359 | 6459 | 8072 |
| Europe | 9355 | 8846 | 13248 |
The above table is sorted by ascending IQR values. Notice that ordering by standard deviations is very different from this ordering by IQR, while the order by mad follows the order of IQR more closely. This suggests that sd may not always be the most meaningful spread to report and interpret for strangely shaped distributions. One often interprets sd as width containing 60% of the data assuming normal distribution, but for highly skewed data this interpretation is far from correct. An IQR, based on actual proporions from the data may be a more useful measure of spread.