Homework03

This homeork assignment is an excercise in data aggregation with a focus on writing code that is clean and readable.

# read from internet gDat <-
# read.delim('http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt')

# read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw03", "GapMinder"), 
    "gapminderDataFiveYear.txt"))

I begin with the “harder” sections. I hope this saves the TA some time in marking.

Life Expectancy by Year

Mean versus trimmed mean

We compare the mean life expectancy by year to the trimmed mean (20% from each end).

mean.lifeExp_by_year <- arrange(ddply(gDat, ~year, summarize, mean = mean(lifeExp), 
    trimmed.mean = mean(lifeExp, trim = 0.2)), year)

print(xtable(mean.lifeExp_by_year, digits = 2), type = "html", include.rownames = F)
year mean trimmed.mean
1952 49.06 47.75
1957 51.51 50.64
1962 53.61 53.13
1967 55.68 55.64
1972 57.65 58.12
1977 59.57 60.39
1982 61.53 62.47
1987 63.21 64.48
1992 64.16 65.89
1997 65.01 66.84
2002 65.69 67.77
2007 67.01 69.17

Curiously, the mean value of earlier years are higher than the trimmed mean, while the reverse is true for the later years…

(Challenge) Life expectancy over time on different continents: the “Wide” format

### performing the task with 1 line of code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat, ~year, daply, ~continent, summarize, 
    mean = mean(lifeExp), median = median(lifeExp))  #note multiple stats can be computed with one call!

This is an odd excercise, that involves essentially “nested” daply() calls. The “inner” call takes data frame for a given year, returns a “array/matrix” of 2 dimensions: 1st dimension the factor levels you split by, and 2nd dimensions the stats you make it calculate. The “outer” call takes the resulting “array/matrix” and inserts another dimension to this array representing the year.

### data structure: array of 3 dimensions
class(mean.lifeExp_by_year_by_cont.wide)
## [1] "array"
dim(mean.lifeExp_by_year_by_cont.wide)
## [1] 12  5  2

To print a 2 dimensional table, one would use the multidimensional array indexing, [,…,]. As an example myMultDimArray[,,“statName”] retrieves the matrix with information on a statistic called “statName”.

### print 2d tables of 'mean'
print(xtable(mean.lifeExp_by_year_by_cont.wide[, , "mean"], digits = 2), type = "html", 
    include.rownames = T)
Africa Americas Asia Europe Oceania
1952 39.14 53.28 46.31 64.41 69.25
1957 41.27 55.96 49.32 66.70 70.30
1962 43.32 58.40 51.56 68.54 71.09
1967 45.33 60.41 54.66 69.74 71.31
1972 47.45 62.39 57.32 70.78 71.91
1977 49.58 64.39 59.61 71.94 72.85
1982 51.59 66.23 62.62 72.81 74.29
1987 53.34 68.09 64.85 73.64 75.32
1992 53.63 69.57 66.54 74.44 76.94
1997 53.60 71.15 68.02 75.51 78.19
2002 53.33 72.42 69.23 76.70 79.74
2007 54.81 73.61 70.73 77.65 80.72

### print 2d table of 'median', with table caption
print(xtable(mean.lifeExp_by_year_by_cont.wide[, , "median"], digits = 2, caption = "Median life expectancy of 5 continents by year"), 
    type = "html", include.rownames = T, caption.placement = "top")
Median life expectancy of 5 continents by year
Africa Americas Asia Europe Oceania
1952 38.83 54.74 44.87 65.90 69.25
1957 40.59 56.07 48.28 67.65 70.30
1962 42.63 58.30 49.33 69.53 71.09
1967 44.70 60.52 53.66 70.61 71.31
1972 47.03 63.44 56.95 70.89 71.91
1977 49.27 66.35 60.77 72.34 72.85
1982 50.76 67.41 63.74 73.49 74.29
1987 51.64 69.50 66.30 74.81 75.32
1992 52.43 69.86 68.69 75.45 76.94
1997 52.76 72.15 70.27 76.12 78.19
2002 51.24 72.05 71.03 77.54 79.74
2007 52.93 72.90 72.40 78.61 80.72

(Basically the same Challenge) Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. “Wide” format. Extension of another task of unknown difficulty / value.

This task becomes very easy with the use of nested daply()!

(thresh <- quantile(gDat$lifeExp, 0.1))
##   10% 
## 41.51
(low_lifeExp <- daply(gDat, ~year, daply, ~continent, summarize, countLowLifeExp = sum(lifeExp < 
    thresh), propLowLifeExp = mean(lifeExp < thresh)))  #note that all of these stats can be computed in one call!
## , ,  = countLowLifeExp
## 
##       continent
## year   Africa Americas Asia Europe Oceania
##   1952 35     2        11   0      0      
##   1957 29     1        8    0      0      
##   1962 22     0        4    0      0      
##   1967 14     0        3    0      0      
##   1972 10     0        3    0      0      
##   1977 5      0        2    0      0      
##   1982 3      0        1    0      0      
##   1987 3      0        1    0      0      
##   1992 5      0        0    0      0      
##   1997 4      0        0    0      0      
##   2002 4      0        0    0      0      
##   2007 1      0        0    0      0      
## 
## , ,  = propLowLifeExp
## 
##       continent
## year   Africa  Americas Asia    Europe Oceania
##   1952 0.6731  0.08     0.3333  0      0      
##   1957 0.5577  0.04     0.2424  0      0      
##   1962 0.4231  0        0.1212  0      0      
##   1967 0.2692  0        0.09091 0      0      
##   1972 0.1923  0        0.09091 0      0      
##   1977 0.09615 0        0.06061 0      0      
##   1982 0.05769 0        0.0303  0      0      
##   1987 0.05769 0        0.0303  0      0      
##   1992 0.09615 0        0       0      0      
##   1997 0.07692 0        0       0      0      
##   2002 0.07692 0        0       0      0      
##   2007 0.01923 0        0       0      0

Note that printing the object results in this “slicing” of multidimensional arrays by the inner-most dimension- in our case the dimension representing statistics we created.

This task actually embeds my knowledge of how to create counts and proportions using comparisons/masks (e.g. propLowLifeExp= mean(lifeExp < thresh)). It is probably unnecessary to print so many tables for 1 assignment. saves times for the marker.

### print 2d tables of 'mean'
print(xtable(low_lifeExp[, , "countLowLifeExp"], digits = 0, caption = "The counts of countries with low life expectancy in each continent by year"), 
    type = "html", include.rownames = T, caption.placement = "top")
The counts of countries with low life expectancy in each continent by year
Africa Americas Asia Europe Oceania
1952 35 2 11 0 0
1957 29 1 8 0 0
1962 22 0 4 0 0
1967 14 0 3 0 0
1972 10 0 3 0 0
1977 5 0 2 0 0
1982 3 0 1 0 0
1987 3 0 1 0 0
1992 5 0 0 0 0
1997 4 0 0 0 0
2002 4 0 0 0 0
2007 1 0 0 0 0

### print 2d table of 'median', with table caption
print(xtable(low_lifeExp[, , "propLowLifeExp"], digits = 2, caption = "The proportion of countries with low life expectancy in each continent by year"), 
    type = "html", include.rownames = T, caption.placement = "top")
The proportion of countries with low life expectancy in each continent by year
Africa Americas Asia Europe Oceania
1952 0.67 0.08 0.33 0.00 0.00
1957 0.56 0.04 0.24 0.00 0.00
1962 0.42 0.00 0.12 0.00 0.00
1967 0.27 0.00 0.09 0.00 0.00
1972 0.19 0.00 0.09 0.00 0.00
1977 0.10 0.00 0.06 0.00 0.00
1982 0.06 0.00 0.03 0.00 0.00
1987 0.06 0.00 0.03 0.00 0.00
1992 0.10 0.00 0.00 0.00 0.00
1997 0.08 0.00 0.00 0.00 0.00
2002 0.08 0.00 0.00 0.00 0.00
2007 0.02 0.00 0.00 0.00 0.00

GDP per capita by Continent

A wide table

Create a table with columns “Continent”, “Minimum GDP per Capita” and “Maximum GDP per Capita”. The continents are arranged by ascending minimum GDP.

minmax.gdp_by_continent.wide <- arrange(ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap), 
    maxGDP = max(gdpPercap)), minGDP)
print(xtable(minmax.gdp_by_continent.wide, digits = 0), type = "html", include.rownames = F)
continent minGDP maxGDP
Africa 241 21951
Asia 331 113523
Europe 974 49357
Americas 1202 42952
Oceania 10040 34435

A Tall table

An alternative look at minimum and maximum GDP per capita by continent. Here the “type” of statistic occupies its one column followed by the results of applying that summary statistic.

# this is not very elegant
fct <- function(dframe) {
    res <- data.frame(factor_name = "min", GDP = min(dframe$gdpPercap))
    res <- rbind(res, data.frame(factor_name = "max", GDP = max(dframe$gdpPercap)))
}

minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name, 
    GDP)
print(xtable(minmax.gdp_by_continent.tall, digits = 0), type = "html", include.rownames = F)
continent factor_name GDP
Africa min 241
Asia min 331
Europe min 974
Americas min 1202
Oceania min 10040
Africa max 21951
Oceania max 34435
Americas max 42952
Europe max 49357
Asia max 113523

Spread of GDP per capita

Here we investigate the spread of GDP per capita in each continent by 3 different measures, the standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR).

spread.gdp_by_continent <- arrange(ddply(gDat, ~continent, summarize, sd = sd(gdpPercap), 
    mad = mad(gdpPercap), IQR = IQR(gdpPercap)), IQR)

print(xtable(spread.gdp_by_continent, digits = 0), type = "html", include.rownames = F)
continent sd mad IQR
Africa 2828 775 1616
Americas 6397 3269 4402
Asia 14045 2821 7492
Oceania 6359 6459 8072
Europe 9355 8846 13248

The above table is sorted by ascending IQR values. Notice that ordering by standard deviations is very different from this ordering by IQR, while the order by mad follows the order of IQR more closely. This suggests that sd may not always be the most meaningful spread to report and interpret for strangely shaped distributions. One often interprets sd as width containing 60% of the data assuming normal distribution, but for highly skewed data this interpretation is far from correct. An IQR, based on actual proporions from the data may be a more useful measure of spread.