Homework03

This homeork assignment is an excercise in data aggregation with a focus on writing code that is clean and readable.

# read from internet gDat <-
# read.delim('http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt')

# read locally
gDat <- read.delim(file.path(str_replace(getwd(), "stat545a-2013-hw03", "GapMinder"), 
    "gapminderDataFiveYear.txt"))

I begin with the “harder” sections. I hope this saves the TA some time in marking.

Life Expectancy by Year

Mean versus trimmed mean

We compare the mean life expectancy by year to the trimmed mean (20% from each end).

mean.lifeExp_by_year <- arrange(ddply(gDat, ~year, summarize, mean = mean(lifeExp), 
    trimmed.mean = mean(lifeExp, trim = 0.2)), year)

print(xtable(mean.lifeExp_by_year, digits = 2), type = "html", include.rownames = F)

year	mean	trimmed.mean
1952	49.06	47.75
1957	51.51	50.64
1962	53.61	53.13
1967	55.68	55.64
1972	57.65	58.12
1977	59.57	60.39
1982	61.53	62.47
1987	63.21	64.48
1992	64.16	65.89
1997	65.01	66.84
2002	65.69	67.77
2007	67.01	69.17

Curiously, the mean value of earlier years are higher than the trimmed mean, while the reverse is true for the later years…

(Challenge) Life expectancy over time on different continents: the “Wide” format

### performing the task with 1 line of code
mean.lifeExp_by_year_by_cont.wide <- daply(gDat, ~year, daply, ~continent, summarize, 
    mean = mean(lifeExp), median = median(lifeExp))  #note multiple stats can be computed with one call!

This is an odd excercise, that involves essentially “nested” daply() calls. The “inner” call takes data frame for a given year, returns a “array/matrix” of 2 dimensions: 1st dimension the factor levels you split by, and 2nd dimensions the stats you make it calculate. The “outer” call takes the resulting “array/matrix” and inserts another dimension to this array representing the year.

### data structure: array of 3 dimensions
class(mean.lifeExp_by_year_by_cont.wide)

## [1] "array"

dim(mean.lifeExp_by_year_by_cont.wide)

## [1] 12  5  2

To print a 2 dimensional table, one would use the multidimensional array indexing, [,…,]. As an example myMultDimArray[,,“statName”] retrieves the matrix with information on a statistic called “statName”.

### print 2d tables of 'mean'
print(xtable(mean.lifeExp_by_year_by_cont.wide[, , "mean"], digits = 2), type = "html", 
    include.rownames = T)

	Africa	Americas	Asia	Europe	Oceania
1952	39.14	53.28	46.31	64.41	69.25
1957	41.27	55.96	49.32	66.70	70.30
1962	43.32	58.40	51.56	68.54	71.09
1967	45.33	60.41	54.66	69.74	71.31
1972	47.45	62.39	57.32	70.78	71.91
1977	49.58	64.39	59.61	71.94	72.85
1982	51.59	66.23	62.62	72.81	74.29
1987	53.34	68.09	64.85	73.64	75.32
1992	53.63	69.57	66.54	74.44	76.94
1997	53.60	71.15	68.02	75.51	78.19
2002	53.33	72.42	69.23	76.70	79.74
2007	54.81	73.61	70.73	77.65	80.72


### print 2d table of 'median', with table caption
print(xtable(mean.lifeExp_by_year_by_cont.wide[, , "median"], digits = 2, caption = "Median life expectancy of 5 continents by year"), 
    type = "html", include.rownames = T, caption.placement = "top")

Median life expectancy of 5 continents by year
	Africa	Americas	Asia	Europe	Oceania
1952	38.83	54.74	44.87	65.90	69.25
1957	40.59	56.07	48.28	67.65	70.30
1962	42.63	58.30	49.33	69.53	71.09
1967	44.70	60.52	53.66	70.61	71.31
1972	47.03	63.44	56.95	70.89	71.91
1977	49.27	66.35	60.77	72.34	72.85
1982	50.76	67.41	63.74	73.49	74.29
1987	51.64	69.50	66.30	74.81	75.32
1992	52.43	69.86	68.69	75.45	76.94
1997	52.76	72.15	70.27	76.12	78.19
2002	51.24	72.05	71.03	77.54	79.74
2007	52.93	72.90	72.40	78.61	80.72

(Basically the same Challenge) Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. “Wide” format. Extension of another task of unknown difficulty / value.

This task becomes very easy with the use of nested daply()!

(thresh <- quantile(gDat$lifeExp, 0.1))

##   10% 
## 41.51

(low_lifeExp <- daply(gDat, ~year, daply, ~continent, summarize, countLowLifeExp = sum(lifeExp < 
    thresh), propLowLifeExp = mean(lifeExp < thresh)))  #note that all of these stats can be computed in one call!

## , ,  = countLowLifeExp
## 
##       continent
## year   Africa Americas Asia Europe Oceania
##   1952 35     2        11   0      0      
##   1957 29     1        8    0      0      
##   1962 22     0        4    0      0      
##   1967 14     0        3    0      0      
##   1972 10     0        3    0      0      
##   1977 5      0        2    0      0      
##   1982 3      0        1    0      0      
##   1987 3      0        1    0      0      
##   1992 5      0        0    0      0      
##   1997 4      0        0    0      0      
##   2002 4      0        0    0      0      
##   2007 1      0        0    0      0      
## 
## , ,  = propLowLifeExp
## 
##       continent
## year   Africa  Americas Asia    Europe Oceania
##   1952 0.6731  0.08     0.3333  0      0      
##   1957 0.5577  0.04     0.2424  0      0      
##   1962 0.4231  0        0.1212  0      0      
##   1967 0.2692  0        0.09091 0      0      
##   1972 0.1923  0        0.09091 0      0      
##   1977 0.09615 0        0.06061 0      0      
##   1982 0.05769 0        0.0303  0      0      
##   1987 0.05769 0        0.0303  0      0      
##   1992 0.09615 0        0       0      0      
##   1997 0.07692 0        0       0      0      
##   2002 0.07692 0        0       0      0      
##   2007 0.01923 0        0       0      0

Note that printing the object results in this “slicing” of multidimensional arrays by the inner-most dimension- in our case the dimension representing statistics we created.

This task actually embeds my knowledge of how to create counts and proportions using comparisons/masks (e.g. propLowLifeExp= mean(lifeExp < thresh)). It is probably unnecessary to print so many tables for 1 assignment. saves times for the marker.

### print 2d tables of 'mean'
print(xtable(low_lifeExp[, , "countLowLifeExp"], digits = 0, caption = "The counts of countries with low life expectancy in each continent by year"), 
    type = "html", include.rownames = T, caption.placement = "top")

The counts of countries with low life expectancy in each continent by year
	Africa	Americas	Asia
1952	35	2	11
1957	29	1	8
1962	22	0	4
1967	14	0	3
1972	10	0	3
1977	5	0	2
1982	3	0	1
1987	3	0	1
1992	5	0	0
1997	4	0	0
2002	4	0	0
2007	1	0	0


### print 2d table of 'median', with table caption
print(xtable(low_lifeExp[, , "propLowLifeExp"], digits = 2, caption = "The proportion of countries with low life expectancy in each continent by year"), 
    type = "html", include.rownames = T, caption.placement = "top")

The proportion of countries with low life expectancy in each continent by year
	Africa	Americas	Asia
1952	0.67	0.08	0.33
1957	0.56	0.04	0.24
1962	0.42	0.00	0.12
1967	0.27	0.00	0.09
1972	0.19	0.00	0.09
1977	0.10	0.00	0.06
1982	0.06	0.00	0.03
1987	0.06	0.00	0.03
1992	0.10	0.00	0.00
1997	0.08	0.00	0.00
2002	0.08	0.00	0.00
2007	0.02	0.00	0.00

GDP per capita by Continent

A wide table

Create a table with columns “Continent”, “Minimum GDP per Capita” and “Maximum GDP per Capita”. The continents are arranged by ascending minimum GDP.

minmax.gdp_by_continent.wide <- arrange(ddply(gDat, ~continent, summarize, minGDP = min(gdpPercap), 
    maxGDP = max(gdpPercap)), minGDP)
print(xtable(minmax.gdp_by_continent.wide, digits = 0), type = "html", include.rownames = F)

continent	minGDP	maxGDP
Africa	241	21951
Asia	331	113523
Europe	974	49357
Americas	1202	42952
Oceania	10040	34435

A Tall table

An alternative look at minimum and maximum GDP per capita by continent. Here the “type” of statistic occupies its one column followed by the results of applying that summary statistic.

# this is not very elegant
fct <- function(dframe) {
    res <- data.frame(factor_name = "min", GDP = min(dframe$gdpPercap))
    res <- rbind(res, data.frame(factor_name = "max", GDP = max(dframe$gdpPercap)))
}

minmax.gdp_by_continent.tall <- arrange(ddply(gDat, ~continent, fct), factor_name, 
    GDP)
print(xtable(minmax.gdp_by_continent.tall, digits = 0), type = "html", include.rownames = F)

continent	factor_name	GDP
Africa	min	241
Asia	min	331
Europe	min	974
Americas	min	1202
Oceania	min	10040
Africa	max	21951
Oceania	max	34435
Americas	max	42952
Europe	max	49357
Asia	max	113523

Spread of GDP per capita

Here we investigate the spread of GDP per capita in each continent by 3 different measures, the standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR).

spread.gdp_by_continent <- arrange(ddply(gDat, ~continent, summarize, sd = sd(gdpPercap), 
    mad = mad(gdpPercap), IQR = IQR(gdpPercap)), IQR)

print(xtable(spread.gdp_by_continent, digits = 0), type = "html", include.rownames = F)

continent	sd	mad	IQR
Africa	2828	775	1616
Americas	6397	3269	4402
Asia	14045	2821	7492
Oceania	6359	6459	8072
Europe	9355	8846	13248

The above table is sorted by ascending IQR values. Notice that ordering by standard deviations is very different from this ordering by IQR, while the order by mad follows the order of IQR more closely. This suggests that sd may not always be the most meaningful spread to report and interpret for strangely shaped distributions. One often interprets sd as width containing 60% of the data assuming normal distribution, but for highly skewed data this interpretation is far from correct. An IQR, based on actual proporions from the data may be a more useful measure of spread.