stat545a-2013-hw03_inskip-jes

Homework 3: Data aggregation

Getting started:

Load packages that will be used and import gapminder dataset and str data to check that import has been successful.

library(plyr)
library(xtable)
library(knitr)
gDat <- read.delim(file = "gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Get the maximum and minimum GDP per capita for all continents in a “wide” format.

Wide format refers to each row variable (in this case continent) having multiple columns rather than repeating the row variable for every new column variable (this would be “tall” format). Here, for every continent row, the minimum GDP per capita (minGdpPercap) and maximum GDP per capita (maxGdpPercap) are reported.

gdpPercapByCont <- ddply(gDat, ~continent, summarise, minGdpPercap = min(gdpPercap), 
    maxGdpPercap = max(gdpPercap))
tgdpPercapByCont <- arrange(gdpPercapByCont, minGdpPercap)
tgdpPercapByCont <- xtable(tgdpPercapByCont)
print(tgdpPercapByCont, type = "html", include.rownames = FALSE)
continent minGdpPercap maxGdpPercap
Africa 241.17 21951.21
Asia 331.00 113523.13
Europe 973.53 49357.19
Americas 1201.64 42951.65
Oceania 10039.60 34435.37

This table is a bit confusing because it doesn't give information about the years when the maximum or mimimum is occuring. I think that the story would be more meaningful if we include when the maximum and mimimum occurred. At the very least it would be important to explain this in the figure caption that the range is 1952-2007.

Look at the spread of GDP per captia within the continents.

Here we will look at three different measures of spread within the continents: standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR). Again, there is only one row per continent.

gdpSpread <- ddply(gDat, ~continent, summarize, sdGdpPercap = sd(gdpPercap), 
    madGdpPercap = mad(gdpPercap), IQRGdpPercap = IQR(gdpPercap))
gdpSpreadTable <- arrange(gdpSpread, sdGdpPercap)
gdpSpreadTable <- xtable(gdpSpreadTable)
print(gdpSpreadTable, type = "html", include.rownames = FALSE)
continent sdGdpPercap madGdpPercap IQRGdpPercap
Africa 2827.93 775.32 1616.17
Oceania 6358.98 6459.10 8072.26
Americas 6396.76 3269.33 4402.43
Europe 9355.21 8846.05 13248.30
Asia 14045.37 2820.83 7492.26

The table is sorted by the standard deviation of the GDP per captia. In general, the relationship seems to hold between the different measures of spread. One exception is the spread of GDP per captia in the Americas, which stands out as smaller when considering the median absolute divation and interquartile range, compared to the standard deviation. The spread of GDP per capita is small in Africa across all measurements. For me, it would be easier to visualise the range with error bars.

Get the maximum and minimum of GDP per captia for all continents in a “tall” format.

To make this a “tall” format, need to write function so that two lines are returned for each continent. The function maxMin was written to make a dataset with two rows, and then the function can be called per continent with ddply.

maxMin <- function(x) {
    values = c(max(x$gdpPercap), min(x$gdpPercap))
    indexes = c("Max", "Min")
    data.frame(GDPPercap = values, MaxorMin = indexes)
}
gdpMaxMinPerCont <- ddply(gDat, ~continent, maxMin)

I think it would be easier to interpret this figure if the maximum and minimum were grouped.
If making this publication ready, would like to merge the rows that are the same column (if they are adjacent) to reduce text and make the table cleaner.

Count the number of countries with low life expectancey over time by continent. “Tall” format.

I chose 50 years as the cutoff for life expectancy, so the following table shows how many countries per continent have a life expectancy less than 50 years. In the “tall” the number of countries with a life expectancy below 50 years (numCountriesBelow50yrs) is reported for each continent*year combination.

t <- ddply(gDat, ~year + continent, summarize, numCountriesBelow50yrs = sum(lifeExp < 
    50))
tablet <- arrange(t, continent)
tablefinal <- xtable(tablet)
print(tablefinal, type = "html", include.rownames = FALSE)
year continent numCountriesBelow50yrs
1952 Africa 50
1957 Africa 49
1962 Africa 47
1967 Africa 39
1972 Africa 36
1977 Africa 28
1982 Africa 24
1987 Africa 20
1992 Africa 20
1997 Africa 20
2002 Africa 22
2007 Africa 18
1952 Americas 9
1957 Americas 8
1962 Americas 6
1967 Americas 2
1972 Americas 2
1977 Americas 1
1982 Americas 0
1987 Americas 0
1992 Americas 0
1997 Americas 0
2002 Americas 0
2007 Americas 0
1952 Asia 22
1957 Asia 18
1962 Asia 17
1967 Asia 12
1972 Asia 6
1977 Asia 5
1982 Asia 3
1987 Asia 1
1992 Asia 1
1997 Asia 1
2002 Asia 1
2007 Asia 1
1952 Europe 1
1957 Europe 1
1962 Europe 0
1967 Europe 0
1972 Europe 0
1977 Europe 0
1982 Europe 0
1987 Europe 0
1992 Europe 0
1997 Europe 0
2002 Europe 0
2007 Europe 0
1952 Oceania 0
1957 Oceania 0
1962 Oceania 0
1967 Oceania 0
1972 Oceania 0
1977 Oceania 0
1982 Oceania 0
1987 Oceania 0
1992 Oceania 0
1997 Oceania 0
2002 Oceania 0
2007 Oceania 0

It is not very intuitive to make conclusions from this tall table. However, when the table is arranged by continent it is a little easier to see the trends. As many continents have very few countries where the life expectancy is less than 50, this table really highlights Africa. Once you have focused solely on one contient, the trends are easier to get a handle on. A graphical representation would be a lot easier for me personally to digest. NB: Use sum rather than count: count gives frequency of TRUE and FALSE values, while sum counts the occurences of TRUE.

Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. “Wide” format.

This follows on from the last task, but using a “Wide” format, it condenses the final output nicely, making it easier to interpret the relationship over time.

wideLifeExp <- function(gDat) {
    daply(gDat, .(continent), summarise, numCountriesBelow50yrs = sum(lifeExp < 
        50))
}
lifeExpByTime <- daply(gDat, ~year, wideLifeExp)
print(xtable(lifeExpByTime, caption = "Number of countries per continent with life expectancy less than 50 years", 
    caption.placement = "top"), type = "html")
Number of countries per continent with life expectancy less than 50 years
Africa Americas Asia Europe Oceania
1952 50.00 9.00 22.00 1.00 0.00
1957 49.00 8.00 18.00 1.00 0.00
1962 47.00 6.00 17.00 0.00 0.00
1967 39.00 2.00 12.00 0.00 0.00
1972 36.00 2.00 6.00 0.00 0.00
1977 28.00 1.00 5.00 0.00 0.00
1982 24.00 0.00 3.00 0.00 0.00
1987 20.00 0.00 1.00 0.00 0.00
1992 20.00 0.00 1.00 0.00 0.00
1997 20.00 0.00 1.00 0.00 0.00
2002 22.00 0.00 1.00 0.00 0.00
2007 18.00 0.00 1.00 0.00 0.00

To make this table, I used a double daply: one embedded in the other. If I had used just one with ~continent + year it would have made one line for every contient*year combination. With two daply functions we can combine all of the by continent information within years. I couldn't get the location of the caption to move to the top on this one…even with caption.placement = "top" I have to investigate this further.