stat545a-2013-hw03_inskip-jes

Homework 3: Data aggregation

Getting started:

Load packages that will be used and import gapminder dataset and str data to check that import has been successful.

library(plyr)
library(xtable)
library(knitr)
gDat <- read.delim(file = "gapminderDataFiveYear.txt")
str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Get the maximum and minimum GDP per capita for all continents in a “wide” format.

Wide format refers to each row variable (in this case continent) having multiple columns rather than repeating the row variable for every new column variable (this would be “tall” format). Here, for every continent row, the minimum GDP per capita (minGdpPercap) and maximum GDP per capita (maxGdpPercap) are reported.

gdpPercapByCont <- ddply(gDat, ~continent, summarise, minGdpPercap = min(gdpPercap), 
    maxGdpPercap = max(gdpPercap))
tgdpPercapByCont <- arrange(gdpPercapByCont, minGdpPercap)
tgdpPercapByCont <- xtable(tgdpPercapByCont)
print(tgdpPercapByCont, type = "html", include.rownames = FALSE)

continent	minGdpPercap	maxGdpPercap
Africa	241.17	21951.21
Asia	331.00	113523.13
Europe	973.53	49357.19
Americas	1201.64	42951.65
Oceania	10039.60	34435.37

This table is a bit confusing because it doesn't give information about the years when the maximum or mimimum is occuring. I think that the story would be more meaningful if we include when the maximum and mimimum occurred. At the very least it would be important to explain this in the figure caption that the range is 1952-2007.

Look at the spread of GDP per captia within the continents.

Here we will look at three different measures of spread within the continents: standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR). Again, there is only one row per continent.

gdpSpread <- ddply(gDat, ~continent, summarize, sdGdpPercap = sd(gdpPercap), 
    madGdpPercap = mad(gdpPercap), IQRGdpPercap = IQR(gdpPercap))
gdpSpreadTable <- arrange(gdpSpread, sdGdpPercap)
gdpSpreadTable <- xtable(gdpSpreadTable)
print(gdpSpreadTable, type = "html", include.rownames = FALSE)

continent	sdGdpPercap	madGdpPercap	IQRGdpPercap
Africa	2827.93	775.32	1616.17
Oceania	6358.98	6459.10	8072.26
Americas	6396.76	3269.33	4402.43
Europe	9355.21	8846.05	13248.30
Asia	14045.37	2820.83	7492.26

The table is sorted by the standard deviation of the GDP per captia. In general, the relationship seems to hold between the different measures of spread. One exception is the spread of GDP per captia in the Americas, which stands out as smaller when considering the median absolute divation and interquartile range, compared to the standard deviation. The spread of GDP per capita is small in Africa across all measurements. For me, it would be easier to visualise the range with error bars.

Get the maximum and minimum of GDP per captia for all continents in a “tall” format.

To make this a “tall” format, need to write function so that two lines are returned for each continent. The function maxMin was written to make a dataset with two rows, and then the function can be called per continent with ddply.

maxMin <- function(x) {
    values = c(max(x$gdpPercap), min(x$gdpPercap))
    indexes = c("Max", "Min")
    data.frame(GDPPercap = values, MaxorMin = indexes)
}
gdpMaxMinPerCont <- ddply(gDat, ~continent, maxMin)

I think it would be easier to interpret this figure if the maximum and minimum were grouped.
If making this publication ready, would like to merge the rows that are the same column (if they are adjacent) to reduce text and make the table cleaner.

Count the number of countries with low life expectancey over time by continent. “Tall” format.

I chose 50 years as the cutoff for life expectancy, so the following table shows how many countries per continent have a life expectancy less than 50 years. In the “tall” the number of countries with a life expectancy below 50 years (numCountriesBelow50yrs) is reported for each continent*year combination.

t <- ddply(gDat, ~year + continent, summarize, numCountriesBelow50yrs = sum(lifeExp < 
    50))
tablet <- arrange(t, continent)
tablefinal <- xtable(tablet)
print(tablefinal, type = "html", include.rownames = FALSE)

year	continent	numCountriesBelow50yrs
1952	Africa	50
1957	Africa	49
1962	Africa	47
1967	Africa	39
1972	Africa	36
1977	Africa	28
1982	Africa	24
1987	Africa	20
1992	Africa	20
1997	Africa	20
2002	Africa	22
2007	Africa	18
1952	Americas	9
1957	Americas	8
1962	Americas	6
1967	Americas	2
1972	Americas	2
1977	Americas	1
1982	Americas	0
1987	Americas	0
1992	Americas	0
1997	Americas	0
2002	Americas	0
2007	Americas	0
1952	Asia	22
1957	Asia	18
1962	Asia	17
1967	Asia	12
1972	Asia	6
1977	Asia	5
1982	Asia	3
1987	Asia	1
1992	Asia	1
1997	Asia	1
2002	Asia	1
2007	Asia	1
1952	Europe	1
1957	Europe	1
1962	Europe	0
1967	Europe	0
1972	Europe	0
1977	Europe	0
1982	Europe	0
1987	Europe	0
1992	Europe	0
1997	Europe	0
2002	Europe	0
2007	Europe	0
1952	Oceania	0
1957	Oceania	0
1962	Oceania	0
1967	Oceania	0
1972	Oceania	0
1977	Oceania	0
1982	Oceania	0
1987	Oceania	0
1992	Oceania	0
1997	Oceania	0
2002	Oceania	0
2007	Oceania	0

It is not very intuitive to make conclusions from this tall table. However, when the table is arranged by continent it is a little easier to see the trends. As many continents have very few countries where the life expectancy is less than 50, this table really highlights Africa. Once you have focused solely on one contient, the trends are easier to get a handle on. A graphical representation would be a lot easier for me personally to digest. NB: Use sum rather than count: count gives frequency of TRUE and FALSE values, while sum counts the occurences of TRUE.

Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. “Wide” format.

This follows on from the last task, but using a “Wide” format, it condenses the final output nicely, making it easier to interpret the relationship over time.

wideLifeExp <- function(gDat) {
    daply(gDat, .(continent), summarise, numCountriesBelow50yrs = sum(lifeExp < 
        50))
}
lifeExpByTime <- daply(gDat, ~year, wideLifeExp)
print(xtable(lifeExpByTime, caption = "Number of countries per continent with life expectancy less than 50 years", 
    caption.placement = "top"), type = "html")

Number of countries per continent with life expectancy less than 50 years
	Africa	Americas	Asia	Europe
1952	50.00	9.00	22.00	1.00
1957	49.00	8.00	18.00	1.00
1962	47.00	6.00	17.00	0.00
1967	39.00	2.00	12.00	0.00
1972	36.00	2.00	6.00	0.00
1977	28.00	1.00	5.00	0.00
1982	24.00	0.00	3.00	0.00
1987	20.00	0.00	1.00	0.00
1992	20.00	0.00	1.00	0.00
1997	20.00	0.00	1.00	0.00
2002	22.00	0.00	1.00	0.00
2007	18.00	0.00	1.00	0.00

To make this table, I used a double daply: one embedded in the other. If I had used just one with ~continent + year it would have made one line for every contient*year combination. With two daply functions we can combine all of the by continent information within years. I couldn't get the location of the caption to move to the top on this one…even with caption.placement = "top" I have to investigate this further.