Homework 3: Data aggregation
Getting started:
Load packages that will be used and import gapminder dataset and str data to check that import has been successful.
library(plyr)
library(xtable)
library(knitr)
gDat <- read.delim(file = "gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Get the maximum and minimum GDP per capita for all continents in a “wide” format.
Wide format refers to each row variable (in this case continent) having multiple columns rather than repeating the row variable for every new column variable (this would be “tall” format). Here, for every continent row, the minimum GDP per capita (minGdpPercap) and maximum GDP per capita (maxGdpPercap) are reported.
gdpPercapByCont <- ddply(gDat, ~continent, summarise, minGdpPercap = min(gdpPercap),
maxGdpPercap = max(gdpPercap))
tgdpPercapByCont <- arrange(gdpPercapByCont, minGdpPercap)
tgdpPercapByCont <- xtable(tgdpPercapByCont)
print(tgdpPercapByCont, type = "html", include.rownames = FALSE)
| continent | minGdpPercap | maxGdpPercap |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Asia | 331.00 | 113523.13 |
| Europe | 973.53 | 49357.19 |
| Americas | 1201.64 | 42951.65 |
| Oceania | 10039.60 | 34435.37 |
This table is a bit confusing because it doesn't give information about the years when the maximum or mimimum is occuring. I think that the story would be more meaningful if we include when the maximum and mimimum occurred. At the very least it would be important to explain this in the figure caption that the range is 1952-2007.
Look at the spread of GDP per captia within the continents.
Here we will look at three different measures of spread within the continents: standard deviation (sd), median absolute deviation (mad), and interquartile range (IQR). Again, there is only one row per continent.
gdpSpread <- ddply(gDat, ~continent, summarize, sdGdpPercap = sd(gdpPercap),
madGdpPercap = mad(gdpPercap), IQRGdpPercap = IQR(gdpPercap))
gdpSpreadTable <- arrange(gdpSpread, sdGdpPercap)
gdpSpreadTable <- xtable(gdpSpreadTable)
print(gdpSpreadTable, type = "html", include.rownames = FALSE)
| continent | sdGdpPercap | madGdpPercap | IQRGdpPercap |
|---|---|---|---|
| Africa | 2827.93 | 775.32 | 1616.17 |
| Oceania | 6358.98 | 6459.10 | 8072.26 |
| Americas | 6396.76 | 3269.33 | 4402.43 |
| Europe | 9355.21 | 8846.05 | 13248.30 |
| Asia | 14045.37 | 2820.83 | 7492.26 |
The table is sorted by the standard deviation of the GDP per captia. In general, the relationship seems to hold between the different measures of spread. One exception is the spread of GDP per captia in the Americas, which stands out as smaller when considering the median absolute divation and interquartile range, compared to the standard deviation. The spread of GDP per capita is small in Africa across all measurements. For me, it would be easier to visualise the range with error bars.
Get the maximum and minimum of GDP per captia for all continents in a “tall” format.
To make this a “tall” format, need to write function so that two lines are returned for each continent. The function maxMin was written to make a dataset with two rows, and then the function can be called per continent with ddply.
maxMin <- function(x) {
values = c(max(x$gdpPercap), min(x$gdpPercap))
indexes = c("Max", "Min")
data.frame(GDPPercap = values, MaxorMin = indexes)
}
gdpMaxMinPerCont <- ddply(gDat, ~continent, maxMin)
I think it would be easier to interpret this figure if the maximum and minimum were grouped.
If making this publication ready, would like to merge the rows that are the same column (if they are adjacent) to reduce text and make the table cleaner.
Count the number of countries with low life expectancey over time by continent. “Tall” format.
I chose 50 years as the cutoff for life expectancy, so the following table shows how many countries per continent have a life expectancy less than 50 years. In the “tall” the number of countries with a life expectancy below 50 years (numCountriesBelow50yrs) is reported for each continent*year combination.
t <- ddply(gDat, ~year + continent, summarize, numCountriesBelow50yrs = sum(lifeExp <
50))
tablet <- arrange(t, continent)
tablefinal <- xtable(tablet)
print(tablefinal, type = "html", include.rownames = FALSE)
| year | continent | numCountriesBelow50yrs |
|---|---|---|
| 1952 | Africa | 50 |
| 1957 | Africa | 49 |
| 1962 | Africa | 47 |
| 1967 | Africa | 39 |
| 1972 | Africa | 36 |
| 1977 | Africa | 28 |
| 1982 | Africa | 24 |
| 1987 | Africa | 20 |
| 1992 | Africa | 20 |
| 1997 | Africa | 20 |
| 2002 | Africa | 22 |
| 2007 | Africa | 18 |
| 1952 | Americas | 9 |
| 1957 | Americas | 8 |
| 1962 | Americas | 6 |
| 1967 | Americas | 2 |
| 1972 | Americas | 2 |
| 1977 | Americas | 1 |
| 1982 | Americas | 0 |
| 1987 | Americas | 0 |
| 1992 | Americas | 0 |
| 1997 | Americas | 0 |
| 2002 | Americas | 0 |
| 2007 | Americas | 0 |
| 1952 | Asia | 22 |
| 1957 | Asia | 18 |
| 1962 | Asia | 17 |
| 1967 | Asia | 12 |
| 1972 | Asia | 6 |
| 1977 | Asia | 5 |
| 1982 | Asia | 3 |
| 1987 | Asia | 1 |
| 1992 | Asia | 1 |
| 1997 | Asia | 1 |
| 2002 | Asia | 1 |
| 2007 | Asia | 1 |
| 1952 | Europe | 1 |
| 1957 | Europe | 1 |
| 1962 | Europe | 0 |
| 1967 | Europe | 0 |
| 1972 | Europe | 0 |
| 1977 | Europe | 0 |
| 1982 | Europe | 0 |
| 1987 | Europe | 0 |
| 1992 | Europe | 0 |
| 1997 | Europe | 0 |
| 2002 | Europe | 0 |
| 2007 | Europe | 0 |
| 1952 | Oceania | 0 |
| 1957 | Oceania | 0 |
| 1962 | Oceania | 0 |
| 1967 | Oceania | 0 |
| 1972 | Oceania | 0 |
| 1977 | Oceania | 0 |
| 1982 | Oceania | 0 |
| 1987 | Oceania | 0 |
| 1992 | Oceania | 0 |
| 1997 | Oceania | 0 |
| 2002 | Oceania | 0 |
| 2007 | Oceania | 0 |
It is not very intuitive to make conclusions from this tall table. However, when the table is arranged by continent it is a little easier to see the trends. As many continents have very few countries where the life expectancy is less than 50, this table really highlights Africa. Once you have focused solely on one contient, the trends are easier to get a handle on. A graphical representation would be a lot easier for me personally to digest.
NB: Use sum rather than count: count gives frequency of TRUE and FALSE values, while sum counts the occurences of TRUE.
Report the absolute and/or relative abundance of countries with low life expectancy over time by continent. “Wide” format.
This follows on from the last task, but using a “Wide” format, it condenses the final output nicely, making it easier to interpret the relationship over time.
wideLifeExp <- function(gDat) {
daply(gDat, .(continent), summarise, numCountriesBelow50yrs = sum(lifeExp <
50))
}
lifeExpByTime <- daply(gDat, ~year, wideLifeExp)
print(xtable(lifeExpByTime, caption = "Number of countries per continent with life expectancy less than 50 years",
caption.placement = "top"), type = "html")
| Africa | Americas | Asia | Europe | Oceania | |
|---|---|---|---|---|---|
| 1952 | 50.00 | 9.00 | 22.00 | 1.00 | 0.00 |
| 1957 | 49.00 | 8.00 | 18.00 | 1.00 | 0.00 |
| 1962 | 47.00 | 6.00 | 17.00 | 0.00 | 0.00 |
| 1967 | 39.00 | 2.00 | 12.00 | 0.00 | 0.00 |
| 1972 | 36.00 | 2.00 | 6.00 | 0.00 | 0.00 |
| 1977 | 28.00 | 1.00 | 5.00 | 0.00 | 0.00 |
| 1982 | 24.00 | 0.00 | 3.00 | 0.00 | 0.00 |
| 1987 | 20.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 1992 | 20.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 1997 | 20.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 2002 | 22.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| 2007 | 18.00 | 0.00 | 1.00 | 0.00 | 0.00 |
To make this table, I used a double daply: one embedded in the other. If I had used just one with ~continent + year it would have made one line for every contient*year combination. With two daply functions we can combine all of the by continent information within years. I couldn't get the location of the caption to move to the top on this one…even with caption.placement = "top" I have to investigate this further.