In HW #3, we were given the opportunity to use the PLYR package to practice data aggregation. However, as part of the excerise, we were specifically asked not to make any plots in order to focus on the task at hand and also to appreciate the benefits of data visualization. Well the time has come…..
Again, we will be working with the Gapminder Data.
gDat <- read.delim("gapminderDataFiveYear.txt")
# Removing Oceania from the data due to the small sample of countries
iDat <- droplevels(subset(gDat, continent != "Oceania"))
library(plyr)
library(xtable)
library(lattice)
First, I reviewed Sean Jewell's HW #3 to see if I could decipher his code and then attempt to make some companion figures. I noticed that he wrote a short function called printTable to help create tables. I thought this was useful and so I will borrow it for my code.
printTable <- function(df) {
print(xtable(df), type = "html", include.rownames = F)
}
Sean is exploring how life expectancy is changing over time:
lifeExpCont <- ddply(iDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp),
medianLifeExp = median(lifeExp))
printTable(lifeExpCont)
| continent | year | meanLifeExp | medianLifeExp |
|---|---|---|---|
| Africa | 1952 | 39.14 | 38.83 |
| Africa | 1957 | 41.27 | 40.59 |
| Africa | 1962 | 43.32 | 42.63 |
| Africa | 1967 | 45.33 | 44.70 |
| Africa | 1972 | 47.45 | 47.03 |
| Africa | 1977 | 49.58 | 49.27 |
| Africa | 1982 | 51.59 | 50.76 |
| Africa | 1987 | 53.34 | 51.64 |
| Africa | 1992 | 53.63 | 52.43 |
| Africa | 1997 | 53.60 | 52.76 |
| Africa | 2002 | 53.33 | 51.24 |
| Africa | 2007 | 54.81 | 52.93 |
| Americas | 1952 | 53.28 | 54.74 |
| Americas | 1957 | 55.96 | 56.07 |
| Americas | 1962 | 58.40 | 58.30 |
| Americas | 1967 | 60.41 | 60.52 |
| Americas | 1972 | 62.39 | 63.44 |
| Americas | 1977 | 64.39 | 66.35 |
| Americas | 1982 | 66.23 | 67.41 |
| Americas | 1987 | 68.09 | 69.50 |
| Americas | 1992 | 69.57 | 69.86 |
| Americas | 1997 | 71.15 | 72.15 |
| Americas | 2002 | 72.42 | 72.05 |
| Americas | 2007 | 73.61 | 72.90 |
| Asia | 1952 | 46.31 | 44.87 |
| Asia | 1957 | 49.32 | 48.28 |
| Asia | 1962 | 51.56 | 49.33 |
| Asia | 1967 | 54.66 | 53.66 |
| Asia | 1972 | 57.32 | 56.95 |
| Asia | 1977 | 59.61 | 60.77 |
| Asia | 1982 | 62.62 | 63.74 |
| Asia | 1987 | 64.85 | 66.30 |
| Asia | 1992 | 66.54 | 68.69 |
| Asia | 1997 | 68.02 | 70.27 |
| Asia | 2002 | 69.23 | 71.03 |
| Asia | 2007 | 70.73 | 72.40 |
| Europe | 1952 | 64.41 | 65.90 |
| Europe | 1957 | 66.70 | 67.65 |
| Europe | 1962 | 68.54 | 69.53 |
| Europe | 1967 | 69.74 | 70.61 |
| Europe | 1972 | 70.78 | 70.89 |
| Europe | 1977 | 71.94 | 72.34 |
| Europe | 1982 | 72.81 | 73.49 |
| Europe | 1987 | 73.64 | 74.81 |
| Europe | 1992 | 74.44 | 75.45 |
| Europe | 1997 | 75.51 | 76.12 |
| Europe | 2002 | 76.70 | 77.54 |
| Europe | 2007 | 77.65 | 78.61 |
Although, it is possible to get a feel for how life expectancy is changing over time in regards to the continents, it is more difficult to grasp this instantly while examining the above table. So, I will attempt to make a graph which will capture this trend over time.
xyplot(meanLifeExp + medianLifeExp ~ year | continent, lifeExpCont, type = c("p",
"a"), auto.key = T, ylab = "Life Expectancy")
The plot above is pretty effective at telling the story related to the previous table. Not only can you examine the changing life expectancy with regards to year on a specific continent but you can also compare that continent against all of the others.
Note that in Sean's HW #3, he converted the tall table to a wide format but to create the figure above, it was much easier to use the tall table.
Next, Sean established a measure of life expectancy and then determined the number of countries per year whose life expectancy was below that measure for each continent. I made a few minor adjustments to his code and changed the baseline measure of life expectancy to be the mean life expectancy per year. (This was mostly to practice creating custom functions and using them with ddply)
The table below depicts the baseline measure of life expectancy per year:
bl_lifeExp <- ddply(iDat, .(year), summarize, meanLifeExp = mean(lifeExp))
printTable(bl_lifeExp)
| year | meanLifeExp |
|---|---|
| 1952 | 48.77 |
| 1957 | 51.24 |
| 1962 | 53.36 |
| 1967 | 55.45 |
| 1972 | 57.44 |
| 1977 | 59.38 |
| 1982 | 61.35 |
| 1987 | 63.04 |
| 1992 | 63.98 |
| 1997 | 64.83 |
| 2002 | 65.49 |
| 2007 | 66.81 |
lowLifeInstance <- function(x) {
meanLifeExp <- ddply(iDat, .(year), summarize, meanLifeExp = mean(lifeExp))
lowLifeExp <- subset(meanLifeExp, year = mean(x$year))
belowAvg <- sum(x$lifeExp <= lowLifeExp)
names(belowAvg) <- "lowLifeExp"
return(belowAvg)
}
continentLifeExp <- ddply(iDat, .(continent, year), lowLifeInstance)
printTable(continentLifeExp)
| continent | year | lowLifeExp |
|---|---|---|
| Africa | 1952 | 24 |
| Africa | 1957 | 24 |
| Africa | 1962 | 24 |
| Africa | 1967 | 24 |
| Africa | 1972 | 23 |
| Africa | 1977 | 23 |
| Africa | 1982 | 22 |
| Africa | 1987 | 22 |
| Africa | 1992 | 20 |
| Africa | 1997 | 22 |
| Africa | 2002 | 22 |
| Africa | 2007 | 21 |
| Americas | 1952 | 20 |
| Americas | 1957 | 19 |
| Americas | 1962 | 16 |
| Americas | 1967 | 16 |
| Americas | 1972 | 15 |
| Americas | 1977 | 13 |
| Americas | 1982 | 13 |
| Americas | 1987 | 12 |
| Americas | 1992 | 12 |
| Americas | 1997 | 12 |
| Americas | 2002 | 12 |
| Americas | 2007 | 12 |
| Asia | 1952 | 23 |
| Asia | 1957 | 21 |
| Asia | 1962 | 20 |
| Asia | 1967 | 19 |
| Asia | 1972 | 18 |
| Asia | 1977 | 18 |
| Asia | 1982 | 18 |
| Asia | 1987 | 17 |
| Asia | 1992 | 17 |
| Asia | 1997 | 15 |
| Asia | 2002 | 15 |
| Asia | 2007 | 15 |
| Europe | 1952 | 17 |
| Europe | 1957 | 15 |
| Europe | 1962 | 13 |
| Europe | 1967 | 12 |
| Europe | 1972 | 12 |
| Europe | 1977 | 12 |
| Europe | 1982 | 12 |
| Europe | 1987 | 12 |
| Europe | 1992 | 12 |
| Europe | 1997 | 12 |
| Europe | 2002 | 12 |
| Europe | 2007 | 12 |
stripplot(lowLifeExp ~ reorder(continent, lowLifeExp), continentLifeExp, jitter.data = T,
type = c("p", "a"))
Each point depicts the number of countries below our baseline measure of life expectancy on each continent. Africa has the most countries below the world wide base expectancy for that year which make sense with our previous plot that showed Africa has the average lowest life expectancy for each year of our data. However, the effect of time is not evident from this strip plot.
Lastly, I will revisit my own previous assignment an visual the spread of GDP per Capita.
spreadGDP <- ddply(iDat, ~continent, summarize, sdGDP = sd(gdpPercap), madGDP = mad(gdpPercap),
iqrGDP = IQR(gdpPercap))
printTable(spreadGDP)
| continent | sdGDP | madGDP | iqrGDP |
|---|---|---|---|
| Africa | 2827.93 | 775.32 | 1616.17 |
| Americas | 6396.76 | 3269.33 | 4402.43 |
| Asia | 14045.37 | 2820.83 | 7492.26 |
| Europe | 9355.21 | 8846.05 | 13248.30 |
While all of these measures give us an idea of the spread of our data, it is very difficult to grasp the distribution of GDP per capita on each continent.
bwplot(gdpPercap ~ reorder(continent, gdpPercap), iDat, panel = function(...,
box.ratio) {
panel.violin(..., col = "transparent", border = "red", varwidth = FALSE,
box.ratio = box.ratio)
panel.bwplot(..., fill = NULL, box.ratio = 0.1)
})
This combination box/violin plot gives us a much better idea of the spread of the data. It also useful for identify outliers in the case of Asia. However, a density plot might be an even better way of examining the spread of GDP per capita.
densityplot(~gdpPercap, iDat, n = 200, adjust = 5, groups = continent, plot.points = F,
ref = T, auto.key = list(columns = nlevels(iDat$continent)))