STAT545a HW #4 Exploring a Quantitative Variable

In HW #3, we were given the opportunity to use the PLYR package to practice data aggregation. However, as part of the excerise, we were specifically asked not to make any plots in order to focus on the task at hand and also to appreciate the benefits of data visualization. Well the time has come…..

Again, we will be working with the Gapminder Data.

gDat <- read.delim("gapminderDataFiveYear.txt")
# Removing Oceania from the data due to the small sample of countries
iDat <- droplevels(subset(gDat, continent != "Oceania"))
library(plyr)
library(xtable)
library(lattice)

First, I reviewed Sean Jewell's HW #3 to see if I could decipher his code and then attempt to make some companion figures. I noticed that he wrote a short function called printTable to help create tables. I thought this was useful and so I will borrow it for my code.

printTable <- function(df) {
    print(xtable(df), type = "html", include.rownames = F)
}

Visualizing the Change in Life Expectancy Over Time

Sean is exploring how life expectancy is changing over time:

lifeExpCont <- ddply(iDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp), 
    medianLifeExp = median(lifeExp))
printTable(lifeExpCont)
continent year meanLifeExp medianLifeExp
Africa 1952 39.14 38.83
Africa 1957 41.27 40.59
Africa 1962 43.32 42.63
Africa 1967 45.33 44.70
Africa 1972 47.45 47.03
Africa 1977 49.58 49.27
Africa 1982 51.59 50.76
Africa 1987 53.34 51.64
Africa 1992 53.63 52.43
Africa 1997 53.60 52.76
Africa 2002 53.33 51.24
Africa 2007 54.81 52.93
Americas 1952 53.28 54.74
Americas 1957 55.96 56.07
Americas 1962 58.40 58.30
Americas 1967 60.41 60.52
Americas 1972 62.39 63.44
Americas 1977 64.39 66.35
Americas 1982 66.23 67.41
Americas 1987 68.09 69.50
Americas 1992 69.57 69.86
Americas 1997 71.15 72.15
Americas 2002 72.42 72.05
Americas 2007 73.61 72.90
Asia 1952 46.31 44.87
Asia 1957 49.32 48.28
Asia 1962 51.56 49.33
Asia 1967 54.66 53.66
Asia 1972 57.32 56.95
Asia 1977 59.61 60.77
Asia 1982 62.62 63.74
Asia 1987 64.85 66.30
Asia 1992 66.54 68.69
Asia 1997 68.02 70.27
Asia 2002 69.23 71.03
Asia 2007 70.73 72.40
Europe 1952 64.41 65.90
Europe 1957 66.70 67.65
Europe 1962 68.54 69.53
Europe 1967 69.74 70.61
Europe 1972 70.78 70.89
Europe 1977 71.94 72.34
Europe 1982 72.81 73.49
Europe 1987 73.64 74.81
Europe 1992 74.44 75.45
Europe 1997 75.51 76.12
Europe 2002 76.70 77.54
Europe 2007 77.65 78.61

Although, it is possible to get a feel for how life expectancy is changing over time in regards to the continents, it is more difficult to grasp this instantly while examining the above table. So, I will attempt to make a graph which will capture this trend over time.

xyplot(meanLifeExp + medianLifeExp ~ year | continent, lifeExpCont, type = c("p", 
    "a"), auto.key = T, ylab = "Life Expectancy")

plot of chunk unnamed-chunk-4

The plot above is pretty effective at telling the story related to the previous table. Not only can you examine the changing life expectancy with regards to year on a specific continent but you can also compare that continent against all of the others.

Note that in Sean's HW #3, he converted the tall table to a wide format but to create the figure above, it was much easier to use the tall table.

Visualizing the Number of Countries per Continent with Low Life Expectancy

Next, Sean established a measure of life expectancy and then determined the number of countries per year whose life expectancy was below that measure for each continent. I made a few minor adjustments to his code and changed the baseline measure of life expectancy to be the mean life expectancy per year. (This was mostly to practice creating custom functions and using them with ddply)

The table below depicts the baseline measure of life expectancy per year:

bl_lifeExp <- ddply(iDat, .(year), summarize, meanLifeExp = mean(lifeExp))
printTable(bl_lifeExp)
year meanLifeExp
1952 48.77
1957 51.24
1962 53.36
1967 55.45
1972 57.44
1977 59.38
1982 61.35
1987 63.04
1992 63.98
1997 64.83
2002 65.49
2007 66.81

lowLifeInstance <- function(x) {
    meanLifeExp <- ddply(iDat, .(year), summarize, meanLifeExp = mean(lifeExp))
    lowLifeExp <- subset(meanLifeExp, year = mean(x$year))
    belowAvg <- sum(x$lifeExp <= lowLifeExp)
    names(belowAvg) <- "lowLifeExp"
    return(belowAvg)
}

continentLifeExp <- ddply(iDat, .(continent, year), lowLifeInstance)
printTable(continentLifeExp)
continent year lowLifeExp
Africa 1952 24
Africa 1957 24
Africa 1962 24
Africa 1967 24
Africa 1972 23
Africa 1977 23
Africa 1982 22
Africa 1987 22
Africa 1992 20
Africa 1997 22
Africa 2002 22
Africa 2007 21
Americas 1952 20
Americas 1957 19
Americas 1962 16
Americas 1967 16
Americas 1972 15
Americas 1977 13
Americas 1982 13
Americas 1987 12
Americas 1992 12
Americas 1997 12
Americas 2002 12
Americas 2007 12
Asia 1952 23
Asia 1957 21
Asia 1962 20
Asia 1967 19
Asia 1972 18
Asia 1977 18
Asia 1982 18
Asia 1987 17
Asia 1992 17
Asia 1997 15
Asia 2002 15
Asia 2007 15
Europe 1952 17
Europe 1957 15
Europe 1962 13
Europe 1967 12
Europe 1972 12
Europe 1977 12
Europe 1982 12
Europe 1987 12
Europe 1992 12
Europe 1997 12
Europe 2002 12
Europe 2007 12
stripplot(lowLifeExp ~ reorder(continent, lowLifeExp), continentLifeExp, jitter.data = T, 
    type = c("p", "a"))

plot of chunk unnamed-chunk-6

Each point depicts the number of countries below our baseline measure of life expectancy on each continent. Africa has the most countries below the world wide base expectancy for that year which make sense with our previous plot that showed Africa has the average lowest life expectancy for each year of our data. However, the effect of time is not evident from this strip plot.

Visualizing the Spread of the Data

Lastly, I will revisit my own previous assignment an visual the spread of GDP per Capita.

spreadGDP <- ddply(iDat, ~continent, summarize, sdGDP = sd(gdpPercap), madGDP = mad(gdpPercap), 
    iqrGDP = IQR(gdpPercap))
printTable(spreadGDP)
continent sdGDP madGDP iqrGDP
Africa 2827.93 775.32 1616.17
Americas 6396.76 3269.33 4402.43
Asia 14045.37 2820.83 7492.26
Europe 9355.21 8846.05 13248.30

While all of these measures give us an idea of the spread of our data, it is very difficult to grasp the distribution of GDP per capita on each continent.

bwplot(gdpPercap ~ reorder(continent, gdpPercap), iDat, panel = function(..., 
    box.ratio) {
    panel.violin(..., col = "transparent", border = "red", varwidth = FALSE, 
        box.ratio = box.ratio)
    panel.bwplot(..., fill = NULL, box.ratio = 0.1)
})

plot of chunk unnamed-chunk-8

This combination box/violin plot gives us a much better idea of the spread of the data. It also useful for identify outliers in the case of Asia. However, a density plot might be an even better way of examining the spread of GDP per capita.

densityplot(~gdpPercap, iDat, n = 200, adjust = 5, groups = continent, plot.points = F, 
    ref = T, auto.key = list(columns = nlevels(iDat$continent)))

plot of chunk unnamed-chunk-9