STAT545a HW #4 Exploring a Quantitative Variable

In HW #3, we were given the opportunity to use the PLYR package to practice data aggregation. However, as part of the excerise, we were specifically asked not to make any plots in order to focus on the task at hand and also to appreciate the benefits of data visualization. Well the time has come…..

Again, we will be working with the Gapminder Data.

gDat <- read.delim("gapminderDataFiveYear.txt")
# Removing Oceania from the data due to the small sample of countries
iDat <- droplevels(subset(gDat, continent != "Oceania"))
library(plyr)
library(xtable)
library(lattice)

First, I reviewed Sean Jewell's HW #3 to see if I could decipher his code and then attempt to make some companion figures. I noticed that he wrote a short function called printTable to help create tables. I thought this was useful and so I will borrow it for my code.

printTable <- function(df) {
    print(xtable(df), type = "html", include.rownames = F)
}

Visualizing the Change in Life Expectancy Over Time

Sean is exploring how life expectancy is changing over time:

lifeExpCont <- ddply(iDat, .(continent, year), summarize, meanLifeExp = mean(lifeExp), 
    medianLifeExp = median(lifeExp))
printTable(lifeExpCont)

continent	year	meanLifeExp	medianLifeExp
Africa	1952	39.14	38.83
Africa	1957	41.27	40.59
Africa	1962	43.32	42.63
Africa	1967	45.33	44.70
Africa	1972	47.45	47.03
Africa	1977	49.58	49.27
Africa	1982	51.59	50.76
Africa	1987	53.34	51.64
Africa	1992	53.63	52.43
Africa	1997	53.60	52.76
Africa	2002	53.33	51.24
Africa	2007	54.81	52.93
Americas	1952	53.28	54.74
Americas	1957	55.96	56.07
Americas	1962	58.40	58.30
Americas	1967	60.41	60.52
Americas	1972	62.39	63.44
Americas	1977	64.39	66.35
Americas	1982	66.23	67.41
Americas	1987	68.09	69.50
Americas	1992	69.57	69.86
Americas	1997	71.15	72.15
Americas	2002	72.42	72.05
Americas	2007	73.61	72.90
Asia	1952	46.31	44.87
Asia	1957	49.32	48.28
Asia	1962	51.56	49.33
Asia	1967	54.66	53.66
Asia	1972	57.32	56.95
Asia	1977	59.61	60.77
Asia	1982	62.62	63.74
Asia	1987	64.85	66.30
Asia	1992	66.54	68.69
Asia	1997	68.02	70.27
Asia	2002	69.23	71.03
Asia	2007	70.73	72.40
Europe	1952	64.41	65.90
Europe	1957	66.70	67.65
Europe	1962	68.54	69.53
Europe	1967	69.74	70.61
Europe	1972	70.78	70.89
Europe	1977	71.94	72.34
Europe	1982	72.81	73.49
Europe	1987	73.64	74.81
Europe	1992	74.44	75.45
Europe	1997	75.51	76.12
Europe	2002	76.70	77.54
Europe	2007	77.65	78.61

Although, it is possible to get a feel for how life expectancy is changing over time in regards to the continents, it is more difficult to grasp this instantly while examining the above table. So, I will attempt to make a graph which will capture this trend over time.

xyplot(meanLifeExp + medianLifeExp ~ year | continent, lifeExpCont, type = c("p", 
    "a"), auto.key = T, ylab = "Life Expectancy")

plot of chunk unnamed-chunk-4

The plot above is pretty effective at telling the story related to the previous table. Not only can you examine the changing life expectancy with regards to year on a specific continent but you can also compare that continent against all of the others.

Note that in Sean's HW #3, he converted the tall table to a wide format but to create the figure above, it was much easier to use the tall table.

Visualizing the Number of Countries per Continent with Low Life Expectancy

Next, Sean established a measure of life expectancy and then determined the number of countries per year whose life expectancy was below that measure for each continent. I made a few minor adjustments to his code and changed the baseline measure of life expectancy to be the mean life expectancy per year. (This was mostly to practice creating custom functions and using them with ddply)

The table below depicts the baseline measure of life expectancy per year:

bl_lifeExp <- ddply(iDat, .(year), summarize, meanLifeExp = mean(lifeExp))
printTable(bl_lifeExp)

year	meanLifeExp
1952	48.77
1957	51.24
1962	53.36
1967	55.45
1972	57.44
1977	59.38
1982	61.35
1987	63.04
1992	63.98
1997	64.83
2002	65.49
2007	66.81


lowLifeInstance <- function(x) {
    meanLifeExp <- ddply(iDat, .(year), summarize, meanLifeExp = mean(lifeExp))
    lowLifeExp <- subset(meanLifeExp, year = mean(x$year))
    belowAvg <- sum(x$lifeExp <= lowLifeExp)
    names(belowAvg) <- "lowLifeExp"
    return(belowAvg)
}

continentLifeExp <- ddply(iDat, .(continent, year), lowLifeInstance)
printTable(continentLifeExp)

continent	year	lowLifeExp
Africa	1952	24
Africa	1957	24
Africa	1962	24
Africa	1967	24
Africa	1972	23
Africa	1977	23
Africa	1982	22
Africa	1987	22
Africa	1992	20
Africa	1997	22
Africa	2002	22
Africa	2007	21
Americas	1952	20
Americas	1957	19
Americas	1962	16
Americas	1967	16
Americas	1972	15
Americas	1977	13
Americas	1982	13
Americas	1987	12
Americas	1992	12
Americas	1997	12
Americas	2002	12
Americas	2007	12
Asia	1952	23
Asia	1957	21
Asia	1962	20
Asia	1967	19
Asia	1972	18
Asia	1977	18
Asia	1982	18
Asia	1987	17
Asia	1992	17
Asia	1997	15
Asia	2002	15
Asia	2007	15
Europe	1952	17
Europe	1957	15
Europe	1962	13
Europe	1967	12
Europe	1972	12
Europe	1977	12
Europe	1982	12
Europe	1987	12
Europe	1992	12
Europe	1997	12
Europe	2002	12
Europe	2007	12

stripplot(lowLifeExp ~ reorder(continent, lowLifeExp), continentLifeExp, jitter.data = T, 
    type = c("p", "a"))

plot of chunk unnamed-chunk-6

Each point depicts the number of countries below our baseline measure of life expectancy on each continent. Africa has the most countries below the world wide base expectancy for that year which make sense with our previous plot that showed Africa has the average lowest life expectancy for each year of our data. However, the effect of time is not evident from this strip plot.

Visualizing the Spread of the Data

Lastly, I will revisit my own previous assignment an visual the spread of GDP per Capita.

spreadGDP <- ddply(iDat, ~continent, summarize, sdGDP = sd(gdpPercap), madGDP = mad(gdpPercap), 
    iqrGDP = IQR(gdpPercap))
printTable(spreadGDP)

continent	sdGDP	madGDP	iqrGDP
Africa	2827.93	775.32	1616.17
Americas	6396.76	3269.33	4402.43
Asia	14045.37	2820.83	7492.26
Europe	9355.21	8846.05	13248.30

While all of these measures give us an idea of the spread of our data, it is very difficult to grasp the distribution of GDP per capita on each continent.

bwplot(gdpPercap ~ reorder(continent, gdpPercap), iDat, panel = function(..., 
    box.ratio) {
    panel.violin(..., col = "transparent", border = "red", varwidth = FALSE, 
        box.ratio = box.ratio)
    panel.bwplot(..., fill = NULL, box.ratio = 0.1)
})

plot of chunk unnamed-chunk-8

This combination box/violin plot gives us a much better idea of the spread of the data. It also useful for identify outliers in the case of Asia. However, a density plot might be an even better way of examining the spread of GDP per capita.

densityplot(~gdpPercap, iDat, n = 200, adjust = 5, groups = continent, plot.points = F, 
    ref = T, auto.key = list(columns = nlevels(iDat$continent)))

plot of chunk unnamed-chunk-9