STAT 545A Homework #4

Prepared by: Amanda Yuen

For homework #4, we will evaluate code prepared by someone else for last week's data aggregation tasks and produce graphics to accompany these tasks.

As always, we begin by importing the Gapminder data into R and do a quick check that the data has been imported properly.

gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Next, we load the libraries that we will need for this assignment.

library(plyr)
## Warning: package 'plyr' was built under R version 2.15.3
library(xtable)
## Warning: package 'xtable' was built under R version 2.15.2
library(lattice)

We will also drop Oceania from the data set due to the small number of countries included on this continent. The table confirms that Oceania has been taken out.

iDat <- droplevels(subset(gDat, continent != "Oceania"))
table(iDat$continent)
## 
##   Africa Americas     Asia   Europe 
##      624      300      396      360

I decided to take a look at the work completed by Justin Chu and try to understand his code as well as produce graphics to accompany his work.

First, he displayed the maximum and minimum GDP per capita for each continent in 3 different ways: 1) “Wide” format sorted by minimum GDP per capita, 2) “Wide” format sorted by maximum GDP per capita, and 3) “Tall” format. Since it is easier to generate graphs from data in a “tall” format, we can focus on his code for the third way.

maxMinGdpByContTall <- ddply(iDat, ~continent, summarize, factor = c("min", 
    "max"), GDP = c(min = min(gdpPercap), max = max(gdpPercap)))
tallGdpTbl <- xtable(maxMinGdpByContTall)
print(tallGdpTbl, type = "html", include.rownames = FALSE)
continent factor GDP
Africa min 241.17
Africa max 21951.21
Americas min 1201.64
Americas max 42951.65
Asia min 331.00
Asia max 113523.13
Europe min 973.53
Europe max 49357.19

The table looks good, however it might be easier to compare the minimum and maximum values by displaying the information in a graph. Let's try a bar graph with different panels for each continent and see how it looks.

barchart(GDP ~ factor | continent, maxMinGdpByContTall, main = "Maximum and Minimum GDP Per Capita By Continent", 
    ylab = "GDP Per Capita", col = "lightblue")

plot of chunk unnamed-chunk-5

The bar graph clearly depicts the fact that Asia has the largest difference between the maximum and minimum GDP per capita while Africa has the smallest difference.

Next, Justin examined the spread of the GDP per capita by continent using the following measures: standard deviation, variance, and interquartile range. His code is reproduced below.

spreadStats <- ddply(iDat, ~continent, summarize, gdpStdDev = sd(gdpPercap), 
    gdpVar = var(gdpPercap), gdpIQR = IQR(gdpPercap))
attach(spreadStats)
spreadTbl <- xtable(spreadStats[order(gdpStdDev), ])
detach(spreadStats)
print(spreadTbl, type = "html", include.rownames = FALSE)
continent gdpStdDev gdpVar gdpIQR
Africa 2827.93 7997187.31 1616.17
Americas 6396.76 40918591.10 4402.43
Europe 9355.21 87520019.60 13248.30
Asia 14045.37 197272505.85 7492.26

While the code gets the job done, the attach() and detach() functions may not be necessary.These two functions could be replaced by an arrange() function that can arrange the information (in this case, by standard deviation) and then simply sending it to a table to be printed. Let's give it a try.

spreadStats <- ddply(iDat, ~continent, summarize, gdpStdDev = sd(gdpPercap), 
    gdpVar = var(gdpPercap), gdpIQR = IQR(gdpPercap))
spreadStats <- arrange(spreadStats, gdpStdDev)
spreadStats <- xtable(spreadStats)
print(spreadStats, type = "html", include.rownames = FALSE)
continent gdpStdDev gdpVar gdpIQR
Africa 2827.93 7997187.31 1616.17
Americas 6396.76 40918591.10 4402.43
Europe 9355.21 87520019.60 13248.30
Asia 14045.37 197272505.85 7492.26

Now, let's visualize the distribution of GDP per capita by continent. We can try generating a density plot and see if the visuals correspond with the numbers in the table.

densityplot(~gdpPercap, iDat, main = "Distribution of GDP Per Capita By Continent", 
    plot.points = FALSE, ref = TRUE, group = continent, auto.key = list(space = "right"), 
    n = 300, adjust = 3)

plot of chunk unnamed-chunk-8

Indeed, one of the most striking aspects of the density plot is the extremely narrow distribution of GDP per capita in Africa, which corresponds with the numbers in the table. We also notice how wide the distribution is for Asia, and for Europe as well (although not as much as Asia).

Let's do one more task. Justin examined how the mean and median life expectancy changed over time by continent. He examined both the trimmed numbers (trimmed by 10%) as well as untrimmed numbers His code for the untrimmed numbers is reproduced below.

lifeExpMeanByYearCont <- ddply(iDat, ~continent ~ year, summarize, meanLifeExp = mean(lifeExp), 
    medianLifeExp = median(lifeExp))
lifeExpMeanByYCTbl <- xtable(lifeExpMeanByYearCont)
print(lifeExpMeanByYCTbl, type = "html", include.rownames = FALSE)
continent year meanLifeExp medianLifeExp
Africa 1952 39.14 38.83
Africa 1957 41.27 40.59
Africa 1962 43.32 42.63
Africa 1967 45.33 44.70
Africa 1972 47.45 47.03
Africa 1977 49.58 49.27
Africa 1982 51.59 50.76
Africa 1987 53.34 51.64
Africa 1992 53.63 52.43
Africa 1997 53.60 52.76
Africa 2002 53.33 51.24
Africa 2007 54.81 52.93
Americas 1952 53.28 54.74
Americas 1957 55.96 56.07
Americas 1962 58.40 58.30
Americas 1967 60.41 60.52
Americas 1972 62.39 63.44
Americas 1977 64.39 66.35
Americas 1982 66.23 67.41
Americas 1987 68.09 69.50
Americas 1992 69.57 69.86
Americas 1997 71.15 72.15
Americas 2002 72.42 72.05
Americas 2007 73.61 72.90
Asia 1952 46.31 44.87
Asia 1957 49.32 48.28
Asia 1962 51.56 49.33
Asia 1967 54.66 53.66
Asia 1972 57.32 56.95
Asia 1977 59.61 60.77
Asia 1982 62.62 63.74
Asia 1987 64.85 66.30
Asia 1992 66.54 68.69
Asia 1997 68.02 70.27
Asia 2002 69.23 71.03
Asia 2007 70.73 72.40
Europe 1952 64.41 65.90
Europe 1957 66.70 67.65
Europe 1962 68.54 69.53
Europe 1967 69.74 70.61
Europe 1972 70.78 70.89
Europe 1977 71.94 72.34
Europe 1982 72.81 73.49
Europe 1987 73.64 74.81
Europe 1992 74.44 75.45
Europe 1997 75.51 76.12
Europe 2002 76.70 77.54
Europe 2007 77.65 78.61

It is quite difficult to gauge the overall trend of the mean and median life expectancies by looking at the table. It would be much easier if we depicted these numbers in a line plot.

xyplot(meanLifeExp + medianLifeExp ~ year, lifeExpMeanByYearCont, main = "Mean and Median Life Expectancies By Continent", 
    group = continent, type = c("l"), auto.key = list(points = FALSE, lines = TRUE, 
        space = "right"))

plot of chunk unnamed-chunk-10

From the plot, we can clearly see that Europe has the highest mean and median life expectancy over time, followed by the Americas, Asia, and then Africa. We can also see that the median numbers show greater movement than the mean, however in general they both depict similar trends over time.