Prepared by: Amanda Yuen
For homework #4, we will evaluate code prepared by someone else for last week's data aggregation tasks and produce graphics to accompany these tasks.
As always, we begin by importing the Gapminder data into R and do a quick check that the data has been imported properly.
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Next, we load the libraries that we will need for this assignment.
library(plyr)
## Warning: package 'plyr' was built under R version 2.15.3
library(xtable)
## Warning: package 'xtable' was built under R version 2.15.2
library(lattice)
We will also drop Oceania from the data set due to the small number of countries included on this continent. The table confirms that Oceania has been taken out.
iDat <- droplevels(subset(gDat, continent != "Oceania"))
table(iDat$continent)
##
## Africa Americas Asia Europe
## 624 300 396 360
I decided to take a look at the work completed by Justin Chu and try to understand his code as well as produce graphics to accompany his work.
First, he displayed the maximum and minimum GDP per capita for each continent in 3 different ways: 1) “Wide” format sorted by minimum GDP per capita, 2) “Wide” format sorted by maximum GDP per capita, and 3) “Tall” format. Since it is easier to generate graphs from data in a “tall” format, we can focus on his code for the third way.
maxMinGdpByContTall <- ddply(iDat, ~continent, summarize, factor = c("min",
"max"), GDP = c(min = min(gdpPercap), max = max(gdpPercap)))
tallGdpTbl <- xtable(maxMinGdpByContTall)
print(tallGdpTbl, type = "html", include.rownames = FALSE)
| continent | factor | GDP |
|---|---|---|
| Africa | min | 241.17 |
| Africa | max | 21951.21 |
| Americas | min | 1201.64 |
| Americas | max | 42951.65 |
| Asia | min | 331.00 |
| Asia | max | 113523.13 |
| Europe | min | 973.53 |
| Europe | max | 49357.19 |
The table looks good, however it might be easier to compare the minimum and maximum values by displaying the information in a graph. Let's try a bar graph with different panels for each continent and see how it looks.
barchart(GDP ~ factor | continent, maxMinGdpByContTall, main = "Maximum and Minimum GDP Per Capita By Continent",
ylab = "GDP Per Capita", col = "lightblue")
The bar graph clearly depicts the fact that Asia has the largest difference between the maximum and minimum GDP per capita while Africa has the smallest difference.
Next, Justin examined the spread of the GDP per capita by continent using the following measures: standard deviation, variance, and interquartile range. His code is reproduced below.
spreadStats <- ddply(iDat, ~continent, summarize, gdpStdDev = sd(gdpPercap),
gdpVar = var(gdpPercap), gdpIQR = IQR(gdpPercap))
attach(spreadStats)
spreadTbl <- xtable(spreadStats[order(gdpStdDev), ])
detach(spreadStats)
print(spreadTbl, type = "html", include.rownames = FALSE)
| continent | gdpStdDev | gdpVar | gdpIQR |
|---|---|---|---|
| Africa | 2827.93 | 7997187.31 | 1616.17 |
| Americas | 6396.76 | 40918591.10 | 4402.43 |
| Europe | 9355.21 | 87520019.60 | 13248.30 |
| Asia | 14045.37 | 197272505.85 | 7492.26 |
While the code gets the job done, the attach() and detach() functions may not be necessary.These two functions could be replaced by an arrange() function that can arrange the information (in this case, by standard deviation) and then simply sending it to a table to be printed. Let's give it a try.
spreadStats <- ddply(iDat, ~continent, summarize, gdpStdDev = sd(gdpPercap),
gdpVar = var(gdpPercap), gdpIQR = IQR(gdpPercap))
spreadStats <- arrange(spreadStats, gdpStdDev)
spreadStats <- xtable(spreadStats)
print(spreadStats, type = "html", include.rownames = FALSE)
| continent | gdpStdDev | gdpVar | gdpIQR |
|---|---|---|---|
| Africa | 2827.93 | 7997187.31 | 1616.17 |
| Americas | 6396.76 | 40918591.10 | 4402.43 |
| Europe | 9355.21 | 87520019.60 | 13248.30 |
| Asia | 14045.37 | 197272505.85 | 7492.26 |
Now, let's visualize the distribution of GDP per capita by continent. We can try generating a density plot and see if the visuals correspond with the numbers in the table.
densityplot(~gdpPercap, iDat, main = "Distribution of GDP Per Capita By Continent",
plot.points = FALSE, ref = TRUE, group = continent, auto.key = list(space = "right"),
n = 300, adjust = 3)
Indeed, one of the most striking aspects of the density plot is the extremely narrow distribution of GDP per capita in Africa, which corresponds with the numbers in the table. We also notice how wide the distribution is for Asia, and for Europe as well (although not as much as Asia).
Let's do one more task. Justin examined how the mean and median life expectancy changed over time by continent. He examined both the trimmed numbers (trimmed by 10%) as well as untrimmed numbers His code for the untrimmed numbers is reproduced below.
lifeExpMeanByYearCont <- ddply(iDat, ~continent ~ year, summarize, meanLifeExp = mean(lifeExp),
medianLifeExp = median(lifeExp))
lifeExpMeanByYCTbl <- xtable(lifeExpMeanByYearCont)
print(lifeExpMeanByYCTbl, type = "html", include.rownames = FALSE)
| continent | year | meanLifeExp | medianLifeExp |
|---|---|---|---|
| Africa | 1952 | 39.14 | 38.83 |
| Africa | 1957 | 41.27 | 40.59 |
| Africa | 1962 | 43.32 | 42.63 |
| Africa | 1967 | 45.33 | 44.70 |
| Africa | 1972 | 47.45 | 47.03 |
| Africa | 1977 | 49.58 | 49.27 |
| Africa | 1982 | 51.59 | 50.76 |
| Africa | 1987 | 53.34 | 51.64 |
| Africa | 1992 | 53.63 | 52.43 |
| Africa | 1997 | 53.60 | 52.76 |
| Africa | 2002 | 53.33 | 51.24 |
| Africa | 2007 | 54.81 | 52.93 |
| Americas | 1952 | 53.28 | 54.74 |
| Americas | 1957 | 55.96 | 56.07 |
| Americas | 1962 | 58.40 | 58.30 |
| Americas | 1967 | 60.41 | 60.52 |
| Americas | 1972 | 62.39 | 63.44 |
| Americas | 1977 | 64.39 | 66.35 |
| Americas | 1982 | 66.23 | 67.41 |
| Americas | 1987 | 68.09 | 69.50 |
| Americas | 1992 | 69.57 | 69.86 |
| Americas | 1997 | 71.15 | 72.15 |
| Americas | 2002 | 72.42 | 72.05 |
| Americas | 2007 | 73.61 | 72.90 |
| Asia | 1952 | 46.31 | 44.87 |
| Asia | 1957 | 49.32 | 48.28 |
| Asia | 1962 | 51.56 | 49.33 |
| Asia | 1967 | 54.66 | 53.66 |
| Asia | 1972 | 57.32 | 56.95 |
| Asia | 1977 | 59.61 | 60.77 |
| Asia | 1982 | 62.62 | 63.74 |
| Asia | 1987 | 64.85 | 66.30 |
| Asia | 1992 | 66.54 | 68.69 |
| Asia | 1997 | 68.02 | 70.27 |
| Asia | 2002 | 69.23 | 71.03 |
| Asia | 2007 | 70.73 | 72.40 |
| Europe | 1952 | 64.41 | 65.90 |
| Europe | 1957 | 66.70 | 67.65 |
| Europe | 1962 | 68.54 | 69.53 |
| Europe | 1967 | 69.74 | 70.61 |
| Europe | 1972 | 70.78 | 70.89 |
| Europe | 1977 | 71.94 | 72.34 |
| Europe | 1982 | 72.81 | 73.49 |
| Europe | 1987 | 73.64 | 74.81 |
| Europe | 1992 | 74.44 | 75.45 |
| Europe | 1997 | 75.51 | 76.12 |
| Europe | 2002 | 76.70 | 77.54 |
| Europe | 2007 | 77.65 | 78.61 |
It is quite difficult to gauge the overall trend of the mean and median life expectancies by looking at the table. It would be much easier if we depicted these numbers in a line plot.
xyplot(meanLifeExp + medianLifeExp ~ year, lifeExpMeanByYearCont, main = "Mean and Median Life Expectancies By Continent",
group = continent, type = c("l"), auto.key = list(points = FALSE, lines = TRUE,
space = "right"))
From the plot, we can clearly see that Europe has the highest mean and median life expectancy over time, followed by the Americas, Asia, and then Africa. We can also see that the median numbers show greater movement than the mean, however in general they both depict similar trends over time.