STAT 545A Homework #5

Prepared by: Amanda Yuen

Up to this point, we have learned how to generate plots using the lattice package. For homework #5, we will practice generating plots using the ggplot2 package. I'm going to attempt to reproduce some plots from Homework #4 using ggplot2 to compare the results from the lattice and ggplot2 packages. The goal is to make at least two figures, with one being a stripplot-type of figure (one quantitative variable and one categorical variable) and one being a scatterplot (two quantitative variables).

As always, we start by importing the Gapminder data into R and then doing a quick check that the data has been imported properly.

gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

We also need to load the libraries needed for this assignment.

library(plyr)
## Warning: package 'plyr' was built under R version 2.15.3
library(xtable)
## Warning: package 'xtable' was built under R version 2.15.2
library(lattice)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 2.15.3

We drop Oceania from the data set and check that it has been done successfully.

iDat <- droplevels(subset(gDat, continent != "Oceania"))
table(iDat$continent)
## 
##   Africa Americas     Asia   Europe 
##      624      300      396      360

I'm planning to make three figures. I'll start by trying to make a multi-panelled bar plot just to see how it goes, and then I'll meet the requirement for this assignment by making a scatterplot-type and then a stripplot-type of graph.

Maximum and Minimum GDP Per Capita for Each Continent

Let's begin by displaying the maximum and minimum GDP per capita for each continent in a “tall” table format.

maxMinGdpByContTall <- ddply(iDat, ~continent, summarize, factor = c("min", 
    "max"), GDP = c(min = min(gdpPercap), max = max(gdpPercap)))
tallGdpTbl <- xtable(maxMinGdpByContTall)
print(tallGdpTbl, type = "html", include.rownames = FALSE)
continent factor GDP
Africa min 241.17
Africa max 21951.21
Americas min 1201.64
Americas max 42951.65
Asia min 331.00
Asia max 113523.13
Europe min 973.53
Europe max 49357.19

Last week, I made a multi-panelled bar plot using the lattice package. It is shown below.

barchart(GDP ~ factor | continent, maxMinGdpByContTall, main = "Maximum and Minimum GDP Per Capita By Continent", 
    ylab = "GDP Per Capita", col = "lightblue")

plot of chunk unnamed-chunk-5

Now, we will attempt to create a similar bar plot using the ggplot2 package.

ggplot(maxMinGdpByContTall, aes(y = GDP, x = factor)) + geom_bar() + facet_wrap(~continent)
## Mapping a variable to y and also using stat="bin".  With stat="bin", it
## will attempt to set the y value to the count of cases in each group.  This
## can result in unexpected behavior and will not be allowed in a future
## version of ggplot2.  If you want y to represent counts of cases, use
## stat="bin" and don't map a variable to y.  If you want y to represent
## values in the data, use stat="identity".  See ?geom_bar for examples.
## (Deprecated; last used in version 0.9.2) Mapping a variable to y and also
## using stat="bin".  With stat="bin", it will attempt to set the y value to
## the count of cases in each group.  This can result in unexpected behavior
## and will not be allowed in a future version of ggplot2.  If you want y to
## represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".  See
## ?geom_bar for examples. (Deprecated; last used in version 0.9.2) Mapping a
## variable to y and also using stat="bin".  With stat="bin", it will attempt
## to set the y value to the count of cases in each group.  This can result
## in unexpected behavior and will not be allowed in a future version of
## ggplot2.  If you want y to represent counts of cases, use stat="bin" and
## don't map a variable to y.  If you want y to represent values in the data,
## use stat="identity".  See ?geom_bar for examples. (Deprecated; last used
## in version 0.9.2) Mapping a variable to y and also using stat="bin".  With
## stat="bin", it will attempt to set the y value to the count of cases in
## each group.  This can result in unexpected behavior and will not be
## allowed in a future version of ggplot2.  If you want y to represent counts
## of cases, use stat="bin" and don't map a variable to y.  If you want y to
## represent values in the data, use stat="identity".  See ?geom_bar for
## examples. (Deprecated; last used in version 0.9.2)

plot of chunk unnamed-chunk-6

Almost immediately, we see 3 differences between the two plots: 1) The colour scheme. 2) The order of the panels (i.e. continent). The lattice package appears to order the continents alphabetically starting from the lower left corner, while ggplot2 appears to order the continents alphabetically starting from the upper left corner. I find the second ordering easier to follow. 3) The bottom of the bars begin exactly at the value of 0 GDP per capita with ggplot, but they begin slightly below 0 with lattice. The biggest effect is that in the ggplot graph, the minimum GDP per capita is so small in Africa and Asia that it cannot be seen.

After playing around for a bit, I found that adding the fill and colour options was the easiest way to make all the minimum GDP per capita bars visible. “Fill” determines the colour of the bars and “colour” determines the colour of the bar outlines. Also, it may be useful to specify the graph and axis labels.

ggplot(maxMinGdpByContTall, aes(y = GDP, x = factor)) + geom_bar(fill = "dark blue", 
    colour = "dark blue") + facet_wrap(~continent) + ylab("GDP Per Capita") + 
    xlab(" ") + ggtitle("Maximum and Minimum GDP Per Capita By Continent")
## Mapping a variable to y and also using stat="bin".  With stat="bin", it
## will attempt to set the y value to the count of cases in each group.  This
## can result in unexpected behavior and will not be allowed in a future
## version of ggplot2.  If you want y to represent counts of cases, use
## stat="bin" and don't map a variable to y.  If you want y to represent
## values in the data, use stat="identity".  See ?geom_bar for examples.
## (Deprecated; last used in version 0.9.2) Mapping a variable to y and also
## using stat="bin".  With stat="bin", it will attempt to set the y value to
## the count of cases in each group.  This can result in unexpected behavior
## and will not be allowed in a future version of ggplot2.  If you want y to
## represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".  See
## ?geom_bar for examples. (Deprecated; last used in version 0.9.2) Mapping a
## variable to y and also using stat="bin".  With stat="bin", it will attempt
## to set the y value to the count of cases in each group.  This can result
## in unexpected behavior and will not be allowed in a future version of
## ggplot2.  If you want y to represent counts of cases, use stat="bin" and
## don't map a variable to y.  If you want y to represent values in the data,
## use stat="identity".  See ?geom_bar for examples. (Deprecated; last used
## in version 0.9.2) Mapping a variable to y and also using stat="bin".  With
## stat="bin", it will attempt to set the y value to the count of cases in
## each group.  This can result in unexpected behavior and will not be
## allowed in a future version of ggplot2.  If you want y to represent counts
## of cases, use stat="bin" and don't map a variable to y.  If you want y to
## represent values in the data, use stat="identity".  See ?geom_bar for
## examples. (Deprecated; last used in version 0.9.2)

plot of chunk unnamed-chunk-7

Mean and Median of Life Expectancy By Continent Over Time

Again, let's display the information in a table first.

lifeExpMeanByYearCont <- ddply(iDat, ~continent ~ year, summarize, meanLifeExp = mean(lifeExp), 
    medianLifeExp = median(lifeExp))
lifeExpMeanByYCTbl <- xtable(lifeExpMeanByYearCont)
print(lifeExpMeanByYCTbl, type = "html", include.rownames = FALSE)
continent year meanLifeExp medianLifeExp
Africa 1952 39.14 38.83
Africa 1957 41.27 40.59
Africa 1962 43.32 42.63
Africa 1967 45.33 44.70
Africa 1972 47.45 47.03
Africa 1977 49.58 49.27
Africa 1982 51.59 50.76
Africa 1987 53.34 51.64
Africa 1992 53.63 52.43
Africa 1997 53.60 52.76
Africa 2002 53.33 51.24
Africa 2007 54.81 52.93
Americas 1952 53.28 54.74
Americas 1957 55.96 56.07
Americas 1962 58.40 58.30
Americas 1967 60.41 60.52
Americas 1972 62.39 63.44
Americas 1977 64.39 66.35
Americas 1982 66.23 67.41
Americas 1987 68.09 69.50
Americas 1992 69.57 69.86
Americas 1997 71.15 72.15
Americas 2002 72.42 72.05
Americas 2007 73.61 72.90
Asia 1952 46.31 44.87
Asia 1957 49.32 48.28
Asia 1962 51.56 49.33
Asia 1967 54.66 53.66
Asia 1972 57.32 56.95
Asia 1977 59.61 60.77
Asia 1982 62.62 63.74
Asia 1987 64.85 66.30
Asia 1992 66.54 68.69
Asia 1997 68.02 70.27
Asia 2002 69.23 71.03
Asia 2007 70.73 72.40
Europe 1952 64.41 65.90
Europe 1957 66.70 67.65
Europe 1962 68.54 69.53
Europe 1967 69.74 70.61
Europe 1972 70.78 70.89
Europe 1977 71.94 72.34
Europe 1982 72.81 73.49
Europe 1987 73.64 74.81
Europe 1992 74.44 75.45
Europe 1997 75.51 76.12
Europe 2002 76.70 77.54
Europe 2007 77.65 78.61

Last week, I graphically depicted this information using a line plot (essentially a scatterplot with lines connecting the dots). This is shown below.

xyplot(meanLifeExp + medianLifeExp ~ year, lifeExpMeanByYearCont, main = "Mean and Median Life Expectancies By Continent", 
    group = continent, type = c("l"), auto.key = list(points = FALSE, lines = TRUE, 
        space = "right"))

plot of chunk unnamed-chunk-9

We will now try to reproduce this plot using ggplot.

ggplot(lifeExpMeanByYearCont, aes(y = meanLifeExp + medianLifeExp, x = year, 
    color = continent)) + geom_line() + facet_wrap(meanLifeExp + medianLifeExp)
## Error: object 'meanLifeExp' not found

Uh oh, that won't work. The only way that I can think of to achieve what I want to do in ggplot is to reshape the data to a “tall” format.

QUESTION: HOW TO AVOID RESHAPING DATA IN GGPLOT? (we saw that it is possible in lattice)

lifeExpMeanByYearContTall <- ddply(iDat, ~continent ~ year, summarize, factor = c("mean", 
    "median"), LifeExp = c(meanLifeExp = mean(lifeExp), medianLifeExp = median(lifeExp)))
TalllifeExpMeanByYCTbl <- xtable(lifeExpMeanByYearContTall)
print(TalllifeExpMeanByYCTbl, type = "html", include.rownames = FALSE)
continent year factor LifeExp
Africa 1952 mean 39.14
Africa 1952 median 38.83
Africa 1957 mean 41.27
Africa 1957 median 40.59
Africa 1962 mean 43.32
Africa 1962 median 42.63
Africa 1967 mean 45.33
Africa 1967 median 44.70
Africa 1972 mean 47.45
Africa 1972 median 47.03
Africa 1977 mean 49.58
Africa 1977 median 49.27
Africa 1982 mean 51.59
Africa 1982 median 50.76
Africa 1987 mean 53.34
Africa 1987 median 51.64
Africa 1992 mean 53.63
Africa 1992 median 52.43
Africa 1997 mean 53.60
Africa 1997 median 52.76
Africa 2002 mean 53.33
Africa 2002 median 51.24
Africa 2007 mean 54.81
Africa 2007 median 52.93
Americas 1952 mean 53.28
Americas 1952 median 54.74
Americas 1957 mean 55.96
Americas 1957 median 56.07
Americas 1962 mean 58.40
Americas 1962 median 58.30
Americas 1967 mean 60.41
Americas 1967 median 60.52
Americas 1972 mean 62.39
Americas 1972 median 63.44
Americas 1977 mean 64.39
Americas 1977 median 66.35
Americas 1982 mean 66.23
Americas 1982 median 67.41
Americas 1987 mean 68.09
Americas 1987 median 69.50
Americas 1992 mean 69.57
Americas 1992 median 69.86
Americas 1997 mean 71.15
Americas 1997 median 72.15
Americas 2002 mean 72.42
Americas 2002 median 72.05
Americas 2007 mean 73.61
Americas 2007 median 72.90
Asia 1952 mean 46.31
Asia 1952 median 44.87
Asia 1957 mean 49.32
Asia 1957 median 48.28
Asia 1962 mean 51.56
Asia 1962 median 49.33
Asia 1967 mean 54.66
Asia 1967 median 53.66
Asia 1972 mean 57.32
Asia 1972 median 56.95
Asia 1977 mean 59.61
Asia 1977 median 60.77
Asia 1982 mean 62.62
Asia 1982 median 63.74
Asia 1987 mean 64.85
Asia 1987 median 66.30
Asia 1992 mean 66.54
Asia 1992 median 68.69
Asia 1997 mean 68.02
Asia 1997 median 70.27
Asia 2002 mean 69.23
Asia 2002 median 71.03
Asia 2007 mean 70.73
Asia 2007 median 72.40
Europe 1952 mean 64.41
Europe 1952 median 65.90
Europe 1957 mean 66.70
Europe 1957 median 67.65
Europe 1962 mean 68.54
Europe 1962 median 69.53
Europe 1967 mean 69.74
Europe 1967 median 70.61
Europe 1972 mean 70.78
Europe 1972 median 70.89
Europe 1977 mean 71.94
Europe 1977 median 72.34
Europe 1982 mean 72.81
Europe 1982 median 73.49
Europe 1987 mean 73.64
Europe 1987 median 74.81
Europe 1992 mean 74.44
Europe 1992 median 75.45
Europe 1997 mean 75.51
Europe 1997 median 76.12
Europe 2002 mean 76.70
Europe 2002 median 77.54
Europe 2007 mean 77.65
Europe 2007 median 78.61

Ok, let's try producing the figure in ggplot now.

ggplot(lifeExpMeanByYearContTall, aes(y = LifeExp, x = year, color = continent)) + 
    geom_line() + facet_wrap(~factor) + ggtitle("Mean and Median Life Expectancies By Continent")

plot of chunk unnamed-chunk-12

Success!! And the key is included without having to explicitly state it, nice!

GDP Per capita By Continent Over Time

For the third figure, we can examine how GDP per capita changes over time by continent. We can visually depict this information using a stripplot. Let's do it using lattice first and then try to do it in ggplot.

stripplot(gdpPercap ~ factor(year), iDat, jitter.data = TRUE, group = reorder(continent, 
    gdpPercap), type = c("p", "a"), alpha = 0.6, grid = "h", main = "GDP Per Capita By Continent", 
    ylab = "GDP Per Capita", xlab = "Year", auto.key = list(space = "right"))

plot of chunk unnamed-chunk-13

This plot is effective in showing how each continent fares with regards to their countries' range of GDP per capita levels. We can clearly see that Asia has the most extreme outliers for countries with high GDP per capita, yet they also have many countries with among the lowest GDP per capita in the world. Now let's try to make the plot using ggplot.

ggplot(iDat, aes(y = gdpPercap, x = year, color = continent)) + geom_point()

plot of chunk unnamed-chunk-15

Two main observations: 1) The points are placed in a straight line and are therefore hard to see. 2) The x-axis tick marks do not correspond to the years in the data.

Let's redo the figure by adding some random noise to make the points easier to see and try to customize the tick marks so that they will correspond with the actual years in the data.

ggplot(iDat, aes(y = gdpPercap, x = year, color = continent)) + geom_jitter(position = position_jitter(width = 0.5)) + 
    scale_x_continuous(breaks = c(1952, 1957, 1962, 1967, 1972, 1977, 1982, 
        1987, 1992, 1997, 2002, 2007)) + ggtitle("GDP Per Capita By Continent")

plot of chunk unnamed-chunk-16

Success!! This shows how versatile ggplot can be in terms of being able to customize graphs to fit the way that you want to present the data.