Prepared by: Amanda Yuen
Up to this point, we have learned how to generate plots using the lattice package. For homework #5, we will practice generating plots using the ggplot2 package. I'm going to attempt to reproduce some plots from Homework #4 using ggplot2 to compare the results from the lattice and ggplot2 packages. The goal is to make at least two figures, with one being a stripplot-type of figure (one quantitative variable and one categorical variable) and one being a scatterplot (two quantitative variables).
As always, we start by importing the Gapminder data into R and then doing a quick check that the data has been imported properly.
gDat <- read.delim("gapminderDataFiveYear.txt")
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
We also need to load the libraries needed for this assignment.
library(plyr)
## Warning: package 'plyr' was built under R version 2.15.3
library(xtable)
## Warning: package 'xtable' was built under R version 2.15.2
library(lattice)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 2.15.3
We drop Oceania from the data set and check that it has been done successfully.
iDat <- droplevels(subset(gDat, continent != "Oceania"))
table(iDat$continent)
##
## Africa Americas Asia Europe
## 624 300 396 360
I'm planning to make three figures. I'll start by trying to make a multi-panelled bar plot just to see how it goes, and then I'll meet the requirement for this assignment by making a scatterplot-type and then a stripplot-type of graph.
Let's begin by displaying the maximum and minimum GDP per capita for each continent in a “tall” table format.
maxMinGdpByContTall <- ddply(iDat, ~continent, summarize, factor = c("min",
"max"), GDP = c(min = min(gdpPercap), max = max(gdpPercap)))
tallGdpTbl <- xtable(maxMinGdpByContTall)
print(tallGdpTbl, type = "html", include.rownames = FALSE)
| continent | factor | GDP |
|---|---|---|
| Africa | min | 241.17 |
| Africa | max | 21951.21 |
| Americas | min | 1201.64 |
| Americas | max | 42951.65 |
| Asia | min | 331.00 |
| Asia | max | 113523.13 |
| Europe | min | 973.53 |
| Europe | max | 49357.19 |
Last week, I made a multi-panelled bar plot using the lattice package. It is shown below.
barchart(GDP ~ factor | continent, maxMinGdpByContTall, main = "Maximum and Minimum GDP Per Capita By Continent",
ylab = "GDP Per Capita", col = "lightblue")
Now, we will attempt to create a similar bar plot using the ggplot2 package.
ggplot(maxMinGdpByContTall, aes(y = GDP, x = factor)) + geom_bar() + facet_wrap(~continent)
## Mapping a variable to y and also using stat="bin". With stat="bin", it
## will attempt to set the y value to the count of cases in each group. This
## can result in unexpected behavior and will not be allowed in a future
## version of ggplot2. If you want y to represent counts of cases, use
## stat="bin" and don't map a variable to y. If you want y to represent
## values in the data, use stat="identity". See ?geom_bar for examples.
## (Deprecated; last used in version 0.9.2) Mapping a variable to y and also
## using stat="bin". With stat="bin", it will attempt to set the y value to
## the count of cases in each group. This can result in unexpected behavior
## and will not be allowed in a future version of ggplot2. If you want y to
## represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity". See
## ?geom_bar for examples. (Deprecated; last used in version 0.9.2) Mapping a
## variable to y and also using stat="bin". With stat="bin", it will attempt
## to set the y value to the count of cases in each group. This can result
## in unexpected behavior and will not be allowed in a future version of
## ggplot2. If you want y to represent counts of cases, use stat="bin" and
## don't map a variable to y. If you want y to represent values in the data,
## use stat="identity". See ?geom_bar for examples. (Deprecated; last used
## in version 0.9.2) Mapping a variable to y and also using stat="bin". With
## stat="bin", it will attempt to set the y value to the count of cases in
## each group. This can result in unexpected behavior and will not be
## allowed in a future version of ggplot2. If you want y to represent counts
## of cases, use stat="bin" and don't map a variable to y. If you want y to
## represent values in the data, use stat="identity". See ?geom_bar for
## examples. (Deprecated; last used in version 0.9.2)
Almost immediately, we see 3 differences between the two plots: 1) The colour scheme. 2) The order of the panels (i.e. continent). The lattice package appears to order the continents alphabetically starting from the lower left corner, while ggplot2 appears to order the continents alphabetically starting from the upper left corner. I find the second ordering easier to follow. 3) The bottom of the bars begin exactly at the value of 0 GDP per capita with ggplot, but they begin slightly below 0 with lattice. The biggest effect is that in the ggplot graph, the minimum GDP per capita is so small in Africa and Asia that it cannot be seen.
After playing around for a bit, I found that adding the fill and colour options was the easiest way to make all the minimum GDP per capita bars visible. “Fill” determines the colour of the bars and “colour” determines the colour of the bar outlines. Also, it may be useful to specify the graph and axis labels.
ggplot(maxMinGdpByContTall, aes(y = GDP, x = factor)) + geom_bar(fill = "dark blue",
colour = "dark blue") + facet_wrap(~continent) + ylab("GDP Per Capita") +
xlab(" ") + ggtitle("Maximum and Minimum GDP Per Capita By Continent")
## Mapping a variable to y and also using stat="bin". With stat="bin", it
## will attempt to set the y value to the count of cases in each group. This
## can result in unexpected behavior and will not be allowed in a future
## version of ggplot2. If you want y to represent counts of cases, use
## stat="bin" and don't map a variable to y. If you want y to represent
## values in the data, use stat="identity". See ?geom_bar for examples.
## (Deprecated; last used in version 0.9.2) Mapping a variable to y and also
## using stat="bin". With stat="bin", it will attempt to set the y value to
## the count of cases in each group. This can result in unexpected behavior
## and will not be allowed in a future version of ggplot2. If you want y to
## represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity". See
## ?geom_bar for examples. (Deprecated; last used in version 0.9.2) Mapping a
## variable to y and also using stat="bin". With stat="bin", it will attempt
## to set the y value to the count of cases in each group. This can result
## in unexpected behavior and will not be allowed in a future version of
## ggplot2. If you want y to represent counts of cases, use stat="bin" and
## don't map a variable to y. If you want y to represent values in the data,
## use stat="identity". See ?geom_bar for examples. (Deprecated; last used
## in version 0.9.2) Mapping a variable to y and also using stat="bin". With
## stat="bin", it will attempt to set the y value to the count of cases in
## each group. This can result in unexpected behavior and will not be
## allowed in a future version of ggplot2. If you want y to represent counts
## of cases, use stat="bin" and don't map a variable to y. If you want y to
## represent values in the data, use stat="identity". See ?geom_bar for
## examples. (Deprecated; last used in version 0.9.2)
Again, let's display the information in a table first.
lifeExpMeanByYearCont <- ddply(iDat, ~continent ~ year, summarize, meanLifeExp = mean(lifeExp),
medianLifeExp = median(lifeExp))
lifeExpMeanByYCTbl <- xtable(lifeExpMeanByYearCont)
print(lifeExpMeanByYCTbl, type = "html", include.rownames = FALSE)
| continent | year | meanLifeExp | medianLifeExp |
|---|---|---|---|
| Africa | 1952 | 39.14 | 38.83 |
| Africa | 1957 | 41.27 | 40.59 |
| Africa | 1962 | 43.32 | 42.63 |
| Africa | 1967 | 45.33 | 44.70 |
| Africa | 1972 | 47.45 | 47.03 |
| Africa | 1977 | 49.58 | 49.27 |
| Africa | 1982 | 51.59 | 50.76 |
| Africa | 1987 | 53.34 | 51.64 |
| Africa | 1992 | 53.63 | 52.43 |
| Africa | 1997 | 53.60 | 52.76 |
| Africa | 2002 | 53.33 | 51.24 |
| Africa | 2007 | 54.81 | 52.93 |
| Americas | 1952 | 53.28 | 54.74 |
| Americas | 1957 | 55.96 | 56.07 |
| Americas | 1962 | 58.40 | 58.30 |
| Americas | 1967 | 60.41 | 60.52 |
| Americas | 1972 | 62.39 | 63.44 |
| Americas | 1977 | 64.39 | 66.35 |
| Americas | 1982 | 66.23 | 67.41 |
| Americas | 1987 | 68.09 | 69.50 |
| Americas | 1992 | 69.57 | 69.86 |
| Americas | 1997 | 71.15 | 72.15 |
| Americas | 2002 | 72.42 | 72.05 |
| Americas | 2007 | 73.61 | 72.90 |
| Asia | 1952 | 46.31 | 44.87 |
| Asia | 1957 | 49.32 | 48.28 |
| Asia | 1962 | 51.56 | 49.33 |
| Asia | 1967 | 54.66 | 53.66 |
| Asia | 1972 | 57.32 | 56.95 |
| Asia | 1977 | 59.61 | 60.77 |
| Asia | 1982 | 62.62 | 63.74 |
| Asia | 1987 | 64.85 | 66.30 |
| Asia | 1992 | 66.54 | 68.69 |
| Asia | 1997 | 68.02 | 70.27 |
| Asia | 2002 | 69.23 | 71.03 |
| Asia | 2007 | 70.73 | 72.40 |
| Europe | 1952 | 64.41 | 65.90 |
| Europe | 1957 | 66.70 | 67.65 |
| Europe | 1962 | 68.54 | 69.53 |
| Europe | 1967 | 69.74 | 70.61 |
| Europe | 1972 | 70.78 | 70.89 |
| Europe | 1977 | 71.94 | 72.34 |
| Europe | 1982 | 72.81 | 73.49 |
| Europe | 1987 | 73.64 | 74.81 |
| Europe | 1992 | 74.44 | 75.45 |
| Europe | 1997 | 75.51 | 76.12 |
| Europe | 2002 | 76.70 | 77.54 |
| Europe | 2007 | 77.65 | 78.61 |
Last week, I graphically depicted this information using a line plot (essentially a scatterplot with lines connecting the dots). This is shown below.
xyplot(meanLifeExp + medianLifeExp ~ year, lifeExpMeanByYearCont, main = "Mean and Median Life Expectancies By Continent",
group = continent, type = c("l"), auto.key = list(points = FALSE, lines = TRUE,
space = "right"))
We will now try to reproduce this plot using ggplot.
ggplot(lifeExpMeanByYearCont, aes(y = meanLifeExp + medianLifeExp, x = year,
color = continent)) + geom_line() + facet_wrap(meanLifeExp + medianLifeExp)
## Error: object 'meanLifeExp' not found
Uh oh, that won't work. The only way that I can think of to achieve what I want to do in ggplot is to reshape the data to a “tall” format.
QUESTION: HOW TO AVOID RESHAPING DATA IN GGPLOT? (we saw that it is possible in lattice)
lifeExpMeanByYearContTall <- ddply(iDat, ~continent ~ year, summarize, factor = c("mean",
"median"), LifeExp = c(meanLifeExp = mean(lifeExp), medianLifeExp = median(lifeExp)))
TalllifeExpMeanByYCTbl <- xtable(lifeExpMeanByYearContTall)
print(TalllifeExpMeanByYCTbl, type = "html", include.rownames = FALSE)
| continent | year | factor | LifeExp |
|---|---|---|---|
| Africa | 1952 | mean | 39.14 |
| Africa | 1952 | median | 38.83 |
| Africa | 1957 | mean | 41.27 |
| Africa | 1957 | median | 40.59 |
| Africa | 1962 | mean | 43.32 |
| Africa | 1962 | median | 42.63 |
| Africa | 1967 | mean | 45.33 |
| Africa | 1967 | median | 44.70 |
| Africa | 1972 | mean | 47.45 |
| Africa | 1972 | median | 47.03 |
| Africa | 1977 | mean | 49.58 |
| Africa | 1977 | median | 49.27 |
| Africa | 1982 | mean | 51.59 |
| Africa | 1982 | median | 50.76 |
| Africa | 1987 | mean | 53.34 |
| Africa | 1987 | median | 51.64 |
| Africa | 1992 | mean | 53.63 |
| Africa | 1992 | median | 52.43 |
| Africa | 1997 | mean | 53.60 |
| Africa | 1997 | median | 52.76 |
| Africa | 2002 | mean | 53.33 |
| Africa | 2002 | median | 51.24 |
| Africa | 2007 | mean | 54.81 |
| Africa | 2007 | median | 52.93 |
| Americas | 1952 | mean | 53.28 |
| Americas | 1952 | median | 54.74 |
| Americas | 1957 | mean | 55.96 |
| Americas | 1957 | median | 56.07 |
| Americas | 1962 | mean | 58.40 |
| Americas | 1962 | median | 58.30 |
| Americas | 1967 | mean | 60.41 |
| Americas | 1967 | median | 60.52 |
| Americas | 1972 | mean | 62.39 |
| Americas | 1972 | median | 63.44 |
| Americas | 1977 | mean | 64.39 |
| Americas | 1977 | median | 66.35 |
| Americas | 1982 | mean | 66.23 |
| Americas | 1982 | median | 67.41 |
| Americas | 1987 | mean | 68.09 |
| Americas | 1987 | median | 69.50 |
| Americas | 1992 | mean | 69.57 |
| Americas | 1992 | median | 69.86 |
| Americas | 1997 | mean | 71.15 |
| Americas | 1997 | median | 72.15 |
| Americas | 2002 | mean | 72.42 |
| Americas | 2002 | median | 72.05 |
| Americas | 2007 | mean | 73.61 |
| Americas | 2007 | median | 72.90 |
| Asia | 1952 | mean | 46.31 |
| Asia | 1952 | median | 44.87 |
| Asia | 1957 | mean | 49.32 |
| Asia | 1957 | median | 48.28 |
| Asia | 1962 | mean | 51.56 |
| Asia | 1962 | median | 49.33 |
| Asia | 1967 | mean | 54.66 |
| Asia | 1967 | median | 53.66 |
| Asia | 1972 | mean | 57.32 |
| Asia | 1972 | median | 56.95 |
| Asia | 1977 | mean | 59.61 |
| Asia | 1977 | median | 60.77 |
| Asia | 1982 | mean | 62.62 |
| Asia | 1982 | median | 63.74 |
| Asia | 1987 | mean | 64.85 |
| Asia | 1987 | median | 66.30 |
| Asia | 1992 | mean | 66.54 |
| Asia | 1992 | median | 68.69 |
| Asia | 1997 | mean | 68.02 |
| Asia | 1997 | median | 70.27 |
| Asia | 2002 | mean | 69.23 |
| Asia | 2002 | median | 71.03 |
| Asia | 2007 | mean | 70.73 |
| Asia | 2007 | median | 72.40 |
| Europe | 1952 | mean | 64.41 |
| Europe | 1952 | median | 65.90 |
| Europe | 1957 | mean | 66.70 |
| Europe | 1957 | median | 67.65 |
| Europe | 1962 | mean | 68.54 |
| Europe | 1962 | median | 69.53 |
| Europe | 1967 | mean | 69.74 |
| Europe | 1967 | median | 70.61 |
| Europe | 1972 | mean | 70.78 |
| Europe | 1972 | median | 70.89 |
| Europe | 1977 | mean | 71.94 |
| Europe | 1977 | median | 72.34 |
| Europe | 1982 | mean | 72.81 |
| Europe | 1982 | median | 73.49 |
| Europe | 1987 | mean | 73.64 |
| Europe | 1987 | median | 74.81 |
| Europe | 1992 | mean | 74.44 |
| Europe | 1992 | median | 75.45 |
| Europe | 1997 | mean | 75.51 |
| Europe | 1997 | median | 76.12 |
| Europe | 2002 | mean | 76.70 |
| Europe | 2002 | median | 77.54 |
| Europe | 2007 | mean | 77.65 |
| Europe | 2007 | median | 78.61 |
Ok, let's try producing the figure in ggplot now.
ggplot(lifeExpMeanByYearContTall, aes(y = LifeExp, x = year, color = continent)) +
geom_line() + facet_wrap(~factor) + ggtitle("Mean and Median Life Expectancies By Continent")
Success!! And the key is included without having to explicitly state it, nice!
For the third figure, we can examine how GDP per capita changes over time by continent. We can visually depict this information using a stripplot. Let's do it using lattice first and then try to do it in ggplot.
stripplot(gdpPercap ~ factor(year), iDat, jitter.data = TRUE, group = reorder(continent,
gdpPercap), type = c("p", "a"), alpha = 0.6, grid = "h", main = "GDP Per Capita By Continent",
ylab = "GDP Per Capita", xlab = "Year", auto.key = list(space = "right"))
This plot is effective in showing how each continent fares with regards to their countries' range of GDP per capita levels. We can clearly see that Asia has the most extreme outliers for countries with high GDP per capita, yet they also have many countries with among the lowest GDP per capita in the world. Now let's try to make the plot using ggplot.
ggplot(iDat, aes(y = gdpPercap, x = year, color = continent)) + geom_point()
Two main observations: 1) The points are placed in a straight line and are therefore hard to see. 2) The x-axis tick marks do not correspond to the years in the data.
Let's redo the figure by adding some random noise to make the points easier to see and try to customize the tick marks so that they will correspond with the actual years in the data.
ggplot(iDat, aes(y = gdpPercap, x = year, color = continent)) + geom_jitter(position = position_jitter(width = 0.5)) +
scale_x_continuous(breaks = c(1952, 1957, 1962, 1967, 1972, 1977, 1982,
1987, 1992, 1997, 2002, 2007)) + ggtitle("GDP Per Capita By Continent")
Success!! This shows how versatile ggplot can be in terms of being able to customize graphs to fit the way that you want to present the data.