The advantages of the ggplot framework is that it makes advanced chart preparation (relatively) simple (relative, that is, to doing all of this with the plot() command and its add-ons). In this example, we will work through applications (somewhat) more relevant to international relations and oil politics (for more on why this matters for oil politics, see Liou & Musgrave, International Studies Quarterly, 2016).
AlcoholData.dta contains six variables:
cname, the country nameccodealp, the three-letter country code from a dataset named “ALP”ht_region, a region code from people named Hadenius and Teorellpwt_rgdpch, the Penn World Tables’ real gdp per capita (chained)alcohollitres, UN statistics on average per-capita alcohol consumptionp_polity2, the 21-point (-10 to 10) democracy scoreSet your working directory appropriately and then load the data:
library(foreign)
alcohol <- read.dta("AlcoholData.dta")
names(alcohol)
## [1] "cname" "ccodealp" "ht_region" "p_polity2"
## [5] "pwt_rgdpch" "alcohollitres"
library(pastecs)
## Loading required package: boot
options(scipen=100) ## Don't worry about what these options() commands are doing for now
options(digits=2)
stat.desc(alcohol,norm=FALSE)
## cname ccodealp ht_region p_polity2 pwt_rgdpch alcohollitres
## nbr.val NA NA NA 162.00 185.0 187.00
## nbr.null NA NA NA 6.00 0.0 4.00
## nbr.na NA NA NA 38.00 15.0 13.00
## min NA NA NA -10.00 361.5 0.00
## max NA NA NA 10.00 65492.8 16.20
## range NA NA NA 20.00 65131.3 16.20
## sum NA NA NA 523.00 1973398.4 962.90
## median NA NA NA 6.00 6017.6 4.50
## mean NA NA NA 3.23 10667.0 5.15
## SE.mean NA NA NA 0.52 866.7 0.30
## CI.mean NA NA NA 1.02 1709.9 0.60
## var NA NA NA 43.31 138956294.4 17.20
## std.dev NA NA NA 6.58 11788.0 4.15
## coef.var NA NA NA 2.04 1.1 0.81
Hmm–that’s a lot of NAs to see in a data set – I wonder what’s going on?
head(alcohol)
## cname ccodealp ht_region p_polity2
## 1 Belize BLZ 10. The Caribbean NA
## 2 Cape Verde CPV 4. Sub-Saharan Africa NA
## 3 Finland FIN 5. Western Europe and North America 10
## 4 Poland POL 1. Eastern Europe and post Soviet Union 10
## 5 Lesotho LSO 4. Sub-Saharan Africa 8
## 6 New Zealand NZL 5. Western Europe and North America 10
## pwt_rgdpch alcohollitres
## 1 8614 5.8
## 2 6294 2.5
## 3 27300 10.0
## 4 11115 9.5
## 5 1800 1.9
## 6 22804 9.3
Ah, it looks like the first three columns are text (or factor) variables, so of course we can’t take averages of these. Instead, let’s re-run stat.desc but this time only focusing on the numerical variables (both continuous and categorical):
stat.desc(alcohol[,c(4:6)],norm=FALSE)
## p_polity2 pwt_rgdpch alcohollitres
## nbr.val 162.00 185.0 187.00
## nbr.null 6.00 0.0 4.00
## nbr.na 38.00 15.0 13.00
## min -10.00 361.5 0.00
## max 10.00 65492.8 16.20
## range 20.00 65131.3 16.20
## sum 523.00 1973398.4 962.90
## median 6.00 6017.6 4.50
## mean 3.23 10667.0 5.15
## SE.mean 0.52 866.7 0.30
## CI.mean.0.95 1.02 1709.9 0.60
## var 43.31 138956294.4 17.20
## std.dev 6.58 11788.0 4.15
## coef.var 2.04 1.1 0.81
There are, potentially, many interesting relationships to plot here. I tend to start with a basic scatterplot, like this one:
library(ggplot2)
base.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
base.plot + geom_point() +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold"))
ggsave("AlcoholBasePlot.png") ## Saves the plot displayed to your working directory
And as we’ve seen, we can use other information to start to display categories and other variables using bubblecharts, colors, shapes, and so on. Here, we’ll make an unusual choice: using size to display a continuous categorical variable, p_polity2, mostly to show how the code works.
sized.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
sized.plot + geom_point(aes(size=p_polity2)) +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold"))
ggsave("AlcoholSizePlot.png")
And now, let’s use regions as basis for color:
color.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
color.plot + geom_point(aes(color=ht_region)) +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold"))
Once you remember that a chart is just an xy coordinate plane, it becomes possible to envision using other sorts of plotting characters to convey information. We are often interested in showing which units are displaying which tendencies, not just in learning the overall trends.
name.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres,label=ccodealp))
name.plot + geom_text(aes(color=ht_region)) +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold"))
This is a messy plot, because our aim here is to show the code, but it does help show outliers clearly (e.g., Luxembourg, Estonia, Czechia, Qatar, Brunei…).
A useful strength of ggplot is the ease with which it allows us to visualize different relationships by categorical variable using the facet_wrap() command. You can learn more about using facets at this link, but here we’ll work on a quick example, subsetting by region:
facet.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
facet.plot + geom_point() +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold")) +
facet_wrap(~ht_region)
This is actually a fine plot, except that it includes NA as a region. We know that NA is R’s way of saying “missing data”, and there’s no country in the world that exists in the Missing Data region.
So we want to use the subset command to exclude any country for which we don’t have region data. The syntax here is subset(dataname,logicalstatement); use, as always, ?subset() to learn more about it. The logical condition we will use is ht_region != "NA", which means “if the region is not equal to missing”.
facet.plot <- ggplot(subset(alcohol,ht_region!="NA"),aes(pwt_rgdpch,alcohollitres))
facet.plot + geom_point() +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold")) +
facet_wrap(~ht_region)
This is pretty close to good enough (for now–remember you could do all the tricks with shape, size, and color within each facet, too). But the labels for each factor is a little difficult to read at this size. We’ll be talking about factors
alcohol$new_region <- alcohol$ht_region ### just copying the original data
## don't directly manipulate original data; work from a copy instead
## use the levels() command to see what the existing levels of a factor
## variable are.
levels(alcohol$new_region)
## [1] "1. Eastern Europe and post Soviet Union"
## [2] "2. Latin America"
## [3] "3. North Africa & the Middle East"
## [4] "4. Sub-Saharan Africa"
## [5] "5. Western Europe and North America"
## [6] "6. East Asia"
## [7] "7. South-East Asia"
## [8] "8. South Asia"
## [9] "9. The Pacific"
## [10] "10. The Caribbean"
## Just like colnames() and row.names(), levels() is both a command
## that tells you what's in it and that lets you reassign values
## Here, we're going to use code from PS2 solution set to recode
## the levels of the regions to be much, much shorter
## note that each new value has to be in the same place as the element
## it's replacing, and that we have to specify the values of _everything_
## we want in the absence of subscripting.
levels(alcohol$new_region) <- c("E. Europe + Fmr USSR",
"Latin America",
"N. Africa and Mideast",
"Sub-Saharan Africa",
"W. Europe and N. America",
"East Asia",
"South-East Asia",
"South Asia",
"The Pacific",
"The Caribbean")
## and now write code to use the new region codes
facet.plot <- ggplot(subset(alcohol,new_region!="NA"),aes(pwt_rgdpch,alcohollitres))
facet.plot + geom_point() +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold")) +
facet_wrap(~new_region)
## Warning: Removed 18 rows containing missing values (geom_point).
Putting labels on legends is important; it’s actually fairly straightforward. Just note whether the scale you are using is “discrete” (that is, treated as categorical) or “continuous” (that is, … continuous) and whether it’s the “colour” (or color; the guy who wrote this is from New Zealand) or “size” etc. For instance, look at what scale_colour_discrete and scale_size_continuous are doing here:
color.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
color.plot + geom_point(aes(size=p_polity2,color=ht_region)) +
ylab("Litres of Alcohol Consumed Per Person Annually") +
xlab("Real Per Capita GDP") +
labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
theme(plot.title=element_text(face="bold")) +
scale_colour_discrete(name="Region") +
scale_size_continuous(name="Polity Score")
ggsave("AlcoholFancy.png")