The advantages of the ggplot framework is that it makes advanced chart preparation (relatively) simple (relative, that is, to doing all of this with the plot() command and its add-ons). In this example, we will work through applications (somewhat) more relevant to international relations and oil politics (for more on why this matters for oil politics, see Liou & Musgrave, International Studies Quarterly, 2016).

About the data

AlcoholData.dta contains six variables:

cname, the country name
ccodealp, the three-letter country code from a dataset named “ALP”
ht_region, a region code from people named Hadenius and Teorell
pwt_rgdpch, the Penn World Tables’ real gdp per capita (chained)
alcohollitres, UN statistics on average per-capita alcohol consumption
p_polity2, the 21-point (-10 to 10) democracy score

Loading the data

Set your working directory appropriately and then load the data:

library(foreign)
alcohol <- read.dta("AlcoholData.dta")

names(alcohol)

## [1] "cname"         "ccodealp"      "ht_region"     "p_polity2"    
## [5] "pwt_rgdpch"    "alcohollitres"

library(pastecs)

## Loading required package: boot

options(scipen=100) ## Don't worry about what these options() commands are doing for now
options(digits=2) 
stat.desc(alcohol,norm=FALSE)

##          cname ccodealp ht_region p_polity2  pwt_rgdpch alcohollitres
## nbr.val     NA       NA        NA    162.00       185.0        187.00
## nbr.null    NA       NA        NA      6.00         0.0          4.00
## nbr.na      NA       NA        NA     38.00        15.0         13.00
## min         NA       NA        NA    -10.00       361.5          0.00
## max         NA       NA        NA     10.00     65492.8         16.20
## range       NA       NA        NA     20.00     65131.3         16.20
## sum         NA       NA        NA    523.00   1973398.4        962.90
## median      NA       NA        NA      6.00      6017.6          4.50
## mean        NA       NA        NA      3.23     10667.0          5.15
## SE.mean     NA       NA        NA      0.52       866.7          0.30
## CI.mean     NA       NA        NA      1.02      1709.9          0.60
## var         NA       NA        NA     43.31 138956294.4         17.20
## std.dev     NA       NA        NA      6.58     11788.0          4.15
## coef.var    NA       NA        NA      2.04         1.1          0.81

Hmm–that’s a lot of NAs to see in a data set – I wonder what’s going on?

head(alcohol)

##         cname ccodealp                               ht_region p_polity2
## 1      Belize      BLZ                       10. The Caribbean        NA
## 2  Cape Verde      CPV                   4. Sub-Saharan Africa        NA
## 3     Finland      FIN     5. Western Europe and North America        10
## 4      Poland      POL 1. Eastern Europe and post Soviet Union        10
## 5     Lesotho      LSO                   4. Sub-Saharan Africa         8
## 6 New Zealand      NZL     5. Western Europe and North America        10
##   pwt_rgdpch alcohollitres
## 1       8614           5.8
## 2       6294           2.5
## 3      27300          10.0
## 4      11115           9.5
## 5       1800           1.9
## 6      22804           9.3

Ah, it looks like the first three columns are text (or factor) variables, so of course we can’t take averages of these. Instead, let’s re-run stat.desc but this time only focusing on the numerical variables (both continuous and categorical):

stat.desc(alcohol[,c(4:6)],norm=FALSE)

##              p_polity2  pwt_rgdpch alcohollitres
## nbr.val         162.00       185.0        187.00
## nbr.null          6.00         0.0          4.00
## nbr.na           38.00        15.0         13.00
## min             -10.00       361.5          0.00
## max              10.00     65492.8         16.20
## range            20.00     65131.3         16.20
## sum             523.00   1973398.4        962.90
## median            6.00      6017.6          4.50
## mean              3.23     10667.0          5.15
## SE.mean           0.52       866.7          0.30
## CI.mean.0.95      1.02      1709.9          0.60
## var              43.31 138956294.4         17.20
## std.dev           6.58     11788.0          4.15
## coef.var          2.04         1.1          0.81

Plotting the data

There are, potentially, many interesting relationships to plot here. I tend to start with a basic scatterplot, like this one:

library(ggplot2)

base.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
base.plot + geom_point() + 
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold"))

ggsave("AlcoholBasePlot.png") ## Saves the plot displayed to your working directory

Size based on other variables

And as we’ve seen, we can use other information to start to display categories and other variables using bubblecharts, colors, shapes, and so on. Here, we’ll make an unusual choice: using size to display a continuous categorical variable, p_polity2, mostly to show how the code works.

sized.plot <-  ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
sized.plot + geom_point(aes(size=p_polity2)) +
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold"))

ggsave("AlcoholSizePlot.png")

Color for Categories

And now, let’s use regions as basis for color:

color.plot <-  ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
color.plot + geom_point(aes(color=ht_region)) +
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold"))

Plotting names instead of dots

Once you remember that a chart is just an xy coordinate plane, it becomes possible to envision using other sorts of plotting characters to convey information. We are often interested in showing which units are displaying which tendencies, not just in learning the overall trends.

name.plot <-  ggplot(alcohol,aes(pwt_rgdpch,alcohollitres,label=ccodealp))
name.plot + geom_text(aes(color=ht_region)) +
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold"))

This is a messy plot, because our aim here is to show the code, but it does help show outliers clearly (e.g., Luxembourg, Estonia, Czechia, Qatar, Brunei…).

Facetting code

A useful strength of ggplot is the ease with which it allows us to visualize different relationships by categorical variable using the facet_wrap() command. You can learn more about using facets at this link, but here we’ll work on a quick example, subsetting by region:

facet.plot <- ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
facet.plot + geom_point() + 
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold")) +
  facet_wrap(~ht_region)

This is actually a fine plot, except that it includes NA as a region. We know that NA is R’s way of saying “missing data”, and there’s no country in the world that exists in the Missing Data region.

So we want to use the subset command to exclude any country for which we don’t have region data. The syntax here is subset(dataname,logicalstatement); use, as always, ?subset() to learn more about it. The logical condition we will use is ht_region != "NA", which means “if the region is not equal to missing”.

facet.plot <- ggplot(subset(alcohol,ht_region!="NA"),aes(pwt_rgdpch,alcohollitres))
facet.plot + geom_point() + 
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold")) +
  facet_wrap(~ht_region)

This is pretty close to good enough (for now–remember you could do all the tricks with shape, size, and color within each facet, too). But the labels for each factor is a little difficult to read at this size. We’ll be talking about factors

alcohol$new_region <- alcohol$ht_region  ### just copying the original data

## don't directly manipulate original data; work from a copy instead


## use the levels() command to see what the existing levels of a factor
## variable are.
levels(alcohol$new_region)

##  [1] "1. Eastern Europe and post Soviet Union"
##  [2] "2. Latin America"                       
##  [3] "3. North Africa & the Middle East"      
##  [4] "4. Sub-Saharan Africa"                  
##  [5] "5. Western Europe and North America"    
##  [6] "6. East Asia"                           
##  [7] "7. South-East Asia"                     
##  [8] "8. South Asia"                          
##  [9] "9. The Pacific"                         
## [10] "10. The Caribbean"

## Just like colnames() and row.names(), levels() is both a command
## that tells you what's in it and that lets you reassign values

## Here, we're going to use code from PS2 solution set to recode
## the levels of the regions to be much, much shorter

## note that each new value has to be in the same place as the element
## it's replacing, and that we have to specify the values of _everything_
## we want in the absence of subscripting.

levels(alcohol$new_region) <- c("E. Europe + Fmr USSR",
                               "Latin America",
                               "N. Africa and Mideast", 
                               "Sub-Saharan Africa",
                               "W. Europe and N. America",
                               "East Asia", 
                               "South-East Asia",
                               "South Asia",
                               "The Pacific",
                               "The Caribbean")

## and now write code to use the new region codes

facet.plot <- ggplot(subset(alcohol,new_region!="NA"),aes(pwt_rgdpch,alcohollitres))
facet.plot + geom_point() + 
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold")) +
  facet_wrap(~new_region)

## Warning: Removed 18 rows containing missing values (geom_point).

Labelling legends appropriately

Putting labels on legends is important; it’s actually fairly straightforward. Just note whether the scale you are using is “discrete” (that is, treated as categorical) or “continuous” (that is, … continuous) and whether it’s the “colour” (or color; the guy who wrote this is from New Zealand) or “size” etc. For instance, look at what scale_colour_discrete and scale_size_continuous are doing here:

color.plot <-  ggplot(alcohol,aes(pwt_rgdpch,alcohollitres))
color.plot + geom_point(aes(size=p_polity2,color=ht_region)) +
    ylab("Litres of Alcohol Consumed Per Person Annually") + 
    xlab("Real Per Capita GDP") + 
    labs(title="Alcohol Consumption\nby Real Per Capita GDP") +
    theme(plot.title=element_text(face="bold")) +
    scale_colour_discrete(name="Region") +
    scale_size_continuous(name="Polity Score")

ggsave("AlcoholFancy.png")

Handout 12: More with ggplot

Paul Musgrave

October 27, 2016