GitHub

Sample Program

To start learning ggplot2, build a chart incrementally. Let’s begin by reviewing the basic RGraphics. In this illustration, the Gapminder package is used. It comes with six variables including country, continent, year, life expectancy, population and GDP per capita.

Let’s install the package, load the data and examine the dataset:

install.packages("gapminder", repos = "https://cran.r-project.org")
## 
## The downloaded binary packages are in
##  /var/folders/qp/s6y46pq11y13t0gpnf4_v9vm0000gp/T//RtmpTJttwU/downloaded_packages
library(gapminder)
summary(gapminder) # Summary of Gapminder dataset
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
str(gapminder) # Structure of dataset
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
gm=gapminder
head(gm)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
summary(gm)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

RGraphics come with simple plotting functions such as hist() and plot():

# Plot one variable 
hist(gm$lifeExp) # Histogram of variable lifeExp (Life expectancy)

The plot() function can do multiple variables:

# Plot two variables with logged version of x
plot(lifeExp ~ gdpPercap, gm, subset = year == 2007, log = "x", pch=16)

# Plot two variables with selected country
plot(lifeExp ~ year, gm, subset = country == "Cambodia", type = "p")

# Try different plot types 
plot(lifeExp ~ year, gm, subset = country == "Cambodia", type = "l")

# Different symbols
plot(lifeExp ~ year, gm, subset = country == "Cambodia", type = "b", pch=18) 

# Add labels to axes
plot(lifeExp ~ year, gm, subset = country == "Cambodia", type = "b", pch=18, xlab="Year", ylab="Life Expectancy") 

# Specify font
plot(lifeExp ~ year, gm, subset = country == "Cambodia", type = "b", pch=18, xlab="Year", ylab="Life Expectancy",family="Palatino") 

With ggplot2, we can apply the Grammer of Graphics methods and modify the chart with more finetuning and detail attention:

Let’s install the package, load it and try it step by step: ## install.packages(“ggplot2”, repos =“https://cran.r-project.org”) ## library(ggplot2)

# More layered plots using ggplot2, with regression line

p <- ggplot2::ggplot(data = gm) 

Why there is no chart?

Add layer?

How about now?

library(ggplot2)
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))  # Shows nothing. Why?
p + geom_point()

First, we need to start the data component, then add the aesthetic mapping defining the basics (i.e. variables), followed by the geometric objects. Here is an alternative:

# Alternative
p <- ggplot(data=gm, aes(x=gdpPercap, y=lifeExp))

What is still missing?

p + geom_point()

Now, we can add more features to the chart.

# Add some color grouping
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp, color=continent))
p + geom_point()

You may try different markers using the pch option:

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp, color=continent))
p + geom_point(pch=6)

Add a regression line, dropped the color grouping

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(pch=16) + geom_smooth(method="lm") 

There are a series of methods for plotting the regression line.

## geom_smooth methods:
# "auto", "lm", "glm", "gam", "loess" 
# Add a  line, dropped the color grouping, try other method
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(pch=16) + geom_smooth(method="loess") 

Let’s put the data in perspective. Some variables have high variance. Apply log to make the more plottable:

# Add a regression line with logged x, dropped the color grouping
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(pch=16) + geom_smooth(method="lm") + 
  scale_x_log10()

Focus on color now!

p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = "purple"))
p + geom_point() +
  geom_smooth(method = "loess") +
  scale_x_log10() # Why it is not purple?

Make sure the color option is applied in the right place.

#  How about now?
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10()

Add a better theme than the default gray background with grid lines.

# Add theme
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(color = "purple", pch=20) +
  theme_bw() +
  geom_smooth(method = "loess") + scale_x_log10()

Here comes the title.

# Add title
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  theme_bw() + 
  ggtitle("Life Expectacy and GDP Per Capita (logged)") 

There is a better way to do all.

# Add title, labels and caption (located at bottom)
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  theme_bw() +
  labs(title="Life Expectacy and GDP Per Capita (logged)", 
       x="GDP Per Capita",y="Life Expectancy",caption="") 

Alignment of title and caption takes a bit more work. Use the theme function to specify alignment.

# Why the title is not centered?
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  theme_bw() +
  labs(title="Life Expectacy and GDP Per Capita (logged)", 
       x="GDP Per Capita",y="Life Expectancy") + 
  theme(plot.title = element_text(hjust = 0.5))

And font.

# Set font
p <- ggplot(data = gm,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(color = "purple") +
  geom_smooth(method = "loess") + scale_x_log10() +
  theme_bw() +
  labs(title="Life Expectacy and GDP Per Capita (logged)", 
       x="GDP Per Capita",y="Life Expectancy") + 
  theme(plot.title = element_text(hjust = 0.5),
        text=element_text(size=16,family="Palatino"))

Revised: 3/13/2019