ggplot2 Tutor

GGPLOT2

GGplots works with dataframes and not individual vectors. all the data needed to make a plot is typically contained within a dataframe supplied to a ggplot() itself or could e supplied to respective geoms.

##Getting data for all the plots starting here
midwest <- read.csv("http://goo.gl/G1K41K")
head(midwest)

##   PID    county state  area poptotal popdensity popwhite popblack
## 1 561     ADAMS    IL 0.052    66090  1270.9615    63917     1702
## 2 562 ALEXANDER    IL 0.014    10626   759.0000     7054     3496
## 3 563      BOND    IL 0.022    14991   681.4091    14477      429
## 4 564     BOONE    IL 0.017    30806  1812.1176    29344      127
## 5 565     BROWN    IL 0.018     5836   324.2222     5264      547
## 6 566    BUREAU    IL 0.050    35688   713.7600    35157       50
##   popamerindian popasian popother percwhite  percblack percamerindan
## 1            98      249      124  96.71206  2.5752761     0.1482826
## 2            19       48        9  66.38434 32.9004329     0.1788067
## 3            35       16       34  96.57128  2.8617170     0.2334734
## 4            46      150     1139  95.25417  0.4122574     0.1493216
## 5            14        5        6  90.19877  9.3728581     0.2398903
## 6            65      195      221  98.51210  0.1401031     0.1821340
##    percasian  percother popadults  perchsd percollege percprof
## 1 0.37675897 0.18762294     43298 75.10740   19.63139 4.355859
## 2 0.45172219 0.08469791      6724 59.72635   11.24331 2.870315
## 3 0.10673071 0.22680275      9669 69.33499   17.03382 4.488572
## 4 0.48691813 3.69733169     19272 75.47219   17.27895 4.197800
## 5 0.08567512 0.10281014      3979 68.86152   14.47600 3.367680
## 6 0.54640215 0.61925577     23444 76.62941   18.90462 3.275891
##   poppovertyknown percpovertyknown percbelowpoverty percchildbelowpovert
## 1           63628         96.27478        13.151443             18.01172
## 2           10529         99.08714        32.244278             45.82651
## 3           14235         94.95697        12.068844             14.03606
## 4           30337         98.47757         7.209019             11.17954
## 5            4815         82.50514        13.520249             13.02289
## 6           35107         98.37200        10.399635             14.15882
##   percadultpoverty percelderlypoverty inmetro category
## 1        11.009776          12.443812       0      AAR
## 2        27.385647          25.228976       0      LHR
## 3        10.852090          12.697410       0      AAR
## 4         5.536013           6.217047       1      ALU
## 5        11.143211          19.200000       0      AAR
## 6         8.179287          11.008586       0      AAR

### Loading the ggplot2 package
library(ggplot2,quietly = FALSE)

## Warning: package 'ggplot2' was built under R version 3.4.3

## 
## Attaching package: 'ggplot2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     midwest

options(scipen = 999) ### turn off scientific notation like 1e-09
## Init Ggplot ##
## let's initialise a basic ggplot based on midwest dataset.
ggplot(midwest, aes(x=area,y=poptotal)) ## area and poptotal are columns in midwest

A blank ggplot is drawn. Even though the x and y are specified, there are no points or lines in it. This is because, ggplot doesn’t assume that you meant a scatterplot or a line chart to be drawn. I have only told ggplot what dataset to use and what columns should be used for X and Y axis. I haven’t explicitly asked it to draw any points.

Also note that aes() function is used to specify the X and Y axes. That’s because, any information that is part of the source dataframe has to be specified inside the aes() function.__

Scatterplot

Lets make a scatterplot on top of blank ggplot by adding points using a geom layer called geom_point

ggplot(midwest, aes(x=area, y=poptotal)) + geom_point()

we can see that most of the points are concentrated on bottom portion of the plot.Like geom_point(), there are many such layers which could be added in the existing plot, one such example is just adding geom_smooth(method = ‘lm’). Since the method is set as lm(linear model), it draws a line of best fit

g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method = 'lm')
plot(g)

The line of best fit is in blue. Can you find out what other method options are available for geom_smooth? (note: see ?geom_smooth). You might have noticed that majority of points lie in the bottom of the chart which doesn’t really look nice. So, let’s change the Y-axis limits to focus on the lower half.

Adjusting the x and Y Axis limits

There are two ways to control X and Y limits. either by deleting the points outside the range or Zooming in

Deletes the points outside specified limits

g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method = 'lm') + xlim(0,0.1) +ylim(0,1000000)
plot(g)

## Warning: Removed 5 rows containing non-finite values (stat_smooth).

## Warning: Removed 5 rows containing missing values (geom_point).

Here if you noticed the line of best fit became more horizontal as compared to the original plot.This is becasue while using xlim() and ylim(), points outside specified range are deleted and will not be considered while drawing the line of best fit

Zooming in

The other method is by zooming in to region of interest without deleting the points.This is done using coord_cartesian()

g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method = 'lm')

g1 <- g + coord_cartesian(xlim = c(0,0.1), ylim = c(0,1000000)) ### Zooom in 
plot(g1)

How to Change Axis labels and Titles

I have stored this as g1. Let’s add the plot title and labels for X and Y axis. This can be done in one go using the labs() function with title, x and y arguments. Another option is to use the ggtitle() xlab() and ylab()

g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm")

g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000))  # zooms in

g1 + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

##g1 + ggtitle("Area Vs Population", subtitle="From midwest dataset") + xlab("Area") + ylab("Population")##

Full plot call

ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point() + 
  geom_smooth(method="lm") + 
  coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

How to Change the color and size of points.

We will modify the aesthetics of the geoms and will change the color of the respective points and line to a static value.

ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(col="steelblue", size=3) +   # Set static color and size for points
  geom_smooth(method="lm", col="firebrick") +  # change the color of line
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

How to change the color to reflect categories in Another column?

Now each point is colored based on the state it belongs because of aes(col=state). Not just color, but size, shape, stroke (thickness of boundary) and fill (fill color) can be used to discriminate groupings.

gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")
plot(gg)

We can also change the color pallete entirely.

gg + scale_color_brewer(palette = "Set1") ##change color pallete

More of such color palletes can be found on RColorBrewer Package

library(RColorBrewer)
display.brewer.all()

How to change the x and y axis text and its location?

let’s see how to change the X and Y axis text and its location. This involves two aspects: breaks and labels. Step 1: Set the breaks—-The breaks should be of the same scale as the X axis variable. Note that I am using scale_x_continuous because, the X axis variable is a continuous variable. Had it been a date variable, scale_x_date could be used. Like scale_x_continuous() an equivalent scale_y_continuous() is available for Y axis.

# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Change breaks
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))

Step 2: Change the labels You can optionally change the labels at the axis ticks. labels take a vector of the same length as breaks.Let me demonstrate by setting the labels to alphabets from a to k (though there is no meaning to it in this context).

gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Change breaks + label
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = letters[1:11])

If you need to reverse the scale, use scale_x_reverse

gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

# Reverse X Axis Scale
gg + scale_x_reverse()

How to Customize the Entire Theme in One Shot using Pre-Built Themes?

instead of changing the theme components individually, we can change the entire theme itself using pre-built themes. The help page ?theme_bw shows all the available built-in themes.This again is commonly done in couple of ways. * Use the theme_set() to set the theme before drawing the ggplot. Note that this setting will affect all future plots.Draw the ggplot and then add the overall theme setting (eg. theme_bw())

# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state), size=3) +  # Set color to vary based on state categories.
  geom_smooth(method="lm", col="firebrick", size=2) + 
  coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) + 
  labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", caption="Midwest Demographics")

gg <- gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))

# method 1: Using theme_set()
theme_set(theme_classic())  # not run
gg

# method 2: Adding theme Layer itself.
gg + theme_bw() + labs(subtitle="BW Theme")

gg + theme_classic() + labs(subtitle="Classic Theme")

Modifying theme components

Plot and axis titles and the axis text are part of the plot’s theme. Therefore, it can be modified using the theme() function. The theme() function accepts one of the four element_type() functions mentioned above as arguments. Since the plot and axis titles are textual components, element_text() is used to modify them.

Below, I have changed the size, color, face and line-height. The axis text can be rotated by changing the angle.

# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + 
  labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")

# Modify theme components -------------------------------------------
gg + theme(plot.title=element_text(size=20, 
                                    face="bold", 
                                    family="American Typewriter",
                                    color="tomato",
                                    hjust=0.5,
                                    lineheight=1.2),  # title
            plot.subtitle=element_text(size=15, 
                                       family="American Typewriter",
                                       face="bold",
                                       hjust=0.5),  # subtitle
            plot.caption=element_text(size=15),  # caption
            axis.title.x=element_text(vjust=10,  
                                      size=15),  # X axis title
            axis.title.y=element_text(size=15),  # Y axis title
            axis.text.x=element_text(size=10, 
                                     angle = 30,
                                     vjust=.5),  # X axis text
            axis.text.y=element_text(size=10))  # Y axis text

## Warning: Removed 15 rows containing non-finite values (stat_smooth).

## Warning: Removed 15 rows containing missing values (geom_point).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

vjust, controls the vertical spacing between title (or label) and plot.

hjust, controls the horizontal spacing. Setting it to 0.5 centers the title.

family, is used to set a new font

face, sets the font face (“plain”, “italic”, “bold”, “bold.italic”)

How to Change Legend Labels and Point Colors for Categories

This can be done using the respective scale_aesthetic_manual() function. The new legend labels are supplied as a character vector to the labels argument. If you want to change the color of the categories, it can be assigned to the values argument as shown in below example.

gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + 
  labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")

gg + scale_color_manual(name="State", 
                        labels = c("Illinois", 
                                   "Indiana", 
                                   "Michigan", 
                                   "Ohio", 
                                   "Wisconsin"), 
                        values = c("IL"="blue", 
                                   "IN"="red", 
                                   "MI"="green", 
                                   "OH"="brown", 
                                   "WI"="orange"))

## Warning: Removed 15 rows containing non-finite values (stat_smooth).

## Warning: Removed 15 rows containing missing values (geom_point).

How to Style the Legend Title, Text and Key

The styling of legend title, text, key and the guide can also be adjusted. The legend’s key is a figure like element, so it has to be set usingelement_rect()function.

# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) + 
  labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")

gg + theme(legend.title = element_text(size=12, color = "firebrick"), 
           legend.text = element_text(size=10),
           legend.key=element_rect(fill='springgreen')) + 
  guides(colour = guide_legend(override.aes = list(size=2, stroke=1.5)))

## Warning: Removed 15 rows containing non-finite values (stat_smooth).

## Warning: Removed 15 rows containing missing values (geom_point).

Faceting : Drawing multiple plots within one figure.

Let us use mpg dataset for this operation in ggplot

data(mpg, package="ggplot2")  # load data
# mpg <- read.csv("http://goo.gl/uEeRGu")  # alt data source

g <- ggplot(mpg, aes(x=displ, y=hwy)) + 
      geom_point() + 
      labs(title="hwy vs displ", caption = "Source: mpg") +
      geom_smooth(method="lm", se=FALSE) + 
      theme_bw()  # apply bw theme
plot(g)

We have a simple chart of highway mileage (hwy) against the engine displacement (displ) for the whole dataset. But what if you want to study how this relationship varies for different classes of vehicles?

The facet_wrap() is used to break down a large plot into multiple small plots for individual categories. It takes a formula as the main argument. The items to the left of ~ forms the rows while those to the right form the columns.By default, all the plots share the same scale in both X and Y axis. You can set them free by setting scales=‘free’ but this way it could be harder to compare between groups.

g <- ggplot(mpg, aes(x=displ, y=hwy)) + 
      geom_point() + 
      geom_smooth(method="lm", se=FALSE) + 
      theme_bw()  # apply bw theme

# Facet wrap with common scales
g + facet_wrap( ~ class, nrow=3) + labs(title="hwy vs displ", caption = "Source: mpg", subtitle="Ggplot2 - Faceting - Multiple plots in one figure")  # Shared scales

# Facet wrap with free scales
g + facet_wrap( ~ class, scales = "free") + labs(title="hwy vs displ", caption = "Source: mpg", subtitle="Ggplot2 - Faceting - Multiple plots in one figure with free scales")  # Scales free

So, What do you infer from this? For one, most 2 seater cars have higher engine displacement while the minivan and compact vehicles are on the lower side. This is evident from where the points are placed along the X-axis.

Modifying Plot Background, Major and Minor Axis

How to Change Plot background

# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) + 
      geom_point() + 
      geom_smooth(method="lm", se=FALSE) + 
      theme_bw()  # apply bw theme

# Change Plot Background elements -----------------------------------
g + theme(panel.background = element_rect(fill = 'khaki'),
          panel.grid.major = element_line(colour = "burlywood", size=1.5),
          panel.grid.minor = element_line(colour = "tomato", 
                                          size=.25, 
                                          linetype = "dashed"),
          panel.border = element_blank(),
          axis.line.x = element_line(colour = "darkorange", 
                                     size=1.5, 
                                     lineend = "butt"),
          axis.line.y = element_line(colour = "darkorange", 
                                     size=1.5)) +
    labs(title="Modified Background", 
         subtitle="How to Change Major and Minor grid, Axis Lines, No Border")

# Change Plot Margins -----------------------------------------------
g + theme(plot.background=element_rect(fill="salmon"), 
          plot.margin = unit(c(2, 2, 1, 1), "cm")) +  # top, right, bottom, left
    labs(title="Modified Background", subtitle="How to Change Plot Margin")

With this we come to end of the basic tutorial on geoms, aesthetics, labels, titles colors and themes in ggplot2