The Basics of ggplot2 for Data Science

We will begin by covering the basics of data visualization using ggplot2

First we need to install the ggplot2 package.

We will do so by installing the broader tidyverse data science package which happens to include ggplot2, tibble, tidyr, readr, purrr, and dplyr. Each of which has data and programming usage for statistical analysis.

An R package is a collection of functions, data, and documentation that expands the capabilities of base R.

ggplot2 implements the grammar of graphics.

In the example below we will use a :: to tell R to use the function ggplot from the package ggplot2.

Now lets practice with the mpg data frame found in ggplot2 and view the resulting tibble.

ggplot2::mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl   
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr>
##  1 audi         a4         1.8  1999     4 auto(l~ f        18    29 p    
##  2 audi         a4         1.8  1999     4 manual~ f        21    29 p    
##  3 audi         a4         2    2008     4 manual~ f        20    31 p    
##  4 audi         a4         2    2008     4 auto(a~ f        21    30 p    
##  5 audi         a4         2.8  1999     6 auto(l~ f        16    26 p    
##  6 audi         a4         2.8  1999     6 manual~ f        18    26 p    
##  7 audi         a4         3.1  2008     6 auto(a~ f        18    27 p    
##  8 audi         a4 quat~   1.8  1999     4 manual~ 4        18    26 p    
##  9 audi         a4 quat~   1.8  1999     4 auto(l~ 4        16    25 p    
## 10 audi         a4 quat~   2    2008     4 manual~ 4        20    28 p    
## # ... with 224 more rows, and 1 more variable: class <chr>

Now we will use the mpg data to plot the displ (a car’s engine size in liters) data on the x-axis and the hwy (car’s fuel efficiency in miles-per-gallon) on the y-axis.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  theme_minimal() #optional step and adds one of several themes in the ggplot2 package

The point of creating the scatter-plot above using geom_point() was to analyze the relationship between displ, a cars engine size, and hwy, a cars fuel efficiency in miles-per-gallon.

The scatter-plot shows that as a cars engine size displ increases a cars fuel efficiency hwy decreases. Suggesting a negative relationship between displ and hwy.

What inference can we draw from the negative relationship?

That cars with large engines use more fuel than cars with smaller engines.

The Basics of ggplot()

When using ggplot2 you begin with the function ggplot():

  • ggplot() creates a coordinate system that you can add layers to.

  • the first argument you make with ggplot() is the dataset to use in the graph. an example of this is ggplot(data = mpg).

  • after you have created a function that assigns the dataset to be used in your graph you can then add one or more layers to your graph using the function geom_point().

  • geom_point adds a layer of points to your plot which then creates a scatter-plot.

Aesthetic Mappings

  • an aesthetic is a visual property of the objects in your plot.

  • aesthetics include things like the size, shape, and color of your points.

  • each geom function in ggplot2 takes a mapping argument.

  • mapping arguments define how variables in your dataset, e.g. x and y, will be mapped to visual properties.

  • the mapping argument is always paired with aes(), the x and y arguments of aes() specify which variables to map to the x and y axes.

  • you can add a third variable to a two-dimensional scatter-plot by mapping it to an aesthetic.

Below we will use the aesthetic color to the map the colors of our points to the class variable.

Note: Occasionally you may see the word colour instead of color auto-populate within R.Studio or someone else’s code.

Either spelling is fine, the difference is that, colour is written in British English.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Now instead of mapping the colors of our points to the class variable, lets instead map size to the class variable.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = class)) +
  theme_light() #optional theme (you may choose not to include this line of R code)

As we can see in the output above, mapping size to the class variable is of little use in helping us analyze the relationship between displ and hwy.

This teaches a broader lesson - using size for a discrete variable is ill advised.

Below we will use the alpha and shape aesthetics and map each to the variable class.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class, color = class))

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = class, color = class))

For x and y aesthetics ggplot2 does not create a legend but it does create an axis line with tick marks and a label.

You can use R to make all of the points in your scatter-plot “blue”.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = hwy, y = displ), color = "blue")

R has 25 built-in shapes that are identified by numbers (e.g. 1 is an empty circle)

Try running the code below and changing the shape to a number between 1 and 25.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = hwy, y = displ), shape = 24, color = "orange") +
  theme_minimal() #choose a theme you have not used before

Facets

  • you may split your plot into facets.

  • facets are subplots that each display one subset of your data.

  • facets are especially useful with categorical variables.

  • Common examples of categorical variables are race, age, sex, and group.

  • you facet your plot by using facet_wrap().

  • the first argument of facet_wrap() should be a formula and the variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~class, nrow = 2) # ~ is a tilde that can be read as "on", "by", "according to"

                              # nrow creates an output with 2 rows, you can also use ncol =

Geometric Objects

A geom is a geometrical object that a plot uses to represent data.

There are over 30 geoms found within ggplot2, these geoms and additional extension geoms can be found at the ggplot2-exts website.

geom_abline() #adds reference lines to a plot (horizontal, vertical, or diagonal)

geom_bar() #makes the height of the bar proportional to the number of case in each group

geom_col() #heights of the bars are used to represent values in the data

geom_boxplot() #creates a boxplot to visualize the five summary statistics

geom_contour() #visualize 3d surfaces in 2d

geom_curve() #draws a curved line between points x and y

geom_density() #draws a kernel density estimate a smoothed version of the histogram

geom_dotplot() #dot plot, the size of the dot corresponds to bin width

geom_histogram() #displays the count with bars

geom_quantile() #fits a quantile regression to the data & draws the fitted quantiles line

geom_smooth() #aids the eye in the presence of overplotting

Lets compare some commonly used geoms to the geom_point() function.

ggplot(data = mpg) +
  geom_histogram(mapping = aes(x = displ))

ggplot(data = mpg) +
  stat_bin(mapping = aes(x = displ))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

Now lets use two frequently used visual aids in statistics: the box-plot and histogram.

ggplot(data = mpg) +
  geom_histogram(mapping = aes(x = displ), color = "blue", fill = "red") +
  theme_minimal() #optional line of code to used to select the minimalist theme

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = displ, y = hwy)) +
  coord_quickmap() #sets the aspect ratio correctly 

Not every aesthetic works with every geom.

  • as an example, imagine that you were able to set the shape of a point, but you were unable to set the shape of a line.

  • while in some instances you may be unable to set the shape of a line, you can set the shape of the line-type.

  • to set the shape of the line-type we use geom_smooth() which will draw a different line with a different line-type, for each unique value of the variable that you map to the line-type.

Lets view geom_smooth() in practice by using it in the example below.

*in our example geom_smooth() is separating the cars into three distinct lines based on their drv value - drv describes a cars drive-train.

*note: a cars drive-train is the group of components that deliver power to the driving wheels.

f-value = front-wheel drive 4-value = four-wheel drive *r-value = rear-wheel drive

#note: displ refers to the engine displacement ( its ~ size) in liters
#note: hwy refers to the miles-per-gallon
ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv))

Lets now try to display multiple geoms in one plot.

#the code below will produce duplication 
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

#this code will produce the same output as the code above without duplication
#you can avoid the mapping repition seen in the first example... 
#by passing a set of mappings directly to ggplot as we have done below
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

You can display different aesthetics within different layers of code using ggplot().

ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth()

Statistical Processes

We can use the diamonds dataset that comes with ggplot2.

To view the variables within the diamonds dataset we can type ?diamonds.

Below we can create a bar chart displaying the quality of the diamond cuts within the diamond dataset.

The data will show that there are more diamonds available with “very good”, “premium”, or “ideal” cuts than there are with “fair” or “good” cuts.

*bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

?diamonds #provides a description of the diamonds dataset

#create a bar chart displaying the quality of the diamond cuts within the diamond dataset
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

The algorithm used to calculate new values for a graph is called a stat - which is short for a statistical transformation.

You can learn which stat a geom uses by inspecting the default value for the stat argument.

*geom_bar() shows the default value for stat is “count”, which means that geom_bar() uses stat_count().

Usually, you can use geoms and stats interchangeably - we can compare their differences or lack thereof below.

#example using a geom
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))

#example using a stat
ggplot(data = diamonds) +
  stat_count(mapping = aes(x = cut))

#are the resulting outputs identical?

The reason the results are identical is understood after you learn that every geom has a default stat and every stat has a default geom.

*this allows you to use both interchangeably without worrying about altering the underlying statistical transformation.

You can override the default stat if you choose to.

First we will create a tribble

#tribbles create tibbles using an easier row-by-row layout
demo <- tribble(
  ~a,        ~b,
  "bar_1",   20,
  "bar_2",   30,
  "bar_3",   40
)

#now we create a bar chart using our "demo" tribble 
ggplot(data = demo) +
  geom_bar(
    mapping = aes(x = a, y = b), 
    stat = "identity")

You can also override the default mapping from transformed variables to aesthetic variables.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

You may also want to pay special attention to the statistical transformation in your code.

  • you can do so by using stat_summary()
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

Position Adjustments

You can color a bar chart using either the color aesthetic or fill.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, color = cut))

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut))

You can also map the fill aesthetic to another variable.

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument.

If you do not want a stacked bar chart then your options are identity, dodge, or fill.

  • position = "identity" will place the object exactly where it falls in the context of the graph.

  • to make the bars slightly transparent set alpha to a small value e.g. 1/5.

  • to make the bars completely transparent set fill = NA.

ggplot(
  data = diamonds,
  mapping = aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 1/5, position = "identity")

ggplot(
  data = diamonds,
  mapping = aes(x = cut, color = clarity)) +
  geom_bar(fill = NA, position = "identity")

  • position = fill works like stacking , but makes each set of stacked bars the same height.

  • this makes it easier to compare proportions across groups.

#Compares proportions of clarity depending upon the diamonds "cut"
ggplot(data = diamonds) +
  geom_bar(
    mapping = aes(x = cut, fill = clarity), 
    position = "fill")

One common issue faced when using scatter-plots is overplotting.

  • over-plotting makes it challenging to view the bulk of your data points and their spacing.

  • one way to avoid this issue is to use position = jitter.

  • position = jitter adds a small amount of random noise/error to each data point, which causes your data points to then spread out as no two data points are likely to receive the same amount of noise/error given that the noise/error is random.

  • Because position = jitter is such a useful and popular operation within R there is a shorthand operation geom_point(position = "jitter"):geom_jitter().

Lets use position = jitter in an example below.

#adding random noise/error to spread our points out and better analyze the data
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

#utilize the shorthand geom_jitter()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_jitter()

The previous scatter-plot may be slightly less accurate at smaller scales given the randomness we have introduced into the noise/error variable.

However, the upside is that your graph/plot will now be more accurate at larger scales.

Also, notice that both scatter-plots are not identical as the error/noise in each plot has been chosen at random creating different spreads among our x and y data points.

Coordinate Systems

The default coordinate system within R is the Cartesian coordinate system.

  • The Cartesian coordinate system is a system where the x and y positions act independently to find the location of each point.

There are several different coordinate systems that you may find helpful as an R user:

  • the first is coord_flip() which can switch the x and y axes - you might find this especially helpful when trying to create horizontal box-plots or longer labels.
#this code creates your default vertical boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

#here we are flipping the coordinates to create a horizontal boxplot using coord_flip()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot() +
  coord_flip()

  • the second is coord_quickmap() which sets the aspect ratio correctly for maps.

  • coord_quickmap() works exceptionally well plotting spatial data using ggplot2.

Lets code an example that makes use of coord_quickmap() using the New Zealand Basic Map which comes installed.

newzea <- map_data("nz")

#we need to find the variables associated with the "nz" map_data
?map_data("nz")
?"nz"

#plot the map of new zealand without the correct aspect ratio
ggplot(data = newzea, mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black")

#plot the map of new zealand with the correct aspect ratio using coord_quickmap()
ggplot(data = newzea, mapping = aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", color = "black") +
  coord_quickmap()

  • the third coordinate system we are going to explore is coord_polar().

  • polar coordinates reveal the connection between a bar-chart and a Coxcomb chart.

We will explore the connection between the bar-chart and the Coxcomb chart below:

#a quick reminder of our potential variables
?"diamonds"

bar <- ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = cut),
           show.legend = FALSE,
           width = 1) +
  theme(aspect.ratio = 1) + #allows you to customize non-data aspects of your plot
  labs(x = NULL, y = NULL) #allows you the option to assign labels to your plot

#barchart which switches the x and y axes using coord_flip()
bar + coord_flip()

#Coxcomb chart created with polar coordinates
bar + coord_polar()

Bibliography

Grolemund, Garrett & Wickham, Hadley. R for Data Science. Sebastopol, CA. O’Reilly Media, Inc. 2017.