ggplot2
for Data Scienceggplot2
for Data ScienceWe will begin by covering the basics of data visualization using ggplot2
First we need to install the ggplot2
package.
We will do so by installing the broader tidyverse
data science package which happens to include ggplot2, tibble, tidyr, readr, purrr, and dplyr
. Each of which has data and programming usage for statistical analysis.
An R package is a collection of functions, data, and documentation that expands the capabilities of base R.
ggplot2
implements the grammar of graphics.
It is a coherent system for describing and building graphs.
If you need to be explicit about where a function (or dataset) comes from you can use a double colon ::
.
In the example below we will use a ::
to tell R to use the function ggplot
from the package ggplot2
.
Now lets practice with the mpg
data frame found in ggplot2
and view the resulting tibble.
a tibble is an alternative to base R’s traditional data.frame
.
tibbles are data frames but they tweak some of base R’s older behaviors to make working with data an easier process.
ggplot2::mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.8 1999 4 auto(l~ f 18 29 p
## 2 audi a4 1.8 1999 4 manual~ f 21 29 p
## 3 audi a4 2 2008 4 manual~ f 20 31 p
## 4 audi a4 2 2008 4 auto(a~ f 21 30 p
## 5 audi a4 2.8 1999 6 auto(l~ f 16 26 p
## 6 audi a4 2.8 1999 6 manual~ f 18 26 p
## 7 audi a4 3.1 2008 6 auto(a~ f 18 27 p
## 8 audi a4 quat~ 1.8 1999 4 manual~ 4 18 26 p
## 9 audi a4 quat~ 1.8 1999 4 auto(l~ 4 16 25 p
## 10 audi a4 quat~ 2 2008 4 manual~ 4 20 28 p
## # ... with 224 more rows, and 1 more variable: class <chr>
Now we will use the mpg
data to plot the displ
(a car’s engine size in liters) data on the x-axis and the hwy
(car’s fuel efficiency in miles-per-gallon) on the y-axis.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
theme_minimal() #optional step and adds one of several themes in the ggplot2 package
The point of creating the scatter-plot above using geom_point()
was to analyze the relationship between displ
, a cars engine size, and hwy
, a cars fuel efficiency in miles-per-gallon.
The scatter-plot shows that as a cars engine size displ
increases a cars fuel efficiency hwy
decreases. Suggesting a negative relationship between displ
and hwy
.
What inference can we draw from the negative relationship?
That cars with large engines use more fuel than cars with smaller engines.
ggplot()
When using ggplot2
you begin with the function ggplot()
:
ggplot()
creates a coordinate system that you can add layers to.
the first argument you make with ggplot()
is the dataset to use in the graph. an example of this is ggplot(data = mpg)
.
after you have created a function that assigns the dataset to be used in your graph you can then add one or more layers to your graph using the function geom_point()
.
geom_point
adds a layer of points to your plot which then creates a scatter-plot.
an aesthetic is a visual property of the objects in your plot.
aesthetics include things like the size, shape, and color of your points.
each geom
function in ggplot2
takes a mapping argument.
mapping arguments define how variables in your dataset, e.g. x
and y
, will be mapped to visual properties.
the mapping argument is always paired with aes()
, the x
and y
arguments of aes()
specify which variables to map to the x
and y
axes.
you can add a third variable to a two-dimensional scatter-plot by mapping it to an aesthetic.
Below we will use the aesthetic color
to the map the colors of our points to the class
variable.
Note: Occasionally you may see the word colour
instead of color
auto-populate within R.Studio or someone else’s code.
Either spelling is fine, the difference is that, colour
is written in British English.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Now instead of mapping the colors of our points to the class
variable, lets instead map size to the class
variable.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class)) +
theme_light() #optional theme (you may choose not to include this line of R code)
As we can see in the output above, mapping size to the class
variable is of little use in helping us analyze the relationship between displ
and hwy
.
This teaches a broader lesson - using size
for a discrete variable is ill advised.
Below we will use the alpha and shape aesthetics and map each to the variable class
.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class, color = class))
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class, color = class))
For x
and y
aesthetics ggplot2
does not create a legend but it does create an axis line with tick marks and a label.
You can use R to make all of the points in your scatter-plot “blue”.
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = displ), color = "blue")
R has 25 built-in shapes that are identified by numbers (e.g. 1 is an empty circle)
Try running the code below and changing the shape to a number between 1 and 25.
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = displ), shape = 24, color = "orange") +
theme_minimal() #choose a theme you have not used before
you may split your plot into facets.
facets are subplots that each display one subset of your data.
facets are especially useful with categorical variables.
Common examples of categorical variables are race, age, sex, and group.
you facet your plot by using facet_wrap()
.
the first argument of facet_wrap()
should be a formula and the variable that you pass to facet_wrap()
should be discrete.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class, nrow = 2) # ~ is a tilde that can be read as "on", "by", "according to"
# nrow creates an output with 2 rows, you can also use ncol =
A geom is a geometrical object that a plot uses to represent data.
There are over 30 geoms found within ggplot2
, these geoms and additional extension geoms can be found at the ggplot2-exts website.
geom_abline() #adds reference lines to a plot (horizontal, vertical, or diagonal)
geom_bar() #makes the height of the bar proportional to the number of case in each group
geom_col() #heights of the bars are used to represent values in the data
geom_boxplot() #creates a boxplot to visualize the five summary statistics
geom_contour() #visualize 3d surfaces in 2d
geom_curve() #draws a curved line between points x and y
geom_density() #draws a kernel density estimate a smoothed version of the histogram
geom_dotplot() #dot plot, the size of the dot corresponds to bin width
geom_histogram() #displays the count with bars
geom_quantile() #fits a quantile regression to the data & draws the fitted quantiles line
geom_smooth() #aids the eye in the presence of overplotting
Lets compare some commonly used geoms to the geom_point()
function.
ggplot(data = mpg) +
geom_histogram(mapping = aes(x = displ))
ggplot(data = mpg) +
stat_bin(mapping = aes(x = displ))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Now lets use two frequently used visual aids in statistics: the box-plot and histogram.
ggplot(data = mpg) +
geom_histogram(mapping = aes(x = displ), color = "blue", fill = "red") +
theme_minimal() #optional line of code to used to select the minimalist theme
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = displ, y = hwy)) +
coord_quickmap() #sets the aspect ratio correctly
Not every aesthetic works with every geom.
as an example, imagine that you were able to set the shape of a point, but you were unable to set the shape of a line.
while in some instances you may be unable to set the shape of a line, you can set the shape of the line-type.
to set the shape of the line-type we use geom_smooth()
which will draw a different line with a different line-type, for each unique value of the variable that you map to the line-type.
Lets view geom_smooth()
in practice by using it in the example below.
*in our example geom_smooth()
is separating the cars into three distinct lines based on their drv
value - drv
describes a cars drive-train.
*note: a cars drive-train is the group of components that deliver power to the driving wheels.
f-value = front-wheel drive 4-value = four-wheel drive *r-value = rear-wheel drive
#note: displ refers to the engine displacement ( its ~ size) in liters
#note: hwy refers to the miles-per-gallon
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv, color = drv))
Lets now try to display multiple geoms in one plot.
#the code below will produce duplication
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
#this code will produce the same output as the code above without duplication
#you can avoid the mapping repition seen in the first example...
#by passing a set of mappings directly to ggplot as we have done below
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
You can display different aesthetics within different layers of code using ggplot()
.
ggplot(data = mpg, mapping = aes(x = hwy, y = displ)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
We can use the diamonds
dataset that comes with ggplot2
.
To view the variables within the diamonds
dataset we can type ?diamonds
.
Below we can create a bar chart displaying the quality of the diamond cuts within the diamond dataset.
The data will show that there are more diamonds available with “very good”, “premium”, or “ideal” cuts than there are with “fair” or “good” cuts.
*bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
?diamonds #provides a description of the diamonds dataset
#create a bar chart displaying the quality of the diamond cuts within the diamond dataset
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The algorithm used to calculate new values for a graph is called a stat - which is short for a statistical transformation.
You can learn which stat a geom uses by inspecting the default value for the stat argument.
*geom_bar()
shows the default value for stat
is “count”, which means that geom_bar()
uses stat_count()
.
Usually, you can use geoms and stats interchangeably - we can compare their differences or lack thereof below.
#example using a geom
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
#example using a stat
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
#are the resulting outputs identical?
The reason the results are identical is understood after you learn that every geom has a default stat and every stat has a default geom.
*this allows you to use both interchangeably without worrying about altering the underlying statistical transformation.
You can override the default stat if you choose to.
First we will create a tribble
#tribbles create tibbles using an easier row-by-row layout
demo <- tribble(
~a, ~b,
"bar_1", 20,
"bar_2", 30,
"bar_3", 40
)
#now we create a bar chart using our "demo" tribble
ggplot(data = demo) +
geom_bar(
mapping = aes(x = a, y = b),
stat = "identity")
You can also override the default mapping from transformed variables to aesthetic variables.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
You may also want to pay special attention to the statistical transformation in your code.
stat_summary()
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
You can color a bar chart using either the color
aesthetic or fill
.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, color = cut))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
You can also map the fill
aesthetic to another variable.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
The stacking is performed automatically by the position adjustment specified by the position
argument.
If you do not want a stacked bar chart then your options are identity
, dodge
, or fill
.
position = "identity"
will place the object exactly where it falls in the context of the graph.
to make the bars slightly transparent set alpha
to a small value e.g. 1/5.
to make the bars completely transparent set fill = NA
.
ggplot(
data = diamonds,
mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(
data = diamonds,
mapping = aes(x = cut, color = clarity)) +
geom_bar(fill = NA, position = "identity")
position = fill
works like stacking , but makes each set of stacked bars the same height.
this makes it easier to compare proportions across groups.
#Compares proportions of clarity depending upon the diamonds "cut"
ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = clarity),
position = "fill")
One common issue faced when using scatter-plots is overplotting.
over-plotting makes it challenging to view the bulk of your data points and their spacing.
one way to avoid this issue is to use position = jitter
.
position = jitter
adds a small amount of random noise/error to each data point, which causes your data points to then spread out as no two data points are likely to receive the same amount of noise/error given that the noise/error is random.
Because position = jitter
is such a useful and popular operation within R there is a shorthand operation geom_point(position = "jitter"):geom_jitter()
.
Lets use position = jitter
in an example below.
#adding random noise/error to spread our points out and better analyze the data
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
#utilize the shorthand geom_jitter()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_jitter()
The previous scatter-plot may be slightly less accurate at smaller scales given the randomness we have introduced into the noise/error variable.
However, the upside is that your graph/plot will now be more accurate at larger scales.
Also, notice that both scatter-plots are not identical as the error/noise in each plot has been chosen at random creating different spreads among our x
and y
data points.
The default coordinate system within R is the Cartesian coordinate system.
x
and y
positions act independently to find the location of each point.There are several different coordinate systems that you may find helpful as an R user:
coord_flip()
which can switch the x
and y
axes - you might find this especially helpful when trying to create horizontal box-plots or longer labels.#this code creates your default vertical boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
#here we are flipping the coordinates to create a horizontal boxplot using coord_flip()
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
the second is coord_quickmap()
which sets the aspect ratio correctly for maps.
coord_quickmap()
works exceptionally well plotting spatial data using ggplot2
.
Lets code an example that makes use of coord_quickmap()
using the New Zealand Basic Map which comes installed.
newzea <- map_data("nz")
#we need to find the variables associated with the "nz" map_data
?map_data("nz")
?"nz"
#plot the map of new zealand without the correct aspect ratio
ggplot(data = newzea, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black")
#plot the map of new zealand with the correct aspect ratio using coord_quickmap()
ggplot(data = newzea, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon(fill = "white", color = "black") +
coord_quickmap()
the third coordinate system we are going to explore is coord_polar()
.
polar coordinates reveal the connection between a bar-chart and a Coxcomb chart.
We will explore the connection between the bar-chart and the Coxcomb chart below:
#a quick reminder of our potential variables
?"diamonds"
bar <- ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1) +
theme(aspect.ratio = 1) + #allows you to customize non-data aspects of your plot
labs(x = NULL, y = NULL) #allows you the option to assign labels to your plot
#barchart which switches the x and y axes using coord_flip()
bar + coord_flip()
#Coxcomb chart created with polar coordinates
bar + coord_polar()
Grolemund, Garrett & Wickham, Hadley. R for Data Science. Sebastopol, CA. O’Reilly Media, Inc. 2017.