An R Learner’s Diary: ggplot2

For the graphical representation of data, the ggplot2 package is widely used:

library(tidyverse)

## -- Attaching packages ----------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

R for Data Science: Ch. 3: Data visualisation

Scatter Plot

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_vline(xintercept = mean(mpg$displ), size = 1) +
  geom_hline(yintercept = mean(mpg$hwy), size = 1) +
  geom_abline(intercept = 0, slope = 10, size = 1)

The command ggplot(data = mpg) creates an empty graph; empty, but ready to use the variables in the mpg data frame to make graphs.

The function geom_point() adds a layer of points. Each point in a scatter plot represents one observation (row) in the mpg data frame. It must have x and y values, at the minimum. So, you must say which data variable will be plotted on the x axis and which data variable will be plotted on the y axis. Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

The geom_vline(), geom_hline(), and geom_abline() add vertical, horizontal, and sloped lines, as directed. These commands are independent of the data frame. It is merely incidental that the horizontal and vertical lines are in this case at the mean values of the corresponding variables in the data frame. The geom_vline() command could just as easily have been geom_vline(xintercept = 3, size = 1).

Returning to the scatter plot, each point may also have other features (besides their co-ordinates) such as color, size, shape, transparency, etc. And these features may be tied to other variables in the mpg data frame. To nail down the horizontal location of the point representing the 23rd observation, you may use the value of the displ variable. Similarly, to nail down the color of a dot representing the 23rd observation, you may use the value of the class variable for that observation.

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

If, however, you want all dots to be blue and shape to be determined by class:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, shape = class), color = "blue")

## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.

## Warning: Removed 62 rows containing missing values (geom_point).

Puzzle: Suppose you wish to highlight the observations for class = compact. How would you do it? Easy. Create, within the mpg data frame, a dummy or binary variable called, say, compactcars that is 1 for compact cars and 0 otherwise. Then use geom_point(mapping = aes(x = displ, y = hwy, color = compactcars) instead of geom_point(mapping = aes(x = displ, y = hwy, color = class) two commands above.

More on aesthetic mappings On highlighting More on highlighting Still more on highlighting Even more on highlighting

Facets

If you want multiple scatter plots, with each plot using observations that have the same value of the variable class, you can do this:

ggplot(data = mpg) +   geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class)

If you want the plots to be arrayed in two rows:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

If you want multiple scatter plots, with each plot using observations that have the same value of the variable drv and the same value of the variable cyl, you can do this:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

Think of the plots arrayed in a table. Each column (row) of the table will correspond to a value of the drv (cyl) variable.

Plots that represent data collectively, not individually

Consider the geom_smooth() graphical object:

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Note that this is a graphical representation of a statistical model of the displ and hwy variables taken as a collection of observations; the displ and hwy values of the individual observations are not shown. To draw the fitted curve, however, you also need to decide various properties such as line thickness, color, line type, etc. And, as in the case of the points that make a scatter plot, these other properties of the fitted curve may be assigned to specific variables in the mpg data frame. For example, if the linetype property is assigned to the drv variable, then separate geom_smooth fitted curves are drawn for the separate sets of observations that share a particular value of the drv variable.

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You could have used, say, color instead of linetype in the example above.

If you want separate fitted curves for separate values of drv but you want the linetype to be the same for all curves:

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Of course, you can have separate curves for observations that share values for multiple variables. You may imagine splitting the data into groups of observations such all observations in a group have the same values of drv and the same values of cyl, and then create separate geom_smooth() curves for each such group of observations.

Multiple Graphs Superimposed

The scatter plot and the fitted curve that we created above can be superimposed in the same graph:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Stop! Stop! You are killing me. This is too much fun!

In this case, the following command is equivalent and requires less typing:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

If you want the fitted curve for only the observations with class == "compact" and no confidence intervals:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "compact"), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Regression Line

Finally, specifying method = "lm" within the geom_smooth() command generates the regression line instead of the curve (which uses the default specification: method = ‘loess’).

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy), method = "lm", se = FALSE)

Labels

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy), method = "lm", se = FALSE) +
  labs(title = "Cars", x = "Displacement", y = "Highway Mileage")

Bar Charts

In this case, the bars measure the count of observations that share a specific value or range of values of some variable. So, the variable that is being used to split the observations into bins must be provided.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

The geom_bar() command is a bit like the geom_smooth() command in the sense that it does not represent each observation in the data set individually; it is a summary. That summary is the count or number of observations in the above example. But the summary formula does not have to be the count. It could be the mean, the standard deviation, or some such statistical summary formula.

Although bar charts really accomodate a single variable, you can squeeze in some information on a second variable.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity))

The binning for the bars is still done with cut. But different colors are used with each bar to express the count for the various vaues of clarity.

If you want bars with cut as primary binning variable and clarity as secondary binning variable:

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

Line Graphs

These are basically scatter plots with the dots joined by lines.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_line()

The observations are ranked by displ, the variable plotted on the x axis. Each line connects two observations that have the same rank or adjacent ranks.

Following the strategy discussed above for scatter plots, we can have separate lines for the different values of the drv variable.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_line(aes(color = drv))

We can make both the dots and the lines distinguishable by some variable, such as class:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point() + 
  geom_line()

Histogram

ggplot(data = mpg) + geom_histogram(mapping = aes(x = hwy))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next, the same histogram is redone with relative frequency (i.e., density) on the y-axis:

ggplot(data = mpg) + geom_histogram(mapping = aes(x = hwy, y = ..density..), fill = "salmon", binwidth=1, colour = "black")

Density Graph

ggplot(data = mpg) + geom_density(mapping = aes(x = hwy))

Density curve and density histogram superimposed!

ggplot(data = mpg) + 
  geom_histogram(mapping = aes(x = hwy, y = ..density..), fill = "salmon", binwidth=1, colour = "black") +
  geom_density(mapping = aes(x = hwy))

Next, separate histograms for separate values of a variable

ggplot(data = mpg) + geom_histogram(mapping = aes(hwy)) + facet_wrap(~cyl)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That’s all for now.