1 Loading

library("ggplot2")
library("patchwork")

ggplot2 has a pre-built dataset called ‘mpg’, so we are going to use it.

head(mpg) # Fuel economy data

1.1 Components

Every ggplot2 plot contains 3 key components ~>

data ~> Data frame (input)
aesthetic mappings ~> A set of mappings between variables in the data and visual properties
geom function ~> At least one layer which describes how to render each observation.

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point()

So we had a scatter plot with ~>

data ~> mpg
Aesthetic mapping ~> Engine size (displ) is mapped to x position and fuel economy (hwy) is mapped to y position.
Layer ~> points.

1.2 Exploration

Relationship between cty and hwy ~>

ggplot(mpg, aes(cty, hwy)) + geom_point()

A relationship can be observed but not very definitive. Later we can see it more clearly.

Relationship between model and manufacturer ~>

ggplot(mpg, aes(model, manufacturer)) + geom_point()

I can’t find any relationship, Can you?

2 Other aesthetic attributes

To add additional variables to a plot, we can use other aesthetics like: color, shape, size.

2.1 Based on color

ggplot(mpg, aes(displ, cty, color = class)) +
  geom_point()

2.2 Based on shape

ggplot(mpg, aes(displ, cty, shape = drv)) +
  geom_point()

2.3 Based on size

ggplot(mpg, aes(displ, hwy, size = cty)) +
  geom_point()

3 Comparison

Let’s compare two following plots

p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = "blue"))

p2 <- ggplot(mpg, aes(displ, hwy)) + geom_point(color = "blue")

p1 + p2

In the first plot, the value ‘blue’ is scaled to a pinkish color, and a legend is added. But in the second plot, the points are given the R color blue.

4 Facetting

Another technique for displaying additional categorical variables on a plot is facetting. Here we will split the data into subsets and display the same graph for each subset.

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class)

Here we break the graph based on class that is a discrete variable. But what will happen if we wrap a graph based on continuous variable (like hwy)?

ggplot(mpg, aes(displ, cyl)) + geom_point() + facet_wrap(~hwy)

Let’s compare it with more of a discrete variable like cyl.

ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~cyl)

So it can be observed that if the variable is a discrete type then sub-setting will be less compared to a continuous variable.

5 Plot geoms

As we already know that geom function add a layer which describes how to render each observation. There are many commonly used plot types.

geom_smooth() fits a smoother to the data and displays the smooth and its standard error.
geom_boxplot() produces a box-and-whisker plot to summarize the distribution of a set of points.
geom_histogram() and geom_freqpoly() show the distribution of continuous variables.
geom_bar() shows the distribution of categorical variables.
geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time.

5.1 Smoothing a plot

ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

If you want to turn off the standard error showing part (the grayish point-wise confidence intervals) then let’s do that.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

If you plot the above ggplots with smoothing, then geom_smooth will through an message about “loess”. So let’s talk about that.

Generally, geom_smooth takes an argument that is called ‘method’. Method allows us to choose which type of model we want to use to fit the smooth curve.

5.1.1 Methods

method = “loess” ~> Represents wiggliness of the line. The wiggliness of the line is controlled by the span parameter, which ranges from 0 (extreme wiggly) to 1 (not so wiggly).

plot1 <- ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth(method = "loess", span = 0.2)

plot2 <- ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth(method = "loess", span = 1)

plot1 + plot2

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

The message doesn’t shouting about “What method you are using?”, but there is another message that is popping up about formula.

method = “gam” ~> It’s a generalized additive model provided by the ‘mgcv’ package. So let’s load it first.

library("mgcv")

Now we can use formula like ‘formula = y ~ s(x)’ or ‘y ~ s(x, bs = ’cs’) (for larger data means more than 1000 points).

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth(method = "gam", formula = y ~ s(x))

model = “lm” ~> It’s a linear model, giving the line of best fit.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

5.2 Boxplot and jittered points

When a set of data includes a categorical variable and one or more continuous variables. Let’s see how the values of the continuous variables vary with the levels of the categorical variable.

ggplot(mpg, aes(drv, hwy)) + geom_point()

As we can see there is a lot of over plotting, means many points are plotted in the same location and it’s difficult to see the distribution. Let’s see some useful techniques that help to avoid this problem.

Jittering ~> geom_jitter() ~> Adds a little random noise to the data which can help avoid over plotting.

ggplot(mpg, aes(drv, hwy)) + geom_jitter()

Boxplots, geom_boxplot() ~> Summarize the shape of the distribution with a handful of summary statistics.

ggplot(mpg, aes(drv, hwy)) + geom_boxplot()

Violin, geom_violin() ~> Show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.

ggplot(mpg, aes(drv, hwy)) + geom_violin()

5.3 Histogram and frequency polygons

Histogram and frequency polygons show the distribution of a single numeric variable.

ggplot(mpg, aes(hwy)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mpg, aes(hwy)) + geom_freqpoly()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Both histograms and frequency polygons work in the same way ~>

Bin the data
Then count the number of observations in each bin.

If we want to adjust the width of the bin then we will use the ‘binwidth’ argument.

plot1 <- ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 2.5)

plot2 <- ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 1)

plot1 + plot2

5.3.1 Colorization

ggplot(mpg, aes(displ, color = drv)) +
  geom_freqpoly(binwidth = 0.5)

5.3.2 Facetting

ggplot(mpg, aes(displ, fill = drv)) +
  geom_histogram(binwidth = 0.5) +
  facet_wrap(~drv, ncol = 1)

5.4 Bar charts

The discrete analogue of the histogram is bar chart, geom_bar().

ggplot(mpg, aes(manufacturer)) + geom_bar()

The geom function bar is used for unsummarized data, and each observation contributes one unit to the height of each bar. For geom_bar(), the default behavior is to count the rows for each x value. So it doesn’t expect a y-value, since it’s going to count that up itself. In fact, it will flag a warning if you give it one.

If we explicitly say, stat = “identity” in geom_bar(), then we are telling ggplot2 to skip the aggregation as we will be providing the y values. Let’s see

ggplot(mpg, aes(manufacturer, cty)) +
  geom_bar(stat = "identity")

5.4.1 Presummarized data

enzyme <- data.frame(
  Genes = c("Gene A", "Gene B", "Gene C"),
  Expression = c(4.2, 9.7, 6.6)
)

enzyme

plot1 <- ggplot(enzyme, aes(Genes, Expression)) +
  geom_bar(stat = "identity")

plot2 <- ggplot(enzyme, aes(Genes, Expression)) +
  geom_point()

plot1 + plot2

Here we are telling geom_bar to not run the default stat which bins and counts the data.

5.5 Time series (line and path)

For time series, line and path plots are typically used. Line plot join the points from left to right, while path plot join them in the order that they appear in the dataset.

head(economics)

5.5.1 Lines plot

ggplot(economics, aes(date, unemploy)) + geom_line()

5.5.2 Path plot

Let’s plot unemployment rate vs. length of unemployment and join the individual observations with a path.

ggplot(economics, aes(unemploy/pop, uempmed)) +
  geom_path() +
  geom_point()

ggplot(economics, aes(unemploy/pop, uempmed)) +
  geom_path(color = "gray60") +
  geom_point(aes(color = date))

6 Modifying the axes

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1/3)

6.1 Giving axes labels

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1/3) +
  xlab("City driving ") +
  ylab("Highway driving")

6.2 Rmoving axes labels

ggplot(mpg, aes(cty, hwy)) + 
  geom_point(alpha = 1/3) +
  xlab(NULL) +
  ylab(NULL)

7 Modifying the limits of axes

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25)

7.1 Giving axes limits

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25) +
  xlim("f", "r") +
  ylim(25, 35)

## Warning: Removed 148 rows containing missing values (`geom_point()`).

We get a warning message saying there are 151 rows that contains missing values (geom_point).

7.2 Set NA to one limit

ggplot(mpg, aes(drv, hwy)) + 
  geom_jitter(width = 0.25, na.rm = TRUE) +
  ylim(NA, 35)

Here we removed the NAs/missing values present in the rows from the jitter plot.

8 Output

We can save the plot to a variable and manipulate it. Like the following ~>

plot <- ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
  geom_point()

plot

8.1 Saving as png

To save the last plot, we will be using ggsave with defined height and width.

ggsave("neko.png", width = 5, height = 5)

8.2 Summarize

summary(plot)

## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
##   class [234x11]
## mapping:  x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

Data Visualization with ggplot2 One

Neko_Chan666

2023-07-16