library("ggplot2")
library("patchwork")
ggplot2 has a pre-built dataset called ‘mpg’, so we are going to use it.
head(mpg) # Fuel economy data
Every ggplot2 plot contains 3 key components ~>
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
So we had a scatter plot with ~>
data ~> mpg
Aesthetic mapping ~> Engine size (displ) is mapped to x position and fuel economy (hwy) is mapped to y position.
Layer ~> points.
ggplot(mpg, aes(cty, hwy)) + geom_point()
A relationship can be observed but not very definitive. Later we can see it more clearly.
ggplot(mpg, aes(model, manufacturer)) + geom_point()
I can’t find any relationship, Can you?
To add additional variables to a plot, we can use other aesthetics like: color, shape, size.
ggplot(mpg, aes(displ, cty, color = class)) +
geom_point()
ggplot(mpg, aes(displ, cty, shape = drv)) +
geom_point()
ggplot(mpg, aes(displ, hwy, size = cty)) +
geom_point()
Let’s compare two following plots
p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point(aes(color = "blue"))
p2 <- ggplot(mpg, aes(displ, hwy)) + geom_point(color = "blue")
p1 + p2
In the first plot, the value ‘blue’ is scaled to a pinkish color, and a legend is added. But in the second plot, the points are given the R color blue.
Another technique for displaying additional categorical variables on a plot is facetting. Here we will split the data into subsets and display the same graph for each subset.
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~class)
Here we break the graph based on class that is a discrete variable. But what will happen if we wrap a graph based on continuous variable (like hwy)?
ggplot(mpg, aes(displ, cyl)) + geom_point() + facet_wrap(~hwy)
Let’s compare it with more of a discrete variable like cyl.
ggplot(mpg, aes(displ, hwy)) + geom_point() + facet_wrap(~cyl)
So it can be observed that if the variable is a discrete type then sub-setting will be less compared to a continuous variable.
As we already know that geom function add a layer which describes how to render each observation. There are many commonly used plot types.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
If you want to turn off the standard error showing part (the grayish point-wise confidence intervals) then let’s do that.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
If you plot the above ggplots with smoothing, then geom_smooth will through an message about “loess”. So let’s talk about that.
Generally, geom_smooth takes an argument that is called ‘method’. Method allows us to choose which type of model we want to use to fit the smooth curve.
plot1 <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "loess", span = 0.2)
plot2 <- ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "loess", span = 1)
plot1 + plot2
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The message doesn’t shouting about “What method you are using?”, but there is another message that is popping up about formula.
library("mgcv")
Now we can use formula like ‘formula = y ~ s(x)’ or ‘y ~ s(x, bs = ’cs’) (for larger data means more than 1000 points).
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "gam", formula = y ~ s(x))
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
When a set of data includes a categorical variable and one or more continuous variables. Let’s see how the values of the continuous variables vary with the levels of the categorical variable.
ggplot(mpg, aes(drv, hwy)) + geom_point()
As we can see there is a lot of over plotting, means many points are plotted in the same location and it’s difficult to see the distribution. Let’s see some useful techniques that help to avoid this problem.
ggplot(mpg, aes(drv, hwy)) + geom_jitter()
ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()
Histogram and frequency polygons show the distribution of a single numeric variable.
ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both histograms and frequency polygons work in the same way ~>
If we want to adjust the width of the bin then we will use the ‘binwidth’ argument.
plot1 <- ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 2.5)
plot2 <- ggplot(mpg, aes(hwy)) + geom_freqpoly(binwidth = 1)
plot1 + plot2
ggplot(mpg, aes(displ, color = drv)) +
geom_freqpoly(binwidth = 0.5)
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol = 1)
The discrete analogue of the histogram is bar chart, geom_bar().
ggplot(mpg, aes(manufacturer)) + geom_bar()
The geom function bar is used for unsummarized data, and each observation contributes one unit to the height of each bar. For geom_bar(), the default behavior is to count the rows for each x value. So it doesn’t expect a y-value, since it’s going to count that up itself. In fact, it will flag a warning if you give it one.
If we explicitly say, stat = “identity” in geom_bar(), then we are telling ggplot2 to skip the aggregation as we will be providing the y values. Let’s see
ggplot(mpg, aes(manufacturer, cty)) +
geom_bar(stat = "identity")
enzyme <- data.frame(
Genes = c("Gene A", "Gene B", "Gene C"),
Expression = c(4.2, 9.7, 6.6)
)
enzyme
plot1 <- ggplot(enzyme, aes(Genes, Expression)) +
geom_bar(stat = "identity")
plot2 <- ggplot(enzyme, aes(Genes, Expression)) +
geom_point()
plot1 + plot2
Here we are telling geom_bar to not run the default stat which bins and counts the data.
For time series, line and path plots are typically used. Line plot join the points from left to right, while path plot join them in the order that they appear in the dataset.
head(economics)
ggplot(economics, aes(date, unemploy)) + geom_line()
Let’s plot unemployment rate vs. length of unemployment and join the individual observations with a path.
ggplot(economics, aes(unemploy/pop, uempmed)) +
geom_path() +
geom_point()
ggplot(economics, aes(unemploy/pop, uempmed)) +
geom_path(color = "gray60") +
geom_point(aes(color = date))
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1/3)
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1/3) +
xlab("City driving ") +
ylab("Highway driving")
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1/3) +
xlab(NULL) +
ylab(NULL)
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25)
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25) +
xlim("f", "r") +
ylim(25, 35)
## Warning: Removed 148 rows containing missing values (`geom_point()`).
We get a warning message saying there are 151 rows that contains missing values (geom_point).
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25, na.rm = TRUE) +
ylim(NA, 35)
Here we removed the NAs/missing values present in the rows from the jitter plot.
We can save the plot to a variable and manipulate it. Like the following ~>
plot <- ggplot(mpg, aes(displ, hwy, color = factor(cyl))) +
geom_point()
plot
To save the last plot, we will be using ggsave with defined height and width.
ggsave("neko.png", width = 5, height = 5)
summary(plot)
## data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
## class [234x11]
## mapping: x = ~displ, y = ~hwy, colour = ~factor(cyl)
## faceting: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity