#Fuel economy data
library(ggplot2)
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
Every ggplot2 plot has three key components:
1.data, 2. A set of aesthetic mappings between variables in the data and visual properties, and 3. At least one layer which describes how to render each observation. Layers are usually created with a geom function.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() + geom_line()
## Exercise How would you describe the relationship between cty and hwy?
Do you have any concerns about drawing conclusions from that plot?
ggplot(mpg,aes(cty,hwy)) + geom_point(colour = 'red') + geom_line(colour = 'skyblue')
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Colour, size, shape and other aesthetic attributes
ggplot(mpg, aes(displ, hwy, colour = class)) +
geom_point()
This gives each point a unique colour corresponding to its class. The legend allows us to read data values from the colour, showing us that the group of cars with unusually high fuel economy for their engine size are two seaters: cars with big engines, but lightweight bodies.
If you want to set an aesthetic to a fixed value, without scaling it, do so in the individual layer outside of aes(). Compare the following two plots:
ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = "blue"))
ggplot(mpg, aes(displ, hwy)) + geom_point(colour = "blue")+geom_line(colour = 'red')
Different types of aesthetic attributes work better with different types
of variables. For example, colour and shape work well with categorical
variables, while size works well for continuous variables. The amount of
data also makes a difference: if there is a lot of data it can be hard
to distinguish different groups. An alternative solution is to use
faceting, as described next.
When using aesthetics in a plot, less is usually more. It’s difficult to see the simultaneous relationships among colour and shape and size, so exercise restraint when using aesthetics. Instead of trying to make one very complex plot that shows everything at once, see if you can create a series of simple plots that tell a story, leading the reader from ignorance to knowledge.
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = displ, size = displ))
ggplot(mpg,aes(cty,hwy)) + geom_point(aes(colour = model,shape = trans))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 10 values. Consider specifying shapes manually if you need
## that many of them.
## Warning: Removed 96 rows containing missing values or values outside the scale range
## (`geom_point()`).
Another technique for displaying additional categorical variables on a plot is faceting. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset. There are two types of faceting: grid and wrapped. Wrapped is the most useful, so we’ll discuss it here, and you can learn about grid faceting later. To facet a plot you simply add a faceting specification with facet_wrap(), which takes the name of a variable preceded by ~.
ggplot(mpg, aes(displ,hwy)) + geom_point() + facet_wrap(~class)
# Plot geoms You might guess that by substituting geom_point() for a
different geom function, you’d get a different type of plot. That’s a
great guess! In the following sections, you’ll learn about some of the
other important geoms provided in ggplot2. This isn’t an exhaustive
list, but should cover the most commonly used plot types.
geom_smooth() fits a smoother to the data and displays the smooth and its standard error.
geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points.
geom_histogram() and geom_freqpoly() show the distribution of continuous variables.
geom_bar() shows the distribution of categorical variables.
geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time. ## Adding a smoother to a plot
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
This overlays the scatterplot with a smooth curve, including an assessment of uncertainty in the form of point-wise confidence intervals shown in grey. If you’re not interested in the confidence interval, turn it off with geom_smooth(se = FALSE).
An important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve:
method = “loess”, the default for small n, uses a smooth local regression (as described in ?loess). The wiggliness of the line is controlled by the span parameter, which ranges from 0 (exceedingly wiggly) to 1 (not so wiggly).
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 0.2)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(span = 1)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue", linetype = "dashed")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point() +
geom_smooth(mapping = aes(x = depth, y = table),
method = "glm",
formula = y ~ x,
se = TRUE,
color = "blue",
fill = "lightblue",
linetype = "dashed")
ggplot(diamonds, aes(carat, price)) +
geom_point() +
geom_smooth(
method = "glm",
formula = y ~ x,
se = TRUE,
color = "blue",
fill = "lightblue",
linetype = "dashed"
)
ggplot(mpg, aes(drv, hwy)) +
geom_point()
Because there are few unique values of both drv and hwy, there is a lot
of overplotting. Many points are plotted in the same location, and it’s
difficult to see the distribution. There are three useful techniques
that help alleviate the problem:
Jittering, geom_jitter(), adds a little random noise to the data which can help avoid overplotting.
Boxplots, geom_boxplot(), summarise the shape of the distribution with a handful of summary statistics.
Violin plots, geom_violin(), show a compact representation of the “density” of the distribution, highlighting the areas where more points are found.
These are illustrated below:
ggplot(mpg, aes(drv, hwy)) + geom_jitter()
ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
ggplot(mpg, aes(drv, hwy)) + geom_violin()
# Histograms and frequency polygons
ggplot(mpg, aes(hwy)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(mpg, aes(hwy)) + geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Both histograms and frequency polygons work in the same way: they bin the data, then count the number of observations in each bin. The only difference is the display: histograms use bars and frequency polygons use lines.
You can control the width of the bins with the binwidth argument (if you don’t want evenly spaced bins you can use the breaks argument). It is very important to experiment with the bin width. The default just splits your data into 30 bins, which is unlikely to be the best choice. You should always try many bin widths, and you may find you need multiple bin widths to tell the full story of your data.
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 2.5)
ggplot(mpg, aes(hwy)) +
geom_freqpoly(binwidth = 1)
An alternative to the frequency polygon is the density plot,
geom_density(). A little care is required if you’re using density plots:
compared to frequency polygons they are harder to interpret since the
underlying computations are more complex. They also make assumptions
that are not true for all data, namely that the underlying distribution
is continuous, unbounded, and smooth.
To compare the distributions of different subgroups, you can map a categorical variable to either fill (for geom_histogram()) or colour (for geom_freqpoly()). It’s easier to compare distributions using the frequency polygon because the underlying perceptual task is easier. You can also use faceting: this makes comparisons a little harder, but it’s easier to see the distribution of each group.
ggplot(mpg, aes(displ, colour = drv)) +
geom_freqpoly(binwidth = 0.5)
ggplot(mpg, aes(displ, fill = drv)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~drv, ncol = 1)
ggplot(mpg,aes(manufacturer)) + geom_bar(fill = "skyblue")
Bar charts can be confusing because there are two rather different plots
that are both commonly called bar charts. The above form expects you to
have unsummarised data, and each observation contributes one unit to the
height of each bar. The other form of bar chart is used for
presummarised data. For example, you might have three drugs with their
average effect:
To display this sort of data, you need to tell geom_bar() to not run the default stat which bins and counts the data. However, we think it’s even better to use geom_point() because points take up less space than bars, and don’t require that the y axis includes 0.
drugs <- data.frame(
drug = c("a", "b", "c"),
effect = c(4.2, 9.7, 6.1)
)
ggplot(drugs,aes(drug, effect)) + geom_bar(stat = "identity",fill = "green", colour = "black")
ggplot(drugs,aes(drug, effect)) + geom_point()
ggplot(mpg, aes(x = class, color = class)) +
geom_bar(fill = "white")
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3)
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab("city driving (mpg)") +
ylab("highway driving (mpg)")
# Remove the axis labels with NULL
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 1 / 3) +
xlab(NULL) +
ylab(NULL)
xlim() and ylim() modify the limits of axes:
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25)
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25) +
xlim("f", "r") +
ylim(20, 30)
## Warning: Removed 136 rows containing missing values or values outside the scale range
## (`geom_point()`).
#> Warning: Removed 138 rows containing missing values or values outside the scale range
#> (`geom_point()`).
# For continuous scales, use NA to set only one limit
ggplot(mpg, aes(drv, hwy)) +
geom_jitter(width = 0.25, na.rm = TRUE) +
ylim(NA, 30)
geom_jitter() adds random noise (horizontal/vertical) to separate
overlapping points and show the true structure of your data.
# Basic scatter plot (points will overlap)
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_point(color = "blue") +
ggtitle("Without Jitter (geom_point)") +
theme_minimal()
# With
ggplot(mpg, aes(x = drv, y = hwy)) +
geom_jitter(width = 0.2, color = "darkgreen", alpha = 0.6) +
ggtitle("With Jitter (geom_jitter)") +
theme_minimal()