Let’s start by looking at the gapminder
dataset
again.
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
We’ll begin with the ggplot
command. The two crucial
components you need are the data and the mapping. Both of these are
arguments of the ggplot
function. The data refers to the
dataset you are currently using, which should be a data frame or a
tibble.
Interestingly, the mapping is an argument of ggplot
,
while aes()
is a function nested inside the
ggplot
function with its own arguments (in this case,
x
and y
).
Notice how our code can extend past a single line. We also indent for readability.
ggplot(data=gapminder,
mapping=aes(x=gdpPercap,y=lifeExp))
We can set up the plot but nothing shows if we don’t set up a
geom
.
Next, we add the point geom
. Notice that this is a
second layer on top of the base layer because we use the +
symbol (Note: in the rest of the tidyverse, we use %>%
or |>
as the piping operator).
ggplot(data=gapminder,
mapping=aes(x=gdpPercap,y=lifeExp)) +
geom_point()
Alternatively, if we have saved the base plot as an object, we can add the geom layer to it:
p<- ggplot(data=gapminder,
mapping=aes(x=gdpPercap,y=lifeExp))
p + geom_point()
One thing we often want to do is to add a smoother. This allows us to visually highlight the general trend in the data. Think of it as a visual analogue to a correlation (or regression) coefficient.
To add a linear smoother:
p +
geom_point() +
geom_smooth(formula = y ~ x, method="lm")
Notice that the visual presentation shows that a linear best fit summary of the data is likely to be very misleading. This would not be as apparent with just a regression coefficient.
What about a loess type smoother? This is a non-parametric smoother that is more flexible than a linear smoother.
p+ geom_smooth(formula = y ~ x, method="loess")
When exploring the data, it’s useful to include the original data points on the plot. This can help you see how the smoother is fitting the data.
p + geom_point() + geom_smooth(formula = y ~ x, method="loess")
Let’s try transforming the x-axis scale of GDP to deal with the
bunched up data. We can do this, for example, by adding a log scale. In
this case, let’s try log base 10 or scale_x_log10()
to the
plot.
p + geom_point() + geom_smooth(formula = y ~ x, method = "lm") + scale_x_log10()
Let’s change the x axis labels from scientific notation to actual dollars.
Let’s try using the dollar function in the scales package.
p + geom_point() +
geom_smooth(method = "loess", formula = y ~ x) +
scale_x_log10(labels = dollar) +
labs(x="GDP per Capita", y="Life Expectancy")
We can set properties like colors, size, or transparency. We do this
in the aes
of the geom_point
.
p + geom_point(alpha = 0.2) +
geom_smooth(color = "orange", se = FALSE, linewidth = 2, method = "lm",formula = y ~ x) +
scale_x_log10(labels = dollar)
Here’s a polished version of the plot, with alpha transparency, a dollar x-axis and log scale, a smoother, titles and labels, and a caption.
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp)) +
geom_point(alpha = 0.2) +
scale_y_continuous(breaks=seq(20, 100, by = 10)) +
geom_smooth(method = "lm",formula = y ~ x) +
scale_x_log10(labels = dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.")
## Introduction to Themes in ggplot2
Themes in ggplot2
allow you to control the overall look
of your plots. They can change the background color, gridlines, fonts,
and many other elements. Using themes can help make your visualizations
more readable and aesthetically pleasing. By default,
ggplot2
uses the theme_grey()
, but there are
many other built-in themes you can use.
We’re going to work in the ggplot2 way: first, we set up a plot, then we add a different theme as a layer on top of the plot.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point(alpha = 0.2) +
scale_y_continuous(breaks=seq(20, 100, by = 10)) +
geom_smooth(method = "lm",formula = y ~ x) +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_bw()
Or let’s try the minimal theme…
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point(alpha = 0.2) +
scale_y_continuous(breaks=seq(20, 100, by = 10)) +
geom_smooth(method = "lm",formula = y ~ x) +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_minimal()
Let’s try saving the plot at the width and height we want.
Notice that the last ggplot object, in this case
p
, is saved. You can manually set this to be something else
if you want with the plot = object setting of ggsave. But I never do
this…
ggsave(filename = "Plots/economic growth and health.png",width=8,height=5)
ggsave(filename = "Plots/economic growth and health.pdf",width=8,height=5)
We can complicate the plot by separating out continents. We do this by setting a color aesthetic by the continent variable. Notice now we have a legend automatically appear, and five smoothers, one for each continent.
p <- ggplot(data = gapminder,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent))
p + geom_point(alpha=0.2) +
geom_smooth(method = "lm", se=F, formula = y ~ x) +
scale_x_log10(labels = scales::dollar) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_minimal()
facet_wrap()
facet_wrap()
is used to create a grid of plots based on
a single categorical variable. Let’s use it to create separate plots for
each continent.
p <- ggplot(data = gapminder %>% filter(year==2007),
mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", formula = y ~ x) +
scale_x_log10(labels = scales::dollar) +
facet_wrap(~ continent) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_minimal()
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
In this example: - facet_wrap(~ continent)
creates a
separate plot for each continent. - Each plot shares the same scales,
making it easy to compare across continents.
You can customize the appearance of facets to improve readability and aesthetics.
Control the layout of the facets using nrow
or
ncol
parameters.
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", formula = y ~ x) +
scale_x_log10(labels = scales::dollar) +
facet_wrap(~ continent, ncol = 2) +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_minimal()
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
Here, ncol = 2
arranges the facets into two columns.
Sometimes, it’s useful to allow each facet to have its own scale for better visualization of individual trends.
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", formula = y ~ x) +
scale_x_log10(labels = scales::dollar_format(accuracy = 1)) +
facet_wrap(~ continent, scales = "free") +
labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
title = "Economic Growth and Life Expectancy",
subtitle = "Data points are country-years",
caption = "Source: Gapminder.") +
theme_minimal()
## Warning in qt((1 - level)/2, df): NaNs produced
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
scales = "free"
allows each facet to have its own x and
y scales.state.x77
## Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20
## Alaska 365 6315 1.5 69.31 11.3 66.7 152
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
## California 21198 5114 1.1 71.71 10.3 62.6 20
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
## Delaware 579 4809 0.9 70.06 6.2 54.6 103
## Florida 8277 4815 1.3 70.66 10.7 52.6 11
## Georgia 4931 4091 2.0 68.54 13.9 40.6 60
## Hawaii 868 4963 1.9 73.60 6.2 61.9 0
## Idaho 813 4119 0.6 71.87 5.3 59.5 126
## Illinois 11197 5107 0.9 70.14 10.3 52.6 127
## Indiana 5313 4458 0.7 70.88 7.1 52.9 122
## Iowa 2861 4628 0.5 72.56 2.3 59.0 140
## Kansas 2280 4669 0.6 72.58 4.5 59.9 114
## Kentucky 3387 3712 1.6 70.10 10.6 38.5 95
## Louisiana 3806 3545 2.8 68.76 13.2 42.2 12
## Maine 1058 3694 0.7 70.39 2.7 54.7 161
## Maryland 4122 5299 0.9 70.22 8.5 52.3 101
## Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103
## Michigan 9111 4751 0.9 70.63 11.1 52.8 125
## Minnesota 3921 4675 0.6 72.96 2.3 57.6 160
## Mississippi 2341 3098 2.4 68.09 12.5 41.0 50
## Missouri 4767 4254 0.8 70.69 9.3 48.8 108
## Montana 746 4347 0.6 70.56 5.0 59.2 155
## Nebraska 1544 4508 0.6 72.60 2.9 59.3 139
## Nevada 590 5149 0.5 69.03 11.5 65.2 188
## New Hampshire 812 4281 0.7 71.23 3.3 57.6 174
## New Jersey 7333 5237 1.1 70.93 5.2 52.5 115
## New Mexico 1144 3601 2.2 70.32 9.7 55.2 120
## New York 18076 4903 1.4 70.55 10.9 52.7 82
## North Carolina 5441 3875 1.8 69.21 11.1 38.5 80
## North Dakota 637 5087 0.8 72.78 1.4 50.3 186
## Ohio 10735 4561 0.8 70.82 7.4 53.2 124
## Oklahoma 2715 3983 1.1 71.42 6.4 51.6 82
## Oregon 2284 4660 0.6 72.13 4.2 60.0 44
## Pennsylvania 11860 4449 1.0 70.43 6.1 50.2 126
## Rhode Island 931 4558 1.3 71.90 2.4 46.4 127
## South Carolina 2816 3635 2.3 67.96 11.6 37.8 65
## South Dakota 681 4167 0.5 72.08 1.7 53.3 172
## Tennessee 4173 3821 1.7 70.11 11.0 41.8 70
## Texas 12237 4188 2.2 70.90 12.2 47.4 35
## Utah 1203 4022 0.6 72.90 4.5 67.3 137
## Vermont 472 3907 0.6 71.64 5.5 57.1 168
## Virginia 4981 4701 1.4 70.08 9.5 47.8 85
## Washington 3559 4864 0.6 71.72 4.3 63.5 32
## West Virginia 1799 3617 1.4 69.48 6.7 41.6 100
## Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149
## Wyoming 376 4566 0.6 70.29 6.9 62.9 173
## Area
## Alabama 50708
## Alaska 566432
## Arizona 113417
## Arkansas 51945
## California 156361
## Colorado 103766
## Connecticut 4862
## Delaware 1982
## Florida 54090
## Georgia 58073
## Hawaii 6425
## Idaho 82677
## Illinois 55748
## Indiana 36097
## Iowa 55941
## Kansas 81787
## Kentucky 39650
## Louisiana 44930
## Maine 30920
## Maryland 9891
## Massachusetts 7826
## Michigan 56817
## Minnesota 79289
## Mississippi 47296
## Missouri 68995
## Montana 145587
## Nebraska 76483
## Nevada 109889
## New Hampshire 9027
## New Jersey 7521
## New Mexico 121412
## New York 47831
## North Carolina 48798
## North Dakota 69273
## Ohio 40975
## Oklahoma 68782
## Oregon 96184
## Pennsylvania 44966
## Rhode Island 1049
## South Carolina 30225
## South Dakota 75955
## Tennessee 41328
## Texas 262134
## Utah 82096
## Vermont 9267
## Virginia 39780
## Washington 66570
## West Virginia 24070
## Wisconsin 54464
## Wyoming 97203
states <- as_tibble(state.x77, rownames = "state") %>%
janitor::clean_names()
states
## # A tibble: 50 × 9
## state population income illiteracy life_exp murder hs_grad frost area
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alabama 3615 3624 2.1 69.0 15.1 41.3 20 50708
## 2 Alaska 365 6315 1.5 69.3 11.3 66.7 152 566432
## 3 Arizona 2212 4530 1.8 70.6 7.8 58.1 15 113417
## 4 Arkansas 2110 3378 1.9 70.7 10.1 39.9 65 51945
## 5 California 21198 5114 1.1 71.7 10.3 62.6 20 156361
## 6 Colorado 2541 4884 0.7 72.1 6.8 63.9 166 103766
## 7 Connecticut 3100 5348 1.1 72.5 3.1 56 139 4862
## 8 Delaware 579 4809 0.9 70.1 6.2 54.6 103 1982
## 9 Florida 8277 4815 1.3 70.7 10.7 52.6 11 54090
## 10 Georgia 4931 4091 2 68.5 13.9 40.6 60 58073
## # ℹ 40 more rows
ggplot(data = states, aes(x=income, y=life_exp)) +
geom_text(aes(label = state), colour = "darkblue") +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Income vs. Life Expectancy (1977)",
x = "Per-capita income (USD)",
y = "Life expectancy (years)"
) +
theme_minimal()
This is good but the labels run together. We can make text smaller, but
we can also ensure any collisions are automatically taken care of using
the
geom_text_repel
function from the ggrepel
package. This will automatically adjust the labels to avoid overlap.
ggplot(data = states, aes(income, life_exp)) +
geom_smooth(method = "lm", se = FALSE, colour = "grey40") +
geom_text_repel(aes(label = state), colour = "darkblue", size = 3) +
labs(
title = "Income vs. Life Expectancy (1977)",
subtitle = "Labels repel to avoid overlap (full state names)",
x = "Per-capita income (USD)",
y = "Life expectancy (years)"
) +
theme_minimal()