This document was composed from Dr. Snopkowski’s ANTH 504 Week 3 lecture and from Introduction to Data Science: Data analysis and prediction algorithms with R by Rafael A.Irizarry.
There are 3 components of our ggplot2 function: 1. Data 2. Geometry – type of chart (scatterplot , barchart, histogram) 3. Aesthetic (aes) mapping – x-axis, y- axis, color
library(dslabs)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data(murders)
#ggplot(data=murders)
# OR
murders %>% ggplot()
We can associate a plot object like this:
# p <- ggplot(data=murders)
# OR
# p <- ggplot(murders)
# OR
p <- murders %>% ggplot()
What is the class of p?
class(p)
## [1] "gg" "ggplot"
What happens when you print p?
p
In ggplot2, we create graphs by adding layers:
Data %>% ggplot() + LAYER1 + LAYER2 + LAYER3 + …+ LAYER n
Usually, first layer defines the geometry. In this case we want to make
a scatter plot. Let’s look at our Data Visualization (cheat) sheet and
try to find the code that represents a scatter
plot.geom_point()
For geom_point() to work, we need to tell R which data we want to plot. If we look at the help file for geom_point(), and look at the “Aesthetics” section, we’ll see the arguments for this function
murders %>% ggplot() +
geom_point(aes(population, total))
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
murders %>% ggplot() +
geom_point(aes(population/10^6, total))
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
Note: aes is required Within aes you don’t need to refer to the dataset name.
We want to add a label for each point to identify the state. Looking at our cheat sheet, what function do we need to add labels (or plot text)? geom_text() with argments: x, y, label So, we can add a layer
murders %>% ggplot() + geom_point(aes(population/10^6,
total)) + geom_text(aes(population/10^6, total, label
= abb))
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
geom_point() makes the
dot.murders %>% ggplot() + geom_text(aes(population/10^6, total,
label = abb))
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
We might want to have dots and move the label.
Each geometry function has many arguments that you can adjust. For example, in the geom_point() function, you can adjust the size of the points.
murders %>% ggplot() + geom_point(aes(population/10^6, total),
size =4) + geom_text(aes(population/10^6, total, label = abb))
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
The nudge_x argument moves the text slightly to the right or left
murders %>% ggplot() + geom_point(aes(population/10^6, total),
size =1) + geom_text(aes(population/10^6, total, label = abb),
nudge_x=1)
In the previous slide, we included aes(population/10^6, total) twice. To avoid this, we can set a global aesthetic mapping so that all geometries.
murders %>% ggplot(aes(population/10^6, total, label=abb)) +
geom_point(size=1) +
geom_text(nudge_x=1.5)
murders
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
## 7 Connecticut CT Northeast 3574097 97
## 8 Delaware DE South 897934 38
## 9 District of Columbia DC South 601723 99
## 10 Florida FL South 19687653 669
## 11 Georgia GA South 9920000 376
## 12 Hawaii HI West 1360301 7
## 13 Idaho ID West 1567582 12
## 14 Illinois IL North Central 12830632 364
## 15 Indiana IN North Central 6483802 142
## 16 Iowa IA North Central 3046355 21
## 17 Kansas KS North Central 2853118 63
## 18 Kentucky KY South 4339367 116
## 19 Louisiana LA South 4533372 351
## 20 Maine ME Northeast 1328361 11
## 21 Maryland MD South 5773552 293
## 22 Massachusetts MA Northeast 6547629 118
## 23 Michigan MI North Central 9883640 413
## 24 Minnesota MN North Central 5303925 53
## 25 Mississippi MS South 2967297 120
## 26 Missouri MO North Central 5988927 321
## 27 Montana MT West 989415 12
## 28 Nebraska NE North Central 1826341 32
## 29 Nevada NV West 2700551 84
## 30 New Hampshire NH Northeast 1316470 5
## 31 New Jersey NJ Northeast 8791894 246
## 32 New Mexico NM West 2059179 67
## 33 New York NY Northeast 19378102 517
## 34 North Carolina NC South 9535483 286
## 35 North Dakota ND North Central 672591 4
## 36 Ohio OH North Central 11536504 310
## 37 Oklahoma OK South 3751351 111
## 38 Oregon OR West 3831074 36
## 39 Pennsylvania PA Northeast 12702379 457
## 40 Rhode Island RI Northeast 1052567 16
## 41 South Carolina SC South 4625364 207
## 42 South Dakota SD North Central 814180 8
## 43 Tennessee TN South 6346105 219
## 44 Texas TX South 25145561 805
## 45 Utah UT West 2763885 22
## 46 Vermont VT Northeast 625741 2
## 47 Virginia VA South 8001024 250
## 48 Washington WA West 6724540 93
## 49 West Virginia WV South 1852994 27
## 50 Wisconsin WI North Central 5686986 97
## 51 Wyoming WY West 563626 5
Label can go in global or local
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size=1) +
geom_text(nudge_x = 1.5)
This allows for multiple types a graphs.
If we want to override the global mapping by defining a new mapping within each layer. These local definitions override the global.
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size = 3) + geom_text(aes(x=10, y=800, label =
"Label at 10,800"))
Our desired scale is log-scale. This needs to be changed through a
scales layer. Take a look at your cheat sheet to see if you can see how
to adjust this. There are 2 that may be useful:
scale_x_log10() and scale_y_log10() Or
scale_x_continuous() which lets us transform a value along
with setting labels, position of the axis, specifying a secondary
axis
p <- murders %>% ggplot(aes(population/10^6, total, label = abb))
p + geom_point(size=1) + geom_text(nudge_x = 0.05) + scale_x_log10() +
scale_y_log10()
### LABELS AND TITLES Looking at our cheat sheet under “Labels” we can
see that we can add labels / titles using labs
p + geom_point(size = 1) + geom_text(nudge_x = 0.05) +
scale_x_log10() + scale_y_log10() +
labs(x = "Populations in millions (log scale)",
y = "Total number of murders (log scale)",
title = "US Gun Murders in 2010")
We can change the color of the points using the col argument in the geom_point function Let’s store our previous layers into p
p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Populations in millions (log scale)",
y = "Total number of murders (log scale)",
title = "US Gun Murders in 2010")
p
Now let’s add geom_point, using a color (or col) argument
p + geom_point(size = 1, color = "blue")
What we want is a different color for each region. To do this, we can assign color = region, but since this is an aesthetic mapping, we need to use aes
p + geom_point(aes(col=region), size=2)
ggplot2 automatically will add a legend. To remove it, use:
p + geom_point(aes(col = region), size = 2, show.legend=FALSE)
### Annotation, Shapes, and Adjustments We often want to add “stuff” to
our figure other than the data. For instance, lines, shaded areas,
etc.
r <- murders %>% summarize(rate = sum(total) / sum(population)*10^6)
r
## rate
## 1 30.34555
class(r)
## [1] "data.frame"
Let’s add a line representing the average US murder rate (per 1,000,000 people – since that’s our x-axis scale). To compute this value:
r <- murders %>%
summarize(rate = sum(total) / sum(population)*10^6) %>%
pull(rate)
class(r)
## [1] "numeric"
The line of the rate is defined as: y=rx, but since we are on log scale, we get: log(y) = log(r) + log(x), so our line has an intercept of log(r) and a slope of 1. To add a line, we use geom_abline (ab represents the a(intercept) and b(slope). The default is intercept = 0 and slope = 1, so in our case, we only have to define the intercept.
murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Populations in millions (log scale)",
y = "Total number of murders (log scale)",
title = "US Gun Murders in 2010") +
geom_point(aes(col=region), size=2) +
geom_abline(intercept = log10(r))
lty = linetype; 0=blank, 1=solid (default), 2=dashed, 3=dotted,
4=dotdash, 5=longdash, 6=twodash
We can change the line type to a color, and we can draw it first so it doesn’t go over the points.
p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_point(aes(col=region), size=2) +
geom_text(nudge_x = 0.05) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Populations in millions (log scale)",
y = "Total number of murders (log scale)",
title = "US Gun Murders in 2010") +
geom_abline(intercept = log10(r), lty=2, color ="darkgrey")
p
There are add-on packages that further augment the ggplot2 package. We will use: ggthemes #adds a variety of themes ggrepel #positions labels
#install.packages("ggthemes")
library(ggthemes)
p + theme_economist()
#install.packages("ggrepel")
library(ggrepel)
p <- murders %>% ggplot(aes(population/10^6, total, label = abb)) +
geom_point(aes(col=region), size=2) +
geom_text_repel() +
scale_x_log10() +
scale_y_log10() +
labs(x = "Populations in millions (log scale)",
y = "Total number of murders (log scale)",
title = "US Gun Murders in 2010") +
geom_abline(intercept = log10(r), lty=2, color ="darkgrey") +
theme_economist()
p
Change geom_text with geom_text_repel which moves the labels so that don’t overlap Provides details on themes available: https://www.datanovia.com/en/blog/gg plot-themes-gallery/#use-ggthemes
murders %>% ggplot(aes(population/10^6, total, label= abb)) +
geom_point(aes(col=region), size=3) +
scale_x_log10() +
scale_y_log10() +
labs(x = "Populations in millions (log scale)",
y = "Total number of murders (log scale)",
title = "US Gun Murders in 2010",
col="Region") +
geom_abline(intercept=log10(r), lty=2, color="darkgrey") +
geom_text_repel() +
theme_economist()
While ggplot is great for creating beautiful figures, sometimes we want to make quick plots. GGplot also has a qplot() function
qplot(log10(murders$population), log10(murders$total))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
#install.packages('gridExtra')
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#dplyer::combine
p1 <- murders %>%
mutate(rate = total/population*10^5) %>%
filter(population < 2*10^6) %>%
ggplot(aes(population/10^6,
rate, label = abb)) +
geom_text() +
ggtitle("Small States")
p1
p2 <- murders %>%
mutate(rate = total/population*10^5) %>%
filter(population >= 2*10^6) %>%
ggplot(aes(population/10^6,
rate, label = abb)) +
geom_text() +
ggtitle("Large States")
p2
Place the graphs side by side with ncol().
grid.arrange(p1, p2, ncol = 2)
Defalt is in rows.
grid.arrange(p1, p2)