We will learn about the following topics in this section.
If you have installed tidyverse then you should have
install.packages("tidyverse", dependencies = TRUE)
Loading tidyverse will automatically get ggplot2 ready.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
For historical reasons, R has at least three graphical systems that are commonly used.
ggplot2 has become the most popular these days, so we will focus on it. For side-by-side comparison of ggplot2 and the base graphics, see here.
ggplot2 can be thought of as a mini-language (domain-specific language) within the R language. It is an R implementation of Wilkinson’s Grammar of Graphics book. A Layered Grammar of Graphics is a paper on ggplot2’s design.
Conceptually, this means that a statistical graphic is a mapping from variables to aesthetic attributes (x axis value, y axis value, color, shape, size) of geometric objects (points, line, bars) which may or may not include statistical transformation of the raw data. In ggplot2, you describe what you want to get as a result (declarative language) rather than state how to draw the graphic (imperative language).
ggplot2 generally works better with the long-format data. In the tidyverse, gather() function is used to transform a wide-format data to long-format data. Here we manipulate the systolic and diastolic blood pressures into one BP variable for easier plotting.
## Load CSV file
framingham <- read_csv("./framingham.csv") %>%
rename(MALE = SEX)
## Parsed with column specification:
## cols(
## .default = col_integer(),
## SYSBP = col_double(),
## DIABP = col_double(),
## BMI = col_double(),
## TIMEAP = col_double(),
## TIMEMI = col_double(),
## TIMEMIFC = col_double(),
## TIMECHD = col_double(),
## TIMESTRK = col_double(),
## TIMECVD = col_double(),
## TIMEDTH = col_double(),
## TIMEHYP = col_double()
## )
## See spec(...) for full column specifications.
## Create a long format dataset.
fram_bp_long <- framingham %>%
## Add ID variable. Sequence of 1,2,...,n along AGE variable.
mutate(ID = seq_along(AGE)) %>%
## Change variable position for better view.
select(ID, MALE, AGE, PREVCHD, SYSBP, DIABP) %>%
## SYSBP and DIABP are transformed into one BP column, labelled by PHASE.
gather(key = PHASE, value = BP, SYSBP, DIABP)
fram_bp_long
## # A tibble: 8,868 x 6
## ID MALE AGE PREVCHD PHASE BP
## <int> <int> <int> <int> <chr> <dbl>
## 1 1 1 39 0 SYSBP 106.0
## 2 2 0 46 0 SYSBP 121.0
## 3 3 1 48 0 SYSBP 127.5
## 4 4 0 61 0 SYSBP 150.0
## 5 5 0 46 0 SYSBP 130.0
## 6 6 0 43 0 SYSBP 180.0
## 7 7 0 63 0 SYSBP 138.0
## 8 8 0 45 0 SYSBP 100.0
## 9 9 1 52 0 SYSBP 141.5
## 10 10 1 43 0 SYSBP 162.0
## # ... with 8,858 more rows
## Map AGE to X axis and BP to Y axis. Use point geometry.
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP)) +
geom_point()
## Also map PHASE to color.
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point()
## Also map MALE to shape.
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE, shape = factor(MALE))) +
geom_point()
Different geometric objects (Geoms) plot different graphs. Geoms come with decent default statistical transformations (Stats) and positions. So you do not need to override the default Stats and positions unless you are doing something non-standard. To find out more see examples.
## histogram
ggplot(data = fram_bp_long, mapping = aes(x = BP)) +
## default stat = "bin", position = "identity"
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## density
ggplot(data = fram_bp_long, mapping = aes(x = BP)) +
## default stat = "density", position = "identity"
geom_density()
## Use factor to create a categorical
ggplot(data = framingham, mapping = aes(x = factor(bmicat))) +
## default stat = "density", position = "identity"
geom_bar()
## scatter plot
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP)) +
## default stat = "identity", position = "identity"
geom_point()
## Geoms can be layered.
## scatter plot with rug plots in the margins
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP)) +
## default stat = "identity", position = "identity"
geom_point() +
## default stat = "identity", position = "identity"
geom_rug()
## scatter plot and smoothing plot (mean tendency)
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP)) +
## default stat = "identity", position = "identity"
geom_point() +
## default stat = "smooth", position = "identity"
geom_smooth(method = "lm")
## geom_smooth takes color as grouping
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
## default stat = "identity", position = "identity"
geom_point() +
## default stat = "smooth", position = "identity"
geom_smooth(method = "lm")
## scatter plot and quantile plot
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP)) +
## default stat = "identity", position = "identity"
geom_point() +
## default stat = "quantile", position = "identity"
geom_quantile(quantiles = seq(from = 0.10, to = 0.90, by = 0.1)) +
## Another quantile plot to emphasize 0.5
geom_quantile(quantiles = 0.5, size = 2)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## boxplot
ggplot(data = framingham, mapping = aes(x = factor(bmicat), y = SYSBP)) +
## default stat = "boxplot", position = "dodge"
geom_boxplot()
## violin plot
ggplot(data = framingham, mapping = aes(x = factor(bmicat), y = SYSBP)) +
## default stat = "ydensity", position = "dodge"
geom_violin()
## point range to plot mean, min, max by BMI categories
framingham %>%
group_by(bmicat) %>%
summarize(mean = mean(BMI, na.rm = TRUE),
min = min(BMI, na.rm = TRUE),
max = max(BMI, na.rm = TRUE)) %>%
ggplot(mapping = aes(x = bmicat, y = mean, ymax = max, ymin = min)) +
## default stat = "identity", position = "dodge"
geom_pointrange()
## Warning: Removed 1 rows containing missing values (geom_pointrange).
Each aesthetic has a corresponding scale_* function for manipulation.
##
ggplot(data = framingham, mapping = aes(x = bmicat)) +
geom_bar()
## Warning: Removed 19 rows containing non-finite values (stat_count).
## Limit X and Y axis range,
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60)) +
scale_y_continuous(limits = c(0,NA))
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
## Limiting X and Y range using coord_cartesian() is possible.
## No data point trimming is performed
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
coord_cartesian(xlim = c(40,60), ylim = c(0,max(fram_bp_long$BP)))
## Place tick marks at 40, 50, 60 with breaks
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA))
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
## Sometimes flipping and using log is useful
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
## reverse X axis
scale_x_reverse() +
## Log10 Y axis without transforming the original variable
## Breaks are specified on the raw variable
scale_y_log10(breaks = c(50, 90, 100, 140, 200, 250))
The most useful scale to manipulate may be the color and fill scales. The color scale is for line and point colors. The fill scale is for color of areas.
## Default fill for discrete bmicat
ggplot(data = framingham, mapping = aes(x = factor(bmicat), fill = factor(bmicat))) +
geom_bar()
## scale_fill_brewer
## See ?scale_fill_brewer for palettes available.
ggplot(data = framingham, mapping = aes(x = factor(bmicat), fill = factor(bmicat))) +
geom_bar() +
scale_fill_brewer(palette = "Accent")
## Manual manipulation (hard to do it nicely)
ggplot(data = framingham, mapping = aes(x = factor(bmicat), fill = factor(bmicat))) +
geom_bar() +
scale_fill_manual(values = c("1" = "red", "2" = "blue", "3" = "yellow", "4" = "purple"))
## Default color for continuous AGE
ggplot(data = framingham, mapping = aes(x = AGE, y = BMI, color = AGE)) +
geom_point()
## Warning: Removed 19 rows containing missing values (geom_point).
## scale_color_distiller
ggplot(data = framingham, mapping = aes(x = AGE, y = BMI, color = AGE)) +
geom_point() +
scale_color_distiller(palette = "Reds")
## Warning: Removed 19 rows containing missing values (geom_point).
## scale_color_gradient2
ggplot(data = framingham, mapping = aes(x = AGE, y = BMI, color = BMI)) +
geom_point() +
scale_color_gradient2(low = "blue", high = "red", mid = "violet", midpoint = 24)
## Warning: Removed 19 rows containing missing values (geom_point).
labs() can be used to label various aspects of the graphic.
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
labs(title = "Title",
subtitle = "Subtitle",
x = "X scale label",
y = "Y scale label",
color = "Color scale label")
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
Some times multi-panel plots are more informative. Here it’s probably better to plot systolic and diastolic BP in different panels.
## Column by PHASE, no row variable
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
facet_grid(. ~ PHASE)
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
## Columns by PHASE, rows by MALE
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
facet_grid(MALE ~ PHASE)
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
This tweaks the details of how graphs look. I do not like the default gray background, so I usually use theme_bw().
## Cleaner plot with theme_bw()
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
facet_grid(MALE ~ PHASE) +
theme_bw()
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
## theme_minimal()
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
facet_grid(MALE ~ PHASE) +
theme_minimal()
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
## theme_classic()
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
facet_grid(MALE ~ PHASE) +
theme_classic()
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
## theme_void()
ggplot(data = fram_bp_long, mapping = aes(x = AGE, y = BP, color = PHASE)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_continuous(limits = c(40,60), breaks = c(40,50,60)) +
scale_y_continuous(limits = c(0,NA)) +
facet_grid(MALE ~ PHASE) +
theme_void()
## Warning: Removed 2442 rows containing non-finite values (stat_smooth).
## Warning: Removed 2442 rows containing missing values (geom_point).
ggplot2 is extensible, some people have extended it to plot maps (ggmap) and networks (ggraph).
## Map
library(ggmap)
get_stamenmap(bbox = c(left = -74, bottom = 41.5, right = -69.5, top = 43),
zoom = 8, maptype = "toner-lite") %>%
ggmap()
## Map from URL : http://tile.stamen.com/toner-lite/8/75/94.png
## Map from URL : http://tile.stamen.com/toner-lite/8/76/94.png
## Map from URL : http://tile.stamen.com/toner-lite/8/77/94.png
## Map from URL : http://tile.stamen.com/toner-lite/8/78/94.png
## Map from URL : http://tile.stamen.com/toner-lite/8/75/95.png
## Map from URL : http://tile.stamen.com/toner-lite/8/76/95.png
## Map from URL : http://tile.stamen.com/toner-lite/8/77/95.png
## Map from URL : http://tile.stamen.com/toner-lite/8/78/95.png
## Networks
library(ggraph)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## %>%, as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## %>%, compose, simplify
## The following objects are masked from 'package:tidyr':
##
## %>%, crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
# Create graph of highschool friendships
graph <- graph_from_data_frame(as_tibble(highschool))
V(graph)$Popularity <- degree(graph, mode = 'in')
# plot using ggraph
ggraph(graph, layout = 'kk') +
geom_edge_fan(aes(alpha = ..index..), show.legend = FALSE) +
geom_node_point(aes(size = Popularity)) +
facet_edges(~year) +
theme_graph(foreground = 'steelblue', fg_text_colour = 'white')