Histograms display continuous single-variable data, binned into categories or “bins” It’s our decision about how to create those bins.
We’re switching to the midwest
data set now, which
contains information on a bunch of counties in the Midwest.
head(midwest,n=20)
## # A tibble: 20 × 28
## PID county state area poptotal popdensity popwhite popblack popamerindian
## <int> <chr> <chr> <dbl> <int> <dbl> <int> <int> <int>
## 1 561 ADAMS IL 0.052 66090 1271. 63917 1702 98
## 2 562 ALEXAN… IL 0.014 10626 759 7054 3496 19
## 3 563 BOND IL 0.022 14991 681. 14477 429 35
## 4 564 BOONE IL 0.017 30806 1812. 29344 127 46
## 5 565 BROWN IL 0.018 5836 324. 5264 547 14
## 6 566 BUREAU IL 0.05 35688 714. 35157 50 65
## 7 567 CALHOUN IL 0.017 5322 313. 5298 1 8
## 8 568 CARROLL IL 0.027 16805 622. 16519 111 30
## 9 569 CASS IL 0.024 13437 560. 13384 16 8
## 10 570 CHAMPA… IL 0.058 173025 2983. 146506 16559 331
## 11 571 CHRIST… IL 0.042 34418 819. 34176 82 51
## 12 572 CLARK IL 0.03 15921 531. 15842 10 26
## 13 573 CLAY IL 0.028 14460 516. 14403 4 17
## 14 574 CLINTON IL 0.029 33944 1170. 32688 1021 48
## 15 575 COLES IL 0.03 51644 1721. 50177 925 92
## 16 576 COOK IL 0.058 5105067 88018. 3204947 1317147 10289
## 17 577 CRAWFO… IL 0.026 19464 749. 19300 63 34
## 18 578 CUMBER… IL 0.02 10670 534. 10627 5 6
## 19 579 DE KALB IL 0.038 77932 2051. 72968 2069 123
## 20 580 DE WITT IL 0.023 16516 718. 16387 25 37
## # ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
## # percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## # popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## # poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## # percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## # percelderlypoverty <dbl>, inmetro <int>, category <chr>
ggplot(data = midwest,
mapping = aes(x = area))
p <- ggplot(data = midwest,
mapping = aes(x = area))
p + geom_histogram() + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If we don’t set the number of bins, it will be set for us automatically.
p + geom_histogram(bins=10)
We can compare histograms, too. Let’s subset by two states, OH and WI, using the %in% operator.
oh_wi = subset(midwest, state %in% c("OH","WI"))
oh_wi <- midwest %>% filter(state%in%c("OH","WI"))
p <- ggplot(data = oh_wi,
mapping = aes(x = area, color = state, fill = state))
p + geom_histogram(alpha = 0.2, bins = 20)
Try: What about a small multiples approach? Notice how it fixes the x and y axes to have the same scale. That’s usually, but not always, what you want.
p + geom_histogram(alpha = 0.4, bins = 20) +
facet_wrap(~state)
A density plot is an alternative to a histogram when plotting
continuous data. geom_density()
will calculate a kernel
density estimate of the underlying distribution of the data.
p <- ggplot(data = midwest,
mapping = aes(x = area))
p + geom_density()
How about breaking out by state? Let’s map color and fill to state, and the variable to plotted to area.
p <- ggplot(data = midwest,
mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3) +
labs(x = "Area",
y = "Density",
title = "Area by State") + facet_wrap(~state)
p <- ggplot(data = midwest,
mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3) +
labs(x = "Area",
y = "Density",
title = "Area by State") + facet_wrap(~state, scales="free_x")
We can use different color palettes to make our plots more visually appealing. You can see descriptions of the available palettes by visiting the RColorBrewer color palettes.
p <- ggplot(data = midwest,
mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.7) +
labs(x = "Area",
y = "Density",
title = "Area by State") + facet_wrap(~state) +
scale_fill_brewer(palette = "Blues") +
scale_color_brewer(palette = "Blues")
p <- ggplot(data = midwest,
mapping = aes(x = area, fill = state, color = state))
p + geom_dotplot(alpha = 0.7) +
labs(x = "Area",
y = "Density",
title = "Area by State") + facet_wrap(~state) + theme_bw()
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.
Box plots provide a visual summary of the distribution of a dataset. They display the median, quartiles, and potential outliers. The box represents the interquartile range (IQR), and the line inside the box indicates the median. Whiskers extend to the smallest and largest values within 1.5 * IQR from the lower and upper quartiles, respectively. Points outside this range are considered outliers and plotted individually.
# Create a box plot
p_box <- ggplot(mpg, aes(x = class, y = hwy)) +
geom_boxplot()
p_box
Jitter plots add random noise to the position of data points to prevent overplotting, especially when dealing with categorical data. This technique helps in visualizing the distribution and spread of individual data points. Jitter plots are useful when there are many overlapping points, as they allow you to see the density and clustering of data.
# Create a jitter plot
p_jitter <- ggplot(mpg, aes(x = class, y = hwy)) +
geom_jitter(width = 0.2, height = 0)
p_jitter
Violin plots combine aspects of box plots and density plots. They show the distribution of the data across different categories by displaying the kernel density estimation of the data, which provides a smooth approximation of the data distribution. The shape of the violin indicates the density of the data at different values, with wider sections representing higher density. Violin plots also typically include a marker for the median and a box representing the interquartile range.
# Example data
data(mpg)
# Violin plot
ggplot(mpg, aes(x = class, y = hwy)) +
geom_violin() +
theme_bw() +
labs(title = "Violin Plot of Highway Mileage by Car Class")
Sina plots, available in the ggforce
package, combine
the features of violin plots and jitter plots. They show the
distribution of the data while preserving the individual observations.
Sina plots display the density of the data like a violin plot but also
jitter the data points within the distribution, providing a more
detailed view of each individual data point.
# Create a sina plot
p_sina <- ggplot(mpg, aes(x = class, y = hwy)) +
geom_violin(alpha = 0.5) +
ggforce::geom_sina()
p_sina
Ridge plots, also known as joy plots, are used to visualize the distribution of a continuous variable across different categories. They are essentially a series of overlapping density plots, where each density plot represents the distribution of the variable for a different category. Ridge plots are useful for comparing the distributions across multiple categories simultaneously.
# Ridge plot
ggplot(mpg, aes(x = hwy, y = class, fill = class)) +
geom_density_ridges(scale = 0.9) +
theme_bw() +
labs(title = "Ridge Plot of Highway Mileage by Car Class") +
scale_fill_brewer(palette = "Blues") +
scale_color_brewer(palette = "Blues")
## Picking joint bandwidth of 0.966
We can always add facets.
library(ggplot2)
library(ggridges)
# Ridge plot with facet_wrap
ggplot(mpg, aes(x = hwy, y = class, fill = class)) +
geom_density_ridges(scale = 0.9) +
theme_bw() +
labs(
title = "Ridge Plot of Highway Mileage by Car Class",
x = "Highway Mileage",
y = "Car Class"
) +
facet_wrap(~ drv)
## Picking joint bandwidth of 4.03
## Picking joint bandwidth of 1.21
## Picking joint bandwidth of 0.841
# Ridge plot with overlap and faceting using the diamonds dataset
ggplot(diamonds, aes(x = price, y = cut, fill = cut)) +
geom_density_ridges(scale = 0.8, alpha = 0.7) +
#scale_x_continuous(labels = scales::dollar) +
theme_minimal() +
labs(
title = "Ridge Plot of Diamond Prices by Cut",
x = "Price",
y = "Cut"
) +
facet_wrap(~ color)
## Picking joint bandwidth of 559
## Picking joint bandwidth of 517
## Picking joint bandwidth of 585
## Picking joint bandwidth of 689
## Picking joint bandwidth of 755
## Picking joint bandwidth of 958
## Picking joint bandwidth of 1010
load("Data/survey/health22.Rdata")
glimpse(health22)
## Rows: 1,000
## Columns: 5
## $ party <chr> "X", "D", "R", "R", "D", "X", "R", "D", "X", "X", "D"…
## $ st <chr> "WI", "IL", "MI", "OR", "TX", "TN", "WI", "IL", "NE",…
## $ ideo7 <dbl> 4, 5, 6, 5, 2, 4, 4, 4, 3, 4, 6, 5, 7, 6, 4, 3, 6, 1,…
## $ medicaid_2 <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ prolifeprochoice <dbl> NA, 1, 0, 0, 0, NA, 0, 1, 1, 1, 0, 0, 1, 0, NA, 0, 0,…
# Density plot
ggplot(health22, aes(x = ideo7, fill=party)) +
geom_density(alpha=0.2) +
theme_bw() +
labs(title = "Density Plot of Ideology")
Heatmaps are a powerful visualization tool for displaying the
intensity of data at the intersection of two continuous variables. By
dividing the plotting area into a grid of bins, heatmaps use color to
represent the number of observations within each bin. Darker colors
indicate higher densities, making it easy to identify areas of high
concentration and detect patterns between variables. In
ggplot2
, the geom_bin2d
function creates
heatmaps by dividing the data space into a grid and counting the number
of observations within each bin. The resulting plot provides a clear way
to visualize the joint distribution of two continuous variables, making
it particularly useful when data points are densely packed and
individual points may overlap.
# heatmap
ggplot(mpg, aes(x = class, y = manufacturer)) +
geom_bin2d() +
theme_minimal() +
labs(title = "Heatmap of Car Class vs. Manufacturer")
Hexbin plots are similar to heatmaps but use hexagonal bins instead
of rectangular ones. The hexagonal tiling provides a more efficient and
visually appealing way to represent the density of data points,
especially when the data is unevenly distributed. Hexagonal bins reduce
visual artifacts and provide a better approximation of the data
distribution. The geom_hex
function in ggplot2
creates hexbin plots by dividing the data space into hexagons and
counting the number of observations within each hexagon. The color
intensity of each hexagon represents the density of data points,
allowing for easy identification of regions with high or low
concentrations. Hexbin plots are particularly useful for visualizing
large datasets where traditional scatter plots would result in
significant overplotting.
library(hexbin)
# Hexbin plot
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_hex() +
theme_minimal() +
labs(title = "Hexbin Plot of Displacement vs. Highway Mileage")
Density contour plots use contours to represent the density of data
points across two continuous variables, providing a topographic map-like
visualization of data distribution. The
geom_density_2d_filled
function in ggplot2
creates these plots by estimating the density of data points and drawing
filled contour lines that connect regions of equal density. The filled
contours help to highlight areas of higher concentration and can reveal
patterns or relationships between variables.
# Density contour plot
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_density_2d_filled() +
theme_minimal() +
labs(title = "Density Contour Plot of Displacement vs. Highway Mileage")