Histograms and Density Plots

Histograms display continuous single-variable data, binned into categories or “bins” It’s our decision about how to create those bins.

We’re switching to the midwest data set now, which contains information on a bunch of counties in the Midwest.

head(midwest,n=20)
## # A tibble: 20 × 28
##      PID county  state  area poptotal popdensity popwhite popblack popamerindian
##    <int> <chr>   <chr> <dbl>    <int>      <dbl>    <int>    <int>         <int>
##  1   561 ADAMS   IL    0.052    66090      1271.    63917     1702            98
##  2   562 ALEXAN… IL    0.014    10626       759      7054     3496            19
##  3   563 BOND    IL    0.022    14991       681.    14477      429            35
##  4   564 BOONE   IL    0.017    30806      1812.    29344      127            46
##  5   565 BROWN   IL    0.018     5836       324.     5264      547            14
##  6   566 BUREAU  IL    0.05     35688       714.    35157       50            65
##  7   567 CALHOUN IL    0.017     5322       313.     5298        1             8
##  8   568 CARROLL IL    0.027    16805       622.    16519      111            30
##  9   569 CASS    IL    0.024    13437       560.    13384       16             8
## 10   570 CHAMPA… IL    0.058   173025      2983.   146506    16559           331
## 11   571 CHRIST… IL    0.042    34418       819.    34176       82            51
## 12   572 CLARK   IL    0.03     15921       531.    15842       10            26
## 13   573 CLAY    IL    0.028    14460       516.    14403        4            17
## 14   574 CLINTON IL    0.029    33944      1170.    32688     1021            48
## 15   575 COLES   IL    0.03     51644      1721.    50177      925            92
## 16   576 COOK    IL    0.058  5105067     88018.  3204947  1317147         10289
## 17   577 CRAWFO… IL    0.026    19464       749.    19300       63            34
## 18   578 CUMBER… IL    0.02     10670       534.    10627        5             6
## 19   579 DE KALB IL    0.038    77932      2051.    72968     2069           123
## 20   580 DE WITT IL    0.023    16516       718.    16387       25            37
## # ℹ 19 more variables: popasian <int>, popother <int>, percwhite <dbl>,
## #   percblack <dbl>, percamerindan <dbl>, percasian <dbl>, percother <dbl>,
## #   popadults <int>, perchsd <dbl>, percollege <dbl>, percprof <dbl>,
## #   poppovertyknown <int>, percpovertyknown <dbl>, percbelowpoverty <dbl>,
## #   percchildbelowpovert <dbl>, percadultpoverty <dbl>,
## #   percelderlypoverty <dbl>, inmetro <int>, category <chr>
 ggplot(data = midwest,
            mapping = aes(x = area))

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram() + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If we don’t set the number of bins, it will be set for us automatically.

p + geom_histogram(bins=10)

We can compare histograms, too. Let’s subset by two states, OH and WI, using the %in% operator.

oh_wi = subset(midwest, state %in% c("OH","WI"))

oh_wi <- midwest %>% filter(state%in%c("OH","WI"))
       
p <- ggplot(data = oh_wi,
            mapping = aes(x = area, color = state, fill = state))
p + geom_histogram(alpha = 0.2, bins = 20)

Try: What about a small multiples approach? Notice how it fixes the x and y axes to have the same scale. That’s usually, but not always, what you want.

p + geom_histogram(alpha = 0.4, bins = 20) + 
    facet_wrap(~state)

A density plot is an alternative to a histogram when plotting continuous data. geom_density() will calculate a kernel density estimate of the underlying distribution of the data.

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_density()

How about breaking out by state? Let’s map color and fill to state, and the variable to plotted to area.

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3) +
    labs(x = "Area",
         y = "Density",
         title = "Area by State") + facet_wrap(~state)

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3) +
    labs(x = "Area",
         y = "Density",
         title = "Area by State") + facet_wrap(~state, scales="free_x")

Color palettes

We can use different color palettes to make our plots more visually appealing. You can see descriptions of the available palettes by visiting the RColorBrewer color palettes.

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.7) +
    labs(x = "Area",
         y = "Density",
         title = "Area by State") + facet_wrap(~state) +
  scale_fill_brewer(palette = "Blues") +
  scale_color_brewer(palette = "Blues")

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_dotplot(alpha = 0.7) +
    labs(x = "Area",
         y = "Density",
         title = "Area by State") + facet_wrap(~state) + theme_bw()
## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

Box Plots

Box plots provide a visual summary of the distribution of a dataset. They display the median, quartiles, and potential outliers. The box represents the interquartile range (IQR), and the line inside the box indicates the median. Whiskers extend to the smallest and largest values within 1.5 * IQR from the lower and upper quartiles, respectively. Points outside this range are considered outliers and plotted individually.

# Create a box plot
p_box <- ggplot(mpg, aes(x = class, y = hwy)) +
    geom_boxplot()

p_box

Jitter Plots

Jitter plots add random noise to the position of data points to prevent overplotting, especially when dealing with categorical data. This technique helps in visualizing the distribution and spread of individual data points. Jitter plots are useful when there are many overlapping points, as they allow you to see the density and clustering of data.

# Create a jitter plot
p_jitter <- ggplot(mpg, aes(x = class, y = hwy)) +
    geom_jitter(width = 0.2, height = 0)

p_jitter

Violin Plots

Violin plots combine aspects of box plots and density plots. They show the distribution of the data across different categories by displaying the kernel density estimation of the data, which provides a smooth approximation of the data distribution. The shape of the violin indicates the density of the data at different values, with wider sections representing higher density. Violin plots also typically include a marker for the median and a box representing the interquartile range.

# Example data
data(mpg)

# Violin plot
ggplot(mpg, aes(x = class, y = hwy)) +
  geom_violin() +
  theme_bw() +
  labs(title = "Violin Plot of Highway Mileage by Car Class")

Sina Plots

Sina plots, available in the ggforce package, combine the features of violin plots and jitter plots. They show the distribution of the data while preserving the individual observations. Sina plots display the density of the data like a violin plot but also jitter the data points within the distribution, providing a more detailed view of each individual data point.

# Create a sina plot
p_sina <- ggplot(mpg, aes(x = class, y = hwy)) +
    geom_violin(alpha = 0.5) +
    ggforce::geom_sina()

p_sina

Ridge Plots

Ridge plots, also known as joy plots, are used to visualize the distribution of a continuous variable across different categories. They are essentially a series of overlapping density plots, where each density plot represents the distribution of the variable for a different category. Ridge plots are useful for comparing the distributions across multiple categories simultaneously.

# Ridge plot
ggplot(mpg, aes(x = hwy, y = class, fill = class)) +
  geom_density_ridges(scale = 0.9) +
  theme_bw() +
  labs(title = "Ridge Plot of Highway Mileage by Car Class") +
  scale_fill_brewer(palette = "Blues") +
  scale_color_brewer(palette = "Blues")
## Picking joint bandwidth of 0.966

We can always add facets.

library(ggplot2)
library(ggridges)

# Ridge plot with facet_wrap
ggplot(mpg, aes(x = hwy, y = class, fill = class)) +
  geom_density_ridges(scale = 0.9) +
  theme_bw() +
  labs(
    title = "Ridge Plot of Highway Mileage by Car Class",
    x = "Highway Mileage",
    y = "Car Class"
  ) +
  facet_wrap(~ drv)
## Picking joint bandwidth of 4.03
## Picking joint bandwidth of 1.21
## Picking joint bandwidth of 0.841

# Ridge plot with overlap and faceting using the diamonds dataset
ggplot(diamonds, aes(x = price, y = cut, fill = cut)) +
  geom_density_ridges(scale = 0.8, alpha = 0.7) +
  #scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  labs(
    title = "Ridge Plot of Diamond Prices by Cut",
    x = "Price",
    y = "Cut"
  ) +
  facet_wrap(~ color)
## Picking joint bandwidth of 559
## Picking joint bandwidth of 517
## Picking joint bandwidth of 585
## Picking joint bandwidth of 689
## Picking joint bandwidth of 755
## Picking joint bandwidth of 958
## Picking joint bandwidth of 1010

load("Data/survey/health22.Rdata")

glimpse(health22)
## Rows: 1,000
## Columns: 5
## $ party            <chr> "X", "D", "R", "R", "D", "X", "R", "D", "X", "X", "D"…
## $ st               <chr> "WI", "IL", "MI", "OR", "TX", "TN", "WI", "IL", "NE",…
## $ ideo7            <dbl> 4, 5, 6, 5, 2, 4, 4, 4, 3, 4, 6, 5, 7, 6, 4, 3, 6, 1,…
## $ medicaid_2       <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ prolifeprochoice <dbl> NA, 1, 0, 0, 0, NA, 0, 1, 1, 1, 0, 0, 1, 0, NA, 0, 0,…
# Density plot

ggplot(health22, aes(x = ideo7, fill=party)) +
  geom_density(alpha=0.2) +
  theme_bw() +
  labs(title = "Density Plot of Ideology")

Heatmaps

Heatmaps are a powerful visualization tool for displaying the intensity of data at the intersection of two continuous variables. By dividing the plotting area into a grid of bins, heatmaps use color to represent the number of observations within each bin. Darker colors indicate higher densities, making it easy to identify areas of high concentration and detect patterns between variables. In ggplot2, the geom_bin2d function creates heatmaps by dividing the data space into a grid and counting the number of observations within each bin. The resulting plot provides a clear way to visualize the joint distribution of two continuous variables, making it particularly useful when data points are densely packed and individual points may overlap.

# heatmap
ggplot(mpg, aes(x = class, y = manufacturer)) +
  geom_bin2d() +
  theme_minimal() +
  labs(title = "Heatmap of Car Class vs. Manufacturer")

Hexbin Plots

Hexbin plots are similar to heatmaps but use hexagonal bins instead of rectangular ones. The hexagonal tiling provides a more efficient and visually appealing way to represent the density of data points, especially when the data is unevenly distributed. Hexagonal bins reduce visual artifacts and provide a better approximation of the data distribution. The geom_hex function in ggplot2 creates hexbin plots by dividing the data space into hexagons and counting the number of observations within each hexagon. The color intensity of each hexagon represents the density of data points, allowing for easy identification of regions with high or low concentrations. Hexbin plots are particularly useful for visualizing large datasets where traditional scatter plots would result in significant overplotting.

library(hexbin)

# Hexbin plot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_hex() +
  theme_minimal() +
  labs(title = "Hexbin Plot of Displacement vs. Highway Mileage")

Density Contour Plots

Density contour plots use contours to represent the density of data points across two continuous variables, providing a topographic map-like visualization of data distribution. The geom_density_2d_filled function in ggplot2 creates these plots by estimating the density of data points and drawing filled contour lines that connect regions of equal density. The filled contours help to highlight areas of higher concentration and can reveal patterns or relationships between variables.

# Density contour plot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_density_2d_filled() +
  theme_minimal() +
  labs(title = "Density Contour Plot of Displacement vs. Highway Mileage")