What is a Distribution?

A distribution describes how values of a variable are spread across possible values.

Questions we often ask:

Distribution plots answer these questions.

Histograms

A histogram divides data into bins and counts how many observations fall in each bin. Each bar represents a range of values.

Key idea: bin width matters

#install.packages("ggplot2")
library(ggplot2)

data(penguins) #this will load two potential data sets, you want the one that is polished, not raw data.

#basics of a histogram
ggplot(penguins, aes(x = body_mass)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

#like in base R, you can alter bin width
ggplot(penguins, aes(x = body_mass)) +
  geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

#smaller binwidth = more detail
#larger binwidth = smoother summary
# You can look up summary statistics and use the median/quartiles to indicate where

Histogram with Density Scaling

#This takes you from "count" to "density" based histograms. This allows overlaying desity curves later. 

ggplot(penguins, aes(x = body_mass, y = after_stat(density))) +
  geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Density Plots

A density plot is a smoothed version of a histogram. Instead of bins, it estimates the probability density function.

Good for:

  • smoother visualization
  • comparing groups
ggplot(penguins, aes(x = body_mass)) +
  geom_density() #Area under curve = 1
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

ggplot(penguins, aes(x = body_mass, fill = species)) +
  geom_density(alpha = 0.4)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

Frequency Polygons

A frequency polygon is like a histogram but drawn with lines instead of bars. Instead of bars, it connects the bin midpoints with lines.

Advantage:

  • Easier comparison across groups
ggplot(penguins, aes(x = body_mass)) +
  geom_freqpoly(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(penguins, aes(x = body_mass, color = species)) +
  geom_freqpoly(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Overlaying histogram and density

ggplot(penguins, aes(x = body_mass)) +
  geom_histogram(aes(y = after_stat(density)),
                 binwidth = 200,
                 fill = "grey80") +
  geom_density(color = "blue", linewidth = 1)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

#histogram = empirical counts
#density = smoothed estimate

Showing relationships alongside Distribution

library(ggplot2)
library(ggExtra)
library(ggthemes)

penguins_clean <- na.omit(penguins)

#create the graph and save it under a new name
#At this stage it is just a normal scatterplot.
p <- ggplot(penguins_clean, aes(x = bill_len,
                                y = bill_dep,
                                color = species)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_minimal() +
  labs(
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    title = "Penguin Bill Measurements by Species"
  )

ggMarginal(p, type = "density", groupColour = TRUE, groupFill = TRUE)

  • They show distribution and relationships simultaneously
  • They help detect clusters or separation between groups
  • They reveal skewness or multimodal distributions

Working in the Ridgeline Package

# library
library(ggridges) #this is the new package you need 
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
#library(hrbrthemes)

# Plot
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`, fill = ..x..)) + #This maps color to the temperature value itself, producing a gradient across each ridge.
  geom_density_ridges_gradient(scale = 3, #Controls how much the ridges overlap vertically. Larger values create more overlap.
                              rel_min_height = 0.01) + #Removes extremely small density tails so the ridges look cleaner.
  scale_fill_viridis(name = "Temp. [F]", option = "C") +
  labs(title = 'Temperatures in Lincoln NE in 2016') +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"), #Reduces spacing between ridges.
      strip.text.x = element_text(size = 8) #Adjusts label size if facets are present.
    )
## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Picking joint bandwidth of 3.37

Useful for comparing how a distribution changes across categories. In this case, it shows how temperature distributions vary by month in Lincoln, Nebraska during 2016.

Ridgeline plots are particularly helpful when you want to compare many distributions simultaneously without overlapping curves like in standard density plots.

Each horizontal ridge represents the distribution of temperatures for a specific month.

So the plot answers questions like:

  1. How do winter and summer temperature distributions differ?
  2. Which months have the widest range of temperatures?
  3. When do high temperatures occur most frequently?

Instead of looking at 12 separate density plots, the ridgeline stacks them vertically so patterns across months are easy to see.

Density 2D Plots

A 2d density chart displays the relationship between 2 numeric variables. One is represented on the X axis, the other on the Y axis, like for a scatterplot. Then, the number of observations within a particular area of the 2D space is counted and represented by a color gradient.

#2d histogram with geom_bin2d()

# 2d histogram with default option
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_bin2d() +
  theme_bw() +
  labs(
    title = "2D Histogram of Penguin Bill Measurements",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)"
  )
## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

# Bin size control + color palette
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_bin2d(bins = 20) +
  scale_fill_viridis_c() +
  theme_bw() +
  labs(
    title = "2D Histogram with Custom Bin Number",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Count"
  )

#Hexbin chart with geom_hex()
### Very similar to the 2d histogram above, but the plot area is split in a multitude of hexagons instead of squares.

library(hexbin) 

ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_hex() +
  theme_bw() +
  labs(
    title = "Hexbin Plot of Penguin Bill Measurements",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Count"
  )

# Bin size control + color palette
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_hex(bins = 20) +
  scale_fill_viridis_c() +
  theme_bw() +
  labs(
    title = "Hexbin Plot with Custom Bin Number",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Count"
  )

# 2d distribution with stat_density_2d

# Show the contour only
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_density_2d() +
  theme_bw() +
  labs(
    title = "2D Density Contours of Penguin Bill Measurements",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)"
  )

# Show the area only
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon") +
  theme_bw() +
  labs(
    title = "Filled 2D Density Plot",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Density level"
  )

# Area + contour
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  stat_density_2d(
    aes(fill = after_stat(level)),
    geom = "polygon",
    colour = "white"
  ) +
  theme_bw() +
  labs(
    title = "Filled 2D Density Plot with Contours",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Density level"
  )

# Using raster
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  stat_density_2d(
    aes(fill = after_stat(density)),
    geom = "raster",
    contour = FALSE
  ) +
  scale_fill_viridis_c() +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw() +
  labs(
    title = "Raster 2D Density Plot",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Density"
  )

ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_point(aes(color = species), alpha = 0.5) +
  geom_density_2d(color = "black") +
  theme_bw() +
  labs(
    title = "Penguin Bill Measurements with Density Contours",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    color = "Species"
  )

These are useful because they help when scatterplots get crowded.

  1. geom_bin2d() shows counts in square bins
  2. geom_hex() shows counts in hexagons, which often look smoother
  3. geom_density_2d() shows where points are most concentrated

Homework

In this assignment you will practice visualizing data distributions using several common ggplot2 tools:

  1. Histograms
  2. Density plots
  3. Frequency polygons
  4. 2D density plots

Understanding how data are distributed is an essential first step in exploratory data analysis.

Task 1

Create a histogram showing the distribution of highway fuel efficiency (hwy) from the mpg dataset.

Requirements: 1. Use geom_histogram() 2. Set the bin width to 2 3. Use a minimal theme 4. Include appropriate axis labels and a title

ggplot(mpg, aes(hwy))+
  geom_histogram(binwidth=2)+
  labs(title="Highway Fuel Efficiency of Cars",x="Highway Fuel Efficiency (miles per gallon)",y="# of Cars")+
  theme_minimal()

  1. What range of highway fuel efficiency values is most common? Approximately 26 miles per gallon.

  2. Does the distribution appear symmetric, skewed, or multimodal? It looks to be multimodal. It has two “peaks”.

Task 2

Create a histogram showing the distribution of engine displacement (displ).

Requirements: 1. Use geom_histogram() 2. Choose a reasonable bin width 3. Fill the bars with a color of your choice

mpg2 <- mpg
mpg2$year = as.factor(mpg2$year)

ggplot(mpg2,aes(displ, fill=year))+
  geom_histogram(binwidth=1)+
  theme_dark()+
  labs(title="Car Engine Displacement Counts By Manufacture Year",x="Engine Displacement",y="# of Cars")

Question

  1. What does the histogram suggest about the types of engines represented in the dataset? Cars with higher-displacement engines were less common in 1999 (compared to in 2008) and appear to be less common in general.

Task 3

Density plots provide a smoothed estimate of the distribution of a variable.

Create a density plot of engine displacement (displ).

Requirements: 1. Use geom_density() 2. Fill the density curve with a color 3. Set transparency (alpha) so the shape is visible

ggplot(mpg, aes(displ, fill="red"))+
  geom_density(color="red",fill="red",alpha=0.3)

Question

  1. How does the density plot compare to the histogram you created earlier? The shape shows the same distribution.

Task 4

Create density plots comparing engine displacement across vehicle classes.

Requirements: 1. Map fill = class 2. Use transparency so the curves overlap 3. Include a legend

ggplot(mpg, aes(displ, fill=class))+
  geom_density(alpha=0.3)+
  labs(title="This plot is a mess")

Question

  1. Which vehicle class tends to have the largest engines? Two-seaters tend to have the largest engines.

Task 5

Create a frequency polygon comparing highway fuel efficiency across vehicle classes.

Requirements: 1. Map color = class 2. Use the same bin width as before

ggplot(mpg, aes(hwy,color=class))+
  geom_freqpoly(binwidth=2)

Question

  1. Why might frequency polygons be easier to interpret than overlapping histograms when comparing groups? Independent spiky shapes are more distinguishable from each other than a bunch of stacked squares.

Task 6

Create a filled 2D density plot.

Requirements: 1. Use stat_density_2d() 2. Set geom = “polygon” 3. Map fill to density level

ggplot(mpg, aes(displ,hwy))+
  stat_density_2d(geom="polygon",aes(fill=after_stat(level)))

Question

  1. What additional information does the filled density plot provide compared to the contour plot? Color-coding the fill by density makes it WAY more intuitive to read. I don’t think it provides extra info per-se, but it’s more clear.

Task 7

Create a scatterplot of carat vs price using the diamonds dataset.

Requirements: 1. Use geom_point() 2. Set alpha = 0.2 so overlapping points are visible 3. Use theme_minimal() 4. Use the ggExtra package to add marginal density plots to the scatterplot.

#install.packages("ggExtra")
#library(ggExtra)

dee <- ggplot(diamonds,aes(carat,price))+
  geom_point(alpha=0.2)+
  theme_minimal()
  ggMarginal(dee, type="density")

Question

  1. Why might marginal plots be useful when analyzing relationships between variables? Marginal plots show the variables’ grouping independently right alongside their correlation. You can analyze them separately AND together.