What is a Distribution?

A distribution describes how values of a variable are spread across possible values.

Questions we often ask:

Distribution plots answer these questions.

Histograms

A histogram divides data into bins and counts how many observations fall in each bin. Each bar represents a range of values.

Key idea: bin width matters

install.packages("ggplot2")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.5'
## (as 'lib' is unspecified)
library(ggplot2)

data(penguins) #this will load two potential data sets, you want the one that is polished, not raw data.

#basics of a histogram
ggplot(penguins, aes(x = body_mass)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

#like in base R, you can alter bin width
ggplot(penguins, aes(x = body_mass)) +
  geom_histogram(binwidth = 500)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

#smaller binwidth = more detail
#larger binwidth = smoother summary

Histogram with Density Scaling

#This takes you from "count" to "density" based histograms. This allows overlaying desity curves later. 

ggplot(penguins, aes(x = body_mass, y = after_stat(density))) +
  geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Density Plots

A density plot is a smoothed version of a histogram. Instead of bins, it estimates the probability density function.

Good for:

  • smoother visualization
  • comparing groups
ggplot(penguins, aes(x = body_mass)) +
  geom_density() #Area under curve = 1
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

ggplot(penguins, aes(x = body_mass, fill = species)) +
  geom_density(alpha = 0.4)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

Frequency Polygons

A frequency polygon is like a histogram but drawn with lines instead of bars. Instead of bars, it connects the bin midpoints with lines.

Advantage:

  • Easier comparison across groups
ggplot(penguins, aes(x = body_mass)) +
  geom_freqpoly(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(penguins, aes(x = body_mass, color = species)) +
  geom_freqpoly(binwidth = 200)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).

Overlaying histogram and density

ggplot(penguins, aes(x = body_mass)) +
  geom_histogram(aes(y = after_stat(density)),
                 binwidth = 200,
                 fill = "grey80") +
  geom_density(color = "blue", linewidth = 1)
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_density()`).

#histogram = empirical counts
#density = smoothed estimate

Showing relationships alongside Distribution

library(ggplot2)
library(ggExtra)
library(ggthemes)

penguins_clean <- na.omit(penguins) # this gets rid of any lines that don't have data, gets rid of any colom with NA 

#create the graph and save it under a new name
#At this stage it is just a normal scatterplot.
p <- ggplot(penguins_clean, aes(x = bill_len,
                                y = bill_dep,
                                color = species)) +
  geom_point(size = 2, alpha = 0.8) +
  theme_minimal() +
  labs(
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    title = "Penguin Bill Measurements by Species"
  )

ggMarginal(p, type = "density", groupColour = TRUE, groupFill = TRUE)

# creates it in the margians 
  • They show distribution and relationships simultaneously
  • They help detect clusters or separation between groups
  • They reveal skewness or multimodal distributions

Working in the Ridgeline Package

# library
library(ggridges) #this is the new package you need 
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
# Plot
ggplot(lincoln_weather, aes(x = `Mean Temperature [F]`, y = `Month`, fill = ..x..)) + #This maps color to the temperature value itself, producing a gradient across each ridge.
geom_density_ridges_gradient(scale = 3, #Controls how much the ridges overlap vertically. Larger values create more overlap.
                              rel_min_height = 0.1) + #Removes extremely small density tails so the ridges look cleaner.
  scale_fill_viridis(name = "Temp. [F]", option = "C") +
  labs(title = 'Temperatures in Lincoln NE in 2016') +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"), #Reduces spacing between ridges.
      strip.text.x = element_text(size = 8) #Adjusts label size if facets are present.
    )
## Warning: The dot-dot notation (`..x..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(x)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Picking joint bandwidth of 3.37

Useful for comparing how a distribution changes across categories. In this case, it shows how temperature distributions vary by month in Lincoln, Nebraska during 2016.

Ridgeline plots are particularly helpful when you want to compare many distributions simultaneously without overlapping curves like in standard density plots.

Each horizontal ridge represents the distribution of temperatures for a specific month.

So the plot answers questions like:

  1. How do winter and summer temperature distributions differ?
  2. Which months have the widest range of temperatures?
  3. When do high temperatures occur most frequently?

Instead of looking at 12 separate density plots, the ridgeline stacks them vertically so patterns across months are easy to see.

Density 2D Plots

A 2d density chart displays the relationship between 2 numeric variables. One is represented on the X axis, the other on the Y axis, like for a scatterplot. Then, the number of observations within a particular area of the 2D space is counted and represented by a color gradient.

#2d histogram with geom_bin2d()

# 2d histogram with default option
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_bin2d() +
  theme_bw() +
  labs(
    title = "2D Histogram of Penguin Bill Measurements",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)"
  )
## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

# Bin size control + color palette
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_bin2d(bins = 20) +
  scale_fill_viridis_c() +
  theme_bw() +
  labs(
    title = "2D Histogram with Custom Bin Number",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Count"
  )

#Hexbin chart with geom_hex()
### Very similar to the 2d histogram above, but the plot area is split in a multitude of hexagons instead of squares.

library(hexbin) 

ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_hex() +
  theme_bw() +
  labs(
    title = "Hexbin Plot of Penguin Bill Measurements",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Count"
  )

# Bin size control + color palette
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_hex(bins = 20) +
  scale_fill_viridis_c() +
  theme_bw() +
  labs(
    title = "Hexbin Plot with Custom Bin Number",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Count"
  )

# 2d distribution with stat_density_2d

# Show the contour only
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_density_2d() +
  theme_bw() +
  labs(
    title = "2D Density Contours of Penguin Bill Measurements",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)"
  )

# Show the area only
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon") +
  theme_bw() +
  labs(
    title = "Filled 2D Density Plot",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Density level"
  )

# Area + contour
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  stat_density_2d(
    aes(fill = after_stat(level)),
    geom = "polygon",
    colour = "white"
  ) +
  theme_bw() +
  labs(
    title = "Filled 2D Density Plot with Contours",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Density level"
  )

# Using raster
ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  stat_density_2d(
    aes(fill = after_stat(density)),
    geom = "raster",
    contour = FALSE
  ) +
  scale_fill_viridis_c() +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme_bw() +
  labs(
    title = "Raster 2D Density Plot",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    fill = "Density"
  )

ggplot(penguins_clean, aes(x = bill_len, y = bill_dep)) +
  geom_point(aes(color = species), alpha = 0.5) +
  geom_density_2d(color = "black") +
  theme_bw() +
  labs(
    title = "Penguin Bill Measurements with Density Contours",
    x = "Bill Length (mm)",
    y = "Bill Depth (mm)",
    color = "Species"
  )

These are useful because they help when scatterplots get crowded.

  1. geom_bin2d() shows counts in square bins
  2. geom_hex() shows counts in hexagons, which often look smoother
  3. geom_density_2d() shows where points are most concentrated

Homework

In this assignment you will practice visualizing data distributions using several common ggplot2 tools:

  1. Histograms
  2. Density plots
  3. Frequency polygons
  4. 2D density plots

Understanding how data are distributed is an essential first step in exploratory data analysis.

Task 1

Create a histogram showing the distribution of highway fuel efficiency (hwy) from the mpg dataset.

Requirements: 1. Use geom_histogram() 2. Set the bin width to 2 3. Use a minimal theme 4. Include appropriate axis labels and a title

ggplot(mpg, aes(hwy)) + 
  geom_histogram(binwidth = 2) + 
  labs(
    title = "Distribution of Highway feul efficiency in the MPG Dataset",
    x = "Highway MPG", 
    y = "Amount of Cars"
  )+
  theme_minimal()

1. What range of highway fuel efficiency values is most common?

The 26 mpg range seems to be the most common.

  1. Does the distribution appear symmetric, skewed, or multimodal?

The distribution seems to appear multimodal.

Task 2

Create a histogram showing the distribution of engine displacement (displ).

Requirements: 1. Use geom_histogram() 2. Choose a reasonable bin width 3. Fill the bars with a color of your choice

ggplot(mpg, aes(displ)) + 
  geom_histogram(color = "darkblue", fill = "skyblue", binwidth = 0.3) + 
  labs(
    title = "Distribution of Engine Displacement",
    x = "Engine Displacement", 
    y = "Amount of Cars"
  )+
  theme_minimal()

Question

1. What does the histogram suggest about the types of engines represented in the dataset?

In this dataset there is a big variation in engine type with the highest representation at ~1.9

Task 3

Density plots provide a smoothed estimate of the distribution of a variable.

Create a density plot of engine displacement (displ).

Requirements: 1. Use geom_density() 2. Fill the density curve with a color 3. Set transparency (alpha) so the shape is visible

ggplot(mpg, aes(displ)) + 
  geom_density(fill="pink", alpha= .5) + 
  labs(
    title = "Distribution of Engine Displacement",
    x = "Engine Displacement", 
    y = "Amount of Cars"
  )+
  theme_minimal()

Question

1. How does the density plot compare to the histogram you created earlier?

The density plot shows an overall trend that is similar to the histogram. From both graphs you can tell that there is more cars with engine size between 2 and 3 and the amount of larger ones seems to decrease.

Task 4

Create density plots comparing engine displacement across vehicle classes.

Requirements: 1. Map fill = class 2. Use transparency so the curves overlap 3. Include a legend

ggplot(mpg, aes(displ, fill= class)) + 
  geom_density(alpha= .4) + 
  labs(
    title = "Distribution of Engine Displacement",
    x = "Engine Displacement", 
    y = "Amount of Cars",
    fill = "Vehicle Class"
  )+
  theme_minimal()

Question

1. Which vehicle class tends to have the largest engines?

The 2seater vehicle class tends to have the bigger engine

Task 5

Create a frequency polygon comparing highway fuel efficiency across vehicle classes.

Requirements: 1. Map color = class 2. Use the same bin width as before

ggplot(mpg, aes(hwy, color = class)) +
  geom_freqpoly(binwidth = 0.3)+
  labs(
      title = "Distribution of Highway feul efficiency in the MPG Dataset",
    x = "Highway MPG", 
    y = "Amount of Cars"
  )

Question

1. Why might frequency polygons be easier to interpret than overlapping histograms when comparing groups?

A frequency polygon may be easier to interpret then an overlapping histogram because it is easier to see at each peak which class has more representation.

Task 6

Create a filled 2D density plot.

Requirements: 1. Use stat_density_2d() 2. Set geom = “polygon” 3. Map fill to density level

ggplot(mpg, aes(displ,hwy)) +
  geom_bin2d() +
  theme_bw() +
  labs(
    title = "2D Histogram of Engine Type Compared to Highway MPG",
    x = "Engine Type",
    y = "Highway MPG"
  )
## `stat_bin2d()` using `bins = 30`. Pick better value `binwidth`.

Question

1. What additional information does the filled density plot provide compared to the contour plot?

A filled density plot makes it easier to see the exact points of high and low density and a contour plot is the general area around the highest density.

Task 7

Create a scatterplot of carat vs price using the diamonds dataset.

Requirements: 1. Use geom_point() 2. Set alpha = 0.2 so overlapping points are visible 3. Use theme_minimal() 4. Use the ggExtra package to add marginal density plots to the scatterplot.

a = ggplot(diamonds, aes(carat, price))+
  geom_point(alpha= 0.2)+
  labs(
    title = "Scatterplot of Carat vs Price Using the Diamonds Dataset",
    x = "Carat Type",
    y = "Price"
  ) + 
  theme_minimal()

ggMarginal(a, type = "density")

Question

1. Why might marginal plots be useful when analyzing relationships between variables?

Marginal plots are useful when analyzing relationships between variables because it make it easier to see where there is higher amounts quickly. When you have the marginal plots it makes it easier to also identify if the data is skewed or not.