program 5

Author

Manoj

Implement an R program to create a histogram illustrating the distribution of a continuous variable, with overlays of density curves for each group, using ggplot2.


Overview of Steps

In this program, we will follow these steps:

  1. Load the required library
  2. Explore the dataset
  3. Identify the continuous and grouping variables
  4. Initialize the plot with aesthetic mappings
  5. Add the histogram layer
  6. Add group-wise density curves
  7. Add labels and theme
  8. Display and interpret the final plot

Step 1: Load Required Library

We first load the ggplot2 package, which is used for data visualization in R.

# Load ggplot2 package for visualization
library(ggplot2)

Step 2: Explore the Inbuilt Dataset

We use the built-in iris dataset. In this dataset:

  • Petal.Length is the continuous variable
  • Species is the categorical grouping variable

Before creating a graph, we inspect the structure and the first few rows of the dataset.

# Use the built-in 'iris' dataset
# 'Petal.Length' is a continuous variable
# 'Species' is a categorical grouping variable

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Explanation

  • str(iris) helps us understand the structure and data types of the variables
  • head(iris) displays the first few rows of the dataset
  • This step is important because it helps us confirm that Petal.Length is numerical and Species is categorical

Step 3: Create Histogram with Group-wise Density Curves

Now we begin building the visualization step by step.


Step 3.1: Initialize the ggplot with Aesthetic Mappings

We start by creating a basic ggplot object and mapping the variables.

# Start ggplot with iris dataset
# Map Petal.Length to x-axis and fill by Species

p <- ggplot(data = iris, aes(x = Petal.Length, fill = Species))
p

Explanation

This step initializes the plot and tells ggplot2 how to use the variables:

  • Petal.Length is mapped to the x-axis because it is the continuous variable we want to study
  • Species is mapped to the fill aesthetic so that different groups can be visually distinguished

This does not yet create the complete graph, but it sets up the plotting framework.


Step 3.2: Add Histogram Layer

Next, we add a histogram layer to show the distribution of the continuous variable.

# Add histogram with density scaling

p <- p + geom_histogram(aes(y = ..density..),
alpha = 0.4,
position = "identity",
bins = 30)
p
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Explanation

  • aes(y = ..density..) scales the histogram so that the y-axis represents density instead of raw counts
  • alpha = 0.4 makes the bars semi-transparent so overlapping groups can still be seen
  • position = "identity" allows the histograms of different groups to overlap rather than being stacked side by side
  • bins = 30 controls how many intervals are used in the histogram

This layer helps us understand how the values of Petal.Length are distributed for different species.


Step 3.3: Add Density Curve Layer

To make the distribution smoother and easier to interpret, we overlay density curves for each group.

# Overlay density curves for each group

p <- p +
geom_density(aes(color = Species),
size = 1.2)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
p

Explanation

This step adds smooth density curves on top of the histogram.

  • aes(color = Species) assigns a different line color to each species
  • size = 1.2 controls the thickness of the density curves

The density curves help students compare the shape, spread, and concentration of the distributions across groups more clearly than the histogram alone.


Step 3.4: Add Labels and Theme

Now we improve the appearance of the graph by adding a title, axis labels, and a clean theme.

# Add title and axis labels, and apply clean theme

p <- p + labs(
title = "Distribution of Petal Length with Group-wise Density Curves",
x = "Petal Length",
y = "Density") +
theme_minimal()

p

Explanation

  • labs() is used to add a meaningful title and axis labels
  • The title explains what the graph is showing
  • The x-axis label identifies the continuous variable
  • The y-axis label indicates that the graph is scaled by density
  • theme_minimal() gives the plot a simple and clean appearance

This step improves readability and presentation quality.


Step 3.5: Display the Plot

Finally, we render the complete plot.

# Finally, render the plot
p


Interpretation

After generating the plot, we study the distribution of Petal.Length for each species.

The histogram shows how the values are spread across intervals, while the density curves provide a smooth summary of the distribution for each group.

This makes it easier to compare:

  • where most values are concentrated
  • how wide or narrow each distribution is
  • whether the groups overlap or are clearly separated

Summary

In this program, we:

  • Used the built-in iris dataset
  • Selected Petal.Length as the continuous variable
  • Used Species as the grouping variable
  • Created a histogram to visualize the distribution
  • Added density curves for each group
  • Improved the graph using labels and a minimal theme

Discussion Points

  1. Why is Petal.Length considered a continuous variable?
  2. Why is Species used as a grouping variable?
  3. What is the purpose of using aes(y = ..density..) in the histogram?
  4. Why is transparency useful when histograms overlap?
  5. How do density curves improve the interpretation of the graph?
  6. What can be learned by comparing the distributions of the three species?

Follow-up Questions

  1. Change the number of bins and observe how the histogram changes
  2. Create the same plot using Sepal.Length instead of Petal.Length
  3. Use a different theme such as theme_bw()
  4. Try changing the transparency level using alpha
  5. Remove the density curves and compare the graph with the original
  6. Create separate histograms for each species using faceting

Conclusion

In this exercise, we learned how to:

  • explore a dataset before visualization
  • identify continuous and categorical variables
  • create a histogram for a continuous variable
  • overlay group-wise density curves
  • interpret grouped distributions effectively using ggplot2