program 5

Author

Manoj

Implement an R program to create a histogram illustrating the distribution of a continuous variable, with overlays of density curves for each group, using ggplot2.

Overview of Steps

In this program, we will follow these steps:

Load the required library
Explore the dataset
Identify the continuous and grouping variables
Initialize the plot with aesthetic mappings
Add the histogram layer
Add group-wise density curves
Add labels and theme
Display and interpret the final plot

Step 1: Load Required Library

We first load the ggplot2 package, which is used for data visualization in R.

# Load ggplot2 package for visualization
library(ggplot2)

Step 2: Explore the Inbuilt Dataset

We use the built-in iris dataset. In this dataset:

Petal.Length is the continuous variable
Species is the categorical grouping variable

Before creating a graph, we inspect the structure and the first few rows of the dataset.

# Use the built-in 'iris' dataset
# 'Petal.Length' is a continuous variable
# 'Species' is a categorical grouping variable

str(iris)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Explanation

str(iris) helps us understand the structure and data types of the variables
head(iris) displays the first few rows of the dataset
This step is important because it helps us confirm that Petal.Length is numerical and Species is categorical

Step 3: Create Histogram with Group-wise Density Curves

Now we begin building the visualization step by step.

Step 3.1: Initialize the ggplot with Aesthetic Mappings

We start by creating a basic ggplot object and mapping the variables.

# Start ggplot with iris dataset
# Map Petal.Length to x-axis and fill by Species

p <- ggplot(data = iris, aes(x = Petal.Length, fill = Species))
p

Explanation

This step initializes the plot and tells ggplot2 how to use the variables:

Petal.Length is mapped to the x-axis because it is the continuous variable we want to study
Species is mapped to the fill aesthetic so that different groups can be visually distinguished

This does not yet create the complete graph, but it sets up the plotting framework.

Step 3.2: Add Histogram Layer

Next, we add a histogram layer to show the distribution of the continuous variable.

# Add histogram with density scaling

p <- p + geom_histogram(aes(y = ..density..),
alpha = 0.4,
position = "identity",
bins = 30)
p

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Explanation

aes(y = ..density..) scales the histogram so that the y-axis represents density instead of raw counts
alpha = 0.4 makes the bars semi-transparent so overlapping groups can still be seen
position = "identity" allows the histograms of different groups to overlap rather than being stacked side by side
bins = 30 controls how many intervals are used in the histogram

This layer helps us understand how the values of Petal.Length are distributed for different species.

Step 3.3: Add Density Curve Layer

To make the distribution smoother and easier to interpret, we overlay density curves for each group.

# Overlay density curves for each group

p <- p +
geom_density(aes(color = Species),
size = 1.2)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Explanation

This step adds smooth density curves on top of the histogram.

aes(color = Species) assigns a different line color to each species
size = 1.2 controls the thickness of the density curves

The density curves help students compare the shape, spread, and concentration of the distributions across groups more clearly than the histogram alone.

Step 3.4: Add Labels and Theme

Now we improve the appearance of the graph by adding a title, axis labels, and a clean theme.

# Add title and axis labels, and apply clean theme

p <- p + labs(
title = "Distribution of Petal Length with Group-wise Density Curves",
x = "Petal Length",
y = "Density") +
theme_minimal()

p

Explanation

labs() is used to add a meaningful title and axis labels
The title explains what the graph is showing
The x-axis label identifies the continuous variable
The y-axis label indicates that the graph is scaled by density
theme_minimal() gives the plot a simple and clean appearance

This step improves readability and presentation quality.

Step 3.5: Display the Plot

Finally, we render the complete plot.

# Finally, render the plot
p

Interpretation

After generating the plot, we study the distribution of Petal.Length for each species.

The histogram shows how the values are spread across intervals, while the density curves provide a smooth summary of the distribution for each group.

This makes it easier to compare:

where most values are concentrated
how wide or narrow each distribution is
whether the groups overlap or are clearly separated

Summary

In this program, we:

Used the built-in iris dataset
Selected Petal.Length as the continuous variable
Used Species as the grouping variable
Created a histogram to visualize the distribution
Added density curves for each group
Improved the graph using labels and a minimal theme

Discussion Points

Why is Petal.Length considered a continuous variable?
Why is Species used as a grouping variable?
What is the purpose of using aes(y = ..density..) in the histogram?
Why is transparency useful when histograms overlap?
How do density curves improve the interpretation of the graph?
What can be learned by comparing the distributions of the three species?

Follow-up Questions

Change the number of bins and observe how the histogram changes
Create the same plot using Sepal.Length instead of Petal.Length
Use a different theme such as theme_bw()
Try changing the transparency level using alpha
Remove the density curves and compare the graph with the original
Create separate histograms for each species using faceting

Conclusion

In this exercise, we learned how to:

explore a dataset before visualization
identify continuous and categorical variables
create a histogram for a continuous variable
overlay group-wise density curves
interpret grouped distributions effectively using ggplot2