LA-Report

Author

K Atharsh,Kaushik BV

📘 LA-1: Age Distribution by Region

Authors:

  • K Atharsh

  • Kaushik BV

📌 Objective

The goal of this analysis is to visualize the distribution of age across various regions using density ridgeline plots. These plots provide an intuitive way to compare distributions, highlighting differences in central tendency, spread, and skewness across regions.

🔧 Step 1: Load Necessary Libraries

# Load necessary libraries
library(ggplot2)
library(ggridges)

The ggplot2 package is used for creating static graphics, while ggridges extends ggplot2 to create ridgeline plots, which are particularly useful for visualizing the distribution of a continuous variable across different categories.

🧪 Step 2: Generate Synthetic Data

set.seed(123)
data <- data.frame(
  age = c(
    rnorm(100, 30, 5), rnorm(100, 40, 7), rnorm(100, 50, 10),
    rnorm(100, 60, 12), rnorm(100, 70, 13), rnorm(100, 80, 16),
    rnorm(100, 90, 20)
  ),
  region = rep(
    c("North", "South", "East", "West", "North-South", "South-West", "East-West"),
    each = 100
  )
)

In this step, synthetic data is generated using the rnorm() function, which creates random numbers from a normal distribution. The set.seed(123) ensures reproducibility. The age variable represents the age of individuals, and the region variable denotes the geographical region.

📊 Step 3: Basic Ridgeline Plot

ggplot(data, aes(x = age, y = region, fill = region)) +
  geom_density_ridges(alpha = 0.7) +
  labs(title = "Age Distribution by Region", x = "Age", y = "Region")
Picking joint bandwidth of 4.05

This code creates a basic ridgeline plot where:

  • x = age: The age variable is mapped to the x-axis.

  • y = region: The region variable is mapped to the y-axis.

  • fill = region: Each region is assigned a different fill color.

  • geom_density_ridges(alpha = 0.7): Adds density ridgelines with a transparency level of 0.7.

  • The plot provides a visual representation of how age distributions vary across different regions.

🎨 Step 4: Enhanced Visualization with Minimal Theme

ggplot(data, aes(x = age, y = region, fill = region)) +
  geom_density_ridges(alpha = 0.7) +
  labs(title = "Age Distribution by Region", x = "Age", y = "Region") +
  theme_minimal()
Picking joint bandwidth of 4.05

Applying theme_minimal() removes background grids and axes, focusing attention on the data. This enhances the clarity and aesthetic appeal of the plot.

📈 Interpretation of Results

The ridgeline plot reveals several key insights:

  • Central Tendency: The peak of each ridgeline indicates the mode of the age distribution for that region

  • Spread: The width of the ridgeline shows the variability in age within each region.

  • Skewness: The asymmetry of the ridgeline can indicate skewness in the age distribution.

  • For instance, regions like “North” and “South” might exhibit a younger population, while “East-West” and “North-South” could show older age distributions.

🛠️ Customization Options

The ggridges package offers various parameters to customize the appearance of ridgeline plots:

  • scale: Controls the vertical scaling of ridgelines. A value of 1 means ridgelines just touch the baseline of the next higher one.

  • rel_min_height: Sets a relative minimum height for the ridgelines. Values below this threshold are removed.

  • alpha: Adjusts the transparency of the ridgelines.

  • fill: Specifies the fill color of the ridgelines

  • For more advanced customization, you can refer to the Introduction to ggridges vignette.

📚 Conclusion

Density ridgeline plots are a powerful tool for visualizing the distribution of a continuous variable across different categories. By applying this technique, we can gain insights into how age distributions vary across regions, aiding in demographic analysis and decision-making.For further exploration, consider experimenting with different datasets and customization options to tailor the plots to specific analytical needs.