1. Introduction and Data Simulation

Visualization is not just about making pretty pictures; it is a critical tool for Exploratory Data Analysis (EDA), assumption checking, and communicating complex research findings. In this module, we utilize the ggplot2 grammar of graphics.

1.1 Simulated Health Dataset

We will generate a complex dataset representing a Clinical Trial with 500 patients.

set.seed(555)
n <- 500

# 1. Patient Data (Cross-sectional)
data_clinical <- data.frame(
  ID = 1:n,
  Age = round(rnorm(n, 55, 12)),
  BMI = round(rnorm(n, 28, 5), 1),
  Cholesterol = round(rnorm(n, 200, 40)),
  # Skewed variable (e.g., Biomarker X)
  Biomarker = rlnorm(n, meanlog = 1, sdlog = 0.8),
  # Count variable (Hospital Visits)
  Visits = rpois(n, lambda = 2),
  # Categorical Variables
  Gender = sample(c("Male", "Female"), n, replace = TRUE),
  Treatment = sample(c("Placebo", "Low Dose", "High Dose"), n, replace = TRUE),
  Outcome = sample(c("Recovered", "Stable", "Deteriorated"), n, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  Hospital = sample(c("General", "St. Mary's", "University"), n, replace = TRUE)
)

# 2. Time Series Data (Longitudinal) for Line plots
time_seq <- 1:12
data_longitudinal <- data.frame(
  Month = rep(time_seq, 3),
  Group = rep(c("Placebo", "Low Dose", "High Dose"), each = 12),
  Avg_Pain_Score = c(
    sort(runif(12, 5, 8), decreasing = TRUE), # Placebo (slow drop)
    sort(runif(12, 3, 8), decreasing = TRUE), # Low Dose
    sort(runif(12, 1, 8), decreasing = TRUE)  # High Dose (fast drop)
  )
)

2. Visualizing Continuous and Count Data

This section covers techniques for variables measured on a continuous scale (e.g., BMI, Age) or discrete counts (e.g., Visits).

2.1 Distribution Analysis (Univariate)

1. Histogram

  • Definition: Bins continuous data into intervals and counts observations per bin.
  • Research Use: Assessing normality, skewness, and modality.
ggplot(data_clinical, aes(x = Cholesterol)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "white") +
  labs(title = "1. Histogram of Cholesterol", subtitle = "Checking for Normality")

2. Density Plot

  • Definition: A smoothed version of the histogram using Kernel Density Estimation (KDE).
  • Research Use: Visualizing the probability density function (PDF).
ggplot(data_clinical, aes(x = Cholesterol)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "2. Density Plot")

3. Frequency Polygon

  • Definition: Similar to a histogram but uses lines connecting bin counts.
  • Research Use: Comparing multiple distributions on the same plot without clutter.
ggplot(data_clinical, aes(x = Cholesterol, color = Gender)) +
  geom_freqpoly(binwidth = 10, size = 1) +
  labs(title = "3. Frequency Polygon by Gender")

4. Area Plot (Density)

  • Definition: A density plot where the area under the curve is filled.
  • Research Use: Emphasizing the volume of the distribution.
ggplot(data_clinical, aes(x = Age)) +
  geom_area(stat = "bin", fill = "lightcoral", alpha = 0.6) +
  labs(title = "4. Area Plot of Age")

5. Rug Plot

  • Definition: Marks individual data points on the axis.
  • Research Use: Supplementing histograms to show exact data location (identifying clustering).
ggplot(data_clinical, aes(x = BMI)) +
  geom_density() +
  geom_rug(alpha = 0.5) +
  labs(title = "5. Density with Rug Plot")

6. ECDF Plot

  • Definition: Empirical Cumulative Distribution Function.
  • Research Use: Estimating percentiles (e.g., “What % of patients have BMI < 30?”).
ggplot(data_clinical, aes(x = BMI)) +
  stat_ecdf(geom = "step", color = "blue") +
  labs(title = "6. ECDF of BMI", y = "Cumulative Probability")

7. Q-Q Plot

  • Definition: Plots sample quantiles against theoretical normal quantiles.
  • Research Use: The gold standard for testing the Normality assumption in regression/ANOVA.
ggplot(data_clinical, aes(sample = Cholesterol)) +
  stat_qq() + stat_qq_line(color = "red") +
  labs(title = "7. Q-Q Plot (Normality Check)")


2.2 Comparisons and Relationships (Bivariate/Multivariate)

8. Boxplot

  • Definition: Displays the 5-number summary (Min, Q1, Median, Q3, Max).
  • Research Use: Detecting outliers and comparing distributions across groups.
ggplot(data_clinical, aes(x = Treatment, y = BMI, fill = Treatment)) +
  geom_boxplot() +
  labs(title = "8. Boxplot of BMI by Treatment")

9. Violin Plot

  • Definition: Combines a boxplot and a density plot (mirrored).
  • Research Use: Showing the shape of the distribution (bimodality) which boxplots hide.
ggplot(data_clinical, aes(x = Treatment, y = BMI, fill = Treatment)) +
  geom_violin(trim = FALSE) +
  labs(title = "9. Violin Plot")

10. Jitter Plot

  • Definition: A scatter plot for categorical x-axis where points are randomly shifted to prevent overlap.
  • Research Use: Viewing sample size and raw data spread.
ggplot(data_clinical, aes(x = Treatment, y = BMI)) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "10. Jitter Plot")

11. Sina Plot (Violin + Jitter)

  • Definition: Overlays raw points on a violin plot.
  • Research Use: The most comprehensive view of group differences.
ggplot(data_clinical, aes(x = Treatment, y = BMI)) +
  geom_violin(alpha = 0.3) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  labs(title = "11. Combined Violin and Jitter")

12. Ridgeline Plot

  • Definition: Partially overlapping density plots stacked vertically.
  • Research Use: Comparing changes in distribution across many groups or time points.
ggplot(data_clinical, aes(x = Cholesterol, y = Outcome, fill = Outcome)) +
  geom_density_ridges(alpha = 0.7) +
  labs(title = "12. Ridgeline Plot")

13. Scatter Plot

  • Definition: Plots two continuous variables against each other.
  • Research Use: Assessing correlation (linear or non-linear).
ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_point(alpha = 0.6) +
  labs(title = "13. Scatter Plot: Age vs Cholesterol")

14. Scatter Plot with Trend Line (Smooth)

  • Definition: Scatter plot with a regression line (Linear or Loess).
  • Research Use: Visualizing the trend and confidence intervals.
ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "14. Scatter with Linear Regression")

15. Bubble Chart

  • Definition: A scatter plot where point size represents a third variable.
  • Research Use: Visualizing 3 dimensions of continuous data.
ggplot(data_clinical, aes(x = Age, y = BMI, size = Biomarker)) +
  geom_point(alpha = 0.5, color = "purple") +
  labs(title = "15. Bubble Chart (Size = Biomarker)")

16. Hexbin Plot

  • Definition: Divides the plane into hexagons and counts points in each.
  • Research Use: Handling “Overplotting” in very large datasets (N > 10,000).
ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_hex(bins = 20) +
  scale_fill_viridis_c() +
  labs(title = "16. Hexbin Plot (Density of Points)")

17. 2D Density Contour

  • Definition: Topographic map of point density.
  • Research Use: Identifying clusters in bivariate data.
ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_density_2d() +
  labs(title = "17. 2D Contour Plot")

18. Correlogram (Heatmap)

  • Definition: A matrix visualizing correlation coefficients.
  • Research Use: Feature selection and identifying collinearity.
corr_matrix <- cor(data_clinical %>% select(Age, BMI, Cholesterol, Biomarker, Visits))
ggcorrplot(corr_matrix, lab = TRUE, type = "lower", title = "18. Correlation Matrix")

19. Line Chart (Time Series)

  • Definition: Connects data points in temporal order.
  • Research Use: Tracking longitudinal changes or trends.
ggplot(data_longitudinal, aes(x = Month, y = Avg_Pain_Score, color = Group)) +
  geom_line(size = 1.2) +
  geom_point() +
  labs(title = "19. Longitudinal Line Chart")

20. Step Plot

  • Definition: Line chart that changes only at specific intervals (steps).
  • Research Use: Survival analysis (Kaplan-Meier curves) or inventory changes.
ggplot(data_longitudinal, aes(x = Month, y = Avg_Pain_Score, color = Group)) +
  geom_step() +
  labs(title = "20. Step Plot")


3. Visualizing Categorical Data

This section covers techniques for nominal (e.g., Gender) and ordinal (e.g., Outcome) data.

3.1 Proportions and Counts

21. Simple Bar Chart

  • Definition: Height of bar represents count of a category.
  • Research Use: Comparing frequency of groups.
ggplot(data_clinical, aes(x = Outcome)) +
  geom_bar(fill = "steelblue") +
  labs(title = "21. Simple Bar Chart")

22. Horizontal Bar Chart

  • Definition: Flipped coordinates.
  • Research Use: Better when category labels are long.
ggplot(data_clinical, aes(x = Outcome)) +
  geom_bar(fill = "steelblue") +
  coord_flip() +
  labs(title = "22. Horizontal Bar Chart")

23. Stacked Bar Chart

  • Definition: Segments bars by a second categorical variable.
  • Research Use: Showing total counts and sub-group composition.
ggplot(data_clinical, aes(x = Treatment, fill = Outcome)) +
  geom_bar(position = "stack") +
  labs(title = "23. Stacked Bar Chart")

24. Grouped (Dodged) Bar Chart

  • Definition: Bars for subgroups are placed side-by-side.
  • Research Use: Direct comparison of subgroups within categories.
ggplot(data_clinical, aes(x = Treatment, fill = Outcome)) +
  geom_bar(position = "dodge") +
  labs(title = "24. Grouped Bar Chart")

25. Percent (Filled) Bar Chart

  • Definition: Bars are stretched to 100%.
  • Research Use: Comparing proportions rather than raw counts (e.g., “Did the High Dose group have a higher % of recovery?”).
ggplot(data_clinical, aes(x = Treatment, fill = Outcome)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", title = "25. 100% Stacked Bar Chart")

26. Lollipop Chart

  • Definition: A dot connected to a baseline by a line.
  • Research Use: A modern, cleaner alternative to bar charts for many categories.
data_summary <- data_clinical %>% count(Hospital)
ggplot(data_summary, aes(x = Hospital, y = n)) +
  geom_segment(aes(x=Hospital, xend=Hospital, y=0, yend=n), color="grey") +
  geom_point(size=4, color="orange") +
  labs(title = "26. Lollipop Chart")

27. Cleveland Dot Plot

  • Definition: Similar to lollipop but often horizontal and sorted.
  • Research Use: Comparing values with high precision.
ggplot(data_summary, aes(x = n, y = reorder(Hospital, n))) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "27. Cleveland Dot Plot", y = "Hospital")

28. Pie Chart

  • Definition: Circular chart divided into sectors.
  • Research Use: Showing part-to-whole relationships (Use with caution; bar charts are usually statistically superior).
ggplot(data_summary, aes(x = "", y = n, fill = Hospital)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_void() +
  labs(title = "28. Pie Chart")

29. Donut Chart

  • Definition: A pie chart with a hole in the center.
  • Research Use: Aesthetically lighter than a pie chart.
ggplot(data_summary, aes(x = 2, y = n, fill = Hospital)) +
  geom_bar(stat = "identity", width = 1) +
  xlim(0.5, 2.5) +
  coord_polar("y") +
  theme_void() +
  labs(title = "29. Donut Chart")

30. Rose Plot (Coxcomb)

  • Definition: A bar chart plotted in polar coordinates.
  • Research Use: Used in meteorology or cyclical data (famous usage by Florence Nightingale).
ggplot(data_clinical, aes(x = Outcome, fill = Treatment)) +
  geom_bar() +
  coord_polar() +
  labs(title = "30. Rose Plot")

31. Heatmap (Categorical Frequency)

  • Definition: A grid where color intensity represents the count of cross-tabulated categories.
  • Research Use: Visualizing contingency tables.
table_data <- data_clinical %>% count(Treatment, Outcome)
ggplot(table_data, aes(x = Treatment, y = Outcome, fill = n)) +
  geom_tile() +
  geom_text(aes(label = n), color = "white") +
  labs(title = "31. Categorical Heatmap")

32. Treemap

  • Definition: Rectangles sized by count/value, nested by hierarchy.
  • Research Use: Displaying hierarchical data (e.g., Hospital -> Department -> Patient Count).
  • Note: Requires treemapify package, simulated here with basic tiles.
# Simplified representation using tiles
ggplot(table_data, aes(area = n, fill = Outcome, label = Treatment)) +
  geom_tile(aes(x = as.numeric(as.factor(Treatment)), y = as.numeric(as.factor(Outcome)), fill = n)) +
  labs(title = "32. Tile Map (Treemap Alternative)", subtitle = "Size/Color = Frequency")

33. Dumbbell Plot

  • Definition: Two points connected by a line, showing change between two states.
  • Research Use: Comparing “Before vs After” or “Group A vs Group B” for categorical items.
dumb_data <- data.frame(
  Metric = c("Pain", "Mobility", "Sleep"),
  Placebo = c(6, 4, 5),
  Drug = c(3, 7, 8)
) %>% pivot_longer(cols = c("Placebo", "Drug"), names_to = "Group", values_to = "Score")

ggplot(dumb_data, aes(x = Score, y = Metric)) +
  geom_line(aes(group = Metric), color = "grey") +
  geom_point(aes(color = Group), size = 3) +
  labs(title = "33. Dumbbell Plot (Effect Size)")

34. Radar (Spider) Chart

  • Definition: Multivariate data plotted on axes starting from the same center.
  • Research Use: Comparing profiles (e.g., comparing a patient’s health metrics against the average).
# Simulated via Polar Coordinates in ggplot
radar_data <- data.frame(
  Metric = c("Physical", "Mental", "Social", "Pain", "General"),
  Score = c(80, 60, 90, 40, 70)
)
ggplot(radar_data, aes(x = Metric, y = Score, group = 1)) +
  geom_polygon(fill = "blue", alpha = 0.2) +
  geom_line(color = "blue") +
  coord_polar() +
  labs(title = "34. Radar Chart Profile")

35. Diverging Bar Chart

  • Definition: Bars extending left and right from a central zero line.
  • Research Use: Visualizing Likert scale data (Agree vs Disagree) or demographic pyramids.
likert_data <- data.frame(
  Question = c("Q1", "Q2", "Q3"),
  Score = c(-20, 15, -5) # Net Promoter Score or similar
)
ggplot(likert_data, aes(x = Question, y = Score, fill = Score > 0)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "35. Diverging Bar Chart")


4. Summary

In this module, we explored the versatility of R and ggplot2 for data visualization. 1. Continuous plots (Histograms, Boxplots, Scatterplots) help us understand distribution and correlation. 2. Categorical plots (Bars, Lollipops, Heatmaps) help us understand frequency and proportions.

Assignment: Select the mtcars built-in dataset in R. Produce a report containing: 1. A histogram of mpg. 2. A scatter plot of hp vs wt colored by cyl. 3. A boxplot of mpg grouped by gear.


End of Module V ```