1. Introduction and Data Simulation

Visualization is not just about making pretty pictures; it is a critical tool for Exploratory Data Analysis (EDA), assumption checking, and communicating complex research findings. In this module, we utilize the ggplot2 grammar of graphics.

1.1 Simulated Health Dataset

We will generate a complex dataset representing a Clinical Trial with 500 patients.

set.seed(555)
n <- 500

# 1. Patient Data (Cross-sectional)
data_clinical <- data.frame(
  ID = 1:n,
  Age = round(rnorm(n, 55, 12)),
  BMI = round(rnorm(n, 28, 5), 1),
  Cholesterol = round(rnorm(n, 200, 40)),
  # Skewed variable (e.g., Biomarker X)
  Biomarker = rlnorm(n, meanlog = 1, sdlog = 0.8),
  # Count variable (Hospital Visits)
  Visits = rpois(n, lambda = 2),
  # Categorical Variables
  Gender = sample(c("Male", "Female"), n, replace = TRUE),
  Treatment = sample(c("Placebo", "Low Dose", "High Dose"), n, replace = TRUE),
  Outcome = sample(c("Recovered", "Stable", "Deteriorated"), n, replace = TRUE, prob = c(0.4, 0.4, 0.2)),
  Hospital = sample(c("General", "St. Mary's", "University"), n, replace = TRUE)
)

# 2. Time Series Data (Longitudinal) for Line plots
time_seq <- 1:12
data_longitudinal <- data.frame(
  Month = rep(time_seq, 3),
  Group = rep(c("Placebo", "Low Dose", "High Dose"), each = 12),
  Avg_Pain_Score = c(
    sort(runif(12, 5, 8), decreasing = TRUE), # Placebo (slow drop)
    sort(runif(12, 3, 8), decreasing = TRUE), # Low Dose
    sort(runif(12, 1, 8), decreasing = TRUE)  # High Dose (fast drop)
  )
)

2. Visualizing Continuous and Count Data

This section covers techniques for variables measured on a continuous scale (e.g., BMI, Age) or discrete counts (e.g., Visits).

2.1 Distribution Analysis (Univariate)

1. Histogram

Definition: Bins continuous data into intervals and counts observations per bin.
Research Use: Assessing normality, skewness, and modality.

ggplot(data_clinical, aes(x = Cholesterol)) +
  geom_histogram(binwidth = 10, fill = "#69b3a2", color = "white") +
  labs(title = "1. Histogram of Cholesterol", subtitle = "Checking for Normality")

2. Density Plot

Definition: A smoothed version of the histogram using Kernel Density Estimation (KDE).
Research Use: Visualizing the probability density function (PDF).

ggplot(data_clinical, aes(x = Cholesterol)) +
  geom_density(fill = "skyblue", alpha = 0.5) +
  labs(title = "2. Density Plot")

3. Frequency Polygon

Definition: Similar to a histogram but uses lines connecting bin counts.
Research Use: Comparing multiple distributions on the same plot without clutter.

ggplot(data_clinical, aes(x = Cholesterol, color = Gender)) +
  geom_freqpoly(binwidth = 10, size = 1) +
  labs(title = "3. Frequency Polygon by Gender")

4. Area Plot (Density)

Definition: A density plot where the area under the curve is filled.
Research Use: Emphasizing the volume of the distribution.

ggplot(data_clinical, aes(x = Age)) +
  geom_area(stat = "bin", fill = "lightcoral", alpha = 0.6) +
  labs(title = "4. Area Plot of Age")

5. Rug Plot

Definition: Marks individual data points on the axis.
Research Use: Supplementing histograms to show exact data location (identifying clustering).

ggplot(data_clinical, aes(x = BMI)) +
  geom_density() +
  geom_rug(alpha = 0.5) +
  labs(title = "5. Density with Rug Plot")

6. ECDF Plot

Definition: Empirical Cumulative Distribution Function.
Research Use: Estimating percentiles (e.g., “What % of patients have BMI < 30?”).

ggplot(data_clinical, aes(x = BMI)) +
  stat_ecdf(geom = "step", color = "blue") +
  labs(title = "6. ECDF of BMI", y = "Cumulative Probability")

7. Q-Q Plot

Definition: Plots sample quantiles against theoretical normal quantiles.
Research Use: The gold standard for testing the Normality assumption in regression/ANOVA.

ggplot(data_clinical, aes(sample = Cholesterol)) +
  stat_qq() + stat_qq_line(color = "red") +
  labs(title = "7. Q-Q Plot (Normality Check)")

2.2 Comparisons and Relationships (Bivariate/Multivariate)

8. Boxplot

Definition: Displays the 5-number summary (Min, Q1, Median, Q3, Max).
Research Use: Detecting outliers and comparing distributions across groups.

ggplot(data_clinical, aes(x = Treatment, y = BMI, fill = Treatment)) +
  geom_boxplot() +
  labs(title = "8. Boxplot of BMI by Treatment")

9. Violin Plot

Definition: Combines a boxplot and a density plot (mirrored).
Research Use: Showing the shape of the distribution (bimodality) which boxplots hide.

ggplot(data_clinical, aes(x = Treatment, y = BMI, fill = Treatment)) +
  geom_violin(trim = FALSE) +
  labs(title = "9. Violin Plot")

10. Jitter Plot

Definition: A scatter plot for categorical x-axis where points are randomly shifted to prevent overlap.
Research Use: Viewing sample size and raw data spread.

ggplot(data_clinical, aes(x = Treatment, y = BMI)) +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "10. Jitter Plot")

11. Sina Plot (Violin + Jitter)

Definition: Overlays raw points on a violin plot.
Research Use: The most comprehensive view of group differences.

ggplot(data_clinical, aes(x = Treatment, y = BMI)) +
  geom_violin(alpha = 0.3) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  labs(title = "11. Combined Violin and Jitter")

12. Ridgeline Plot

Definition: Partially overlapping density plots stacked vertically.
Research Use: Comparing changes in distribution across many groups or time points.

ggplot(data_clinical, aes(x = Cholesterol, y = Outcome, fill = Outcome)) +
  geom_density_ridges(alpha = 0.7) +
  labs(title = "12. Ridgeline Plot")

13. Scatter Plot

Definition: Plots two continuous variables against each other.
Research Use: Assessing correlation (linear or non-linear).

ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_point(alpha = 0.6) +
  labs(title = "13. Scatter Plot: Age vs Cholesterol")

14. Scatter Plot with Trend Line (Smooth)

Definition: Scatter plot with a regression line (Linear or Loess).
Research Use: Visualizing the trend and confidence intervals.

ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "14. Scatter with Linear Regression")

15. Bubble Chart

Definition: A scatter plot where point size represents a third variable.
Research Use: Visualizing 3 dimensions of continuous data.

ggplot(data_clinical, aes(x = Age, y = BMI, size = Biomarker)) +
  geom_point(alpha = 0.5, color = "purple") +
  labs(title = "15. Bubble Chart (Size = Biomarker)")

16. Hexbin Plot

Definition: Divides the plane into hexagons and counts points in each.
Research Use: Handling “Overplotting” in very large datasets (N > 10,000).

ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_hex(bins = 20) +
  scale_fill_viridis_c() +
  labs(title = "16. Hexbin Plot (Density of Points)")

17. 2D Density Contour

Definition: Topographic map of point density.
Research Use: Identifying clusters in bivariate data.

ggplot(data_clinical, aes(x = Age, y = Cholesterol)) +
  geom_density_2d() +
  labs(title = "17. 2D Contour Plot")

18. Correlogram (Heatmap)

Definition: A matrix visualizing correlation coefficients.
Research Use: Feature selection and identifying collinearity.

corr_matrix <- cor(data_clinical %>% select(Age, BMI, Cholesterol, Biomarker, Visits))
ggcorrplot(corr_matrix, lab = TRUE, type = "lower", title = "18. Correlation Matrix")

19. Line Chart (Time Series)

Definition: Connects data points in temporal order.
Research Use: Tracking longitudinal changes or trends.

ggplot(data_longitudinal, aes(x = Month, y = Avg_Pain_Score, color = Group)) +
  geom_line(size = 1.2) +
  geom_point() +
  labs(title = "19. Longitudinal Line Chart")

20. Step Plot

Definition: Line chart that changes only at specific intervals (steps).
Research Use: Survival analysis (Kaplan-Meier curves) or inventory changes.

ggplot(data_longitudinal, aes(x = Month, y = Avg_Pain_Score, color = Group)) +
  geom_step() +
  labs(title = "20. Step Plot")

3. Visualizing Categorical Data

This section covers techniques for nominal (e.g., Gender) and ordinal (e.g., Outcome) data.

3.1 Proportions and Counts

21. Simple Bar Chart

Definition: Height of bar represents count of a category.
Research Use: Comparing frequency of groups.

ggplot(data_clinical, aes(x = Outcome)) +
  geom_bar(fill = "steelblue") +
  labs(title = "21. Simple Bar Chart")

22. Horizontal Bar Chart

Definition: Flipped coordinates.
Research Use: Better when category labels are long.

ggplot(data_clinical, aes(x = Outcome)) +
  geom_bar(fill = "steelblue") +
  coord_flip() +
  labs(title = "22. Horizontal Bar Chart")

23. Stacked Bar Chart

Definition: Segments bars by a second categorical variable.
Research Use: Showing total counts and sub-group composition.

ggplot(data_clinical, aes(x = Treatment, fill = Outcome)) +
  geom_bar(position = "stack") +
  labs(title = "23. Stacked Bar Chart")

24. Grouped (Dodged) Bar Chart

Definition: Bars for subgroups are placed side-by-side.
Research Use: Direct comparison of subgroups within categories.

ggplot(data_clinical, aes(x = Treatment, fill = Outcome)) +
  geom_bar(position = "dodge") +
  labs(title = "24. Grouped Bar Chart")

25. Percent (Filled) Bar Chart

Definition: Bars are stretched to 100%.
Research Use: Comparing proportions rather than raw counts (e.g., “Did the High Dose group have a higher % of recovery?”).

ggplot(data_clinical, aes(x = Treatment, fill = Outcome)) +
  geom_bar(position = "fill") +
  labs(y = "Proportion", title = "25. 100% Stacked Bar Chart")

26. Lollipop Chart

Definition: A dot connected to a baseline by a line.
Research Use: A modern, cleaner alternative to bar charts for many categories.

data_summary <- data_clinical %>% count(Hospital)
ggplot(data_summary, aes(x = Hospital, y = n)) +
  geom_segment(aes(x=Hospital, xend=Hospital, y=0, yend=n), color="grey") +
  geom_point(size=4, color="orange") +
  labs(title = "26. Lollipop Chart")

27. Cleveland Dot Plot

Definition: Similar to lollipop but often horizontal and sorted.
Research Use: Comparing values with high precision.

ggplot(data_summary, aes(x = n, y = reorder(Hospital, n))) +
  geom_point(size = 3) +
  theme_minimal() +
  labs(title = "27. Cleveland Dot Plot", y = "Hospital")

28. Pie Chart

Definition: Circular chart divided into sectors.
Research Use: Showing part-to-whole relationships (Use with caution; bar charts are usually statistically superior).

ggplot(data_summary, aes(x = "", y = n, fill = Hospital)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  theme_void() +
  labs(title = "28. Pie Chart")

29. Donut Chart

Definition: A pie chart with a hole in the center.
Research Use: Aesthetically lighter than a pie chart.

ggplot(data_summary, aes(x = 2, y = n, fill = Hospital)) +
  geom_bar(stat = "identity", width = 1) +
  xlim(0.5, 2.5) +
  coord_polar("y") +
  theme_void() +
  labs(title = "29. Donut Chart")

30. Rose Plot (Coxcomb)

Definition: A bar chart plotted in polar coordinates.
Research Use: Used in meteorology or cyclical data (famous usage by Florence Nightingale).

ggplot(data_clinical, aes(x = Outcome, fill = Treatment)) +
  geom_bar() +
  coord_polar() +
  labs(title = "30. Rose Plot")

31. Heatmap (Categorical Frequency)

Definition: A grid where color intensity represents the count of cross-tabulated categories.
Research Use: Visualizing contingency tables.

table_data <- data_clinical %>% count(Treatment, Outcome)
ggplot(table_data, aes(x = Treatment, y = Outcome, fill = n)) +
  geom_tile() +
  geom_text(aes(label = n), color = "white") +
  labs(title = "31. Categorical Heatmap")

32. Treemap

Definition: Rectangles sized by count/value, nested by hierarchy.
Research Use: Displaying hierarchical data (e.g., Hospital -> Department -> Patient Count).
Note: Requires treemapify package, simulated here with basic tiles.

# Simplified representation using tiles
ggplot(table_data, aes(area = n, fill = Outcome, label = Treatment)) +
  geom_tile(aes(x = as.numeric(as.factor(Treatment)), y = as.numeric(as.factor(Outcome)), fill = n)) +
  labs(title = "32. Tile Map (Treemap Alternative)", subtitle = "Size/Color = Frequency")

33. Dumbbell Plot

Definition: Two points connected by a line, showing change between two states.
Research Use: Comparing “Before vs After” or “Group A vs Group B” for categorical items.

dumb_data <- data.frame(
  Metric = c("Pain", "Mobility", "Sleep"),
  Placebo = c(6, 4, 5),
  Drug = c(3, 7, 8)
) %>% pivot_longer(cols = c("Placebo", "Drug"), names_to = "Group", values_to = "Score")

ggplot(dumb_data, aes(x = Score, y = Metric)) +
  geom_line(aes(group = Metric), color = "grey") +
  geom_point(aes(color = Group), size = 3) +
  labs(title = "33. Dumbbell Plot (Effect Size)")

34. Radar (Spider) Chart

Definition: Multivariate data plotted on axes starting from the same center.
Research Use: Comparing profiles (e.g., comparing a patient’s health metrics against the average).

# Simulated via Polar Coordinates in ggplot
radar_data <- data.frame(
  Metric = c("Physical", "Mental", "Social", "Pain", "General"),
  Score = c(80, 60, 90, 40, 70)
)
ggplot(radar_data, aes(x = Metric, y = Score, group = 1)) +
  geom_polygon(fill = "blue", alpha = 0.2) +
  geom_line(color = "blue") +
  coord_polar() +
  labs(title = "34. Radar Chart Profile")

35. Diverging Bar Chart

Definition: Bars extending left and right from a central zero line.
Research Use: Visualizing Likert scale data (Agree vs Disagree) or demographic pyramids.

likert_data <- data.frame(
  Question = c("Q1", "Q2", "Q3"),
  Score = c(-20, 15, -5) # Net Promoter Score or similar
)
ggplot(likert_data, aes(x = Question, y = Score, fill = Score > 0)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "35. Diverging Bar Chart")

4. Summary

In this module, we explored the versatility of R and ggplot2 for data visualization. 1. Continuous plots (Histograms, Boxplots, Scatterplots) help us understand distribution and correlation. 2. Categorical plots (Bars, Lollipops, Heatmaps) help us understand frequency and proportions.

Assignment: Select the mtcars built-in dataset in R. Produce a report containing: 1. A histogram of mpg. 2. A scatter plot of hp vs wt colored by cyl. 3. A boxplot of mpg grouped by gear.

End of Module V ```

Module V: Advanced Data Visualization in R

Course: Statistical Programming and Data Management

Abdisalam Hassan Muse (PhD)

2025-12-15