Introduction

This presentation explores data visualization using ggplot2 with the Pima Indians Diabetes Dataset. We will go through data cleaning, basic and advanced plotting techniques, and enhancing interactivity using plotly,graphic designs and implementing key principles of graphic design.

Graphic design

Graphics design refers to the creation of visual representations of data, such as plots, charts, and diagrams, to facilitate data analysis, interpretation, and presentation. R offers a wide array of tools and packages to produce high-quality graphics, making it a powerful language for data visualization.

Key Concepts in Graphics Design in R

Base Graphics System:

R’s original graphics system, providing functions like plot(), hist(), boxplot(), and barplot(). Allows detailed customization of plots through parameters and functions.

Grid Graphics System:

Provides a flexible approach using the grid package for complex and custom layouts. Enables precise control over graphical objects and their positioning.

Lattice Graphics:

Built on top of the grid system, designed for creating multi-panel conditioned plots. Functions like xyplot(), bwplot(), densityplot() facilitate multi-variable visualization.

ggplot2:

A popular and powerful package based on the Grammar of Graphics concept. Allows building complex layered graphics systematically. Provides extensive customization options for aesthetics, themes, and annotations.

Principles of Good Graphics Design in R

Clarity: Ensure the visualization clearly conveys the intended message.

Simplicity: Avoid unnecessary clutter; focus on key data insights.

Consistency: Use consistent color schemes, scales, and formats.

Aesthetics: Make visuals appealing with appropriate themes and styles.

Interactivity: Use packages like plotly or shiny for interactive graphics.

Graphics design is a vital aspect of data analysis, enabling the transformation of raw data into insightful visual stories. Whether using base R graphics, lattice, or ggplot2, R provides versatile tools to create informative, attractive, and publication-quality visualizations tailored to various analytical needs.

The following principles guide effective and aesthetic data visualization, along with the tools and ggplot2 features used to achieve them:

1.Clarity: Ensure your message is easy to interpret at a glance. Tools Used: labs(), theme(), proper axis labels, clear titles and subtitles.

2.Consistency: Use consistent color schemes, fonts, and layouts. Tools Used: theme_minimal(), custom fonts (base_family), consistent scale_fill_manual()/scale_color_manual() across plots.

3.Hierarchy: Use size, color, and position to prioritize elements. Tools Used: annotate(), element_text(), scale_* functions for size and color.

4.Balance: Distribute visual weight evenly. Tools Used: Plot layout (e.g., grid.arrange()), spacing in theme().

5.Contrast: Emphasize key areas with contrasting elements. Tools Used: scale_fill_manual(), annotate() with distinct fill colors, theme_dark() or light/dark backgrounds.

Whitespace: Prevent clutter and improve readability. Tools Used: theme_*() functions, margin() settings within themes, simplifying plots by removing grid lines or redundant axes.

Libraries and Data Loading

library(ggplot2)      # For plotting
library(dplyr)        # For data manipulation
library(tidyr)        # For data tidying
library(plotly)       # For interactivity
library(gridExtra)    # For combining plots
library(faraway)      # For the Pima dataset

data(pima)

Data Cleaning

Zero values in medical measurements like glucose or blood pressure are erroneous and are treated as missing values. Converting the ‘test’ variable to a factor enhances interpretability in plots.

# Convert zero values in specific columns to NA
cols_with_zero_na <- c("glucose", "diastolic", "triceps", "insulin", "bmi")
pima[cols_with_zero_na] <- lapply(pima[cols_with_zero_na], function(x) ifelse(x == 0, NA, x))


# Convert 'test' to a factor with labels
pima$test <- factor(pima$test, levels = c(0, 1), labels = c("Negative", "Positive"))

# Summary of cleaned data
summary(pima)

##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     insulin            bmi           diabetes           age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780   Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437   1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725   Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##  NA's   :374      NA's   :11                                      
##        test    
##  Negative:500  
##  Positive:268  
##                
##                
##                
##                
##

Scatterplot

This scatterplot reveals the relationship between BMI and glucose levels, differentiated by diabetes test results. The Interactive plots allow for dynamic exploration of data points

geom_point() – Plots raw data points (used to highlight relationships between glucose and BMI).

aes(color = diabetes) – Encodes diabetes status as color for comparison.

labs() – Enhances clarity with labeled axes and titles.

theme_minimal() – Minimalistic theme for readability.

plotly::ggplotly() – Makes it interactive with zoom, hover tooltips.

The graphic design principles applied in this scatterplot are:

Clarity (labels), Contrast (colors), Interactivity (plotly), Whitespace (theme), Hierarchy (size/color distinctions).

scatter_plot <- ggplot(pima, aes(x = glucose, y = bmi, color = test)) +
  geom_point(alpha = 0.6) +
  labs(title = "BMI vs. Glucose by Diabetes Test Result",
       x = "Glucose", y = "BMI") +
  theme_minimal()

# Interactive plot
plotly::ggplotly(scatter_plot)

Lines and Smoothers

The LOESS smoother helps identify trends in BMI across different ages for each test result category.

geom_smooth(method = “loess”) – Adds trend line with confidence band for BMI vs. age.

geom_point() – Adds actual observations.

theme_light() – Ensures legibility.

The graphic design principles applied here are:

Hierarchy (trend emphasized), Clarity (smooth visual cue), Balance (layout and color).

smooth_plot <- ggplot(pima, aes(x = age, y = bmi, color = test)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = TRUE) +
  labs(title = "BMI vs. Age with LOESS Smoother", x = "Age", y = "BMI") +
  theme_light()

plotly::ggplotly(smooth_plot)

Bars and Columns

This bar chart displays the distribution of positive and negative diabetes test results.

geom_bar() – Counts diabetes outcomes.

scale_fill_manual() – Custom colors to contrast outcomes.

theme_classic() – Simple lines and no distractions.

The graphic design principles applied are:

Contrast (colors), Balance (symmetry), Clarity (simple theme), Consistency (colors reused).

bar_chart <- ggplot(pima, aes(x = test, fill = test)) +
  geom_bar() +
  labs(title = "Count of Diabetes Test Results", x = "Test Result", y = "Count") +
  theme_classic()

ggplotly(bar_chart)

Histograms

This Histogram illustrate the distribution of glucose levels, highlighting differences between test result groups.

geom_histogram() – Distribution of glucose levels.

alpha = 0.6 – Transparency to show overlaps.

bins = 30 – Controls granularity.

scale_fill_brewer() – Cohesive color palette.

Design principles used are:

Hierarchy (overlapping distributions), Contrast, Clarity (bin control), Consistency (same fill system).

histogram <- ggplot(pima, aes(x = glucose, fill = test)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  labs(title = "Distribution of Glucose Levels", x = "Glucose", y = "Frequency") +
  theme_bw()

ggplotly(histogram)

Boxplots

These Boxplots provide a summary of BMI distributions, indicating medians and variability within each test result group.

geom_boxplot() – Summarizes BMI distribution by outcome.

outlier.color – Highlights outliers for emphasis.

theme_minimal() – Avoids clutter.

scale_fill_manual() – Controls aesthetics consistently.

graphics design principles:

Clarity (whisker logic), Emphasis (outliers), Contrast (fill), Whitespace (theme).

boxplot <- ggplot(pima, aes(x = test, y = bmi, fill = test)) +
  geom_boxplot() +
  labs(title = "BMI Distribution by Test Result", x = "Test Result", y = "BMI") +
  theme_minimal()

ggplotly(boxplot)

Modifying Plot Appearance

Background

We can Switch the plot’s appearance to a dark background with light grid lines and text.

Design Contribution:

Contrast: Highlights colored points against a dark background.

Accessibility: Useful for dark mode presentations or visual impairments.

Aesthetic: Helps shift tone and focus in presentations.

scatter_plot + theme_light()

Axes

Adjusting axis labels and limits focuses the plot on specific data ranges and improves clarity.

scatter_plot +
  scale_x_continuous(name = "Glucose Level", limits = c(50, 200)) +
  scale_y_continuous(name = "BMI", limits = c(10, 70))

Scales

Using different color palettes can make plots more accessible and visually appealing. We have used ColorBrewer palettes, optimized for visual accessibility and consistency.

scatter_plot + scale_color_brewer(palette = "Set1")

Legends

Customizing the legend enhances interpretability and integrates it seamlessly into the plot layout.Renaming the legend for enables interpretability. You can move the legend to bottom for balance and aesthetic spacing.

scatter_plot +
  labs(color = "Diabetes Test Result") +
  theme(legend.position = "bottom")

Annotations

Annotations draw attention to specific regions or points of interest within the plot. This can be displayed by Adding a shaded box and label to highlight a region of interest (e.g., high-risk glucose/BMI levels).

scatter_plot +
  annotate("rect", xmin = 150, xmax = 200, ymin = 30, ymax = 50, alpha = 0.2, fill = "blue") +
  annotate("text", x = 175, y = 52, label = "High Risk Zone", color = "blue")

Titles and Themes

Titles and captions provide context and source information, making the plot self-explanatory.This has been applied by adding multi-layered textual context; main title, explanatory subtitle, and data source.

scatter_plot +
  labs(title = "BMI vs. Glucose Levels",
       subtitle = "Differentiated by Diabetes Test Result",
       caption = "Source: Pima Indians Diabetes Dataset")

Combine Plots

Combining plots facilitates comparative analysis and a comprehensive view of different aspects of the data.It Arranges multiple plots in a grid layout, allowing side-by-side comparisons.

grid.arrange(scatter_plot, boxplot, histogram, ncol = 2)

Conclusion

ThiS presentation demonstrated how to visualize and analyze the Pima Indians Diabetes dataset using ggplot2, while applying design principles to enhance clarity, engagement, and interpretability. We can say that effective data visualization is both a science and an art, relying on the thoughtful combination of aesthetics and analytics.

Data Visualization,ggplot2 and graphics designs

KEVIN KIPMUTAI

2025-05-23