This presentation explores data visualization using
ggplot2 with the Pima Indians Diabetes
Dataset. We will go through data cleaning, basic and advanced
plotting techniques, and enhancing interactivity using
plotly,graphic designs and implementing key principles of
graphic design.
Graphic design
Graphics design refers to the creation of visual representations of data, such as plots, charts, and diagrams, to facilitate data analysis, interpretation, and presentation. R offers a wide array of tools and packages to produce high-quality graphics, making it a powerful language for data visualization.
Key Concepts in Graphics Design in R
Base Graphics System:
R’s original graphics system, providing functions like plot(), hist(), boxplot(), and barplot(). Allows detailed customization of plots through parameters and functions.
Grid Graphics System:
Provides a flexible approach using the grid package for complex and custom layouts. Enables precise control over graphical objects and their positioning.
Lattice Graphics:
Built on top of the grid system, designed for creating multi-panel conditioned plots. Functions like xyplot(), bwplot(), densityplot() facilitate multi-variable visualization.
ggplot2:
A popular and powerful package based on the Grammar of Graphics concept. Allows building complex layered graphics systematically. Provides extensive customization options for aesthetics, themes, and annotations.
Principles of Good Graphics Design in R
Clarity: Ensure the visualization clearly conveys the intended message.
Simplicity: Avoid unnecessary clutter; focus on key data insights.
Consistency: Use consistent color schemes, scales, and formats.
Aesthetics: Make visuals appealing with appropriate themes and styles.
Interactivity: Use packages like plotly
or shiny for interactive graphics.
Graphics design is a vital aspect of data analysis, enabling the transformation of raw data into insightful visual stories. Whether using base R graphics, lattice, or ggplot2, R provides versatile tools to create informative, attractive, and publication-quality visualizations tailored to various analytical needs.
The following principles guide effective and aesthetic data visualization, along with the tools and ggplot2 features used to achieve them:
1.Clarity: Ensure your message is easy to interpret
at a glance. Tools Used: labs(),
theme(), proper axis labels, clear titles and
subtitles.
2.Consistency: Use consistent color schemes, fonts,
and layouts. Tools Used: theme_minimal(), custom
fonts (base_family), consistent
scale_fill_manual()/scale_color_manual()
across plots.
3.Hierarchy: Use size, color, and position to
prioritize elements. Tools Used: annotate(),
element_text(), scale_* functions for size and
color.
4.Balance: Distribute visual weight evenly.
Tools Used: Plot layout (e.g., grid.arrange()),
spacing in theme().
5.Contrast: Emphasize key areas with contrasting
elements. Tools Used: scale_fill_manual(),
annotate() with distinct fill colors,
theme_dark() or light/dark backgrounds.
theme_*() functions,
margin() settings within themes, simplifying plots by
removing grid lines or redundant axes.Libraries and Data Loading
library(ggplot2) # For plotting
library(dplyr) # For data manipulation
library(tidyr) # For data tidying
library(plotly) # For interactivity
library(gridExtra) # For combining plots
library(faraway) # For the Pima dataset
data(pima)
Zero values in medical measurements like glucose or blood pressure are erroneous and are treated as missing values. Converting the ‘test’ variable to a factor enhances interpretability in plots.
# Convert zero values in specific columns to NA
cols_with_zero_na <- c("glucose", "diastolic", "triceps", "insulin", "bmi")
pima[cols_with_zero_na] <- lapply(pima[cols_with_zero_na], function(x) ifelse(x == 0, NA, x))
# Convert 'test' to a factor with labels
pima$test <- factor(pima$test, levels = c(0, 1), labels = c("Negative", "Positive"))
# Summary of cleaned data
summary(pima)
## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## insulin bmi diabetes age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## test
## Negative:500
## Positive:268
##
##
##
##
##
This scatterplot reveals the relationship between BMI and glucose levels, differentiated by diabetes test results. The Interactive plots allow for dynamic exploration of data points
geom_point() – Plots raw data points (used to highlight relationships between glucose and BMI).
aes(color = diabetes) – Encodes diabetes status as color for comparison.
labs() – Enhances clarity with labeled axes and titles.
theme_minimal() – Minimalistic theme for readability.
plotly::ggplotly() – Makes it interactive with zoom, hover tooltips.
The graphic design principles applied in this scatterplot are:
Clarity (labels), Contrast (colors), Interactivity (plotly), Whitespace (theme), Hierarchy (size/color distinctions).
scatter_plot <- ggplot(pima, aes(x = glucose, y = bmi, color = test)) +
geom_point(alpha = 0.6) +
labs(title = "BMI vs. Glucose by Diabetes Test Result",
x = "Glucose", y = "BMI") +
theme_minimal()
# Interactive plot
plotly::ggplotly(scatter_plot)
The LOESS smoother helps identify trends in BMI across different ages for each test result category.
geom_smooth(method = “loess”) – Adds trend line with confidence band for BMI vs. age.
geom_point() – Adds actual observations.
theme_light() – Ensures legibility.
The graphic design principles applied here are:
Hierarchy (trend emphasized), Clarity (smooth visual cue), Balance (layout and color).
smooth_plot <- ggplot(pima, aes(x = age, y = bmi, color = test)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = TRUE) +
labs(title = "BMI vs. Age with LOESS Smoother", x = "Age", y = "BMI") +
theme_light()
plotly::ggplotly(smooth_plot)
This bar chart displays the distribution of positive and negative diabetes test results.
geom_bar() – Counts diabetes outcomes.
scale_fill_manual() – Custom colors to contrast outcomes.
theme_classic() – Simple lines and no distractions.
The graphic design principles applied are:
Contrast (colors), Balance (symmetry), Clarity (simple theme), Consistency (colors reused).
bar_chart <- ggplot(pima, aes(x = test, fill = test)) +
geom_bar() +
labs(title = "Count of Diabetes Test Results", x = "Test Result", y = "Count") +
theme_classic()
ggplotly(bar_chart)
This Histogram illustrate the distribution of glucose levels, highlighting differences between test result groups.
geom_histogram() – Distribution of glucose levels.
alpha = 0.6 – Transparency to show overlaps.
bins = 30 – Controls granularity.
scale_fill_brewer() – Cohesive color palette.
Design principles used are:
Hierarchy (overlapping distributions), Contrast, Clarity (bin control), Consistency (same fill system).
histogram <- ggplot(pima, aes(x = glucose, fill = test)) +
geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
labs(title = "Distribution of Glucose Levels", x = "Glucose", y = "Frequency") +
theme_bw()
ggplotly(histogram)
These Boxplots provide a summary of BMI distributions, indicating medians and variability within each test result group.
geom_boxplot() – Summarizes BMI distribution by outcome.
outlier.color – Highlights outliers for emphasis.
theme_minimal() – Avoids clutter.
scale_fill_manual() – Controls aesthetics consistently.
graphics design principles:
Clarity (whisker logic), Emphasis (outliers), Contrast (fill), Whitespace (theme).
boxplot <- ggplot(pima, aes(x = test, y = bmi, fill = test)) +
geom_boxplot() +
labs(title = "BMI Distribution by Test Result", x = "Test Result", y = "BMI") +
theme_minimal()
ggplotly(boxplot)
We can Switch the plot’s appearance to a dark background with light grid lines and text.
Design Contribution:
Contrast: Highlights colored points against a dark background.
Accessibility: Useful for dark mode presentations or visual impairments.
Aesthetic: Helps shift tone and focus in presentations.
scatter_plot + theme_light()
Adjusting axis labels and limits focuses the plot on specific data ranges and improves clarity.
scatter_plot +
scale_x_continuous(name = "Glucose Level", limits = c(50, 200)) +
scale_y_continuous(name = "BMI", limits = c(10, 70))
Using different color palettes can make plots more accessible and visually appealing. We have used ColorBrewer palettes, optimized for visual accessibility and consistency.
scatter_plot + scale_color_brewer(palette = "Set1")
Customizing the legend enhances interpretability and integrates it seamlessly into the plot layout.Renaming the legend for enables interpretability. You can move the legend to bottom for balance and aesthetic spacing.
scatter_plot +
labs(color = "Diabetes Test Result") +
theme(legend.position = "bottom")
Annotations draw attention to specific regions or points of interest within the plot. This can be displayed by Adding a shaded box and label to highlight a region of interest (e.g., high-risk glucose/BMI levels).
scatter_plot +
annotate("rect", xmin = 150, xmax = 200, ymin = 30, ymax = 50, alpha = 0.2, fill = "blue") +
annotate("text", x = 175, y = 52, label = "High Risk Zone", color = "blue")
Titles and captions provide context and source information, making the plot self-explanatory.This has been applied by adding multi-layered textual context; main title, explanatory subtitle, and data source.
scatter_plot +
labs(title = "BMI vs. Glucose Levels",
subtitle = "Differentiated by Diabetes Test Result",
caption = "Source: Pima Indians Diabetes Dataset")
Combining plots facilitates comparative analysis and a comprehensive view of different aspects of the data.It Arranges multiple plots in a grid layout, allowing side-by-side comparisons.
grid.arrange(scatter_plot, boxplot, histogram, ncol = 2)
ThiS presentation demonstrated how to visualize and analyze the Pima
Indians Diabetes dataset using ggplot2, while applying
design principles to enhance clarity, engagement, and interpretability.
We can say that effective data visualization is both a science and an
art, relying on the thoughtful combination of aesthetics and
analytics.