Visualization, Part 1

Spring 2025

Basics of Visualization

Good Data Visualization

Important points are emphasized / annotated
Axes, symbols, and colors are described
Visual content clarifies (does not distract)
Is accurate, clear, and improves understanding
An “effective graph” communicates clearly

What Do We “Look” for in Data?

Patterns
Relationships \(\leadsto\) compare & contrast values
Anomolies
Focus / reduction of information

The Role of the Subjective

What is the role of the following in data visualization:
- Art / aesthetics?
- Entertaining / engagement?
We want to see the truth, to support some set of claims with data–but how can we do that if people can’t “see” it because of disinterest?
- Hans Rosling on Third vs. Western Word
How you present data is important … but entirely secondary to presenting data clearly

Graphic Design Process

Bad Data Visualizations

Hide or obfuscate data
Lack context, labeling, or description
Are inaccurate or misleading
Focus more on art, iconography, or technology than delivering content, ideas, and data

Limitations in Data Visualization

Avoiding pie charts
Avoiding 3D plot elements
Complications with multivariate visualization
Bad baselining examples
Complexity of judging differences

The Many Reasons to Avoid Pie Charts

It is bad at doing what it is designed to do: Difficult to judge relative size of the pie slices
Inefficient / inflexible use of space
Need many colors and high contrast to make wedges distinct
We’re much worse at estimating area than length — we’re especially bad at perceiving small differences in area
Pie charts make judging trends difficult

Example: Pie Chart

Example: Bar Chart

Obfuscating Charts

3D effects make graphs harder to read

Are we to judge length? Area? Volume?
Display looks 3D when the angular perspective is offset, which makes referencing values on the axes harder
Display looks 3D when shading is employed, which clutters the graph and makes it harder to read

Making a Single Number Unnecessarily Hard to Read

Limitations of Bar Plots

When you have multiple variables to compare, there are several possibilities:

Stacked bar plots
- Efficient use of space & clean
- Variables at bottom easier to compare than variables at top
Separate bar plots
- Good scaling control & clean
- Extremely inefficient with space cross-variable comparisons can be difficult
Grouped bar plots
- Cross-variable comparisons are natural
- Creates chart clutter and can be difficult to read

Stacked Bar Plots

Separate Bar Plots

Grouped Bar Plots

Adding Another Dimension w/ Area

Plotting 2D values using a scatter plot is easy
If we have a categorical variable, we can sometimes use shading or color to add a third dimension
But if we have another numeric dimension, it’s challenging
Why not use point size (area)?

Bubble Plots

Unfortunately …

People don’t judge small differences in area very well
Using radius distorts values since the area increases with the square of the radius
Double the radius means quadruple the area

Projected Population Sizes in Europe, July 2015

Baselines & Scales

Items compared should have the same baseline for comparison
That baseline should not distort the true data values
Scaling should be set properly for comparison (apples-to-apples)
Scaling should not distort the true data values
Data should always be properly adjusted

Steep Increases?

Similar Data, Properly Baselined

Similar Data, Adjusted by Population

Data-Ink Ratio

A notional concept by Edward Tufte that argues for keeping visualizations as simple as possible
Idea is to maximize: \(\frac{ink\;required\;to\;represent\;actual\;data}{total\;ink\;used\;in\;the\;graphic}\)
What is “data ink”? Anything that, were it erased, the underlying data would be removed (values, proportions, etc.)
What is “non-data ink”? Anything that can be erased without damaging the underlying data (e.g., iconographic images, non-data shading, borders, etc.)
“Above all else, data.” Tufte 1983
It is a good rule of thumb, but shouldn’t be followed blindly

GGPlot2 Overview

Why ggplot2?

The ggplot2 library:

uses a consistent grammar of graphics
provides a high-level plots specification
allows user to think in terms of a layers data visualization pipeline
provides extensive visualization functionality for many common graphics
is widely used

Pipeline / Grammar

Users specify building plots for a plot, then layer them as desired
Building blocks include:
- data sets
- aesthetic mapping (what fields map to what visual elements)
- geometry objects (how to visually encode those elements)
- transformations, coordinate systems, and scaling mechanisms
- thematic fine-tuning (fonts, annotations, positions adjustments)
- faceting (multi-panel plotting)
Each layer is added onto the old layer, making it intuitive and natural

Basics of ggplot2

The ggplot() is the foundational function in the ggplot2 library
It initializes a plotting object, setting up:
- The data set to use
- Which variables map to what plot elements (x, y, fill, size, etc.)
- The core plot object
The object returned by ggplot() is not a visualization by itself
It is extended and interpretted with subsequent function calls for visualziation

Data Frames

Most plot tools have functions that take explicit data types as arguments (e.g., the built-in plot() function takes R vectors for x and y)
ggplot consumes R data frames, and it understands the concept of variables (fields) in a data frame
ggplot understands the difference between numeric and categorical data and treats them differently
You can flexibly map variables to whatever visual elements you prefer
Data frames should be in long (statistical) form

Plot Element Visualization Attributes

Elements you choose to visualize may be set explicitly, or may be mapped to a variable using aes()

x – Position along the x-axis
y – Position along the y-axis
size – Point width, line thickness, etc.
linetype – Type of line (dotted, dashed, solid, etc.)
color – Usually the color of the outer border of something
fill – The color of the inner fill of some shape

Plot Element Visualization Attributes (2)

Additional elements you choose to visualize may be set explicitly, or may be mapped to a variable using aes()

shape – The symbol being used for some point
linetype – The style of line objects (e.g., solid, dashed)
alpha – The transparency of an object
label – Text labels associated with an object

Creating a Plot Object with ggplot()

Load the ggplot2 library
Call ggplot(), providing:
- The data frame to use
- The variable mapping, aes()

library(ggplot2)
rd = data.frame(
  Student = c("Bob", "Sue", "Cat", "Lin"),
  NumberGrade = c(96, 82, 97, 74),
  LetterGrade = factor(c("A","B","A","C")) )

p = ggplot(rd, aes(y=NumberGrade))

Encoding Plot Elements with Geometric Objects

ggplot2 interprets plot elements using geom objects:

geom_point – Points
geom_bar – Bars
geom_line – Lines
geom_smooth – Smoothed curves
geom_polygon – Polygons
geom_boxplot – Boxplot

Geometric Objects (2)

geom_density – Density plots
geom_histogram – Histograms

There are other geomtetry objects: http://docs.ggplot2.org/current/

Other Types of Layers

Aside from geom objects, there are other kinds of layers:

Scales – Scaling controls for the mapping between data and aesthetics (scale_x_discrete(), scale_size_continuous(), etc. In general: scale_AESTHETIC_QUALIFER
Coordinate systems – For transforming data to other coordinate systems (coord_flip(), coord_polar(), etc.)
Faceting – Splitting up data into trellis displays (facet_grid(), facet_wrap(), etc.)
Themes – Changing non-data elements of plot (element_text(), etc.)

http://docs.ggplot2.org/current/

Adding Layers to the Plot for Visualization

Use the “+” operator to add visualization layers to the plot object
Layers are placed in the order that they are added, in a pipeline
It’s easy to keep each layer separate and explicit

library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade)) +   # Build the plot object
    geom_point(size=5) +                     # Encode visually using points
    xlab("Student Name") +                   # Label the X axis
    ylab("Numeric Grade") +                  # Label the Y axis
    ggtitle("Course Grade Results")          # Give the plot a title

Interpretting the Plot for Visualization

The same basic plot object can be interpretted differently:

library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade)) +
    geom_bar(stat="identity") +              # Only line that changed...
    xlab("Student Name") + 
    ylab("Numeric Grade") + 
    ggtitle("Course Grade Results")

Interpretting the Plot for Visualization

library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade)) +
    geom_bar(stat="identity") + 
    coord_flip() +
    xlab("Student Name") + 
    ylab("Numeric Grade") + 
    ggtitle("Course Grade Results")

Interpretting the Plot for Visualization

Other variables can be mapped to other plot features, as well

library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade,fill=LetterGrade)) +
    geom_bar(stat="identity") + 
    xlab("Student Name") + 
    ylab("Numeric Grade") + 
    ggtitle("Course Grade Results")

Interpretting the Plot for Visualization

Color vs. Fill

In ggplot2:
- color refers to the color of borders
- fill refers to the color of the inside fill
Except with the default point shape, which is actually a font – and so uses color only
Unless you change the shape to a drawn shape (e.g., shape 21)

Using the Default Point Shape (Font)

library(ggplot2)

myData = data.frame(Furbletude=rnorm(30),
                    Blehmekness=rnorm(30))

ggplot(myData, aes(x=Furbletude, y=Blehmekness)) +
  geom_point(color="lightblue", fill="darkblue", size=4)

Using the Default Point Shape (Font)

Using a Drawn Point Shape

library(ggplot2)

myData = data.frame(Furbletude=rnorm(30),
                    Blehmekness=rnorm(30))

ggplot(myData, aes(x=Furbletude, y=Blehmekness)) +
  geom_point(color="darkblue", fill="lightblue", size=4, shape=21)

Using a Drawn Point Shape

Prebuilt Colors vs. Customized Colors

R has hundreds of prebuilt colors (type colors() at the console to list)
But you can also specify custom colors in several ways, including:
- In RGB hex via a string – e.g., “#992B1A”
- Using the rgb() function – e.g., rgb(0.26, 0.52, 0.87)
- Using the hsv() function – e.g., hsv(0.17, 0.98, 0.66)
You can construct palettes as lists of these
Or you can use prebuilt palletes

Customizing Colors

library(ggplot2)

myData = data.frame(Count=sample(1:10, 30, replace=T),
    Awesomeness=sample(c("CoolThings", "SillyThings", "Meh"), 30, replace=T))

ggplot(myData, aes(x=Awesomeness, y=Count)) +
  geom_bar(stat="identity", color="white", fill=rgb(0.12, 0.76, 0.9))

Customizing Colors

For Data-Driven Properties Like Color, Use aes()

library(ggplot2)

myData = data.frame(Count=sample(1:10, 30, replace=T),
        Awesomeness=sample(c("CoolThings", "SillyThings", "Meh"), 30, replace=T),
        TypeOfThing=sample(c("A", "B", "C"), 30, replace=T))

ggplot(myData, aes(x=Awesomeness, y=Count, fill=TypeOfThing)) +
  geom_bar(stat="identity", color="black")

For Data-Driven Properties Like Color, Use aes()

Customizing Palettes

In ggplot2, the scale_…() functions are used to override data-driven properties like colors
Get used to using RColorBrewer
It’s nice because it has some good pre-built discrete and continuous palettes
So install it, if you have not already

Selecting a Pre-Built Palette from RColorBrewer

library(ggplot2)
library(RColorBrewer)

myData = data.frame(Count=sample(1:10, 30, replace=T),
        Awesomeness=sample(c("CoolThings", "SillyThings", "Meh"), 30, replace=T),
        TypeOfThing=sample(c("A", "B", "C"), 30, replace=T))

ggplot(myData, aes(x=Awesomeness, y=Count, fill=TypeOfThing)) +
  geom_bar(stat="identity") +
  scale_fill_brewer(palette="Set2") +  # Set2 is a color-blind friendly palette
  theme_bw()  # Make the background white and the grid lines black

Selecting a Pre-Built Palette from RColorBrewer

Mapping Continuous Colors and Sizes

library(ggplot2)

ggplot(mtcars, aes(x=mpg,y=hp)) + 
  geom_smooth(size=1.5, color="darkgray") + 
  geom_point(aes(size=gear,color=cyl)) +
  xlab("Miles per Gallon") + 
  ylab("Horse Power")

Mapping Continuous Colors and Sizes

Changing Font & Font Size

library(ggplot2)

ggplot(mtcars, aes(x=mpg,y=hp)) + 
  geom_point(size=4, shape=21, fill="lightblue", color="darkblue") +  
  xlab("Miles per Gallon") + 
  ylab("Horse Power") +
  theme(text=element_text(size=18, family="Times"))

Changing Font & Font Size

Histograms

library(ggplot2)

ggplot(diamonds, aes(carat)) + 
  geom_histogram(binwidth=0.5, fill="wheat", color="black") +  
  xlab("Carat") + 
  ylab("Count") +
  ggtitle("Diamond Carat Distribution")
  theme(text=element_text(size=18, family="Times"))

Histograms

More Resources

Harvard’s R Graphics Tutorial
R-Statistics’ ggplot2 Tutorial
DataCamp’s Visualization with ggplot2 course
R CookBook’s Colors (ggplot2)