Spring 2025

Basics of Visualization

Good Data Visualization

  • Important points are emphasized / annotated

  • Axes, symbols, and colors are described

  • Visual content clarifies (does not distract)

  • Is accurate, clear, and improves understanding

  • An “effective graph” communicates clearly

What Do We “Look” for in Data?

  • Patterns

  • Relationships \(\leadsto\) compare & contrast values

  • Anomolies

  • Focus / reduction of information

The Role of the Subjective

  • What is the role of the following in data visualization:
    • Art / aesthetics?
    • Entertaining / engagement?
  • We want to see the truth, to support some set of claims with data–but how can we do that if people can’t “see” it because of disinterest?
  • How you present data is important … but entirely secondary to presenting data clearly

Graphic Design Process

Bad Data Visualizations

  • Hide or obfuscate data
  • Lack context, labeling, or description
  • Are inaccurate or misleading
  • Focus more on art, iconography, or technology than delivering content, ideas, and data

Limitations in Data Visualization

  • Avoiding pie charts
  • Avoiding 3D plot elements
  • Complications with multivariate visualization
  • Bad baselining examples
  • Complexity of judging differences

The Many Reasons to Avoid Pie Charts

  • It is bad at doing what it is designed to do: Difficult to judge relative size of the pie slices

  • Inefficient / inflexible use of space

  • Need many colors and high contrast to make wedges distinct

  • We’re much worse at estimating area than length — we’re especially bad at perceiving small differences in area

  • Pie charts make judging trends difficult

Example: Pie Chart

Example: Bar Chart

Obfuscating Charts

3D effects make graphs harder to read

  • Are we to judge length? Area? Volume?

  • Display looks 3D when the angular perspective is offset, which makes referencing values on the axes harder

  • Display looks 3D when shading is employed, which clutters the graph and makes it harder to read

Making a Single Number Unnecessarily Hard to Read

Limitations of Bar Plots

When you have multiple variables to compare, there are several possibilities:

  • Stacked bar plots
    • Efficient use of space & clean
    • Variables at bottom easier to compare than variables at top
  • Separate bar plots
    • Good scaling control & clean
    • Extremely inefficient with space cross-variable comparisons can be difficult
  • Grouped bar plots
    • Cross-variable comparisons are natural
    • Creates chart clutter and can be difficult to read

Stacked Bar Plots

Separate Bar Plots

Grouped Bar Plots

Adding Another Dimension w/ Area

  • Plotting 2D values using a scatter plot is easy

  • If we have a categorical variable, we can sometimes use shading or color to add a third dimension

  • But if we have another numeric dimension, it’s challenging

  • Why not use point size (area)?

Bubble Plots

Unfortunately …

  • People don’t judge small differences in area very well
  • Using radius distorts values since the area increases with the square of the radius
  • Double the radius means quadruple the area

Projected Population Sizes in Europe, July 2015

Projected Population Sizes in Europe, July 2015

Baselines & Scales

  • Items compared should have the same baseline for comparison

  • That baseline should not distort the true data values

  • Scaling should be set properly for comparison (apples-to-apples)

  • Scaling should not distort the true data values

  • Data should always be properly adjusted

Steep Increases?

Steep Increases?

Similar Data, Properly Baselined

Similar Data, Adjusted by Population

Data-Ink Ratio

  • A notional concept by Edward Tufte that argues for keeping visualizations as simple as possible
  • Idea is to maximize: \(\frac{ink\;required\;to\;represent\;actual\;data}{total\;ink\;used\;in\;the\;graphic}\)
  • What is “data ink”? Anything that, were it erased, the underlying data would be removed (values, proportions, etc.)
  • What is “non-data ink”? Anything that can be erased without damaging the underlying data (e.g., iconographic images, non-data shading, borders, etc.)
  • Above all else, data.” Tufte 1983
  • It is a good rule of thumb, but shouldn’t be followed blindly

GGPlot2 Overview

Why ggplot2?

The ggplot2 library:

  • uses a consistent grammar of graphics
  • provides a high-level plots specification
  • allows user to think in terms of a layers data visualization pipeline
  • provides extensive visualization functionality for many common graphics
  • is widely used

Pipeline / Grammar

  • Users specify building plots for a plot, then layer them as desired
  • Building blocks include:
    • data sets
    • aesthetic mapping (what fields map to what visual elements)
    • geometry objects (how to visually encode those elements)
    • transformations, coordinate systems, and scaling mechanisms
    • thematic fine-tuning (fonts, annotations, positions adjustments)
    • faceting (multi-panel plotting)
  • Each layer is added onto the old layer, making it intuitive and natural

Basics of ggplot2

  • The ggplot() is the foundational function in the ggplot2 library
  • It initializes a plotting object, setting up:
    • The data set to use
    • Which variables map to what plot elements (x, y, fill, size, etc.)
    • The core plot object
  • The object returned by ggplot() is not a visualization by itself
  • It is extended and interpretted with subsequent function calls for visualziation

Data Frames

  • Most plot tools have functions that take explicit data types as arguments (e.g., the built-in plot() function takes R vectors for x and y)
  • ggplot consumes R data frames, and it understands the concept of variables (fields) in a data frame
  • ggplot understands the difference between numeric and categorical data and treats them differently
  • You can flexibly map variables to whatever visual elements you prefer
  • Data frames should be in long (statistical) form

Plot Element Visualization Attributes

Elements you choose to visualize may be set explicitly, or may be mapped to a variable using aes()

  • x – Position along the x-axis
  • y – Position along the y-axis
  • size – Point width, line thickness, etc.
  • linetype – Type of line (dotted, dashed, solid, etc.)
  • color – Usually the color of the outer border of something
  • fill – The color of the inner fill of some shape

Plot Element Visualization Attributes (2)

Additional elements you choose to visualize may be set explicitly, or may be mapped to a variable using aes()

  • shape – The symbol being used for some point
  • linetype – The style of line objects (e.g., solid, dashed)
  • alpha – The transparency of an object
  • label – Text labels associated with an object

Creating a Plot Object with ggplot()

  • Load the ggplot2 library
  • Call ggplot(), providing:
    • The data frame to use
    • The variable mapping, aes()
library(ggplot2)
rd = data.frame(
  Student = c("Bob", "Sue", "Cat", "Lin"),
  NumberGrade = c(96, 82, 97, 74),
  LetterGrade = factor(c("A","B","A","C")) )

p = ggplot(rd, aes(y=NumberGrade))

Encoding Plot Elements with Geometric Objects

ggplot2 interprets plot elements using geom objects:

  • geom_point – Points
  • geom_bar – Bars
  • geom_line – Lines
  • geom_smooth – Smoothed curves
  • geom_polygon – Polygons
  • geom_boxplot – Boxplot

Geometric Objects (2)

Other Types of Layers

Aside from geom objects, there are other kinds of layers:

  • Scales – Scaling controls for the mapping between data and aesthetics (scale_x_discrete(), scale_size_continuous(), etc. In general: scale_AESTHETIC_QUALIFER
  • Coordinate systems – For transforming data to other coordinate systems (coord_flip(), coord_polar(), etc.)
  • Faceting – Splitting up data into trellis displays (facet_grid(), facet_wrap(), etc.)
  • Themes – Changing non-data elements of plot (element_text(), etc.)

http://docs.ggplot2.org/current/

Adding Layers to the Plot for Visualization

  • Use the “+” operator to add visualization layers to the plot object
  • Layers are placed in the order that they are added, in a pipeline
  • It’s easy to keep each layer separate and explicit
library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade)) +   # Build the plot object
    geom_point(size=5) +                     # Encode visually using points
    xlab("Student Name") +                   # Label the X axis
    ylab("Numeric Grade") +                  # Label the Y axis
    ggtitle("Course Grade Results")          # Give the plot a title

Interpretting the Plot for Visualization

Interpretting the Plot for Visualization

  • The same basic plot object can be interpretted differently:
library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade)) +
    geom_bar(stat="identity") +              # Only line that changed...
    xlab("Student Name") + 
    ylab("Numeric Grade") + 
    ggtitle("Course Grade Results")

Interpretting the Plot for Visualization

Interpretting the Plot for Visualization

library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade)) +
    geom_bar(stat="identity") + 
    coord_flip() +
    xlab("Student Name") + 
    ylab("Numeric Grade") + 
    ggtitle("Course Grade Results")

Interpretting the Plot for Visualization

Interpretting the Plot for Visualization

  • Other variables can be mapped to other plot features, as well
library(ggplot2)
rd = data.frame(Student = c("Bob", "Sue", "Cat", "Lin"),
                NumberGrade = c(96, 82, 97, 74),
                LetterGrade = factor(c("A","B","A","C")) )

ggplot(rd, aes(x=Student,y=NumberGrade,fill=LetterGrade)) +
    geom_bar(stat="identity") + 
    xlab("Student Name") + 
    ylab("Numeric Grade") + 
    ggtitle("Course Grade Results")

Interpretting the Plot for Visualization

Color vs. Fill

  • In ggplot2:
    • color refers to the color of borders
    • fill refers to the color of the inside fill
  • Except with the default point shape, which is actually a font – and so uses color only
  • Unless you change the shape to a drawn shape (e.g., shape 21)

Using the Default Point Shape (Font)

library(ggplot2)

myData = data.frame(Furbletude=rnorm(30),
                    Blehmekness=rnorm(30))

ggplot(myData, aes(x=Furbletude, y=Blehmekness)) +
  geom_point(color="lightblue", fill="darkblue", size=4)

Using the Default Point Shape (Font)

Using a Drawn Point Shape

library(ggplot2)

myData = data.frame(Furbletude=rnorm(30),
                    Blehmekness=rnorm(30))

ggplot(myData, aes(x=Furbletude, y=Blehmekness)) +
  geom_point(color="darkblue", fill="lightblue", size=4, shape=21)

Using a Drawn Point Shape

Prebuilt Colors vs. Customized Colors

  • R has hundreds of prebuilt colors (type colors() at the console to list)
  • But you can also specify custom colors in several ways, including:
    • In RGB hex via a string – e.g., “#992B1A”
    • Using the rgb() function – e.g., rgb(0.26, 0.52, 0.87)
    • Using the hsv() function – e.g., hsv(0.17, 0.98, 0.66)
  • You can construct palettes as lists of these
  • Or you can use prebuilt palletes

Customizing Colors

library(ggplot2)

myData = data.frame(Count=sample(1:10, 30, replace=T),
    Awesomeness=sample(c("CoolThings", "SillyThings", "Meh"), 30, replace=T))

ggplot(myData, aes(x=Awesomeness, y=Count)) +
  geom_bar(stat="identity", color="white", fill=rgb(0.12, 0.76, 0.9))

Customizing Colors

For Data-Driven Properties Like Color, Use aes()

library(ggplot2)

myData = data.frame(Count=sample(1:10, 30, replace=T),
        Awesomeness=sample(c("CoolThings", "SillyThings", "Meh"), 30, replace=T),
        TypeOfThing=sample(c("A", "B", "C"), 30, replace=T))

ggplot(myData, aes(x=Awesomeness, y=Count, fill=TypeOfThing)) +
  geom_bar(stat="identity", color="black")

For Data-Driven Properties Like Color, Use aes()

Customizing Palettes

  • In ggplot2, the scale_…() functions are used to override data-driven properties like colors
  • Get used to using RColorBrewer
  • It’s nice because it has some good pre-built discrete and continuous palettes
  • So install it, if you have not already

Selecting a Pre-Built Palette from RColorBrewer

library(ggplot2)
library(RColorBrewer)

myData = data.frame(Count=sample(1:10, 30, replace=T),
        Awesomeness=sample(c("CoolThings", "SillyThings", "Meh"), 30, replace=T),
        TypeOfThing=sample(c("A", "B", "C"), 30, replace=T))

ggplot(myData, aes(x=Awesomeness, y=Count, fill=TypeOfThing)) +
  geom_bar(stat="identity") +
  scale_fill_brewer(palette="Set2") +  # Set2 is a color-blind friendly palette
  theme_bw()  # Make the background white and the grid lines black

Selecting a Pre-Built Palette from RColorBrewer

Mapping Continuous Colors and Sizes

library(ggplot2)

ggplot(mtcars, aes(x=mpg,y=hp)) + 
  geom_smooth(size=1.5, color="darkgray") + 
  geom_point(aes(size=gear,color=cyl)) +
  xlab("Miles per Gallon") + 
  ylab("Horse Power")

Mapping Continuous Colors and Sizes

Changing Font & Font Size

library(ggplot2)

ggplot(mtcars, aes(x=mpg,y=hp)) + 
  geom_point(size=4, shape=21, fill="lightblue", color="darkblue") +  
  xlab("Miles per Gallon") + 
  ylab("Horse Power") +
  theme(text=element_text(size=18, family="Times"))

Changing Font & Font Size

Histograms

library(ggplot2)

ggplot(diamonds, aes(carat)) + 
  geom_histogram(binwidth=0.5, fill="wheat", color="black") +  
  xlab("Carat") + 
  ylab("Count") +
  ggtitle("Diamond Carat Distribution")
  theme(text=element_text(size=18, family="Times"))

Histograms

More Resources