Summer 2020

Outline

  1. Course orientation
  2. Statistics Review
  3. Terminology & Concepts
  4. History of Visualization
  5. Traits of Meaningful Data
  6. Effective Graphing
  7. Limitations in Data Visualization

Course Orientation

Course Objective

This course:

  • Provides an interdisciplinary introduction to the principles and fundamentals of visualizing statistical information
  • Will cover:
    • limitations of common visualization methods
    • effective visualization methods
    • perception and representation issues
    • general principles in the design of effective data visualizations
  • Is a hands-on and project-based course

Course Materials

  • Reading Materials
    • Now You See It, by Stephen Few
    • Visualize This by Nathan Yau
    • My slides, examples, & lecture notes
    • Various assigned readings TBD
  • Software
    • Labs and examples will use open source statistical package R
    • We will also use the R Studio environment
    • And the ggplot2 visualization library for R

Course Requirements

To do well in this course, you must

  • Participate in (some of) the online Zoom sessions
  • Complete all assigned reading
  • Perform well on the R Lab Assignments (50%)
  • Perform well on the Midterm Project (20%)
  • Perform well on the Final Project (30%)

Course Topics Schedule (1)

  1. Introduction to Data Visualization; Introduction to R
  2. Perception & Representation; Handling Data; R Datatypes
  3. Analysis Interaction & Simple, Effective Graphs; Introduction to ggplot2 and RColorBrewer
  4. Techniques & Practices
  5. Analytical Patterns & Designing with Purpose
  6. Midterm project presentations

Course Topics Schedule (2)

  1. Visualizing & Analyzing Time and Proportions
  2. Visualizing & Analyzing Differences, Deviations, and Distributions
  3. Visualizing & Analyzing Relationships and Multidimensional Plotting
  4. Scientific Visualization
  5. Plotting in Other Tools (e.g., Python with Matplotlib & Seaborn)
  6. Final project presentations

General Data Visualization vs. R/ggplot2

  • The point of this class is not to teach specifically R and ggplot2
  • At first, I will keep R/ggplot2 lectures separate from general data viz topics
  • Once the basics are developed (by week 4), I will integrate them and try to provide example R/ggplot2 code for almost all plots I use
  • The labs are to make sure you understand enough of the tools to follow these examples
  • The projects can be done in any tool you like

Final Observations

  • This course will make heavy use of on-line materials via WebCourses
  • We will have hands-on online Zoom sessions each week
  • You are expected to read in advance of attendance
  • We will spend most of our time in discussion or in hands on activities (usually in R)
  • Be ready to share your screen to participate in class activities

Statistics Review

Overview

  • The real world (and most simulations) are random and uncertain
  • We need a way to describe, predict, and draw conclusions from such observations
  • We do this using statistics
    • population - the thing in which we are interested … “truth
    • parameters - the defining characteristics of a population
  • We usually cannot accurately know populations or the parameters describing a population

Statistics

  • So we have to collect examples from the population and characterize these examples
    • sample - subset of a population that we collect and can describe
  • Types of statistics:
    • Collecting data and analyzing it (descriptive statistics)
    • Using this to draw conclusions about a population (inferential statistics)
  • We can organize this data in a data matrix (a table)
    • Each row represents a specific thing observed
    • Each column represents a variable

Data Matrix Example (p.1)

Data collected on students in a statistics class on a variety of variables:

  • Each row is a single observation of a student
  • Each column is a variable we are measuring

Data Matrix Example (p.2)

Student Gender Intro/Extra … Dread
1 male extravert \(\cdots\) 3
2 female extravert \(\cdots\) 2
3 female introvert \(\cdots\) 4
4 female extravert \(\cdots\) 2
\(\vdots\) \(\vdots\) \(\vdots\) \(\ddots\) \(\vdots\)
86 male extravert \(\cdots\) 3

Types of variables

  • Numerical - numeric values
    • discrete -integers, counting numbers (e.g., \(1, 2, 3\))
    • continuous - real values (e.g., \(-1.2, \pi, 5.2\))


  • Categorical -non-numeric values
    • nominal - values that cannot be ordered (e.g., red, blue, green)
    • ordinal - values that can be ordered (e.g., tall, medium, short)

Associated vs. independent

  • Associated or dependent variables
    • show some connection with one another
  • Independent variables are those with no evident connection between the variables
  • Variables might be:
    • Negatively associated - As one one increases in value or proportion, the other tends to decrease
    • Positively associated - As one one increases in value or proportion, the other also tends to increase

Point vs. Summary Statistics

  • Sometimes use the word “statistic” to refer to a characteristic of data
  • Point Statistic - is a specific measured value (e.g., I am 1.86 meters tall)
  • Summary Statistic - is a value representing a characteristic of many values (e.g., average)
  • Summaries are abstract descriptions of a sample or population
  • Many inferential statistical methods deal with distributions of summary statistics (not point statistics)

Measures of Center

  • mean - statistical average of numeric data,
  • median - middle-most value of numeric data, when sorted
  • mode - most common value of categorical data
  • proportion - frequency with which some categorical value occurs

Mean

  • The sample mean is the average of observed values in a sample
  • The population mean is also computed the same way but is denoted as (Typically cannot be calculated)
  • The sample mean is a sample statistic and serves as a point estimate of the population mean
sample = rnorm(mean=10, sd=3, n=100)
mean(sample)
## [1] 9.913674

Measures of Spread, p.1

  • range - difference between max and min values of numerical data
  • quartiles - values occurring at the \(\frac{1}{4}\) (\(Q_1\)), \(\frac{1}{2}\) (\(Q_2\)) and \(\frac{3}{4}\) (\(Q_3\)) positions in numeric data when sorted
  • interquartile range - difference between the \(Q_3\) and \(Q_1\) quartile points in numeric data
  • variance - average squared distance between each point and the mean,
  • standard deviation - square root of the variance

Measures of Spread, p.2

sample = rnorm(mean=10, sd=3, n=100)
range(sample)
## [1]  3.555334 17.820299
quantile(sample)
##        0%       25%       50%       75%      100% 
##  3.555334  8.118230  9.781637 11.764143 17.820299
IQR(sample)
## [1] 3.645912
var(sample); sd(sample)
## [1] 9.169551
## [1] 3.028127

Measures of Relationship

  • correlation coefficient - strength and direction of (linear) relationship

Terminology & Concepts

Term: Information Vis. vs. Scientific Vis.

  • Data visualization: umbrella term to cover all types of visual representations to support exploration, examination, and communication of data
  • Information visualization: type of visual representations that use computer-supported and possibly interactive mechanisms to “amplify cognition” (whatever that means)
  • Scientific visualization: type of visual representations of scientific information
  • Stephen Few sees this distinction as being about visualizing abstract information vs. visualizing physical phenomena
  • Alternatively, some people see the distinction as visualizing primarily non-spatial information (Info) vs. primarily spatial (Scientific)

Term: Statistical Data Visualization vs. Infographics

  • Infographic perspective: infographics provide context and tell a story (aesthetically), while statistical data visualizations are largely are simple charts and graphs without context
  • Statistics perspective: statistical data visualizations communicate data relationships and patterns as clearly as possible, while infographics are more focused on persuading people with little attention to actual data
  • Really: Goals are entirely different and (somewhat) opposing

History of Visualization

Sumarian Tables; 2,600 BCE

Planetary Movements, 10th Century

Sunspots, Scheiner 1626

National Debt, William Playfair, 1786

Cholera Outbreak, John Snow, 1854

Preventable Deaths, Florence Nightengale, 1855

Napolean’s Troops, Charles Minard, 1869

UK Exports, Arthur Bowley, 1901

Rise of Textbooks

  • Jacques Bertin, 1967
  • Edward Tufte, 1983
  • William Cleveland, 1985
  • IEEE Visualization Conference, 1990
  • Since then: An explosion of literature

Traits of Meaningful Data

What Makes Data Useful?

To be able to produce useful visualization, data must:

  • Quantity: have a sufficient quantity of data must exist
  • Consistent: be renderable on common scales, along common baselines
  • Structure: be referencable in a common organizational structure
  • Clean & Clear: have few errors, omissions, and ambiguities
  • Trustworthy: come from a reputable source

Traits of Even More Useful Data

To produce the most meaningful results, data should also:

  • Atomic: provide access to the unaggregated, atomic components to the data analyst
  • Multivariate: permit the analyst to see how different variables compare
  • Contextual: permit the analyst to place the data along some context (space, time, group, etc.)

Dealing with Data

  • Most data isn’t all of these things
  • Some data isn’t any of these things …
  • The hardest part of data visualization is dealing with the raw data

Effective Graphing

Activity: What is “Effective Data Visualization”?

  • Discussion point: What makes a data visualization “good”?

Good Data Visualization

  • Important points are emphasized / annotated

  • Axes, symbols, and colors are described

  • Visual content clarifies (does not distract)

  • Is accurate, clear, and improves understanding

  • An “effective graph” communicates clearly

Good Data Visualization (p.2)

  • “Chart and graph design isn’t just about making statistical visualizations but also about explaining what the visualization shows.” Nathan Yau

  • Journalism in the Age of Data

  • A good visualization tells a story from data!

What Do We “Look” for in Data?

  • Patterns

  • Relationships \(\leadsto\) compare & contrast values

  • Anomolies

  • Focus / reduction of information

The Role of the Subjective

  • What is the role of the following in data visualization:
    • Art / aesthetics?
    • Entertaining / engagement?
  • We want to see the truth, to support some set of claims with data–but how can we do that if people can’t “see” it because of disinterest?
  • How you present data is important … but entirely secondary to presenting data clearly

Graphic Design Process

Limitations in Data Visualization

Activity: What are some limitations in data visualization?

  • Discussion point: What makes a data visualization “bad”?

Bad Data Visualizations

  • Hide or obfuscate data
  • Lack context, labeling, or description
  • Are inaccurate or misleading
  • Focus more on art, iconography, or technology than delivering content, ideas, and data

Limitations in Data Visualization

We will discuss all of this in more detail as we go, but for now …

  • Avoiding pie charts
  • Avoiding 3D plot elements
  • Complications with multivariate visualization
  • Bad baselining examples
  • Complexity of judging differences

The Many Reasons to Avoid Pie Charts

  • It is bad at doing what it is designed to do: Difficult to judge relative size of the pie slices

  • Inefficient / inflexible use of space

  • Need many colors and high contrast to make wedges distinct

  • We’re much worse at estimating area than length — we’re especially bad at perceiving small differences in area

  • Pie charts make judging trends difficult

Example: Pie Chart

Example: Bar Chart

Dedicated Process Hours

Dedicated Process Hours, v.2

A Series of Pie Charts

A Grouped, Series of Bar Charts

Obfuscating Charts

3D effects make graphs harder to read

  • Are we to judge length? Area? Volume?

  • Display looks 3D when the angular perspective is offset, which makes referencing values on the axes harder

  • Display looks 3D when shading is employed, which clutters the graph and makes it harder to read

Making a Single Number Unnecessarily Hard to Read

3D vs. 2D Pie Chart Example

3D vs. 2D Bar Chart Example

Limitations of Bar Plots

When you have multiple variables to compare, there are several possibilities:

  • Stacked bar plots
    • Efficient use of space & clean
    • Variables at bottom easier to compare than variables at top
  • Separate bar plots
    • Good scaling control & clean
    • Extremely inefficient with space cross-variable comparisons can be difficult
  • Grouped bar plots
    • Cross-variable comparisons are natural
    • Creates chart clutter and can be difficult to read

Stacked Bar Plots

Separate Bar Plots

Grouped Bar Plots

Adding Another Dimension w/ Area

  • Plotting 2D values using a scatter plot is easy

  • If we have a categorical variable, we can sometimes use shading or color to add a third dimension

  • But if we have another numeric dimension, it’s challenging

  • Why not use point size (area)?

Bubble Plots

Unfortunately …

  • People don’t judge small differences in area very well
  • Using radius distorts values since the area increases with the square of the radius
  • Double the radius means quadruple the area

Projected Population Sizes in Europe, July 2015

Projected Population Sizes in Europe, July 2015

Baselines & Scales

  • Items compared should have the same baseline for comparison

  • That baseline should not distort the true data values

  • Scaling should be set properly for comparison (apples-to-apples)

  • Scaling should not distort the true data values

  • Data should always be properly adjusted

Steep Increases?

Steep Increases?

Similar Data, Properly Baselined

Similar Data, Adjusted by Population

Humans Are Bad at Judging Differences

  • Humans are good at finding patterns

  • We’re not so good at judging the differences between things

  • We re-orient things in our mind without realizing, focusing on where things are most similar

  • So when we judge the difference between two curves (e.g., inflation in the US over time vs. inflation in Europe), we minimize differences by finding the points where they are the closest, regardless of orientation

  • This is particularly a problem when judging differences along a y-axis (which is how we typically plot things)

How Do These Curves Differ?

How Do These Curves Differ?

Visualize What You Want to Show

  • If you want to show the difference between two things then …
    • Don’t show \(Thing A\) and also \(Thing B\)
    • Instead show \((Thing A - Thing B)\)
    • In other words: Don’t let the reader infer the difference, show the difference explicitly

Often Differences Are Surprising

Data-Ink Ratio

  • A notional concept by Edward Tufte that argues for keeping visualizations as simple as possible
  • Idea is to maximize: \(\frac{ink\;required\;to\;represent\;actual\;data}{total\;ink\;used\;in\;the\;graphic}\)
  • What is “data ink”? Anything that, were it erased, the underlying data would be removed (values, proportions, etc.)
  • What is “non-data ink”? Anything that can be erased without damaging the underlying data (e.g., iconographic images, non-data shading, borders, etc.)
  • Above all else, data.” Tufte 1983
  • It is a good rule of thumb, but shouldn’t be followed blindly