Summer 2020

Outline

  1. What to look for in time
  2. Discrete data over time
  3. Continuous data over time
  4. Issues with plotting cycles

What to Look for

Where the Focus is

Yau says, when visualizing patterns over time, we typically focus on:

  • Illustrating (dramatic) changes
  • Describing general trends
  • Describing some periodic pattern
  • Making implicit and explicit predictions

Time Series Patterns

Few has a similar list:

  • Trends
  • Variability
  • Co-variation
  • Cycles
  • Exceptions

Rules of Thumb

  • Typical to lay out time along the x-axis from left-to-right

  • When time data is discretized into “blocks” of time, use a discrete data visualization (e.g., bar plots, dot plots, etc.)

  • When time data is continuous, use a continuous data visualization (e.g., line plot, stacked area plot, step plot, etc.)

  • It’s okay to annotate specific points in time to support your narative or for comparison

  • Nathan Yau talks uses standard R functions, but we’ll use libraries like reshape2, dplyr, RColorBrewer, and ggplot2 to make our lives a lot easier

Time Series Displays

  • Line graphs
  • Bar graphs
  • Dot plots
  • Radar graphs, Cox-Comb plots, etc.
  • Heatmaps

Best Practices (1)

When viewing a time series, keep these in mind:

  • Aggregate to various time intervals
  • View time periods in context
  • Group related time intervals
  • Smooth, where appropriate
  • Treat missing data correctly
  • Make sure the aspect ratio is not misleading
  • Use log scales to compare rates of change

Best Practices (2)

Also these are sometimes helpful for trends and cycles:

  • Use cycle plots to examine trends and cycles together
  • Shift time to compare leading and lagging indicators
  • Stack line graphs to compare multiple variables

Some Interesting Examples:

Further Reading of Visualization Patterns over Time

Discrete Data over Time

Bar Graphs Over Time

  • Use bar graphs when:
    • You have numeric data
    • Data occurs in discrete intervals of time
  • Typically, the itervals appear along the x-axis
  • Use annotation, color, or shading to help emphasize key changes that occur

Hotdog Eating Contest

Hotdog Eating Contest Source, p.1

library(ggplot2)
library(dplyr)

# Get the data and add a more useful "new world record" field
hotdogs = read.csv("http://datasets.flowingdata.com/hot-dog-contest-winners.csv")
hotdogs = mutate(hotdogs, 
                 New.record=factor(c("Old record stands",
                                     "New world record")[hotdogs$New.record+1],
                                   ordered=T,
                                   levels=c("Old record stands","New world record")))

# Set a color palette with two colors (light grey and green)
hdPalette <- c("#999999","#20B400")

# Your basic bar plot call for ggplot2
ggplot(hotdogs, aes(x=Year, y=Dogs.eaten, fill=New.record)) + 
  geom_bar(stat="identity") + 
  # More on the next slide ...

Hotdog Eating Contest Source, p.2

  # Change the fill color values to use the palette we made
  scale_fill_manual(values=hdPalette) +
  #
  # Make sure the y-axis ticks every 10 HDBs and label everything
  scale_y_continuous(breaks = seq(from=0,to=70,by=10)) +
  ylab("Hotdogs and Buns Eaten (HDBs)") +
  ggtitle("Nathan's Hot Dog Context Results, 1980-2010") +
  #
  # We don't need a legend title b/c we made the new record
  # labels easier to understand
  guides(fill=guide_legend(title=NULL)) +
  #
  # I like serif fonts, so use Times instead of the default font
  # and make the font bigger
  theme_bw(base_family="Times") +
  theme(text=element_text(size=18)) +
  #
  # More on the next slide ...

Hotdog Eating Contest Source, p.3

  #
  # Get rid of all the grid lines and the border then instead
  # generate Tufte-style "anti-grid" lines *through* the bars
  theme(panel.grid.major=element_blank(), 
        panel.grid.minor=element_blank(),
        panel.border = element_blank()) +
  geom_hline(yintercept=seq(from=0,to=70,by=10), col="white") +
  #
  # Annotate when important changes occurred
  annotate("text",2001,56,label="Takeru Koyahsi", hjust=0) + 
  annotate("segment",x=2000.6,xend=2006.4,y=55,yend=55,size=1.2) +
  annotate("text",2007,70,label="Joey Chestnut", hjust=0) + 
  annotate("segment",x=2006.6,xend=2010.4,y=69,yend=69,size=1.2)) 

Stacked Bar Graphs Over Time

Stacked Bar Graphs Over Time Source, p1

library(ggplot2)      # For plotting
library(dplyr)        # For mutate()
library(RColorBrewer) # For color scaling
library(reshape2)     # For melt()

# Get the data in the right format, with the Year as an ordered
# factor and removing the prefix 'X' from the strings in the 
# dataset
rawHotdogs = read.csv("http://datasets.flowingdata.com/hot-dog-places.csv")
rawHotdogs$Place = rownames(rawHotdogs)
hotdogs = melt(rawHotdogs)
YearStr=as.character(gsub("X","",as.character(hotdogs$variable)))
hotdogs = mutate(hotdogs,
                 Year=factor(YearStr, 
                             ordered=TRUE, 
                             levels=unique(YearStr)))

# More on the next slide ...

Stacked Bar Graphs Over Time Source, p2

# Build a linear color palette of greens, then reverse it
# so that the darkest green is first and so on.
hdPalette <- rev(colorRampPalette(brewer.pal(5,"Greens"))(4))

# Basic stacked bar plot
ggplot(hotdogs, aes(x=Year,y=value,fill=Place)) + 
  geom_bar(stat="identity") +
  #
  # Use the provided color palette for "fill"
  scale_fill_manual(values=hdPalette) +
  #
  # More on the next slide ...

Stacked Bar Graphs Over Time Source, p3

  #
  # Use a white background and a large serif font
  theme_bw(base_family="Times") +
  theme(text=element_text(size=18)) +
  #
  # Tick every 25 HDBs, label the y-axis properly, and give title
  scale_y_continuous(breaks = seq(from=0,to=150,by=25)) +
  ylab("Hotdogs and Buns Eaten (HDBs)") +
  ggtitle("Hot Dog Eating Context Results, 2000-2010")

2-Directional Stacked Bar Graphs

2-Directional Stacked Bar Graphs Source, p.1

# Get data directly from the Bureau of Labor Statistics
blsDataURL = "https://download.bls.gov/pub/time.series/ce/ce.data.00a.TotalNonfarm.Employment"
empl <- read.table(blsDataURL, header=F,flush=T,skip=1,
                   col.names=c("ID","Year","Period","Value"))
emplSmall = filter(empl,
                   (Year>2000) & (Year<2015), # Get only 2001-2014
                   Period != "M13",           # Remove any year-end redundancies
                   ID=="CES0000000001")       # Select Total National data series

# Create a "RawYear" variable that includes months in numeric value 
emplSmall = mutate(emplSmall,
                   Month = as.numeric(gsub("M","",as.character(Period))),
                   RawYear = Year + (Month-1)/12)
#
# More on the next slide ...

2-Directional Stacked Bar Graphs Source, p.1

# Number of rows (observations)
nr = nrow(emplSmall)

# Since we are computing the *change* in jobs, we must subtract
# each period from the preceding period.  This also means re-indexing
# from the second period onward.  While we're at it, let's add a
# categorical variable for the administration.
Employment = data.frame(
  EmplChange = emplSmall$Value[2:nr] - emplSmall$Value[1:(nr-1)],  # Delta per period
  Year = emplSmall$RawYear[2:nr],
  Administration = factor(c("Bush","Obama")[1+(emplSmall$RawYear[2:nr] > 2009.08)])
 )
hdPalette <- c("firebrick","deepskyblue") # Custom bar colors

# Standard plot command
ggplot(Employment, aes(x=Year, y=EmplChange, fill=Administration)) +
  geom_rect(aes(xmin=Year-(1/24), xmax=Year+(1/24), ymin=0, ymax=EmplChange), color="white") +
  scale_fill_manual(values=hdPalette) +
# More on the next slide ...

2-Directional Stacked Bar Graphs Source, p.3

  # Draw the baseline 
  geom_hline(yintercept=0, col="black", size=1) +  
  #
  # Set tick spacing, label the axis and title, set the font
  scale_y_continuous(breaks = seq(from=-800,to=600,by=200)) +
  ylab("Employees (Thousands)") +
  ggtitle("New Jobs in the United States, 2001-2014") +  
  theme(text=element_text(size=18, family="Times")) +
  #
  # Annotate the sides of the plot corresponding with gains vs. loss
  annotate("text",2001,450,label="Gain in Jobs",
           hjust=0, family="Times", fontface="bold") +
  annotate("text",2001,-450,label="Loss in Jobs",
           hjust=0, family="Times", fontface="bold")

Dot Plots Over Time

Dot Plots Over Time Source, p.1

# Get and reformat the data
subscribers = read.csv("http://datasets.flowingdata.com/flowingdata_subscribers.csv")
subscribers = transform(subscribers,
                        Subscribers = Subscribers/1000,  # we'll show val per K
                        Date = strptime(as.character(Date),"%m-%d-%Y"),
                        Day = 1:length(Date))

# Nathan Yau's x-axis shows ticks every day and labels every 5 days ...
# Except Day 1, which is also labeled.  I construct this to use later
# to deal with this strangeness.
weirdXLabels = c("1",rep("",3),
                 "5",rep("",4),
                 "10",rep("",4),
                 "15",rep("",4),
                 "20",rep("",4),
                 "25",rep("",4),
                 "30")
# Start the plot
ggplot(subscribers, aes(x=Day, y=Subscribers)) +
  # More on the next slide ...

Dot Plots Over Time Source, p.2

  # Place all the annotations.  The segments, at least, have to
  # come first so that the points will occlude them...
  #
  # Annotate the errant points
  annotate("text",14,11,label="Reporting Error",
           hjust=0, family="Times", fontface="bold") +
  annotate("text",14,9,hjust=0, family="Times",
           label="A source reported incorrect subscriber\n counts for these days") +
  #
  # Annotate the first and end point, drawing line segments to
  # the actual point.
  annotate("text",1,22,label="25,047", family="Times", fontface="bold") +
  annotate("segment",x=1,y=22.5,xend=1,yend=25.047,size=0.5) +
  annotate("text",30,22,label="27,611", family="Times", fontface="bold") +  
  annotate("text",30,21,label="(+10%)", family="Times") +
  annotate("segment",x=30,y=22.5,xend=30,yend=27.611,size=0.5) +
  #
  # A weird hack to get the top of the graph to look like his does ...
  annotate("text",x=0.5,y=30,label="thousand subscribers",hjust=0, family="Times") +
  # More on the next slide

Dot Plots Over Time Source, p.3

  # Draw the actual points
  geom_point(shape=21, fill="firebrick", size=5) +
  #
  # Set all the theme elements 
  theme_bw() +
  theme(text=element_text(size=18,family="Times"), # OK, so he doesn't use serif ...
        panel.grid.major.x=element_blank(),  # Get rid of major x grid lines
        panel.grid.minor.x=element_blank(),  # Get rid of minor x grid lines
        panel.border = element_blank(),      # Get rid of the border
        axis.ticks.y = element_blank()) +    # Get rid of y-axis tick marks
  #
  # Set the grid lines and labels on the axes, draw the y-axis baseline back in
  scale_y_continuous(breaks = seq(from=0,to=30,by=5)) +
  scale_x_discrete(breaks = 1:30, labels=weirdXLabels) +
  geom_hline(yintercept=0, col="black", size=0.5) +
  ylab("") +
  xlab("January 2010") +
  ggtitle("Increase in RSS & Email Subscribers in January 2001") 

Continuous Data over Time

Connecting the Dots

  • The simplest way to communicate continuity or connection is by connecting points using line segments

  • Since all data is discrete at some point, the distinction between when to use a dot plot for “discrete” data and when to use a line plot is not always clear

  • In the previous examples, we were making observations within a discete interval of time (e.g., all the subscribers in a give day)

  • Alternatively, we might be simply sampling some value at discrete intervals (e.g., the amount of CO2 in the atmosphere at periodic intervals)

  • We sometimes refer to the latter as a time series

Types of Time-Series Pattern Analysis

  • Trends – overall tendency of a series (e.g., increase vs. decrease)

  • Variability – degree of change from one point in time to the next over some time span

  • Rate of change – overal rate of change over some time span

  • Co-variation – how variation in one time series affects another

  • Cycles – tendency for a time series to exibit periodic patterns

  • Exceptions – sudden changes in how a time series proceeds

Expenditure Over Time

Expenditure Over Time Source

library(dplyr)
library(ggplot2)
ggplot(economics,aes(x=date, y=pce)) + 
  geom_line(size=1.25,color="darkblue") +
  xlab("Date") +
  ylab("Personal Consumption Expenditures (billions of dollars)") +
  theme(text=element_text(size=18))

Time Series Data in R, p.1

  • R provides a special kind of data frame called a time series (ts)
    • Typically a vector or vectors of values, as well as
    • A start point, end point, and frequency specifier
  • Basic R plotting commands understand ts

  • Some basic R routines for ts analysis
    • stl() – decomposes a time series into seasonal, trending, and irregular components
    • HoltWinters() – Performs a kind of exponential smoothing over time series data for cleaning visualization
    • arima() – fits a different kind of model to a ts and can be used for prediction

Time Series Data in R, p.2

  • Some installable packages provide additional modeling and prediction tools for ts
    • e.g., forecast
  • Sadly, ggplot2 does not understand ts or ts models

  • We’ll talk more about “models” another week

CO2 Data from 1959-1997

plot(co2,lwd=2,col="darkblue");grid()

Seasonaly Decomposition of CO2 Data

plot(stl(co2,"per"),lwd=2,col="darkblue");grid()

Predicting with the ARIMA model

Predicting with the ARIMA model Source, p.1

# Build a model using appropriate model parameters
myModel = arima(co2, order=c(10,2,1))
myPrediction = predict(myModel,22*12)

# Setup the initial plot
plot(co2,xlim=c(1959,2020),ylim=c(300,410),
     lwd=1.25,col="darkblue",
     xlab="Date",ylab="Atmospheric Concentration of CO2 (ppm)")
grid()

# Sketch out today's measure
# From:  http://co2now.org/current-co2/co2-now/
segments(y0=400.26, y1=400.26, x0=0, x1=2015+2/12, col="darkorange", lty=2)
segments(y0=0, y1=400.26, x0=2015+2/12, x1=2015+2/12, col="darkorange", lty=2)
points(2015+2/12,400.26,pch=19,col="darkorange")
text(2006,402,"Atmospheric Concentration of CO2 in Feb 2015",col="darkorange")
segments(y0=416.27, y1=416.27, x0=0, x1=2020+6/12, col="darkred", lty=2)
segments(y0=0, y1=416.27, x0=2020+6/12, x1=2020+6/12, col="darkred", lty=2)
points(2020+6/12,416.27,pch=19,col="darkred")
text(2011,417.7,"Atmospheric Concentration of CO2 in Jun 2020",col="darkred")

# Add the predictions
lines(myPrediction$pred,lwd=4,col="steelblue")

Step Plots

  • Line plots imply steady (linear) change between the sample points
  • For some time-based data, changes:
    • Occur at specific moments in time
    • Do not change between observations
  • For this kind of data, it’s more appropriate to use a step-plot

Postage Price Change

Postage Price Change Source, p.1

# Get the data
postage = read.csv("http://datasets.flowingdata.com/us-postage.csv")

# Make labels for prices to put over steps
priceLabels = as.character(100*postage$Price)
priceLabels[1] = paste(priceLabels[1],"cents")

# Make the special, irregular x-axis year labels
yearLabels = c("1991",rep("",3),
                "'95",rep("",3),
                "'99",rep("",1),
                "2001",
                "'02",rep("",3),
                "'06", "'07", "'08", "'09",
               "")

Postage Price Change Source, p.2

# Create the basic plot
ggplot(postage, aes(x=Year, y=Price)) +
  geom_step(size=1.5, color="firebrick") +
  geom_text(aes(label=priceLabels), vjust=-0.5, hjust=0) + # Label the prices
  #
  # Blank out axis elements and set the font
  theme(panel.grid.major=element_blank(), 
        panel.grid.minor=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title=element_blank(),
        text=element_text(size=18,family="Times")) +
  #
  # Add the irregular x-axis labels then title the thing
  scale_x_continuous(breaks=seq(from=1991,to=2010,by=1), labels=yearLabels) +
  ggtitle("United States Postage Rate, 1991-2010")

Smoothing over Rough Spots

  • Sometimes creating a line plot is misleading because:
    • There’s too much noise in the data, which hides trends
    • There’s missing data over which we’d like to generalize
    • There are so many points, the plot is distracting
  • One way to address this is to try to fit some kind of model to the data, and plot that model instead of, or over top the real data
  • For example:
    • Fitting a line to data (linear regression)
    • Fitting a curve to data (generalized regression)
    • Fitting piece-wise curves to data (LOESS)

Fitting a Line

Fitting a Line Source

ggplot(economics, aes(x=date,y=psavert)) +
  geom_point(shape=21, fill="white", size=3) +
  geom_smooth(method=lm, se=FALSE, color="darkblue", size=1.5) +
  theme(text=element_text(size=18, family="Times")) +
  xlab("Date") + ylab("Personal Savings Rate") +
  ggtitle("United States Personal Savings Rates, 1967-2007")

Fitting a Smoothed Curve

Fitting a Smoothed Curve Source, p.1

# Get the data
utilURL ="http://eecs.ucf.edu/~wiegand/ids6938/datasets/February-2015-SystemUtilization.txt"
sysutil = read.csv(utilURL)
sysutil = transform(sysutil,
                 Date=strptime(DateTime,"%m/%d/%Y %H:%M"),
                 PctUtil=100*System.Utilization)

# Create the plot
ggplot(sysutil, aes(x=Date, y=PctUtil)) + 
  # Set axis and title lables, also set y-axis limits
  xlab("Date") + ylab("System Utilization (%)") + ylim(c(0,100)) +
  ggtitle("Stokes System Utilization for February 2015") +
  #
  # More on next slide ...

Fitting a Smoothed Curve Source, p.2

  # Draw the points
  geom_point(shape=21,size=2,color="pink",fill="pink") +
  #
  # Use a white background and set the font properties
  theme_bw() +
  theme(text=element_text(size=18, family="Times")) +
  #
  # Draw the piece-wise, smoothed curve fit using LOESS
  stat_smooth(size=1.75, color="firebrick", method=loess)

Issues with Plotting Cycles

The Challenge of Cyclical Time Periods

When data is cyclical, it can be complicated to visualize

  • Points that might be placed close together are often physically separated

  • Plot types that resolve this can be difficult to understand

  • Multiple data trends on one plot can be confusing and cluttered

January and December Aren’t Really Far Apart

January and December Aren’t Really Far Apart

Year = ordered(c(rep(1974,12), rep(1975,12), rep(1976,12), rep(1977,12), rep(1978,12), rep(1979,12)))
MonthList = c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec')
Month = factor(rep(MonthList,6),
               levels=MonthList,
               ordered=T)
Deaths = as.numeric(ldeaths)
UKDeaths = data.frame(Year, Month, Deaths)

ggplot(UKDeaths, aes(x=Month, y=Deaths, fill=factor(Year))) + 
  geom_bar(stat="identity", position="dodge", color="black", width=0.75) + 
  geom_hline(yintercept=0, size=1.1) +
  scale_fill_brewer(palette="YlOrRd", name="Year") +
  theme(text=element_text(size=18, family="Times")) +
  ggtitle("Deaths for Lung Disease in the UK")

But Radar Plots Are Hard To Understand

But Radar Plots Are Hard To Understand

ggplot(UKDeaths, aes(x=Month, y=Deaths, group=Year, color=factor(Year))) + 
  geom_line(size=2) + 
  coord_polar() + 
  scale_color_brewer(palette="Set1", name="Year") +
  theme(text=element_text(size=18, family="Times")) +
  ggtitle("Deaths for Lung Disease in the UK")

And Heat Maps Maybe Have Both Problems …

And Heat Maps Maybe Have Both Problems …

ggplot(UKDeaths, aes(x=Month, y=Year, fill=Deaths)) + 
  geom_tile(color="white") + 
  scale_fill_gradient(low="white", high="steelblue" ) +
  theme(text=element_text(size=18, family="Times")) +
  ggtitle("Deaths for Lung Disease in the UK")  

As Ever, Summarizing Can Avoid Clutter

As Ever, Summarizing Can Avoid Clutter

ggplot(UKDeaths, aes(x=Month, y=Deaths)) + 
  geom_boxplot() + 
  theme(text=element_text(size=18, family="Times")) +
  ggtitle("Deaths for Lung Disease in the UK (1974-1979)")  

No Easy Answers

There are no easy answers to how to address this

  • What is your objective?

  • What will your audience understand?

  • Somtimes: What is the least annoying option?