Data visualization, Part 1

Statistical Learning
James Scott (UT-Austin)

Types of features in a data set

Categorical: the answer to a multiple choice question:

  • Chevy/Honday/Tesla
  • ice cream/cake/pie

Ordinal: categorical, where the answers have an ordering but not a magnitude

  • Poor, Moderate, Good, Great
  • Private, Corporal, Lieutenant, Colonel, General

Numerical: numbers, whether integer or real-valued

  • Beware the “faux numerical” ordinal scale (CIS!)

Five basic building blocks of data vis

  • Tables (categorical vs categorical)
  • Histograms
  • Boxplots/dotplots
  • Scatter plots
  • Lattice/panel plots

Good principles in action: histograms

Let's load some data on temperatures in Rapid City and San Diego:

# Use the correct path name
citytemps = read.csv('data/citytemps.csv')
summary(citytemps)
      Year          Month             Day        Temp.SanDiego  
 Min.   :1995   Min.   : 1.000   Min.   : 1.00   Min.   :45.10  
 1st Qu.:1999   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:58.70  
 Median :2003   Median : 7.000   Median :16.00   Median :63.00  
 Mean   :2003   Mean   : 6.492   Mean   :15.71   Mean   :63.08  
 3rd Qu.:2007   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:67.30  
 Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :81.30  
 Temp.RapidCity  
 Min.   :-19.00  
 1st Qu.: 33.30  
 Median : 47.60  
 Mean   : 47.28  
 3rd Qu.: 63.95  
 Max.   : 91.90  

Histograms

A simple histogram:

hist(citytemps$Temp.SanDiego)

plot of chunk unnamed-chunk-2

The `breaks` flag

hist(citytemps$Temp.SanDiego, breaks=30)

plot of chunk unnamed-chunk-3

Adding bells and whistles

muSanDiego = mean(citytemps$Temp.SanDiego)
hist(citytemps$Temp.SanDiego, breaks=30)
abline(v=muSanDiego, col='red', lwd=5)

plot of chunk unnamed-chunk-4

Comparing two histograms

We can make a multi-frame plot.

par(mfrow=c(2,1))
hist(citytemps$Temp.SanDiego)
hist(citytemps$Temp.RapidCity)

R's syntax here is somewhat hacky. We'll see something better when we learn ggplot!

Comparing two histograms

plot of chunk unnamed-chunk-5

What's wrong with this picture?

Core principle: plots are truthful about magnitude! Here we have:

  • unequal bin size
  • noncomparable y axes
  • noncomparable x axes

Let's fix these:

# custom breakpoints
mybreaks = seq(-20, 92, by=2) 
par(mfrow=c(2,1))
hist(citytemps$Temp.SanDiego, breaks=mybreaks,
  xlim=c(-20,100), ylim=c(0, 760))
hist(citytemps$Temp.RapidCity, breaks=mybreaks,
  xlim=c(-20,100), ylim=c(0, 760))

Comparing two histograms

plot of chunk unnamed-chunk-6