On July 2, John Oliver had a segment on his TV show “Last Week Tonight” (https://www.youtube.com/watch?v=5cBV8KFFasY) about the auction of life-size wax figures from the presidential wing of a museum in Gettysburg. I instantly wondered “How much did Oliver spend on his 5 figures?”. For \(n = 5\) figures, the answer is relatively easy to compute, but here I wish to practice some data mining skill in R.

Obtaining the Data

The data from the auction was freely available (http://www.paonsiteauction.com/Jan14/011417prices.pdf) in a PDF, and fortunately there is a website that will convert PDF files to text files (http://pdftotext.com/). From there, let’s try out the read.table function to load the text file.

df <- read.table("011417prices.txt", 
                 fill = TRUE, #since rows have varying amounts of elements
                 header = FALSE)

Oh, that returns a data frame huh? For now, I just need line-by-line reads, so we can try readLines instead.

dataLines <- readLines("011417prices.txt")
## Warning in readLines("011417prices.txt"): incomplete final line found on
## '011417prices.txt'
head(dataLines)
## [1] "Pa. OnSite Auction"                                                                                   
## [2] "Sat. Jan. 14, 2017"                                                                                   
## [3] "Hall of Presidents Catalog"                                                                           
## [4] ""                                                                                                     
## [5] "FINAL SALE PRICES INCLUDING BUYERS PREMIUM"                                                           
## [6] "001. President George Washington 1789-1797 life size wax figure & name plaque, 6â<U+0080><U+0099> 2â<U+0080> tall $5,610."

Regular Expressions

While there were lines of description (most of them referring to auction items), I only want to focus on the “life size wax figure” data. I am a novice when it comes to regular expressions, and I will try out the stringr package with help from a vignette (https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html).

library("stringr")

In order to look for “life size wax figure”, I think that the simplest search is for “size”. I will create a separate list of lines that focuses on these wax figures from the auction.

waxFigures <- dataLines[str_detect(dataLines, "size")]
head(waxFigures)
## [1] "001. President George Washington 1789-1797 life size wax figure & name plaque, 6â<U+0080><U+0099> 2â<U+0080> tall $5,610."                
## [2] "004. President John Adams 1797-1801 life size wax figure w/ name plaque, 5â<U+0080><U+0099> 7â<U+0080> tall $1,760."                      
## [3] "006. President Thomas Jefferson 1801-1809 life size wax figure w/ name plaque, 6â<U+0080><U+0099> 3â<U+0080> tall, missing 1 hand $2,640."
## [4] "009. President James Madison 1809-1817 life size wax figure w/ name plaque, 5â<U+0080><U+0099> 4â<U+0080> tall $1,430."                   
## [5] "017. President James Monroe 1817-1825 life size wax figure w/ name plaque, 6â<U+0080> tall $1,100."                         
## [6] "020. President John Quincy Adams 1825-1829 life size wax figure w/ name plaque, 5â<U+0080><U+0099> 7â<U+0080> tall $2,090."

My goals here are to extract from each line the

I can extract the ‘years in office’ data with the following.

yearPattern <- "[:digit:]+-[:digit:]+"
years <- str_extract_all(waxFigures, yearPattern)

Next, I can extract the dollar amounts with the following (but this will return strings instead of numbers).

dollarPattern <- "[:digit:]+,[:digit:]+"
dollars <- str_extract_all(waxFigures, dollarPattern)

To extract the president’s names, I am trying a strategy of skipping the first 5 characters in each line (e.g. “001.”) and then gathering all of the letters and spaces. Ideally, this would stop at each instance of years served.

namePattern <- "(?<=^.....)[[:alpha:][:space:]]+"
presNames <- str_extract_all(waxFigures, namePattern)

Data Wrangling

In R, most a lot of data analysis starts with a data frame. Before we get to that, we should convert the dollars (e.g. “$1,234”) into numbers that a computer can handle.

cost <- as.numeric(gsub('[$,]', '', dollars))
## Warning: NAs introduced by coercion

Now we can create that data frame!

waxPresidents <- as.data.frame(cbind(unlist(presNames), unlist(years), cost))
names(waxPresidents) <- c("president", "years", "cost")
head(waxPresidents)
##                      president     years cost
## 1 President George Washington  1789-1797 5610
## 2        President John Adams  1797-1801 1760
## 3  President Thomas Jefferson  1801-1809 2640
## 4     President James Madison  1809-1817 1430
## 5      President James Monroe  1817-1825 1100
## 6 President John Quincy Adams  1825-1829 2090

Now that I have some grasp on the data, there are a few analyses that we can do. It might be useful to separate the years served column.

library("tidyr")
waxPresidents <- waxPresidents %>% separate(years, into = c("start", "end"), sep = "-")
head(waxPresidents)
##                      president start  end cost
## 1 President George Washington   1789 1797 5610
## 2        President John Adams   1797 1801 1760
## 3  President Thomas Jefferson   1801 1809 2640
## 4     President James Madison   1809 1817 1430
## 5      President James Monroe   1817 1825 1100
## 6 President John Quincy Adams   1825 1829 2090

I should note that there were 0 missing entries for the end-of-term numbers, and this would affect later calculations.

waxPresidents[is.na(waxPresidents$end),]
## [1] president start     end       cost     
## <0 rows> (or 0-length row.names)

Let me try to quickly fix those values.

waxPresidents[22, "start"] <- 1885
waxPresidents[22, "end"] <- 1899
waxPresidents[34, "start"] <- 1953
waxPresidents[34, "end"] <- 1961
waxPresidents[38, "start"] <- 1974
waxPresidents[38, "end"] <- 1977
waxPresidents[46, "start"] <- 2009
waxPresidents[46, "end"] <- 2016

Data Analysis

First, I simply want a plot of the auction prices. Maybe I will put the start of each president’s term on the horizontal axis.

waxPresidents$start <- as.numeric(waxPresidents$start)
waxPresidents$end <- as.numeric(waxPresidents$end)
waxPresidents$cost <- as.numeric(as.character(waxPresidents$cost)) #remove factor levels
waxPresidents <- waxPresidents[!is.na(waxPresidents$cost),] #removed unsold figures

library("ggplot2")
ggplot(waxPresidents, aes(x = start, y = cost)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(caption = "(data from http://www.paonsiteauction.com/Jan14/011417prices.pdf)") +
  labs(title = "PA Auction of US President Wax Figures") + 
  labs(x = "start of presidential term") + 
  labs(y = "cost of wax figurine (in USD)")

Which presidents’ wax figures fetched the highest sale prices?

head(waxPresidents[order(-cost),])
##                           president start  end cost
## 16            dent Abraham Lincoln   1861 1865 9350
## 26 President Thomas Woodrow Wilson   1901 1909 2310
## 18 President Ulysses Simpson Grant   1869 1877 6820
## 1      President George Washington   1789 1797 5610
## 7         President Andrew Jackson   1829 1837 5610
## 15        President James Buchanan   1857 1861 4400

Which wax figures fetched the lowest sale prices?

head(waxPresidents[order(cost),])
##                         president start  end cost
## 5         President James Monroe   1817 1825 1100
## 4        President James Madison   1809 1817 1430
## 14     President Franklin Pierce   1853 1857 1430
## 2           President John Adams   1797 1801 1760
## 17      President Andrew Johnson   1865 1869 1870
## 21 President Chester Alan Arthur   1881 1885 1870

John Oliver’s Haul

Finally, I will focus on the 5 figures that John Oliver purchased for his TV show *Last Week Tonight.

what_Oliver_Owns <- c("Carter", "Clinton", "Harrison", "Harding", "Nixon")
OliverFigures <- waxPresidents[grep(paste(what_Oliver_Owns, collapse = "|"), waxPresidents$president),]
OliverFigures
##                               president start  end cost
## 9     President William Henry Harrison   1841 1841 3520
## 27        dent Warren Gamaliel Harding   1909 1913 2970
## 36     President Richard Milhous Nixon   1963 1969 1870
## 39       President James Earl Carter Jr  1977 1981 2090
## 42 President William Jefferson Clinton   1993 2001 3300

For the heck of it, here is a graph.

ggplot(OliverFigures, aes(x = president, y = cost)) +
  geom_bar(stat = "identity") + 
  labs(title = "Last Week Tonight's Presidential Wax Figures") +
  labs(x = "") + 
  labs(y = "cost of wax figure (in USD)") +
  theme(axis.text.x=element_text(angle=45,hjust=1))

Finally, we can still add up the cost of the 5 wax figures and conclude that Last Week Tonight spent $ 13,750.0000 at that auction.