On July 2, John Oliver had a segment on his TV show “Last Week Tonight” (https://www.youtube.com/watch?v=5cBV8KFFasY) about the auction of life-size wax figures from the presidential wing of a museum in Gettysburg. I instantly wondered “How much did Oliver spend on his 5 figures?”. For \(n = 5\) figures, the answer is relatively easy to compute, but here I wish to practice some data mining skill in R.
The data from the auction was freely available (http://www.paonsiteauction.com/Jan14/011417prices.pdf) in a PDF, and fortunately there is a website that will convert PDF files to text files (http://pdftotext.com/). From there, let’s try out the read.table function to load the text file.
df <- read.table("011417prices.txt",
fill = TRUE, #since rows have varying amounts of elements
header = FALSE)
Oh, that returns a data frame huh? For now, I just need line-by-line reads, so we can try readLines instead.
dataLines <- readLines("011417prices.txt")
## Warning in readLines("011417prices.txt"): incomplete final line found on
## '011417prices.txt'
head(dataLines)
## [1] "Pa. OnSite Auction"
## [2] "Sat. Jan. 14, 2017"
## [3] "Hall of Presidents Catalog"
## [4] ""
## [5] "FINAL SALE PRICES INCLUDING BUYERS PREMIUM"
## [6] "001. President George Washington 1789-1797 life size wax figure & name plaque, 6â<U+0080><U+0099> 2â<U+0080> tall $5,610."
While there were lines of description (most of them referring to auction items), I only want to focus on the “life size wax figure” data. I am a novice when it comes to regular expressions, and I will try out the stringr package with help from a vignette (https://cran.r-project.org/web/packages/stringr/vignettes/stringr.html).
library("stringr")
In order to look for “life size wax figure”, I think that the simplest search is for “size”. I will create a separate list of lines that focuses on these wax figures from the auction.
waxFigures <- dataLines[str_detect(dataLines, "size")]
head(waxFigures)
## [1] "001. President George Washington 1789-1797 life size wax figure & name plaque, 6â<U+0080><U+0099> 2â<U+0080> tall $5,610."
## [2] "004. President John Adams 1797-1801 life size wax figure w/ name plaque, 5â<U+0080><U+0099> 7â<U+0080> tall $1,760."
## [3] "006. President Thomas Jefferson 1801-1809 life size wax figure w/ name plaque, 6â<U+0080><U+0099> 3â<U+0080> tall, missing 1 hand $2,640."
## [4] "009. President James Madison 1809-1817 life size wax figure w/ name plaque, 5â<U+0080><U+0099> 4â<U+0080> tall $1,430."
## [5] "017. President James Monroe 1817-1825 life size wax figure w/ name plaque, 6â<U+0080> tall $1,100."
## [6] "020. President John Quincy Adams 1825-1829 life size wax figure w/ name plaque, 5â<U+0080><U+0099> 7â<U+0080> tall $2,090."
My goals here are to extract from each line the
president’s full name
years in office
final price of his wax figurine
I can extract the ‘years in office’ data with the following.
yearPattern <- "[:digit:]+-[:digit:]+"
years <- str_extract_all(waxFigures, yearPattern)
Next, I can extract the dollar amounts with the following (but this will return strings instead of numbers).
dollarPattern <- "[:digit:]+,[:digit:]+"
dollars <- str_extract_all(waxFigures, dollarPattern)
To extract the president’s names, I am trying a strategy of skipping the first 5 characters in each line (e.g. “001.”) and then gathering all of the letters and spaces. Ideally, this would stop at each instance of years served.
namePattern <- "(?<=^.....)[[:alpha:][:space:]]+"
presNames <- str_extract_all(waxFigures, namePattern)
In R, most a lot of data analysis starts with a data frame. Before we get to that, we should convert the dollars (e.g. “$1,234”) into numbers that a computer can handle.
cost <- as.numeric(gsub('[$,]', '', dollars))
## Warning: NAs introduced by coercion
Now we can create that data frame!
waxPresidents <- as.data.frame(cbind(unlist(presNames), unlist(years), cost))
names(waxPresidents) <- c("president", "years", "cost")
head(waxPresidents)
## president years cost
## 1 President George Washington 1789-1797 5610
## 2 President John Adams 1797-1801 1760
## 3 President Thomas Jefferson 1801-1809 2640
## 4 President James Madison 1809-1817 1430
## 5 President James Monroe 1817-1825 1100
## 6 President John Quincy Adams 1825-1829 2090
Now that I have some grasp on the data, there are a few analyses that we can do. It might be useful to separate the years served column.
library("tidyr")
waxPresidents <- waxPresidents %>% separate(years, into = c("start", "end"), sep = "-")
head(waxPresidents)
## president start end cost
## 1 President George Washington 1789 1797 5610
## 2 President John Adams 1797 1801 1760
## 3 President Thomas Jefferson 1801 1809 2640
## 4 President James Madison 1809 1817 1430
## 5 President James Monroe 1817 1825 1100
## 6 President John Quincy Adams 1825 1829 2090
I should note that there were 0 missing entries for the end-of-term numbers, and this would affect later calculations.
waxPresidents[is.na(waxPresidents$end),]
## [1] president start end cost
## <0 rows> (or 0-length row.names)
Let me try to quickly fix those values.
waxPresidents[22, "start"] <- 1885
waxPresidents[22, "end"] <- 1899
waxPresidents[34, "start"] <- 1953
waxPresidents[34, "end"] <- 1961
waxPresidents[38, "start"] <- 1974
waxPresidents[38, "end"] <- 1977
waxPresidents[46, "start"] <- 2009
waxPresidents[46, "end"] <- 2016
First, I simply want a plot of the auction prices. Maybe I will put the start of each president’s term on the horizontal axis.
waxPresidents$start <- as.numeric(waxPresidents$start)
waxPresidents$end <- as.numeric(waxPresidents$end)
waxPresidents$cost <- as.numeric(as.character(waxPresidents$cost)) #remove factor levels
waxPresidents <- waxPresidents[!is.na(waxPresidents$cost),] #removed unsold figures
library("ggplot2")
ggplot(waxPresidents, aes(x = start, y = cost)) +
geom_point() +
geom_smooth(method = "lm") +
labs(caption = "(data from http://www.paonsiteauction.com/Jan14/011417prices.pdf)") +
labs(title = "PA Auction of US President Wax Figures") +
labs(x = "start of presidential term") +
labs(y = "cost of wax figurine (in USD)")
Which presidents’ wax figures fetched the highest sale prices?
head(waxPresidents[order(-cost),])
## president start end cost
## 16 dent Abraham Lincoln 1861 1865 9350
## 26 President Thomas Woodrow Wilson 1901 1909 2310
## 18 President Ulysses Simpson Grant 1869 1877 6820
## 1 President George Washington 1789 1797 5610
## 7 President Andrew Jackson 1829 1837 5610
## 15 President James Buchanan 1857 1861 4400
Which wax figures fetched the lowest sale prices?
head(waxPresidents[order(cost),])
## president start end cost
## 5 President James Monroe 1817 1825 1100
## 4 President James Madison 1809 1817 1430
## 14 President Franklin Pierce 1853 1857 1430
## 2 President John Adams 1797 1801 1760
## 17 President Andrew Johnson 1865 1869 1870
## 21 President Chester Alan Arthur 1881 1885 1870
Finally, I will focus on the 5 figures that John Oliver purchased for his TV show *Last Week Tonight.
what_Oliver_Owns <- c("Carter", "Clinton", "Harrison", "Harding", "Nixon")
OliverFigures <- waxPresidents[grep(paste(what_Oliver_Owns, collapse = "|"), waxPresidents$president),]
OliverFigures
## president start end cost
## 9 President William Henry Harrison 1841 1841 3520
## 27 dent Warren Gamaliel Harding 1909 1913 2970
## 36 President Richard Milhous Nixon 1963 1969 1870
## 39 President James Earl Carter Jr 1977 1981 2090
## 42 President William Jefferson Clinton 1993 2001 3300
For the heck of it, here is a graph.
ggplot(OliverFigures, aes(x = president, y = cost)) +
geom_bar(stat = "identity") +
labs(title = "Last Week Tonight's Presidential Wax Figures") +
labs(x = "") +
labs(y = "cost of wax figure (in USD)") +
theme(axis.text.x=element_text(angle=45,hjust=1))
Finally, we can still add up the cost of the 5 wax figures and conclude that Last Week Tonight spent $ 13,750.0000 at that auction.