1. This… is… Jeopardy!

“This… is… Jeopardy!” These words will ring a bell for anyone who has watched the American game show, Jeopardy! . This iconic TV show could be described as quizbowl with gambling. In each 30-minute episode, three contestants compete in answering questions with specific monetary value, accumulating and wagering their earnings throughout each round. It’s no petty cash, either; a recent Jeopardy! champion, James Holzhauer, walked away with $2.46 million after winning in 33 consecutive episodes. However, for an aspiring Jeopardy! champion, the amount of knowledge required to excel at the game might seem discouraging at first glance. How can it be possible to know everything about everything? Some Jeopardy! enthusiasts have turned to data analysis for the answers—and we’ll do just that. In this project, we’ll use basic text mining techniques on Jeopardy! data to spot trends in the types of questions asked. Let’s start by loading in the dataset and the packages we’ll need.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tm)
## Loading required package: NLP
library(wordcloud)
## Loading required package: RColorBrewer
jeopardy<-read_csv('datasets/jeopardy.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   show_number = col_double(),
##   air_date = col_character(),
##   round = col_character(),
##   category = col_character(),
##   value = col_character(),
##   question = col_character(),
##   answer = col_character()
## )

2. A glimpse ahead

Here are the basic rules of the game. Three contestants compete against each other in three rounds: Jeopardy, Double Jeopardy, and Final Jeopardy. In Jeopardy and Double Jeopardy , each round has six categories, with five answers per category. After an answer is read by the show’s host, Alex Trebeck, each contestant competes to be the first to come up with the correct question to the answer. Each answer has a monetary value based on its difficulty. The monetary values in the Double Jeopardy round are double the values of the answers in Jeopardy round. In Final Jeopardy, the contestants bet any amount from their accumulated earnings on one difficult answer. For a complete breakdown of the rules, check out the Jeopardy! Wikipedia page. Knowing the rules of the game will make the jeopardy dataset easier to understand!

glimpse(jeopardy)
## Rows: 116,837
## Columns: 7
## $ show_number <dbl> 4031, 4031, 4031, 4031, 4031, 4031, 4031, 4031, 4031, 4031…
## $ air_date    <chr> "2/25/2002", "2/25/2002", "2/25/2002", "2/25/2002", "2/25/…
## $ round       <chr> "Jeopardy!", "Jeopardy!", "Jeopardy!", "Jeopardy!", "Jeopa…
## $ category    <chr> "AMERICAN HISTORY", "FIREFIGHTING", "GEOGRAPH\"E\"", "GIVE…
## $ value       <chr> "$200", "$200", "$200", "$200", "$200", "$200", "$400", "$…
## $ question    <chr> "In 1805 this territory was created from the Indiana one, …
## $ answer      <chr> "Michigan", "the Hall of Flame", "Etna", "Gary Burghoff", …
head(jeopardy)
## # A tibble: 6 x 7
##   show_number air_date  round  category    value question               answer  
##         <dbl> <chr>     <chr>  <chr>       <chr> <chr>                  <chr>   
## 1        4031 2/25/2002 Jeopa… "AMERICAN … $200  "In 1805 this territo… Michigan
## 2        4031 2/25/2002 Jeopa… "FIREFIGHT… $200  "The firefighting mus… the Hal…
## 3        4031 2/25/2002 Jeopa… "GEOGRAPH\… $200  "Sicilians call this … Etna    
## 4        4031 2/25/2002 Jeopa… "GIVE THE … $200  "TV, 1972-1979: Walte… Gary Bu…
## 5        4031 2/25/2002 Jeopa… "WED TO TH… $200  "The sacrament of mar… matrimo…
## 6        4031 2/25/2002 Jeopa… "CRIMINAL … $200  "Ice can refer to dia… to kill

3. Corpus of categories

Whew. Where do we even start? Jeopardy! questions and answers include all kinds of words—places, people, even obscure vocabulary. Did you know that “philately” means “love of stamp collecting?” Check out the data from show number 6108. It might be better to start with a small-scale text analysis. Let’s look at the categories. In addition to having clever and sometimes downright funny names, they’ll tell us a little more about the content of the questions without having to analyze the question text. We need to take the categories data and convert it to an easily-workable body of text—in other words, a corpus.

categories<-jeopardy %>%
    select(round,category) %>%
    filter(round == 'Jeopardy!')

# VectorSource(x) : Takes a grouping of texts and makes each element of the resulting vector a document within your R Workspace. There are many types of sources, but VectorSource() is made for working with character objects in R.

# "Corpus" is a collection of text documents.

# VCorpus in tm refers to "Volatile" corpus which means that the corpus is stored in memory  and would be destroyed when the R object containing it is destroyed.
# VCorpus(): Takes a source object and makes a volatile corpora. A VCorpus object is created from a source object. In essence, a corpus is a collection of documents. Since the object is volatile, all changes only affect the corresponding R object. Also for volatile objects, once the variable is destroyed, the corpus object is also destroyed.

    
categories_source<-VectorSource(categories)
categories_source
## $encoding
## [1] ""
## 
## $length
## [1] 2
## 
## $position
## [1] 0
## 
## $reader
## function (elem, language, id) 
## {
##     if (!is.null(elem$uri)) 
##         id <- basename(elem$uri)
##     PlainTextDocument(elem$content, id = id, language = language)
## }
## <bytecode: 0x7ffdbe0ee5f8>
## <environment: namespace:tm>
## 
## $content
## # A tibble: 58,079 x 2
##    round     category               
##    <chr>     <chr>                  
##  1 Jeopardy! "AMERICAN HISTORY"     
##  2 Jeopardy! "FIREFIGHTING"         
##  3 Jeopardy! "GEOGRAPH\"E\""        
##  4 Jeopardy! "GIVE THE ROLE TO GARY"
##  5 Jeopardy! "WED TO THE IDEA"      
##  6 Jeopardy! "CRIMINAL CONVERSATION"
##  7 Jeopardy! "AMERICAN HISTORY"     
##  8 Jeopardy! "FIREFIGHTING"         
##  9 Jeopardy! "GEOGRAPH\"E\""        
## 10 Jeopardy! "GIVE THE ROLE TO GARY"
## # … with 58,069 more rows
## 
## attr(,"class")
## [1] "VectorSource" "SimpleSource" "Source"
categories_corp<-VCorpus(categories_source)
categories_corp
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

4. Cleaning the categories

Jeopardy! categories are notorious for being witty and unique. An example of a category title is “Element, Spel-ement” (from the episode aired on March 28, 2011). Every question in this category gave the contestant a list of chemical element names, and the contestant had to spell the word created by the symbols of those elements (example: boron, aluminum, potassium = “balk”). Some categories have more straightforward titles, such as the “Indonesia” category (from the episode on March 25, 2011), which had questions all about Indonesia. You can imagine that text mining from these wordy and specific categories might be difficult—and this would probably be correct. Some cleaning is in order! Bust out the vacuums (or in this case, some tm package verbs).

# Clean the corpus

# tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like 'the', “we”. The information value of 'stopwords' is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses

# transforming all text to lowercase
clean_corp <- tm_map(categories_corp, content_transformer(tolower))

# removing punctuation
clean_corp <- tm_map(clean_corp, removePunctuation)

# stripping whitespace
clean_corp <- tm_map(clean_corp, stripWhitespace)

# removing English stopwords
clean_corp <- tm_map(clean_corp, removeWords, stopwords("en"))

# Create a TDM from the clean corpus / Create a term-document matrix 
# erm document matrix is also a method for representing the text data. In this method, the text data is represented in the form of a matrix. The rows of the matrix represent the sentences from the data which needs to be analyzed and the columns of the matrix represent the word.
categories_tdm <- TermDocumentMatrix(clean_corp)

5. Favorite topics

A basic, yet fairly effective analysis here would be a word-frequency analysis. If certain words popped on in category titles more often than others, we could reasonably assume that there are recurring themes in Jeopardy! categories. First, we will need to turn the TDM into an M (a matrix). Then, we will rank the most frequent words.

categories_m <- as.matrix(categories_tdm)
term_frequency<- sort(rowSums(categories_m), decreasing = TRUE)

# The las argument allows to change the orientation of the axis labels:

# 0: always parallel to the axis
# 1: always horizontal
# 2: always perpendicular to the axis
# 3: always vertical.
barplot(term_frequency[1:12], las=2, col = 'Blue' )

6. Removing unwanted words

That is a nice barplot…but we’re in this for the money and Jeopardy! fame. Let’s improve the bar plot by removing some unhelpful words: “time,” “new,” “first,” and “lets.”

# Remove additional words from the corpus
cleaner_corp <- tm_map(clean_corp, removeWords, 
                        c(stopwords("en"), "time", "new", "first", "lets")) 
# Create a TDM 
cleaner_tdm <- TermDocumentMatrix(cleaner_corp)

# Copy your code from Task 5 (change barplot colors if you want)
categories_m <- as.matrix(cleaner_tdm)    
term_frequency <- sort(rowSums(categories_m), decreasing = TRUE)
barplot(term_frequency[1:12], col = "honeydew3", las = 2)

  1. Creating better tools, part 1

A few of our top ranking category words are: “words,” “world,” “state,” “name,” and “history.” “Words” most likely refers to the wordplay or vocabulary categories, which appear often on the show. The other four words suggest that a Jeopardy! champion will need to know a lot about history, geography, and significant historical figures. However, when we go further down the plot, there’s an interesting term—the 11th most common term is “American.” Considering this is an American game show, it would make sense that the game requires the contestants to be most familiar with American history. We should look into this! But first, let’s save some time by condensing many lines of code into one. We’ll write simple, one-line functions for the cleaning and term-frequency extraction processes.

# Create a cleaning function
speed_clean <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  return(corpus)
}

8. Creating better tools, part 2

We can incorporate the speed_clean() function we just made into a new function that will extract frequent terms from an already-cleaned matrix. Then, we’ll move on to Final Jeopardy, the last and toughest round (also, the one with the iconic Jeopardy! song).

# Create freq_terms function
freq_terms <- function(list) {
  source <- VectorSource(list)
  corpus <- VCorpus(source)
  clean_corpus <- speed_clean(corpus)
  tdm <- TermDocumentMatrix(clean_corpus)
  matrix <- as.matrix(tdm)
  term_frequency <- sort(rowSums(matrix), decreasing = TRUE)
  return(term_frequency)
}

9. Think!

Final Jeopardy is arguably the most important round in the entire game—contestants bet any amount from their accumulated earnings on one answer. This answer is supposedly more difficult than all the questions in the previous rounds. The contestants make their bets before the answer is read and are given 30 seconds to write down their questions. You can probably imagine how much of a game-changer this round is (check out this for proof). Since we’ve already looked at the categories, let’s look at some of the correct answers to Final Jeopardy questions.

# Create the answers variable
answers <- jeopardy %>% 
  filter(round == "Final Jeopardy!") %>%
  select(answer)

# Retrieve word frequency
ans_frequency <- freq_terms(answers)

# Retrieve names
ans_names <- names(ans_frequency)

# Create wordcloud
wordcloud(ans_names, ans_frequency, max.words = 40, 
          colors = c("wheat", "plum", "salmon"))

  1. A few insights John, William, James, and Henry… who might these people be? We don’t know exactly, but the wordcloud seems to support and expand upon a hunch we had a little while ago - many Jeopardy! questions are drawn from American or European history. While it is certainly possible to get a category like “Indonesia,” contestants are much more likely to be tested on the history, literature, or pop culture from the west. This might not be surprising, but there are plenty of other insights to be drawn from the dataset using the text mining techniques we have explored in this project.

As an aspiring Jeopardy! champion, which textbook might best help you study? a. “Geography of Indonesia” b. “Chemistry 101” c. “U.S. History” d. "Introduction to Text Mining in R