What is: A Data-Driven Method To Win Jeopardy?

Author

Soren Miner

Introduction

Jeopardy! is a popular and long-running game show, hosted by the late Alex Trebek until 2020, and Ken Jennings following Trebek’s passing. During each show, three contestants compete to answer trivia questions and win a cash prize. This case study investigates a corpus of nearly 20,000 individual Jeopardy! questions and responses, with the goal of developing a data-driven strategy to prepare for and win this game show.

Analysis: Category

A first, crucial step in this analysis will be developing a tool that allows us to efficiently count the number of occurrences of subject categories (obtained by manually stemming question strings to loosely interpret them as categories). subject_category_count(c("history", "science")) will count the number of questions that fall in those two categories:

subject	count
history	260
science	137

We’ll start our strategy analysis by investigating patterns of question values. While values of questions within subjects fall into a fairly standardized scheme (lowest-value questions are worth ~$200, highest are worth ~$1000), there remains a degree of nuance that can inform our strategy. Firstly, some categories may appear more often in the higher-value “Final Jeopardy!” rounds, making them more valuable on average. Additionally, the average values of questions have changed over time since the show’s first airing.

This graph shows a significant increase in average question value around 2000, likely to adjust for inflation of the United States dollar and ensure that contestants are competing for a prize pot of appropriate size. Already, we have a concrete strategy to maximize Jeopardy! winnings: participate after the year 2000!

Fig. 2 reinforces this change, and provides us with more valuable information. Firstly, there is an uneven distribution of question counts between categories: Mathematics does not appear often as a category, while Punny Questions (i.e. any question that involves a play on words or letters) come up very frequently. Additionally, Science, Punny Questions, and Mathematics questions tend to be valued slightly higher than other categories.

(Note: this analysis (and all future Subject Category analyses) use only questions that have an assigned subject category. Approximately half of the data did not have identifiable categories, and will be used for other analyses later in this report.)

I next investigated change in relative proportion of subject categories over Jeopardy!’s history. Fig. 4 shows proportions of different questions appearing since the show’s beginning in 1984.

This figure shows that different categories have become more and less popular over the years. Sports, History, and Science were significantly more 1984, but have overall declined in frequency, particular since the turn of the century. Meanwhile, People and Music have remained relatively steady, and Punny Questions has bounced back following a decrease in the mid 1990s.

Important: to use a similar, interactive version of this chart, please visit: https://sauron-245.shinyapps.io/portfolio4-app/

Analysis: Question

The Question column contains a verbatim of each question posed to contestants. This section of our analysis will investigate more trends in subject category using words that frequently appear in questions in order to confirm our earlier finding based purely on subject categories. To accomplish this, we’ll dissolve each question into its constituent words, remove all ‘stopwords’ that are unlikely to be of interest, then identify the words that most frequently appear in Jeopardy! questions.

Words such as ‘name’, ‘one’, ‘called’, and ‘first’ appear most frequently in the corpus of Jeopardy! questions, as demonstrated by their presence in this top 30 wordcloud. This would suggest that having a strong command of historical figures, popular culture, and history (particularly American history, given the presence of words ‘American’ and ‘U.S.’ in the wordcloud). Let’s confirm that by investigating question category counts!

Fig. 4 removes all questions that don’t contain any of the 30 most popular words, then counts the subject categories associated with those questions. This chart confirms what we’ve already investigated and gives new insights. When counted by popular Question words, History and Language come up more frequently than they have in other parts of this investigation. Additionally, Punny Questions seems to be far overrepresented: it’s possible that misclassification of this category has artificially inflated its prevalence as a category.

Conclusion & Strategy

So far, we’ve investigated:

Change in value of Jeopardy! questions over time;
Number of questions associated with each category;
Change in proportion of categories over time;
Average value of different categories;
Most common words in Jeopardy questions;
And the categories associated with those most common words.

How should one use this information to maximize winnings at Jeopardy? The first step is the most straightforward: play after the year 2000! Regardless of the reasoning behind this change, average question values significantly increased in 2000, so current players stand to win a greater quantity of prize money than early contestants. Additionally, one should focus on studying American history, pop culture, and language in order to have a targeted scope of knowledge that translates directly to wins. Being able to negotiate puns and plays on words also seems to be an important skill; however, it’s important to note that many questions categorized as Punny in this analysis may, in fact, fall into other categories. However, the words that most frequently appear in Jeopardy! questions correspond to straightforward answers (such as names or places); these answers are easy enough that the showrunners likely would wish to add a degree of uncertainty to add challenge for Jeopardy!’s contestants.