What Is This Obscure Question, Trebek?

Rensselaer Polytechnic Institute

October 11, 2016

Version 1.2

1. Setting

Jeopardy! is a highly successful American trivia game show running on syndicated television for decades. In the game, three contestants compete to answer questions (or “question answers,” for those familiar with the show’s format) worth varying dollar amounts. Generally, the difficulty of the questions progresses as the dollar amounts increase. Each show is divided into three rounds: “Jeopardy!”, “Double Jeopardy!”, and “Final Jeopardy!”. In “Final Jeopardy!”, contestants wager their accumulated winnings on a single question. The show’s impact on mainstream culture is manifested in the numerous pop culture parodies, and its format is recognized almost universally (Fig. 1).

Fig. 1 Typical format of the Jeopardy! gameboard.

The goal of this study is to identify statistically significance differences between the “obscurity” of the question’s answer (as determined by the number of Google Search results) and four factors associated with each question (Year, Month, Value, and Round).

System under test

Reddit user trexmatt scraped the fan-created Jeopardy question archive website http://www.j-archive.com/ and created a .JSON file with the data on over 200,000 Jeopardy questions.¹ This large dataset required some cleaning prior to the analysis. The columns in the raw dataset included category, air_date, question, value, answer, round, and show_number. The jsonlite library was useful for importing the JSON dataset into R as a single data frame.

# import JSON file into a data frame for R (the local file directory must be assigned to the "file_location" string)
jeopardy_data <- fromJSON(file_location)

# view the first 6 rows of the data frame
head(jeopardy_data)

##                          category   air_date
## 1                         HISTORY 2004-12-31
## 2 ESPN's TOP 10 ALL-TIME ATHLETES 2004-12-31
## 3     EVERYBODY TALKS ABOUT IT... 2004-12-31
## 4                THE COMPANY LINE 2004-12-31
## 5             EPITAPHS & TRIBUTES 2004-12-31
## 6                  3-LETTER WORDS 2004-12-31
##                                                                                                        question
## 1            'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'
## 2 'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'
## 3                    'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'
## 4                        'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'
## 5    'Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States'
## 6                               'In the title of an Aesop fable, this insect shared billing with a grasshopper'
##   value       answer     round show_number
## 1  $200   Copernicus Jeopardy!        4680
## 2  $200   Jim Thorpe Jeopardy!        4680
## 3  $200      Arizona Jeopardy!        4680
## 4  $200 McDonald\\'s Jeopardy!        4680
## 5  $200   John Adams Jeopardy!        4680
## 6  $200      the ant Jeopardy!        4680

# view the last 6 rows of the data frame
tail(jeopardy_data)

##                      category   air_date
## 216925           OFF-BROADWAY 2006-05-11
## 216926         RIDDLE ME THIS 2006-05-11
## 216927              "T" BIRDS 2006-05-11
## 216928 AUTHORS IN THEIR YOUTH 2006-05-11
## 216929             QUOTATIONS 2006-05-11
## 216930         HISTORIC NAMES 2006-05-11
##                                                                                                                                                                                 question
## 216925 'In 2006 the cast of this long-running hit embarked on <a href="http://www.j-archive.com/media/2006-05-11_DJ_26.wmv">an exuberant & noisy campaign</a> to clean up New York City'
## 216926                                                                                                      'This Puccini opera turns on the solution to 3 riddles posed by the heroine'
## 216927                                                                         'In North America this term is properly applied to only 4 species that are crested, including the tufted'
## 216928                                                                       'In Penny Lane, where this "Hellraiser" grew up, the barber shaves another customer--then flays him alive!'
## 216929                                                          'From Ft. Sill, Okla. he made the plea, Arizona is my land, my home, my father's land, to which I now ask to... return"'
## 216930                                                                         'A silent movie title includes the last name of this 18th c. statesman & favorite of Catherine the Great'
##        value                         answer            round show_number
## 216925 $2000                          Stomp Double Jeopardy!        4999
## 216926 $2000                       Turandot Double Jeopardy!        4999
## 216927 $2000                     a titmouse Double Jeopardy!        4999
## 216928 $2000                   Clive Barker Double Jeopardy!        4999
## 216929 $2000                       Geronimo Double Jeopardy!        4999
## 216930  <NA> Grigori Alexandrovich Potemkin  Final Jeopardy!        4999

#view the structure of the data frame
str(jeopardy_data)

## 'data.frame':    216930 obs. of  7 variables:
##  $ category   : chr  "HISTORY" "ESPN's TOP 10 ALL-TIME ATHLETES" "EVERYBODY TALKS ABOUT IT..." "THE COMPANY LINE" ...
##  $ air_date   : chr  "2004-12-31" "2004-12-31" "2004-12-31" "2004-12-31" ...
##  $ question   : chr  "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'" "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'" "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'" "'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'" ...
##  $ value      : chr  "$200" "$200" "$200" "$200" ...
##  $ answer     : chr  "Copernicus" "Jim Thorpe" "Arizona" "McDonald\\'s" ...
##  $ round      : chr  "Jeopardy!" "Jeopardy!" "Jeopardy!" "Jeopardy!" ...
##  $ show_number: chr  "4680" "4680" "4680" "4680" ...

Factors and levels

A requirement of this study was to select four factors upon which to perform the statistical analysis. The Jeopardy! Questions and Answers themselves were not considered to be factors in this experiment. Similarly, while the Category variable sounds tempting to introduce as a categorical variable, this was not considered because there are too many unique categories on Jeopardy! and too few questions within a given category. By process of elimination, the factors chosen for this dataset were Value, Round, Year Aired, and Month Aired.

Value

In the first round (“Jeopardy!”), the values of the questions are $200, $400, $600, $800, and $1000. In the second round (“Double Jeopardy!”), the values of each question are doubled to $400, $800, $1200, $1600, and $2000.
Questions from the “Final Jeopardy!” round were removed from the dataset during cleaning because the values were listed as <NA>. Hidden among the first two rounds are “Daily Double” questions, where the value is determined by the contestant before seeing the question. Unfortunately, the dataset did not include an identifier for “Daily Double” questions, so the dataset was cleaned by removing all questions whose value did not conform to an acceptable value for a given round (e.g. $400, $800, $1200, $1600, or $2000 for “Double Jeopardy!”). Of course, this method is not foolproof (e.g. a contestant could have valued a $400 at $1200), but cleaning the data removed values that would have added a lone extra factor to the data (e.g. $3652) that could only have occured from a “Daily Double.”

NOTE: On Nov. 26, 2001, Jeopardy! doubled the values awarded on its show.² Prior to this episode, values were set to $100, $200, $300, $400, and $500 in the first round and $200, $400, $600, $800, and $1000 in the second round. To compensate for this sudden “inflation” of values, the lubridate library was used to identify air dates prior to the transition, and retroactively double the value amount. This change was necessary in order to more accurately compare pre-transition question difficulties alongside more recent question difficulties.

Round

As mentioned above, there are two rounds considered in this analysis: “Jeopardy!” and “Double Jeopardy!”. All “Final Jeopardy!” and “Tiebreaker” rounds were removed from the dataset because the values are set by the contestants rather than the show’s producers. Therefore, this factor has only 2 levels.

Year Aired

Using the lubridate library and the year() function enabled parsing of the air_date variable into a new column titled year. The levels in the dataset span all years from 1984 to 2012.

Month Aired

Again, the lubridate library was used, this time with the month() function to parse air_date into a new month column. The levels of this factor are all 12 months in the year (Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec).

Continuous variables (if any)

All the factors considered in this study are categorical variables, meaning they have discrete values. The only variable that may be considered continuous is the response variable (defined below).

Response variables

The raw dataset did not include a response variable that could be directly used in this type of analysis. As a result, a response variable had to be created. There are many ways to analyze the “difficulty” of a question on Jeopardy!, but one can approximate the difficulty of a question based on the obscurity of the answer. Thus, a function was utilized that takes an input string and outputs the number of Google Search results for that given string.³

# Determining the number of Google Search results for a string
# function credit to Drew Conway at https://gist.github.com/drewconway/791559
google.counts<-function(s){
  search.url<-paste("http://www.google.com/search?q=",gsub(" ","+",s),sep="")
  search.html<-getURL(search.url)
  parse.search<-htmlTreeParse(search.html,useInternalNodes = TRUE)
  search.nodes<-getNodeSet(parse.search,"//div[@id='resultStats']")
  search.value<-strsplit(xmlValue(search.nodes[[1]])," ",fixed=TRUE)[[1]][2]
  return(as.numeric(gsub(",","",search.value,fixed=TRUE)))
}

This function was applied to the Answer column in order to create a new results column.

2. (Experimental) Design

Organization of experiment to test the hypothesis

The factors and response variable were selected according to the general selection rules for ANOVA,⁴ which state that an ANOVA is appropriate when the independent variables are categorical and the dependent (response) variables are continuous. The best categorical factors were chosen based on the limited dataset available.
The dataset required significant cleaning to account for the various quirks in the dataset, such as “Daily Double” values and the transition date in Nov. 2001 when Jeopardy! doubled the dollar amount of values in each round. Additionally, because the answer strings were inputs into the function that determined the response variable, care had to be taken that certain special characters in the string were removed that otherwise would have impacted the number of Google Search results. The cleaning and preparation of the data for analysis was the most time-consuming aspect of the project.

# We do not care about the Question, so delete this column
jeopardy_data$question <- NULL
# We also do not care about the Category Name (there are too many different categories for this to be useful as a factor)
jeopardy_data$category <- NULL

# remove Final Jeopardy! and Tiebreaker rows (as these are not associated with a pre-selected "value" for the question)
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Final Jeopardy!"),]
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Tiebreaker"),]

# remove dollar signs and commas from "value" strings (because they are inconsistently used in dataset)
jeopardy_data[,"value"] <- gsub('\\$|,', '', jeopardy_data[,"value"])

# cast "value" as numerics
jeopardy_data$value <- as.numeric(jeopardy_data$value)

#cast "air_date" as datetimes using lubridate
jeopardy_data$air_date <- ymd(jeopardy_data$air_date)
# NOTE: on Nov. 26, 2001 (ep. 3966), the prize values were doubled for each rounds
#  we will retroactively double the "value" for episodes aired PRIOR to the transition to compare apples to apples
transition_date <- ymd("2001-11-26")


# remove Daily Doubles (where value is not as listed on board e.g. $3562)
# this is obviously an approx. since contestant may have picked an 'acceptable value' for their DD bet
acceptable_values0 <- c("100", "200", "300", "400", "500") #pre-transition Jeopardy! values
acceptable_values1 <- c("200", "400", "600", "800", "1000") #pre-transition Double Jeopardy! values, post-transition Jeopardy! values
acceptable_values2 <- c("400", "800", "1200", "1600", "2000") #post-transition Double Jeopardy! values
jeopardy_data <- jeopardy_data[(jeopardy_data$air_date < transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values0) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date < transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values2),]

Rationale for design

In order to assess which factors might have an effect on the “obscurity” of a Jeopardy answer, an ANOVA design was selected because the continuous response variable could be compared to the categorical factors described above. The Google Search results were used as a quantitative proxy for the otherwise nebulous concept of “obscurity” of a given answer. Limitations to this approach are elaborated upon in the Contingencies section below.

Randomization

With over 200,000 questions available and a function that requires 500-600ms to generate the value of the response variable for a given question, it was not feasible to test the entire dataset (which would have taken 24-48 hours just to generate the data). Therefore, it was necessary to select an appropriately sized random sample from the dataset. The importance of randomization for ANOVA is detailed further in section 4 below.
Upon selecting a random sample, the data was cleaned further to eliminate extraneous character strings from the answers data, and the year and month columns were added.

# NOTE ON CHOOSING SAMPLE SIZE: Google-search function implemented in this code takes 500~600ms PER ITERATION
#   rule of thumb: will take ~1 min for every 100 items in the sample, so choose sample size wisely
#   sample size must be less than the number of rows in the dataset (which is 200,000... so do not worry)
sample_size <- 500

# set seed for reproducibility of random sample for writing report
#(CHANGE OR REMOVE THE LINE BELOW TO ASSESS OVERALL REPEATABILITY OF RESULTS)
set.seed(1)

# select sample from jeopardy data
#randomly sample rows from the dataset (without replacement is the default)
brief <- jeopardy_data[sample(nrow(jeopardy_data), sample_size),]

# format the "answer" column to remove backslashes (strings in the dataset using apostrophes had escape-backslashes)
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\\\", "")
# format the "answer" to remove the string "or ", which has special meaning in Google searches
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\(or ", " ")
# format the "answer" to remove the character ", which is used non-uniformly in dataset and vastly impacts Google Search results
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\"", "")

#define separate columns for month and year of air_date 
brief$month <- month(brief$air_date, label=TRUE)
brief$year <- year(brief$air_date)


#subdivide Years into a new row called Decade (to simplify mean effect analysis later)
brief$year <- as.numeric(as.character(brief$year))
brief$decade <- cut(brief$year, breaks = c(1980,1990,2000,2010,2020), labels = c("1980s", "1990s", "2000s", "2010s"))

Following the cleaning, the response variable column, results, was populated using the Google Search function shown in section 1. In addition, the value in the value column was also doubled if the episode had aired prior to the transition date in Nov. 2001.

# initialize a new column for Google Search results to be placed into
brief$results <- 0

# apply the for loop to the entire sample
for(i in 1:nrow(brief)) {
  # run the Google search function on all "answers", put in "results"
  brief[i, "results"] <- google.counts(brief[i,"answer"])
  
  # retroactively double the value if the episode was aired before the transition date in 2001
  if (brief[i, "air_date"] < transition_date) {
    brief[i, "value"] <- brief[i, "value"] * 2
  }
}

Replication

Jeopardy! occasionally recycles a question category, but the show does not re-use questions. Therefore, the nature of the experimental setting does not lend itself to replication or repeated measurements.

Blocking

Blocking was briefly considered for this experiment with respect to the transition in Nov. 2001 (see Contingencies section below), but a simple “inflation adjustment” served as a more convenient workaround.
Future work could consider blocking by round, but this would eliminate one of the four required factors for this ANOVA.

3. (Statistical) Analysis

Exploratory data analysis

The four factors being considered in this study (round, value, year, and month) were cast as factors.

#define columns of interest as factors
brief$round <- as.factor(brief$round)
brief$value <- as.factor(brief$value)
brief$year <- as.factor(brief$year)
brief$month <- as.factor(brief$month)

# re-arrange Round factor levels to "Jeopardy!" before "Double Jeopardy!" (otherwise is alphabetical)
brief$round = factor(brief$round, c("Jeopardy!", "Double Jeopardy!"))

The respective levels of each factor represented in the sample are:

#view levels of each factor
levels(brief$round)

## [1] "Jeopardy!"        "Double Jeopardy!"

levels(brief$value)

## [1] "200"  "400"  "600"  "800"  "1000" "1200" "1600" "2000"

levels(brief$year)

##  [1] "1984" "1985" "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993"
## [11] "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003"
## [21] "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012"

levels(brief$month)

##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"

Initially, only the raw results were plotted in a histogram. However, the results data varies across several orders of magnitude (roughly 10⁵ to 10¹⁰), so the log10() function was implemented to make the plot more visually useful. A log-based histogram of the results seems to show a roughly normal distribution (Fig. 2).

Fig. 2 Histogram of the log of the results data.

Given the strong skew of the results before applying the logarithm (and the extremely non-normal distribution without taking the logarithm) a new logresults column was created and subsequently treated as the response variable.

#take the common log of the "results" column
brief$logresults <- log10(brief$results)

Boxplots to visually explore the main effects are given below.

Computing main effects

To simplify the main effects computations for year (because there are 29 levels) a new column was created called decade that bins each year into either 1980s, 1990s, 2000s, or 2010s. The main effects for each combination of decade was computed using the average of the means across that variable. It is assumed that, due to randomization, the sample is sufficiently balanced and that this approach is valid.

#compute main effects by DECADE
print(me_decade1 <- mean(subset(brief$logresults, brief$decade == "1990s")) - mean(subset(brief$logresults, brief$decade == "1980s")))

## [1] -0.03837641

print(me_decade2 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1980s")))

## [1] -0.003630541

print(me_decade3 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1980s")))

## [1] 0.03342239

print(me_decade4 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1990s")))

## [1] 0.03474587

print(me_decade5 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1990s")))

## [1] 0.0717988

print(me_decade6 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "2000s")))

## [1] 0.03705293

Main effects of round was computed by subtracting the mean of “Jeopardy!” round results from the mean of “Double Jeopardy!” round results. Again, it is assumed that the random sample is sufficiently balanced. A negative value of the main effect for round indicates that, as the round advances from “Jeopardy!” to “Double Jeopardy!” the answers do get more obscure.

#compute main effects by ROUND
print(me_round <- mean(subset(brief$logresults, brief$round == "Double Jeopardy!")) - mean(subset(brief$logresults, brief$round == "Jeopardy!")))

## [1] -0.2332999

Only a select portion of the main effects of value were computed due to the many levels for this factor. It is also assumed that the data follow a general decreasing trend with increasing value. Therefore, all the main effects using 2000 as the high point were computed.

#compute main effects by VALUE
print(me_value1 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 200)))

## [1] -1.16066

print(me_value2 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 400)))

## [1] -0.7223347

print(me_value3 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 600)))

## [1] -0.6218618

print(me_value4 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 800)))

## [1] -0.5262369

print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1000)))

## [1] -0.4126626

print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1200)))

## [1] -0.3182218

print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1600)))

## [1] -0.5560816

The main effects of month combinations were not computed because there is no apparent trend in the boxplots.

2-factor interactions

Interaction plots were generated for all six combinations of the four factors (4C2 = 6).

ANOVA

Two examine the statistical significance of the exploratory analyses above, a series of two-way ANOVAs were computed for all six factor combinations.

#compute the anova for each 2-combination of factors and print the summary
summary(aov(brief$logresults ~ brief$year*brief$month))

##                         Df Sum Sq Mean Sq F value Pr(>F)
## brief$year              28   36.2   1.294   0.994  0.478
## brief$month             10   10.9   1.086   0.834  0.596
## brief$year:brief$month 148  176.9   1.195   0.918  0.721
## Residuals              313  407.6   1.302

summary(aov(brief$logresults ~ brief$year*brief$value))

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## brief$year              28   36.2   1.294   1.082    0.357    
## brief$value              7   43.6   6.226   5.207 1.16e-05 ***
## brief$year:brief$value 117  137.0   1.171   0.979    0.546    
## Residuals              347  414.9   1.196                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(brief$logresults ~ brief$year*brief$round))

##                         Df Sum Sq Mean Sq F value Pr(>F)  
## brief$year              28   36.2   1.294   1.026 0.4313  
## brief$round              1    8.1   8.099   6.418 0.0116 *
## brief$year:brief$round  22   22.0   1.000   0.792 0.7362  
## Residuals              448  565.3   1.262                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(brief$logresults ~ brief$month*brief$value))

##                          Df Sum Sq Mean Sq F value   Pr(>F)    
## brief$month              10   13.4   1.341   1.115    0.349    
## brief$value               7   39.8   5.692   4.735 3.94e-05 ***
## brief$month:brief$value  70   83.2   1.188   0.989    0.508    
## Residuals               412  495.2   1.202                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(brief$logresults ~ brief$month*brief$round))

##                          Df Sum Sq Mean Sq F value Pr(>F)  
## brief$month              10   13.4   1.341   1.070 0.3840  
## brief$round               1    5.2   5.151   4.110 0.0432 *
## brief$month:brief$round  10   14.1   1.408   1.123 0.3426  
## Residuals               478  599.0   1.253                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(brief$logresults ~ brief$value*brief$round))

##                          Df Sum Sq Mean Sq F value   Pr(>F)    
## brief$value               7   41.3   5.905   4.941 2.05e-05 ***
## brief$round               1    0.2   0.194   0.163   0.6868    
## brief$value:brief$round   1    4.5   4.539   3.798   0.0519 .  
## Residuals               490  585.6   1.195                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Across all two-factor ANOVAs containing value as a factor, the F-statistic associated with value was statistically significant (p<<0.05), indicating that it is unlikely that the difference in the number of Google Search results as a factor of question value is likely explained by something other than randomness. The factor round was statistically significant in all relevant ANOVAs (p<0.05) besides the two-factor ANOVA between value and round. In this particular ANOVA, the interaction between round and value had a p-value = 0.052, just below our cutoff for statistical significance but indicating that there might be an interaction between value and round.

Contingencies to the experimental design

The biggest assumption made in this experiment is that the number of Google Search results for a given Jeopardy! question is inversely related to the obscurity, and therefore difficulty, of this question. It is easy to come up with counter-examples, such as an obscure question which has an answer that is otherwise a commonly used noun. It was hoped that the nuisance factors associated with this imperfect measure of “answer obscurity” would be equally distributed among groups, and not significant enough to alter the results of the analysis.
In addition, anyone who has pressed “Ctrl+U” while visiting has seen the immense complexity of the source code of the Google website, a website which appears simple on the surface. The webpage source code is really only the tip of the iceberg, because thousands of engineers have developed algorithms under the hood that determine which search results make it to the front page. Almost certainly, the nuances in these algorithms would have some effect on the number of search results returned for a given input string. Blocking was initially considered for handling the quirk in the Nov. 2001 transition date where question values were doubled, but the workaround implemented in this analysis was deemed appropriate because–while the dollar amounts for the questions were suddenly doubled–it seems unlikely that the difficulty of the questions would have suddenly doubled following the transition.

4. Summary of Relevant Theory

A factorial design differs from the best guess or one-factor-at-a-time (OFAT) in that it tests all possible combinations of a given set of factors. By removing subjectivity from the experimental design and analysis, it is more likely that one will uncover the true parameter space.
When the factor levels are categorical and the response variable is continuous, one should perform a t-test or an analysis of variance (ANOVA). When performing an ANOVA, one must assign factors, levels, and replicates randomly to remove all systematic biases from the analysis. While the experiment is controlled, there may be nuisance factors that impact the response variable of interest. Variation in the response variable is expected, and the later analysis compares the controlled variations to the uncontrolled (random) variations.
The main effect looks at the averages across levels of a given factor. In a simple 2-factor, 2-level design (shown in section 5.1 of Montgomery), the main effect is “the difference between the average response at the low level of A and the average response at the high level of A.”⁵ When a factor has many levels, a main effects plot or side-by-side boxplots are often the most informative. In an ideal world, one assigns equal numbers of trials to each group, leading to a balanced design. However, in an ex post facto analysis such as the Jeopardy! analysis performed above, we did not have the luxury of an organized dataset to work from. There were various quirks in the dataset (e.g. “Daily Double” scores) that had to be removed prior to sampling. While it is not feasible to obtain a perfectly balanced sample from the dataset, one can take a sample large enough that the influence of any missing levels balances out due to randomization.
When the source of a nuisance factor is known, one may apply blocking to remove the variation associated with this nuisance factor from the experiment. In this analysis, blocking was not performed but a recommendation was given for future work to examine blocking by Jeopardy! round. However, this would have eliminated one of the four required factors for this dataset which does not contain much information on each question to begin with. If feasible, it would have been nice if the dataset included such information as “Daily Double (Y/N)” or whether the question was answered correctly on the first, second, or third try (or not at all).
Statistical significance is assessed by performing the ANOVA. It is assumed that there will be variation in the response variable, and the ANOVA seeks to attribute that variation to either the factors, the interactions between factors, or pure randomness. The F-statistic is used, which is the ratio of the mean squares of between groups to the mean squares within groups. A large F-statistic indicates a higher chance that something other than randomness determined the variation in the response variable. Knowing the degrees of freedon, one can look up the probabilistic F-distribution to obtain the critical F-value for a given confidence level that determines whether the value is statistically significant.

5. References

trexmatt. “200,000+ Jeopardy! Questions in a JSON file.” Reddit. January 11, 2014. Available October 8, 2016 https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/.
“Show No. 3966 (Harold Skinner vs. Geoffrey Zimmerman vs. Kristin Lawhead)”. Jeopardy!. November 26, 2001. Syndicated.
Conway, Drew. “google_counts.R”. Github Gist. January 22, 2011. Available October 8, 2016 https://gist.github.com/drewconway/791559.
Ludford, P.J. “An Overview: Choosing the Correct Statistical Test.” University of Minnesota. Available October 9, 2016 http://www-users.cs.umn.edu/~ludford/stat_overview.htm.
Montgomery, Douglas C. Design and analysis of experiments. John Wiley & Sons, 2008.

6. Appendices

Raw data

The JSON file (52.9 MB) of the raw data¹ can be downloaded from the link provided at https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/.

Complete and documented R code

# Analysis of the Obscurity of Jeopardy Answers by Google Search Results
# M. Deagen
# ISYE 6020
# 2016/10/08

# the following libraries must be installed and loaded
require(jsonlite)
require(lubridate)
require(XML)
require(RCurl)
require(stringr)

# Download dataset from https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
# ----> EDIT the "file_location" string with the directory of the file on your computer
file_location <- "C:/Users/Deagen/Desktop/JEOPARDY_QUESTIONS1.json"

# import JSON file into a data frame for R (the local file directory must be assigned to the "file_location" string)
jeopardy_data <- fromJSON(file_location)

# view the first 6 rows of the data frame
head(jeopardy_data)

# view the last 6 rows of the data frame
tail(jeopardy_data)

#view the structure of the data frame
str(jeopardy_data)

# Determining the number of Google Search results for a string
# function credit to Drew Conway at https://gist.github.com/drewconway/791559
google.counts<-function(s){
  search.url<-paste("http://www.google.com/search?q=",gsub(" ","+",s),sep="")
  search.html<-getURL(search.url)
  parse.search<-htmlTreeParse(search.html,useInternalNodes = TRUE)
  search.nodes<-getNodeSet(parse.search,"//div[@id='resultStats']")
  search.value<-strsplit(xmlValue(search.nodes[[1]])," ",fixed=TRUE)[[1]][2]
  return(as.numeric(gsub(",","",search.value,fixed=TRUE)))
}

# We do not care about the Question, so delete this column
jeopardy_data$question <- NULL
# We also do not care about the Category Name (there are too many different categories for this to be useful as a factor)
jeopardy_data$category <- NULL

# remove Final Jeopardy! and Tiebreaker rows (as these are not associated with a pre-selected "value" for the question)
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Final Jeopardy!"),]
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Tiebreaker"),]

# remove dollar signs and commas from "value" strings (because they are inconsistently used in dataset)
jeopardy_data[,"value"] <- gsub('\\$|,', '', jeopardy_data[,"value"])

# cast "value" as numerics
jeopardy_data$value <- as.numeric(jeopardy_data$value)

#cast "air_date" as datetimes using lubridate
jeopardy_data$air_date <- ymd(jeopardy_data$air_date)
# NOTE: on Nov. 26, 2001 (ep. 3966), the prize values were doubled for each rounds
#  we will retroactively double the "value" for episodes aired PRIOR to the transition to compare apples to apples
transition_date <- ymd("2001-11-26")


# remove Daily Doubles (where value is not as listed on board e.g. $3562)
# this is obviously an approx. since contestant may have picked an 'acceptable value' for their DD bet
acceptable_values0 <- c("100", "200", "300", "400", "500") #pre-transition Jeopardy! values
acceptable_values1 <- c("200", "400", "600", "800", "1000") #pre-transition Double Jeopardy! values, post-transition Jeopardy! values
acceptable_values2 <- c("400", "800", "1200", "1600", "2000") #post-transition Double Jeopardy! values
jeopardy_data <- jeopardy_data[(jeopardy_data$air_date < transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values0) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date < transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values2),]






# NOTE ON CHOOSING SAMPLE SIZE: Google-search function implemented in this code takes 500~600ms PER ITERATION
#   rule of thumb: will take ~1 min for every 100 items in the sample, so choose sample size wisely
#   sample size must be less than the number of rows in the dataset (which is 200,000... so do not worry)
sample_size <- 500


# set seed for reproducibility of random sample for writing report
#(CHANGE OR REMOVE THE LINE BELOW TO ASSESS OVERALL REPEATABILITY OF RESULTS)
set.seed(1)

# select sample from jeopardy data
#randomly sample rows from the dataset (without replacement is the default)
brief <- jeopardy_data[sample(nrow(jeopardy_data), sample_size),]

# format the "answer" column to remove backslashes (strings in the dataset using apostrophes had escape-backslashes)
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\\\", "")
# format the "answer" to remove the string "or ", which has special meaning in Google searches
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\(or ", " ")
# format the "answer" to remove the character ", which is used non-uniformly in dataset and vastly impacts Google Search results
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\"", "")

#define separate columns for month and year of air_date 
brief$month <- month(brief$air_date, label=TRUE)
brief$year <- year(brief$air_date)


#subdivide Years into a new row called Decade (to simplify mean effect analysis later)
brief$year <- as.numeric(as.character(brief$year))
brief$decade <- cut(brief$year, breaks = c(1980,1990,2000,2010,2020), labels = c("1980s", "1990s", "2000s", "2010s"))

# initialize a new column for Google Search results to be placed into
brief$results <- 0

# apply the for loop to the entire sample
for(i in 1:nrow(brief)) {
  # run the Google search function on all "answers", put in "results"
  brief[i, "results"] <- google.counts(brief[i,"answer"])
  
  # retroactively double the value if the episode was aired before the transition date in 2001
  if (brief[i, "air_date"] < transition_date) {
    brief[i, "value"] <- brief[i, "value"] * 2
  }
}




#Create a histogram of Google Search Results ('brief$results') on a LOG SCALE
par(mfrow=c(1,1))
hist(log10(brief$results), main = "Google Search Results for Jeopardy Answers", xlab = "log10(results)")

#take the common log of the "results" column into a new column "logresults
brief$logresults <- log10(brief$results)

#define columns of interest as factors
brief$round <- as.factor(brief$round)
brief$value <- as.factor(brief$value)
brief$year <- as.factor(brief$year)
brief$month <- as.factor(brief$month)

# re-arrange Round factor levels to "Jeopardy!" before "Double Jeopardy!" (otherwise is alphabetical)
brief$round = factor(brief$round, c("Jeopardy!", "Double Jeopardy!"))

#view levels of each factor
levels(brief$round)
levels(brief$value)
levels(brief$year)
levels(brief$month)

# view boxplots of results for each factor
boxplot(logresults~value, data=brief, vertical=TRUE, las=2, ylab="log(results)", main="Google Search results as a function of Value")
boxplot(logresults~round, data=brief, vertical=TRUE, las=1, ylab="log(results)", main="Google Search results as a function of Round")
boxplot(logresults~year, data=brief, vertical=TRUE, las=2, ylab="log(results)", main="Google Search results as a function of Year")
boxplot(logresults~month, data=brief, vertical=TRUE, las=2, ylab="log(results)", main="Google Search results as a function of Month")





#compute main effects by DECADE
print(me_decade1 <- mean(subset(brief$logresults, brief$decade == "1990s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
print(me_decade2 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
print(me_decade3 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
print(me_decade4 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1990s")))
print(me_decade5 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1990s")))
print(me_decade6 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "2000s")))

boxplot(logresults~decade, data=brief, vertical=TRUE, las=1, ylab="log(results)", main="Google Search results as a function of Decade")

#compute main effects by ROUND
print(me_round <- mean(subset(brief$logresults, brief$round == "Double Jeopardy!")) - mean(subset(brief$logresults, brief$round == "Jeopardy!")))

#compute main effects by VALUE
print(me_value1 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 200)))
print(me_value2 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 400)))
print(me_value3 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 600)))
print(me_value4 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 800)))
print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1000)))

#2-factor interaction plots
par(mfrow=c(1,1))
interaction.plot(brief$decade, brief$value, brief$logresults)
interaction.plot(brief$decade, brief$round, brief$logresults)
interaction.plot(brief$decade, brief$month, brief$logresults)
interaction.plot(brief$round, brief$value, brief$logresults)
interaction.plot(brief$value, brief$month, brief$logresults)
interaction.plot(brief$round, brief$month, brief$logresults)

#compute the anova for each 2-combination of factors and print the summary
summary(aov(brief$logresults ~ brief$year*brief$month))
summary(aov(brief$logresults ~ brief$year*brief$value))
summary(aov(brief$logresults ~ brief$year*brief$round))
summary(aov(brief$logresults ~ brief$month*brief$value))
summary(aov(brief$logresults ~ brief$month*brief$round))
summary(aov(brief$logresults ~ brief$value*brief$round))