Jeopardy! is a highly successful American trivia game show running on syndicated television for decades. In the game, three contestants compete to answer questions (or “question answers,” for those familiar with the show’s format) worth varying dollar amounts. Generally, the difficulty of the questions progresses as the dollar amounts increase. Each show is divided into three rounds: “Jeopardy!”, “Double Jeopardy!”, and “Final Jeopardy!”. In “Final Jeopardy!”, contestants wager their accumulated winnings on a single question. The show’s impact on mainstream culture is manifested in the numerous pop culture parodies, and its format is recognized almost universally (Fig. 1).
Fig. 1 Typical format of the Jeopardy! gameboard.
The goal of this study is to identify statistically significance differences between the “obscurity” of the question’s answer (as determined by the number of Google Search results) and four factors associated with each question (Year, Month, Value, and Round).
Reddit user trexmatt scraped the fan-created Jeopardy question archive website http://www.j-archive.com/ and created a .JSON file with the data on over 200,000 Jeopardy questions.1 This large dataset required some cleaning prior to the analysis. The columns in the raw dataset included category
, air_date
, question
, value
, answer
, round
, and show_number
. The jsonlite
library was useful for importing the JSON dataset into R as a single data frame.
# import JSON file into a data frame for R (the local file directory must be assigned to the "file_location" string)
jeopardy_data <- fromJSON(file_location)
# view the first 6 rows of the data frame
head(jeopardy_data)
## category air_date
## 1 HISTORY 2004-12-31
## 2 ESPN's TOP 10 ALL-TIME ATHLETES 2004-12-31
## 3 EVERYBODY TALKS ABOUT IT... 2004-12-31
## 4 THE COMPANY LINE 2004-12-31
## 5 EPITAPHS & TRIBUTES 2004-12-31
## 6 3-LETTER WORDS 2004-12-31
## question
## 1 'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'
## 2 'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'
## 3 'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'
## 4 'In 1963, live on "The Art Linkletter Show", this company served its billionth burger'
## 5 'Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States'
## 6 'In the title of an Aesop fable, this insect shared billing with a grasshopper'
## value answer round show_number
## 1 $200 Copernicus Jeopardy! 4680
## 2 $200 Jim Thorpe Jeopardy! 4680
## 3 $200 Arizona Jeopardy! 4680
## 4 $200 McDonald\\'s Jeopardy! 4680
## 5 $200 John Adams Jeopardy! 4680
## 6 $200 the ant Jeopardy! 4680
# view the last 6 rows of the data frame
tail(jeopardy_data)
## category air_date
## 216925 OFF-BROADWAY 2006-05-11
## 216926 RIDDLE ME THIS 2006-05-11
## 216927 "T" BIRDS 2006-05-11
## 216928 AUTHORS IN THEIR YOUTH 2006-05-11
## 216929 QUOTATIONS 2006-05-11
## 216930 HISTORIC NAMES 2006-05-11
## question
## 216925 'In 2006 the cast of this long-running hit embarked on <a href="http://www.j-archive.com/media/2006-05-11_DJ_26.wmv">an exuberant & noisy campaign</a> to clean up New York City'
## 216926 'This Puccini opera turns on the solution to 3 riddles posed by the heroine'
## 216927 'In North America this term is properly applied to only 4 species that are crested, including the tufted'
## 216928 'In Penny Lane, where this "Hellraiser" grew up, the barber shaves another customer--then flays him alive!'
## 216929 'From Ft. Sill, Okla. he made the plea, Arizona is my land, my home, my father's land, to which I now ask to... return"'
## 216930 'A silent movie title includes the last name of this 18th c. statesman & favorite of Catherine the Great'
## value answer round show_number
## 216925 $2000 Stomp Double Jeopardy! 4999
## 216926 $2000 Turandot Double Jeopardy! 4999
## 216927 $2000 a titmouse Double Jeopardy! 4999
## 216928 $2000 Clive Barker Double Jeopardy! 4999
## 216929 $2000 Geronimo Double Jeopardy! 4999
## 216930 <NA> Grigori Alexandrovich Potemkin Final Jeopardy! 4999
#view the structure of the data frame
str(jeopardy_data)
## 'data.frame': 216930 obs. of 7 variables:
## $ category : chr "HISTORY" "ESPN's TOP 10 ALL-TIME ATHLETES" "EVERYBODY TALKS ABOUT IT..." "THE COMPANY LINE" ...
## $ air_date : chr "2004-12-31" "2004-12-31" "2004-12-31" "2004-12-31" ...
## $ question : chr "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'" "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'" "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'" "'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'" ...
## $ value : chr "$200" "$200" "$200" "$200" ...
## $ answer : chr "Copernicus" "Jim Thorpe" "Arizona" "McDonald\\'s" ...
## $ round : chr "Jeopardy!" "Jeopardy!" "Jeopardy!" "Jeopardy!" ...
## $ show_number: chr "4680" "4680" "4680" "4680" ...
A requirement of this study was to select four factors upon which to perform the statistical analysis. The Jeopardy! Questions and Answers themselves were not considered to be factors in this experiment. Similarly, while the Category variable sounds tempting to introduce as a categorical variable, this was not considered because there are too many unique categories on Jeopardy! and too few questions within a given category. By process of elimination, the factors chosen for this dataset were Value, Round, Year Aired, and Month Aired.
In the first round (“Jeopardy!”), the values of the questions are $200, $400, $600, $800, and $1000. In the second round (“Double Jeopardy!”), the values of each question are doubled to $400, $800, $1200, $1600, and $2000.
Questions from the “Final Jeopardy!” round were removed from the dataset during cleaning because the values were listed as <NA>. Hidden among the first two rounds are “Daily Double” questions, where the value is determined by the contestant before seeing the question. Unfortunately, the dataset did not include an identifier for “Daily Double” questions, so the dataset was cleaned by removing all questions whose value did not conform to an acceptable value for a given round (e.g. $400, $800, $1200, $1600, or $2000 for “Double Jeopardy!”). Of course, this method is not foolproof (e.g. a contestant could have valued a $400 at $1200), but cleaning the data removed values that would have added a lone extra factor to the data (e.g. $3652) that could only have occured from a “Daily Double.”
NOTE: On Nov. 26, 2001, Jeopardy! doubled the values awarded on its show.2 Prior to this episode, values were set to $100, $200, $300, $400, and $500 in the first round and $200, $400, $600, $800, and $1000 in the second round. To compensate for this sudden “inflation” of values, the lubridate
library was used to identify air dates prior to the transition, and retroactively double the value amount. This change was necessary in order to more accurately compare pre-transition question difficulties alongside more recent question difficulties.
As mentioned above, there are two rounds considered in this analysis: “Jeopardy!” and “Double Jeopardy!”. All “Final Jeopardy!” and “Tiebreaker” rounds were removed from the dataset because the values are set by the contestants rather than the show’s producers. Therefore, this factor has only 2 levels.
Using the lubridate
library and the year()
function enabled parsing of the air_date
variable into a new column titled year
. The levels in the dataset span all years from 1984 to 2012.
Again, the lubridate
library was used, this time with the month()
function to parse air_date
into a new month
column. The levels of this factor are all 12 months in the year (Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec).
All the factors considered in this study are categorical variables, meaning they have discrete values. The only variable that may be considered continuous is the response variable (defined below).
The raw dataset did not include a response variable that could be directly used in this type of analysis. As a result, a response variable had to be created. There are many ways to analyze the “difficulty” of a question on Jeopardy!, but one can approximate the difficulty of a question based on the obscurity of the answer. Thus, a function was utilized that takes an input string and outputs the number of Google Search results for that given string.3
# Determining the number of Google Search results for a string
# function credit to Drew Conway at https://gist.github.com/drewconway/791559
google.counts<-function(s){
search.url<-paste("http://www.google.com/search?q=",gsub(" ","+",s),sep="")
search.html<-getURL(search.url)
parse.search<-htmlTreeParse(search.html,useInternalNodes = TRUE)
search.nodes<-getNodeSet(parse.search,"//div[@id='resultStats']")
search.value<-strsplit(xmlValue(search.nodes[[1]])," ",fixed=TRUE)[[1]][2]
return(as.numeric(gsub(",","",search.value,fixed=TRUE)))
}
This function was applied to the Answer column in order to create a new results
column.
The factors and response variable were selected according to the general selection rules for ANOVA,4 which state that an ANOVA is appropriate when the independent variables are categorical and the dependent (response) variables are continuous. The best categorical factors were chosen based on the limited dataset available.
The dataset required significant cleaning to account for the various quirks in the dataset, such as “Daily Double” values and the transition date in Nov. 2001 when Jeopardy! doubled the dollar amount of values in each round. Additionally, because the answer strings were inputs into the function that determined the response variable, care had to be taken that certain special characters in the string were removed that otherwise would have impacted the number of Google Search results. The cleaning and preparation of the data for analysis was the most time-consuming aspect of the project.
# We do not care about the Question, so delete this column
jeopardy_data$question <- NULL
# We also do not care about the Category Name (there are too many different categories for this to be useful as a factor)
jeopardy_data$category <- NULL
# remove Final Jeopardy! and Tiebreaker rows (as these are not associated with a pre-selected "value" for the question)
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Final Jeopardy!"),]
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Tiebreaker"),]
# remove dollar signs and commas from "value" strings (because they are inconsistently used in dataset)
jeopardy_data[,"value"] <- gsub('\\$|,', '', jeopardy_data[,"value"])
# cast "value" as numerics
jeopardy_data$value <- as.numeric(jeopardy_data$value)
#cast "air_date" as datetimes using lubridate
jeopardy_data$air_date <- ymd(jeopardy_data$air_date)
# NOTE: on Nov. 26, 2001 (ep. 3966), the prize values were doubled for each rounds
# we will retroactively double the "value" for episodes aired PRIOR to the transition to compare apples to apples
transition_date <- ymd("2001-11-26")
# remove Daily Doubles (where value is not as listed on board e.g. $3562)
# this is obviously an approx. since contestant may have picked an 'acceptable value' for their DD bet
acceptable_values0 <- c("100", "200", "300", "400", "500") #pre-transition Jeopardy! values
acceptable_values1 <- c("200", "400", "600", "800", "1000") #pre-transition Double Jeopardy! values, post-transition Jeopardy! values
acceptable_values2 <- c("400", "800", "1200", "1600", "2000") #post-transition Double Jeopardy! values
jeopardy_data <- jeopardy_data[(jeopardy_data$air_date < transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values0) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date < transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values2),]
In order to assess which factors might have an effect on the “obscurity” of a Jeopardy answer, an ANOVA design was selected because the continuous response variable could be compared to the categorical factors described above. The Google Search results were used as a quantitative proxy for the otherwise nebulous concept of “obscurity” of a given answer. Limitations to this approach are elaborated upon in the Contingencies section below.
With over 200,000 questions available and a function that requires 500-600ms to generate the value of the response variable for a given question, it was not feasible to test the entire dataset (which would have taken 24-48 hours just to generate the data). Therefore, it was necessary to select an appropriately sized random sample from the dataset. The importance of randomization for ANOVA is detailed further in section 4 below.
Upon selecting a random sample, the data was cleaned further to eliminate extraneous character strings from the answers
data, and the year
and month
columns were added.
# NOTE ON CHOOSING SAMPLE SIZE: Google-search function implemented in this code takes 500~600ms PER ITERATION
# rule of thumb: will take ~1 min for every 100 items in the sample, so choose sample size wisely
# sample size must be less than the number of rows in the dataset (which is 200,000... so do not worry)
sample_size <- 500
# set seed for reproducibility of random sample for writing report
#(CHANGE OR REMOVE THE LINE BELOW TO ASSESS OVERALL REPEATABILITY OF RESULTS)
set.seed(1)
# select sample from jeopardy data
#randomly sample rows from the dataset (without replacement is the default)
brief <- jeopardy_data[sample(nrow(jeopardy_data), sample_size),]
# format the "answer" column to remove backslashes (strings in the dataset using apostrophes had escape-backslashes)
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\\\", "")
# format the "answer" to remove the string "or ", which has special meaning in Google searches
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\(or ", " ")
# format the "answer" to remove the character ", which is used non-uniformly in dataset and vastly impacts Google Search results
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\"", "")
#define separate columns for month and year of air_date
brief$month <- month(brief$air_date, label=TRUE)
brief$year <- year(brief$air_date)
#subdivide Years into a new row called Decade (to simplify mean effect analysis later)
brief$year <- as.numeric(as.character(brief$year))
brief$decade <- cut(brief$year, breaks = c(1980,1990,2000,2010,2020), labels = c("1980s", "1990s", "2000s", "2010s"))
Following the cleaning, the response variable column, results
, was populated using the Google Search function shown in section 1. In addition, the value in the value
column was also doubled if the episode had aired prior to the transition date in Nov. 2001.
# initialize a new column for Google Search results to be placed into
brief$results <- 0
# apply the for loop to the entire sample
for(i in 1:nrow(brief)) {
# run the Google search function on all "answers", put in "results"
brief[i, "results"] <- google.counts(brief[i,"answer"])
# retroactively double the value if the episode was aired before the transition date in 2001
if (brief[i, "air_date"] < transition_date) {
brief[i, "value"] <- brief[i, "value"] * 2
}
}
Jeopardy! occasionally recycles a question category, but the show does not re-use questions. Therefore, the nature of the experimental setting does not lend itself to replication or repeated measurements.
Blocking was briefly considered for this experiment with respect to the transition in Nov. 2001 (see Contingencies section below), but a simple “inflation adjustment” served as a more convenient workaround.
Future work could consider blocking by round
, but this would eliminate one of the four required factors for this ANOVA.
The four factors being considered in this study (round
, value
, year
, and month
) were cast as factors.
#define columns of interest as factors
brief$round <- as.factor(brief$round)
brief$value <- as.factor(brief$value)
brief$year <- as.factor(brief$year)
brief$month <- as.factor(brief$month)
# re-arrange Round factor levels to "Jeopardy!" before "Double Jeopardy!" (otherwise is alphabetical)
brief$round = factor(brief$round, c("Jeopardy!", "Double Jeopardy!"))
The respective levels of each factor represented in the sample are:
#view levels of each factor
levels(brief$round)
## [1] "Jeopardy!" "Double Jeopardy!"
levels(brief$value)
## [1] "200" "400" "600" "800" "1000" "1200" "1600" "2000"
levels(brief$year)
## [1] "1984" "1985" "1986" "1987" "1988" "1989" "1990" "1991" "1992" "1993"
## [11] "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003"
## [21] "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012"
levels(brief$month)
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
Initially, only the raw results
were plotted in a histogram. However, the results
data varies across several orders of magnitude (roughly 105 to 1010), so the log10()
function was implemented to make the plot more visually useful. A log-based histogram of the results
seems to show a roughly normal distribution (Fig. 2).
Fig. 2 Histogram of the log of the
results
data.
Given the strong skew of the results
before applying the logarithm (and the extremely non-normal distribution without taking the logarithm) a new logresults
column was created and subsequently treated as the response variable.
#take the common log of the "results" column
brief$logresults <- log10(brief$results)
Boxplots to visually explore the main effects are given below.
To simplify the main effects computations for year
(because there are 29 levels) a new column was created called decade
that bins each year into either 1980s, 1990s, 2000s, or 2010s. The main effects for each combination of decade
was computed using the average of the means across that variable. It is assumed that, due to randomization, the sample is sufficiently balanced and that this approach is valid.
#compute main effects by DECADE
print(me_decade1 <- mean(subset(brief$logresults, brief$decade == "1990s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
## [1] -0.03837641
print(me_decade2 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
## [1] -0.003630541
print(me_decade3 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
## [1] 0.03342239
print(me_decade4 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1990s")))
## [1] 0.03474587
print(me_decade5 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1990s")))
## [1] 0.0717988
print(me_decade6 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "2000s")))
## [1] 0.03705293
Main effects of round
was computed by subtracting the mean of “Jeopardy!” round results from the mean of “Double Jeopardy!” round results. Again, it is assumed that the random sample is sufficiently balanced. A negative value of the main effect for round
indicates that, as the round advances from “Jeopardy!” to “Double Jeopardy!” the answers do get more obscure.
#compute main effects by ROUND
print(me_round <- mean(subset(brief$logresults, brief$round == "Double Jeopardy!")) - mean(subset(brief$logresults, brief$round == "Jeopardy!")))
## [1] -0.2332999
Only a select portion of the main effects of value
were computed due to the many levels for this factor. It is also assumed that the data follow a general decreasing trend with increasing value
. Therefore, all the main effects using 2000 as the high point were computed.
#compute main effects by VALUE
print(me_value1 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 200)))
## [1] -1.16066
print(me_value2 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 400)))
## [1] -0.7223347
print(me_value3 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 600)))
## [1] -0.6218618
print(me_value4 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 800)))
## [1] -0.5262369
print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1000)))
## [1] -0.4126626
print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1200)))
## [1] -0.3182218
print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1600)))
## [1] -0.5560816
The main effects of month
combinations were not computed because there is no apparent trend in the boxplots.
Interaction plots were generated for all six combinations of the four factors (4C2 = 6).
Two examine the statistical significance of the exploratory analyses above, a series of two-way ANOVAs were computed for all six factor combinations.
#compute the anova for each 2-combination of factors and print the summary
summary(aov(brief$logresults ~ brief$year*brief$month))
## Df Sum Sq Mean Sq F value Pr(>F)
## brief$year 28 36.2 1.294 0.994 0.478
## brief$month 10 10.9 1.086 0.834 0.596
## brief$year:brief$month 148 176.9 1.195 0.918 0.721
## Residuals 313 407.6 1.302
summary(aov(brief$logresults ~ brief$year*brief$value))
## Df Sum Sq Mean Sq F value Pr(>F)
## brief$year 28 36.2 1.294 1.082 0.357
## brief$value 7 43.6 6.226 5.207 1.16e-05 ***
## brief$year:brief$value 117 137.0 1.171 0.979 0.546
## Residuals 347 414.9 1.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(brief$logresults ~ brief$year*brief$round))
## Df Sum Sq Mean Sq F value Pr(>F)
## brief$year 28 36.2 1.294 1.026 0.4313
## brief$round 1 8.1 8.099 6.418 0.0116 *
## brief$year:brief$round 22 22.0 1.000 0.792 0.7362
## Residuals 448 565.3 1.262
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(brief$logresults ~ brief$month*brief$value))
## Df Sum Sq Mean Sq F value Pr(>F)
## brief$month 10 13.4 1.341 1.115 0.349
## brief$value 7 39.8 5.692 4.735 3.94e-05 ***
## brief$month:brief$value 70 83.2 1.188 0.989 0.508
## Residuals 412 495.2 1.202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(brief$logresults ~ brief$month*brief$round))
## Df Sum Sq Mean Sq F value Pr(>F)
## brief$month 10 13.4 1.341 1.070 0.3840
## brief$round 1 5.2 5.151 4.110 0.0432 *
## brief$month:brief$round 10 14.1 1.408 1.123 0.3426
## Residuals 478 599.0 1.253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(brief$logresults ~ brief$value*brief$round))
## Df Sum Sq Mean Sq F value Pr(>F)
## brief$value 7 41.3 5.905 4.941 2.05e-05 ***
## brief$round 1 0.2 0.194 0.163 0.6868
## brief$value:brief$round 1 4.5 4.539 3.798 0.0519 .
## Residuals 490 585.6 1.195
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Across all two-factor ANOVAs containing value
as a factor, the F-statistic associated with value
was statistically significant (p<<0.05), indicating that it is unlikely that the difference in the number of Google Search results as a factor of question value is likely explained by something other than randomness. The factor round
was statistically significant in all relevant ANOVAs (p<0.05) besides the two-factor ANOVA between value
and round
. In this particular ANOVA, the interaction between round and value had a p-value = 0.052, just below our cutoff for statistical significance but indicating that there might be an interaction between value
and round
.
The biggest assumption made in this experiment is that the number of Google Search results for a given Jeopardy! question is inversely related to the obscurity, and therefore difficulty, of this question. It is easy to come up with counter-examples, such as an obscure question which has an answer that is otherwise a commonly used noun. It was hoped that the nuisance factors associated with this imperfect measure of “answer obscurity” would be equally distributed among groups, and not significant enough to alter the results of the analysis.
In addition, anyone who has pressed “Ctrl+U” while visiting
A factorial design differs from the best guess or one-factor-at-a-time (OFAT) in that it tests all possible combinations of a given set of factors. By removing subjectivity from the experimental design and analysis, it is more likely that one will uncover the true parameter space.
When the factor levels are categorical and the response variable is continuous, one should perform a t-test or an analysis of variance (ANOVA). When performing an ANOVA, one must assign factors, levels, and replicates randomly to remove all systematic biases from the analysis. While the experiment is controlled, there may be nuisance factors that impact the response variable of interest. Variation in the response variable is expected, and the later analysis compares the controlled variations to the uncontrolled (random) variations.
The main effect looks at the averages across levels of a given factor. In a simple 2-factor, 2-level design (shown in section 5.1 of Montgomery), the main effect is “the difference between the average response at the low level of A and the average response at the high level of A.”5 When a factor has many levels, a main effects plot or side-by-side boxplots are often the most informative. In an ideal world, one assigns equal numbers of trials to each group, leading to a balanced design. However, in an ex post facto analysis such as the Jeopardy! analysis performed above, we did not have the luxury of an organized dataset to work from. There were various quirks in the dataset (e.g. “Daily Double” scores) that had to be removed prior to sampling. While it is not feasible to obtain a perfectly balanced sample from the dataset, one can take a sample large enough that the influence of any missing levels balances out due to randomization.
When the source of a nuisance factor is known, one may apply blocking to remove the variation associated with this nuisance factor from the experiment. In this analysis, blocking was not performed but a recommendation was given for future work to examine blocking by Jeopardy! round. However, this would have eliminated one of the four required factors for this dataset which does not contain much information on each question to begin with. If feasible, it would have been nice if the dataset included such information as “Daily Double (Y/N)” or whether the question was answered correctly on the first, second, or third try (or not at all).
Statistical significance is assessed by performing the ANOVA. It is assumed that there will be variation in the response variable, and the ANOVA seeks to attribute that variation to either the factors, the interactions between factors, or pure randomness. The F-statistic is used, which is the ratio of the mean squares of between groups to the mean squares within groups. A large F-statistic indicates a higher chance that something other than randomness determined the variation in the response variable. Knowing the degrees of freedon, one can look up the probabilistic F-distribution to obtain the critical F-value for a given confidence level that determines whether the value is statistically significant.
The JSON file (52.9 MB) of the raw data1 can be downloaded from the link provided at https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/.
# Analysis of the Obscurity of Jeopardy Answers by Google Search Results
# M. Deagen
# ISYE 6020
# 2016/10/08
# the following libraries must be installed and loaded
require(jsonlite)
require(lubridate)
require(XML)
require(RCurl)
require(stringr)
# Download dataset from https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
# ----> EDIT the "file_location" string with the directory of the file on your computer
file_location <- "C:/Users/Deagen/Desktop/JEOPARDY_QUESTIONS1.json"
# import JSON file into a data frame for R (the local file directory must be assigned to the "file_location" string)
jeopardy_data <- fromJSON(file_location)
# view the first 6 rows of the data frame
head(jeopardy_data)
# view the last 6 rows of the data frame
tail(jeopardy_data)
#view the structure of the data frame
str(jeopardy_data)
# Determining the number of Google Search results for a string
# function credit to Drew Conway at https://gist.github.com/drewconway/791559
google.counts<-function(s){
search.url<-paste("http://www.google.com/search?q=",gsub(" ","+",s),sep="")
search.html<-getURL(search.url)
parse.search<-htmlTreeParse(search.html,useInternalNodes = TRUE)
search.nodes<-getNodeSet(parse.search,"//div[@id='resultStats']")
search.value<-strsplit(xmlValue(search.nodes[[1]])," ",fixed=TRUE)[[1]][2]
return(as.numeric(gsub(",","",search.value,fixed=TRUE)))
}
# We do not care about the Question, so delete this column
jeopardy_data$question <- NULL
# We also do not care about the Category Name (there are too many different categories for this to be useful as a factor)
jeopardy_data$category <- NULL
# remove Final Jeopardy! and Tiebreaker rows (as these are not associated with a pre-selected "value" for the question)
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Final Jeopardy!"),]
jeopardy_data <- jeopardy_data[!(jeopardy_data$round=="Tiebreaker"),]
# remove dollar signs and commas from "value" strings (because they are inconsistently used in dataset)
jeopardy_data[,"value"] <- gsub('\\$|,', '', jeopardy_data[,"value"])
# cast "value" as numerics
jeopardy_data$value <- as.numeric(jeopardy_data$value)
#cast "air_date" as datetimes using lubridate
jeopardy_data$air_date <- ymd(jeopardy_data$air_date)
# NOTE: on Nov. 26, 2001 (ep. 3966), the prize values were doubled for each rounds
# we will retroactively double the "value" for episodes aired PRIOR to the transition to compare apples to apples
transition_date <- ymd("2001-11-26")
# remove Daily Doubles (where value is not as listed on board e.g. $3562)
# this is obviously an approx. since contestant may have picked an 'acceptable value' for their DD bet
acceptable_values0 <- c("100", "200", "300", "400", "500") #pre-transition Jeopardy! values
acceptable_values1 <- c("200", "400", "600", "800", "1000") #pre-transition Double Jeopardy! values, post-transition Jeopardy! values
acceptable_values2 <- c("400", "800", "1200", "1600", "2000") #post-transition Double Jeopardy! values
jeopardy_data <- jeopardy_data[(jeopardy_data$air_date < transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values0) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date < transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values1) | (jeopardy_data$air_date >= transition_date & jeopardy_data$round=="Double Jeopardy!" & jeopardy_data$value %in% acceptable_values2),]
# NOTE ON CHOOSING SAMPLE SIZE: Google-search function implemented in this code takes 500~600ms PER ITERATION
# rule of thumb: will take ~1 min for every 100 items in the sample, so choose sample size wisely
# sample size must be less than the number of rows in the dataset (which is 200,000... so do not worry)
sample_size <- 500
# set seed for reproducibility of random sample for writing report
#(CHANGE OR REMOVE THE LINE BELOW TO ASSESS OVERALL REPEATABILITY OF RESULTS)
set.seed(1)
# select sample from jeopardy data
#randomly sample rows from the dataset (without replacement is the default)
brief <- jeopardy_data[sample(nrow(jeopardy_data), sample_size),]
# format the "answer" column to remove backslashes (strings in the dataset using apostrophes had escape-backslashes)
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\\\", "")
# format the "answer" to remove the string "or ", which has special meaning in Google searches
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\\(or ", " ")
# format the "answer" to remove the character ", which is used non-uniformly in dataset and vastly impacts Google Search results
brief[,"answer"] <- str_replace_all(brief[,"answer"], "\"", "")
#define separate columns for month and year of air_date
brief$month <- month(brief$air_date, label=TRUE)
brief$year <- year(brief$air_date)
#subdivide Years into a new row called Decade (to simplify mean effect analysis later)
brief$year <- as.numeric(as.character(brief$year))
brief$decade <- cut(brief$year, breaks = c(1980,1990,2000,2010,2020), labels = c("1980s", "1990s", "2000s", "2010s"))
# initialize a new column for Google Search results to be placed into
brief$results <- 0
# apply the for loop to the entire sample
for(i in 1:nrow(brief)) {
# run the Google search function on all "answers", put in "results"
brief[i, "results"] <- google.counts(brief[i,"answer"])
# retroactively double the value if the episode was aired before the transition date in 2001
if (brief[i, "air_date"] < transition_date) {
brief[i, "value"] <- brief[i, "value"] * 2
}
}
#Create a histogram of Google Search Results ('brief$results') on a LOG SCALE
par(mfrow=c(1,1))
hist(log10(brief$results), main = "Google Search Results for Jeopardy Answers", xlab = "log10(results)")
#take the common log of the "results" column into a new column "logresults
brief$logresults <- log10(brief$results)
#define columns of interest as factors
brief$round <- as.factor(brief$round)
brief$value <- as.factor(brief$value)
brief$year <- as.factor(brief$year)
brief$month <- as.factor(brief$month)
# re-arrange Round factor levels to "Jeopardy!" before "Double Jeopardy!" (otherwise is alphabetical)
brief$round = factor(brief$round, c("Jeopardy!", "Double Jeopardy!"))
#view levels of each factor
levels(brief$round)
levels(brief$value)
levels(brief$year)
levels(brief$month)
# view boxplots of results for each factor
boxplot(logresults~value, data=brief, vertical=TRUE, las=2, ylab="log(results)", main="Google Search results as a function of Value")
boxplot(logresults~round, data=brief, vertical=TRUE, las=1, ylab="log(results)", main="Google Search results as a function of Round")
boxplot(logresults~year, data=brief, vertical=TRUE, las=2, ylab="log(results)", main="Google Search results as a function of Year")
boxplot(logresults~month, data=brief, vertical=TRUE, las=2, ylab="log(results)", main="Google Search results as a function of Month")
#compute main effects by DECADE
print(me_decade1 <- mean(subset(brief$logresults, brief$decade == "1990s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
print(me_decade2 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
print(me_decade3 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1980s")))
print(me_decade4 <- mean(subset(brief$logresults, brief$decade == "2000s")) - mean(subset(brief$logresults, brief$decade == "1990s")))
print(me_decade5 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "1990s")))
print(me_decade6 <- mean(subset(brief$logresults, brief$decade == "2010s")) - mean(subset(brief$logresults, brief$decade == "2000s")))
boxplot(logresults~decade, data=brief, vertical=TRUE, las=1, ylab="log(results)", main="Google Search results as a function of Decade")
#compute main effects by ROUND
print(me_round <- mean(subset(brief$logresults, brief$round == "Double Jeopardy!")) - mean(subset(brief$logresults, brief$round == "Jeopardy!")))
#compute main effects by VALUE
print(me_value1 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 200)))
print(me_value2 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 400)))
print(me_value3 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 600)))
print(me_value4 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 800)))
print(me_value5 <- mean(subset(brief$logresults, brief$value == 2000)) - mean(subset(brief$logresults, brief$value == 1000)))
#2-factor interaction plots
par(mfrow=c(1,1))
interaction.plot(brief$decade, brief$value, brief$logresults)
interaction.plot(brief$decade, brief$round, brief$logresults)
interaction.plot(brief$decade, brief$month, brief$logresults)
interaction.plot(brief$round, brief$value, brief$logresults)
interaction.plot(brief$value, brief$month, brief$logresults)
interaction.plot(brief$round, brief$month, brief$logresults)
#compute the anova for each 2-combination of factors and print the summary
summary(aov(brief$logresults ~ brief$year*brief$month))
summary(aov(brief$logresults ~ brief$year*brief$value))
summary(aov(brief$logresults ~ brief$year*brief$round))
summary(aov(brief$logresults ~ brief$month*brief$value))
summary(aov(brief$logresults ~ brief$month*brief$round))
summary(aov(brief$logresults ~ brief$value*brief$round))