Jeopardy!

Are longer answers worth more?

Er… Is that longer questions?

TL;DR - Log/Log transform finds a trend, but does three characters make a difference?

Earlier today Yhat, Inc. tweeted about a data set of over 200,000 Jeopardy Questions scraped from www.j-archive.com . A JSON and CSV files were linked to from Reddit, http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

Looked pretty cool for sure. A quick question (er.. answer?) popped into my head: given the data available in this file (see below) is there a relationship between the length of the players answer (the Jeopardy “Question”) to the value that answer is worth. Simply, do bigger values require bigger words?

This is a pretty odd-ball question and I am sure there are plenty of confounding issues, but it gave me a fun excuse to do a little data cleaning practice via dplyr and could lead to something interesting.

DISCLAIMER: This was purely for fun and has no guarantee. You break it you buy it…

Lets, get started:

rm(list=ls(all=TRUE))
sessionInfo()
## R version 3.0.3 (2014-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.6
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0    stringr_0.6.2  tools_3.0.3

Packages you will need:

require("dplyr")
require("ggplot2")
# install.packages('ggthemes', dependencies = TRUE)
library("ggthemes")

I downloaded the CSV from the reddit page linked above (~ 34 Mb)

jData <- read.csv("C:/TEMP/jData/JEOPARDY_CSV.csv", stringsAsFactors=FALSE)
str(jData)
## 'data.frame':    216930 obs. of  7 variables:
##  $ Show.Number: int  4680 4680 4680 4680 4680 4680 4680 4680 4680 4680 ...
##  $ Air.Date   : chr  "2004-12-31" "2004-12-31" "2004-12-31" "2004-12-31" ...
##  $ Round      : chr  "Jeopardy!" "Jeopardy!" "Jeopardy!" "Jeopardy!" ...
##  $ Category   : chr  "HISTORY" "ESPN's TOP 10 ALL-TIME ATHLETES" "EVERYBODY TALKS ABOUT IT..." "THE COMPANY LINE" ...
##  $ Value      : chr  "$200" "$200" "$200" "$200" ...
##  $ Question   : chr  "For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory" "No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves" "The city of Yuma in this state has a record average of 4,055 hours of sunshine each year" "In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger" ...
##  $ Answer     : chr  "Copernicus" "Jim Thorpe" "Arizona" "McDonald's" ...
unique(jData$Value)
##   [1] "$200"    "$400"    "$600"    "$800"    "$2,000"  "$1000"   "$1200"  
##   [8] "$1600"   "$2000"   "$3,200"  "None"    "$5,000"  "$100"    "$300"   
##  [15] "$500"    "$1,000"  "$1,500"  "$1,200"  "$4,800"  "$1,800"  "$1,100" 
##  [22] "$2,200"  "$3,400"  "$3,000"  "$4,000"  "$1,600"  "$6,800"  "$1,900" 
##  [29] "$3,100"  "$700"    "$1,400"  "$2,800"  "$8,000"  "$6,000"  "$2,400" 
##  [36] "$12,000" "$3,800"  "$2,500"  "$6,200"  "$10,000" "$7,000"  "$1,492" 
##  [43] "$7,400"  "$1,300"  "$7,200"  "$2,600"  "$3,300"  "$5,400"  "$4,500" 
##  [50] "$2,100"  "$900"    "$3,600"  "$2,127"  "$367"    "$4,400"  "$3,500" 
##  [57] "$2,900"  "$3,900"  "$4,100"  "$4,600"  "$10,800" "$2,300"  "$5,600" 
##  [64] "$1,111"  "$8,200"  "$5,800"  "$750"    "$7,500"  "$1,700"  "$9,000" 
##  [71] "$6,100"  "$1,020"  "$4,700"  "$2,021"  "$5,200"  "$3,389"  "$4,200" 
##  [78] "$5"      "$2,001"  "$1,263"  "$4,637"  "$3,201"  "$6,600"  "$3,700" 
##  [85] "$2,990"  "$5,500"  "$14,000" "$2,700"  "$6,400"  "$350"    "$8,600" 
##  [92] "$6,300"  "$250"    "$3,989"  "$8,917"  "$9,500"  "$1,246"  "$6,435" 
##  [99] "$8,800"  "$2,222"  "$2,746"  "$10,400" "$7,600"  "$6,700"  "$5,100" 
## [106] "$13,200" "$4,300"  "$1,407"  "$12,400" "$5,401"  "$7,800"  "$1,183" 
## [113] "$1,203"  "$13,000" "$11,600" "$14,200" "$1,809"  "$8,400"  "$8,700" 
## [120] "$11,000" "$5,201"  "$1,801"  "$3,499"  "$5,700"  "$601"    "$4,008" 
## [127] "$50"     "$2,344"  "$2,811"  "$18,000" "$1,777"  "$3,599"  "$9,800" 
## [134] "$796"    "$3,150"  "$20"     "$1,810"  "$22"     "$9,200"  "$1,512" 
## [141] "$8,500"  "$585"    "$1,534"  "$13,800" "$5,001"  "$4,238"  "$16,400"
## [148] "$1,347"  "$2547"   "$11,200"

Now for the data munging… (tried chaining with %>%, but this was more readable for now…)

Issues:
1) There are lots of value here for various questions outside of the common Jeopardy and Double Jeopardy rounds; they are removed below.
2) This database starts as early as 1984, but values over $1,000 were not added until Super Jeopardy in 1990 and the current version of Jeopardy in 2001. We'll have to live with these 3) There are many more observations for values under $1,000 than over, but that is just part of the fun…

So, how to clean? (following the sequence below)
Final Jeopardy questions have a value of “None” are not of interest.
I am only interested in the Value and Answer columns.
Mutate to get the length of each answer and convert Values to numeric.
Filter to remove values of over $2,000 as they are not in the typical Jeopardy play.
Filter any value that does not end in “0” as it is not in the typical Jeopardy play. Filter out two values that have only single examples (and frankly mess with the analysis)

x <- filter(jData, Value != "None")
x <- filter(jData, Round == "Double Jeopardy!" | Round == "Jeopardy!")
x <- select(x, Value, Answer)
x <- mutate(x, Alength = nchar(x$Answer))
x <- mutate(x, Value = as.character(sub("$","",Value, fixed=TRUE)))
x <- mutate(x, Value = as.numeric(as.character(sub(",","",Value, fixed=TRUE))) )
x <- filter(x, Value <= 2000)
x <- filter(x, (Value/10) %% 2 == 0)
x <- filter(x, Value != 20)
x <- filter(x, Value != 1020)
str(x)
## 'data.frame':    210408 obs. of  3 variables:
##  $ Value  : num  200 200 200 200 200 200 400 400 400 400 ...
##  $ Answer : chr  "Copernicus" "Jim Thorpe" "Arizona" "McDonald's" ...
##  $ Alength: int  10 10 7 10 10 7 14 14 10 14 ...
# nobs per value group
data.frame(table(x$Value))
##    Var1  Freq
## 1   100  9029
## 2   200 30455
## 3   300  8663
## 4   400 42244
## 5   500  9016
## 6   600 20377
## 7   700   203
## 8   800 31860
## 9   900   114
## 10 1000 21640
## 11 1100    63
## 12 1200 11772
## 13 1300    75
## 14 1400   228
## 15 1500   546
## 16 1600 11040
## 17 1700    44
## 18 1800   182
## 19 1900    28
## 20 2000 12829

Take that dataframe and summaries the mean answer length by value groups.

x1 <- summarise(group_by(x, Value), mean=mean(Alength))

Can you predict the length of the players answer from the value of the question? Well, a relationship exists and the explanatory variable “Value” has a t-value of 1.87 (p = 0.0781), but clearly the residuals are huge and the fit is kinda flat. The trend seems to be there, but it is subtle. Perhaps a little transformation will help…

f <- lm(mean ~ Value, data = x1)

Plot this thing…

ggplot(x1, aes(Value,mean)) + 
  geom_point(colour = "black", size = 3, shape = 19) + 
    stat_smooth(method="lm", se=TRUE) +
    theme_economist() +
    ggtitle("Jeopardy! Mean Answer Length Over Question Value") +
    xlab("Question Value (dollars) ") +
    ylab("Mean Answer Length (characters) ")

plot of chunk unnamed-chunk-7

Fit summary…

summary(f)
## 
## Call:
## lm(formula = mean ~ Value, data = x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.270 -0.812 -0.461  0.395  3.304 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.14e+01   6.23e-01   18.38  4.1e-13 ***
## Value       9.71e-04   5.20e-04    1.87    0.078 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.34 on 18 degrees of freedom
## Multiple R-squared:  0.162,  Adjusted R-squared:  0.116 
## F-statistic: 3.49 on 1 and 18 DF,  p-value: 0.0781
print(x1)
## Source: local data frame [20 x 2]
## 
##    Value  mean
## 1    100 10.57
## 2    200 10.84
## 3    300 11.38
## 4    400 11.17
## 5    500 12.28
## 6    600 11.47
## 7    700 13.98
## 8    800 11.46
## 9    900 14.99
## 10  1000 12.14
## 11  1100 11.75
## 12  1200 11.59
## 13  1300 16.01
## 14  1400 14.13
## 15  1500 13.43
## 16  1600 11.73
## 17  1700 12.84
## 18  1800 12.36
## 19  1900 13.07
## 20  2000 12.15

Log/Log a go-go

Transforming the x and y to log gives a much better representation of the trend.

Better fit by transforming both axis by the natural log. log(value) t-value = 2.736, p = 0.0136 Not setting the world on fire, but fun nonetheless More has been built on less…

flxy <- lm(log(mean) ~ log(Value), data = x1)
summary(flxy)
## 
## Call:
## lm(formula = log(mean) ~ log(Value), data = x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1029 -0.0633 -0.0235  0.0305  0.2234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.0252     0.1811   11.18  1.6e-09 ***
## log(Value)    0.0732     0.0268    2.74    0.014 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0948 on 18 degrees of freedom
## Multiple R-squared:  0.294,  Adjusted R-squared:  0.254 
## F-statistic: 7.49 on 1 and 18 DF,  p-value: 0.0136

Well, there you have it; a small positive relationship exists between the log value of an answer and the log number of characters (length) of that answer (er… question). The mean of all the answers for questions valued under $2,000 is mean(x1$mean), the lowest mean is 10.574 for questions worth $100, and the highest is a mean length of 16.013 for questions valued at $1,300 (only 75 observations…).

PLOT:

If you can see it, then it is real!!!!

plot(log(x1$mean) ~ log(x1$Value))
lines(log(x1$Value), predict(flxy), col = "red")

plot of chunk unnamed-chunk-10

Predicted values in original units (dollars and length in chracters)

e <- 2.718281828
data.frame(x1$Value, e^predict(flxy))
##    x1.Value e.predict.flxy.
## 1       100           10.62
## 2       200           11.17
## 3       300           11.50
## 4       400           11.75
## 5       500           11.94
## 6       600           12.10
## 7       700           12.24
## 8       800           12.36
## 9       900           12.47
## 10     1000           12.56
## 11     1100           12.65
## 12     1200           12.73
## 13     1300           12.81
## 14     1400           12.88
## 15     1500           12.94
## 16     1600           13.00
## 17     1700           13.06
## 18     1800           13.12
## 19     1900           13.17
## 20     2000           13.22

Conclusion:

If I am on Jeopardy! and I get the first buzzer on a question worth $2,000 and I have two possible answers in my head of length 10 and 13 characters… I'll pick the 13 characters. If I get to be in that situation a few thousand times, I should come out on top!