Jeopardy Database Mind your P's and Q's

Earlier this week Yhat, Inc. tweeted about a data set of over 200,000 Jeopardy Questions scraped from www.j-archive.com . A JSON and CSV files were linked to from Reddit, http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

I did a quick analysis to see if there was a definable trend between the length of the answer and the value of the question. Hint: there is if you look at it the right (wrong?) way. it is here: http://rpubs.com/mharris/jeopardy

Below is a quick run through of the general descriptive attributes of the Jeopardy data.

Begin…

sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.6
## 
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0    stringr_0.6.2  tools_3.1.0

requirements…

require("dplyr")
## Warning: package 'dplyr' was built under R version 3.1.2
require("ggplot2")
# install.packages('ggthemes', dependencies = TRUE)
 require("ggthemes")

dimensions of the data

jData <- read.csv("C:/TEMP/jData/JEOPARDY_CSV.csv", stringsAsFactors=FALSE)
str(jData)
## 'data.frame':    216930 obs. of  7 variables:
##  $ Show.Number: int  4680 4680 4680 4680 4680 4680 4680 4680 4680 4680 ...
##  $ Air.Date   : chr  "2004-12-31" "2004-12-31" "2004-12-31" "2004-12-31" ...
##  $ Round      : chr  "Jeopardy!" "Jeopardy!" "Jeopardy!" "Jeopardy!" ...
##  $ Category   : chr  "HISTORY" "ESPN's TOP 10 ALL-TIME ATHLETES" "EVERYBODY TALKS ABOUT IT..." "THE COMPANY LINE" ...
##  $ Value      : chr  "$200" "$200" "$200" "$200" ...
##  $ Question   : chr  "For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory" "No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves" "The city of Yuma in this state has a record average of 4,055 hours of sunshine each year" "In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger" ...
##  $ Answer     : chr  "Copernicus" "Jim Thorpe" "Arizona" "McDonald's" ...
dim(jData)
## [1] 216930      7

Ok, so 216930 rows of data that include information for attributes of Show.Number, Air.Date, Round, Category, Value, Question, Answer.

The data are from a total of 3640 shows spanning in data from 1984-09-10, 2012-01-27.

Categories

20 most frequently used categories

catFreq <- arrange(data.frame(table(jData$Category)), desc(Freq))
head(catFreq,20)
##                       Var1 Freq
## 1           BEFORE & AFTER  547
## 2                  SCIENCE  519
## 3               LITERATURE  496
## 4         AMERICAN HISTORY  418
## 5                POTPOURRI  401
## 6            WORLD HISTORY  377
## 7             WORD ORIGINS  371
## 8  COLLEGES & UNIVERSITIES  351
## 9                  HISTORY  349
## 10                  SPORTS  342
## 11             U.S. CITIES  339
## 12         WORLD GEOGRAPHY  338
## 13         BODIES OF WATER  327
## 14                 ANIMALS  324
## 15          STATE CAPITALS  314
## 16     BUSINESS & INDUSTRY  311
## 17                 ISLANDS  301
## 18          WORLD CAPITALS  300
## 19          U.S. GEOGRAPHY  299
## 20                RELIGION  297

20 most lengthy category names

catLength <- unique(arrange(data.frame(Category = jData$Category, nChar = nchar(jData$Category)), desc(nChar)))
head(catLength,20)
##                                             Category nChar
## 1  TWINKLE TWINKLE LITTLE WORD THAT RHYMES WITH STAR    49
## 6   THE NATIONAL ASSOCIATION OF CHILDREN'S HOSPITALS    48
## 10    THE NEW YORK FIRE DEPARTMENT THROUGH THE YEARS    46
## 15    AMERICAN HERITAGE DICTIONARY PREFERRED PLURALS    46
## 25    THE NEW YORK TIMES BEST 1,000 MOVIES EVER MADE    46
## 30    EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT SEX    46
## 35    PULITZER-WINNING BIOGRAPHIES & AUTOBIOGRAPHIES    46
## 40     LAST PAGE OF THE AMERICAN HERITAGE DICTIONARY    45
## 45      INSIDE THE ARTIST'S STUDIO WITH JAMES LIPTON    44
## 48      NATIONAL GEOGRAPHIC CHANNEL GREAT MIGRATIONS    44
## 53      TREASURES OF THE NEW-YORK HISTORICAL SOCIETY    44
## 58      INGREDIENTS IN THE MACBETH WITCHES' CAULDRON    44
## 63       THE NEW YORK TIMES 2009 FICTION BESTSELLERS    43
## 68       ROLLING STONE'S 50 BEST SONGS OF THE DECADE    43
## 73       COUNTRY FEMALE VOCALIST OF THE YEAR GRAMMYS    43
## 77       MARY, QUEEN OF SCOTS, PLAYMATE OF THE MONTH    43
## 82        THE THIRD-MOST POPULAR PRESIDENTIAL CHOICE    42
## 84        THE WORLD ALMANAC'S WIDELY KNOWN AMERICANS    42
## 89        HOME COUNTRY OF THE U.N. SECRETARY-GENERAL    42
## 94        THERE'S NO BUSINESS LIKE BUSINESS BUSINESS    42

…20 shortest category names

catLengthA <- unique(arrange(data.frame(Category = jData$Category, nChar = nchar(jData$Category)), nChar))
head(catLengthA,20)
##    Category nChar
## 1         2     1
## 6         4     1
## 11        X     1
## 20        7     1
## 25        Q     1
## 28        3     1
## 33        6     1
## 38        M     1
## 43       47     2
## 48       25     2
## 53       TV     2
## 58       17     2
## 63       22     2
## 68       U2     2
## 73       E3     2
## 78       GM     2
## 83       13     2
## 88       26     2
## 90       GO     2
## 95       AA     2

Very heavily skewed distribution of category use frequency

hist(catFreq$Freq, breaks = 100)

plot of chunk unnamed-chunk-7

Much nicer distribution of the length of category names

hist(catLength$nChar, breaks=20)

plot of chunk unnamed-chunk-8

prep some data for ggplot2

catPlot <- jData[jData$Category %in% head(catFreq$Var1,20), "Category"]
catTbl <- data.frame(table(catPlot))
colnames(catTbl) <- c("Category", "Frequency")
catTbl <- arrange(catTbl, desc(Frequency))

Frequency of appearence for 20 most common category names

ggplot(catTbl, aes(x = reorder(Category, -Frequency), y = Frequency)) + 
  geom_bar(stat="identity") +
  xlab("Category") +
  theme_wsj() +
  theme(axis.text.x = element_text(angle = 45)) +
  ggtitle("I'll take... Frequency of Jeopardy! Categories (top 20)")

plot of chunk unnamed-chunk-10

Who is, what is, where is… Answers

Well, as you know in Jeopardy! the answers are really a question and the questions are posed as answers. Here are descriptives of the answers the players gave, posed as questions…

20 most common correct answers

answFreq <- arrange(data.frame(table(jData$Answer)), desc(Freq))
head(answFreq,20)
##            Var1 Freq
## 1         China  216
## 2     Australia  215
## 3         Japan  196
## 4       Chicago  194
## 5        France  193
## 6         India  185
## 7    California  180
## 8        Canada  176
## 9         Spain  171
## 10       Mexico  164
## 11       Alaska  161
## 12        Italy  160
## 13       Hawaii  157
## 14        Texas  153
## 15        Paris  149
## 16      Germany  147
## 17       Russia  141
## 18      Florida  140
## 19 South Africa  139
## 20      Ireland  136

the 20 longest answers… but many are actually answers that have multiple possibilities; see #5 for the longest single answer

answLength <- unique(arrange(data.frame(Answer = jData$Answer, nChar = nchar(jData$Answer)), desc(nChar)))
head(answLength,20)
##                                                                                                                                    Answer
## 1  (2 of) the Montreal Expos, the Seattle Mariners, the California Angels, the Texas Rangers, the Toronto Blue Jays, & the Houston Astros
## 2         (1 of) Beth Henley (for Crimes of the Heart), Marsha Norman (for 'Night, Mother) & Wendy Wasserstein (for The Heidi Chronicles)
## 3                   the Atlanta-Fulton County Stadium (or the Kingdome, Milwaukee County Stadium, or the Oakland-Alameda County Coliseum)
## 4                        Cardinals (St. Louis [baseball] & Phoenix [football]) or Giants (San Francisco [baseball] & New York [football])
## 5                                A Midsummer Night's Sex Comedy & Everything You Always Wanted to Know About Sex (But Were Afraid to Ask)
## 6                                     Rain, rain, go away/Come again another day (Rain, rain, go away/Come again some other day accepted)
## 7                                      (1 of) Ameritech, Bell Atlantic, Bell South, NYNEX, Southwestern Bell, Pacific Telesis, or US West
## 8                                      Declaration of Independence, Articles of Confederation, Articles of Association & The Constitution
## 9                                       escaping the Earth's gravity (and go off into outer space, on your way to the moon, for instance)
## 10                                       (2 of) the Seattle Seahawks, the Jacksonville Jaguars, the Buffalo Bills, & the Tennessee Titans
## 11                                         Dag Hammarskjold (the secretary-general of the UN who was killed in the plane crash in Africa)
## 12                                         Son (Jefferson, Madison, Jackson, W.H. Harrison, A. Johnson, B. Harrison, Wilson & L. Johnson)
## 13                                          Speaker of the House of Orange Julius (Speaker of the House of Orange Julius Caesar accepted)
## 14                                             Massachusetts Institute of Technology & California Institute of Technology (MIT & Caltech)
## 15                                               Aquarius (W.H. Harrison, Lincoln, McKinley & FDR were the presidents who died in office)
## 16                                                 (1 of) Ann Richards (of Texas), Barbara Roberts (of Oregon) or Joan Finney (of Kansas)
## 17                                                 a microphone & the masks of comedy & tragedy (a TV set, a movie camera & a phonograph)
## 18                                                   Chile (which goes down around Tierra Del Fuego & sweeps a little south of Argentina)
## 19                                                     hiding your light under a bushel (keeping your light underneath a bushel accepted)
## 20                                                     (2 of 5) Marlon Brando, Gary Cooper, Dustin Hoffman, Fredric March & Spencer Tracy
##    nChar
## 1    134
## 2    127
## 3    117
## 4    112
## 5    104
## 6     99
## 7     98
## 8     98
## 9     97
## 10    96
## 11    94
## 12    94
## 13    93
## 14    90
## 15    88
## 16    86
## 17    86
## 18    84
## 19    82
## 20    82

very skewed distribution of answer frequency

hist(answFreq$Freq, breaks = 100)

plot of chunk unnamed-chunk-13

less skewed distribution of answer length

hist(answLength$nChar, breaks=20)

plot of chunk unnamed-chunk-14

some ggplot2 data prep

answPlot <- jData[jData$Answer %in% head(answFreq$Var1,20), "Answer"]
answTbl <- data.frame(table(answPlot))
colnames(answTbl) <- c("Answer", "Frequency")
answTbl <- arrange(answTbl, desc(Frequency))

Frequency of 20 most common correct answers on Jeopardy! (all place names!)

ggplot(answTbl, aes(x = reorder(Answer, -Frequency), y = Frequency)) + 
    geom_bar(stat="identity") +
    xlab("Answer") +
    theme_wsj() +
    theme(axis.text.x = element_text(angle = 45)) +
    ggtitle("What is... ? Frequency of Jeopardy! Answers (top 20)")

plot of chunk unnamed-chunk-16

Rounds

These data contain information on different rounds within the game.

data.frame(table(jData$Round))
##               Var1   Freq
## 1 Double Jeopardy! 105912
## 2  Final Jeopardy!   3631
## 3        Jeopardy! 107384
## 4       Tiebreaker      3

Question/Answer Values

While the other post I linked to at the top looked at a relationship between answers and values, these are just the basics of the values attribute

Values <- select(jData, Value)
Values<- filter(Values, Value != "None")
Values <- mutate(Values, Value = as.character(sub("$","",Value, fixed=TRUE)))
Values <- mutate(Values, Value = as.numeric(as.character(sub(",","",Value, fixed=TRUE))) )

highly skewed distribution of question value frequencies…

hist(Values$Value, breaks= 100)

plot of chunk unnamed-chunk-19

but… general Jeopardy play only goes up to a value of $2,000 per question in Double Jeopardy! so lets look at that.

smallValues <- filter(Values, Value <= 2000) %>%
  filter((Value/10) %% 2 == 0) %>%
  filter(Value != 20) %>%
  filter(Value != 1020)
data.frame(table(smallValues))
##    smallValues  Freq
## 1          100  9029
## 2          200 30455
## 3          300  8663
## 4          400 42244
## 5          500  9016
## 6          600 20377
## 7          700   203
## 8          800 31860
## 9          900   114
## 10        1000 21640
## 11        1100    63
## 12        1200 11772
## 13        1300    75
## 14        1400   228
## 15        1500   546
## 16        1600 11040
## 17        1700    44
## 18        1800   182
## 19        1900    28
## 20        2000 12829

Frequncy of 20 most common correct answers on Jeopardy! (all place names!)

ggplot(data.frame(table(smallValues)), aes(x = reorder(smallValues, -Freq), y = Freq)) + 
    geom_bar(stat="identity") +
    xlab("Answer") +
    theme_wsj() +
    theme(axis.text.x = element_text(angle = 45)) +
    ggtitle("For... ? Frequency of Jeopardy! Question Values (top 20)")

plot of chunk unnamed-chunk-21

What's a quince? It's a food, Billy, that starts with the letter Q, and I got seven more!

classic line from Rosie Perez in White Men Can't Jump (1992).
At the end of the Movie, this answer comes in hand during the Double Jeopardy! round.
Has “Quince” ever been used as an answer in Jeopardy! ? Why yes, three times as a matter of fact. Interestingly, they all came after that movie was released.

Occurrences of “Quince” as a correct answer.

quince <- arrange(jData, desc(Answer)) %>%
          mutate(temp = ifelse(Answer == "Quince",1,0)) %>%
          filter(temp == 1)
print(quince[ , c(7, 1:5)])
##   Answer Show.Number   Air.Date            Round
## 1 Quince        5694 2009-05-14 Double Jeopardy!
## 2 Quince        2576 1995-11-13        Jeopardy!
## 3 Quince        3008 1997-10-01 Double Jeopardy!
##                               Category Value
## 1    MIND YOUR SHAKESPEARE "P"s & "Q"s $2000
## 2                MIND YOUR "P"s & "Q"s  $500
## 3 FOODS THAT BEGIN WITH THE LETTER "Q"  $800
print(quince[ , c(6)])
## [1] "Fruity surname of Peter in \"A Midsummer Night's Dream\""                                   
## [2] "Used in jellies & compotes, this fruit of the rose family puckers the mouth when tasted raw"
## [3] "This yellow-skinned fruit has been used for centuries in jams & jellies"

Rosie's character says she has seven more foods that start with the letter “Q”. While there are perhaps more than seven, below are an additional foods (or medicine in the case of Quinine).

foodQ <- c("Quesadilla", "Quahog", "Quiche", "Quail", "Queso", "Quinine", "Quinoa")
qFoods <- arrange(jData, desc(Answer)) %>%
          mutate(temp = ifelse(Answer %in% foodQ,1,0)) %>%
          filter(temp == 1)
print(qFoods[ , c(7, 1:5)])
##        Answer Show.Number   Air.Date            Round
## 1     Quinine        3688 2000-09-20 Double Jeopardy!
## 2     Quinine        2576 1995-11-13        Jeopardy!
## 3     Quinine        3367 1999-04-06 Double Jeopardy!
## 4     Quinine        2674 1996-03-28        Jeopardy!
## 5     Quinine        2819 1996-11-28 Double Jeopardy!
## 6      Quiche        3251 1998-10-26        Jeopardy!
## 7      Quiche        2900 1997-03-21        Jeopardy!
## 8      Quiche        3008 1997-10-01 Double Jeopardy!
## 9  Quesadilla        3008 1997-10-01 Double Jeopardy!
## 10      Quail        2847 1997-01-07        Jeopardy!
## 11      Quail        3008 1997-10-01 Double Jeopardy!
## 12     Quahog        3890 2001-06-29 Double Jeopardy!
## 13     Quahog        3529 1999-12-30 Double Jeopardy!
##                                Category Value
## 1                         "NINE" ACROSS  $400
## 2                 MIND YOUR "P"s & "Q"s  $400
## 3                        PLANTS & TREES  $600
## 4                       GENERAL SCIENCE  $300
## 5                          FOOD & DRINK  $400
## 6                            EGGS & HAM  $200
## 7                                ON "Q"  $100
## 8  FOODS THAT BEGIN WITH THE LETTER "Q"  $600
## 9  FOODS THAT BEGIN WITH THE LETTER "Q"  $400
## 10                     SCIENCE & NATURE  $300
## 11 FOODS THAT BEGIN WITH THE LETTER "Q"  $200
## 12                              GEE "Q" $1000
## 13                        ANIMAL A.K.A. $1000
print(qFoods[ , c(6)])
##  [1] "Malaria remedy (7)"                                                                                                     
##  [2] "Today most of this medicine comes from cinchona trees in Java"                                                          
##  [3] "The cinchona tree, the source for this malaria drug, was once plentiful in South America; now it's grown mainly on Java"
##  [4] "This malaria-fighting substance is extracted from the bark of the cinchona tree"                                        
##  [5] "Dubonnet is an aperitif flavored with herbs & a form of this malaria medicine"                                          
##  [6] "The ham & cheese version of this egg dish is prepared in much the same manner as the Lorraine"                          
##  [7] "Made with eggs, cheese & other fillings in a pastry shell, it's what real men don't eat"                                
##  [8] "This savory pie can have bacon bits added to the custard filling"                                                       
##  [9] "It's a tortilla that's been filled, folded & fried"                                                                     
## [10] "North American types of this bird include Montezuma, mountain & bobwhite"                                               
## [11] "This member of the pheasant family can be roasted, broiled or fried"                                                    
## [12] "A thick-shelled edible clam of the north Atlantic coast"                                                                
## [13] "Hard-shell or littleneck clam"

All the “Q” foods.

Qs <- merge(quince, qFoods, all=TRUE) 
Qs <- data.frame(table(Qs$Answer))
colnames(Qs) <- c("Food", "Frequency")
arrange(Qs, desc(Frequency))
##         Food Frequency
## 1    Quinine         5
## 2     Quiche         3
## 3     Quince         3
## 4     Quahog         2
## 5      Quail         2
## 6 Quesadilla         1

Frequency of 20 most common correct answers on Jeopardy! (all place names!)

ggplot(Qs, aes(x = reorder(Food, -Frequency), y = Frequency)) + 
    geom_bar(stat="identity") +
    xlab("Answer") +
    theme_wsj() +
    theme(axis.text.x = element_text(angle = 45)) +
    ggtitle("Seven Foods Starting with the Letter Q")

plot of chunk unnamed-chunk-25