Earlier this week Yhat, Inc. tweeted about a data set of over 200,000 Jeopardy Questions scraped from www.j-archive.com . A JSON and CSV files were linked to from Reddit, http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
I did a quick analysis to see if there was a definable trend between the length of the answer and the value of the question. Hint: there is if you look at it the right (wrong?) way. it is here: http://rpubs.com/mharris/jeopardy
Below is a quick run through of the general descriptive attributes of the Jeopardy data.
Begin…
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.6
##
## loaded via a namespace (and not attached):
## [1] evaluate_0.5.5 formatR_1.0 stringr_0.6.2 tools_3.1.0
requirements…
require("dplyr")
## Warning: package 'dplyr' was built under R version 3.1.2
require("ggplot2")
# install.packages('ggthemes', dependencies = TRUE)
require("ggthemes")
dimensions of the data
jData <- read.csv("C:/TEMP/jData/JEOPARDY_CSV.csv", stringsAsFactors=FALSE)
str(jData)
## 'data.frame': 216930 obs. of 7 variables:
## $ Show.Number: int 4680 4680 4680 4680 4680 4680 4680 4680 4680 4680 ...
## $ Air.Date : chr "2004-12-31" "2004-12-31" "2004-12-31" "2004-12-31" ...
## $ Round : chr "Jeopardy!" "Jeopardy!" "Jeopardy!" "Jeopardy!" ...
## $ Category : chr "HISTORY" "ESPN's TOP 10 ALL-TIME ATHLETES" "EVERYBODY TALKS ABOUT IT..." "THE COMPANY LINE" ...
## $ Value : chr "$200" "$200" "$200" "$200" ...
## $ Question : chr "For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory" "No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves" "The city of Yuma in this state has a record average of 4,055 hours of sunshine each year" "In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger" ...
## $ Answer : chr "Copernicus" "Jim Thorpe" "Arizona" "McDonald's" ...
dim(jData)
## [1] 216930 7
Ok, so 216930 rows of data that include information for attributes of Show.Number, Air.Date, Round, Category, Value, Question, Answer.
The data are from a total of 3640 shows spanning in data from 1984-09-10, 2012-01-27.
20 most frequently used categories
catFreq <- arrange(data.frame(table(jData$Category)), desc(Freq))
head(catFreq,20)
## Var1 Freq
## 1 BEFORE & AFTER 547
## 2 SCIENCE 519
## 3 LITERATURE 496
## 4 AMERICAN HISTORY 418
## 5 POTPOURRI 401
## 6 WORLD HISTORY 377
## 7 WORD ORIGINS 371
## 8 COLLEGES & UNIVERSITIES 351
## 9 HISTORY 349
## 10 SPORTS 342
## 11 U.S. CITIES 339
## 12 WORLD GEOGRAPHY 338
## 13 BODIES OF WATER 327
## 14 ANIMALS 324
## 15 STATE CAPITALS 314
## 16 BUSINESS & INDUSTRY 311
## 17 ISLANDS 301
## 18 WORLD CAPITALS 300
## 19 U.S. GEOGRAPHY 299
## 20 RELIGION 297
20 most lengthy category names
catLength <- unique(arrange(data.frame(Category = jData$Category, nChar = nchar(jData$Category)), desc(nChar)))
head(catLength,20)
## Category nChar
## 1 TWINKLE TWINKLE LITTLE WORD THAT RHYMES WITH STAR 49
## 6 THE NATIONAL ASSOCIATION OF CHILDREN'S HOSPITALS 48
## 10 THE NEW YORK FIRE DEPARTMENT THROUGH THE YEARS 46
## 15 AMERICAN HERITAGE DICTIONARY PREFERRED PLURALS 46
## 25 THE NEW YORK TIMES BEST 1,000 MOVIES EVER MADE 46
## 30 EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT SEX 46
## 35 PULITZER-WINNING BIOGRAPHIES & AUTOBIOGRAPHIES 46
## 40 LAST PAGE OF THE AMERICAN HERITAGE DICTIONARY 45
## 45 INSIDE THE ARTIST'S STUDIO WITH JAMES LIPTON 44
## 48 NATIONAL GEOGRAPHIC CHANNEL GREAT MIGRATIONS 44
## 53 TREASURES OF THE NEW-YORK HISTORICAL SOCIETY 44
## 58 INGREDIENTS IN THE MACBETH WITCHES' CAULDRON 44
## 63 THE NEW YORK TIMES 2009 FICTION BESTSELLERS 43
## 68 ROLLING STONE'S 50 BEST SONGS OF THE DECADE 43
## 73 COUNTRY FEMALE VOCALIST OF THE YEAR GRAMMYS 43
## 77 MARY, QUEEN OF SCOTS, PLAYMATE OF THE MONTH 43
## 82 THE THIRD-MOST POPULAR PRESIDENTIAL CHOICE 42
## 84 THE WORLD ALMANAC'S WIDELY KNOWN AMERICANS 42
## 89 HOME COUNTRY OF THE U.N. SECRETARY-GENERAL 42
## 94 THERE'S NO BUSINESS LIKE BUSINESS BUSINESS 42
…20 shortest category names
catLengthA <- unique(arrange(data.frame(Category = jData$Category, nChar = nchar(jData$Category)), nChar))
head(catLengthA,20)
## Category nChar
## 1 2 1
## 6 4 1
## 11 X 1
## 20 7 1
## 25 Q 1
## 28 3 1
## 33 6 1
## 38 M 1
## 43 47 2
## 48 25 2
## 53 TV 2
## 58 17 2
## 63 22 2
## 68 U2 2
## 73 E3 2
## 78 GM 2
## 83 13 2
## 88 26 2
## 90 GO 2
## 95 AA 2
Very heavily skewed distribution of category use frequency
hist(catFreq$Freq, breaks = 100)
Much nicer distribution of the length of category names
hist(catLength$nChar, breaks=20)
prep some data for ggplot2
catPlot <- jData[jData$Category %in% head(catFreq$Var1,20), "Category"]
catTbl <- data.frame(table(catPlot))
colnames(catTbl) <- c("Category", "Frequency")
catTbl <- arrange(catTbl, desc(Frequency))
Frequency of appearence for 20 most common category names
ggplot(catTbl, aes(x = reorder(Category, -Frequency), y = Frequency)) +
geom_bar(stat="identity") +
xlab("Category") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 45)) +
ggtitle("I'll take... Frequency of Jeopardy! Categories (top 20)")
Well, as you know in Jeopardy! the answers are really a question and the questions are posed as answers. Here are descriptives of the answers the players gave, posed as questions…
20 most common correct answers
answFreq <- arrange(data.frame(table(jData$Answer)), desc(Freq))
head(answFreq,20)
## Var1 Freq
## 1 China 216
## 2 Australia 215
## 3 Japan 196
## 4 Chicago 194
## 5 France 193
## 6 India 185
## 7 California 180
## 8 Canada 176
## 9 Spain 171
## 10 Mexico 164
## 11 Alaska 161
## 12 Italy 160
## 13 Hawaii 157
## 14 Texas 153
## 15 Paris 149
## 16 Germany 147
## 17 Russia 141
## 18 Florida 140
## 19 South Africa 139
## 20 Ireland 136
the 20 longest answers… but many are actually answers that have multiple possibilities; see #5 for the longest single answer
answLength <- unique(arrange(data.frame(Answer = jData$Answer, nChar = nchar(jData$Answer)), desc(nChar)))
head(answLength,20)
## Answer
## 1 (2 of) the Montreal Expos, the Seattle Mariners, the California Angels, the Texas Rangers, the Toronto Blue Jays, & the Houston Astros
## 2 (1 of) Beth Henley (for Crimes of the Heart), Marsha Norman (for 'Night, Mother) & Wendy Wasserstein (for The Heidi Chronicles)
## 3 the Atlanta-Fulton County Stadium (or the Kingdome, Milwaukee County Stadium, or the Oakland-Alameda County Coliseum)
## 4 Cardinals (St. Louis [baseball] & Phoenix [football]) or Giants (San Francisco [baseball] & New York [football])
## 5 A Midsummer Night's Sex Comedy & Everything You Always Wanted to Know About Sex (But Were Afraid to Ask)
## 6 Rain, rain, go away/Come again another day (Rain, rain, go away/Come again some other day accepted)
## 7 (1 of) Ameritech, Bell Atlantic, Bell South, NYNEX, Southwestern Bell, Pacific Telesis, or US West
## 8 Declaration of Independence, Articles of Confederation, Articles of Association & The Constitution
## 9 escaping the Earth's gravity (and go off into outer space, on your way to the moon, for instance)
## 10 (2 of) the Seattle Seahawks, the Jacksonville Jaguars, the Buffalo Bills, & the Tennessee Titans
## 11 Dag Hammarskjold (the secretary-general of the UN who was killed in the plane crash in Africa)
## 12 Son (Jefferson, Madison, Jackson, W.H. Harrison, A. Johnson, B. Harrison, Wilson & L. Johnson)
## 13 Speaker of the House of Orange Julius (Speaker of the House of Orange Julius Caesar accepted)
## 14 Massachusetts Institute of Technology & California Institute of Technology (MIT & Caltech)
## 15 Aquarius (W.H. Harrison, Lincoln, McKinley & FDR were the presidents who died in office)
## 16 (1 of) Ann Richards (of Texas), Barbara Roberts (of Oregon) or Joan Finney (of Kansas)
## 17 a microphone & the masks of comedy & tragedy (a TV set, a movie camera & a phonograph)
## 18 Chile (which goes down around Tierra Del Fuego & sweeps a little south of Argentina)
## 19 hiding your light under a bushel (keeping your light underneath a bushel accepted)
## 20 (2 of 5) Marlon Brando, Gary Cooper, Dustin Hoffman, Fredric March & Spencer Tracy
## nChar
## 1 134
## 2 127
## 3 117
## 4 112
## 5 104
## 6 99
## 7 98
## 8 98
## 9 97
## 10 96
## 11 94
## 12 94
## 13 93
## 14 90
## 15 88
## 16 86
## 17 86
## 18 84
## 19 82
## 20 82
very skewed distribution of answer frequency
hist(answFreq$Freq, breaks = 100)
less skewed distribution of answer length
hist(answLength$nChar, breaks=20)
some ggplot2 data prep
answPlot <- jData[jData$Answer %in% head(answFreq$Var1,20), "Answer"]
answTbl <- data.frame(table(answPlot))
colnames(answTbl) <- c("Answer", "Frequency")
answTbl <- arrange(answTbl, desc(Frequency))
Frequency of 20 most common correct answers on Jeopardy! (all place names!)
ggplot(answTbl, aes(x = reorder(Answer, -Frequency), y = Frequency)) +
geom_bar(stat="identity") +
xlab("Answer") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 45)) +
ggtitle("What is... ? Frequency of Jeopardy! Answers (top 20)")
These data contain information on different rounds within the game.
data.frame(table(jData$Round))
## Var1 Freq
## 1 Double Jeopardy! 105912
## 2 Final Jeopardy! 3631
## 3 Jeopardy! 107384
## 4 Tiebreaker 3
While the other post I linked to at the top looked at a relationship between answers and values, these are just the basics of the values attribute
Values <- select(jData, Value)
Values<- filter(Values, Value != "None")
Values <- mutate(Values, Value = as.character(sub("$","",Value, fixed=TRUE)))
Values <- mutate(Values, Value = as.numeric(as.character(sub(",","",Value, fixed=TRUE))) )
highly skewed distribution of question value frequencies…
hist(Values$Value, breaks= 100)
but… general Jeopardy play only goes up to a value of $2,000 per question in Double Jeopardy! so lets look at that.
smallValues <- filter(Values, Value <= 2000) %>%
filter((Value/10) %% 2 == 0) %>%
filter(Value != 20) %>%
filter(Value != 1020)
data.frame(table(smallValues))
## smallValues Freq
## 1 100 9029
## 2 200 30455
## 3 300 8663
## 4 400 42244
## 5 500 9016
## 6 600 20377
## 7 700 203
## 8 800 31860
## 9 900 114
## 10 1000 21640
## 11 1100 63
## 12 1200 11772
## 13 1300 75
## 14 1400 228
## 15 1500 546
## 16 1600 11040
## 17 1700 44
## 18 1800 182
## 19 1900 28
## 20 2000 12829
Frequncy of 20 most common correct answers on Jeopardy! (all place names!)
ggplot(data.frame(table(smallValues)), aes(x = reorder(smallValues, -Freq), y = Freq)) +
geom_bar(stat="identity") +
xlab("Answer") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 45)) +
ggtitle("For... ? Frequency of Jeopardy! Question Values (top 20)")
classic line from Rosie Perez in White Men Can't Jump (1992).
At the end of the Movie, this answer comes in hand during the Double Jeopardy! round.
Has “Quince” ever been used as an answer in Jeopardy! ? Why yes, three times as a matter of fact. Interestingly, they all came after that movie was released.
Occurrences of “Quince” as a correct answer.
quince <- arrange(jData, desc(Answer)) %>%
mutate(temp = ifelse(Answer == "Quince",1,0)) %>%
filter(temp == 1)
print(quince[ , c(7, 1:5)])
## Answer Show.Number Air.Date Round
## 1 Quince 5694 2009-05-14 Double Jeopardy!
## 2 Quince 2576 1995-11-13 Jeopardy!
## 3 Quince 3008 1997-10-01 Double Jeopardy!
## Category Value
## 1 MIND YOUR SHAKESPEARE "P"s & "Q"s $2000
## 2 MIND YOUR "P"s & "Q"s $500
## 3 FOODS THAT BEGIN WITH THE LETTER "Q" $800
print(quince[ , c(6)])
## [1] "Fruity surname of Peter in \"A Midsummer Night's Dream\""
## [2] "Used in jellies & compotes, this fruit of the rose family puckers the mouth when tasted raw"
## [3] "This yellow-skinned fruit has been used for centuries in jams & jellies"
Rosie's character says she has seven more foods that start with the letter “Q”. While there are perhaps more than seven, below are an additional foods (or medicine in the case of Quinine).
foodQ <- c("Quesadilla", "Quahog", "Quiche", "Quail", "Queso", "Quinine", "Quinoa")
qFoods <- arrange(jData, desc(Answer)) %>%
mutate(temp = ifelse(Answer %in% foodQ,1,0)) %>%
filter(temp == 1)
print(qFoods[ , c(7, 1:5)])
## Answer Show.Number Air.Date Round
## 1 Quinine 3688 2000-09-20 Double Jeopardy!
## 2 Quinine 2576 1995-11-13 Jeopardy!
## 3 Quinine 3367 1999-04-06 Double Jeopardy!
## 4 Quinine 2674 1996-03-28 Jeopardy!
## 5 Quinine 2819 1996-11-28 Double Jeopardy!
## 6 Quiche 3251 1998-10-26 Jeopardy!
## 7 Quiche 2900 1997-03-21 Jeopardy!
## 8 Quiche 3008 1997-10-01 Double Jeopardy!
## 9 Quesadilla 3008 1997-10-01 Double Jeopardy!
## 10 Quail 2847 1997-01-07 Jeopardy!
## 11 Quail 3008 1997-10-01 Double Jeopardy!
## 12 Quahog 3890 2001-06-29 Double Jeopardy!
## 13 Quahog 3529 1999-12-30 Double Jeopardy!
## Category Value
## 1 "NINE" ACROSS $400
## 2 MIND YOUR "P"s & "Q"s $400
## 3 PLANTS & TREES $600
## 4 GENERAL SCIENCE $300
## 5 FOOD & DRINK $400
## 6 EGGS & HAM $200
## 7 ON "Q" $100
## 8 FOODS THAT BEGIN WITH THE LETTER "Q" $600
## 9 FOODS THAT BEGIN WITH THE LETTER "Q" $400
## 10 SCIENCE & NATURE $300
## 11 FOODS THAT BEGIN WITH THE LETTER "Q" $200
## 12 GEE "Q" $1000
## 13 ANIMAL A.K.A. $1000
print(qFoods[ , c(6)])
## [1] "Malaria remedy (7)"
## [2] "Today most of this medicine comes from cinchona trees in Java"
## [3] "The cinchona tree, the source for this malaria drug, was once plentiful in South America; now it's grown mainly on Java"
## [4] "This malaria-fighting substance is extracted from the bark of the cinchona tree"
## [5] "Dubonnet is an aperitif flavored with herbs & a form of this malaria medicine"
## [6] "The ham & cheese version of this egg dish is prepared in much the same manner as the Lorraine"
## [7] "Made with eggs, cheese & other fillings in a pastry shell, it's what real men don't eat"
## [8] "This savory pie can have bacon bits added to the custard filling"
## [9] "It's a tortilla that's been filled, folded & fried"
## [10] "North American types of this bird include Montezuma, mountain & bobwhite"
## [11] "This member of the pheasant family can be roasted, broiled or fried"
## [12] "A thick-shelled edible clam of the north Atlantic coast"
## [13] "Hard-shell or littleneck clam"
All the “Q” foods.
Qs <- merge(quince, qFoods, all=TRUE)
Qs <- data.frame(table(Qs$Answer))
colnames(Qs) <- c("Food", "Frequency")
arrange(Qs, desc(Frequency))
## Food Frequency
## 1 Quinine 5
## 2 Quiche 3
## 3 Quince 3
## 4 Quahog 2
## 5 Quail 2
## 6 Quesadilla 1
Frequency of 20 most common correct answers on Jeopardy! (all place names!)
ggplot(Qs, aes(x = reorder(Food, -Frequency), y = Frequency)) +
geom_bar(stat="identity") +
xlab("Answer") +
theme_wsj() +
theme(axis.text.x = element_text(angle = 45)) +
ggtitle("Seven Foods Starting with the Letter Q")