The objective of this exercise is to build a predictive model and evaluate its efficiency in predicting answers to the following quiz:
-“The guy in front of me just bought a pound of bacon, a bouquet, and a case of
-”You’re the reason why I smile everyday. Can you follow me please? It would mean the"
-“Hey sunshine, can you follow me and make me the”
-“Very early observations on the Bills game: Offense still struggling but the”
-“Go on a romantic date at the”
-“Well I’m pretty sure my granny has some old bagpipes in her garage I’ll dust them off and be on my”
-“Ohhhhh #PointBreak is on tomorrow. Love that film and haven’t seen it in quite some”
-“After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little”
-“Be grateful for the good times and keep the faith during the”
-“If this isn’t the cutest thing you’ve ever seen, then you must be”
library(tm);library(dplyr);library(ggplot2)
library(pryr);library(stringr);library(RWeka)
# Quiz List 1
qb1<-"The guy in front of me just bought a pound of bacon, a bouquet, and a case of"
qb2<-"You're the reason why I smile everyday. Can you follow me please? It would mean the"
qb3<-"Hey sunshine, can you follow me and make me the"
qb4<-"Very early observations on the Bills game: Offense still struggling but the"
qb5<-"Go on a romantic date at the"
qb6<-"Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my"
qb7<-"Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some"
qb8<-"After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"
qb9<-"Be grateful for the good times and keep the faith during the"
qb10<-"If this isn't the cutest thing you've ever seen, then you must be"
# Quiz List 2
qc1<-"When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"
qc2<-"Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"
qc3<-"I'd give anything to see arctic monkeys this"
qc4<-"Talking to your mom has the same effect as a hug and helps reduce your"
qc5<-"When you were in Holland you were like 1 inch away from me but you hadn't time to take a"
qc6<-"I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"
qc7<-"I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"
qc8<-"Every inch of you is perfect from the bottom to the"
qc9<-"I'm thankful my childhood was filled with imagination and bruises from playing"
qc10<-"I like how the same people are in almost all of Adam Sandler's"
qblist<-list(qb1,qb2,qb3,qb4,qb5,qb6,qb7,qb8,qb9,qb10)
qclist<-list(qc1,qc2,qc3,qc4,qc5,qc6,qc7,qc8,qc9,qc10)
# Load data
df1 = readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
df.blogs = readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
df.news = readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
To make predictions, the last n-grams are first extracted from the sentence.
ngram.extract = function(str, n) {
sep.chr = tail(strsplit(str, split = " ")[[1]], n)
return(paste(sep.chr, sep = "", collapse = " "))
}
This predictive model is then applied to suggest the next word in sample vectors “a case of”.
#x = grep("a case of", s_df1, value = TRUE)
#x
match = regmatches(df1, regexpr("a case of (.*?) ", df1))
match
[1] "a case of Red "
[2] "a case of the "
[3] "a case of water "
[4] "a case of the "
[5] "a case of water "
[6] "a case of plagiarism "
[7] "a case of the "
[8] "a case of Shameka "
[9] "a case of the "
[10] "a case of carpal "
[11] "a case of whiplash "
[12] "a case of Anchor "
[13] "a case of keystone "
[14] "a case of beer, "
[15] "a case of the "
[16] "a case of rum "
[17] "a case of the "
[18] "a case of the "
[19] "a case of the "
[20] "a case of SPRING "
[21] "a case of Miller "
[22] "a case of sweet "
[23] "a case of the "
[24] "a case of #SheThinksShesHot...My "
[25] "a case of the "
[26] "a case of the "
[27] "a case of \"parents "
[28] "a case of mind "
[29] "a case of greatness "
[30] "a case of beer "
[31] "a case of from "
[32] "a case of \"The "
[33] "a case of IC "
[34] "a case of beer "
[35] "a case of the "
[36] "a case of the "
[37] "a case of the "
[38] "a case of the "
[39] "a case of the "
[40] "a case of extremes, "
[41] "a case of 2006 "
[42] "a case of that "
[43] "a case of Renee "
[44] "a case of the "
[45] "a case of the "
[46] "a case of Thermoplastic "
[47] "a case of the "
[48] "a case of Mondays "
[49] "a case of the "
[50] "a case of Nugget "
[51] "a case of lead "
[52] "a case of the "
[53] "a case of food "
[54] "a case of the "
[55] "a case of the "
[56] "a case of the "
[57] "a case of the "
[58] "a case of excusable "
[59] "a case of bud "
[60] "a case of Pepsi, "
[61] "a case of Miller "
[62] "a case of the "
[63] "a case of jet "
[64] "a case of the "
[65] "a case of the "
[66] "a case of the "
[67] "a case of the "
[68] "a case of books. "
[69] "a case of sexual "
[70] "a case of the "
[71] "a case of \"too "
[72] "a case of mind "
[73] "a case of the "
[74] "a case of the "
[75] "a case of \"Arrested "
[76] "a case of the "
[77] "a case of psychic "
[78] "a case of beer "
[79] "a case of the "
[80] "a case of C+Swiss "
[81] "a case of CDs "
[82] "a case of Miller "
[83] "a case of the "
[84] "a case of the "
[85] "a case of \"The "
[86] "a case of poor "
[87] "a case of Franks "
[88] "a case of Gin! "
[89] "a case of jealousy "
[90] "a case of mistaken "
[91] "a case of the "
[92] "a case of Boxer. "
[93] "a case of idle "
[94] "a case of the "
[95] "a case of an "
[96] "a case of do "
[97] "a case of spring "
[98] "a case of beer. "
[99] "a case of the "
[100] "a case of this "
[101] "a case of Surge "
[102] "a case of this "
[103] "a case of writer<U+0092>s "
[104] "a case of the "
[105] "a case of beer. "
[106] "a case of beer "
[107] "a case of the "
[108] "a case of ProPenn "
[109] "a case of duct "
[110] "a case of the "
[111] "a case of damaged "
[112] "a case of disgusting "
[113] "a case of Monday "
[114] "a case of the "
[115] "a case of silver "
[116] "a case of #vernors "
[117] "a case of beer "
[118] "a case of the "
[119] "a case of the "
[120] "a case of the "
[121] "a case of high "
[122] "a case of no "
[123] "a case of beer "
[124] "a case of Lena "
[125] "a case of the "
[126] "a case of mountain "
[127] "a case of a "
[128] "a case of the "
[129] "a case of wine "
[130] "a case of \"Bad "
[131] "a case of the "
[132] "a case of hairspray "
[133] "a case of suds, "
[134] "a case of the "
[135] "a case of the "
[136] "a case of Dundee. "
[137] "a case of the "
[138] "a case of beer "
[139] "a case of the "
[140] "a case of luck "
[141] "a case of \"write "
[142] "a case of the "
[143] "a case of knowing "
[144] "a case of the "
[145] "a case of cold "
[146] "a case of bananas "
[147] "a case of the "
[148] "a case of sunglasses "
[149] "a case of Coors "
matchlist = gsub("a case of | $","", match)
matchnames = names(table(matchlist))
matchfreq = table(matchlist)
caseofmatch = data.frame(item = matchnames, freq = matchfreq)
caseofmatch = data.frame(item = caseofmatch$item, freq = caseofmatch$freq.Freq)
caseofmatch = caseofmatch[order(-caseofmatch$freq),]
Based on the sample, the most frequent word after “a case of” is “the”. The second most frequent word is “beer”.
All the 3-grams in each sentence of the quiz are extracted.
qblist.3gram = sapply(qblist, ngram.extract, 3)
Next, the last 2-grams of each sentence are extracted.
qblist.2gram = sapply(qblist, ngram.extract, 2)
qclist.3gram = sapply(qclist, ngram.extract, 3)
A function is then created to extract the word following these 3-gram terms in the source file and make a data frame listing the most frequent word.
word3.predict = function(datasource, term) {
match = NULL
matchlist = NULL
match = regmatches(datasource, regexpr(paste(term, "(.*?) "), datasource))
matchlist = gsub(paste(term, "| $"),"", match)
return(matchlist)
}
match.list = sapply(qblist.3gram, word3.predict, datasource = df1)
cgramText = vector()
cgramCount = vector()
nw1Text = vector()
nw1Count = vector()
nw2Text = vector()
nw2Count = vector()
nw3Text = vector()
nw3Count = vector()
for (i in 1:length(match.list)) {
temp.vec = match.list[i]
cgramText[i] = qblist.3gram[i]
cgramCount[i] = sum(table(match.list[i]))
if (is.na(table(match.list[i])[1])) {
nw1Text[i] = NA
}
else {
nw1Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[1])
}
nw1Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[1])
if (is.na(table(match.list[i])[2])) {
nw2Text[i] = NA
}
else {
nw2Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[2])
}
nw2Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[2])
if (is.na(table(match.list[i])[3])) {
nw2Text[i] = NA
}
else {
nw3Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[3])
}
nw3Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[3])
}
The result of the prediction is as follows:
pred.rts = data.frame(cgramText, cgramCount, nw1Text, nw1Count, nw2Text, nw2Count, nw3Text, nw3Count)
pred.rts
cgramText cgramCount nw1Text nw1Count nw2Text nw2Count
1 a case of 149 the 57 beer 7
2 would mean the 171 world 151 WORLD 4
3 make me the 44 happiest 24 most 3
4 struggling but the 0 <NA> NA <NA> NA
5 date at the 12 "art 1 App 1
6 be on my 144 way 24 show 6
7 in quite some 5 time. 3 time! 1
8 with his little 6 bitty 1 brother 1
9 faith during the 0 <NA> NA <NA> NA
10 you must be 218 a 27 so 15
nw3Text nw3Count
1 Miller 3
2 world! 3
3 16th 2
4 <NA> NA
5 bottom 1
6 own 5
7 time!! 1
8 brother. 1
9 <NA> NA
10 on 8
After this initial test, the code above is packaged as a function to predict any list in any file.
pred.word.fun = function (data.input, gramlist) {
match.list = sapply(gramlist, word3.predict, datasource = data.input)
cgramText = vector()
cgramCount = vector()
nw1Text = vector()
nw1Count = vector()
nw2Text = vector()
nw2Count = vector()
nw3Text = vector()
nw3Count = vector()
for (i in 1:length(match.list)) {
temp.vec = match.list[i]
cgramText[i] = gramlist[i]
cgramCount[i] = sum(table(match.list[i]))
if (is.na(table(match.list[i])[1])) {
nw1Text[i] = NA
}
else {
nw1Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[1])
}
nw1Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[1])
if (is.na(table(match.list[i])[2])) {
nw2Text[i] = NA
}
else {
nw2Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[2])
}
nw2Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[2])
if (is.na(table(match.list[i])[3])) {
nw2Text[i] = NA
}
else {
nw3Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[3])
}
nw3Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[3])
}
pred.rts = data.frame(cgramText, cgramCount, nw1Text, nw1Count, nw2Text, nw2Count, nw3Text, nw3Count)
return(pred.rts)
}
The predictions can now be analyzed in other forms including blogs and news using the function.
news.pred.rts = pred.word.fun(gramlist = qblist.3gram, data.input = df.news )
blog.pred.rts = pred.word.fun(gramlist = qblist.3gram, data.input = df.blogs )
Two dataframes are created with the match information for the quiz questions with different sources.
pred.rts
cgramText cgramCount nw1Text nw1Count nw2Text nw2Count
1 a case of 149 the 57 beer 7
2 would mean the 171 world 151 WORLD 4
3 make me the 44 happiest 24 most 3
4 struggling but the 0 <NA> NA <NA> NA
5 date at the 12 "art 1 App 1
6 be on my 144 way 24 show 6
7 in quite some 5 time. 3 time! 1
8 with his little 6 bitty 1 brother 1
9 faith during the 0 <NA> NA <NA> NA
10 you must be 218 a 27 so 15
nw3Text nw3Count
1 Miller 3
2 world! 3
3 16th 2
4 <NA> NA
5 bottom 1
6 own 5
7 time!! 1
8 brother. 1
9 <NA> NA
10 on 8
news.pred.rts
cgramText cgramCount nw1Text nw1Count nw2Text nw2Count
1 a case of 10 beer 1 double-dipping 1
2 would mean the 4 airport 1 Century 1
3 make me the 0 <NA> NA <NA> NA
4 struggling but the 0 <NA> NA <NA> NA
5 date at the 4 end 1 Four 1
6 be on my 0 <NA> NA <NA> NA
7 in quite some 1 time. 1 <NA> NA
8 with his little 0 <NA> NA <NA> NA
9 faith during the 0 <NA> NA <NA> NA
10 you must be 2 a 1 <NA> 1
nw3Text nw3Count
1 first 1
2 state 1
3 <NA> NA
4 <NA> NA
5 time 1
6 first NA
7 state NA
8 <NA> NA
9 <NA> NA
10 time NA
blog.pred.rts
cgramText cgramCount nw1Text nw1Count nw2Text nw2Count
1 a case of 245 the 31 beer 5
2 would mean the 10 world 2 £1 1
3 make me the 17 best 2 bun, 1
4 struggling but the 0 <NA> NA <NA> NA
5 date at the 11 end 2 Cake 1
6 be on my 69 own 7 mind 6
7 in quite some 22 time. 12 time, 7
8 with his little 24 bedroom. 1 brother 1
9 faith during the 1 worship 1 <NA> NA
10 you must be 118 a 18 able 10
nw3Text nw3Count
1 24 3
2 angles 1
3 daughter 1
4 <NA> NA
5 Driskill 1
6 way 6
7 time 2
8 brother. 1
9 <NA> NA
10 the 5