Executive Summary

The objective of this exercise is to build a predictive model and evaluate its efficiency in predicting answers to the following quiz:

-“The guy in front of me just bought a pound of bacon, a bouquet, and a case of
-”You’re the reason why I smile everyday. Can you follow me please? It would mean the"
-“Hey sunshine, can you follow me and make me the”
-“Very early observations on the Bills game: Offense still struggling but the”
-“Go on a romantic date at the”
-“Well I’m pretty sure my granny has some old bagpipes in her garage I’ll dust them off and be on my”
-“Ohhhhh #PointBreak is on tomorrow. Love that film and haven’t seen it in quite some”
-“After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little”
-“Be grateful for the good times and keep the faith during the”
-“If this isn’t the cutest thing you’ve ever seen, then you must be”

PreProcessing

library(tm);library(dplyr);library(ggplot2)
library(pryr);library(stringr);library(RWeka)

# Quiz List 1
qb1<-"The guy in front of me just bought a pound of bacon, a bouquet, and a case of"
qb2<-"You're the reason why I smile everyday. Can you follow me please? It would mean the"
qb3<-"Hey sunshine, can you follow me and make me the"
qb4<-"Very early observations on the Bills game: Offense still struggling but the"
qb5<-"Go on a romantic date at the"
qb6<-"Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my"
qb7<-"Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some"
qb8<-"After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"
qb9<-"Be grateful for the good times and keep the faith during the"
qb10<-"If this isn't the cutest thing you've ever seen, then you must be"

# Quiz List 2
qc1<-"When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"
qc2<-"Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"
qc3<-"I'd give anything to see arctic monkeys this"
qc4<-"Talking to your mom has the same effect as a hug and helps reduce your"
qc5<-"When you were in Holland you were like 1 inch away from me but you hadn't time to take a"
qc6<-"I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"
qc7<-"I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"
qc8<-"Every inch of you is perfect from the bottom to the"
qc9<-"I'm thankful my childhood was filled with imagination and bruises from playing"
qc10<-"I like how the same people are in almost all of Adam Sandler's"

qblist<-list(qb1,qb2,qb3,qb4,qb5,qb6,qb7,qb8,qb9,qb10)
qclist<-list(qc1,qc2,qc3,qc4,qc5,qc6,qc7,qc8,qc9,qc10)

# Load data
df1 = readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
df.blogs = readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
df.news = readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")

Prediction

To make predictions, the last n-grams are first extracted from the sentence.

ngram.extract = function(str, n) {
        sep.chr = tail(strsplit(str, split = " ")[[1]], n)
        return(paste(sep.chr, sep = "", collapse = " "))
}

This predictive model is then applied to suggest the next word in sample vectors “a case of”.

#x = grep("a case of", s_df1, value = TRUE)
#x
match = regmatches(df1, regexpr("a case of (.*?) ", df1))
match
  [1] "a case of Red "                   
  [2] "a case of the "                   
  [3] "a case of water "                 
  [4] "a case of the "                   
  [5] "a case of water "                 
  [6] "a case of plagiarism "            
  [7] "a case of the "                   
  [8] "a case of Shameka "               
  [9] "a case of the "                   
 [10] "a case of carpal "                
 [11] "a case of whiplash "              
 [12] "a case of Anchor "                
 [13] "a case of keystone "              
 [14] "a case of beer, "                 
 [15] "a case of the "                   
 [16] "a case of rum "                   
 [17] "a case of the "                   
 [18] "a case of the "                   
 [19] "a case of the "                   
 [20] "a case of SPRING "                
 [21] "a case of Miller "                
 [22] "a case of sweet "                 
 [23] "a case of the "                   
 [24] "a case of #SheThinksShesHot...My "
 [25] "a case of the "                   
 [26] "a case of the "                   
 [27] "a case of \"parents "             
 [28] "a case of mind "                  
 [29] "a case of greatness "             
 [30] "a case of beer "                  
 [31] "a case of from "                  
 [32] "a case of \"The "                 
 [33] "a case of IC "                    
 [34] "a case of beer "                  
 [35] "a case of the "                   
 [36] "a case of the "                   
 [37] "a case of the "                   
 [38] "a case of the "                   
 [39] "a case of the "                   
 [40] "a case of extremes, "             
 [41] "a case of 2006 "                  
 [42] "a case of that "                  
 [43] "a case of Renee "                 
 [44] "a case of the "                   
 [45] "a case of the "                   
 [46] "a case of Thermoplastic "         
 [47] "a case of the "                   
 [48] "a case of Mondays "               
 [49] "a case of the "                   
 [50] "a case of Nugget "                
 [51] "a case of lead "                  
 [52] "a case of the "                   
 [53] "a case of food "                  
 [54] "a case of the "                   
 [55] "a case of the "                   
 [56] "a case of the "                   
 [57] "a case of the "                   
 [58] "a case of excusable "             
 [59] "a case of bud "                   
 [60] "a case of Pepsi, "                
 [61] "a case of Miller "                
 [62] "a case of the "                   
 [63] "a case of jet "                   
 [64] "a case of the "                   
 [65] "a case of the "                   
 [66] "a case of the "                   
 [67] "a case of the "                   
 [68] "a case of books. "                
 [69] "a case of sexual "                
 [70] "a case of the "                   
 [71] "a case of \"too "                 
 [72] "a case of mind "                  
 [73] "a case of the "                   
 [74] "a case of the "                   
 [75] "a case of \"Arrested "            
 [76] "a case of the "                   
 [77] "a case of psychic "               
 [78] "a case of beer "                  
 [79] "a case of the "                   
 [80] "a case of C+Swiss "               
 [81] "a case of CDs "                   
 [82] "a case of Miller "                
 [83] "a case of the "                   
 [84] "a case of the "                   
 [85] "a case of \"The "                 
 [86] "a case of poor "                  
 [87] "a case of Franks "                
 [88] "a case of Gin! "                  
 [89] "a case of jealousy "              
 [90] "a case of mistaken "              
 [91] "a case of the "                   
 [92] "a case of Boxer. "                
 [93] "a case of idle "                  
 [94] "a case of the "                   
 [95] "a case of an "                    
 [96] "a case of do "                    
 [97] "a case of spring "                
 [98] "a case of beer. "                 
 [99] "a case of the "                   
[100] "a case of this "                  
[101] "a case of Surge "                 
[102] "a case of this "                  
[103] "a case of writer<U+0092>s "              
[104] "a case of the "                   
[105] "a case of beer. "                 
[106] "a case of beer "                  
[107] "a case of the "                   
[108] "a case of ProPenn "               
[109] "a case of duct "                  
[110] "a case of the "                   
[111] "a case of damaged "               
[112] "a case of disgusting "            
[113] "a case of Monday "                
[114] "a case of the "                   
[115] "a case of silver "                
[116] "a case of #vernors "              
[117] "a case of beer "                  
[118] "a case of the "                   
[119] "a case of the "                   
[120] "a case of the "                   
[121] "a case of high "                  
[122] "a case of no "                    
[123] "a case of beer "                  
[124] "a case of Lena "                  
[125] "a case of the "                   
[126] "a case of mountain "              
[127] "a case of a "                     
[128] "a case of the "                   
[129] "a case of wine "                  
[130] "a case of \"Bad "                 
[131] "a case of the "                   
[132] "a case of hairspray "             
[133] "a case of suds, "                 
[134] "a case of the "                   
[135] "a case of the "                   
[136] "a case of Dundee. "               
[137] "a case of the "                   
[138] "a case of beer "                  
[139] "a case of the "                   
[140] "a case of luck "                  
[141] "a case of \"write "               
[142] "a case of the "                   
[143] "a case of knowing "               
[144] "a case of the "                   
[145] "a case of cold "                  
[146] "a case of bananas "               
[147] "a case of the "                   
[148] "a case of sunglasses "            
[149] "a case of Coors "                 
matchlist = gsub("a case of | $","", match)
matchnames = names(table(matchlist))
matchfreq = table(matchlist)
caseofmatch = data.frame(item = matchnames, freq = matchfreq)
caseofmatch = data.frame(item = caseofmatch$item, freq = caseofmatch$freq.Freq)
caseofmatch = caseofmatch[order(-caseofmatch$freq),]

Based on the sample, the most frequent word after “a case of” is “the”. The second most frequent word is “beer”.

Quiz Predictions

All the 3-grams in each sentence of the quiz are extracted.

qblist.3gram = sapply(qblist, ngram.extract, 3)

Next, the last 2-grams of each sentence are extracted.

qblist.2gram = sapply(qblist, ngram.extract, 2)
qclist.3gram = sapply(qclist, ngram.extract, 3)

A function is then created to extract the word following these 3-gram terms in the source file and make a data frame listing the most frequent word.

word3.predict = function(datasource, term) {
        match = NULL
        matchlist = NULL
        match = regmatches(datasource, regexpr(paste(term, "(.*?) "), datasource))
        matchlist = gsub(paste(term, "| $"),"", match)
        return(matchlist)
}
match.list = sapply(qblist.3gram, word3.predict, datasource = df1)
cgramText = vector()
cgramCount = vector()
nw1Text = vector()
nw1Count = vector()
nw2Text = vector()
nw2Count = vector()
nw3Text = vector()
nw3Count = vector()
for (i in 1:length(match.list)) {
        temp.vec = match.list[i]
        cgramText[i] = qblist.3gram[i]
        cgramCount[i] = sum(table(match.list[i]))
        if (is.na(table(match.list[i])[1])) {
                nw1Text[i] = NA
        }
        else {
                nw1Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[1])
        }
        nw1Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[1])
        if (is.na(table(match.list[i])[2])) {
                nw2Text[i] = NA
        }
        else {
                nw2Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[2])
        }
        nw2Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[2])
        if (is.na(table(match.list[i])[3])) {
                nw2Text[i] = NA
        }
        else {
                nw3Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[3])
        }
        nw3Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[3])
}

The result of the prediction is as follows:

pred.rts = data.frame(cgramText, cgramCount, nw1Text, nw1Count, nw2Text, nw2Count, nw3Text, nw3Count)
pred.rts
            cgramText cgramCount  nw1Text nw1Count nw2Text nw2Count
1           a case of        149      the       57    beer        7
2      would mean the        171    world      151   WORLD        4
3         make me the         44 happiest       24    most        3
4  struggling but the          0     <NA>       NA    <NA>       NA
5         date at the         12     "art        1     App        1
6            be on my        144      way       24    show        6
7       in quite some          5    time.        3   time!        1
8     with his little          6    bitty        1 brother        1
9    faith during the          0     <NA>       NA    <NA>       NA
10        you must be        218        a       27      so       15
    nw3Text nw3Count
1    Miller        3
2    world!        3
3      16th        2
4      <NA>       NA
5    bottom        1
6       own        5
7    time!!        1
8  brother.        1
9      <NA>       NA
10       on        8

After this initial test, the code above is packaged as a function to predict any list in any file.

pred.word.fun = function (data.input, gramlist) {
        match.list = sapply(gramlist, word3.predict, datasource = data.input)
        cgramText = vector()
        cgramCount = vector()
        nw1Text = vector()
        nw1Count = vector()
        nw2Text = vector()
        nw2Count = vector()
        nw3Text = vector()
        nw3Count = vector()
        for (i in 1:length(match.list)) {
              temp.vec = match.list[i]
              cgramText[i] = gramlist[i]
             cgramCount[i] = sum(table(match.list[i]))
             if (is.na(table(match.list[i])[1])) {
                     nw1Text[i] = NA
             }
             else {
                    nw1Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[1])
            }
            nw1Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[1])
            if (is.na(table(match.list[i])[2])) {
                   nw2Text[i] = NA
            }
            else {
                    nw2Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[2])
                }
                nw2Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[2])
                if (is.na(table(match.list[i])[3])) {
                        nw2Text[i] = NA
                }
                else {
                        nw3Text[i] = names(sort(table(match.list[i]), decreasing = TRUE)[3])
                }
                nw3Count [i] = as.numeric(sort(table(match.list[i]), decreasing = TRUE)[3])
        }
        pred.rts = data.frame(cgramText, cgramCount, nw1Text, nw1Count, nw2Text, nw2Count, nw3Text, nw3Count)
        return(pred.rts)
}

The predictions can now be analyzed in other forms including blogs and news using the function.

news.pred.rts = pred.word.fun(gramlist = qblist.3gram, data.input = df.news )
blog.pred.rts = pred.word.fun(gramlist = qblist.3gram, data.input = df.blogs )

Two dataframes are created with the match information for the quiz questions with different sources.

pred.rts
            cgramText cgramCount  nw1Text nw1Count nw2Text nw2Count
1           a case of        149      the       57    beer        7
2      would mean the        171    world      151   WORLD        4
3         make me the         44 happiest       24    most        3
4  struggling but the          0     <NA>       NA    <NA>       NA
5         date at the         12     "art        1     App        1
6            be on my        144      way       24    show        6
7       in quite some          5    time.        3   time!        1
8     with his little          6    bitty        1 brother        1
9    faith during the          0     <NA>       NA    <NA>       NA
10        you must be        218        a       27      so       15
    nw3Text nw3Count
1    Miller        3
2    world!        3
3      16th        2
4      <NA>       NA
5    bottom        1
6       own        5
7    time!!        1
8  brother.        1
9      <NA>       NA
10       on        8
news.pred.rts
            cgramText cgramCount nw1Text nw1Count        nw2Text nw2Count
1           a case of         10    beer        1 double-dipping        1
2      would mean the          4 airport        1        Century        1
3         make me the          0    <NA>       NA           <NA>       NA
4  struggling but the          0    <NA>       NA           <NA>       NA
5         date at the          4     end        1           Four        1
6            be on my          0    <NA>       NA           <NA>       NA
7       in quite some          1   time.        1           <NA>       NA
8     with his little          0    <NA>       NA           <NA>       NA
9    faith during the          0    <NA>       NA           <NA>       NA
10        you must be          2       a        1           <NA>        1
   nw3Text nw3Count
1    first        1
2    state        1
3     <NA>       NA
4     <NA>       NA
5     time        1
6    first       NA
7    state       NA
8     <NA>       NA
9     <NA>       NA
10    time       NA
blog.pred.rts
            cgramText cgramCount  nw1Text nw1Count nw2Text nw2Count
1           a case of        245      the       31    beer        5
2      would mean the         10    world        2      £1        1
3         make me the         17     best        2    bun,        1
4  struggling but the          0     <NA>       NA    <NA>       NA
5         date at the         11      end        2    Cake        1
6            be on my         69      own        7    mind        6
7       in quite some         22    time.       12   time,        7
8     with his little         24 bedroom.        1 brother        1
9    faith during the          1  worship        1    <NA>       NA
10        you must be        118        a       18    able       10
    nw3Text nw3Count
1        24        3
2    angles        1
3  daughter        1
4      <NA>       NA
5  Driskill        1
6       way        6
7      time        2
8  brother.        1
9      <NA>       NA
10      the        5