Quiz 2: Natural language processing I

Data Science Capstone: https://www.coursera.org/learn/data-science-project/
Quiz: https://www.coursera.org/learn/data-science-project/exam/QbBvW/quiz-2-natural-language-processing-i

Note: For how to develop text input prediction models, refer to this report http://rpubs.com/Nov05/459931, in which only Twitter text was explored. However, for actual models, I might use all the English text files, e.g. Twitter, blogs and news, or at least a sample of text of all the files. Also, in the coming weeks, for the actual product I will reduce the N-gram dictionary size to improve the performance, or use lemmatization and/or stemming to push it further.

N-gram Modeling With Markov Chains
https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/

For each of the sentence fragments below use your natural language processing algorithm to predict the next word in the sentence.

library(tm)

# constants
co_text_attr_en = "D:/R/capstone/data/text_attr_en.rds"
co_tidy_twitter_en = "D:/R/capstone/data/tidy_twitter_en.rds"
co_tidy_nostop_twitter_en = "D:/R/capstone/data/tidy_nostop_twitter_en.rds"
co_1gram_twitter_en = "D:/R/capstone/data/1gram_twitter_en.rds"
co_2gram_twitter_en = "D:/R/capstone/data/2gram_twitter_en.rds"
co_3gram_twitter_en = "D:/R/capstone/data/3gram_twitter_en.rds"
co_1gram_nostop_twitter_en = "D:/R/capstone/data/1gram_nostop_twitter_en.rds"
co_2gram_nostop_twitter_en = "D:/R/capstone/data/2gram_nostop_twitter_en.rds"
co_3gram_nostop_twitter_en = "D:/R/capstone/data/3gram_nostop_twitter_en.rds"

df <- readRDS(co_3gram_twitter_en)
df_nostop <- readRDS(co_3gram_nostop_twitter_en)

1. The guy in front of me just bought a pound of bacon, a bouquet, and a case of

Options: prezels, soda, beer, cheese

Answer: beer

head(df[grep("^case of", df[,1]),], 10)
##                    ngrams freq         prop
## 7530         case of the   133 6.510105e-06
## 77955          case of a    20 9.789632e-07
## 106553      case of beer    15 7.342224e-07
## 228403        case of an     8 3.915853e-07
## 268752 case of emergency     7 3.426371e-07
## 282067   case of divorce     7 3.426371e-07
## 369578  case of benjamin     5 2.447408e-07
## 458838      case of wine     4 1.957926e-07
## 458878        case of my     4 1.957926e-07
## 466934    case of attack     4 1.957926e-07

2. You’re the reason why I smile everyday. Can you follow me please? It would mean the

Options: world, best, most, universe

Answer: world

head(df[grep("^mean the ", df[,1]),], 10)
##                      ngrams freq         prop
## 2730        mean the world   276 1.350969e-05
## 128969       mean the same    13 6.363261e-07
## 139616      mean the whole    12 5.873779e-07
## 153377       mean the most    11 5.384297e-07
## 162734        mean the one    11 5.384297e-07
## 364652        mean the end     5 2.447408e-07
## 433088 mean the difference     4 1.957926e-07
## 586131      mean the other     3 1.468445e-07
## 682154     mean the entire     3 1.468445e-07
## 877981       mean the rest     2 9.789632e-08

3. Hey sunshine, can you follow me and make me the

Options: bluest, smelliest, saddest, happiest

Answer: happiest

Note: The top frequence of the 3-grams is “me the f*ck“. Interesting. Probably need to add the f-word to stop word list? Lol.

head(df[grep("^me the", df[,1]),], 10)
##                 ngrams freq         prop
## 12445     me the fuck    90 4.405334e-06
## 13997     me the link    82 4.013749e-06
## 15839     me the most    74 3.622164e-06
## 24344     me the same    53 2.594252e-06
## 25005      me the way    52 2.545304e-06
## 34738     me the best    40 1.957926e-06
## 39638 me the happiest    35 1.713186e-06
## 42812    me the wrong    33 1.615289e-06
## 47137    me the other    31 1.517393e-06
## 52263  me the details    28 1.370548e-06

4. Very early observations on the Bills game: Offence still struggling but the

Options: crowd, defense, referees, players(wrong)

Answer: defense

Note: It didn’t match anything in the options by using Twitter N-gram dictionary. Probably need to use models generated by all the English files.

head(df[grep("^struggling but", df[,1]),], 10) # It didn't match anything in the options.
##                            ngrams freq         prop
## 2215502 struggling but westbrook     1 4.894816e-08
## 7218901  struggling but remember     1 4.894816e-08
str <- "Very early observations on the Bills game: Offence still struggling but the"
str <- removeWords(str, stopwords("en")); str <- gsub("\\s+", " ", str); str
## [1] "Very early observations Bills game: Offence still struggling "
head(df_nostop[grep("^still struggling", df_nostop[,1]),], 10)
##                               ngrams freq         prop
## 611809        still struggling wake     1 1.186304e-07
## 1028875   still struggling -effects     1 1.186304e-07
## 1147811       still struggling just     1 1.186304e-07
## 2367348      still struggling title     1 1.186304e-07
## 2536987    still struggling furnace     1 1.186304e-07
## 3108443         still struggling eh     1 1.186304e-07
## 3828318     still struggling adjust     1 1.186304e-07
## 3888773    still struggling impress     1 1.186304e-07
## 3957642 still struggling electronic     1 1.186304e-07
## 4261034      still struggling catch     1 1.186304e-07
rbind(df_nostop[grep("^still struggling crowd", df_nostop[,1]),], 
      df_nostop[grep("^still struggling defense", df_nostop[,1]),],
      df_nostop[grep("^still struggling referees", df_nostop[,1]),],
      df_nostop[grep("^still struggling players", df_nostop[,1]),])
## [1] ngrams freq   prop  
## <0 rows> (or 0-length row.names)

5. Go on a romantic date at the

Options: mall, grocery(wrong), movies, beach

Answer: beach

Note: Should consider sentiments?

head(df[grep("^date at", df[,1]),], 10)
##                        ngrams freq         prop
## 179320           date at the    10 4.894816e-07
## 2006055        date at least     1 4.894816e-08
## 2740455          date at www     1 4.894816e-08
## 2913398          date at toc     1 4.894816e-08
## 3349665           date at rj     1 4.894816e-08
## 3998177     date at johnny's     1 4.894816e-08
## 4750393 date at mydateishere     1 4.894816e-08
## 4880412         date at ikea     1 4.894816e-08
## 4924022           date at so     1 4.894816e-08
## 4979357         date at work     1 4.894816e-08
head(df_nostop[grep("^romantic date", df_nostop[,1]),], 10)
##                             ngrams freq         prop
## 2316983 romantic date valentine's     1 1.186304e-07
## 4382783      romantic date hunter     1 1.186304e-07
## 6493635    romantic dates hookers     1 1.186304e-07
head(df[grep("^date at", df[,1]),], 10) # The only match was "date atl interview"
##                        ngrams freq         prop
## 179320           date at the    10 4.894816e-07
## 2006055        date at least     1 4.894816e-08
## 2740455          date at www     1 4.894816e-08
## 2913398          date at toc     1 4.894816e-08
## 3349665           date at rj     1 4.894816e-08
## 3998177     date at johnny's     1 4.894816e-08
## 4750393 date at mydateishere     1 4.894816e-08
## 4880412         date at ikea     1 4.894816e-08
## 4924022           date at so     1 4.894816e-08
## 4979357         date at work     1 4.894816e-08
rbind(df[grep("date at mall", df[,1]),], 
      df[grep("date at grocery", df[,1]),],
      df[grep("date at movies", df[,1]),], 
      df[grep("date at beach", df[,1]),])
## [1] ngrams freq   prop  
## <0 rows> (or 0-length row.names)
rbind(df_nostop[grep("date mall", df_nostop[,1]),], 
      df_nostop[grep("date grocery", df_nostop[,1]),],
      df_nostop[grep("date movies", df_nostop[,1]),], # "third-wheel date movies"
      df_nostop[grep("date beach", df_nostop[,1]),]) # "asked date beach"
##                           ngrams freq         prop
## 185190       date grocery store     2 2.372608e-07
## 2998739      going date grocery     1 1.186304e-07
## 6601542     picked date grocery     1 1.186304e-07
## 1805559 third-wheel date movies     1 1.186304e-07
## 3494901        asked date beach     1 1.186304e-07

6. Well I’m pretty sure my granny has some old bagpipes in her garage I’ll dust them o􀃗 and be on my

Options: way, horse, motorcycle, phone

Answer: way

head(df[grep("^on my ", df[,1]),], 10)
##               ngrams freq         prop
## 97        on my way  1947 9.530206e-05
## 797      on my mind   627 3.069050e-05
## 828     on my phone   610 2.985838e-05
## 1931     on my face   348 1.703396e-05
## 2708      on my own   277 1.355864e-05
## 3362       on my tl   239 1.169861e-05
## 3803     on my list   218 1.067070e-05
## 4642     on my ipod   190 9.300150e-06
## 4860   on my iphone   185 9.055409e-06
## 5257 on my birthday   175 8.565928e-06

7. Ohhhhh #PointBreak is on tomorrow. Love that film and haven’t seen it in quite some

Options: thing, weeks, time, years

Answer: time

head(df[grep("^quite some ", df[,1]),], 10)
##                      ngrams freq         prop
## 39304      quite some time    36 1.762134e-06
## 1025288 quite some company     2 9.789632e-08
## 3728560    quite some news     1 4.894816e-08
## 5458287 quite some freedom     1 4.894816e-08
## 7821006    quite some hair     1 4.894816e-08

8. After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little

Options: fingers, eyes, ears, toes

Answer: fingers

head(df[grep("^his little ", df[,1]),], 10)
##                       ngrams freq         prop
## 424701   his little brother     5 2.447408e-07
## 655334    his little league     3 1.468445e-07
## 775586    his little sister     3 1.468445e-07
## 883271       his little ass     2 9.789632e-08
## 1074846     his little girl     2 9.789632e-08
## 1124613    his little heart     2 9.789632e-08
## 1944370 his little nameless     1 4.894816e-08
## 2106328      his little nay     1 4.894816e-08
## 2118569  his little sweetie     1 4.894816e-08
## 2135059    his little girly     1 4.894816e-08
str <- "After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"
str <- removeWords(str, stopwords("en")); str <- gsub("\\s+", " ", str); str
## [1] "After ice bucket challenge Louis will push long wet hair eyes little"
head(df_nostop[grep("^eyes little ", df_nostop[,1]),], 10)
##                    ngrams freq         prop
## 2035404  eyes little bit     1 1.186304e-07
## 5308905  eyes little red     1 1.186304e-07
## 5752525   eyes little xd     1 1.186304e-07
## 7066989 eyes little sore     1 1.186304e-07
rbind(df[grep("his little fingers", df[,1]),], 
      df[grep("his little eyes", df[,1]),],
      df[grep("his little ears", df[,1]),], 
      df[grep("his little toes", df[,1]),])
## [1] ngrams freq   prop  
## <0 rows> (or 0-length row.names)
rbind(df[grep("his little finger", df[,1]),], 
      df[grep("his little eye", df[,1]),],
      df[grep("his little ear", df[,1]),], 
      df[grep("his little toe", df[,1]),])
##                      ngrams freq         prop
## 4547214 this little finger     1 4.894816e-08

9. Be grateful for the good times and keep the faith during the

Options: worse, bad, hard, sad

Answer: bad

head(df[grep("^during the ", df[,1]),], 10)
##                     ngrams freq         prop
## 3824       during the day   218 1.067070e-05
## 11034     during the week    99 4.845868e-06
## 15267     during the game    77 3.769008e-06
## 16240   during the summer    73 3.573216e-06
## 25605 during the holidays    51 2.496356e-06
## 42253     during the show    34 1.664237e-06
## 44400    during the first    32 1.566341e-06
## 48183     during the last    30 1.468445e-06
## 56195   during the season    26 1.272652e-06
## 68964    during the month    22 1.076859e-06
head(df_nostop[grep("^faith during ", df_nostop[,1]),], 10)
## [1] ngrams freq   prop  
## <0 rows> (or 0-length row.names)
rbind(df[grep("during the worse", df[,1]),], 
      df[grep("during the bad", df[,1]),],
      df[grep("during the hard", df[,1]),], 
      df[grep("during the sad", df[,1]),])
##                      ngrams freq         prop
## 922362      during the bad     2 9.789632e-08
## 7780139 during the badgers     1 4.894816e-08

10. If this isn’t the cutest thing you’ve ever seen, then you must be

Options: asleep, insensitive, callous, insane

Answer: insane

head(df[grep("^must be ", df[,1]),], 10)
##                   ngrams freq         prop
## 1818          must be a   363 1.776818e-05
## 4354        must be the   198 9.691735e-06
## 12882        must be in    88 4.307438e-06
## 14825        must be so    79 3.866904e-06
## 15652      must be nice    75 3.671112e-06
## 17585 must be following    69 3.377423e-06
## 24582        must be an    52 2.545304e-06
## 26868        must be on    49 2.398460e-06
## 28011     must be doing    47 2.300563e-06
## 29458 must be something    45 2.202667e-06
str <- "If this isn't the cutest thing you've ever seen, then you must be"
str <- removeWords(str, stopwords("en")); str <- gsub("\\s+", " ", str); str
## [1] "If cutest thing ever seen, must "
head(df_nostop[grep("^seen must ", df_nostop[,1]),], 10) # seen must keep
##                  ngrams freq         prop
## 2045173 seen must keep     1 1.186304e-07
rbind(df[grep("must be asleep", df[,1]),], 
      df[grep("must be insensitive", df[,1]),],
      df[grep("must be callous", df[,1]),], 
      df[grep("must be insane", df[,1]),])
##                  ngrams freq         prop
## 1650565 must be asleep     2 9.789632e-08
## 375819  must be insane     5 2.447408e-07




Refence report:
http://rstudio-pubs-static.s3.amazonaws.com/387645_d494b67fb45e4d3792fb679eb274291c.html
https://rpubs.com/redneckz/smart-keyboard-basic-modeling