Data Science Capstone: https://www.coursera.org/learn/data-science-project/
Quiz: https://www.coursera.org/learn/data-science-project/exam/QbBvW/quiz-2-natural-language-processing-i
Note: For how to develop text input prediction models, refer to this report http://rpubs.com/Nov05/459931, in which only Twitter text was explored. However, for actual models, I might use all the English text files, e.g. Twitter, blogs and news, or at least a sample of text of all the files. Also, in the coming weeks, for the actual product I will reduce the N-gram dictionary size to improve the performance, or use lemmatization and/or stemming to push it further.
N-gram Modeling With Markov Chains
https://sookocheff.com/post/nlp/ngram-modeling-with-markov-chains/
library(tm)
# constants
co_text_attr_en = "D:/R/capstone/data/text_attr_en.rds"
co_tidy_twitter_en = "D:/R/capstone/data/tidy_twitter_en.rds"
co_tidy_nostop_twitter_en = "D:/R/capstone/data/tidy_nostop_twitter_en.rds"
co_1gram_twitter_en = "D:/R/capstone/data/1gram_twitter_en.rds"
co_2gram_twitter_en = "D:/R/capstone/data/2gram_twitter_en.rds"
co_3gram_twitter_en = "D:/R/capstone/data/3gram_twitter_en.rds"
co_1gram_nostop_twitter_en = "D:/R/capstone/data/1gram_nostop_twitter_en.rds"
co_2gram_nostop_twitter_en = "D:/R/capstone/data/2gram_nostop_twitter_en.rds"
co_3gram_nostop_twitter_en = "D:/R/capstone/data/3gram_nostop_twitter_en.rds"
df <- readRDS(co_3gram_twitter_en)
df_nostop <- readRDS(co_3gram_nostop_twitter_en)
1. The guy in front of me just bought a pound of bacon, a bouquet, and a case of
Options: prezels, soda, beer, cheese
Answer: beer
head(df[grep("^case of", df[,1]),], 10)
## ngrams freq prop
## 7530 case of the 133 6.510105e-06
## 77955 case of a 20 9.789632e-07
## 106553 case of beer 15 7.342224e-07
## 228403 case of an 8 3.915853e-07
## 268752 case of emergency 7 3.426371e-07
## 282067 case of divorce 7 3.426371e-07
## 369578 case of benjamin 5 2.447408e-07
## 458838 case of wine 4 1.957926e-07
## 458878 case of my 4 1.957926e-07
## 466934 case of attack 4 1.957926e-07
2. You’re the reason why I smile everyday. Can you follow me please? It would mean the
Options: world, best, most, universe
Answer: world
head(df[grep("^mean the ", df[,1]),], 10)
## ngrams freq prop
## 2730 mean the world 276 1.350969e-05
## 128969 mean the same 13 6.363261e-07
## 139616 mean the whole 12 5.873779e-07
## 153377 mean the most 11 5.384297e-07
## 162734 mean the one 11 5.384297e-07
## 364652 mean the end 5 2.447408e-07
## 433088 mean the difference 4 1.957926e-07
## 586131 mean the other 3 1.468445e-07
## 682154 mean the entire 3 1.468445e-07
## 877981 mean the rest 2 9.789632e-08
3. Hey sunshine, can you follow me and make me the
Options: bluest, smelliest, saddest, happiest
Answer: happiest
Note: The top frequence of the 3-grams is “me the f*ck“. Interesting. Probably need to add the f-word to stop word list? Lol.
head(df[grep("^me the", df[,1]),], 10)
## ngrams freq prop
## 12445 me the fuck 90 4.405334e-06
## 13997 me the link 82 4.013749e-06
## 15839 me the most 74 3.622164e-06
## 24344 me the same 53 2.594252e-06
## 25005 me the way 52 2.545304e-06
## 34738 me the best 40 1.957926e-06
## 39638 me the happiest 35 1.713186e-06
## 42812 me the wrong 33 1.615289e-06
## 47137 me the other 31 1.517393e-06
## 52263 me the details 28 1.370548e-06
4. Very early observations on the Bills game: Offence still struggling but the
Options: crowd, defense, referees, players(wrong)
Answer: defense
Note: It didn’t match anything in the options by using Twitter N-gram dictionary. Probably need to use models generated by all the English files.
head(df[grep("^struggling but", df[,1]),], 10) # It didn't match anything in the options.
## ngrams freq prop
## 2215502 struggling but westbrook 1 4.894816e-08
## 7218901 struggling but remember 1 4.894816e-08
str <- "Very early observations on the Bills game: Offence still struggling but the"
str <- removeWords(str, stopwords("en")); str <- gsub("\\s+", " ", str); str
## [1] "Very early observations Bills game: Offence still struggling "
head(df_nostop[grep("^still struggling", df_nostop[,1]),], 10)
## ngrams freq prop
## 611809 still struggling wake 1 1.186304e-07
## 1028875 still struggling -effects 1 1.186304e-07
## 1147811 still struggling just 1 1.186304e-07
## 2367348 still struggling title 1 1.186304e-07
## 2536987 still struggling furnace 1 1.186304e-07
## 3108443 still struggling eh 1 1.186304e-07
## 3828318 still struggling adjust 1 1.186304e-07
## 3888773 still struggling impress 1 1.186304e-07
## 3957642 still struggling electronic 1 1.186304e-07
## 4261034 still struggling catch 1 1.186304e-07
rbind(df_nostop[grep("^still struggling crowd", df_nostop[,1]),],
df_nostop[grep("^still struggling defense", df_nostop[,1]),],
df_nostop[grep("^still struggling referees", df_nostop[,1]),],
df_nostop[grep("^still struggling players", df_nostop[,1]),])
## [1] ngrams freq prop
## <0 rows> (or 0-length row.names)
5. Go on a romantic date at the
Options: mall, grocery(wrong), movies, beach
Answer: beach
Note: Should consider sentiments?
head(df[grep("^date at", df[,1]),], 10)
## ngrams freq prop
## 179320 date at the 10 4.894816e-07
## 2006055 date at least 1 4.894816e-08
## 2740455 date at www 1 4.894816e-08
## 2913398 date at toc 1 4.894816e-08
## 3349665 date at rj 1 4.894816e-08
## 3998177 date at johnny's 1 4.894816e-08
## 4750393 date at mydateishere 1 4.894816e-08
## 4880412 date at ikea 1 4.894816e-08
## 4924022 date at so 1 4.894816e-08
## 4979357 date at work 1 4.894816e-08
head(df_nostop[grep("^romantic date", df_nostop[,1]),], 10)
## ngrams freq prop
## 2316983 romantic date valentine's 1 1.186304e-07
## 4382783 romantic date hunter 1 1.186304e-07
## 6493635 romantic dates hookers 1 1.186304e-07
head(df[grep("^date at", df[,1]),], 10) # The only match was "date atl interview"
## ngrams freq prop
## 179320 date at the 10 4.894816e-07
## 2006055 date at least 1 4.894816e-08
## 2740455 date at www 1 4.894816e-08
## 2913398 date at toc 1 4.894816e-08
## 3349665 date at rj 1 4.894816e-08
## 3998177 date at johnny's 1 4.894816e-08
## 4750393 date at mydateishere 1 4.894816e-08
## 4880412 date at ikea 1 4.894816e-08
## 4924022 date at so 1 4.894816e-08
## 4979357 date at work 1 4.894816e-08
rbind(df[grep("date at mall", df[,1]),],
df[grep("date at grocery", df[,1]),],
df[grep("date at movies", df[,1]),],
df[grep("date at beach", df[,1]),])
## [1] ngrams freq prop
## <0 rows> (or 0-length row.names)
rbind(df_nostop[grep("date mall", df_nostop[,1]),],
df_nostop[grep("date grocery", df_nostop[,1]),],
df_nostop[grep("date movies", df_nostop[,1]),], # "third-wheel date movies"
df_nostop[grep("date beach", df_nostop[,1]),]) # "asked date beach"
## ngrams freq prop
## 185190 date grocery store 2 2.372608e-07
## 2998739 going date grocery 1 1.186304e-07
## 6601542 picked date grocery 1 1.186304e-07
## 1805559 third-wheel date movies 1 1.186304e-07
## 3494901 asked date beach 1 1.186304e-07
6. Well I’m pretty sure my granny has some old bagpipes in her garage I’ll dust them o and be on my
Options: way, horse, motorcycle, phone
Answer: way
head(df[grep("^on my ", df[,1]),], 10)
## ngrams freq prop
## 97 on my way 1947 9.530206e-05
## 797 on my mind 627 3.069050e-05
## 828 on my phone 610 2.985838e-05
## 1931 on my face 348 1.703396e-05
## 2708 on my own 277 1.355864e-05
## 3362 on my tl 239 1.169861e-05
## 3803 on my list 218 1.067070e-05
## 4642 on my ipod 190 9.300150e-06
## 4860 on my iphone 185 9.055409e-06
## 5257 on my birthday 175 8.565928e-06
7. Ohhhhh #PointBreak is on tomorrow. Love that film and haven’t seen it in quite some
Options: thing, weeks, time, years
Answer: time
head(df[grep("^quite some ", df[,1]),], 10)
## ngrams freq prop
## 39304 quite some time 36 1.762134e-06
## 1025288 quite some company 2 9.789632e-08
## 3728560 quite some news 1 4.894816e-08
## 5458287 quite some freedom 1 4.894816e-08
## 7821006 quite some hair 1 4.894816e-08
8. After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little
Options: fingers, eyes, ears, toes
Answer: fingers
head(df[grep("^his little ", df[,1]),], 10)
## ngrams freq prop
## 424701 his little brother 5 2.447408e-07
## 655334 his little league 3 1.468445e-07
## 775586 his little sister 3 1.468445e-07
## 883271 his little ass 2 9.789632e-08
## 1074846 his little girl 2 9.789632e-08
## 1124613 his little heart 2 9.789632e-08
## 1944370 his little nameless 1 4.894816e-08
## 2106328 his little nay 1 4.894816e-08
## 2118569 his little sweetie 1 4.894816e-08
## 2135059 his little girly 1 4.894816e-08
str <- "After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"
str <- removeWords(str, stopwords("en")); str <- gsub("\\s+", " ", str); str
## [1] "After ice bucket challenge Louis will push long wet hair eyes little"
head(df_nostop[grep("^eyes little ", df_nostop[,1]),], 10)
## ngrams freq prop
## 2035404 eyes little bit 1 1.186304e-07
## 5308905 eyes little red 1 1.186304e-07
## 5752525 eyes little xd 1 1.186304e-07
## 7066989 eyes little sore 1 1.186304e-07
rbind(df[grep("his little fingers", df[,1]),],
df[grep("his little eyes", df[,1]),],
df[grep("his little ears", df[,1]),],
df[grep("his little toes", df[,1]),])
## [1] ngrams freq prop
## <0 rows> (or 0-length row.names)
rbind(df[grep("his little finger", df[,1]),],
df[grep("his little eye", df[,1]),],
df[grep("his little ear", df[,1]),],
df[grep("his little toe", df[,1]),])
## ngrams freq prop
## 4547214 this little finger 1 4.894816e-08
9. Be grateful for the good times and keep the faith during the
Options: worse, bad, hard, sad
Answer: bad
head(df[grep("^during the ", df[,1]),], 10)
## ngrams freq prop
## 3824 during the day 218 1.067070e-05
## 11034 during the week 99 4.845868e-06
## 15267 during the game 77 3.769008e-06
## 16240 during the summer 73 3.573216e-06
## 25605 during the holidays 51 2.496356e-06
## 42253 during the show 34 1.664237e-06
## 44400 during the first 32 1.566341e-06
## 48183 during the last 30 1.468445e-06
## 56195 during the season 26 1.272652e-06
## 68964 during the month 22 1.076859e-06
head(df_nostop[grep("^faith during ", df_nostop[,1]),], 10)
## [1] ngrams freq prop
## <0 rows> (or 0-length row.names)
rbind(df[grep("during the worse", df[,1]),],
df[grep("during the bad", df[,1]),],
df[grep("during the hard", df[,1]),],
df[grep("during the sad", df[,1]),])
## ngrams freq prop
## 922362 during the bad 2 9.789632e-08
## 7780139 during the badgers 1 4.894816e-08
10. If this isn’t the cutest thing you’ve ever seen, then you must be
Options: asleep, insensitive, callous, insane
Answer: insane
head(df[grep("^must be ", df[,1]),], 10)
## ngrams freq prop
## 1818 must be a 363 1.776818e-05
## 4354 must be the 198 9.691735e-06
## 12882 must be in 88 4.307438e-06
## 14825 must be so 79 3.866904e-06
## 15652 must be nice 75 3.671112e-06
## 17585 must be following 69 3.377423e-06
## 24582 must be an 52 2.545304e-06
## 26868 must be on 49 2.398460e-06
## 28011 must be doing 47 2.300563e-06
## 29458 must be something 45 2.202667e-06
str <- "If this isn't the cutest thing you've ever seen, then you must be"
str <- removeWords(str, stopwords("en")); str <- gsub("\\s+", " ", str); str
## [1] "If cutest thing ever seen, must "
head(df_nostop[grep("^seen must ", df_nostop[,1]),], 10) # seen must keep
## ngrams freq prop
## 2045173 seen must keep 1 1.186304e-07
rbind(df[grep("must be asleep", df[,1]),],
df[grep("must be insensitive", df[,1]),],
df[grep("must be callous", df[,1]),],
df[grep("must be insane", df[,1]),])
## ngrams freq prop
## 1650565 must be asleep 2 9.789632e-08
## 375819 must be insane 5 2.447408e-07
Refence report:
http://rstudio-pubs-static.s3.amazonaws.com/387645_d494b67fb45e4d3792fb679eb274291c.html
https://rpubs.com/redneckz/smart-keyboard-basic-modeling