We analyze here a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We analyze the English data here. Here are the steps in this analysis:
Task 2: Exploratory Data Analysis of the Swift English Dataset as an N-Gram Model
Explore 1-grams, 2-grams, and 3-grams in a 70 percent training sample of the dataset to:
Use wordnet (https://wordnet.princeton.edu/) to:
Task 3: Draft “Next Word” Prediction Model
Sys.setenv(JAVA_HOME="")
Sys.setenv(WNHOME="C:\\Program Files (x86)\\WordNet\\2.1\\dict")
library(textreadr)
library(tm)
library(caret)
library(tidyverse)
library(rJava)
library(RWeka)
library(knitr)
library(quanteda)
library(tidytext)
library(data.table)
library(wordnet)
setDict("C:\\Program Files (x86)\\WordNet\\2.1\\dict")
library(wordnet)
We load the data and, outside this markdown, segment it into small chunks that we analyze individually. This ‘divide and conquer’ method is a quick way to analyze the data given limited computing resources (a laptop).
en_blogs <-
readLines("./../course-data/en_US/en_US.blogs.txt",skipNul = TRUE,warn=FALSE)
en_news <-
readLines("./../course-data/en_US/en_US.news.txt",skipNul = TRUE,warn=FALSE)
en_twitter <-
readLines("./../course-data/en_US/en_US.twitter.txt",skipNul = TRUE,warn=FALSE)
str(en_blogs)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235." ...
#chr [1:899288]
str(en_news)
## chr [1:77259] "He wasn't home alone, apparently." ...
#chr [1:77259]
str(en_twitter)
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
#chr [1:2360148]
Our first step is to remove symbols. We show an example of a sentence with symbols that we need to remove. Notice the symbols before and after “gods” in “pagan gods.”
en_blogs[1]
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."
Here is the step in which we remove all of these non-Latin or non-ASCII symbols.
en_blogs <- iconv(en_blogs, "latin1", "ASCII", sub="")
en_news <- iconv(en_news, "latin1", "ASCII", sub="")
en_twitter <- iconv(en_twitter, "latin1", "ASCII", sub="")
str(en_blogs)
## chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods." ...
#chr [1:899288]
str(en_news)
## chr [1:77259] "He wasn't home alone, apparently." ...
#chr [1:77259]
str(en_twitter)
## chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...
#chr [1:2360148]
Here is an example of symbol removal. Notice the symbols before and after “gods” in “pagan gods” are now gone.
en_blogs[1]
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."
We next select approximately 70 percent of each of the three data sources (blogs, news, twitter), leaving 20 percent for later validation, and 10 percent as a testing dataset. Once we start using the validation data, we plan to incrementally add data in small chunks to model how a real world application will be exposed to additional words and phrases not previously encountered. Due to the limitations of running within R Markdown with large files, we run on very small selections within this markdown file, and run on much larger samples outside of markdown. The number of sentences used to train our model are:
First, we address the following:
Here is the code for 1-grams running on a very small sample of the blogs data. Running on the corpora in markdown is very slow. We show an example of the code we use for a very small sample of the blogs text, and run on all of our testing data outside of markdown.
matrixDF_TDM_save <- data.frame(word = "", freq = 0 )
matrixDF_TDM_all <- data.frame(word = "", freq = 0 )
for (i in 1:100) {
#corpus to TDM
corpus_blogs_01 <- VCorpus(VectorSource(en_blogs[i]))
myCorpus_blogs_01 = tm_map(corpus_blogs_01, content_transformer(tolower))
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removePunctuation)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removeNumbers)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, stemDocument)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01,
removeWords,c(stopwords(source = "smart"),"english"))
myTDM_blogs_01 = TermDocumentMatrix(myCorpus_blogs_01,
control = list(minWordLength = 1))
matrix_TDM <- as.matrix(myTDM_blogs_01)
matrixsums_TDM <- rowSums(matrix_TDM)
matrixDF_TDM <- data.frame(word=names(matrixsums_TDM),freq=matrixsums_TDM)
matrixDF_TDM_all <- rbind(matrixDF_TDM,matrixDF_TDM_save)
matrixDF_TDM_save <- matrixDF_TDM_all
}
Here are the word frequencies for this very small sample, with a table for the data we have plotted.
matrixDF_TDM_toplot <- aggregate(freq ~ word,
data=matrixDF_TDM_all,FUN=sum,
na.rm = TRUE)
matrixDF_TDM_toplot <- matrixDF_TDM_toplot[order(-matrixDF_TDM_toplot$freq),]
head(matrixDF_TDM_toplot)
## word freq
## 311 time 20
## 228 peopl 14
## 242 veri 12
## 211 make 11
## 174 thing 10
## 179 love 9
matrixDF_TDM_toplot_filter <- filter(matrixDF_TDM_toplot, freq >= 7)
matrixDF_TDM_toplot_filter %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq)) +
geom_col() +
xlab(NULL) +
coord_flip()
kable(matrixDF_TDM_toplot_filter)
| word | freq |
|---|---|
| time | 20 |
| peopl | 14 |
| veri | 12 |
| make | 11 |
| thing | 10 |
| love | 9 |
| work | 9 |
| face | 8 |
| tri | 8 |
| imag | 8 |
| made | 8 |
| good | 8 |
| today | 7 |
| onli | 7 |
| year | 7 |
| mani | 7 |
| nice | 7 |
| bad | 7 |
Here is the code for 2-grams running on a very small sample of the blogs data.
BigramTokenizer_2gram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
matrixDF_TDM_save <- data.frame(word = "", freq = 0 )
matrixDF_TDM_all <- data.frame(word = "", freq = 0 )
for (i in 1:100) {
#corpus to TDM
corpus_blogs_01 <- VCorpus(VectorSource(en_blogs[i]))
myCorpus_blogs_01 = tm_map(corpus_blogs_01, content_transformer(tolower))
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removePunctuation)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removeNumbers)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, stemDocument)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01,
removeWords,c(stopwords(source = "smart"),"english"))
txtTdmBi_blogs_2gram <- TermDocumentMatrix(myCorpus_blogs_01,
control = list(tokenize = BigramTokenizer_2gram))
matrix_TDM <- as.matrix(txtTdmBi_blogs_2gram)
matrixsums_TDM <- rowSums(matrix_TDM)
matrixDF_TDM <- data.frame(word=names(matrixsums_TDM),freq=matrixsums_TDM)
matrixDF_TDM_all <- rbind(matrixDF_TDM,matrixDF_TDM_save)
matrixDF_TDM_save <- matrixDF_TDM_all
}
Here are the word frequencies for this very small sample, with a table for the data we have plotted.
matrixDF_TDM_toplot <- aggregate(freq ~ word,
data=matrixDF_TDM_all,FUN=sum,
na.rm = TRUE)
matrixDF_TDM_toplot <- matrixDF_TDM_toplot[order(-matrixDF_TDM_toplot$freq),]
matrixDF_TDM_toplot_filter <- filter(matrixDF_TDM_toplot, freq >= 2)
kable(matrixDF_TDM_toplot_filter)
| word | freq |
|---|---|
| mani time | 3 |
| start rub | 2 |
| greet card | 2 |
| becaus mani | 2 |
| john barricelli | 2 |
| south norwalk | 2 |
| watch peopl | 2 |
| stuff veri | 2 |
| veri nice | 2 |
| twinkl light | 2 |
| lovin peopl | 2 |
| nice person | 2 |
| veri bad | 2 |
| wast time | 2 |
| home mom | 2 |
| devic deton | 2 |
| winter time | 2 |
| impress plate | 2 |
Here is the code for 3-grams running on a very small sample of the blogs data.
BigramTokenizer_3gram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
matrixDF_TDM_save <- data.frame(word = "", freq = 0 )
matrixDF_TDM_all <- data.frame(word = "", freq = 0 )
for (i in 1:100) {
#corpus to TDM
corpus_blogs_01 <- VCorpus(VectorSource(en_blogs[i]))
myCorpus_blogs_01 = tm_map(corpus_blogs_01, content_transformer(tolower))
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removePunctuation)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removeNumbers)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, stemDocument)
myCorpus_blogs_01 = tm_map(myCorpus_blogs_01,
removeWords,c(stopwords(source = "smart"),"english"))
txtTdmBi_blogs_3gram <- TermDocumentMatrix(myCorpus_blogs_01,
control = list(tokenize = BigramTokenizer_3gram))
matrix_TDM <- as.matrix(txtTdmBi_blogs_3gram)
matrixsums_TDM <- rowSums(matrix_TDM)
matrixDF_TDM <- data.frame(word=names(matrixsums_TDM),freq=matrixsums_TDM)
matrixDF_TDM_all <- rbind(matrixDF_TDM,matrixDF_TDM_save)
matrixDF_TDM_save <- matrixDF_TDM_all
}
Here are the word frequencies for this very small sample, with a table for the data we have plotted.
matrixDF_TDM_toplot <- aggregate(freq ~ word,
data=matrixDF_TDM_all,FUN=sum,
na.rm = TRUE)
matrixDF_TDM_toplot <- matrixDF_TDM_toplot[order(-matrixDF_TDM_toplot$freq),]
head(matrixDF_TDM_toplot)
## word freq
## 1 acceler high kinet 1
## 2 achiev nozzl effect 1
## 3 addit fit precis 1
## 4 amount water acceler 1
## 5 ankl maxim turn 1
## 6 axi extrem outer 1
matrixDF_TDM_toplot_filter <- filter(matrixDF_TDM_toplot, str_detect(word, "^a[rst]"))
kable(matrixDF_TDM_toplot_filter)
| word | freq |
|---|---|
| art deco interior | 1 |
| ate big geek | 1 |
| artist esperanza spald | 1 |
| area plan guidelin | 1 |
| arti butler organ | 1 |
| assist mother organ | 1 |
| assert honest outspoken | 1 |
| armchair toe lift | 1 |
| attend companyfund xmas | 1 |
| asianmiddl eastern descent | 1 |
| area warn obama | 1 |
| atom bomb end | 1 |
| attend friend kendra | 1 |
| area abl skip | 1 |
| astound support marshal | 1 |
| assam waffl leaf | 1 |
| artist kind turn | 1 |
For some of the exploratory work, and also because we need to understand the Swiftkey dataset, we want to know how words are correlated between small selections of the twitter, blogs, and news subsets of the data, within each-e.g. one twitter selection compared to another twitter selection; and across-e.g. twitter to blogs.
We load small samples of each of the three subsets:
gram_1_twitter_01 <- read.csv("./../skims/gram_1_twitter/gram_1_twitter_01_1.txt")
gram_1_twitter_02 <- read.csv("./../skims/gram_1_twitter/gram_1_twitter_01_2.txt")
gram_1_twitter_01 <- gram_1_twitter_01 %>% mutate(doc = "doc_01_1", rank = row_number(),
term_freq = freq/sum(freq),
logtf = log10(term_freq),
logrank = log10(rank))
gram_1_twitter_02 <- gram_1_twitter_02 %>% mutate(doc = "doc_01_2", rank = row_number(),
term_freq = freq/sum(freq),
logtf = log10(term_freq),
logrank = log10(rank))
head(gram_1_twitter_01)
## X word freq doc rank term_freq logtf logrank
## 1 47 love 1004 doc_01_1 1 0.008852054 -2.052956 0.0000000
## 2 73 day 915 doc_01_1 2 0.008067360 -2.093269 0.3010300
## 3 238 good 866 doc_01_1 3 0.007635338 -2.117172 0.4771213
## 4 259 dont 746 doc_01_1 4 0.006577323 -2.181951 0.6020600
## 5 106 time 706 doc_01_1 5 0.006224652 -2.205885 0.6989700
## 6 195 follow 693 doc_01_1 6 0.006110034 -2.213956 0.7781513
gram_1_blogs_01 <- read.csv("./../skims/gram_1_blogs/gram_1_blogs_01_1.txt")
gram_1_blogs_02 <- read.csv("./../skims/gram_1_blogs/gram_1_blogs_01_2.txt")
gram_1_blogs_01 <- gram_1_blogs_01 %>% mutate(doc = "doc_01_1", rank = row_number(),
term_freq = freq/sum(freq),
logtf = log10(term_freq), logrank = log10(rank))
gram_1_blogs_02 <- gram_1_blogs_02 %>% mutate(doc = "doc_01_2", rank = row_number(),
term_freq = freq/sum(freq),
logtf = log10(term_freq), logrank = log10(rank))
head(gram_1_blogs_01)
## X word freq doc rank term_freq logtf logrank
## 1 417 time 1241 doc_01_1 1 0.007010072 -2.154278 0.0000000
## 2 187 make 930 doc_01_1 2 0.005253317 -2.279566 0.3010300
## 3 120 love 758 doc_01_1 3 0.004281736 -2.368380 0.4771213
## 4 422 day 743 doc_01_1 4 0.004197005 -2.377061 0.6020600
## 5 130 year 688 doc_01_1 5 0.003886325 -2.410461 0.6989700
## 6 374 becaus 671 doc_01_1 6 0.003790297 -2.421327 0.7781513
gram_1_news <- read.csv("./../skims/gram_all_news/gram_1_news.txt")
gram_1_news_corr <- gram_1_news %>% mutate(doc = "doc_01_1", rank = row_number(),
term_freq = freq/sum(freq),
logtf = log10(term_freq), logrank = log10(rank))
head(gram_1_news)
## X word freq
## 1 175 year 2747
## 2 133 time 1664
## 3 96 state 1642
## 4 102 work 1271
## 5 280 make 1268
## 6 675 peopl 1248
Here are the correlations comparing within twitter and blogs.
par(mfrow=c(1,2))
freq_join_gram_1_twitter <- merge(gram_1_twitter_01, gram_1_twitter_02, by = "word")
freq_join_gram_1_twitter <- filter(freq_join_gram_1_twitter, freq.x != 0)
head(freq_join_gram_1_twitter)
## word X.x freq.x doc.x rank.x term_freq.x logtf.x logrank.x X.y
## 1 aaawww 15211 1 doc_01_1 15318 8.816787e-06 -5.054690 4.185202 7721
## 2 aaron 3426 7 doc_01_1 2361 6.171751e-05 -4.209592 3.373096 3602
## 3 abandon 15204 3 doc_01_1 4841 2.645036e-05 -4.577568 3.684935 9220
## 4 abbey 2866 6 doc_01_1 2621 5.290072e-05 -4.276538 3.418467 13214
## 5 abbi 15690 1 doc_01_1 15766 8.816787e-06 -5.054690 4.197722 3161
## 6 abc 9609 5 doc_01_1 3235 4.408394e-05 -4.355720 3.509874 5133
## freq.y doc.y rank.y term_freq.y logtf.y logrank.y
## 1 1 doc_01_2 9723 8.743781e-06 -5.058301 3.987800
## 2 12 doc_01_2 1601 1.049254e-04 -3.979119 3.204391
## 3 3 doc_01_2 4590 2.623134e-05 -4.581179 3.661813
## 4 2 doc_01_2 6605 1.748756e-05 -4.757271 3.819873
## 5 1 doc_01_2 7623 8.743781e-06 -5.058301 3.882126
## 6 13 doc_01_2 1517 1.136692e-04 -3.944357 3.180986
plot(freq_join_gram_1_twitter$freq.x,freq_join_gram_1_twitter$freq.y,
type = "p", col = "black", lwd = 1,
xlab = "Twitter Selection 1",
ylab = "Twitter Selection 2")
cor.test(freq_join_gram_1_twitter$freq.x,freq_join_gram_1_twitter$freq.y)
##
## Pearson's product-moment correlation
##
## data: freq_join_gram_1_twitter$freq.x and freq_join_gram_1_twitter$freq.y
## t = 727.94, df = 8083, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9921245 0.9927799
## sample estimates:
## cor
## 0.9924593
freq_join_gram_1_blogs <- merge(gram_1_blogs_01, gram_1_blogs_02, by = "word")
freq_join_gram_1_blogs <- filter(freq_join_gram_1_blogs, freq.x != 0)
head(freq_join_gram_1_blogs)
## word X.x freq.x doc.x rank.x term_freq.x logtf.x logrank.x X.y
## 1 aam 6275 1 doc_01_1 11330 5.648728e-06 -5.248049 4.054230 13858
## 2 aaron 1137 4 doc_01_1 4877 2.259491e-05 -4.645989 3.688153 2048
## 3 aback 16350 1 doc_01_1 17093 5.648728e-06 -5.248049 4.232818 9124
## 4 abandon 3130 14 doc_01_1 2233 7.908219e-05 -4.101921 3.348889 2000
## 5 abat 16592 1 doc_01_1 17279 5.648728e-06 -5.248049 4.237519 8635
## 6 abba 17127 1 doc_01_1 17698 5.648728e-06 -5.248049 4.247924 9534
## freq.y doc.y rank.y term_freq.y logtf.y logrank.y
## 1 1 doc_01_2 15321 5.601237e-06 -5.251716 4.185287
## 2 12 doc_01_2 2456 6.721484e-05 -4.172535 3.390228
## 3 3 doc_01_2 6409 1.680371e-05 -4.774595 3.806790
## 4 15 doc_01_2 2121 8.401855e-05 -4.075625 3.326541
## 5 2 doc_01_2 8128 1.120247e-05 -4.950686 3.909984
## 6 5 doc_01_2 4702 2.800618e-05 -4.552746 3.672283
plot(freq_join_gram_1_blogs$freq.x,freq_join_gram_1_blogs$freq.y,
type = "p", col = "black", lwd = 1,
xlab = "Blogs Selection 1",
ylab = "Blogs Selection 2")
cor.test(freq_join_gram_1_blogs$freq.x,freq_join_gram_1_blogs$freq.y)
##
## Pearson's product-moment correlation
##
## data: freq_join_gram_1_blogs$freq.x and freq_join_gram_1_blogs$freq.y
## t = 730.23, df = 10900, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9895502 0.9903025
## sample estimates:
## cor
## 0.9899333
Here are the cross correlations between twitter, blogs, and news.
par(mfrow=c(1,3))
#### twitter and blogs
freq_join_gram_1_twitter_blogs <- merge(gram_1_twitter_01, gram_1_blogs_01, by = "word")
freq_join_gram_1_twitter_blogs <- filter(freq_join_gram_1_twitter_blogs, freq.x != 0)
head(freq_join_gram_1_twitter_blogs)
## word X.x freq.x doc.x rank.x term_freq.x logtf.x logrank.x X.y
## 1 aam 6289 2 doc_01_1 5559 1.763357e-05 -4.753660 3.744997 6275
## 2 aand 14092 1 doc_01_1 14326 8.816787e-06 -5.054690 4.156125 15858
## 3 aaron 3426 7 doc_01_1 2361 6.171751e-05 -4.209592 3.373096 1137
## 4 abandon 15204 3 doc_01_1 4841 2.645036e-05 -4.577568 3.684935 3130
## 5 abbey 2866 6 doc_01_1 2621 5.290072e-05 -4.276538 3.418467 14388
## 6 abbi 15690 1 doc_01_1 15766 8.816787e-06 -5.054690 4.197722 9520
## freq.y doc.y rank.y term_freq.y logtf.y logrank.y
## 1 1 doc_01_1 11330 5.648728e-06 -5.248049 4.054230
## 2 1 doc_01_1 16718 5.648728e-06 -5.248049 4.223184
## 3 4 doc_01_1 4877 2.259491e-05 -4.645989 3.688153
## 4 14 doc_01_1 2233 7.908219e-05 -4.101921 3.348889
## 5 3 doc_01_1 6917 1.694618e-05 -4.770928 3.839918
## 6 7 doc_01_1 3786 3.954110e-05 -4.402951 3.578181
plot(freq_join_gram_1_twitter_blogs$freq.x,freq_join_gram_1_twitter_blogs$freq.y,
type = "p", col = "black", lwd = 1,
xlab = "Twitter Selection 1",
ylab = "Blogs Selection 1")
cor.test(freq_join_gram_1_twitter_blogs$freq.x,freq_join_gram_1_twitter_blogs$freq.y)
##
## Pearson's product-moment correlation
##
## data: freq_join_gram_1_twitter_blogs$freq.x and freq_join_gram_1_twitter_blogs$freq.y
## t = 124.84, df = 8162, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8025374 0.8174532
## sample estimates:
## cor
## 0.8101264
freq_join_gram_1_twitter_news <- merge(gram_1_twitter_01, gram_1_news_corr, by = "word")
freq_join_gram_1_twitter_news <- filter(freq_join_gram_1_twitter_news, freq.x != 0)
head(freq_join_gram_1_twitter_news)
## word X.x freq.x doc.x rank.x term_freq.x logtf.x logrank.x X.y
## 1 aapl 11921 1 doc_01_1 12550 8.816787e-06 -5.054690 4.098644 33208
## 2 aaron 3426 7 doc_01_1 2361 6.171751e-05 -4.209592 3.373096 4301
## 3 aaup 10721 1 doc_01_1 11657 8.816787e-06 -5.054690 4.066587 22931
## 4 abandon 15204 3 doc_01_1 4841 2.645036e-05 -4.577568 3.684935 10687
## 5 abbey 2866 6 doc_01_1 2621 5.290072e-05 -4.276538 3.418467 23819
## 6 abbi 15690 1 doc_01_1 15766 8.816787e-06 -5.054690 4.197722 23527
## freq.y doc.y rank.y term_freq.y logtf.y logrank.y
## 1 1 doc_01_1 33330 2.373690e-06 -5.624576 4.522835
## 2 32 doc_01_1 2368 7.595808e-05 -4.119426 3.374382
## 3 1 doc_01_1 25299 2.373690e-06 -5.624576 4.403103
## 4 41 doc_01_1 1995 9.732129e-05 -4.011792 3.299943
## 5 4 doc_01_1 9755 9.494760e-06 -5.022516 3.989227
## 6 3 doc_01_1 11881 7.121070e-06 -5.147455 4.074853
plot(freq_join_gram_1_twitter_news$freq.x,freq_join_gram_1_twitter_news$freq.y,
type = "p", col = "black", lwd = 1,
xlab = "Twitter Selection 1",
ylab = "News Selection 1")
cor.test(freq_join_gram_1_twitter_news$freq.x,freq_join_gram_1_twitter_news$freq.y)
##
## Pearson's product-moment correlation
##
## data: freq_join_gram_1_twitter_news$freq.x and freq_join_gram_1_twitter_news$freq.y
## t = 78.917, df = 9372, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.619526 0.643854
## sample estimates:
## cor
## 0.6318456
freq_join_gram_1_blogs_news <- merge(gram_1_blogs_01, gram_1_news_corr, by = "word")
freq_join_gram_1_blogs_news <- filter(freq_join_gram_1_blogs_news, freq.x != 0)
head(freq_join_gram_1_blogs_news)
## word X.x freq.x doc.x rank.x term_freq.x logtf.x logrank.x X.y
## 1 aaron 1137 4 doc_01_1 4877 2.259491e-05 -4.645989 3.688153 4301
## 2 ababa 5890 1 doc_01_1 11230 5.648728e-06 -5.248049 4.050380 31513
## 3 aback 16350 1 doc_01_1 17093 5.648728e-06 -5.248049 4.232818 25703
## 4 abandon 3130 14 doc_01_1 2233 7.908219e-05 -4.101921 3.348889 10687
## 5 abat 16592 1 doc_01_1 17279 5.648728e-06 -5.248049 4.237519 31986
## 6 abba 17127 1 doc_01_1 17698 5.648728e-06 -5.248049 4.247924 16724
## freq.y doc.y rank.y term_freq.y logtf.y logrank.y
## 1 32 doc_01_1 2368 7.595808e-05 -4.119426 3.374382
## 2 1 doc_01_1 31819 2.373690e-06 -5.624576 4.502687
## 3 1 doc_01_1 27191 2.373690e-06 -5.624576 4.434425
## 4 41 doc_01_1 1995 9.732129e-05 -4.011792 3.299943
## 5 2 doc_01_1 16954 4.747380e-06 -5.323546 4.229272
## 6 3 doc_01_1 11234 7.121070e-06 -5.147455 4.050534
plot(freq_join_gram_1_blogs_news$freq.x,freq_join_gram_1_blogs_news$freq.y,
type = "p", col = "black", lwd = 1,
xlab = "Blogs Selection 1",
ylab = "Newss Selection 1")
cor.test(freq_join_gram_1_blogs_news$freq.x,freq_join_gram_1_blogs_news$freq.y)
##
## Pearson's product-moment correlation
##
## data: freq_join_gram_1_blogs_news$freq.x and freq_join_gram_1_blogs_news$freq.y
## t = 143.78, df = 12168, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7867177 0.7998864
## sample estimates:
## cor
## 0.7933949
All of the scripts used to generate the data we explore here are in the following github repository: (https://github.com/rubiera/RCapstone_Week2Writeup).
The N-Gram counts in the training dataset (combining twitter, blogs, and news datasets) are:
We manipulate the large files in which we have aggregated counts outside of this report, write results to csv files, and then input those here.
en_gram_1 <-
read.csv("./../skims/merged_grams/en_gram_1_plot.csv", stringsAsFactors = FALSE)
en_gram_2 <-
read.csv("./../skims/merged_grams/en_gram_2_plot.csv", stringsAsFactors = FALSE)
en_gram_3 <-
read.csv("./../skims/merged_grams/en_gram_3_plot.csv", stringsAsFactors = FALSE)
str(en_gram_1)
## 'data.frame': 19 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "time" "love" "day" "make" ...
## $ freq: int 128289 123586 119183 102951 101047 96947 84019 76762 74511 73138 ...
str(en_gram_2)
## 'data.frame': 20 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "happi birthday" "year ago" "good morn" "follow back" ...
## $ freq: int 6022 5505 5387 4724 4455 4254 3897 3851 3547 3307 ...
str(en_gram_3)
## 'data.frame': 19 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "happi mother day" "cinco de mayo" "follow follow back" "love love love" ...
## $ freq: int 2222 722 658 547 473 448 353 337 334 326 ...
Here are the most common words found in the testing dataset.
en_gram_1_plot <- filter(en_gram_1, freq > 50000)
par(mfrow=c(1,1))
en_gram_1_plot %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq)) +
geom_col() +
xlab(NULL) +
coord_flip()
kable(en_gram_1_plot)
| X | V1 | word | freq |
|---|---|---|---|
| 1 | 1 | time | 128289 |
| 2 | 2 | love | 123586 |
| 3 | 3 | day | 119183 |
| 4 | 4 | make | 102951 |
| 5 | 5 | good | 101047 |
| 6 | 6 | dont | 96947 |
| 7 | 7 | work | 84019 |
| 8 | 8 | peopl | 76762 |
| 9 | 9 | year | 74511 |
| 10 | 10 | back | 73138 |
| 11 | 11 | great | 71493 |
| 12 | 12 | thing | 70235 |
| 13 | 13 | today | 65769 |
| 14 | 14 | follow | 63389 |
| 15 | 15 | realli | 63276 |
| 16 | 16 | onli | 61547 |
| 17 | 17 | becaus | 54739 |
| 18 | 18 | feel | 52073 |
| 19 | 19 | week | 50729 |
Here are the most common pairs of words found in the testing dataset.
en_gram_2_plot <- filter(en_gram_2, freq > 2400)
par(mfrow=c(1,1))
en_gram_2_plot %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq)) +
geom_col() +
xlab(NULL) +
coord_flip()
kable(en_gram_2_plot)
| X | V1 | word | freq |
|---|---|---|---|
| 1 | 1 | happi birthday | 6022 |
| 2 | 2 | year ago | 5505 |
| 3 | 3 | good morn | 5387 |
| 4 | 4 | follow back | 4724 |
| 5 | 5 | good luck | 4455 |
| 6 | 6 | mother day | 4254 |
| 7 | 7 | high school | 3897 |
| 8 | 8 | everi day | 3851 |
| 9 | 9 | im gonna | 3547 |
| 10 | 10 | social media | 3307 |
| 11 | 11 | everi time | 3298 |
| 12 | 12 | long time | 3152 |
| 13 | 13 | great day | 3116 |
| 14 | 14 | good thing | 3011 |
| 15 | 15 | dont forget | 2920 |
| 16 | 16 | pleas follow | 2843 |
| 17 | 17 | mani peopl | 2822 |
| 18 | 18 | littl bit | 2528 |
| 19 | 19 | im glad | 2410 |
| 20 | 20 | make feel | 2405 |
Here are the most triple combinations of words found in the testing dataset.
en_gram_3_plot <- filter(en_gram_3, freq > 275)
par(mfrow=c(1,1))
en_gram_3_plot %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(word, freq)) +
geom_col() +
xlab(NULL) +
coord_flip()
kable(en_gram_3_plot)
| X | V1 | word | freq |
|---|---|---|---|
| 1 | 1 | happi mother day | 2222 |
| 2 | 2 | cinco de mayo | 722 |
| 3 | 3 | follow follow back | 658 |
| 4 | 4 | love love love | 547 |
| 5 | 5 | st patrick day | 473 |
| 6 | 6 | happi valentin day | 448 |
| 7 | 7 | pleas pleas pleas | 353 |
| 8 | 8 | coupl week ago | 337 |
| 9 | 9 | good morn everyon | 334 |
| 10 | 10 | mother day mom | 326 |
| 11 | 11 | hope everyon great | 326 |
| 12 | 12 | follow back pleas | 321 |
| 13 | 13 | pleas follow back | 307 |
| 14 | 14 | item abov pp | 305 |
| 15 | 15 | hope great day | 298 |
| 16 | 16 | long time ago | 296 |
| 17 | 17 | ha ha ha | 296 |
| 18 | 18 | incorpor item abov | 284 |
| 19 | 19 | dont understand whi | 279 |
We next assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English training dataset. We use the stemmed dataset we have been using so far, and leave as a next step to use a pre-stemmed dataset in which every word is valid post checking it in wordnet.
We find a total of 19,974,831 words for the 1-gram 477,305 word instances. The most common instances of words in our 1-gram file that are found “in the wild,” but are not correctly words are:
select_http_www <- read.csv("./../skims/merged_grams/select_http_www.csv", stringsAsFactors = FALSE)
select_long_words <- read.csv("./../skims/merged_grams/select_long_words.csv", stringsAsFactors = FALSE)
select_many_vowels <- read.csv("./../skims/merged_grams/select_many_vowels.csv", stringsAsFactors = FALSE)
select_many_consonants <- read.csv("./../skims/merged_grams/select_many_consonants.csv", stringsAsFactors = FALSE)
found_once <- read.csv("./../skims/merged_grams/found_once.csv", stringsAsFactors = FALSE)
Here is a sample of some of these.
select_http_www
## X V1 word freq
## 1 1 11112 http 111
## 2 2 15484 wwwradiotagrcomwfuv 61
## 3 3 15900 www 58
## 4 4 16756 wwwcrowdclickcom 53
## 5 5 17895 wwwwatchnhllivecom 47
## 6 6 18337 wwwknowledgesafaricom 45
## 7 7 18598 wwwgpartycom 44
## 8 8 21479 wwwtinyurlcomelegantlivingfurnitur 34
## 9 9 22601 wwwsnoclothingcom 31
## 10 10 24379 wwwfreekshowradiocom 27
## 11 11 25458 wwwzesmocom 25
## 12 12 25464 wwwussportspagescom 25
## 13 13 26013 wwwtncnewscom 24
## 14 14 26641 wwwbirdofprayerorg 23
## 15 15 28823 wwwwedrawthelinescagov 20
select_long_words
## X V1 word freq
## 1 1 14790 someoneiwanttomeet 66
## 2 2 15058 thatawkwardmomentwhen 64
## 3 3 15484 wwwradiotagrcomwfuv 61
## 4 4 16755 wordsyouwillneverhearmesay 53
## 5 5 16758 thoughtsduringschool 53
## 6 6 17490 whenifirstjoinedtwitt 49
## 7 7 17894 thingsireallycantstand 47
## 8 8 17895 wwwwatchnhllivecom 47
## 9 9 18337 wwwknowledgesafaricom 45
## 10 10 18592 mypastrelationship 44
## 11 11 19336 sometimesyouhaveto 41
## 12 12 19613 everyonehasthatfriend 40
## 13 13 20199 wbeyondthescoreboardnet 38
## 14 14 20483 firstsongsonshuffl 37
## 15 15 20486 incaseyoudidntknow 37
select_many_vowels
## X V1 word freq
## 1 1 10446 sooooo 123
## 2 2 15057 lmaoooo 64
## 3 3 16336 soooooo 55
## 4 4 20192 lmfaoooo 38
## 5 5 21471 lmaooooo 34
## 6 6 21472 lmfaooooo 34
## 7 7 26624 yaaaaay 23
## 8 8 28053 lmfaoooooo 21
## 9 9 30054 nooooo 18
## 10 10 30633 lmaoooooo 18
## 11 11 32719 yooooo 16
## 12 12 33809 hooooo 15
## 13 13 33812 yeeeee 15
## 14 14 34174 waaaaay 14
## 15 15 35228 yaaaaaay 14
select_many_consonants
## X V1 word freq
## 1 1 20440 grrrrr 37
## 2 2 22187 pptpltpsstp 32
## 3 3 23855 grrrrrr 28
## 4 4 26011 plzzzz 24
## 5 5 26013 wwwtncnewscom 24
## 6 6 26377 msnbcs 23
## 7 7 29574 brrrrr 19
## 8 8 29660 wwwkrqecom 19
## 9 9 31583 wwwdrwiggyscom 17
## 10 10 33958 kstptv 15
## 11 11 34040 dwntwn 15
## 12 12 34718 bldgblog 14
## 13 13 37042 httpwwwdagnomusiccom 13
## 14 14 38661 grrrrrrr 12
## 15 15 40980 waystotellppltheyreug 11
found_once
## X x
## 1 1 arguefight
## 2 2 arianrhod
## 3 3 armsswear
## 4 4 aroo
## 5 5 arounddo
## 6 6 aroundor
## 7 7 arquinton
## 8 8 artbas
## 9 9 aryabhatta
## 10 10 asidecook
## 11 11 aska
## 12 12 aslevel
## 13 13 aspeg
## 14 14 assembleia
## 15 15 astier
## 16 16 attatar
## 17 17 attemptth
## 18 18 augard
## 19 19 aul
## 20 20 autoerotic
## 21 21 autumnsh
## 22 22 avalabl
## 23 23 avarist
## 24 24 avial
## 25 25 aviana
## 26 26 axarquia
## 27 27 aylay
## 28 28 aztext
## 29 29 baauer
## 30 30 babenzl
## 31 31 babyactiv
We take all of these regular expressions to exclude words that we estimate should not be in our dictionary, with the caveat that this is a draft of our final dictionary. This criteria requires considerable optimization. This rough clean-up takes us from 477,305 words to 440,682 by subtracting 36,623 words. There are additional words that need to be subtracted before we finalize this dictionary. As a rough working dictionary, we use the dictionary minus our regular expression criteria, and exclude all words found once. This leaves a 155,365 word dictionary that we can use to assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English training dataset. These 155,365 words are found in a total of 19,624,213 instances. 50 percent of these instances is 9,812,207, or approximately 9.8 million instances; and 90 percent of these instances is 17,661,792, or approximately 17.7 million instances. The top 600 words account for 9,776,698 instances, or or approximately 9.8 million instances; and the top 7,500 words account for 17,679,382 instances, or or approximately 17.7 million instances.
en_gram_1_round_2_2plus <-
read.csv("./../skims/merged_grams/en_gram_1_round_2_2plus.csv", stringsAsFactors = FALSE)
str(en_gram_1_round_2_2plus)
## 'data.frame': 155365 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "time" "love" "day" "make" ...
## $ freq: int 128289 123586 119183 102951 101047 96947 84019 76762 74511 73138 ...
sum(en_gram_1_round_2_2plus$freq)
## [1] 19624213
19624213*0.5
## [1] 9812107
19624213*0.9
## [1] 17661792
sum(en_gram_1_round_2_2plus[1:600,]$freq)
## [1] 9776698
#[1] 9776698
sum(en_gram_1_round_2_2plus[1:7500,]$freq)
## [1] 17679382
#[1] 17679382
We use the wordnet Thesaurus to determine if a word in our dictionary is in the English language. Because stemming makes many of our words not be in wordnet-e.g. “happy” becomes “happi”; we generate a small sample of pre-stemmed word frequencies to explore checking our dictionary with wordnet. We leave a full implementation of this aspect of the prediction model for later.
wordnet_gram_1_check <- read.csv("./../model/test_03/gram_1_twitter_nostem_01_1.txt", stringsAsFactors = FALSE)
str(wordnet_gram_1_check)
## 'data.frame': 22754 obs. of 3 variables:
## $ X : int 45 231 70 254 203 104 33 582 31 191 ...
## $ word: chr "love" "good" "day" "dont" ...
## $ freq: int 857 848 747 745 639 626 586 540 506 438 ...
Here is a redimentary function that checks if a word is in wordnet. Wordnet does not include stopwords, and only includes four types of words: “ADJECTIVE”,“ADVERB”,“NOUN” and “VERB”. (From the wordnet web site): wordnet “excluded words include determiners, prepositions, pronouns, conjunctions, and particles”. We check it for three Spanish words that are not commonly used in English. “Cinco de Mayo,” “burrito,” and “taco,” are examples of words that are commonly used in English.
checkWordnet <- function(checkword,found) {
word <- checkword
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
terms_NOUN <- getIndexTerms("NOUN", 1, filter)
terms_VERB <- getIndexTerms("VERB", 1, filter)
if(is.null(terms_ADJECTIVE) == TRUE &&
is.null(terms_ADVERB) == TRUE &&
is.null(terms_NOUN) == TRUE &&
is.null(terms_VERB) == TRUE){
print(paste0(checkword," is not in wordnet"))
found <- FALSE
}
if(!is.null(terms_ADJECTIVE) == TRUE ||
!is.null(terms_ADVERB) == TRUE ||
!is.null(terms_NOUN) == TRUE ||
!is.null(terms_VERB) == TRUE){
print(paste0(checkword," is in wordnet"))
found <- TRUE
}
return(found)
}
checkWordnet("orden")
## [1] "orden is not in wordnet"
## [1] FALSE
checkWordnet("socorro")
## [1] "socorro is not in wordnet"
## [1] FALSE
checkWordnet("serpiente")
## [1] "serpiente is not in wordnet"
## [1] FALSE
These are words that are found in wordnet.
checkWordnet("happy")
## [1] "happy is in wordnet"
## [1] TRUE
checkWordnet("make")
## [1] "make is in wordnet"
## [1] TRUE
checkWordnet("may")
## [1] "may is in wordnet"
## [1] TRUE
We select the top 298 words in this small sample, and check them in wordnet. We find 252, or 84.6 percent of these words in wordnet. We run a smaller sample here to minimize markdown bloat. Of the 68 words here, we find 57, or 83.8 percent. The words not found are mostly pronouns and internet acronyms. It’s puzzling that we did not “guys.” We leave working with the quirks of wordnet for a later installment of this work.
wordnet_gram_1_check_example <- wordnet_gram_1_check %>% arrange(desc(freq)) %>% filter(freq > 150)
str(wordnet_gram_1_check_example)
## 'data.frame': 68 obs. of 3 variables:
## $ X : int 45 231 70 254 203 104 33 582 31 191 ...
## $ word: chr "love" "good" "day" "dont" ...
## $ freq: int 857 848 747 745 639 626 586 540 506 438 ...
checkWordnet <- function(checkword,found) {
word <- checkword
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
terms_NOUN <- getIndexTerms("NOUN", 1, filter)
terms_VERB <- getIndexTerms("VERB", 1, filter)
if(is.null(terms_ADJECTIVE) == TRUE &&
is.null(terms_ADVERB) == TRUE &&
is.null(terms_NOUN) == TRUE &&
is.null(terms_VERB) == TRUE){
found <- FALSE
}
if(!is.null(terms_ADJECTIVE) == TRUE ||
!is.null(terms_ADVERB) == TRUE ||
!is.null(terms_NOUN) == TRUE ||
!is.null(terms_VERB) == TRUE){
found <- TRUE
}
return(found)
}
count <- 0
for (i in 1:nrow(wordnet_gram_1_check_example)){
outcome <- checkWordnet(wordnet_gram_1_check_example$word[i])
if (outcome == TRUE) {
count <- count + 1
}
if(outcome ==FALSE && count < 50){
print("Not found in wordnet")
print(wordnet_gram_1_check_example$word[i])
}
if(i == nrow(wordnet_gram_1_check_example)){
print("count")
print(count)
}
}
## [1] "Not found in wordnet"
## [1] "dont"
## [1] "Not found in wordnet"
## [1] "lol"
## [1] "Not found in wordnet"
## [1] "youre"
## [1] "Not found in wordnet"
## [1] "haha"
## [1] "Not found in wordnet"
## [1] "hey"
## [1] "Not found in wordnet"
## [1] "guys"
## [1] "Not found in wordnet"
## [1] "ive"
## [1] "Not found in wordnet"
## [1] "lets"
## [1] "Not found in wordnet"
## [1] "gonna"
## [1] "count"
## [1] 57
For the final exploration of task 2, we explore ways to increase coverage using synonyms from wordnet.
checkWordnet_synonyms <- function(checkword) {
word <- checkword
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
terms_NOUN <- getIndexTerms("NOUN", 1, filter)
terms_VERB <- getIndexTerms("VERB", 1, filter)
if(!is.null(terms_ADJECTIVE) == TRUE)
{
print(paste0(word," is an ADJECTIVE"))
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 10, filter)
print(getSynonyms(terms_ADJECTIVE[[1]]))
}
if(!is.null(terms_ADVERB) == TRUE)
{
print(paste0(word," is an ADVERB"))
terms_ADVERB <- getIndexTerms("ADVERB", 10, filter)
print(getSynonyms(terms_ADVERB[[1]]))
}
if(!is.null(terms_NOUN) == TRUE)
{
print(paste0(word," is a NOUN"))
terms_NOUN <- getIndexTerms("NOUN", 10, filter)
print(getSynonyms(terms_NOUN[[1]]))
}
if(!is.null(terms_VERB) == TRUE)
{
print(paste0(word," is a VERB"))
terms_VERB <- getIndexTerms("VERB", 10, filter)
print(getSynonyms(terms_VERB[[1]]))
}
}
Checking some of the more common words in our small sample, we find many synonyms. We can use the synonyms that are not in our dictionary and we find in wordnet to augment our dictionary. We can apply a term frequency that is similar to the term frequency for those words we already have in our dictionary, or we can come up with an algorith that is adaptive to how common a synonym that is not in our initial dictionary is. We leave work an expansion of the initial dictionary with wordnet synonyms for a later installment of our prediction model.
checkWordnet_synonyms("happy")
## [1] "happy is an ADJECTIVE"
## [1] "felicitous" "glad" "happy" "well-chosen"
checkWordnet_synonyms("follow")
## [1] "follow is a VERB"
## [1] "abide by" "accompany" "adopt" "be"
## [5] "come" "come after" "comply" "conform to"
## [9] "espouse" "fall out" "follow" "keep abreast"
## [13] "keep an eye on" "keep up" "observe" "play along"
## [17] "postdate" "pursue" "stick to" "stick with"
## [21] "succeed" "surveil" "survey" "take after"
## [25] "trace" "travel along" "watch" "watch over"
checkWordnet_synonyms("good")
## [1] "good is an ADJECTIVE"
## [1] "adept" "beneficial" "dear" "dependable" "effective"
## [6] "estimable" "expert" "full" "good" "honorable"
## [11] "in effect(p)" "in force(p)" "just" "near" "practiced"
## [16] "proficient" "respectable" "right" "ripe" "safe"
## [21] "salutary" "secure" "serious" "skilful" "skillful"
## [26] "sound" "unspoiled" "unspoilt" "upright" "well(p)"
## [1] "good is an ADVERB"
## [1] "good" "soundly" "thoroughly" "well"
## [1] "good is a NOUN"
## [1] "commodity" "good" "goodness" "trade good"
checkWordnet_synonyms("back")
## [1] "back is an ADJECTIVE"
## [1] "back(a)" "hind(a)" "hinder(a)"
## [1] "back is an ADVERB"
## [1] "back" "backward" "backwards" "in reply" "rearward" "rearwards"
## [1] "back is a NOUN"
## [1] "back" "backbone" "backrest" "binding"
## [5] "book binding" "cover" "dorsum" "rachis"
## [9] "rear" "spinal column" "spine" "vertebral column"
## [1] "back is a VERB"
## [1] "back" "back up" "bet on" "endorse" "gage" "game"
## [7] "indorse" "plump for" "plunk for" "punt" "second" "stake"
## [13] "support"
checkWordnet_synonyms("wait")
## [1] "wait is a NOUN"
## [1] "delay" "hold" "postponement" "time lag" "wait"
## [6] "waiting"
## [1] "wait is a VERB"
## [1] "await" "expect" "hold back" "hold off" "look" "wait"
## [7] "waitress"
checkWordnet_synonyms("amazing")
## [1] "amazing is an ADJECTIVE"
## [1] "amazing" "astonishing" "awe-inspiring" "awesome"
## [5] "awful" "awing"
checkWordnet_synonyms("watch")
## [1] "watch is a NOUN"
## [1] "lookout" "lookout man" "picket" "scout" "sentinel"
## [6] "sentry" "spotter" "ticker" "vigil" "watch"
## [1] "watch is a VERB"
## [1] "ascertain" "catch" "check" "determine"
## [5] "find out" "follow" "keep an eye on" "learn"
## [9] "look on" "look out" "observe" "see"
## [13] "take in" "view" "watch" "watch out"
## [17] "watch over"
checkWordnet_synonyms("nice")
## [1] "nice is an ADJECTIVE"
## [1] "courteous" "dainty" "decent" "discriminate" "gracious"
## [6] "nice" "overnice" "prissy" "skillful" "squeamish"
## [1] "nice is a NOUN"
## [1] "Nice"
checkWordnet_synonyms("start")
## [1] "start is a NOUN"
## [1] "beginning" "commencement" "first" "get-go"
## [5] "head start" "jump" "kickoff" "offset"
## [9] "outset" "showtime" "start" "starting"
## [13] "starting line" "starting signal" "starting time" "startle"
## [1] "start is a VERB"
## [1] "begin" "bug out" "bulge" "bulge out" "come out" "commence"
## [7] "depart" "embark on" "get" "get down" "get going" "go"
## [13] "initiate" "jump" "lead off" "originate" "part" "pop"
## [19] "pop out" "protrude" "set about" "set forth" "set off" "set out"
## [25] "start" "start out" "start up" "startle" "take off" "take up"
We first trim the large N-gram counts files each to a reasonable file to test a draft model. We select:
gram_1_aggr <- read.csv("./../model/training/gram_1_aggr_freq2plus.csv", stringsAsFactors = FALSE)
gram_2_aggr <- read.csv("./../model/training/gram_2_aggr_freq20plus.csv", stringsAsFactors = FALSE)
gram_3_aggr <- read.csv("./../model/training/gram_3_aggr_freq4plus.csv", stringsAsFactors = FALSE)
str(gram_1_aggr)
## 'data.frame': 160415 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "time" "love" "day" "make" ...
## $ freq: int 128289 123586 119183 102951 101047 96947 84019 76762 74511 73138 ...
str(gram_2_aggr)
## 'data.frame': 88756 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "happi birthday" "year ago" "good morn" "follow back" ...
## $ freq: int 6022 5505 5387 4724 4455 4254 3897 3851 3547 3307 ...
str(gram_3_aggr)
## 'data.frame': 96429 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word: chr "happi mother day" "cinco de mayo" "follow follow back" "love love love" ...
## $ freq: int 2222 722 658 547 473 448 353 337 334 326 ...
We split each file to separate out the words in each N-gram, and select only 2-grams that have the first word as “time.” We want to follow time as “word1,” as the N-grams increase from N=2 to N=5.
gram_2_aggr_split <- gram_2_aggr %>% separate(word, c("word1", "word2"), sep = " ")
gram_3_aggr_split <- gram_3_aggr %>% separate(word, c("word1", "word2", "word3"), sep = " ")
head(gram_2_aggr_split)
## X V1 word1 word2 freq
## 1 1 1 happi birthday 6022
## 2 2 2 year ago 5505
## 3 3 3 good morn 5387
## 4 4 4 follow back 4724
## 5 5 5 good luck 4455
## 6 6 6 mother day 4254
head(gram_3_aggr_split)
## X V1 word1 word2 word3 freq
## 1 1 1 happi mother day 2222
## 2 2 2 cinco de mayo 722
## 3 3 3 follow follow back 658
## 4 4 4 love love love 547
## 5 5 5 st patrick day 473
## 6 6 6 happi valentin day 448
gram_2_aggr_split <- filter(gram_2_aggr_split, word1 == "time")
head(gram_2_aggr_split,20)
## X V1 word1 word2 freq
## 1 50 50 time year 1757
## 2 91 91 time time 1292
## 3 129 129 time im 1121
## 4 182 182 time make 955
## 5 187 187 time day 947
## 6 308 308 time week 711
## 7 311 311 time work 706
## 8 313 313 time start 698
## 9 323 323 time ive 685
## 10 349 349 time befor 666
## 11 411 411 time tri 610
## 12 439 439 time sinc 590
## 13 458 458 time dont 579
## 14 537 537 time read 523
## 15 580 580 time life 503
## 16 590 590 time back 496
## 17 643 643 time love 468
## 18 652 652 time chang 460
## 19 658 658 time today 458
## 20 665 665 time realli 454
str(gram_2_aggr_split)
## 'data.frame': 949 obs. of 5 variables:
## $ X : int 50 91 129 182 187 308 311 313 323 349 ...
## $ V1 : int 50 91 129 182 187 308 311 313 323 349 ...
## $ word1: chr "time" "time" "time" "time" ...
## $ word2: chr "year" "time" "im" "make" ...
## $ freq : int 1757 1292 1121 955 947 711 706 698 685 666 ...
We calculate the 2-gram frequencies and select 2-grams with frquecy >= 300.
gram_2_aggr_time_freqs <- gram_2_aggr_split %>% mutate(term_freq2 = freq/sum(freq))
gram_2_aggr_time_freqs <- gram_2_aggr_time_freqs %>%
mutate(freq2 = freq) %>% select(-freq)
gram_2_aggr_time_freqs <- gram_2_aggr_time_freqs %>% filter(freq2 >= 300)
gram_2_aggr_time_freqs
## X V1 word1 word2 term_freq2 freq2
## 1 50 50 time year 0.022358810 1757
## 2 91 91 time time 0.016441424 1292
## 3 129 129 time im 0.014265353 1121
## 4 182 182 time make 0.012152910 955
## 5 187 187 time day 0.012051106 947
## 6 308 308 time week 0.009047874 711
## 7 311 311 time work 0.008984246 706
## 8 313 313 time start 0.008882441 698
## 9 323 323 time ive 0.008717009 685
## 10 349 349 time befor 0.008475223 666
## 11 411 411 time tri 0.007762592 610
## 12 439 439 time sinc 0.007508081 590
## 13 458 458 time dont 0.007368100 579
## 14 537 537 time read 0.006655468 523
## 15 580 580 time life 0.006400957 503
## 16 590 590 time back 0.006311878 496
## 17 643 643 time love 0.005955562 468
## 18 652 652 time chang 0.005853758 460
## 19 658 658 time today 0.005828307 458
## 20 665 665 time realli 0.005777404 454
## 21 688 688 time watch 0.005662874 445
## 22 758 758 time ago 0.005382912 423
## 23 780 780 time talk 0.005268382 414
## 24 836 836 time play 0.005052048 397
## 25 865 865 time write 0.004962969 390
## 26 891 891 time becaus 0.004899341 385
## 27 903 903 time peopl 0.004848439 381
## 28 916 916 time onli 0.004797536 377
## 29 948 948 time good 0.004670281 367
## 30 954 954 time spent 0.004644830 365
## 31 969 969 time ill 0.004593927 361
## 32 977 977 time find 0.004568476 359
## 33 1003 1003 time feel 0.004504848 354
## 34 1023 1023 time famili 0.004441221 349
## 35 1126 1126 time put 0.004212161 331
## 36 1275 1275 time togeth 0.003919473 308
## 37 1336 1336 time move 0.003817668 300
The next word is going to have a really low probability of being correct for common words like “time.” The concept here is defined as perplexity, which is the branching frequency from time as “word1” followed by the next words in the sentence. The perplexity of “time” into “word2” (the second word in the bigram) is high enough (37 in the training data, after keeping instances that occur 300 times or more) that choosing the most common “word2” as a prediction will be the wrong prediction most of the time.
In this section we address the following issue in Task 3:
We will at times encounter a 3-gram (word1, word2, word3) in which the information we have on the 2-grams formed by (word1, word2); and/or the 2-grams formed by (word2, word3) may increase our predictive ability. We will be testing this code in a future iteration of our model. For now, we show how it works.
First, we place the most common 2-grams in a placeholder data frame.
gram_2_aggr_split_backtrack <- gram_2_aggr %>% separate(word, c("word1", "word2"), sep = " ")
head(gram_2_aggr_split_backtrack)
## X V1 word1 word2 freq
## 1 1 1 happi birthday 6022
## 2 2 2 year ago 5505
## 3 3 3 good morn 5387
## 4 4 4 follow back 4724
## 5 5 5 good luck 4455
## 6 6 6 mother day 4254
As an example of the lowest perplexity, a bigram beginning with the word “abbey” is found once in our 2-grams data frame. This is an example of a low perplexity N-gram that we should be able to have the highest precitive power for. If someone enters the word “abbey,” based on our training data, the only next word possible is “road,” of Beatles fame.
gram_2_aggr_split_word_001 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 1)
head(gram_2_aggr_split_word_001)
## # A tibble: 6 x 2
## word1 n
## <chr> <int>
## 1 aaron 1
## 2 ab 1
## 3 abbey 1
## 4 abe 1
## 5 abercrombi 1
## 6 abort 1
nrow(gram_2_aggr_split_word_001)
## [1] 2437
gram_2_aggr_split_backtrack %>% filter(word1 == "abbey")
## X V1 word1 word2 freq
## 1 86912 86912 abbey road 20
Here is a sample 2-gram with perplexity = 5 (word1 = “agent”).
gram_2_aggr_split_word_005 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 5)
head(gram_2_aggr_split_word_005)
## # A tibble: 6 x 2
## word1 n
## <chr> <int>
## 1 actor 5
## 2 adopt 5
## 3 agent 5
## 4 airport 5
## 5 alot 5
## 6 alright 5
nrow(gram_2_aggr_split_word_005)
## [1] 220
gram_2_aggr_split_backtrack %>% filter(word1 == "agent")
## X V1 word1 word2 freq
## 1 22892 22892 agent publish 51
## 2 35900 35900 agent editor 38
## 3 55733 55733 agent work 27
## 4 76703 76703 agent agent 22
## 5 87484 87484 agent sign 20
Here is a sample 2-gram with perplexity = 10 (word1 = “amount”).
gram_2_aggr_split_word_010 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 10)
head(gram_2_aggr_split_word_010)
## # A tibble: 6 x 2
## word1 n
## <chr> <int>
## 1 accomplish 10
## 2 advic 10
## 3 ahead 10
## 4 amount 10
## 5 ancient 10
## 6 annual 10
nrow(gram_2_aggr_split_word_010)
## [1] 79
gram_2_aggr_split_backtrack %>% filter(word1 == "amount")
## X V1 word1 word2 freq
## 1 670 670 amount time 451
## 2 2654 2654 amount money 203
## 3 13658 13658 amount work 73
## 4 20198 20198 amount peopl 56
## 5 29301 29301 amount food 43
## 6 30291 30291 amount water 42
## 7 49293 49293 amount inform 30
## 8 56773 56773 amount weight 27
## 9 69887 69887 amount energi 23
## 10 82894 82894 amount alcohol 20
Here is a sample 2-gram with perplexity = 20 (word1 = “collect”).
gram_2_aggr_split_word_020 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 20)
head(gram_2_aggr_split_word_020)
## # A tibble: 6 x 2
## word1 n
## <chr> <int>
## 1 ah 20
## 2 app 20
## 3 began 20
## 4 bill 20
## 5 center 20
## 6 collect 20
nrow(gram_2_aggr_split_word_020)
## [1] 29
gram_2_aggr_split_backtrack %>% filter(word1 == "collect")
## X V1 word1 word2 freq
## 1 18437 18437 collect bargain 60
## 2 30332 30332 collect short 42
## 3 31390 31390 collect work 41
## 4 32554 32554 collect book 40
## 5 33758 33758 collect year 39
## 6 34093 34093 collect dust 39
## 7 39631 39631 collect includ 35
## 8 41262 41262 collect call 34
## 9 42129 42129 collect thing 34
## 10 47405 47405 collect love 31
## 11 52748 52748 collect data 29
## 12 53680 53680 collect stori 28
## 13 57423 57423 collect make 27
## 14 61159 61159 collect donat 26
## 15 72705 72705 collect develop 23
## 16 73323 73323 collect item 22
## 17 75291 75291 collect inform 22
## 18 75292 75292 collect mani 22
## 19 83039 83039 collect money 20
## 20 84756 84756 collect avail 20
As we progress to N-grams with higher perplexity in our model, the N required increases. We explore this issue after exploring back-trakcing to lower order N-grams and how to tackle missing words using wordnet synonyms.
We use the 37 high frequency word1 = “time” trigrams as an example here of how to combine a trigram frequency measure with the measure of it’s two 2-grams (word1,word2) and (word2,word3). We do not execute this code in this markdown report and instead generate output files that we read in.
gram_3_aggr_time <- 0
gram_2_aggr_time_backout <- 0
for (i in 1:nrow(gram_2_aggr_time_freqs)) {
gram_3_aggr_time <- 0
for (j in 1:nrow(gram_3_aggr_split)) {
if( (gram_2_aggr_time_freqs$word1[i] == gram_3_aggr_split$word1[j] &&
gram_2_aggr_time_freqs$word2[i] == gram_3_aggr_split$word2[j]) ){
if (!exists("gram_3_aggr_time")){
gram_3_aggr_time <- 0
}
# if the merged dataset does exist, append to it
if (exists("gram_3_aggr_time")){
gram_2_aggr_time_backout <- 0
for (k in 1:nrow(gram_2_aggr_split_backtrack)) {
#back track to 2 gram for word2 and word3 and get it from gram_2_aggr_split
if( (gram_2_aggr_split_backtrack$word1[k] == gram_3_aggr_split$word2[j] &&
gram_2_aggr_split_backtrack$word2[k] == gram_3_aggr_split$word3[j]) ){
gram_2_aggr_time_backout <- gram_2_aggr_split_backtrack[k,]
gram_2_aggr_time_backout <- gram_2_aggr_time_backout %>%
rename(
freq23 = freq
)
freq23 <- gram_2_aggr_time_backout$freq23
}#back track
}#k for loop
freq2 <- gram_2_aggr_time_freqs[i,5]
temp_dataset <- cbind(freq2,freq23, gram_3_aggr_split[j,])
gram_3_aggr_time <- rbind(gram_3_aggr_time, temp_dataset)
rm(temp_dataset)
}#gram_3_aggr_time exists
}#top if loop
}#j for loop
gram_3_aggr_time <- filter(gram_3_aggr_time, freq != 0)
gram_3_aggr_time <- gram_3_aggr_time %>% arrange(desc(freq))
gram_3_aggr_time <- gram_3_aggr_time %>% mutate(freq3 = freq) %>% select(-freq)
gram_3_aggr_time_freqs <- gram_3_aggr_time %>% mutate(term_freq3 = freq3/sum(freq3))
gram_3_aggr_time_freqs <- gram_3_aggr_time_freqs %>%
mutate(term_freq23 = freq23/sum(freq23))
write.csv(gram_3_aggr_time_freqs,paste0("./../model/training/test_01_time/gram_3_aggr_time_",gram_2_aggr_time_freqs$word2[i],"_freqs.csv"))
}#i for loop
Once we have the relative frequencies of a trigram, we can make decisions as to how to combine these frequencies, and with which weights. These weights are referred to as lambdas in the NLP literature.
In the example of word1 = “time” and word2 = “day”, the two highest freq3’s are 13. A choice with a higher predicitve power for this example is “im”, or “I am”. In a more-optimized version of our language model, we plan to investigate how to incorporate (or not) pronouns. Wordnet, for instance, excludes them. We may opt for considering pronouns stopwords. Our trigram frequency is only 7 percent for “time day im,” due to the high perplexity of this trigram.
time_day <- read.csv("./../model/training/test_01_time/gram_3_aggr_time_day_freqs.csv")
time_day
## X freq2 freq23 V1 word1 word2 word3 freq3 term_freq3 term_freq23
## 1 1 947 1093 7899 time day im 13 0.06989247 0.084329913
## 2 2 947 667 8148 time day time 13 0.06989247 0.051462079
## 3 3 947 1353 10879 time day day 11 0.05913978 0.104390093
## 4 4 947 448 12182 time day night 10 0.05376344 0.034565234
## 5 5 947 1073 13087 time day befor 10 0.05376344 0.082786822
## 6 6 947 439 18131 time day good 8 0.04301075 0.033870843
## 7 7 947 478 18712 time day start 8 0.04301075 0.036879870
## 8 8 947 566 18876 time day dont 8 0.04301075 0.043669470
## 9 9 947 321 19181 time day everi 8 0.04301075 0.024766608
## 10 10 947 322 22612 time day becaus 7 0.03763441 0.024843762
## 11 11 947 760 24863 time day ago 7 0.03763441 0.058637451
## 12 12 947 1101 26447 time day work 7 0.03763441 0.084947149
## 13 13 947 1201 30221 time day week 6 0.03225806 0.092662603
## 14 14 947 588 30614 time day make 6 0.03225806 0.045366870
## 15 15 947 296 33013 time day lol 6 0.03225806 0.022837744
## 16 16 947 190 34853 time day thing 6 0.03225806 0.014659363
## 17 17 947 671 35628 time day love 6 0.03225806 0.051770697
## 18 18 947 323 41551 time day onli 5 0.02688172 0.024920917
## 19 19 947 174 48472 time day hour 5 0.02688172 0.013424890
## 20 20 947 121 61604 time day talk 4 0.02150538 0.009335699
## 21 21 947 68 64544 time day listen 4 0.02150538 0.005246509
## 22 22 947 46 66350 time day minut 4 0.02150538 0.003549109
## 23 23 947 76 66351 time day morn 4 0.02150538 0.005863745
## 24 24 947 76 72689 time day adult 4 0.02150538 0.005863745
## 25 25 947 174 73290 time day mani 4 0.02150538 0.013424890
## 26 26 947 148 76318 time day read 4 0.02150538 0.011418872
## 27 27 947 119 81016 time day sit 4 0.02150538 0.009181390
## 28 28 947 69 81491 time day differ 4 0.02150538 0.005323663
In the example of word1 = “time” and word2 = “write”, the highest frequency word3 yields a 20.9 percent success rate, which is much higher than the previosu example due to the lower perplexity of this trigram.
time_write <- read.csv("./../model/training/test_01_time/gram_3_aggr_time_write_freqs.csv")
time_write
## X freq2 freq23 V1 word1 word2 word3 freq3 term_freq3 term_freq23
## 1 1 390 371 15217 time write blog 9 0.20930233 0.17624703
## 2 2 390 310 18137 time write post 8 0.18604651 0.14726841
## 3 3 390 264 32521 time write song 6 0.13953488 0.12541568
## 4 4 390 221 33256 time write letter 6 0.13953488 0.10498812
## 5 5 390 269 41563 time write someth 5 0.11627907 0.12779097
## 6 6 390 553 47900 time write book 5 0.11627907 0.26270784
## 7 7 390 117 80762 time write time 4 0.09302326 0.05558195
In this section we address the following issue in Task3:
We have previously shown the use of wordnet. In this section, we expand on that work by generating the pre-stemmined 1-grams for all of the testing data in the blogs and twitter files. The generated file has 558,809 “words,” some of which need to be trimmed using regular expressions. We leave this work for a later installment, and instead select from this file the words found with frequency >= 3. For this example, we use the word “good.”
gram_1_aggr_no_stem_sel <- read.csv("./../skims/no_stem/dataset_gram_1_aggregated_no_stem_select.txt",
stringsAsFactors = FALSE)
head(gram_1_aggr_no_stem_sel,20)
## X word freq
## 1 1 time 107875
## 2 2 love 98437
## 3 3 good 97317
## 4 4 dont 96148
## 5 5 day 92261
## 6 6 people 73163
## 7 7 back 70807
## 8 8 great 69554
## 9 9 make 64528
## 10 10 today 61114
## 11 11 work 51539
## 12 12 life 47997
## 13 13 lol 44744
## 14 14 happy 40311
## 15 15 things 39691
## 16 16 year 38713
## 17 17 night 38607
## 18 18 ive 38030
## 19 19 made 36820
## 20 20 week 35904
checkWordnet <- function(checkword,found) {
word <- checkword
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
terms_NOUN <- getIndexTerms("NOUN", 1, filter)
terms_VERB <- getIndexTerms("VERB", 1, filter)
if(is.null(terms_ADJECTIVE) == TRUE &&
is.null(terms_ADVERB) == TRUE &&
is.null(terms_NOUN) == TRUE &&
is.null(terms_VERB) == TRUE){
print(paste0(checkword," is not in wordnet"))
found <- FALSE
}
if(!is.null(terms_ADJECTIVE) == TRUE ||
!is.null(terms_ADVERB) == TRUE ||
!is.null(terms_NOUN) == TRUE ||
!is.null(terms_VERB) == TRUE){
print(paste0(checkword," is in wordnet"))
found <- TRUE
}
return(found)
}
checkWordnet_synonyms <- function(checkword) {
word <- checkword
filter <- getTermFilter("ExactMatchFilter", word, TRUE)
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
terms_NOUN <- getIndexTerms("NOUN", 1, filter)
terms_VERB <- getIndexTerms("VERB", 1, filter)
if(!is.null(terms_ADJECTIVE) == TRUE)
{
print(paste0(word," is an ADJECTIVE"))
terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 10, filter)
print(getSynonyms(terms_ADJECTIVE[[1]]))
}
if(!is.null(terms_ADVERB) == TRUE)
{
print(paste0(word," is an ADVERB"))
terms_ADVERB <- getIndexTerms("ADVERB", 10, filter)
print(getSynonyms(terms_ADVERB[[1]]))
}
if(!is.null(terms_NOUN) == TRUE)
{
print(paste0(word," is a NOUN"))
terms_NOUN <- getIndexTerms("NOUN", 10, filter)
print(getSynonyms(terms_NOUN[[1]]))
}
if(!is.null(terms_VERB) == TRUE)
{
print(paste0(word," is a VERB"))
terms_VERB <- getIndexTerms("VERB", 10, filter)
print(getSynonyms(terms_VERB[[1]]))
}
}
checkWordnet_synonyms("good")
## [1] "good is an ADJECTIVE"
## [1] "adept" "beneficial" "dear" "dependable" "effective"
## [6] "estimable" "expert" "full" "good" "honorable"
## [11] "in effect(p)" "in force(p)" "just" "near" "practiced"
## [16] "proficient" "respectable" "right" "ripe" "safe"
## [21] "salutary" "secure" "serious" "skilful" "skillful"
## [26] "sound" "unspoiled" "unspoilt" "upright" "well(p)"
## [1] "good is an ADVERB"
## [1] "good" "soundly" "thoroughly" "well"
## [1] "good is a NOUN"
## [1] "commodity" "good" "goodness" "trade good"
We check the single-word synonyms of “good” in our 1-gram and 2-gram files. One way to incorporate the single-word synonyms of “good” we do not find in our data is to replace them in the respective N-gram with “good” and assigning these unfound N-grams with the lowest frequency for the single-word synonyms of “good” in our 1-gram and 2-gram files; which is this case is freq = 6. We plan to automate this procedure in the future.
good_synonyms <- c("adept","beneficial","dear","dependable","full","honorable","just","near",
"practiced","proficient","respectable","right","ripe","safe","salutary",
"secure","serious","skilful","skillful","sound","unspoiled","unspoilt",
"upright","soundly","thoroughly","well","commodity","goodness")
str(gram_2_aggr_split_backtrack)
## 'data.frame': 88756 obs. of 5 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ V1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ word1: chr "happi" "year" "good" "follow" ...
## $ word2: chr "birthday" "ago" "morn" "back" ...
## $ freq : int 6022 5505 5387 4724 4455 4254 3897 3851 3547 3307 ...
gram_1_aggr %>% filter(word %in% good_synonyms)
## X V1 word freq
## 1 145 145 sound 18186
## 2 222 222 full 13935
## 3 650 650 safe 6048
## 4 718 718 dear 5482
## 5 6242 6242 ripe 293
## 6 6795 6795 upright 255
## 7 11114 11114 adept 110
## 8 56575 56575 unspoilt 6
It seems correct to assign a very low frequency to an unseen N-gram given the large sample we are using. In addition, the synonyms of good we find can sometimes be found in N-grams with different meaning entirely, in which case a word for word substituion does not fit the contet or meaning of the unseen N-gram.
head(gram_2_aggr_split_backtrack %>% filter(word1 %in% good_synonyms),10)
## X V1 word1 word2 freq
## 1 70 70 sound good 1424
## 2 346 346 sound great 668
## 3 620 620 dear friend 482
## 4 685 685 full time 445
## 5 740 740 sound fun 429
## 6 1191 1191 full day 320
## 7 1289 1289 full moon 306
## 8 1356 1356 safe travel 298
## 9 2208 2208 sound awesom 226
## 10 2255 2255 sound amaz 223
In this section we conclude by addressing the following issues in Task 3:
To explore the correct N for our model, we use “time” as word1. This selection reduces the size of the 4-grams and 5-grams we work with considerably, while addressing the question we are attempting to answer adequately.
gram_2_aggr_split_time <- gram_2_aggr_split
gram_3_aggr_split_time <- gram_3_aggr_split %>% filter(word1 == "time")
gram_4_aggr_split_time <- read.csv("./../model/training/gram_4_aggr_split_time.txt", stringsAsFactors = FALSE)
gram_5_aggr_split_time <- read.csv("./../model/training/gram_5_aggr_split_time.txt", stringsAsFactors = FALSE)
In order to define the highest N that should yield an accurate model, we begin with the word that is most common, “time” as word1, and add one more word until we have very low perplexity (<= 2) for even the most common N-grams. With this result, we should expect at worst, a 50 percent error on the most common N-grams.
The top 5 words in our current model (the 70 percent training data) are “year”,“time”,“im”,“make”, and “day”
top5_time <- c("year","time","im","make","day")
gram_2_aggr_split_time %>% filter(word2 %in% top5_time)
## X V1 word1 word2 freq
## 1 50 50 time year 1757
## 2 91 91 time time 1292
## 3 129 129 time im 1121
## 4 182 182 time make 955
## 5 187 187 time day 947
Looking at “time year,” we find 62 3-grams; and at “time year ago,” 33 4-grams.
head(gram_3_aggr_split_time %>% filter(word2 == "year") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 freq
## 1 840 840 time year ago 42
## 2 3154 3154 time year time 21
## 3 3569 3569 time year im 20
## 4 4140 4140 time year ive 18
## 5 6044 6044 time year love 15
## 6 6569 6569 time year ill 14
## 7 7949 7949 time year start 13
## 8 9051 9051 time year onli 12
## 9 9373 9373 time year alway 12
## 10 12029 12029 time year mani 10
count(gram_3_aggr_split_time %>% filter(word2 == "year") %>% arrange(desc(freq)))
## # A tibble: 1 x 1
## n
## <int>
## 1 62
head(gram_4_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%
arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 freq
## 1 1005 136263 time year ago day 3
## 2 11827 1803595 time year ago im 2
## 3 4965 743837 time year ago consid 1
## 4 11826 1803594 time year ago allman 1
## 5 24540 3789729 time year ago continu 1
## 6 25518 3942130 time year ago blog 1
## 7 29344 4546307 time year ago british 1
## 8 29345 4546308 time year ago lot 1
## 9 31293 4852439 time year ago drive 1
## 10 31294 4852440 time year ago work 1
count(gram_4_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%
arrange(desc(freq)))
## # A tibble: 1 x 1
## n
## <int>
## 1 33
Once we look at 5-grams, for this, our most common 4-gram, we reach a point of very low perplexity, with only word4 = “im” having a 50/50 chance of being a correct prediction, and all others having a 100 percent of a correct prediction.
head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%
filter(word4 == "day") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 53115 8228619 time year ago day fester 1
head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%
filter(word4 == "im") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 11254 1712954 time year ago im im 1
## 2 49311 7648354 time year ago im number 1
head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%
filter(word4 == "drive") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 29755 4615815 time year ago drive back 1
head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%
filter(word4 == "work") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 29756 4615816 time year ago work show 1
Looking at “time time,” we find 45 3-grams; and at “time time time,” 20 4-grams. We find 2 5-grams with “time time time time.”
head(gram_3_aggr_split_time %>% filter(word2 == "time") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 freq
## 1 2299 2299 time time time 25
## 2 3723 3723 time time make 19
## 3 11345 11345 time time life 11
## 4 12026 12026 time time alway 10
## 5 12979 12979 time time ive 10
## 6 17879 17879 time time veri 8
## 7 18715 18715 time time im 8
## 8 20544 20544 time time year 8
## 9 22621 22621 time time lol 7
## 10 23236 23236 time time onli 7
count(gram_3_aggr_split_time %>% filter(word2 == "time") %>% arrange(desc(freq)))
## # A tibble: 1 x 1
## n
## <int>
## 1 45
head(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "time") %>%
arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 freq
## 1 11712 1803480 time time time time 2
## 2 11711 1803479 time time time higher 1
## 3 13713 2105333 time time time pain 1
## 4 13714 2105334 time time time understand 1
## 5 24453 3789642 time time time circl 1
## 6 32234 5002433 time time time carolyn 1
## 7 36082 5605047 time time time becom 1
## 8 42816 6656808 time time time energi 1
## 9 44757 6960747 time time time person 1
## 10 45651 7109046 time time time fli 1
count(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "time") %>%
arrange(desc(freq)))
## # A tibble: 1 x 1
## n
## <int>
## 1 20
head(gram_5_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "time") %>%
filter(word4 == "time") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 11142 1712842 time time time time higher 1
## 2 30655 4758591 time time time time carolyn 1
We find single 5-grams for “time time time make.”
head(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>%
arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 freq
## 1 896 136154 time time make salsa 2
## 2 58985 9165679 time time make work 2
## 3 6869 1048094 time time make overtim 1
## 4 11705 1803473 time time make consist 1
## 5 31177 4852323 time time make bad 1
## 6 38940 6053714 time time make huge 1
## 7 44752 6960742 time time make juic 1
## 8 45648 7109043 time time make decis 1
## 9 51639 8032956 time time make purer 1
## 10 53617 8337843 time time make import 1
count(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>%
arrange(desc(freq)))
## # A tibble: 1 x 1
## n
## <int>
## 1 17
head(gram_5_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>%
filter(word4 == "salsa") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 850 129156 time time make salsa verd 2
head(gram_5_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>%
filter(word4 == "work") %>% arrange(desc(freq)),10)
## X V1 word1 word2 word3 word4 word5 freq
## 1 56157 8719875 time time make work harder 2
We therefore conclude that N=5 will yield a model with sufficiently high predictive accuracy.