Introduction

We analyze here a dataset provided by Swiftkey, which is now part of Microsoft (https://www.microsoft.com/en-us/swiftkey?activetab=pivot_1%3aprimaryr2). The dataset provided has sub-data in four languages (Finnish, Russian, German, and English). We analyze the English data here. Here are the steps in this analysis:

Task 2: Exploratory Data Analysis of the Swift English Dataset as an N-Gram Model

Explore 1-grams, 2-grams, and 3-grams in a 70 percent training sample of the dataset to:

Understand the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs.
Assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English dataset.

Use wordnet (https://wordnet.princeton.edu/) to:

Evaluate if a word is in the English language, which, by inference, can be used to establish that it is from another language.
Explore ways to increase coverage using synonyms.

Task 3: Draft “Next Word” Prediction Model

Explore predictions based on the (N-1) gram to compare use of back-off to the (N-1) gram and/or the use of multiple lower order N-Grams.
Explore techniques to handle unseen N-Grams.
Explore N-grams in the testing dataset to predict the next word knowing the previous 1, 2, or 3 words.
Explore how big an N is needed in our N-Gram model to maximize correct predictions while minimizing response time to user and storage requirements.

Sys.setenv(JAVA_HOME="")
Sys.setenv(WNHOME="C:\\Program Files (x86)\\WordNet\\2.1\\dict")
library(textreadr)
library(tm)
library(caret)
library(tidyverse)
library(rJava)
library(RWeka)
library(knitr)
library(quanteda)
library(tidytext)
library(data.table)
library(wordnet)
setDict("C:\\Program Files (x86)\\WordNet\\2.1\\dict")
library(wordnet)

We load the data and, outside this markdown, segment it into small chunks that we analyze individually. This ‘divide and conquer’ method is a quick way to analyze the data given limited computing resources (a laptop).

en_blogs <- 
  readLines("./../course-data/en_US/en_US.blogs.txt",skipNul = TRUE,warn=FALSE)
en_news <- 
  readLines("./../course-data/en_US/en_US.news.txt",skipNul = TRUE,warn=FALSE)
en_twitter <- 
  readLines("./../course-data/en_US/en_US.twitter.txt",skipNul = TRUE,warn=FALSE)

str(en_blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235." ...

#chr [1:899288]
str(en_news)

##  chr [1:77259] "He wasn't home alone, apparently." ...

#chr [1:77259]
str(en_twitter)

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

#chr [1:2360148]

Our first step is to remove symbols. We show an example of a sentence with symbols that we need to remove. Notice the symbols before and after “gods” in “pagan gods.”

en_blogs[1]

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."

Here is the step in which we remove all of these non-Latin or non-ASCII symbols.

en_blogs <- iconv(en_blogs, "latin1", "ASCII", sub="")
en_news <- iconv(en_news, "latin1", "ASCII", sub="")
en_twitter <- iconv(en_twitter, "latin1", "ASCII", sub="")

str(en_blogs)

##  chr [1:899288] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods." ...

#chr [1:899288]
str(en_news)

##  chr [1:77259] "He wasn't home alone, apparently." ...

#chr [1:77259]
str(en_twitter)

##  chr [1:2360148] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long." ...

#chr [1:2360148]

Here is an example of symbol removal. Notice the symbols before and after “gods” in “pagan gods” are now gone.

en_blogs[1]

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan gods."

Testing Dataset

We next select approximately 70 percent of each of the three data sources (blogs, news, twitter), leaving 20 percent for later validation, and 10 percent as a testing dataset. Once we start using the validation data, we plan to incrementally add data in small chunks to model how a real world application will be exposed to additional words and phrases not previously encountered. Due to the limitations of running within R Markdown with large files, we run on very small selections within this markdown file, and run on much larger samples outside of markdown. The number of sentences used to train our model are:

“blogs”: 600,000 out of 899,288 sentences.
“news”: 25,000 out of 77,259 sentences.
“twitter”: 1,500,000 out of 2,360,148 sentences.

Task 2: Exploratory Data Analysis Part 1 (word counts)

First, we address the following:

Understand the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs.

Distributions of word frequencies (1-grams)

Here is the code for 1-grams running on a very small sample of the blogs data. Running on the corpora in markdown is very slow. We show an example of the code we use for a very small sample of the blogs text, and run on all of our testing data outside of markdown.

matrixDF_TDM_save <- data.frame(word = "", freq = 0 )
matrixDF_TDM_all <- data.frame(word = "", freq = 0 )
for (i in 1:100) {
  #corpus to TDM
  corpus_blogs_01 <- VCorpus(VectorSource(en_blogs[i]))
  myCorpus_blogs_01 = tm_map(corpus_blogs_01, content_transformer(tolower))
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removePunctuation)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removeNumbers)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, stemDocument)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, 
                             removeWords,c(stopwords(source = "smart"),"english"))
  myTDM_blogs_01 = TermDocumentMatrix(myCorpus_blogs_01,
                                      control = list(minWordLength = 1))
  matrix_TDM <- as.matrix(myTDM_blogs_01)
  matrixsums_TDM  <- rowSums(matrix_TDM)
  matrixDF_TDM  <- data.frame(word=names(matrixsums_TDM),freq=matrixsums_TDM)
  matrixDF_TDM_all <- rbind(matrixDF_TDM,matrixDF_TDM_save)
  matrixDF_TDM_save <- matrixDF_TDM_all
}

Here are the word frequencies for this very small sample, with a table for the data we have plotted.

matrixDF_TDM_toplot <- aggregate(freq ~ word, 
                                 data=matrixDF_TDM_all,FUN=sum, 
                                 na.rm = TRUE) 
matrixDF_TDM_toplot <- matrixDF_TDM_toplot[order(-matrixDF_TDM_toplot$freq),]
head(matrixDF_TDM_toplot)

##      word freq
## 311  time   20
## 228 peopl   14
## 242  veri   12
## 211  make   11
## 174 thing   10
## 179  love    9

matrixDF_TDM_toplot_filter <- filter(matrixDF_TDM_toplot, freq >= 7)
matrixDF_TDM_toplot_filter %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(word, freq)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

kable(matrixDF_TDM_toplot_filter)

word	freq
time	20
peopl	14
veri	12
make	11
thing	10
love	9
work	9
face	8
tri	8
imag	8
made	8
good	8
today	7
onli	7
year	7
mani	7
nice	7
bad	7

Distributions of word frequencies (2-grams)

Here is the code for 2-grams running on a very small sample of the blogs data.

BigramTokenizer_2gram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

matrixDF_TDM_save <- data.frame(word = "", freq = 0 )
matrixDF_TDM_all <- data.frame(word = "", freq = 0 )
for (i in 1:100) {
  #corpus to TDM
  corpus_blogs_01 <- VCorpus(VectorSource(en_blogs[i]))
  myCorpus_blogs_01 = tm_map(corpus_blogs_01, content_transformer(tolower))
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removePunctuation)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removeNumbers)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, stemDocument)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, 
                             removeWords,c(stopwords(source = "smart"),"english"))
  txtTdmBi_blogs_2gram <- TermDocumentMatrix(myCorpus_blogs_01, 
                                             control = list(tokenize = BigramTokenizer_2gram))
  
  matrix_TDM <- as.matrix(txtTdmBi_blogs_2gram)
  matrixsums_TDM  <- rowSums(matrix_TDM)
  matrixDF_TDM  <- data.frame(word=names(matrixsums_TDM),freq=matrixsums_TDM)
  matrixDF_TDM_all <- rbind(matrixDF_TDM,matrixDF_TDM_save)
  matrixDF_TDM_save <- matrixDF_TDM_all
}

Here are the word frequencies for this very small sample, with a table for the data we have plotted.

matrixDF_TDM_toplot <- aggregate(freq ~ word, 
                                 data=matrixDF_TDM_all,FUN=sum, 
                                 na.rm = TRUE) 
matrixDF_TDM_toplot <- matrixDF_TDM_toplot[order(-matrixDF_TDM_toplot$freq),]
matrixDF_TDM_toplot_filter <- filter(matrixDF_TDM_toplot, freq >= 2)
kable(matrixDF_TDM_toplot_filter)

word	freq
mani time	3
start rub	2
greet card	2
becaus mani	2
john barricelli	2
south norwalk	2
watch peopl	2
stuff veri	2
veri nice	2
twinkl light	2
lovin peopl	2
nice person	2
veri bad	2
wast time	2
home mom	2
devic deton	2
winter time	2
impress plate	2

Distributions of word frequencies (3-grams)

Here is the code for 3-grams running on a very small sample of the blogs data.

BigramTokenizer_3gram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

matrixDF_TDM_save <- data.frame(word = "", freq = 0 )
matrixDF_TDM_all <- data.frame(word = "", freq = 0 )
for (i in 1:100) {
  #corpus to TDM
  corpus_blogs_01 <- VCorpus(VectorSource(en_blogs[i]))
  myCorpus_blogs_01 = tm_map(corpus_blogs_01, content_transformer(tolower))
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removePunctuation)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, removeNumbers)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, stemDocument)
  myCorpus_blogs_01 = tm_map(myCorpus_blogs_01, 
                             removeWords,c(stopwords(source = "smart"),"english"))
  txtTdmBi_blogs_3gram <- TermDocumentMatrix(myCorpus_blogs_01, 
                                             control = list(tokenize = BigramTokenizer_3gram))
  
  matrix_TDM <- as.matrix(txtTdmBi_blogs_3gram)
  matrixsums_TDM  <- rowSums(matrix_TDM)
  matrixDF_TDM  <- data.frame(word=names(matrixsums_TDM),freq=matrixsums_TDM)
  matrixDF_TDM_all <- rbind(matrixDF_TDM,matrixDF_TDM_save)
  matrixDF_TDM_save <- matrixDF_TDM_all
}

Here are the word frequencies for this very small sample, with a table for the data we have plotted.

matrixDF_TDM_toplot <- aggregate(freq ~ word, 
                                 data=matrixDF_TDM_all,FUN=sum, 
                                 na.rm = TRUE) 
matrixDF_TDM_toplot <- matrixDF_TDM_toplot[order(-matrixDF_TDM_toplot$freq),]
head(matrixDF_TDM_toplot)

##                   word freq
## 1   acceler high kinet    1
## 2  achiev nozzl effect    1
## 3     addit fit precis    1
## 4 amount water acceler    1
## 5      ankl maxim turn    1
## 6     axi extrem outer    1

matrixDF_TDM_toplot_filter <- filter(matrixDF_TDM_toplot, str_detect(word, "^a[rst]")) 
kable(matrixDF_TDM_toplot_filter)

word	freq
art deco interior	1
ate big geek	1
artist esperanza spald	1
area plan guidelin	1
arti butler organ	1
assist mother organ	1
assert honest outspoken	1
armchair toe lift	1
attend companyfund xmas	1
asianmiddl eastern descent	1
area warn obama	1
atom bomb end	1
attend friend kendra	1
area abl skip	1
astound support marshal	1
assam waffl leaf	1
artist kind turn	1

For some of the exploratory work, and also because we need to understand the Swiftkey dataset, we want to know how words are correlated between small selections of the twitter, blogs, and news subsets of the data, within each-e.g. one twitter selection compared to another twitter selection; and across-e.g. twitter to blogs.

We load small samples of each of the three subsets:

Twitter

gram_1_twitter_01 <- read.csv("./../skims/gram_1_twitter/gram_1_twitter_01_1.txt")
gram_1_twitter_02 <- read.csv("./../skims/gram_1_twitter/gram_1_twitter_01_2.txt")

gram_1_twitter_01 <- gram_1_twitter_01 %>% mutate(doc = "doc_01_1", rank = row_number(), 
                                                      term_freq = freq/sum(freq), 
                                                      logtf = log10(term_freq), 
                                                      logrank = log10(rank))
gram_1_twitter_02 <- gram_1_twitter_02 %>% mutate(doc = "doc_01_2", rank = row_number(),
                                                      term_freq = freq/sum(freq), 
                                                      logtf = log10(term_freq), 
                                                      logrank = log10(rank))

head(gram_1_twitter_01)

##     X   word freq      doc rank   term_freq     logtf   logrank
## 1  47   love 1004 doc_01_1    1 0.008852054 -2.052956 0.0000000
## 2  73    day  915 doc_01_1    2 0.008067360 -2.093269 0.3010300
## 3 238   good  866 doc_01_1    3 0.007635338 -2.117172 0.4771213
## 4 259   dont  746 doc_01_1    4 0.006577323 -2.181951 0.6020600
## 5 106   time  706 doc_01_1    5 0.006224652 -2.205885 0.6989700
## 6 195 follow  693 doc_01_1    6 0.006110034 -2.213956 0.7781513

Blogs

gram_1_blogs_01 <- read.csv("./../skims/gram_1_blogs/gram_1_blogs_01_1.txt")
gram_1_blogs_02 <- read.csv("./../skims/gram_1_blogs/gram_1_blogs_01_2.txt")


gram_1_blogs_01 <- gram_1_blogs_01 %>% mutate(doc = "doc_01_1", rank = row_number(), 
                                                  term_freq = freq/sum(freq), 
                                                  logtf = log10(term_freq), logrank = log10(rank))
gram_1_blogs_02 <- gram_1_blogs_02 %>% mutate(doc = "doc_01_2", rank = row_number(),
                                                  term_freq = freq/sum(freq), 
                                                  logtf = log10(term_freq), logrank = log10(rank))

head(gram_1_blogs_01)

##     X   word freq      doc rank   term_freq     logtf   logrank
## 1 417   time 1241 doc_01_1    1 0.007010072 -2.154278 0.0000000
## 2 187   make  930 doc_01_1    2 0.005253317 -2.279566 0.3010300
## 3 120   love  758 doc_01_1    3 0.004281736 -2.368380 0.4771213
## 4 422    day  743 doc_01_1    4 0.004197005 -2.377061 0.6020600
## 5 130   year  688 doc_01_1    5 0.003886325 -2.410461 0.6989700
## 6 374 becaus  671 doc_01_1    6 0.003790297 -2.421327 0.7781513

News

gram_1_news <- read.csv("./../skims/gram_all_news/gram_1_news.txt")

gram_1_news_corr <- gram_1_news %>% mutate(doc = "doc_01_1", rank = row_number(), 
                                                term_freq = freq/sum(freq), 
                                                logtf = log10(term_freq), logrank = log10(rank))

head(gram_1_news)

##     X  word freq
## 1 175  year 2747
## 2 133  time 1664
## 3  96 state 1642
## 4 102  work 1271
## 5 280  make 1268
## 6 675 peopl 1248

Here are the correlations comparing within twitter and blogs.

par(mfrow=c(1,2))
freq_join_gram_1_twitter <- merge(gram_1_twitter_01, gram_1_twitter_02, by = "word")
freq_join_gram_1_twitter <- filter(freq_join_gram_1_twitter,  freq.x != 0)
head(freq_join_gram_1_twitter)

##      word   X.x freq.x    doc.x rank.x  term_freq.x   logtf.x logrank.x   X.y
## 1  aaawww 15211      1 doc_01_1  15318 8.816787e-06 -5.054690  4.185202  7721
## 2   aaron  3426      7 doc_01_1   2361 6.171751e-05 -4.209592  3.373096  3602
## 3 abandon 15204      3 doc_01_1   4841 2.645036e-05 -4.577568  3.684935  9220
## 4   abbey  2866      6 doc_01_1   2621 5.290072e-05 -4.276538  3.418467 13214
## 5    abbi 15690      1 doc_01_1  15766 8.816787e-06 -5.054690  4.197722  3161
## 6     abc  9609      5 doc_01_1   3235 4.408394e-05 -4.355720  3.509874  5133
##   freq.y    doc.y rank.y  term_freq.y   logtf.y logrank.y
## 1      1 doc_01_2   9723 8.743781e-06 -5.058301  3.987800
## 2     12 doc_01_2   1601 1.049254e-04 -3.979119  3.204391
## 3      3 doc_01_2   4590 2.623134e-05 -4.581179  3.661813
## 4      2 doc_01_2   6605 1.748756e-05 -4.757271  3.819873
## 5      1 doc_01_2   7623 8.743781e-06 -5.058301  3.882126
## 6     13 doc_01_2   1517 1.136692e-04 -3.944357  3.180986

plot(freq_join_gram_1_twitter$freq.x,freq_join_gram_1_twitter$freq.y, 
     type = "p", col = "black", lwd = 1, 
     xlab = "Twitter Selection 1",
     ylab = "Twitter Selection 2")

cor.test(freq_join_gram_1_twitter$freq.x,freq_join_gram_1_twitter$freq.y)

## 
##  Pearson's product-moment correlation
## 
## data:  freq_join_gram_1_twitter$freq.x and freq_join_gram_1_twitter$freq.y
## t = 727.94, df = 8083, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9921245 0.9927799
## sample estimates:
##       cor 
## 0.9924593

freq_join_gram_1_blogs <- merge(gram_1_blogs_01, gram_1_blogs_02, by = "word")
freq_join_gram_1_blogs <- filter(freq_join_gram_1_blogs,  freq.x != 0)
head(freq_join_gram_1_blogs)

##      word   X.x freq.x    doc.x rank.x  term_freq.x   logtf.x logrank.x   X.y
## 1     aam  6275      1 doc_01_1  11330 5.648728e-06 -5.248049  4.054230 13858
## 2   aaron  1137      4 doc_01_1   4877 2.259491e-05 -4.645989  3.688153  2048
## 3   aback 16350      1 doc_01_1  17093 5.648728e-06 -5.248049  4.232818  9124
## 4 abandon  3130     14 doc_01_1   2233 7.908219e-05 -4.101921  3.348889  2000
## 5    abat 16592      1 doc_01_1  17279 5.648728e-06 -5.248049  4.237519  8635
## 6    abba 17127      1 doc_01_1  17698 5.648728e-06 -5.248049  4.247924  9534
##   freq.y    doc.y rank.y  term_freq.y   logtf.y logrank.y
## 1      1 doc_01_2  15321 5.601237e-06 -5.251716  4.185287
## 2     12 doc_01_2   2456 6.721484e-05 -4.172535  3.390228
## 3      3 doc_01_2   6409 1.680371e-05 -4.774595  3.806790
## 4     15 doc_01_2   2121 8.401855e-05 -4.075625  3.326541
## 5      2 doc_01_2   8128 1.120247e-05 -4.950686  3.909984
## 6      5 doc_01_2   4702 2.800618e-05 -4.552746  3.672283

plot(freq_join_gram_1_blogs$freq.x,freq_join_gram_1_blogs$freq.y, 
     type = "p", col = "black", lwd = 1, 
     xlab = "Blogs Selection 1",
     ylab = "Blogs Selection 2")

cor.test(freq_join_gram_1_blogs$freq.x,freq_join_gram_1_blogs$freq.y)

## 
##  Pearson's product-moment correlation
## 
## data:  freq_join_gram_1_blogs$freq.x and freq_join_gram_1_blogs$freq.y
## t = 730.23, df = 10900, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9895502 0.9903025
## sample estimates:
##       cor 
## 0.9899333

Here are the cross correlations between twitter, blogs, and news.

par(mfrow=c(1,3))

#### twitter and blogs

freq_join_gram_1_twitter_blogs <- merge(gram_1_twitter_01, gram_1_blogs_01, by = "word")
freq_join_gram_1_twitter_blogs <- filter(freq_join_gram_1_twitter_blogs,  freq.x != 0)
head(freq_join_gram_1_twitter_blogs)

##      word   X.x freq.x    doc.x rank.x  term_freq.x   logtf.x logrank.x   X.y
## 1     aam  6289      2 doc_01_1   5559 1.763357e-05 -4.753660  3.744997  6275
## 2    aand 14092      1 doc_01_1  14326 8.816787e-06 -5.054690  4.156125 15858
## 3   aaron  3426      7 doc_01_1   2361 6.171751e-05 -4.209592  3.373096  1137
## 4 abandon 15204      3 doc_01_1   4841 2.645036e-05 -4.577568  3.684935  3130
## 5   abbey  2866      6 doc_01_1   2621 5.290072e-05 -4.276538  3.418467 14388
## 6    abbi 15690      1 doc_01_1  15766 8.816787e-06 -5.054690  4.197722  9520
##   freq.y    doc.y rank.y  term_freq.y   logtf.y logrank.y
## 1      1 doc_01_1  11330 5.648728e-06 -5.248049  4.054230
## 2      1 doc_01_1  16718 5.648728e-06 -5.248049  4.223184
## 3      4 doc_01_1   4877 2.259491e-05 -4.645989  3.688153
## 4     14 doc_01_1   2233 7.908219e-05 -4.101921  3.348889
## 5      3 doc_01_1   6917 1.694618e-05 -4.770928  3.839918
## 6      7 doc_01_1   3786 3.954110e-05 -4.402951  3.578181

plot(freq_join_gram_1_twitter_blogs$freq.x,freq_join_gram_1_twitter_blogs$freq.y,
          type = "p", col = "black", lwd = 1, 
          xlab = "Twitter Selection 1",
          ylab = "Blogs Selection 1")

cor.test(freq_join_gram_1_twitter_blogs$freq.x,freq_join_gram_1_twitter_blogs$freq.y)

## 
##  Pearson's product-moment correlation
## 
## data:  freq_join_gram_1_twitter_blogs$freq.x and freq_join_gram_1_twitter_blogs$freq.y
## t = 124.84, df = 8162, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8025374 0.8174532
## sample estimates:
##       cor 
## 0.8101264

freq_join_gram_1_twitter_news <- merge(gram_1_twitter_01, gram_1_news_corr, by = "word")
freq_join_gram_1_twitter_news <- filter(freq_join_gram_1_twitter_news,  freq.x != 0)
head(freq_join_gram_1_twitter_news)

##      word   X.x freq.x    doc.x rank.x  term_freq.x   logtf.x logrank.x   X.y
## 1    aapl 11921      1 doc_01_1  12550 8.816787e-06 -5.054690  4.098644 33208
## 2   aaron  3426      7 doc_01_1   2361 6.171751e-05 -4.209592  3.373096  4301
## 3    aaup 10721      1 doc_01_1  11657 8.816787e-06 -5.054690  4.066587 22931
## 4 abandon 15204      3 doc_01_1   4841 2.645036e-05 -4.577568  3.684935 10687
## 5   abbey  2866      6 doc_01_1   2621 5.290072e-05 -4.276538  3.418467 23819
## 6    abbi 15690      1 doc_01_1  15766 8.816787e-06 -5.054690  4.197722 23527
##   freq.y    doc.y rank.y  term_freq.y   logtf.y logrank.y
## 1      1 doc_01_1  33330 2.373690e-06 -5.624576  4.522835
## 2     32 doc_01_1   2368 7.595808e-05 -4.119426  3.374382
## 3      1 doc_01_1  25299 2.373690e-06 -5.624576  4.403103
## 4     41 doc_01_1   1995 9.732129e-05 -4.011792  3.299943
## 5      4 doc_01_1   9755 9.494760e-06 -5.022516  3.989227
## 6      3 doc_01_1  11881 7.121070e-06 -5.147455  4.074853

plot(freq_join_gram_1_twitter_news$freq.x,freq_join_gram_1_twitter_news$freq.y,
          type = "p", col = "black", lwd = 1, 
          xlab = "Twitter Selection 1",
          ylab = "News Selection 1")

cor.test(freq_join_gram_1_twitter_news$freq.x,freq_join_gram_1_twitter_news$freq.y)

## 
##  Pearson's product-moment correlation
## 
## data:  freq_join_gram_1_twitter_news$freq.x and freq_join_gram_1_twitter_news$freq.y
## t = 78.917, df = 9372, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.619526 0.643854
## sample estimates:
##       cor 
## 0.6318456

freq_join_gram_1_blogs_news <- merge(gram_1_blogs_01, gram_1_news_corr, by = "word")
freq_join_gram_1_blogs_news <- filter(freq_join_gram_1_blogs_news,  freq.x != 0)
head(freq_join_gram_1_blogs_news)

##      word   X.x freq.x    doc.x rank.x  term_freq.x   logtf.x logrank.x   X.y
## 1   aaron  1137      4 doc_01_1   4877 2.259491e-05 -4.645989  3.688153  4301
## 2   ababa  5890      1 doc_01_1  11230 5.648728e-06 -5.248049  4.050380 31513
## 3   aback 16350      1 doc_01_1  17093 5.648728e-06 -5.248049  4.232818 25703
## 4 abandon  3130     14 doc_01_1   2233 7.908219e-05 -4.101921  3.348889 10687
## 5    abat 16592      1 doc_01_1  17279 5.648728e-06 -5.248049  4.237519 31986
## 6    abba 17127      1 doc_01_1  17698 5.648728e-06 -5.248049  4.247924 16724
##   freq.y    doc.y rank.y  term_freq.y   logtf.y logrank.y
## 1     32 doc_01_1   2368 7.595808e-05 -4.119426  3.374382
## 2      1 doc_01_1  31819 2.373690e-06 -5.624576  4.502687
## 3      1 doc_01_1  27191 2.373690e-06 -5.624576  4.434425
## 4     41 doc_01_1   1995 9.732129e-05 -4.011792  3.299943
## 5      2 doc_01_1  16954 4.747380e-06 -5.323546  4.229272
## 6      3 doc_01_1  11234 7.121070e-06 -5.147455  4.050534

plot(freq_join_gram_1_blogs_news$freq.x,freq_join_gram_1_blogs_news$freq.y,
          type = "p", col = "black", lwd = 1, 
          xlab = "Blogs Selection 1",
          ylab = "Newss Selection 1")

cor.test(freq_join_gram_1_blogs_news$freq.x,freq_join_gram_1_blogs_news$freq.y)

## 
##  Pearson's product-moment correlation
## 
## data:  freq_join_gram_1_blogs_news$freq.x and freq_join_gram_1_blogs_news$freq.y
## t = 143.78, df = 12168, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7867177 0.7998864
## sample estimates:
##       cor 
## 0.7933949

Exploratory Data Analysis of the Training Dataset

All of the scripts used to generate the data we explore here are in the following github repository: (https://github.com/rubiera/RCapstone_Week2Writeup).

The N-Gram counts in the training dataset (combining twitter, blogs, and news datasets) are:

1-grams: 477,305 words.
2-grams: 7,884,566 pairs of words.
3-grams: 14,923,896 triple combinations of words.

We manipulate the large files in which we have aggregated counts outside of this report, write results to csv files, and then input those here.

en_gram_1 <- 
  read.csv("./../skims/merged_grams/en_gram_1_plot.csv", stringsAsFactors = FALSE)
en_gram_2 <- 
  read.csv("./../skims/merged_grams/en_gram_2_plot.csv", stringsAsFactors = FALSE)
en_gram_3 <- 
  read.csv("./../skims/merged_grams/en_gram_3_plot.csv", stringsAsFactors = FALSE)

str(en_gram_1)

## 'data.frame':    19 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "time" "love" "day" "make" ...
##  $ freq: int  128289 123586 119183 102951 101047 96947 84019 76762 74511 73138 ...

str(en_gram_2)

## 'data.frame':    20 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "happi birthday" "year ago" "good morn" "follow back" ...
##  $ freq: int  6022 5505 5387 4724 4455 4254 3897 3851 3547 3307 ...

str(en_gram_3)

## 'data.frame':    19 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "happi mother day" "cinco de mayo" "follow follow back" "love love love" ...
##  $ freq: int  2222 722 658 547 473 448 353 337 334 326 ...

Here are the most common words found in the testing dataset.

en_gram_1_plot <- filter(en_gram_1, freq > 50000) 
par(mfrow=c(1,1))
en_gram_1_plot %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(word, freq)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

kable(en_gram_1_plot)

X	V1	word	freq
1	1	time	128289
2	2	love	123586
3	3	day	119183
4	4	make	102951
5	5	good	101047
6	6	dont	96947
7	7	work	84019
8	8	peopl	76762
9	9	year	74511
10	10	back	73138
11	11	great	71493
12	12	thing	70235
13	13	today	65769
14	14	follow	63389
15	15	realli	63276
16	16	onli	61547
17	17	becaus	54739
18	18	feel	52073
19	19	week	50729

Here are the most common pairs of words found in the testing dataset.

en_gram_2_plot <- filter(en_gram_2, freq > 2400) 
par(mfrow=c(1,1))
en_gram_2_plot %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(word, freq)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

kable(en_gram_2_plot)

X	V1	word	freq
1	1	happi birthday	6022
2	2	year ago	5505
3	3	good morn	5387
4	4	follow back	4724
5	5	good luck	4455
6	6	mother day	4254
7	7	high school	3897
8	8	everi day	3851
9	9	im gonna	3547
10	10	social media	3307
11	11	everi time	3298
12	12	long time	3152
13	13	great day	3116
14	14	good thing	3011
15	15	dont forget	2920
16	16	pleas follow	2843
17	17	mani peopl	2822
18	18	littl bit	2528
19	19	im glad	2410
20	20	make feel	2405

Here are the most triple combinations of words found in the testing dataset.

en_gram_3_plot <- filter(en_gram_3, freq > 275) 
par(mfrow=c(1,1))
en_gram_3_plot %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(word, freq)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

kable(en_gram_3_plot)

X	V1	word	freq
1	1	happi mother day	2222
2	2	cinco de mayo	722
3	3	follow follow back	658
4	4	love love love	547
5	5	st patrick day	473
6	6	happi valentin day	448
7	7	pleas pleas pleas	353
8	8	coupl week ago	337
9	9	good morn everyon	334
10	10	mother day mom	326
11	11	hope everyon great	326
12	12	follow back pleas	321
13	13	pleas follow back	307
14	14	item abov pp	305
15	15	hope great day	298
16	16	long time ago	296
17	17	ha ha ha	296
18	18	incorpor item abov	284
19	19	dont understand whi	279

We next assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English training dataset. We use the stemmed dataset we have been using so far, and leave as a next step to use a pre-stemmed dataset in which every word is valid post checking it in wordnet.

We find a total of 19,974,831 words for the 1-gram 477,305 word instances. The most common instances of words in our 1-gram file that are found “in the wild,” but are not correctly words are:

Words from other languages. Example from Spanish: “socorro,” which means “help.”
Common acronyms used in internet discourse, such as “lol” and “lmao.”
Words expanded with extra characters for effect, such as “aaannndddddd.”
Internet addresses, beginning with “http” or “www”
Words that have been merged together for effect, usually with periods, which are removed from the corpus in one of our steps, such as “and.that.sucks.” These words can be taken out with a length limit regular expression.
In such a large number of words, a word with a very low count-e.g. found once, is often a miss-spelled word, or a word from another language.

select_http_www <- read.csv("./../skims/merged_grams/select_http_www.csv", stringsAsFactors = FALSE)
select_long_words <- read.csv("./../skims/merged_grams/select_long_words.csv", stringsAsFactors = FALSE)
select_many_vowels <- read.csv("./../skims/merged_grams/select_many_vowels.csv", stringsAsFactors = FALSE)
select_many_consonants <- read.csv("./../skims/merged_grams/select_many_consonants.csv", stringsAsFactors = FALSE)
found_once <- read.csv("./../skims/merged_grams/found_once.csv", stringsAsFactors = FALSE)

Here is a sample of some of these.

Internet web addresses.

select_http_www

##     X    V1                               word freq
## 1   1 11112                               http  111
## 2   2 15484                wwwradiotagrcomwfuv   61
## 3   3 15900                                www   58
## 4   4 16756                   wwwcrowdclickcom   53
## 5   5 17895                 wwwwatchnhllivecom   47
## 6   6 18337              wwwknowledgesafaricom   45
## 7   7 18598                       wwwgpartycom   44
## 8   8 21479 wwwtinyurlcomelegantlivingfurnitur   34
## 9   9 22601                  wwwsnoclothingcom   31
## 10 10 24379               wwwfreekshowradiocom   27
## 11 11 25458                        wwwzesmocom   25
## 12 12 25464                wwwussportspagescom   25
## 13 13 26013                      wwwtncnewscom   24
## 14 14 26641                 wwwbirdofprayerorg   23
## 15 15 28823             wwwwedrawthelinescagov   20

Very long words, some also internet web addresses.

select_long_words

##     X    V1                       word freq
## 1   1 14790         someoneiwanttomeet   66
## 2   2 15058      thatawkwardmomentwhen   64
## 3   3 15484        wwwradiotagrcomwfuv   61
## 4   4 16755 wordsyouwillneverhearmesay   53
## 5   5 16758       thoughtsduringschool   53
## 6   6 17490      whenifirstjoinedtwitt   49
## 7   7 17894     thingsireallycantstand   47
## 8   8 17895         wwwwatchnhllivecom   47
## 9   9 18337      wwwknowledgesafaricom   45
## 10 10 18592         mypastrelationship   44
## 11 11 19336         sometimesyouhaveto   41
## 12 12 19613      everyonehasthatfriend   40
## 13 13 20199    wbeyondthescoreboardnet   38
## 14 14 20483         firstsongsonshuffl   37
## 15 15 20486         incaseyoudidntknow   37

Words with repeated vowels for effect.

select_many_vowels

##     X    V1       word freq
## 1   1 10446     sooooo  123
## 2   2 15057    lmaoooo   64
## 3   3 16336    soooooo   55
## 4   4 20192   lmfaoooo   38
## 5   5 21471   lmaooooo   34
## 6   6 21472  lmfaooooo   34
## 7   7 26624    yaaaaay   23
## 8   8 28053 lmfaoooooo   21
## 9   9 30054     nooooo   18
## 10 10 30633  lmaoooooo   18
## 11 11 32719     yooooo   16
## 12 12 33809     hooooo   15
## 13 13 33812     yeeeee   15
## 14 14 34174    waaaaay   14
## 15 15 35228   yaaaaaay   14

Words with repeated consonants for effect.

select_many_consonants

##     X    V1                  word freq
## 1   1 20440                grrrrr   37
## 2   2 22187           pptpltpsstp   32
## 3   3 23855               grrrrrr   28
## 4   4 26011                plzzzz   24
## 5   5 26013         wwwtncnewscom   24
## 6   6 26377                msnbcs   23
## 7   7 29574                brrrrr   19
## 8   8 29660            wwwkrqecom   19
## 9   9 31583        wwwdrwiggyscom   17
## 10 10 33958                kstptv   15
## 11 11 34040                dwntwn   15
## 12 12 34718              bldgblog   14
## 13 13 37042  httpwwwdagnomusiccom   13
## 14 14 38661              grrrrrrr   12
## 15 15 40980 waystotellppltheyreug   11

Words found once, here example with ten or fewer characters.

found_once

##     X          x
## 1   1 arguefight
## 2   2  arianrhod
## 3   3  armsswear
## 4   4       aroo
## 5   5   arounddo
## 6   6   aroundor
## 7   7  arquinton
## 8   8     artbas
## 9   9 aryabhatta
## 10 10  asidecook
## 11 11       aska
## 12 12    aslevel
## 13 13      aspeg
## 14 14 assembleia
## 15 15     astier
## 16 16    attatar
## 17 17  attemptth
## 18 18     augard
## 19 19        aul
## 20 20 autoerotic
## 21 21   autumnsh
## 22 22    avalabl
## 23 23    avarist
## 24 24      avial
## 25 25     aviana
## 26 26   axarquia
## 27 27      aylay
## 28 28     aztext
## 29 29     baauer
## 30 30    babenzl
## 31 31  babyactiv

We take all of these regular expressions to exclude words that we estimate should not be in our dictionary, with the caveat that this is a draft of our final dictionary. This criteria requires considerable optimization. This rough clean-up takes us from 477,305 words to 440,682 by subtracting 36,623 words. There are additional words that need to be subtracted before we finalize this dictionary. As a rough working dictionary, we use the dictionary minus our regular expression criteria, and exclude all words found once. This leaves a 155,365 word dictionary that we can use to assess how many unique words are needed in a frequency sorted dictionary to cover (50%,90%) of all word instances in the English training dataset. These 155,365 words are found in a total of 19,624,213 instances. 50 percent of these instances is 9,812,207, or approximately 9.8 million instances; and 90 percent of these instances is 17,661,792, or approximately 17.7 million instances. The top 600 words account for 9,776,698 instances, or or approximately 9.8 million instances; and the top 7,500 words account for 17,679,382 instances, or or approximately 17.7 million instances.

en_gram_1_round_2_2plus <- 
  read.csv("./../skims/merged_grams/en_gram_1_round_2_2plus.csv", stringsAsFactors = FALSE)
str(en_gram_1_round_2_2plus)

## 'data.frame':    155365 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "time" "love" "day" "make" ...
##  $ freq: int  128289 123586 119183 102951 101047 96947 84019 76762 74511 73138 ...

sum(en_gram_1_round_2_2plus$freq)

## [1] 19624213

19624213*0.5

## [1] 9812107

19624213*0.9

## [1] 17661792

sum(en_gram_1_round_2_2plus[1:600,]$freq)

## [1] 9776698

#[1] 9776698
sum(en_gram_1_round_2_2plus[1:7500,]$freq)

## [1] 17679382

#[1] 17679382

Task 2: Exploratory Data Analysis Part 2 (wordnet)

We use the wordnet Thesaurus to determine if a word in our dictionary is in the English language. Because stemming makes many of our words not be in wordnet-e.g. “happy” becomes “happi”; we generate a small sample of pre-stemmed word frequencies to explore checking our dictionary with wordnet. We leave a full implementation of this aspect of the prediction model for later.

wordnet_gram_1_check <- read.csv("./../model/test_03/gram_1_twitter_nostem_01_1.txt", stringsAsFactors = FALSE)
str(wordnet_gram_1_check)

## 'data.frame':    22754 obs. of  3 variables:
##  $ X   : int  45 231 70 254 203 104 33 582 31 191 ...
##  $ word: chr  "love" "good" "day" "dont" ...
##  $ freq: int  857 848 747 745 639 626 586 540 506 438 ...

Here is a redimentary function that checks if a word is in wordnet. Wordnet does not include stopwords, and only includes four types of words: “ADJECTIVE”,“ADVERB”,“NOUN” and “VERB”. (From the wordnet web site): wordnet “excluded words include determiners, prepositions, pronouns, conjunctions, and particles”. We check it for three Spanish words that are not commonly used in English. “Cinco de Mayo,” “burrito,” and “taco,” are examples of words that are commonly used in English.

checkWordnet <- function(checkword,found) {
  word <- checkword
  filter <- getTermFilter("ExactMatchFilter", word, TRUE)
  terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
  terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
  terms_NOUN <- getIndexTerms("NOUN", 1, filter)
  terms_VERB <- getIndexTerms("VERB", 1, filter)
  if(is.null(terms_ADJECTIVE) == TRUE &&
     is.null(terms_ADVERB) == TRUE &&
     is.null(terms_NOUN) == TRUE &&
     is.null(terms_VERB) == TRUE){
    print(paste0(checkword," is not in wordnet"))
    found <- FALSE
  } 
  if(!is.null(terms_ADJECTIVE) == TRUE ||
     !is.null(terms_ADVERB) == TRUE ||
     !is.null(terms_NOUN) == TRUE ||
     !is.null(terms_VERB) == TRUE){
    print(paste0(checkword," is in wordnet"))
    found <- TRUE    
  } 
  return(found)
}

checkWordnet("orden")

## [1] "orden is not in wordnet"

## [1] FALSE

checkWordnet("socorro")

## [1] "socorro is not in wordnet"

## [1] FALSE

checkWordnet("serpiente")

## [1] "serpiente is not in wordnet"

## [1] FALSE

These are words that are found in wordnet.

checkWordnet("happy")

## [1] "happy is in wordnet"

## [1] TRUE

checkWordnet("make")

## [1] "make is in wordnet"

## [1] TRUE

checkWordnet("may")

## [1] "may is in wordnet"

## [1] TRUE

We select the top 298 words in this small sample, and check them in wordnet. We find 252, or 84.6 percent of these words in wordnet. We run a smaller sample here to minimize markdown bloat. Of the 68 words here, we find 57, or 83.8 percent. The words not found are mostly pronouns and internet acronyms. It’s puzzling that we did not “guys.” We leave working with the quirks of wordnet for a later installment of this work.

wordnet_gram_1_check_example <- wordnet_gram_1_check %>% arrange(desc(freq)) %>% filter(freq > 150)
str(wordnet_gram_1_check_example)

## 'data.frame':    68 obs. of  3 variables:
##  $ X   : int  45 231 70 254 203 104 33 582 31 191 ...
##  $ word: chr  "love" "good" "day" "dont" ...
##  $ freq: int  857 848 747 745 639 626 586 540 506 438 ...

checkWordnet <- function(checkword,found) {
  word <- checkword
  filter <- getTermFilter("ExactMatchFilter", word, TRUE)
  terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
  terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
  terms_NOUN <- getIndexTerms("NOUN", 1, filter)
  terms_VERB <- getIndexTerms("VERB", 1, filter)
  if(is.null(terms_ADJECTIVE) == TRUE &&
     is.null(terms_ADVERB) == TRUE &&
     is.null(terms_NOUN) == TRUE &&
     is.null(terms_VERB) == TRUE){
    found <- FALSE
  } 
  if(!is.null(terms_ADJECTIVE) == TRUE ||
     !is.null(terms_ADVERB) == TRUE ||
     !is.null(terms_NOUN) == TRUE ||
     !is.null(terms_VERB) == TRUE){
    found <- TRUE    
  } 
  return(found)
}

count <- 0
for (i in 1:nrow(wordnet_gram_1_check_example)){
  outcome <- checkWordnet(wordnet_gram_1_check_example$word[i])
  if (outcome == TRUE) {
    count <- count + 1
  }
  if(outcome ==FALSE && count < 50){
    print("Not found in wordnet")
    print(wordnet_gram_1_check_example$word[i])
  }
  if(i == nrow(wordnet_gram_1_check_example)){
    print("count")
    print(count)
  }
}

## [1] "Not found in wordnet"
## [1] "dont"
## [1] "Not found in wordnet"
## [1] "lol"
## [1] "Not found in wordnet"
## [1] "youre"
## [1] "Not found in wordnet"
## [1] "haha"
## [1] "Not found in wordnet"
## [1] "hey"
## [1] "Not found in wordnet"
## [1] "guys"
## [1] "Not found in wordnet"
## [1] "ive"
## [1] "Not found in wordnet"
## [1] "lets"
## [1] "Not found in wordnet"
## [1] "gonna"
## [1] "count"
## [1] 57

For the final exploration of task 2, we explore ways to increase coverage using synonyms from wordnet.

checkWordnet_synonyms <- function(checkword) {
      word <- checkword
      filter <- getTermFilter("ExactMatchFilter", word, TRUE)
      terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
      terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
      terms_NOUN <- getIndexTerms("NOUN", 1, filter)
      terms_VERB <- getIndexTerms("VERB", 1, filter)
      
      if(!is.null(terms_ADJECTIVE) == TRUE)
      {
        print(paste0(word," is an ADJECTIVE"))
        terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 10, filter)
        print(getSynonyms(terms_ADJECTIVE[[1]]))
      }
      if(!is.null(terms_ADVERB) == TRUE)
      {
        print(paste0(word," is an ADVERB"))
        terms_ADVERB <- getIndexTerms("ADVERB", 10, filter)
        print(getSynonyms(terms_ADVERB[[1]]))
      }
      if(!is.null(terms_NOUN) == TRUE)
      {
        print(paste0(word," is a NOUN"))
        terms_NOUN <- getIndexTerms("NOUN", 10, filter)
        print(getSynonyms(terms_NOUN[[1]]))
      }
      if(!is.null(terms_VERB) == TRUE)
      {
        print(paste0(word," is a VERB"))
        terms_VERB <- getIndexTerms("VERB", 10, filter)
        print(getSynonyms(terms_VERB[[1]]))
      }

}

Checking some of the more common words in our small sample, we find many synonyms. We can use the synonyms that are not in our dictionary and we find in wordnet to augment our dictionary. We can apply a term frequency that is similar to the term frequency for those words we already have in our dictionary, or we can come up with an algorith that is adaptive to how common a synonym that is not in our initial dictionary is. We leave work an expansion of the initial dictionary with wordnet synonyms for a later installment of our prediction model.

checkWordnet_synonyms("happy")

## [1] "happy is an ADJECTIVE"
## [1] "felicitous"  "glad"        "happy"       "well-chosen"

checkWordnet_synonyms("follow")

## [1] "follow is a VERB"
##  [1] "abide by"       "accompany"      "adopt"          "be"            
##  [5] "come"           "come after"     "comply"         "conform to"    
##  [9] "espouse"        "fall out"       "follow"         "keep abreast"  
## [13] "keep an eye on" "keep up"        "observe"        "play along"    
## [17] "postdate"       "pursue"         "stick to"       "stick with"    
## [21] "succeed"        "surveil"        "survey"         "take after"    
## [25] "trace"          "travel along"   "watch"          "watch over"

checkWordnet_synonyms("good")

## [1] "good is an ADJECTIVE"
##  [1] "adept"        "beneficial"   "dear"         "dependable"   "effective"   
##  [6] "estimable"    "expert"       "full"         "good"         "honorable"   
## [11] "in effect(p)" "in force(p)"  "just"         "near"         "practiced"   
## [16] "proficient"   "respectable"  "right"        "ripe"         "safe"        
## [21] "salutary"     "secure"       "serious"      "skilful"      "skillful"    
## [26] "sound"        "unspoiled"    "unspoilt"     "upright"      "well(p)"     
## [1] "good is an ADVERB"
## [1] "good"       "soundly"    "thoroughly" "well"      
## [1] "good is a NOUN"
## [1] "commodity"  "good"       "goodness"   "trade good"

checkWordnet_synonyms("back")

## [1] "back is an ADJECTIVE"
## [1] "back(a)"   "hind(a)"   "hinder(a)"
## [1] "back is an ADVERB"
## [1] "back"      "backward"  "backwards" "in reply"  "rearward"  "rearwards"
## [1] "back is a NOUN"
##  [1] "back"             "backbone"         "backrest"         "binding"         
##  [5] "book binding"     "cover"            "dorsum"           "rachis"          
##  [9] "rear"             "spinal column"    "spine"            "vertebral column"
## [1] "back is a VERB"
##  [1] "back"      "back up"   "bet on"    "endorse"   "gage"      "game"     
##  [7] "indorse"   "plump for" "plunk for" "punt"      "second"    "stake"    
## [13] "support"

checkWordnet_synonyms("wait")

## [1] "wait is a NOUN"
## [1] "delay"        "hold"         "postponement" "time lag"     "wait"        
## [6] "waiting"     
## [1] "wait is a VERB"
## [1] "await"     "expect"    "hold back" "hold off"  "look"      "wait"     
## [7] "waitress"

checkWordnet_synonyms("amazing")

## [1] "amazing is an ADJECTIVE"
## [1] "amazing"       "astonishing"   "awe-inspiring" "awesome"      
## [5] "awful"         "awing"

checkWordnet_synonyms("watch")

## [1] "watch is a NOUN"
##  [1] "lookout"     "lookout man" "picket"      "scout"       "sentinel"   
##  [6] "sentry"      "spotter"     "ticker"      "vigil"       "watch"      
## [1] "watch is a VERB"
##  [1] "ascertain"      "catch"          "check"          "determine"     
##  [5] "find out"       "follow"         "keep an eye on" "learn"         
##  [9] "look on"        "look out"       "observe"        "see"           
## [13] "take in"        "view"           "watch"          "watch out"     
## [17] "watch over"

checkWordnet_synonyms("nice")

## [1] "nice is an ADJECTIVE"
##  [1] "courteous"    "dainty"       "decent"       "discriminate" "gracious"    
##  [6] "nice"         "overnice"     "prissy"       "skillful"     "squeamish"   
## [1] "nice is a NOUN"
## [1] "Nice"

checkWordnet_synonyms("start")

## [1] "start is a NOUN"
##  [1] "beginning"       "commencement"    "first"           "get-go"         
##  [5] "head start"      "jump"            "kickoff"         "offset"         
##  [9] "outset"          "showtime"        "start"           "starting"       
## [13] "starting line"   "starting signal" "starting time"   "startle"        
## [1] "start is a VERB"
##  [1] "begin"     "bug out"   "bulge"     "bulge out" "come out"  "commence" 
##  [7] "depart"    "embark on" "get"       "get down"  "get going" "go"       
## [13] "initiate"  "jump"      "lead off"  "originate" "part"      "pop"      
## [19] "pop out"   "protrude"  "set about" "set forth" "set off"   "set out"  
## [25] "start"     "start out" "start up"  "startle"   "take off"  "take up"

Task 3: An N=3 N-Gram prediction model Part 1 (without wordnet synonyms)

We first trim the large N-gram counts files each to a reasonable file to test a draft model. We select:

1-Grams with frequency >= 2, yielding 160,415 1-Grams.
2-Grams with frequency >= 20, yielding 88,756 2-Grams.
3-Grams with frequency >= 4, yielding 96,429 3-Grams.

gram_1_aggr <- read.csv("./../model/training/gram_1_aggr_freq2plus.csv", stringsAsFactors = FALSE)
gram_2_aggr <- read.csv("./../model/training/gram_2_aggr_freq20plus.csv", stringsAsFactors = FALSE)
gram_3_aggr <- read.csv("./../model/training/gram_3_aggr_freq4plus.csv", stringsAsFactors = FALSE)
str(gram_1_aggr)

## 'data.frame':    160415 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "time" "love" "day" "make" ...
##  $ freq: int  128289 123586 119183 102951 101047 96947 84019 76762 74511 73138 ...

str(gram_2_aggr)

## 'data.frame':    88756 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "happi birthday" "year ago" "good morn" "follow back" ...
##  $ freq: int  6022 5505 5387 4724 4455 4254 3897 3851 3547 3307 ...

str(gram_3_aggr)

## 'data.frame':    96429 obs. of  4 variables:
##  $ X   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word: chr  "happi mother day" "cinco de mayo" "follow follow back" "love love love" ...
##  $ freq: int  2222 722 658 547 473 448 353 337 334 326 ...

We split each file to separate out the words in each N-gram, and select only 2-grams that have the first word as “time.” We want to follow time as “word1,” as the N-grams increase from N=2 to N=5.

gram_2_aggr_split <- gram_2_aggr %>% separate(word, c("word1", "word2"), sep = " ")
gram_3_aggr_split <- gram_3_aggr %>% separate(word, c("word1", "word2", "word3"), sep = " ")
head(gram_2_aggr_split)

##   X V1  word1    word2 freq
## 1 1  1  happi birthday 6022
## 2 2  2   year      ago 5505
## 3 3  3   good     morn 5387
## 4 4  4 follow     back 4724
## 5 5  5   good     luck 4455
## 6 6  6 mother      day 4254

head(gram_3_aggr_split)

##   X V1  word1    word2 word3 freq
## 1 1  1  happi   mother   day 2222
## 2 2  2  cinco       de  mayo  722
## 3 3  3 follow   follow  back  658
## 4 4  4   love     love  love  547
## 5 5  5     st  patrick   day  473
## 6 6  6  happi valentin   day  448

gram_2_aggr_split <- filter(gram_2_aggr_split, word1 == "time")
head(gram_2_aggr_split,20)

##      X  V1 word1  word2 freq
## 1   50  50  time   year 1757
## 2   91  91  time   time 1292
## 3  129 129  time     im 1121
## 4  182 182  time   make  955
## 5  187 187  time    day  947
## 6  308 308  time   week  711
## 7  311 311  time   work  706
## 8  313 313  time  start  698
## 9  323 323  time    ive  685
## 10 349 349  time  befor  666
## 11 411 411  time    tri  610
## 12 439 439  time   sinc  590
## 13 458 458  time   dont  579
## 14 537 537  time   read  523
## 15 580 580  time   life  503
## 16 590 590  time   back  496
## 17 643 643  time   love  468
## 18 652 652  time  chang  460
## 19 658 658  time  today  458
## 20 665 665  time realli  454

str(gram_2_aggr_split)

## 'data.frame':    949 obs. of  5 variables:
##  $ X    : int  50 91 129 182 187 308 311 313 323 349 ...
##  $ V1   : int  50 91 129 182 187 308 311 313 323 349 ...
##  $ word1: chr  "time" "time" "time" "time" ...
##  $ word2: chr  "year" "time" "im" "make" ...
##  $ freq : int  1757 1292 1121 955 947 711 706 698 685 666 ...

We calculate the 2-gram frequencies and select 2-grams with frquecy >= 300.

gram_2_aggr_time_freqs <- gram_2_aggr_split %>% mutate(term_freq2 = freq/sum(freq))
gram_2_aggr_time_freqs <- gram_2_aggr_time_freqs %>% 
  mutate(freq2 = freq) %>% select(-freq)
gram_2_aggr_time_freqs <- gram_2_aggr_time_freqs %>% filter(freq2 >= 300)
gram_2_aggr_time_freqs

##       X   V1 word1  word2  term_freq2 freq2
## 1    50   50  time   year 0.022358810  1757
## 2    91   91  time   time 0.016441424  1292
## 3   129  129  time     im 0.014265353  1121
## 4   182  182  time   make 0.012152910   955
## 5   187  187  time    day 0.012051106   947
## 6   308  308  time   week 0.009047874   711
## 7   311  311  time   work 0.008984246   706
## 8   313  313  time  start 0.008882441   698
## 9   323  323  time    ive 0.008717009   685
## 10  349  349  time  befor 0.008475223   666
## 11  411  411  time    tri 0.007762592   610
## 12  439  439  time   sinc 0.007508081   590
## 13  458  458  time   dont 0.007368100   579
## 14  537  537  time   read 0.006655468   523
## 15  580  580  time   life 0.006400957   503
## 16  590  590  time   back 0.006311878   496
## 17  643  643  time   love 0.005955562   468
## 18  652  652  time  chang 0.005853758   460
## 19  658  658  time  today 0.005828307   458
## 20  665  665  time realli 0.005777404   454
## 21  688  688  time  watch 0.005662874   445
## 22  758  758  time    ago 0.005382912   423
## 23  780  780  time   talk 0.005268382   414
## 24  836  836  time   play 0.005052048   397
## 25  865  865  time  write 0.004962969   390
## 26  891  891  time becaus 0.004899341   385
## 27  903  903  time  peopl 0.004848439   381
## 28  916  916  time   onli 0.004797536   377
## 29  948  948  time   good 0.004670281   367
## 30  954  954  time  spent 0.004644830   365
## 31  969  969  time    ill 0.004593927   361
## 32  977  977  time   find 0.004568476   359
## 33 1003 1003  time   feel 0.004504848   354
## 34 1023 1023  time famili 0.004441221   349
## 35 1126 1126  time    put 0.004212161   331
## 36 1275 1275  time togeth 0.003919473   308
## 37 1336 1336  time   move 0.003817668   300

Next word prediction for bigrams

The next word is going to have a really low probability of being correct for common words like “time.” The concept here is defined as perplexity, which is the branching frequency from time as “word1” followed by the next words in the sentence. The perplexity of “time” into “word2” (the second word in the bigram) is high enough (37 in the training data, after keeping instances that occur 300 times or more) that choosing the most common “word2” as a prediction will be the wrong prediction most of the time.

Next word prediction using lower order N-grams

In this section we address the following issue in Task 3:

Explore predictions based on the (N-1) gram to compare use of back-off to the (N-1) gram and/or the use of multiple lower order N-Grams.

We will at times encounter a 3-gram (word1, word2, word3) in which the information we have on the 2-grams formed by (word1, word2); and/or the 2-grams formed by (word2, word3) may increase our predictive ability. We will be testing this code in a future iteration of our model. For now, we show how it works.

First, we place the most common 2-grams in a placeholder data frame.

gram_2_aggr_split_backtrack <- gram_2_aggr %>% separate(word, c("word1", "word2"), sep = " ")
head(gram_2_aggr_split_backtrack)

##   X V1  word1    word2 freq
## 1 1  1  happi birthday 6022
## 2 2  2   year      ago 5505
## 3 3  3   good     morn 5387
## 4 4  4 follow     back 4724
## 5 5  5   good     luck 4455
## 6 6  6 mother      day 4254

As an example of the lowest perplexity, a bigram beginning with the word “abbey” is found once in our 2-grams data frame. This is an example of a low perplexity N-gram that we should be able to have the highest precitive power for. If someone enters the word “abbey,” based on our training data, the only next word possible is “road,” of Beatles fame.

gram_2_aggr_split_word_001 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 1)
head(gram_2_aggr_split_word_001)

## # A tibble: 6 x 2
##   word1          n
##   <chr>      <int>
## 1 aaron          1
## 2 ab             1
## 3 abbey          1
## 4 abe            1
## 5 abercrombi     1
## 6 abort          1

nrow(gram_2_aggr_split_word_001)

## [1] 2437

gram_2_aggr_split_backtrack %>% filter(word1 == "abbey")

##       X    V1 word1 word2 freq
## 1 86912 86912 abbey  road   20

Here is a sample 2-gram with perplexity = 5 (word1 = “agent”).

gram_2_aggr_split_word_005 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 5)
head(gram_2_aggr_split_word_005)

## # A tibble: 6 x 2
##   word1       n
##   <chr>   <int>
## 1 actor       5
## 2 adopt       5
## 3 agent       5
## 4 airport     5
## 5 alot        5
## 6 alright     5

nrow(gram_2_aggr_split_word_005)

## [1] 220

gram_2_aggr_split_backtrack %>% filter(word1 == "agent")

##       X    V1 word1   word2 freq
## 1 22892 22892 agent publish   51
## 2 35900 35900 agent  editor   38
## 3 55733 55733 agent    work   27
## 4 76703 76703 agent   agent   22
## 5 87484 87484 agent    sign   20

Here is a sample 2-gram with perplexity = 10 (word1 = “amount”).

gram_2_aggr_split_word_010 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 10)
head(gram_2_aggr_split_word_010)

## # A tibble: 6 x 2
##   word1          n
##   <chr>      <int>
## 1 accomplish    10
## 2 advic         10
## 3 ahead         10
## 4 amount        10
## 5 ancient       10
## 6 annual        10

nrow(gram_2_aggr_split_word_010)

## [1] 79

gram_2_aggr_split_backtrack %>% filter(word1 == "amount")

##        X    V1  word1   word2 freq
## 1    670   670 amount    time  451
## 2   2654  2654 amount   money  203
## 3  13658 13658 amount    work   73
## 4  20198 20198 amount   peopl   56
## 5  29301 29301 amount    food   43
## 6  30291 30291 amount   water   42
## 7  49293 49293 amount  inform   30
## 8  56773 56773 amount  weight   27
## 9  69887 69887 amount  energi   23
## 10 82894 82894 amount alcohol   20

Here is a sample 2-gram with perplexity = 20 (word1 = “collect”).

gram_2_aggr_split_word_020 <- count(gram_2_aggr_split_backtrack, word1) %>% filter(n == 20)
head(gram_2_aggr_split_word_020)

## # A tibble: 6 x 2
##   word1       n
##   <chr>   <int>
## 1 ah         20
## 2 app        20
## 3 began      20
## 4 bill       20
## 5 center     20
## 6 collect    20

nrow(gram_2_aggr_split_word_020)

## [1] 29

gram_2_aggr_split_backtrack %>% filter(word1 == "collect")

##        X    V1   word1   word2 freq
## 1  18437 18437 collect bargain   60
## 2  30332 30332 collect   short   42
## 3  31390 31390 collect    work   41
## 4  32554 32554 collect    book   40
## 5  33758 33758 collect    year   39
## 6  34093 34093 collect    dust   39
## 7  39631 39631 collect  includ   35
## 8  41262 41262 collect    call   34
## 9  42129 42129 collect   thing   34
## 10 47405 47405 collect    love   31
## 11 52748 52748 collect    data   29
## 12 53680 53680 collect   stori   28
## 13 57423 57423 collect    make   27
## 14 61159 61159 collect   donat   26
## 15 72705 72705 collect develop   23
## 16 73323 73323 collect    item   22
## 17 75291 75291 collect  inform   22
## 18 75292 75292 collect    mani   22
## 19 83039 83039 collect   money   20
## 20 84756 84756 collect   avail   20

As we progress to N-grams with higher perplexity in our model, the N required increases. We explore this issue after exploring back-trakcing to lower order N-grams and how to tackle missing words using wordnet synonyms.

Generation of frequencies for predicting the third word

We use the 37 high frequency word1 = “time” trigrams as an example here of how to combine a trigram frequency measure with the measure of it’s two 2-grams (word1,word2) and (word2,word3). We do not execute this code in this markdown report and instead generate output files that we read in.

gram_3_aggr_time <- 0
gram_2_aggr_time_backout <- 0
for (i in 1:nrow(gram_2_aggr_time_freqs)) {
  gram_3_aggr_time <- 0
  for (j in 1:nrow(gram_3_aggr_split)) {
        if(  (gram_2_aggr_time_freqs$word1[i] == gram_3_aggr_split$word1[j] &&
              gram_2_aggr_time_freqs$word2[i] == gram_3_aggr_split$word2[j]) ){
                  
                  if (!exists("gram_3_aggr_time")){
                    gram_3_aggr_time <- 0
                  }
                  
                  # if the merged dataset does exist, append to it
                  if (exists("gram_3_aggr_time")){
                    
                    gram_2_aggr_time_backout <- 0
                    for (k in 1:nrow(gram_2_aggr_split_backtrack)) {
                      #back track to 2 gram for word2 and word3 and get it from gram_2_aggr_split
                      if(  (gram_2_aggr_split_backtrack$word1[k] == gram_3_aggr_split$word2[j] &&
                            gram_2_aggr_split_backtrack$word2[k] == gram_3_aggr_split$word3[j])  ){
                        
                        gram_2_aggr_time_backout <- gram_2_aggr_split_backtrack[k,]
                        gram_2_aggr_time_backout <- gram_2_aggr_time_backout %>% 
                          rename(
                            freq23 = freq
                          )
                        freq23 <- gram_2_aggr_time_backout$freq23
                      }#back track
                    }#k for loop
                    
                    freq2 <- gram_2_aggr_time_freqs[i,5] 
                    temp_dataset <- cbind(freq2,freq23, gram_3_aggr_split[j,])
                    gram_3_aggr_time <- rbind(gram_3_aggr_time, temp_dataset)
                    rm(temp_dataset)
                    
                  }#gram_3_aggr_time exists
        }#top if loop
  }#j for loop
  
  gram_3_aggr_time <- filter(gram_3_aggr_time, freq != 0)
  gram_3_aggr_time <- gram_3_aggr_time %>% arrange(desc(freq))
  gram_3_aggr_time <- gram_3_aggr_time %>% mutate(freq3 = freq) %>% select(-freq)
  
  gram_3_aggr_time_freqs <- gram_3_aggr_time %>% mutate(term_freq3 = freq3/sum(freq3))
  gram_3_aggr_time_freqs <- gram_3_aggr_time_freqs %>% 
    mutate(term_freq23 = freq23/sum(freq23)) 
  write.csv(gram_3_aggr_time_freqs,paste0("./../model/training/test_01_time/gram_3_aggr_time_",gram_2_aggr_time_freqs$word2[i],"_freqs.csv"))

}#i for loop

Analysis of a trigram for perplexity and lambdas

Once we have the relative frequencies of a trigram, we can make decisions as to how to combine these frequencies, and with which weights. These weights are referred to as lambdas in the NLP literature.

freq2 : frequency of (word1,word2)
freq23 : frequency of (word2,word3)
freq3 : frequency of (word1,word2,word3)

In the example of word1 = “time” and word2 = “day”, the two highest freq3’s are 13. A choice with a higher predicitve power for this example is “im”, or “I am”. In a more-optimized version of our language model, we plan to investigate how to incorporate (or not) pronouns. Wordnet, for instance, excludes them. We may opt for considering pronouns stopwords. Our trigram frequency is only 7 percent for “time day im,” due to the high perplexity of this trigram.

time_day <- read.csv("./../model/training/test_01_time/gram_3_aggr_time_day_freqs.csv")
time_day

##     X freq2 freq23    V1 word1 word2  word3 freq3 term_freq3 term_freq23
## 1   1   947   1093  7899  time   day     im    13 0.06989247 0.084329913
## 2   2   947    667  8148  time   day   time    13 0.06989247 0.051462079
## 3   3   947   1353 10879  time   day    day    11 0.05913978 0.104390093
## 4   4   947    448 12182  time   day  night    10 0.05376344 0.034565234
## 5   5   947   1073 13087  time   day  befor    10 0.05376344 0.082786822
## 6   6   947    439 18131  time   day   good     8 0.04301075 0.033870843
## 7   7   947    478 18712  time   day  start     8 0.04301075 0.036879870
## 8   8   947    566 18876  time   day   dont     8 0.04301075 0.043669470
## 9   9   947    321 19181  time   day  everi     8 0.04301075 0.024766608
## 10 10   947    322 22612  time   day becaus     7 0.03763441 0.024843762
## 11 11   947    760 24863  time   day    ago     7 0.03763441 0.058637451
## 12 12   947   1101 26447  time   day   work     7 0.03763441 0.084947149
## 13 13   947   1201 30221  time   day   week     6 0.03225806 0.092662603
## 14 14   947    588 30614  time   day   make     6 0.03225806 0.045366870
## 15 15   947    296 33013  time   day    lol     6 0.03225806 0.022837744
## 16 16   947    190 34853  time   day  thing     6 0.03225806 0.014659363
## 17 17   947    671 35628  time   day   love     6 0.03225806 0.051770697
## 18 18   947    323 41551  time   day   onli     5 0.02688172 0.024920917
## 19 19   947    174 48472  time   day   hour     5 0.02688172 0.013424890
## 20 20   947    121 61604  time   day   talk     4 0.02150538 0.009335699
## 21 21   947     68 64544  time   day listen     4 0.02150538 0.005246509
## 22 22   947     46 66350  time   day  minut     4 0.02150538 0.003549109
## 23 23   947     76 66351  time   day   morn     4 0.02150538 0.005863745
## 24 24   947     76 72689  time   day  adult     4 0.02150538 0.005863745
## 25 25   947    174 73290  time   day   mani     4 0.02150538 0.013424890
## 26 26   947    148 76318  time   day   read     4 0.02150538 0.011418872
## 27 27   947    119 81016  time   day    sit     4 0.02150538 0.009181390
## 28 28   947     69 81491  time   day differ     4 0.02150538 0.005323663

In the example of word1 = “time” and word2 = “write”, the highest frequency word3 yields a 20.9 percent success rate, which is much higher than the previosu example due to the lower perplexity of this trigram.

time_write <- read.csv("./../model/training/test_01_time/gram_3_aggr_time_write_freqs.csv")
time_write

##   X freq2 freq23    V1 word1 word2  word3 freq3 term_freq3 term_freq23
## 1 1   390    371 15217  time write   blog     9 0.20930233  0.17624703
## 2 2   390    310 18137  time write   post     8 0.18604651  0.14726841
## 3 3   390    264 32521  time write   song     6 0.13953488  0.12541568
## 4 4   390    221 33256  time write letter     6 0.13953488  0.10498812
## 5 5   390    269 41563  time write someth     5 0.11627907  0.12779097
## 6 6   390    553 47900  time write   book     5 0.11627907  0.26270784
## 7 7   390    117 80762  time write   time     4 0.09302326  0.05558195

Task 3: An N=3 Prediction Model Part 2 (with wordnet synonyms)

In this section we address the following issue in Task3:

Explore techniques to handle unseen N-Grams.

We have previously shown the use of wordnet. In this section, we expand on that work by generating the pre-stemmined 1-grams for all of the testing data in the blogs and twitter files. The generated file has 558,809 “words,” some of which need to be trimmed using regular expressions. We leave this work for a later installment, and instead select from this file the words found with frequency >= 3. For this example, we use the word “good.”

gram_1_aggr_no_stem_sel <- read.csv("./../skims/no_stem/dataset_gram_1_aggregated_no_stem_select.txt",
                                    stringsAsFactors = FALSE)
head(gram_1_aggr_no_stem_sel,20)

##     X   word   freq
## 1   1   time 107875
## 2   2   love  98437
## 3   3   good  97317
## 4   4   dont  96148
## 5   5    day  92261
## 6   6 people  73163
## 7   7   back  70807
## 8   8  great  69554
## 9   9   make  64528
## 10 10  today  61114
## 11 11   work  51539
## 12 12   life  47997
## 13 13    lol  44744
## 14 14  happy  40311
## 15 15 things  39691
## 16 16   year  38713
## 17 17  night  38607
## 18 18    ive  38030
## 19 19   made  36820
## 20 20   week  35904

checkWordnet <- function(checkword,found) {
  word <- checkword
  filter <- getTermFilter("ExactMatchFilter", word, TRUE)
  terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
  terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
  terms_NOUN <- getIndexTerms("NOUN", 1, filter)
  terms_VERB <- getIndexTerms("VERB", 1, filter)
  if(is.null(terms_ADJECTIVE) == TRUE &&
     is.null(terms_ADVERB) == TRUE &&
     is.null(terms_NOUN) == TRUE &&
     is.null(terms_VERB) == TRUE){
    print(paste0(checkword," is not in wordnet"))
    found <- FALSE
  } 
  if(!is.null(terms_ADJECTIVE) == TRUE ||
     !is.null(terms_ADVERB) == TRUE ||
     !is.null(terms_NOUN) == TRUE ||
     !is.null(terms_VERB) == TRUE){
    print(paste0(checkword," is in wordnet"))
    found <- TRUE    
  } 
  return(found)
}

checkWordnet_synonyms <- function(checkword) {
  word <- checkword
  filter <- getTermFilter("ExactMatchFilter", word, TRUE)
  terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 1, filter)
  terms_ADVERB <- getIndexTerms("ADVERB", 1, filter)
  terms_NOUN <- getIndexTerms("NOUN", 1, filter)
  terms_VERB <- getIndexTerms("VERB", 1, filter)
  
  if(!is.null(terms_ADJECTIVE) == TRUE)
  {
    print(paste0(word," is an ADJECTIVE"))
    terms_ADJECTIVE <- getIndexTerms("ADJECTIVE", 10, filter)
    print(getSynonyms(terms_ADJECTIVE[[1]]))
  }
  if(!is.null(terms_ADVERB) == TRUE)
  {
    print(paste0(word," is an ADVERB"))
    terms_ADVERB <- getIndexTerms("ADVERB", 10, filter)
    print(getSynonyms(terms_ADVERB[[1]]))
  }
  if(!is.null(terms_NOUN) == TRUE)
  {
    print(paste0(word," is a NOUN"))
    terms_NOUN <- getIndexTerms("NOUN", 10, filter)
    print(getSynonyms(terms_NOUN[[1]]))
  }
  if(!is.null(terms_VERB) == TRUE)
  {
    print(paste0(word," is a VERB"))
    terms_VERB <- getIndexTerms("VERB", 10, filter)
    print(getSynonyms(terms_VERB[[1]]))
  }
  
}

checkWordnet_synonyms("good")

## [1] "good is an ADJECTIVE"
##  [1] "adept"        "beneficial"   "dear"         "dependable"   "effective"   
##  [6] "estimable"    "expert"       "full"         "good"         "honorable"   
## [11] "in effect(p)" "in force(p)"  "just"         "near"         "practiced"   
## [16] "proficient"   "respectable"  "right"        "ripe"         "safe"        
## [21] "salutary"     "secure"       "serious"      "skilful"      "skillful"    
## [26] "sound"        "unspoiled"    "unspoilt"     "upright"      "well(p)"     
## [1] "good is an ADVERB"
## [1] "good"       "soundly"    "thoroughly" "well"      
## [1] "good is a NOUN"
## [1] "commodity"  "good"       "goodness"   "trade good"

We check the single-word synonyms of “good” in our 1-gram and 2-gram files. One way to incorporate the single-word synonyms of “good” we do not find in our data is to replace them in the respective N-gram with “good” and assigning these unfound N-grams with the lowest frequency for the single-word synonyms of “good” in our 1-gram and 2-gram files; which is this case is freq = 6. We plan to automate this procedure in the future.

good_synonyms <- c("adept","beneficial","dear","dependable","full","honorable","just","near",
                   "practiced","proficient","respectable","right","ripe","safe","salutary",
                   "secure","serious","skilful","skillful","sound","unspoiled","unspoilt",  
                   "upright","soundly","thoroughly","well","commodity","goodness")  
str(gram_2_aggr_split_backtrack)

## 'data.frame':    88756 obs. of  5 variables:
##  $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V1   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ word1: chr  "happi" "year" "good" "follow" ...
##  $ word2: chr  "birthday" "ago" "morn" "back" ...
##  $ freq : int  6022 5505 5387 4724 4455 4254 3897 3851 3547 3307 ...

gram_1_aggr %>% filter(word %in% good_synonyms)

##       X    V1     word  freq
## 1   145   145    sound 18186
## 2   222   222     full 13935
## 3   650   650     safe  6048
## 4   718   718     dear  5482
## 5  6242  6242     ripe   293
## 6  6795  6795  upright   255
## 7 11114 11114    adept   110
## 8 56575 56575 unspoilt     6

It seems correct to assign a very low frequency to an unseen N-gram given the large sample we are using. In addition, the synonyms of good we find can sometimes be found in N-grams with different meaning entirely, in which case a word for word substituion does not fit the contet or meaning of the unseen N-gram.

head(gram_2_aggr_split_backtrack %>% filter(word1 %in% good_synonyms),10)

##       X   V1 word1  word2 freq
## 1    70   70 sound   good 1424
## 2   346  346 sound  great  668
## 3   620  620  dear friend  482
## 4   685  685  full   time  445
## 5   740  740 sound    fun  429
## 6  1191 1191  full    day  320
## 7  1289 1289  full   moon  306
## 8  1356 1356  safe travel  298
## 9  2208 2208 sound awesom  226
## 10 2255 2255 sound   amaz  223

Task 3: An N>3 Prediction Model Part 3

In this section we conclude by addressing the following issues in Task 3:

Explore N-grams in the testing dataset to predict the next word knowing the previous 1, 2, or 3 words.
Explore how big an N is needed in our N-Gram model to maximize correct predictions while minimizing response time to user and storage requirements.

To explore the correct N for our model, we use “time” as word1. This selection reduces the size of the 4-grams and 5-grams we work with considerably, while addressing the question we are attempting to answer adequately.

gram_2_aggr_split_time <- gram_2_aggr_split
gram_3_aggr_split_time <- gram_3_aggr_split %>% filter(word1 == "time")
gram_4_aggr_split_time <- read.csv("./../model/training/gram_4_aggr_split_time.txt", stringsAsFactors = FALSE)
gram_5_aggr_split_time <- read.csv("./../model/training/gram_5_aggr_split_time.txt", stringsAsFactors = FALSE)

In order to define the highest N that should yield an accurate model, we begin with the word that is most common, “time” as word1, and add one more word until we have very low perplexity (<= 2) for even the most common N-grams. With this result, we should expect at worst, a 50 percent error on the most common N-grams.

The top 5 words in our current model (the 70 percent training data) are “year”,“time”,“im”,“make”, and “day”

top5_time <-  c("year","time","im","make","day")
gram_2_aggr_split_time %>% filter(word2 %in% top5_time)

##     X  V1 word1 word2 freq
## 1  50  50  time  year 1757
## 2  91  91  time  time 1292
## 3 129 129  time    im 1121
## 4 182 182  time  make  955
## 5 187 187  time   day  947

Looking at “time year,” we find 62 3-grams; and at “time year ago,” 33 4-grams.

head(gram_3_aggr_split_time %>% filter(word2 == "year") %>% arrange(desc(freq)),10)

##        X    V1 word1 word2 word3 freq
## 1    840   840  time  year   ago   42
## 2   3154  3154  time  year  time   21
## 3   3569  3569  time  year    im   20
## 4   4140  4140  time  year   ive   18
## 5   6044  6044  time  year  love   15
## 6   6569  6569  time  year   ill   14
## 7   7949  7949  time  year start   13
## 8   9051  9051  time  year  onli   12
## 9   9373  9373  time  year alway   12
## 10 12029 12029  time  year  mani   10

count(gram_3_aggr_split_time %>% filter(word2 == "year") %>% arrange(desc(freq)))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    62

head(gram_4_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>% 
       arrange(desc(freq)),10)

##        X      V1 word1 word2 word3   word4 freq
## 1   1005  136263  time  year   ago     day    3
## 2  11827 1803595  time  year   ago      im    2
## 3   4965  743837  time  year   ago  consid    1
## 4  11826 1803594  time  year   ago  allman    1
## 5  24540 3789729  time  year   ago continu    1
## 6  25518 3942130  time  year   ago    blog    1
## 7  29344 4546307  time  year   ago british    1
## 8  29345 4546308  time  year   ago     lot    1
## 9  31293 4852439  time  year   ago   drive    1
## 10 31294 4852440  time  year   ago    work    1

count(gram_4_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>% 
        arrange(desc(freq)))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    33

Once we look at 5-grams, for this, our most common 4-gram, we reach a point of very low perplexity, with only word4 = “im” having a 50/50 chance of being a correct prediction, and all others having a 100 percent of a correct prediction.

head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%  
       filter(word4 == "day") %>% arrange(desc(freq)),10)

##       X      V1 word1 word2 word3 word4  word5 freq
## 1 53115 8228619  time  year   ago   day fester    1

head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%  
       filter(word4 == "im") %>% arrange(desc(freq)),10)

##       X      V1 word1 word2 word3 word4  word5 freq
## 1 11254 1712954  time  year   ago    im     im    1
## 2 49311 7648354  time  year   ago    im number    1

head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%  
       filter(word4 == "drive") %>% arrange(desc(freq)),10)

##       X      V1 word1 word2 word3 word4 word5 freq
## 1 29755 4615815  time  year   ago drive  back    1

head(gram_5_aggr_split_time %>% filter(word2 == "year") %>% filter(word3 == "ago") %>%  
       filter(word4 == "work") %>% arrange(desc(freq)),10)

##       X      V1 word1 word2 word3 word4 word5 freq
## 1 29756 4615816  time  year   ago  work  show    1

Looking at “time time,” we find 45 3-grams; and at “time time time,” 20 4-grams. We find 2 5-grams with “time time time time.”

head(gram_3_aggr_split_time %>% filter(word2 == "time") %>% arrange(desc(freq)),10)

##        X    V1 word1 word2 word3 freq
## 1   2299  2299  time  time  time   25
## 2   3723  3723  time  time  make   19
## 3  11345 11345  time  time  life   11
## 4  12026 12026  time  time alway   10
## 5  12979 12979  time  time   ive   10
## 6  17879 17879  time  time  veri    8
## 7  18715 18715  time  time    im    8
## 8  20544 20544  time  time  year    8
## 9  22621 22621  time  time   lol    7
## 10 23236 23236  time  time  onli    7

count(gram_3_aggr_split_time %>% filter(word2 == "time") %>% arrange(desc(freq)))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    45

head(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "time") %>% 
       arrange(desc(freq)),10)

##        X      V1 word1 word2 word3      word4 freq
## 1  11712 1803480  time  time  time       time    2
## 2  11711 1803479  time  time  time     higher    1
## 3  13713 2105333  time  time  time       pain    1
## 4  13714 2105334  time  time  time understand    1
## 5  24453 3789642  time  time  time      circl    1
## 6  32234 5002433  time  time  time    carolyn    1
## 7  36082 5605047  time  time  time      becom    1
## 8  42816 6656808  time  time  time     energi    1
## 9  44757 6960747  time  time  time     person    1
## 10 45651 7109046  time  time  time        fli    1

count(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "time") %>% 
        arrange(desc(freq)))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    20

head(gram_5_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "time") %>%  
       filter(word4 == "time") %>% arrange(desc(freq)),10)

##       X      V1 word1 word2 word3 word4   word5 freq
## 1 11142 1712842  time  time  time  time  higher    1
## 2 30655 4758591  time  time  time  time carolyn    1

We find single 5-grams for “time time time make.”

head(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>% 
       arrange(desc(freq)),10)

##        X      V1 word1 word2 word3   word4 freq
## 1    896  136154  time  time  make   salsa    2
## 2  58985 9165679  time  time  make    work    2
## 3   6869 1048094  time  time  make overtim    1
## 4  11705 1803473  time  time  make consist    1
## 5  31177 4852323  time  time  make     bad    1
## 6  38940 6053714  time  time  make    huge    1
## 7  44752 6960742  time  time  make    juic    1
## 8  45648 7109043  time  time  make   decis    1
## 9  51639 8032956  time  time  make   purer    1
## 10 53617 8337843  time  time  make  import    1

count(gram_4_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>% 
        arrange(desc(freq)))

## # A tibble: 1 x 1
##       n
##   <int>
## 1    17

head(gram_5_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>%  
       filter(word4 == "salsa") %>% arrange(desc(freq)),10)

##     X     V1 word1 word2 word3 word4 word5 freq
## 1 850 129156  time  time  make salsa  verd    2

head(gram_5_aggr_split_time %>% filter(word2 == "time") %>% filter(word3 == "make") %>%  
       filter(word4 == "work") %>% arrange(desc(freq)),10)

##       X      V1 word1 word2 word3 word4  word5 freq
## 1 56157 8719875  time  time  make  work harder    2

We therefore conclude that N=5 will yield a model with sufficiently high predictive accuracy.

Exploratory N-Gram Model and Draft Next Word Prediction Model

Antonio Rubiera

12/31/2019

Introduction

Testing Dataset

Task 2: Exploratory Data Analysis Part 1 (word counts)

Distributions of word frequencies (1-grams)

Distributions of word frequencies (2-grams)

Distributions of word frequencies (3-grams)

Exploratory Data Analysis of the Training Dataset

Task 2: Exploratory Data Analysis Part 2 (wordnet)

Task 3: An N=3 N-Gram prediction model Part 1 (without wordnet synonyms)

Next word prediction for bigrams

Next word prediction using lower order N-grams

Generation of frequencies for predicting the third word

Analysis of a trigram for perplexity and lambdas

Task 3: An N=3 Prediction Model Part 2 (with wordnet synonyms)

Task 3: An N>3 Prediction Model Part 3