N Gram model for word prediction:

Introduction & Understanding the Problem:

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.

This project is the capstone project of Data Science Specialization course provided by JHU on Coursera.

In this project I use basic NLP and ngram model to predict next word, given a set of words by a user through the shiny app we create which is hosted on shinyapps.io

Reproducability:

date() #project start date: 25th september,2019

## [1] "Thu Sep 26 23:49:28 2019"

set.seed(369)

Synopsis:

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships we observe in the data and prepare to build wer first linguistic models.

This Document contains:

Exploratory analysis - performing a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understanding frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

setting up environment:

# Required R libraries:
library(stringi)
library(ggplot2)
library(data.table)
library(htmlwidgets)
library(magrittr)
library(webshot)
#webshot::install_phantomjs()
library(markdown)
library(RWeka)
library(openNLP)
library(wordcloud2)
library(wordcloud)
library(tm)
library(NLP)
library(qdap)
library(devtools)
library(plotrix)
setwd('/home/himank/Documents/JHUCapstone/final/en_US')

Data Acquisition and cleaning:

Dataset

The data is provided by JHUCapstone Course in Data Science. It can be downloaded from here

Reading files & Sampling:

#reading few lines from files
news<-readLines("/home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt")
blogs<-readLines("/home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt")
twitter<-readLines("/home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt")


#sampling only 5000 lines from each file for performance efficiency:
fewnews<-sample(news,5000)
fewblogs<-sample(blogs,5000)
fewtwit<-sample(twitter,5000)

#combining text from the three files
fewtext<-paste(fewnews,fewblogs,fewtwit)
fewtext[[1]]

## [1] "The Senior Thrift & Caring Center has been caring for the needy for 33 years. They have gleaned fields, farms, food packing companies, bakeries, frozen food plants, food markets and collected food from the South Jersey Regional Food Bank. It's about cultivating a sense of responsibility - that you have a duty to report the world without bias. That you remain objective, no matter how deep into the morass or private hell you go. I can name at least a dozen different kinds of construction machinery."

Summarization:

sizeblogs<-file.size('/home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt')/1024^2
sizenews<-file.size('/home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt')/1024^2
sizetwitter<-file.size('/home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt')/1024^2

wordsblogs<-strsplit(system("wc -w /home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt",intern=TRUE),' ')[[1]][1]
wordsnews<-strsplit(system("wc -w /home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt",intern=TRUE),' ')[[1]][1]
wordstwitter<-strsplit(system("wc -w /home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt",intern=TRUE),' ')[[1]][1]


charsblogs<-strsplit(system("wc -c /home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt",intern=TRUE),' ')[[1]][1]
charsnews<-strsplit(system("wc -c /home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt",intern=TRUE),' ')[[1]][1]
charstwitter<-strsplit(system("wc -c /home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt",intern=TRUE),' ')[[1]][1]

linesblogs<-strsplit(system("wc -l /home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt",intern=TRUE),' ')[[1]][1]
linesnews<-strsplit(system("wc -l /home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt",intern=TRUE),' ')[[1]][1]
linestwitter<-strsplit(system("wc -l /home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt",intern=TRUE),' ')[[1]][1]

# Displaying details of files:
data.frame(Name=c("US_blogs", "US_news", "US_twitter"),
           Lines=c(linesblogs,linesnews,linestwitter),
           Words=c(wordsblogs,wordsnews,wordstwitter),
           Size_in_MB=c(sizeblogs,sizenews,sizetwitter))

##         Name   Lines    Words Size_in_MB
## 1   US_blogs  899288 37334117   200.4242
## 2    US_news 1010242 34365936   196.2775
## 3 US_twitter 2360148 30373559   159.3641

Preprocessing:

#making corpus
corpus<-VCorpus(VectorSource(fewtext))
# remove numbers
corpus <- tm_map(corpus, removeNumbers) 
# remove stopwords of english as they do not provide any descriptive information
#corpus<- tm_map(corpus,removeWords,stopwords('english'))
 # strip whitespaces left and right
corpus <- tm_map(corpus, stripWhitespace)
# convert all chars to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# remove special characters
corpus <- tm_map(corpus, removePunctuation) 

#getting rid profane words:
badwords<-file('badwords.txt')
badwords<-VectorSource(badwords)
corpus <- tm_map(corpus, removeWords, badwords)
corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 5000

Stemming with english language:

corpus<-tm_map(corpus,stemDocument,language='english')
cleanData<-data.table(text=(sapply(corpus,'[','content')),stringsAsFactors = F)

Exploratory Analysis:

ngram Tokenizaton for 1,2,3 and 4 grams:

unitok<-NGramTokenizer(cleanData,Weka_control(min=1,max=1))

bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
head(unitok)

## [1] "list"   "the"    "senior" "thrift" "care"   "center"

tetratok<-NGramTokenizer(cleanData,Weka_control(min=4,max=4,delimiters = " \\r\\n\\t
.,;:\"()?!"))
head(bitok)

## [1] "list the"      "the senior"    "senior thrift" "thrift care"  
## [5] "care center"   "center has"

head(tritok)

## [1] "list the senior"    "the senior thrift"  "senior thrift care"
## [4] "thrift care center" "care center has"    "center has been"

Q:What are the frequencies of 2-grams and 3-grams in the dataset?

unidf<-data.table(table(unitok))
unidf<-unidf[order(-N),]
colnames(unidf)<-c('word','freq')

bitokdf<-data.table(table(bitok))
bitokdf<-bitokdf[order(-N),]
colnames(bitokdf)<-c('word','freq')

tritokdf<-data.table(table(tritok))
tritokdf<-tritokdf[order(-N),]
colnames(tritokdf)<-c('word','freq')

tetratokdf<-data.table(table(tetratok))
tetratokdf<-tetratokdf[order(-N),]
colnames(tetratokdf)<-c('word','freq')
bitokdf[1:15,]

##         word freq
##  1:   of the 2088
##  2:   in the 1998
##  3:   to the  996
##  4:   on the  903
##  5:  for the  834
##  6:    to be  667
##  7:  and the  617
##  8:   at the  591
##  9:     in a  573
## 10: with the  532
## 11:     is a  443
## 12:   it was  439
## 13:   with a  414
## 14:    go to  411
## 15:    for a  407

tritokdf[1:15,]

##                word freq
##  1:      one of the  173
##  2:        a lot of  127
##  3:     some of the   81
##  4:     part of the   78
##  5:       i want to   75
##  6:       be abl to   73
##  7:         to be a   70
##  8:      as well as   65
##  9:     you want to   65
## 10:        it was a   64
## 11:        go to be   63
## 12:      out of the   63
## 13:      the end of   63
## 14: look forward to   55
## 15:     the rest of   53

Q: Some words are more frequent than others - what are the distributions of word frequencies?:

WORD CLOUDS & Distributions:

1-Gram

wordcloud2(unidf[1:2000,])

splot<-ggplot(unidf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='red',colour='blue')+
  geom_text(aes(label=freq),vjust=-1)+labs(title='15 Most Frequent unigrams',x='1-Gram',y='Frequency')
splot

### 2-Gram

wc2<-wordcloud2(bitokdf[1:1000,])
saveWidget(wc2,"2.html",selfcontained = F)
webshot::webshot("2.html","2.png",vwidth = 1000, vheight = 800, delay =10)

bplot<-ggplot(bitokdf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='blue',colour='red')+geom_text(aes(label=freq),vjust=-1)+
  theme(axis.text.x=element_text(angle = 45,hjust=1))+labs(title='15 Most Frequent bigrams',x='2-Gram',y='Frequency')
bplot

3-Gram

wc3<-wordcloud2(tritokdf[1:200],minSize = 0.001,size=0.5)
saveWidget(wc3,"3.html",selfcontained = F)
webshot::webshot("3.html","3.png",vwidth = 1000, vheight = 800, delay =10)

 tplot<-ggplot(tritokdf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='dark green',colour='yellow')+geom_text(aes(label=freq),vjust=-1)+
  theme(axis.text.x=element_text(angle = 45,hjust=1))+labs(title='15 Most Frequent trigrams',x='3-Gram',y='Frequency')
tplot

4-Gram

wc4<-wordcloud2(tetratokdf[1:200],minSize=0.001,size=0.3)
saveWidget(wc4,"4.html",selfcontained = F)
webshot::webshot("4.html","4.png",vwidth = 1000, vheight = 800, delay =10)

ttplot<-ggplot(tetratokdf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='yellow',colour='dark green')+geom_text(aes(label=freq),vjust=-1)+
  theme(axis.text.x=element_text(angle = 45,hjust=1))+labs(title='15 Most Frequent tetragrams',x='4-Gram',y='Frequency')
ttplot

How many unique words do we need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Creating a function to get minimum numbers of tokens required to explain given frequence percentage

explained<-function(df,percentage){
  cumperc=0
  n=0
  totalfreq=sum(df$freq)
  for (i in 1:nrow(df)){
    perc=(df$freq[i]/totalfreq)*100
    n=n+1
    cumperc=cumperc+perc
    if(cumperc>=percentage){
      return(n)
    }
  }
}

explained(unidf,50)

## [1] 123

Frequency explained by using minimum number of tokens:

percentages<-seq(30,90,10)
thresholdwords<-c(explained(unidf,30),explained(unidf,40),explained(unidf,50),explained(unidf,60),explained(unidf,70),explained(unidf,80),explained(unidf,90))
qplot(percentages,thresholdwords,geom=c('line','point'))+geom_text(aes(label=thresholdwords),hjust=2,vjust=-1)

Q:How do we evaluate how many of the words come from foreign languages?

We can evaluate this using a dictionary to remove words that are not from english language or that are just typos. And then calculate the number of words removed. We can also use different languages dictionaries to compare removed words but that would be computationally expensive.

Q:Can we think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

We can increase the coverage by converting all words to lower letter. Also we can use stemming which reduces words into their respective word stems which significantly increases frequency of common words and also reduces number of words in dictionary.

Modelling Plan & Shiny Application:

The goal here is to build the first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.

Tasks to accomplish

Build basic n-gram model - using the exploratory analysis we performed, build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora. Building a model to handle cases where a particular n-gram isn’t observed.

We have already Built 1-gram,2-gram, 3-gram and 4-gram dictionaries for our model.

We would have to: * Efficiently store the n-gram model * Reduce the size and runtime of the model * use appropriate n in our ngram model. Here we are using 1,2,3 and 4 * Smooth out the probabilities * Evaluate efficiency and accuracy of our model * Use backoff models to deal with unobserved ngrams * Also,We would build our model taking into account that currently available predictive text models can run on mobile phones, which typically have limited memory and processing power compared to desktop computers. * At last, we would create a ShinyApp that runs our model by taking input from the user and predicting next word based on our ngram model. * The shinyApp would run on shinyapps.io server.

Data Science Capstone

Himank Jain

24/09/2019