Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:
I went to the
the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.
This project is the capstone project of Data Science Specialization course provided by JHU on Coursera.
In this project I use basic NLP and ngram model to predict next word, given a set of words by a user through the shiny app we create which is hosted on shinyapps.io
date() #project start date: 25th september,2019
## [1] "Thu Sep 26 23:49:28 2019"
set.seed(369)
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships we observe in the data and prepare to build wer first linguistic models.
This Document contains:
# Required R libraries:
library(stringi)
library(ggplot2)
library(data.table)
library(htmlwidgets)
library(magrittr)
library(webshot)
#webshot::install_phantomjs()
library(markdown)
library(RWeka)
library(openNLP)
library(wordcloud2)
library(wordcloud)
library(tm)
library(NLP)
library(qdap)
library(devtools)
library(plotrix)
setwd('/home/himank/Documents/JHUCapstone/final/en_US')
The data is provided by JHUCapstone Course in Data Science. It can be downloaded from here
#reading few lines from files
news<-readLines("/home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt")
blogs<-readLines("/home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt")
twitter<-readLines("/home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt")
#sampling only 5000 lines from each file for performance efficiency:
fewnews<-sample(news,5000)
fewblogs<-sample(blogs,5000)
fewtwit<-sample(twitter,5000)
#combining text from the three files
fewtext<-paste(fewnews,fewblogs,fewtwit)
fewtext[[1]]
## [1] "The Senior Thrift & Caring Center has been caring for the needy for 33 years. They have gleaned fields, farms, food packing companies, bakeries, frozen food plants, food markets and collected food from the South Jersey Regional Food Bank. It's about cultivating a sense of responsibility - that you have a duty to report the world without bias. That you remain objective, no matter how deep into the morass or private hell you go. I can name at least a dozen different kinds of construction machinery."
sizeblogs<-file.size('/home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt')/1024^2
sizenews<-file.size('/home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt')/1024^2
sizetwitter<-file.size('/home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt')/1024^2
wordsblogs<-strsplit(system("wc -w /home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt",intern=TRUE),' ')[[1]][1]
wordsnews<-strsplit(system("wc -w /home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt",intern=TRUE),' ')[[1]][1]
wordstwitter<-strsplit(system("wc -w /home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt",intern=TRUE),' ')[[1]][1]
charsblogs<-strsplit(system("wc -c /home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt",intern=TRUE),' ')[[1]][1]
charsnews<-strsplit(system("wc -c /home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt",intern=TRUE),' ')[[1]][1]
charstwitter<-strsplit(system("wc -c /home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt",intern=TRUE),' ')[[1]][1]
linesblogs<-strsplit(system("wc -l /home/himank/Documents/JHUCapstone/final/en_US/en_US.blogs.txt",intern=TRUE),' ')[[1]][1]
linesnews<-strsplit(system("wc -l /home/himank/Documents/JHUCapstone/final/en_US/en_US.news.txt",intern=TRUE),' ')[[1]][1]
linestwitter<-strsplit(system("wc -l /home/himank/Documents/JHUCapstone/final/en_US/en_US.twitter.txt",intern=TRUE),' ')[[1]][1]
# Displaying details of files:
data.frame(Name=c("US_blogs", "US_news", "US_twitter"),
Lines=c(linesblogs,linesnews,linestwitter),
Words=c(wordsblogs,wordsnews,wordstwitter),
Size_in_MB=c(sizeblogs,sizenews,sizetwitter))
## Name Lines Words Size_in_MB
## 1 US_blogs 899288 37334117 200.4242
## 2 US_news 1010242 34365936 196.2775
## 3 US_twitter 2360148 30373559 159.3641
#making corpus
corpus<-VCorpus(VectorSource(fewtext))
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
# remove stopwords of english as they do not provide any descriptive information
#corpus<- tm_map(corpus,removeWords,stopwords('english'))
# strip whitespaces left and right
corpus <- tm_map(corpus, stripWhitespace)
# convert all chars to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# remove special characters
corpus <- tm_map(corpus, removePunctuation)
#getting rid profane words:
badwords<-file('badwords.txt')
badwords<-VectorSource(badwords)
corpus <- tm_map(corpus, removeWords, badwords)
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 5000
corpus<-tm_map(corpus,stemDocument,language='english')
cleanData<-data.table(text=(sapply(corpus,'[','content')),stringsAsFactors = F)
unitok<-NGramTokenizer(cleanData,Weka_control(min=1,max=1))
bitok <- NGramTokenizer(cleanData, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.
,;:\"()?!"))
tritok <- NGramTokenizer(cleanData, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t
.,;:\"()?!"))
head(unitok)
## [1] "list" "the" "senior" "thrift" "care" "center"
tetratok<-NGramTokenizer(cleanData,Weka_control(min=4,max=4,delimiters = " \\r\\n\\t
.,;:\"()?!"))
head(bitok)
## [1] "list the" "the senior" "senior thrift" "thrift care"
## [5] "care center" "center has"
head(tritok)
## [1] "list the senior" "the senior thrift" "senior thrift care"
## [4] "thrift care center" "care center has" "center has been"
unidf<-data.table(table(unitok))
unidf<-unidf[order(-N),]
colnames(unidf)<-c('word','freq')
bitokdf<-data.table(table(bitok))
bitokdf<-bitokdf[order(-N),]
colnames(bitokdf)<-c('word','freq')
tritokdf<-data.table(table(tritok))
tritokdf<-tritokdf[order(-N),]
colnames(tritokdf)<-c('word','freq')
tetratokdf<-data.table(table(tetratok))
tetratokdf<-tetratokdf[order(-N),]
colnames(tetratokdf)<-c('word','freq')
bitokdf[1:15,]
## word freq
## 1: of the 2088
## 2: in the 1998
## 3: to the 996
## 4: on the 903
## 5: for the 834
## 6: to be 667
## 7: and the 617
## 8: at the 591
## 9: in a 573
## 10: with the 532
## 11: is a 443
## 12: it was 439
## 13: with a 414
## 14: go to 411
## 15: for a 407
tritokdf[1:15,]
## word freq
## 1: one of the 173
## 2: a lot of 127
## 3: some of the 81
## 4: part of the 78
## 5: i want to 75
## 6: be abl to 73
## 7: to be a 70
## 8: as well as 65
## 9: you want to 65
## 10: it was a 64
## 11: go to be 63
## 12: out of the 63
## 13: the end of 63
## 14: look forward to 55
## 15: the rest of 53
wordcloud2(unidf[1:2000,])
splot<-ggplot(unidf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='red',colour='blue')+
geom_text(aes(label=freq),vjust=-1)+labs(title='15 Most Frequent unigrams',x='1-Gram',y='Frequency')
splot
### 2-Gram
wc2<-wordcloud2(bitokdf[1:1000,])
saveWidget(wc2,"2.html",selfcontained = F)
webshot::webshot("2.html","2.png",vwidth = 1000, vheight = 800, delay =10)
bplot<-ggplot(bitokdf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='blue',colour='red')+geom_text(aes(label=freq),vjust=-1)+
theme(axis.text.x=element_text(angle = 45,hjust=1))+labs(title='15 Most Frequent bigrams',x='2-Gram',y='Frequency')
bplot
wc3<-wordcloud2(tritokdf[1:200],minSize = 0.001,size=0.5)
saveWidget(wc3,"3.html",selfcontained = F)
webshot::webshot("3.html","3.png",vwidth = 1000, vheight = 800, delay =10)
tplot<-ggplot(tritokdf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='dark green',colour='yellow')+geom_text(aes(label=freq),vjust=-1)+
theme(axis.text.x=element_text(angle = 45,hjust=1))+labs(title='15 Most Frequent trigrams',x='3-Gram',y='Frequency')
tplot
wc4<-wordcloud2(tetratokdf[1:200],minSize=0.001,size=0.3)
saveWidget(wc4,"4.html",selfcontained = F)
webshot::webshot("4.html","4.png",vwidth = 1000, vheight = 800, delay =10)
ttplot<-ggplot(tetratokdf[1:15],aes(x=word,y=freq))+geom_bar(stat='identity',fill='yellow',colour='dark green')+geom_text(aes(label=freq),vjust=-1)+
theme(axis.text.x=element_text(angle = 45,hjust=1))+labs(title='15 Most Frequent tetragrams',x='4-Gram',y='Frequency')
ttplot
Creating a function to get minimum numbers of tokens required to explain given frequence percentage
explained<-function(df,percentage){
cumperc=0
n=0
totalfreq=sum(df$freq)
for (i in 1:nrow(df)){
perc=(df$freq[i]/totalfreq)*100
n=n+1
cumperc=cumperc+perc
if(cumperc>=percentage){
return(n)
}
}
}
explained(unidf,50)
## [1] 123
percentages<-seq(30,90,10)
thresholdwords<-c(explained(unidf,30),explained(unidf,40),explained(unidf,50),explained(unidf,60),explained(unidf,70),explained(unidf,80),explained(unidf,90))
qplot(percentages,thresholdwords,geom=c('line','point'))+geom_text(aes(label=thresholdwords),hjust=2,vjust=-1)
We can evaluate this using a dictionary to remove words that are not from english language or that are just typos. And then calculate the number of words removed. We can also use different languages dictionaries to compare removed words but that would be computationally expensive.
We can increase the coverage by converting all words to lower letter. Also we can use stemming which reduces words into their respective word stems which significantly increases frequency of common words and also reduces number of words in dictionary.
The goal here is to build the first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models and discover more complicated modeling techniques.
Tasks to accomplish
We have already Built 1-gram,2-gram, 3-gram and 4-gram dictionaries for our model.
We would have to: * Efficiently store the n-gram model * Reduce the size and runtime of the model * use appropriate n in our ngram model. Here we are using 1,2,3 and 4 * Smooth out the probabilities * Evaluate efficiency and accuracy of our model * Use backoff models to deal with unobserved ngrams * Also,We would build our model taking into account that currently available predictive text models can run on mobile phones, which typically have limited memory and processing power compared to desktop computers. * At last, we would create a ShinyApp that runs our model by taking input from the user and predicting next word based on our ngram model. * The shinyApp would run on shinyapps.io server.