Mobile devices have become ubiquitous and integral to our life,most of the time we communicate through emails,social media,texts,messaging app … Almost every action involves typing,and typing long texts on devices isn’t strenous Good thing is that, there are some companies such as Swiftkey, trying to make typing on mobile devices easier for us.They are working towards building sophisticated text prediction application using state of the art tech and concepts in natural language processing.
This Capstone project will test us on all the skills, that we have acquired in the specialization. The goal is to comeup with a model, which can with high level of accuracy, predict the next word the user might type.
As R’s limitation is well known, and to avoid any memory issues later on, we need to plan ahead,and sample the data in a way that could represent the population and with high accuracy predict the next word.
news,blogs,twitterTokenization and normalization performed on dataExploratory Analysis performedunigram,bigram,trigram tables are created using quanteda packagehash function and stored for referenceFeature vector are words here,script downloads the data, creates a folder and unzips the files in the folder
source('dataDownload.R')
script checks for required packages,if not found then installs them
source("packInstall.R")
# "hash","ggplot2", "shiny","quanteda","dplyr",sringi","hellno","magrittr","parallel"
source("functExplorefile.R")
# file names for input to function
blogfileName <- 'en_US/en_US.blogs.txt'
newsfileName <- 'en_US/en_US.news.txt'
twitfileName <-'en_US/en_US.twitter.txt'
profName <-'bad-words.txt'
Basic Summary of Text files i.e. file name, size of file,number of lines,total number of words
Words which form 95% of the corpus
blogs,news,twitter plots
knitr::kable(summaryDf)
| File Size in mb | Lines | Longest Line | Word Count | Singleton Count | |
|---|---|---|---|---|---|
| en_US.blogs.txt | 210.16 | 899288 | 40833 | 37034929 | 557523 |
| en_US.news.txt | 205.81 | 1010242 | 11384 | 33836302 | 462997 |
| en_US.twitter.txt | 167.11 | 2360148 | 140 | 29557879 | 640489 |
As we know the R’s limitation , becuase of lexical scoping,R is required to keep the objects in the memory,sampling becomes important.
set.seed(3456)
blog_ind <- sample(seq_len(length(txtReadb)),size=40000)
news_ind <- sample(seq_len(length(txtReadn)),size=40000)
tweet_ind <- sample(seq_len(length(txtReadt)),size=40000)
using quanteda package * 40000 samples from the corpus ## Unigram frequency plot
source("grams.R")
uniGram <- grams(txtspcharRemove,ngrams=1)
head(uniGram[[1]],5)
## Source: local data frame [5 x 2]
##
## Feature Frequency
## (chr) (int)
## 1 the 66766
## 2 to 52128
## 3 and 48984
## 4 a 48637
## 5 of 42433
uniGram[[2]]
biGram <- grams(txtspcharRemove,ngrams=2)
head(biGram[[1]],5)
## Source: local data frame [5 x 2]
##
## Feature Frequency
## (chr) (int)
## 1 of the 13462
## 2 in the 12931
## 3 to the 7191
## 4 on the 6465
## 5 for the 6229
biGram[[2]]
triGram <- grams(txtspcharRemove,ngrams=3)
head(triGram[[1]],5)
## Source: local data frame [5 x 2]
##
## Feature Frequency
## (chr) (int)
## 1 one of the 1252
## 2 a lot of 1065
## 3 to be a 634
## 4 out of the 575
## 5 the end of 540
triGram[[2]]
From Wiki:-
Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.
It is an effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability’s lower order terms to omit n-grams with lower frequencies.
This approach has been considered equally effective for both higher and lower order n-grams.
A common example that illustrates the concept behind this method is the frequency of the bigram “San Francisco”. If it appears several times in a training corpus, the frequency of the unigram “Francisco” will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.
Equation of lower order unigram:
Equation of middle order unigram:
Equation of highest order unigram: