1 Background

Mobile devices have become ubiquitous and integral to our life,most of the time we communicate through emails,social media,texts,messaging app … Almost every action involves typing,and typing long texts on devices isn’t strenous Good thing is that, there are some companies such as Swiftkey, trying to make typing on mobile devices easier for us.They are working towards building sophisticated text prediction application using state of the art tech and concepts in natural language processing.

This Capstone project will test us on all the skills, that we have acquired in the specialization. The goal is to comeup with a model, which can with high level of accuracy, predict the next word the user might type.

As R’s limitation is well known, and to avoid any memory issues later on, we need to plan ahead,and sample the data in a way that could represent the population and with high accuracy predict the next word.

3 Methodolgy

  • Read the data completely
  • Basic pattern analyzed
  • Sample from the date taken ,
  • More samples from twitter , as tweets are very similar to the way we chat and text these days
  • Test data kept separately for testing accuracy of the model
  • data scrubbed and munged .
  • Tokenization and normalization performed on data
  • special character and punctuations removed
  • stop words have not been removed, as it meant the elimination the word ‘the’ and several others.
  • Desriptive and Exploratory Analysis performed
  • unigram,bigram,trigram tables are created using quanteda package
  • visual analysis (ggplot) to check the most frequently used words
  • model trained and probability smoothing done using Kneser Ney smoothing
  • files split into several smaller part for faster access
  • a separate key value pair table is created using hash function and stored for reference
  • Feature vector are words here,
  • charcters vectors take more space, hence words are mapped to numbers using hash function,so as to reduce the size of the table object
  • model for Prediction is created which asks for user input and based on the input suggests next word
  • Three words are suggested
  • when the input box is empty , the most probale first word is used from the unigram table
  • in addition to the next word suggestion, when the word is being typed, dynamically several
    permutation of words are offered to the user, based on the previous charcters typed until that point for eg if kno is typed then knowledge, know and knows will be suggested

4 Setup and Getting Data

script downloads the data, creates a folder and unzips the files in the folder

source('dataDownload.R')

5 Packages used

script checks for required packages,if not found then installs them

source("packInstall.R")

# "hash","ggplot2", "shiny","quanteda","dplyr",sringi","hellno","magrittr","parallel"

6 Data preprocessing and exploration

source("functExplorefile.R")

# file names for input to function
    
    blogfileName <- 'en_US/en_US.blogs.txt' 
    newsfileName <- 'en_US/en_US.news.txt' 
    twitfileName <-'en_US/en_US.twitter.txt' 
    profName <-'bad-words.txt'

7 Summary Stats

Basic Summary of Text files i.e. file name, size of file,number of lines,total number of words

Words which form 95% of the corpus

blogs,news,twitter plots

blogs,news,twitter plots

knitr::kable(summaryDf)
File Size in mb Lines Longest Line Word Count Singleton Count
en_US.blogs.txt 210.16 899288 40833 37034929 557523
en_US.news.txt 205.81 1010242 11384 33836302 462997
en_US.twitter.txt 167.11 2360148 140 29557879 640489
  • wiki:- Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word,

As we know the R’s limitation , becuase of lexical scoping,R is required to keep the objects in the memory,sampling becomes important.

8 Sampling

set.seed(3456)

blog_ind <- sample(seq_len(length(txtReadb)),size=40000)
news_ind <-  sample(seq_len(length(txtReadn)),size=40000)
tweet_ind <-  sample(seq_len(length(txtReadt)),size=40000)

9 Ngrams generation

using quanteda package * 40000 samples from the corpus ## Unigram frequency plot

source("grams.R")

uniGram <- grams(txtspcharRemove,ngrams=1)
head(uniGram[[1]],5)
## Source: local data frame [5 x 2]
## 
##   Feature Frequency
##     (chr)     (int)
## 1     the     66766
## 2      to     52128
## 3     and     48984
## 4       a     48637
## 5      of     42433
uniGram[[2]]

9.1 Bigram frequency plot

biGram <- grams(txtspcharRemove,ngrams=2)
head(biGram[[1]],5)
## Source: local data frame [5 x 2]
## 
##   Feature Frequency
##     (chr)     (int)
## 1  of the     13462
## 2  in the     12931
## 3  to the      7191
## 4  on the      6465
## 5 for the      6229
biGram[[2]]

9.2 Trigram and frequency plot

triGram <- grams(txtspcharRemove,ngrams=3)
head(triGram[[1]],5)
## Source: local data frame [5 x 2]
## 
##      Feature Frequency
##        (chr)     (int)
## 1 one of the      1252
## 2   a lot of      1065
## 3    to be a       634
## 4 out of the       575
## 5 the end of       540
triGram[[2]]

10 Algorithm

From Wiki:-

  • Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.

  • It is an effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability’s lower order terms to omit n-grams with lower frequencies.

  • This approach has been considered equally effective for both higher and lower order n-grams.

A common example that illustrates the concept behind this method is the frequency of the bigram “San Francisco”. If it appears several times in a training corpus, the frequency of the unigram “Francisco” will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Equation of lower order unigram:

Equation of middle order unigram:

Equation of highest order unigram: