1 Background

Mobile devices have become ubiquitous and integral to our life,most of the time we communicate through emails,social media,texts,messaging app … Almost every action involves typing,and typing long texts on devices isn’t strenous Good thing is that, there are some companies such as Swiftkey, trying to make typing on mobile devices easier for us.They are working towards building sophisticated text prediction application using state of the art tech and concepts in natural language processing.

This Capstone project will test us on all the skills, that we have acquired in the specialization. The goal is to comeup with a model, which can with high level of accuracy, predict the next word the user might type.

As R’s limitation is well known, and to avoid any memory issues later on, we need to plan ahead,and sample the data in a way that could represent the population and with high accuracy predict the next word.

2 Data Source

Source:- https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
Type:- news,blogs,twitter

3 Methodolgy

Read the data completely
Basic pattern analyzed
Sample from the date taken ,
More samples from twitter , as tweets are very similar to the way we chat and text these days
Test data kept separately for testing accuracy of the model
data scrubbed and munged .
Tokenization and normalization performed on data
special character and punctuations removed
stop words have not been removed, as it meant the elimination the word ‘the’ and several others.
Desriptive and Exploratory Analysis performed
unigram,bigram,trigram tables are created using quanteda package
visual analysis (ggplot) to check the most frequently used words
model trained and probability smoothing done using Kneser Ney smoothing
files split into several smaller part for faster access
a separate key value pair table is created using hash function and stored for reference
Feature vector are words here,
charcters vectors take more space, hence words are mapped to numbers using hash function,so as to reduce the size of the table object
model for Prediction is created which asks for user input and based on the input suggests next word
Three words are suggested
when the input box is empty , the most probale first word is used from the unigram table
in addition to the next word suggestion, when the word is being typed, dynamically several
permutation of words are offered to the user, based on the previous charcters typed until that point for eg if kno is typed then knowledge, know and knows will be suggested

4 Setup and Getting Data

script downloads the data, creates a folder and unzips the files in the folder

source('dataDownload.R')

5 Packages used

script checks for required packages,if not found then installs them

source("packInstall.R")

# "hash","ggplot2", "shiny","quanteda","dplyr",sringi","hellno","magrittr","parallel"

6 Data preprocessing and exploration

source("functExplorefile.R")

# file names for input to function
    
    blogfileName <- 'en_US/en_US.blogs.txt' 
    newsfileName <- 'en_US/en_US.news.txt' 
    twitfileName <-'en_US/en_US.twitter.txt' 
    profName <-'bad-words.txt'

7 Summary Stats

Basic Summary of Text files i.e. file name, size of file,number of lines,total number of words

Words which form 95% of the corpus

blogs,news,twitter plots

knitr::kable(summaryDf)

	File Size in mb	Lines	Longest Line	Word Count	Singleton Count
en_US.blogs.txt	210.16	899288	40833	37034929	557523
en_US.news.txt	205.81	1010242	11384	33836302	462997
en_US.twitter.txt	167.11	2360148	140	29557879	640489

wiki:- Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: the rank-frequency distribution is an inverse relation. For example, in the Brown Corpus of American English text, the word “the” is the most frequently occurring word,

As we know the R’s limitation , becuase of lexical scoping,R is required to keep the objects in the memory,sampling becomes important.

8 Sampling

set.seed(3456)

blog_ind <- sample(seq_len(length(txtReadb)),size=40000)
news_ind <-  sample(seq_len(length(txtReadn)),size=40000)
tweet_ind <-  sample(seq_len(length(txtReadt)),size=40000)

9 Ngrams generation

using quanteda package * 40000 samples from the corpus ## Unigram frequency plot

source("grams.R")

uniGram <- grams(txtspcharRemove,ngrams=1)
head(uniGram[[1]],5)

## Source: local data frame [5 x 2]
## 
##   Feature Frequency
##     (chr)     (int)
## 1     the     66766
## 2      to     52128
## 3     and     48984
## 4       a     48637
## 5      of     42433

uniGram[[2]]

9.1 Bigram frequency plot

biGram <- grams(txtspcharRemove,ngrams=2)
head(biGram[[1]],5)

## Source: local data frame [5 x 2]
## 
##   Feature Frequency
##     (chr)     (int)
## 1  of the     13462
## 2  in the     12931
## 3  to the      7191
## 4  on the      6465
## 5 for the      6229

biGram[[2]]

9.2 Trigram and frequency plot

triGram <- grams(txtspcharRemove,ngrams=3)
head(triGram[[1]],5)

## Source: local data frame [5 x 2]
## 
##      Feature Frequency
##        (chr)     (int)
## 1 one of the      1252
## 2   a lot of      1065
## 3    to be a       634
## 4 out of the       575
## 5 the end of       540

triGram[[2]]

10 Algorithm

From Wiki:-

Kneser–Ney smoothing is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.
It is an effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability’s lower order terms to omit n-grams with lower frequencies.
This approach has been considered equally effective for both higher and lower order n-grams.

A common example that illustrates the concept behind this method is the frequency of the bigram “San Francisco”. If it appears several times in a training corpus, the frequency of the unigram “Francisco” will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results; however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

Equation of lower order unigram:

Equation of middle order unigram:

Equation of highest order unigram:

John Hopkins Coursera Data Science Course

Capstone Project’s MileStone Report

Kanu Dutta

March 21, 2016