Coursera Capstone Project - Milestone report

Executive Summary

This Capstone project will be held in collaboration with SwiftKey. The main goal of this Capstone project is to create an algorithm to predict next possible words while typing a text fragment into an input field as many people may know while using their mobile devices. Because on this devices exist a limit in amount of storage and RAM it is not a good idea to have huge databases to predict next words. So, these predicting algorithms will be used.This intermediate report is to provide a short overview and some exploratory results about our traing data set. The english texts will be used for the exploratory analysis.

Data Exploration

There are blog, twitter and news categories of text files are avilable for this analysis. The data sets for this project are reasonably large. It may cause problems for your computer if you try to read the whole data set in at once. So a sample of 5000 lines of text is used for each category.

Desc	Blog	News	Twitter
Total# of lines	899288	1010242	2360148
Total Words	37334131	34372530	30373543
File Size(Mb)	200.42	196.28	159.36
Sample# of lines	5000	5000	5000
Sample Word Count	205555	63747	170940
Sample Word Count(after cleanup)	104347	35947	96239

Data cleanup

A corpus is created for each category text file. On the corpus, clean up operations have been performed as part of Tokenization.

Tokenization : word wise breakup, clean up the corpus by removing special characters, punctuation, numbers etc. Also, remove profanity will be removed. Corpus cleanup includes the following operations:

Removing punctuations
Removing numbers
Removing white spaces
Removing stop words
Converting to lowercase
Removing profanity words

Stemming Words

In many cases, words need to be stemmed to retrieve their radicals. For instance, “example” and “examples” are both stemmed to “exampl”. However, after that, one may want to complete the stems to their original forms, so that the words would look “normal”.

The tokenization in the project includes stemming words process.

Term frequncy analysis

The exploratory analysis includes determining the terms that are repeatedly used. This analysis includes the word clouds with the top 100 terms that are with high freqncy.

N-gram estimation

an n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on.

“memorylessness”: the probability distribution of the next word depends only on the current word or previous 1-3 words, and not on the big sequence of words that preceded it. This specific kind of “memorylessness” is called the Markov property.

In this project n-grams is collected on the text-corpus of each category. And bigram, trigram and four-gram estimations are used, as per Markov property.

Blog

## Warning in rm(dtm_corpus): object 'dtm_corpus' not found

## Warning in rm(dtm_corpus1): object 'dtm_corpus1' not found

Twitter

## Warning in rm(dtm_corpus): object 'dtm_corpus' not found

## Warning in rm(dtm_corpus1): object 'dtm_corpus1' not found

News

## Warning in rm(dtm_corpus): object 'dtm_corpus' not found

## Warning in rm(dtm_corpus1): object 'dtm_corpus1' not found

Prediction strategies and plans for Shiny app

While the strategy for modeling and prediction has not been finalized, the n-gram model with a frequency look-up table might be used based on the analysis above. A possible method of prediction is to use the 4-gram model to find the most likely next word first. If none is found, then the 3-gram model is used, and so forth. Furthermore, stemming might also be done in data preprocessing.

For the Shiny app, the plan is to create an app with a simple interface where the user can enter a string of text. Our prediction model will then give a list of suggested words to update the next word.

What to do next?

The next steps for this project could look like as follows:

add pofanity filtering to the cleanup step
add stemming word process
optimize preprocessing and cleaning
What to do with pretty usless words (like ‘aaaarrgghhh’)
create a better model out of n-grams for prediction; what level of n-grams do we realy need?
implement smooting techniques like Katz-backoff or Kneser-Ney to predict words which are unseen before
train the model with a sampled data set
trim it for performance, size (memory neede) and accuracy
build the shiny app