READING IN DATA

To begin with, I am using only the English version of the files. All three files, en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt, are read in using readLines. For example, the blogs file was read in as follows:

blogsFile <- "./Coursera-Swiftkey/final/en_US/en_US.blogs.txt"
con <- file(blogsFile, "rb")
blogsVec <- readLines(con, encoding="UTF-8", skipNul=TRUE)
close(con)

SUMMARY STATISTICS

Before cleaning the data, I wanted to get an idea about the size of the files, the line count and the word count. The following table gives the basic information in terms of size, line count and word count of the three original data files: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt. To get this data, we executed the following shell commands on git bash:

wc -c (for #characters - size)

wc -l (for #lines)

wc -w (for #words)

file name	size (MB)	#lines	#words
en_US.blogs.txt	210	899288	37334131
en_US.news.txt	206	1010242	34372530
en_US.twitter.txt	167	2360148	30373583

DATA CLEANING PHASE I

In the first phase of data cleaning, I cleaned the data as given below:

Removed all lines containing non-ASCII characters such as emoticons - in the first pass of the implementation, I have removed all the lines containing non-ASCII characters. Later in the project, I will try to ensure only the emoticons, etc. are removed.
Removed all punctuation and digits.
Converted all characters to lower case.
Removed all multiple white spaces.
Removed all English language stopwords such as you, the, and, etc.

Tha above cleaning strategy is too conservative. However, my first goal, as per the advice given in the DS Capstone Survival Guide, is to get a working data product. Once that is done, I will make the prediction tool more robust by using a less conservative approach to cleaning.

Next we construct some basic plots to get more insight about the data sets.

EXPLORATORY DATA ANALYSIS

I tried to get some insight on the data by plotting histograms of word lengths and understanding their distributions.

## Loading required package: NLP

1. Histograms of Word Lengths

We plot histograms of word lengths in each file.

As can be seen from the plots:

in the en_US.blogs.txt file, words of length 5-10 are the most frequent
in the en_US.news.txt file, words of length 6-7 are the most frequent
in the en_US.twitter.txt file, words of length 5-10 are the most frequent

Based on these plots, it may be sufficient to look at only words of length 5-10 and the n words surrounding them to compute the n-grams, where n = 2, 3, 4, 5, etc.

2. Distribution of Unique Words

We compute the frequencies of words in each data file and sort the words in decreasing order of frequency. The tables below give the number of unique words needed in a frequency-sorted dictionary to cover 50% (first table) and 90% (second table) of words in each file.

file name	#unique words (50%)	list of unique words (50%)
en_US.blogs.txt	2	one, will
en_US.news.txt	2	said, will
en_US.twitter.txt	1	im

file name	#unique words (90%)	list of unique words (90%)
en_US.blogs.txt	39	one, will, just, like, can, time, get, im, now, know, day, new, well, also, back, make, little, people, first, really, see, love, much, good, us, even, dont, think, way, go, two, made, going, things, last, many, still, year, life
en_US.news.txt	55	said, will, one, year, new, two, also, can, first, time, last, years, just, state, like, people, get, m, s, three, city, percent, now, school, back, game, million, make, says, day, home, county, many, even, well, good, going, may, high, made, season, team, police, p, way, dont, work, u, much, still, st, four, go, take, old
en_US.twitter.txt	19	im, just, like, get, love, good, will, day, thanks, dont, can, rt, now, one, u, know, time, great, today

To get a sense of the distribution of the unique words that cover 90% of the data in each of the three files, we plot word clouds.

## Loading required package: RColorBrewer

Word cloud of unique words that cover 90% of en_US.blogs.txt
Word cloud of unique words that cover 90% of en_US.news.txt
Word cloud of unique words that cover 90% of en_US.twitter.txt

Based on the distribution of the unique words, I gather that the en_US.news.txt draws upon a larger set of words, followed by en_US.blogs.txt and en_US.twitter.txt uses the smallest set of unique words. So I should probably prepare a sample that is composed of 30% of data from en_US.blogs.txt, 50% from en_US.news.txt and 20% from en_US.twitter.txt.

NEXT STEPS

DATA CLEANING PHASE II: Filter out profane words. For this I will use the list of profane words from CMU Prof. Luis von Ahn’s website.
SAMPLING DATA & CONSTRUCTING N-GRAMS: Since computing n-grams on the given data set is too computationally intensive, we will sample the cleaned data to compute n-grams, where n = 2, 3 and possibly 4 and 5.
SAVING THE N-GRAMS: We will save the n-grams computed in Step 2 in a file to be read by the Shiny App.
SHINY APP & PREDICTION ALGORITHM: The shiny app will implement our prediction algorithm which is as follows:
1. Read the input from the user.
2. Clean the input in an identical manner to the cleaning of the original data.
3. Predict the next word starting with the maximum n, n-gram and if such an n-gram does not exist, decrement n and repeat. Initially this simple backoff strategy will be used. Time permitting, more sophisticated backoff strategies will be evaluated.
4. Since we are only predicting the next word, selecting the best option based on maximum frequency count will suffice. Time permitting, we will try to list 2-3 options for next words ranked on their probability of occurrence.
5. To enable quick lookup of the corresponding n-gram, I plan to investigate, and compare the performance of, approaches using both markov chains and hash tables.
STOPWORDS: Initially, I have removed all stopwords and so the tool will not predict a stopword. I will also consider ways to predict stopwords as the next word.
IMPROVE PERFORMANCE: Analyze time and memory consumption and identify ways to make it more time and memory efficient.
IMPROVE APP: Improve the Shiny App in any which way possible.

DATA SCIENCE CAPSTONE PROJECT - MILESTONE REPORT