Data science capstone - milestone report

Executive Summary

This report is a part of the Coursera data science capstone project. The goal of this project is to create a shiny web application that could predict the word which will most likely follow a meaningful sequence of words. Practical implementation of this app could be increasing the speed of typing.

This milestone report covers the topics of getting the data, exploratory data analysis and data preparation. The application is based on N-gram model. N-gram is a sequence of words of certain length n. N-gram models estimate the probability of the last word based on the previous words.

Getting the data

The dataset consists of text from three sources: news, blogs, twitter. There are four languages available for analysis: German, English, Finnish, Russian. This project uses English dataset.

Raw data is in the zip file, which contains .txt files. The source of the data is here: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Exploratory data analysis

Dataset description

Total dataset contains 4269678 lines and it occupies 831.5 Mb of disk space. The longest file is twitter. The largest file is blogs file. To get a better overview of the dataset the number of lines, file size, and a number of words are following:

file	size_MB	length	longest_line
blogs	200.4242	899288	40833
news	196.2775	1010242	11384
twitter	159.3641	2360148	140

Pre-processing

We combine three files into one prior to further processing and make a 20% random sample. Random sampling saves time and resources, while we still get a good approximation to results that would be obtained using all the data. Then we subset the sample into training and testing sets in 80/20 proportion.

file	size_MB	length
training	133.24413	683148
testing	33.35484	170787

Data cleaning

Data cleaning includes following ordered transformations:
1. split the text to sentences;
2. exclude sentences that contain profanity words;
3. remove punctuation;
4. remove symbols;
5. remove numbers;
6. remove URLs;
7. remove excessive whitespaces;
8. compress multi-character whitespaces to a single whitespace;
9. split hyphens;
10. convert all characters to lower case.

We do not remove stopwords - most frequent words in English. Users of application will most likely type in stopwords. Keeping stopwords would result in better predictive power of the algorithm.

We remove the whole sentence when we meet the profanity word in it. We do this to reduce the number of meaningless word combinations. The list of profanity words was obtained from Wikipedia.

N-grams frequency

We use tidy data to make the N-grams. The quality of the model depends on the size of n-grams - the longer the n-gram, the more is the probability to accurately predict the next word. Our model uses five n-grams: 5-gram, 4-gram, 3-gram, 2-gram, 1-gram.

We make term-frequency matrices in order to understand the frequency of n-grams. We use dfm() function from quanteda package to obtain the frequencies. The frequencies are given in the following histograms.

Next steps

1. Rank next word candidates

Ranking algorithm has to take a sequence of last 4 words from a line as an input and search for next word candidates in 5-gram dataset. If there is less than 5 candidates the algorithm has to drop the first word and search in 4-gram dataset. The algorithm has to drop words until there are at least 5 candidates.

Ranking itself has to be done through stupid backoff algorithm, described [here] (https://www.aclweb.org/anthology/D07-1090.pdf). Once there are at least 5 candidates the algorithm has to assign a score to each candidate and return top 5 candidates aggregated by descending score.

2. Eliminate low frequency n-grams

Low frequency grams have low scores and the algorithm will not likely show them at top 5 next word candidates. At the same time low frequency grams consume a lot of memory and slow down the algorithm. We have to eliminate the low frequency n-grams in order to save memory and increase speed.

3. Test alorithm

One of the interim points could be testing the algorithm manually for the ability to deal with no input case or unintelligent input. Here we could also test the n-grams that are not present in the dataset. Another thing is to test on testing dataset that we have obtained in the process of data cleaning. It would allow to check the predictive power of the algorithm.