Milestone Report – Data Science Specialization Capstone Project

For this part of the project I will be using the dataset supplyed by SwiftKey. The dataset contains text from different sources (blogs, twitter and news) in 4 different languages. For this part of the project I will only be using the ones in English.

The overall goal of the project is to create a prediction model.

Summary of Data Files

The data files contain a text entry on each line.

file_name	lines	words
en_US.twitter.txt	2360148	30359804
en_US.news.txt	1010242	1010242
en_US.blogs.txt	899288	37334114

Exploratory Analysis

My first thought when I loaded these files into memory was to figure out what were the most frequent words. For the purpose of this analysis I will only use a subset of the data.

Frequent Words

plot of chunk unnamed-chunk-8

Lets consider NGrams, N combinations of words.

Frequent 2Grams

plot of chunk unnamed-chunk-10

Frequent 3Grams

plot of chunk unnamed-chunk-11

Frequent 4Grams

plot of chunk unnamed-chunk-12

Next steps

My current plans for a prediction algorithm are as follows: The input must be a sentence and the output will be one word, the predicted next word in the sentence. The algorithm will work on a precomputed model of frequent ngrams for 4,3,2 ngrams. It will find the most frequent ngram that begins with the last word in the sentence starting from 3grams to 2grams. I will have to figure out a way of weighing them diferently and doing a the search quickly.