Overview

This is a milestone report for the Data Science Capstone project. The following report will include an exploratory analysis of the three .txt files in the en_US data folder: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. Additionally, the distribution of word frequencies in the files will be analyzed; so, too, will the frequencies of 2-gram and 3-grams. Further included will be a summary of plans for creating a prediction algorithm and an app that utilizes this model and tests its ability.

Exploratory Analysis

Below is some basic info on the files, including each file’s size, number of lines, longest line length and number of words.

getDataSummary()  
##          size lines longestLine  words
## blogs   200MB  899K 40833 chars 37.33M
## news    196MB 1010K 11384 chars 34.37M
## twitter 159MB 2360K   140 chars 30.37M

Each file has been subsetted. Each line in the file was taken with probability .1.

Each subsetted file has been tokenized. Among other things, every character that is not a letter, space or an apostrophe has been removed. “Hashtags” have been removed, particularly from the twitter file. Single letter words have been removed. Everything has been converted to lowercase. Profanity has been removed. However, plenty of non-dictionary words remain.

The frequencies of words/unigrams (1-grams), bigrams (2-grams) and trigrams (3-grams) are included below. “Stopwords” like “the,” “and,” and “for” were not removed before computing these frequencies, and so naturally they top the list of most frequent unigrams.

Most frequent 1-grams.

plotFrequencies(1)

Most frequent 2-grams.

plotFrequencies(2)

Most frequent 3-grams.

plotFrequencies(3)

It’s worth noting how much of the corpora is composed of repeated words, in particular the most frequently occurring words.

getCoverage()
## [1] "The 10 most frequent words represent 16.6% of all words instances in the corpora."
## [1] "The 50 most frequent words represent 29.4% of all words instances in the corpora."
## [1] "The 250 most frequent words represent 47.5% of all words instances in the corpora."
## [1] "The 1000 most frequent words represent 64.8% of all words instances in the corpora."
## [1] "The 5000 most frequent words represent 84.2% of all words instances in the corpora."

Model Plans

The information gathered in learning the structure of the data, particularly the frequencies of different n-grams, will be used to build a model for predicting the next word given a string of words. If there is a “long” string of words (3 or more), the 4-grams of the sampled, tokenized corpora will be taken, and the last 3 words of the string will be matched against all possible 4-grams in which the first 3 words of the 4-gram are the last 3 words of the string; if there are matches, the last word of the most frequently occurring match will be returned. If there are no matches, this process will be implemented again but now looking at the last 2 words of the string matched against all 3-grams from the sampled, tokenized corpora. Similarly, if no matches are found this time, the last word of the string will be matched against 2-grams. If that still doesn’t work, “the” will be returned, since it is the most common word in the corpora.

This model will be used to build a simple app in which a user can type in a string, and a prediction will be returned for the next word in the string (i.e. the app will try to predict the next word of the sentence/query/etc. the user has typed in).