Data Science Capstone Milestone Report

Project Overview

The goal of this project is to display the exploratory analysis for the Data Science Capstone Project which aims to create a word prediction algorithm using SwiftKey dataset. The texts used for this project include US blogs, news, and tweets. Note that prior to the exploratory analysis, the texts were cleaned by converting everything to lowercase, removing extra white spaces, removing special characters, removing offensive words, converting all end-of-sentence characters to periods, converting all numbers to the symbol “#”, and separating the data into sentences. In addition, only the first 25000 lines of each text source were included for exploratory analysis, yielding a total of 1.4925610^{5} lines for analysis (6.451710^{4} from blogs, 5.6710^{4} from news, and 2.803910^{4} from tweets).

Objectives

The objectives of this milestone report are to…

Demonstrate the data has been successfully downloaded
Provide summary statistics about the data sets
Report interesting findings
Get feedback on plans for creating a prediction algorithm

Data Cleaning

Once a corpus has been created, it is cleaned to ease further analysis. First, all letters are changed to lower case. Next, all numbers are removed. Small frequent words like “and” and “the” are referred to as stop words. These stop words are removed as they do not add any valuable information for text analysis. Finally white spaces and punctuation are removed so only text remains.

Distribution of Word Frequencies

The first step in exploring the data is to simply identify the distribution of word frequencies. Transformed the clean corpus into a document term matrix.

Using this document term matrix, it is easy to find the most common terms. I have neglected 2 letter words in the creation of this document term matrix, so the top 5 most common words are “will”, “just”, “said”, “one”, and “like”. To compare relative frequencies of the most common terms, a barplot was used.

Figure 1 below plots the frequency of DTM.

Distribution of Word Combinations (2-Grams)

Next, we consider the most frequent two word combinations (i.e. “2-grams”). As with individual words, we can quantify the number and frequency of 2-grams.

Figure 2 below plots the frequency of 2-Grams in the tweetsCorpus.

Word Cloud visualization

This word cloud provides a nice visualization of the most frequently used words that are used across the three data sets.

Modeling Strategy

This exploratory analysis suggests the following strategies will be needed to develop a predictive text model

Remove sparse words, sparse 2-grams, and then sparse 3-grams
Exclude n-grams where n > 3 (as these will be even more sparse)
Stem words for prediction
Use individual word frequencies and n-gram frequencies as independent variables in predictive text model

While these steps may be helpful, I am admittedly struggling as to how to actually develop a predictive text model based on this information. Suggestions for a modeling strategy are welcome and appreciated!