Data Science Capstone: Milestone Report

Introduction

With the increased prevalence handheld devices with touch-based text entry, efficient and accurate word recomender systems have been critical to their usability. This report provides an exploratory data analysis of text from Blogs, News Articles, and Twitter tweets from around the internet as a prelude to developing a next-word recomender system. However, it should be noted that are still experimenting with different NLP R-packages and technologies, so the results shown should be taken as “very preliminary”. We discuss the results and the next steps in the discussion.

Exploratory Analysis

This section provides an overview of the data sources, R packges used, and basic cleaning (i.e., removal of inappropreate or banned words).

Data Acquisition

The data was provided by SwiftKey and made available via the Coursera Data Science Capstone course, and was downloaded from the following URL:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The extracted files files contained the following number of lines of text:

File	Lines	Size (MB)
en_US.blogs.txt	899288	210.2
en_US.news.txt	1010242	205.8
en_US.twitter.txt	2360148	167.1

Since data was obtained form Twitter, a number translations were applied to expand common shorthand (i.e. ‘u’, ‘ur’). Data was also screened for inappropriate or derogatory words using Words Banned by Google as a reference. Analysis utilized a number of open-source R packages:

Word Frequency Analysis

Note: To accommodate the speedy compilation of an RMarkdown document and in the interest of making making the Data Science Milestone deadline, the following analysis was performed a randomly sampled (without replacement) subset of lines in the original file, representing approximately 10% the original data size.

Before computing a word frequency matrix the corpus was cleaned in a number of ways typical to many Natural Language Processing tasks. Punctuation, numbers, white-space, and english ‘stop words’ (and, I, you, etc.), were removed. Also words in the corpus were stemmed to their roots to avoid double counting.

Load our nessisary libraries
Define a few functions that will be convenient
Load and create document-term matricies for each of blogs, news, and twitter.

The three different sources of text, Blogs, News Media, and Twitter, seem to have very similar word frequency distributions. Many words seem to have very low counts.

plot of chunk unnamed-chunk-4

Top 20 Words

Many of the top-20 words seem to be common between the three datasets (e.g., go, like, new, etc.)

plot of chunk unnamed-chunk-5

Discussion and Future Work

Similarities in the Top-20 words from each dataset suggest that a small number of words are common between all three of corpora. It would be interesting to combine the three data sources and to do a combined analysis. While analyzing the individual words could be interesting, it is unlikely to be very predictive as word pairs or sequences of words (i.e., bi-grams or tri-grams) are intuatively required to predict the next word. Realizing this, we will change our approach to first identify sentences, and then proceed to a frequency analyses of the words derived from detected sentences.

Moreover, the preliminary nature of this analysis belies we are still in the experimental stages of using these Natural Language Processing R-packages. We are very much still in the mechanics-building stage, and it seem that and scaling up the analysis to the full dataset will be a challenge. The data is currently difficult to load completely into the R-environment on a eight-year old laptop, suggesting that I may have to batch the analysis, perform some kind of dimensionality reduction procedure, or move the analysis to a more performant machine.