Week 2 : Exploratory Data Analysis

Milestone Summary

The goal of this task was to familiarize myself with the content of the three datasets provided for this task. Each file was uploaded in R and I first ran some summary statistics to get a general understanding of their size and shape:

the Twitter data set was 163 KB in size and had 1 column with 2,360,808 rows of string data
the Blogs data set was 205 KB in size and had 1 column with 419,940 rows of string data
the News data set was 201 KB in size and had 1 column with 1,010,213 rows of string data

I then cleaned the data and conducted analysis on aggregate word count, 2-gram count and 3-gram count in each data set.

Twitter Word Analysis

Blogs Word Analysis

News Word Analysis

Twitter N-gram Analysis

Blogs N-gram Analysis

News N-gram Analysis

All N-Gram

Assessment of the Source Data

When reviewing the outcomes of the EDA process there were some consistent themes:

The most common words are, on their own, not particularly indicative of the semantic contents of the larger body of text they were part of. Words such as “a”, “the” and “with” offer relatively little value. In other words, the frequency of the occurence of the word seems to be strongly negative correlated with the “value” of the word
The top 20 2-grams and 3-grams represented relatively small portions of the overall population of n-grams, so any model that is developed will need to rely on a much broader swathe of data to be useful
The larger the “n” in n-gram the more valuable the insights might be for any type of predictive text model, However this would likely come at the expense of computational intensity, so there will be some trade-offs

Items for Consideration

When building the model, the computational load will be very high if the EDA is indicative of the calculation time. Because of this, I will need to develop one or more ways to process the information efficiently, possibly in chunks which will then be re-aggregated somehow
I will need to strike an equilibrium between accuracy and speed
I may need to run some of the calculations externally on Google Colab where I can scale up processing capacity while I experiment with different methodologies to complete this task

JHU Capstone Project