Exploratory Data Analysis

Executive Summary

This report documents the Exploratory Data Analysis performed on Blogs, News and Twitter (Tweets) data from a corpus called HC Corpora (www.corpora.heliohost.org). This analysis was a pre-cursor for building a word prediction algorithm for the Capstone Data Science Project.

In summary, 3 data files were analysed which contained over 100 million words. Exploratory Data Analysis was performed to understand the relationship between the words. In addition some limited analysis on the use of profanity within the corpus was conducted. The report ends with a high level design for prediction algorithm and further considerations which may have enhanced this analysis.

The intented aim of this report is to be understandable to a non-data scientist manager.

The top 5 most frequently used words across the corpus were: of,a and, to & the.

Approach / Method

Perform high level file analysis on each corpus
count the number of Lines (items)
count the number of Words
calculate average words per line
For each corpus:
Remove all punctuation (except apostrophes)
Convert all text to lowercase
Remove Whitespace
Remove strange characters
Construct n-grams for each corpus (n-gram is a contiguous sequence of n words from a given sequence of text). For a given string “the quick brown fox jumped” becomes:
1-gram (Uni-grams) - “the”, “quick”, “brown”, “fox”, “jumped”
2-grams (Bi-grams) - “the quick”, “quick brown”, “brown fox”, “fox jumped”
3-grams (Tri-grams) -“the quick brown”, “quick brown fox”, “brown fox jumped”
4-grams (Quad-grams) - “the quick brown fox”, “quick brown fox jumped”
Determine Frequency of each unique n-gram and plot.
Perform some simple Profanity Analysis on the words

High Level File Analysis

The 3 files were extratcted from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

< Corpus >	< Number of Lines >	< Number of Words >	< Avg. Words/Line >
Blogs	899,288	37,334,131	41.51
News	1,010,242	34,372,529	34.02
Twitter	2,360,148	30,373,583	12.86
Total	4,269,678	102,080,243

The Twitter corpus represented the lowest number of words per line (12.86) - this may be explained by a given tweet having a maximum 140 character limit; note any future text analysis of tweets sent post June 2015 may show a different outcome as there is now a new 1000 character limit.

Twitter Analysis

Blogs Analysis

News Analysis

A distinct difference with this corpus is the use of reporting language such as “according to the” and “said in a statement”.

Profanity Analysis

I wanted to understand the use of Profanity (abusive, vulgar, or irreverent language) in each of the 3 data groups. I used “The 10 Most Popular Swear Words on Facebook (for both genders” (http://www.slate.com/blogs/lexicon_valley/2013/09/11/top_swear_words_most_popular_curse_words_on_facebook.html) to identify differences between data sets.

Data Type	Lines containing Profanity
Blogs	1.7%
News	0.5%
Twitter	2.5%

To avoid offending the readers of this report I have chosen not to list the top 10 words!

Word Frequency

Now we calculate the number of unique words needed in a frequency sorted dictionary to cover 50% and 90% of all word instances; also how much coverage just 100 words would achieve.

< n-gram >	< 100 words >	< 50% >	< 90% >
Uni-gram	46%	150 words	7000 words

For uni-grams, 100 words provides 46% coverage, whilst ~150 words provides 50% coverage.

Word Cloud

A Word Cloud is a visual representation for text data, typically used to depict keyword text. Tags are usually single words, and the importance of each tag is shown with font size; that is, a higher word frequency equates to larger font size.

Below illustrates a word cloud for Bi-grams:

High Level Design for Word Prediction Algorithm

Combine individual n-grams calculated for each corpus into single n-gram files (for 2,3 and 4 grams)
Considereably reduce the size of each n-gram file by including the top 1000 (exact no. to be determined)
Develop Application to accept words and convert to lowercase and remove punctuation
Search the n-gram files starting backwards from 4-gram until a possible match is determined; taking account of probabilities where appropriate.
Design shall need to take account of not being able to find an appropriate next word.

Further Considerations

Through this analysis I identified other areas which could have enhanced this model, they incude:

Taking into account sentence boundaries in lines of text - for my model the whole line was treated as a contiguous sentence, irrespective of full-stops.
It would have been interesting to remove Stop Words (the most common words in a language such as and, the, at, on) in the Exploratory Data Analysis to highlight any particular themes. Although useful analysis, I considered it would not benefit word prediction.
I did not investigate the application of Stemming (reducing inflected, or sometimes derived, words to their word stem).
Although high level analysis of Profanity words was performed, I chose not to remove from the analysis.

Capstone Exploratory Data Analysis

David Hayes

Tuesday, July 07, 2015