This report documents the Exploratory Data Analysis performed on Blogs, News and Twitter (Tweets) data from a corpus called HC Corpora (www.corpora.heliohost.org). This analysis was a pre-cursor for building a word prediction algorithm for the Capstone Data Science Project.
In summary, 3 data files were analysed which contained over 100 million words. Exploratory Data Analysis was performed to understand the relationship between the words. In addition some limited analysis on the use of profanity within the corpus was conducted. The report ends with a high level design for prediction algorithm and further considerations which may have enhanced this analysis.
The intented aim of this report is to be understandable to a non-data scientist manager.
The top 5 most frequently used words across the corpus were: of,a and, to & the.
calculate average words per line
Remove strange characters
4-grams (Quad-grams) - “the quick brown fox”, “quick brown fox jumped”
Determine Frequency of each unique n-gram and plot.
Perform some simple Profanity Analysis on the words
The 3 files were extratcted from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
< Corpus > | < Number of Lines > | < Number of Words > | < Avg. Words/Line > |
---|---|---|---|
Blogs | 899,288 | 37,334,131 | 41.51 |
News | 1,010,242 | 34,372,529 | 34.02 |
2,360,148 | 30,373,583 | 12.86 | |
Total | 4,269,678 | 102,080,243 |
The Twitter corpus represented the lowest number of words per line (12.86) - this may be explained by a given tweet having a maximum 140 character limit; note any future text analysis of tweets sent post June 2015 may show a different outcome as there is now a new 1000 character limit.
A distinct difference with this corpus is the use of reporting language such as “according to the” and “said in a statement”.
I wanted to understand the use of Profanity (abusive, vulgar, or irreverent language) in each of the 3 data groups. I used “The 10 Most Popular Swear Words on Facebook (for both genders” (http://www.slate.com/blogs/lexicon_valley/2013/09/11/top_swear_words_most_popular_curse_words_on_facebook.html) to identify differences between data sets.
Data Type | Lines containing Profanity |
---|---|
Blogs | 1.7% |
News | 0.5% |
2.5% |
To avoid offending the readers of this report I have chosen not to list the top 10 words!
Now we calculate the number of unique words needed in a frequency sorted dictionary to cover 50% and 90% of all word instances; also how much coverage just 100 words would achieve.
< n-gram > | < 100 words > | < 50% > | < 90% > |
---|---|---|---|
Uni-gram | 46% | 150 words | 7000 words |
For uni-grams, 100 words provides 46% coverage, whilst ~150 words provides 50% coverage.
A Word Cloud is a visual representation for text data, typically used to depict keyword text. Tags are usually single words, and the importance of each tag is shown with font size; that is, a higher word frequency equates to larger font size.
Below illustrates a word cloud for Bi-grams:
Through this analysis I identified other areas which could have enhanced this model, they incude:
Taking into account sentence boundaries in lines of text - for my model the whole line was treated as a contiguous sentence, irrespective of full-stops.
It would have been interesting to remove Stop Words (the most common words in a language such as and, the, at, on) in the Exploratory Data Analysis to highlight any particular themes. Although useful analysis, I considered it would not benefit word prediction.
I did not investigate the application of Stemming (reducing inflected, or sometimes derived, words to their word stem).
Although high level analysis of Profanity words was performed, I chose not to remove from the analysis.