Milestone Report

Introduction

Exploratory Data Analysis of the most common words present in 3 of the major mediums of media consumption in the United States:

  • Twitter

  • Blogs

  • News

This report is meant to be citizen data scientist friendly and is written in a way that is meant to be generalized to a variety of text data. The report level function source code is set to echo = false here, and thus not shown in this report. That code is however available at github.com/KKONZ/Data-Sceince-Capstone. To get an idea of what we are working with, a table of some summary statistics from the full files are shown displayed below.

twitterPath <- '~/final/en_US/en_US.twitter.txt'
twitterSummary <- textSummary(twitterPath)

blogPath <- '~/final/en_US/en_US.blogs.txt'
blogSummary <- textSummary(blogPath)

newsPath <- '~/final/en_US/en_US.news.txt'
newsSummary <- textSummary(newsPath)

see https://github.com/kkonz/DataScienceCapstone for full markdown with report function calls

df <- setNames(data.frame(twitterSummary, blogSummary, newsSummary,                
                          row.names = c(
                            "Number of Lines", 
                            "Number of Non-blank Lines", 
                            "Number of Characters", 
                            "Number of Non-blank Characters")), 
               c("twitter", "blogs", "news")
               )
formatTable(df)

Next .5% of the overall data is sampled, cleaned of casing, punctuation, and stopwords. The top 100 most frequent words are plotted below:

Twitter Word Cloud

I am pleasantly not too surprised by the overall words from the twitter data set. I think of twitter as an incredibly fast medium that often is the first platform to report on major events. Here we see get as the most commen word. Twitter is well known for having emotional posts, so like as the second most frequent word is also not too surprising.

10 Most Frequently Sampled Twitter Words

Blog Word Cloud

Here we see many frequent words that would be expected from a blog, such as: time, know, people.

10 Most Frequently Sampled Blog Words

News Word Cloud

Said is by far the most frequently represented word in the news data set. It may also be worth reiterating that this is a comaparably much smaller than the other 2 sets, so this may be more prone to random chance. Seeing said as the most common news words makes sense as well. Most are narative in essense and thus commonly in third person and taking statements from relevant people to a given news story.

10 Most Frequently Sampled News Words

Conclusion

We can make some pretty solid deductions from the overviews of the three datasets provided by SwiftKey for the Data Science Capstone from Johns Hobpkins as describe above. However, what we have done so far does not provide much for actionable insights. For the final project, we will create a predictive shiny app to finish a phrase from a given platform.

For more computationally complex initiatives, it would be adventageous to use Gensim and Python. See my post here for a similar project of mine,https://kkonz.github.io/2017-12-10-topic-modeling-with-yelp-pizza2vec/.