Executive Summary

The overall purpose of this capstone project is to develop a predictive model of english text delivered via a Shiny app. The data upon which the model will be built comes from the following source: data. It is comprised of three files with large amounts of text scraped from tweets, blogs and news articles and provides a wide breadth of english lanugage expression. This report contains some exploratory analysis performed on these files, and outlines the next steps in building out and improving upon the prospective model.

Data Summary

File Size # Lines (000s) # Words (000s) # Unique Words (000s) Avg. Words per Line
Twitter 301.4 Mb 2360 30218 384 13
Blogs 248.5 Mb 899 38154 361 42
News 19.2 Mb 77 2694 89 35

From this summary there are a number of interesting findings:

Word Frequency

Here we will examine what the most common words are from each dataset. We start with the top 10 words from each dataset:

Note: Even after removing stop words and numbers, it is clear we have some letters with accent characters that have snuck into the cleaned up version. This is something that will have to be dealt with in future analysis.

Word Concentration

Now we’ll seek to answer the question. How many words are necessary to comprehend 80% of english writing? First I compile a list of word frequencies across all the documents, then look at the cumulative percentages of all words.

word n freq cumsum
the 2942618 0.0414067 0.0414067
to 1926898 0.0271141 0.0685208
and 1600372 0.0225194 0.0910402
a 1577903 0.0222033 0.1132435
i 1503847 0.0211612 0.1344047
of 1295395 0.0182280 0.1526327

Amazingly, of the 634935 unique words, only 2113 or 0.33 % are required to comprehend 80% of writings. This will have implications for future analysis and development of the predictive model.

Future Steps

This exploratory analysis provided valuable information for further analysis.