The overall purpose of this capstone project is to develop a predictive model of english text delivered via a Shiny app. The data upon which the model will be built comes from the following source: data. It is comprised of three files with large amounts of text scraped from tweets, blogs and news articles and provides a wide breadth of english lanugage expression. This report contains some exploratory analysis performed on these files, and outlines the next steps in building out and improving upon the prospective model.
| File | Size | # Lines (000s) | # Words (000s) | # Unique Words (000s) | Avg. Words per Line |
|---|---|---|---|---|---|
| 301.4 Mb | 2360 | 30218 | 384 | 13 | |
| Blogs | 248.5 Mb | 899 | 38154 | 361 | 42 |
| News | 19.2 Mb | 77 | 2694 | 89 | 35 |
From this summary there are a number of interesting findings:
Here we will examine what the most common words are from each dataset. We start with the top 10 words from each dataset:
Note: Even after removing stop words and numbers, it is clear we have some letters with accent characters that have snuck into the cleaned up version. This is something that will have to be dealt with in future analysis.
Now we’ll seek to answer the question. How many words are necessary to comprehend 80% of english writing? First I compile a list of word frequencies across all the documents, then look at the cumulative percentages of all words.
| word | n | freq | cumsum |
|---|---|---|---|
| the | 2942618 | 0.0414067 | 0.0414067 |
| to | 1926898 | 0.0271141 | 0.0685208 |
| and | 1600372 | 0.0225194 | 0.0910402 |
| a | 1577903 | 0.0222033 | 0.1132435 |
| i | 1503847 | 0.0211612 | 0.1344047 |
| of | 1295395 | 0.0182280 | 0.1526327 |
Amazingly, of the 634935 unique words, only 2113 or 0.33 % are required to comprehend 80% of writings. This will have implications for future analysis and development of the predictive model.
This exploratory analysis provided valuable information for further analysis.