This is an exploratory analysis report of the text corpus provided as part of the Coursera Capstone. The English corpus (which is the scope of the project) consists of news articles, blogs and tweets as 3 separate sets.
Each of the corpus was ingested as a character vector. The following transformations were then applied:
Word transformations: Compound words were decomposed into separate words. Also, abbreviated words were expanded to their full forms. The specific transformations were as follows: * “can’t” to “cannot” * “won’t” to “would not” * “n’t” to " not" * “I’m” to “I am” * “it’s” to “It is” * “Mr.” to “Mr” * “Mrs.” to “Mrs” * “Sr.” to “Sr”
The last 3 transformations above were to preserve only those periods that represent end of sentences and not those periods that occur as part of abbreviations.
End-of-sentence transformations: In creating bigrams or trigrams, the end-of-sentence boundaries have to be preserved. This was achieved by splitting all the elements of the corpus by period, exclamation marks or round braces. This resulted in a character vector where each element was a sentence
Removing special characters: After accounting for end-of-sentence markers and abbreviation markers as above, there was no need to preserve any non-alphanumeric characters. The sentence vector was thus tripped of anything other than alphanumerics
Converting to lower case: This was not possible before due to the special non-UTF characters such as emoticons etc. This step is necessary befor any tokenization and frequency analysis
Removing profanities: A list of profane words was obtained from https://www.cs.cmu.edu/~biglou/resources/ and these words were stripped from the corpus vectors. A regex approach ensured the identification of these words even if they were part of other words.
Having created a clean sentence set, The Quanteda package was used to create tokens. Unigrams, bigrams and trigrams, 4-grams and 5-grams were generated along with their frequencies
What is the nature and size of the corpus?
| Corpus Type | No.of docs in corpus |
|---|---|
| Blogs | 899288 |
| News | 77259 |
| Tweets | 2360148 |
What is the distribution of document size for each corpus?