Exploratory analyses

Using a basic word counter which took each instance of a space as a separate word, the Twitter dataset contained 28,343,285 words, the news dataset contained 2,643,187 words, and the blogs dataset contained 37,205,236 words.

In terms of lines, the Twitter dataset contained 2,302,307 lines, news contained 77,258, and blogs contained 898,384.

Once profanity was filtered out, it was reduced to a smaller subsample via random sampling, leaving us with 50,000 lines.

Combined Dataset

Punctuations, numbers, and whitespace were removed from the dataset, and all words were converted to lowercase. Now let’s look at the top 10 most common terms, as shown in the table below.

Also bigrams

And trigrams

Plans for analyses

One approach could be to: