Executive summary of exploratory data analysis

The original database contains more than 4,269,678 lines of information distributed in news, blogs and twitts in the English language. The full database preprocessing took about 20 minutes using a notebook with an Intel i3 processor, 8 GB RAM and 54 GB swap, Debian Jessie 8.6 operating system, RStudio version Version 1.0.44 and R version 3.3.2 , “Sincere Pumpkin Patch”.

I. Preprocessing consisted of

  1. Removing anything other than English letters or space;
  2. Converting all letters to lowercase;
  3. Removing “stopwords” (common words) that usually have no analytic value;
  4. Removing extra whitespace;

The script used for processing can be downloaded here.

The processed database (full) can be downloaded here.

After cleaning the database, a sample of 10% proportional to the categories Blog, Twitter and News can be downloaded here.

II. Size files and very basic summary

The dataset is comprised of three files in English; these will be kept separate since they likely have different n-gram statistics. These are summarized below:

  1. en_US.blogs.txt -> 899,288 lines 37,334,690 words
  2. en_US.news.txt -> 1,010,242 lines 34,372,720 words
  3. en_US.twitter.txt -> 2,360,148 lines 30,374,206 words

III. Unigrams, Bigrams, Trigrams

N-grams were processed, with n = 1, 2, 3. Being listed “Top 20”, which can be visualized in the “TOP 20 - MOST USED WORDS …”. The most frequent words used in unigrams were: “to say, just, will, one and like”. Bigrams, “right now, new york, cant wait, dont know”. Already for the trigrams, “cant wait see, new york city, happy mothers day” are the ones that stood out the most used.

IV. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? How do you evaluate how many of the words come from foreign languages? Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

This point leads us to an exponential evaluation such that the number of words for each n-gram arrives at absurd values. In the case of unigrams, if we want to increase coverage from 50% to 90% we will need to exit from 981 unique words to 15,067 which is approximately 15.36x the total words by 50%. In the case of bigrams and trigrams the increases are respectively 2.22x and 1.81x, that is, assuming n-grams with n > 1 the rate of increase tends to decrease, but still lead us to a heavy database since the percentage of single words in 1-gram is 2.69% in 90% coverage while a 3-gram reaches 88.84% usage under the same conditions.

The idea of the prediction model is to use the given natural language sample. In this sample may contain neologisms, adoption of foreign words and slang that follow a syntax of their own, so the ideal is to use what people use, that is, if we want a prediction model for a particular audience then we should use the natural language of that group. But assuming it is necessary to evaluate words from foreign languages, we can use the “tm_map” function to “remove words” based on a language dictionary. The difference in word count will provide insight into the number of words from that particular language in the corpus.

There are several ways that could be used to increase coverage. One of them is to use some Pareto rule adaptation to achieve the highest coverage with the least amount of words. Another point is to use some library of synonyms. One more way would be a system that learns from the user’s writing changes and adapts when the accuracy comes out of its predicted confidence interval.

V. Bonus exploratory data analysis - Sentiments text

A reasonable point to consider is sentiment analysis, that is, how can we perceive texts written by netizens?

For that, the dictionaries of sentiment “bing”, “AFINN” and “nrc” (see more here) were used. Here we have some interesting results:

  1. The. In general we have more positive words than negative words.
  2. Feelings perceived on a scale of -5 to 5. Generally the sentiments are between -3 and 3. That is, without extreme feelings.
  3. The way people use words are strongly tied to their positive and negative feelings.
  4. When it comes to negative texts there is a tendency to use more words. and. In some cases feelings like sadness, anger, fear, disgust (negatives) have been used in a positive condition, so they can be a form of sarcasm. The same occurs with surprise and anticipation (positives) appropriate in a negative context.

More details can be seen in the charts and tables in appendix. For all evaluated feelings their metrics (mean and proportion) were calculated with the standard error by sampling techniques.

V. Get feedback on your plans for creating a prediction algorithm and Shiny app

To predict words, planning is to use a sampling approach (10% of the database, proportional to strata, blog, twitter and news)

I believe that to be a Shiny application I have to lose some of the accuracy to be able to make the trade off of robustness vs. Computational effort. Therefore a 2-gram should be enough to find an optimal condition between these two points.

The modeling will probably be done by Neural Networks.

Appendix

Unique words to cover 50% and 90% of all instances in the language
1-gram.5 1-gram.9 2-gram.5 2-gram.9 3-gram.9 3-gram.9
unique words (abs) 981.00 15067.00 170167.00 377207.00 232489.00 422914.00
unique words (%) 0.18 2.69 32.88 72.88 48.84 88.84

Summary of word board sentiments
word freq bing AFINN nrc prob AFINN.score
Length:1969 Min. : 1.00 negative:1378 Min. :-5.0000 negative:468 Min. :1.417e-05 Min. :-0.0161903
Class :character 1st Qu.: 4.00 positive: 591 1st Qu.:-2.0000 sadness :236 1st Qu.:5.666e-05 1st Qu.:-0.0003824
Mode :character Median : 10.00 Median :-2.0000 anger :233 Median :1.416e-04 Median :-0.0001133
Mean : 35.85 Mean :-0.7659 positive:220 Mean :5.079e-04 Mean : 0.0002656
3rd Qu.: 30.00 3rd Qu.: 2.0000 fear :216 3rd Qu.:4.249e-04 3rd Qu.: 0.0001417
Max. :1791.00 Max. : 5.0000 disgust :174 Max. :2.537e-02 Max. : 0.0761070
(Other) :422

Sentiments Estimators
metric se
AFINN.mean 0.52306 0.02521
bing.pos.prop 0.57853 0.00266
bing.neg.prop 0.42147 0.00266
nrc.anger.prop 0.06990 0.00266
nrc.antecipation.prop 0.10084 0.00266
nrc.disgust.prop 0.05475 0.00266
nrc.fear.prop 0.07410 0.00266
nrc.joy.prop 0.14054 0.00266
nrc.negative.prop 0.13285 0.00266
nrc.positive.prop 0.18019 0.00266