The original database contains more than 4,269,678 lines of information distributed in news, blogs and twitts in the English language. The full database preprocessing took about 20 minutes using a notebook with an Intel i3 processor, 8 GB RAM and 54 GB swap, Debian Jessie 8.6 operating system, RStudio version Version 1.0.44 and R version 3.3.2 , “Sincere Pumpkin Patch”.
I. Preprocessing consisted of
The script used for processing can be downloaded here.
The processed database (full) can be downloaded here.
After cleaning the database, a sample of 10% proportional to the categories Blog, Twitter and News can be downloaded here.
II. Size files and very basic summary
The dataset is comprised of three files in English; these will be kept separate since they likely have different n-gram statistics. These are summarized below:
III. Unigrams, Bigrams, Trigrams
N-grams were processed, with n = 1, 2, 3. Being listed “Top 20”, which can be visualized in the “TOP 20 - MOST USED WORDS …”. The most frequent words used in unigrams were: “to say, just, will, one and like”. Bigrams, “right now, new york, cant wait, dont know”. Already for the trigrams, “cant wait see, new york city, happy mothers day” are the ones that stood out the most used.
IV. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? How do you evaluate how many of the words come from foreign languages? Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
This point leads us to an exponential evaluation such that the number of words for each n-gram arrives at absurd values. In the case of unigrams, if we want to increase coverage from 50% to 90% we will need to exit from 981 unique words to 15,067 which is approximately 15.36x the total words by 50%. In the case of bigrams and trigrams the increases are respectively 2.22x and 1.81x, that is, assuming n-grams with n > 1 the rate of increase tends to decrease, but still lead us to a heavy database since the percentage of single words in 1-gram is 2.69% in 90% coverage while a 3-gram reaches 88.84% usage under the same conditions.
The idea of the prediction model is to use the given natural language sample. In this sample may contain neologisms, adoption of foreign words and slang that follow a syntax of their own, so the ideal is to use what people use, that is, if we want a prediction model for a particular audience then we should use the natural language of that group. But assuming it is necessary to evaluate words from foreign languages, we can use the “tm_map” function to “remove words” based on a language dictionary. The difference in word count will provide insight into the number of words from that particular language in the corpus.
There are several ways that could be used to increase coverage. One of them is to use some Pareto rule adaptation to achieve the highest coverage with the least amount of words. Another point is to use some library of synonyms. One more way would be a system that learns from the user’s writing changes and adapts when the accuracy comes out of its predicted confidence interval.
V. Bonus exploratory data analysis - Sentiments text
A reasonable point to consider is sentiment analysis, that is, how can we perceive texts written by netizens?
For that, the dictionaries of sentiment “bing”, “AFINN” and “nrc” (see more here) were used. Here we have some interesting results:
More details can be seen in the charts and tables in appendix. For all evaluated feelings their metrics (mean and proportion) were calculated with the standard error by sampling techniques.
V. Get feedback on your plans for creating a prediction algorithm and Shiny app
To predict words, planning is to use a sampling approach (10% of the database, proportional to strata, blog, twitter and news)
I believe that to be a Shiny application I have to lose some of the accuracy to be able to make the trade off of robustness vs. Computational effort. Therefore a 2-gram should be enough to find an optimal condition between these two points.
The modeling will probably be done by Neural Networks.
| 1-gram.5 | 1-gram.9 | 2-gram.5 | 2-gram.9 | 3-gram.9 | 3-gram.9 | |
|---|---|---|---|---|---|---|
| unique words (abs) | 981.00 | 15067.00 | 170167.00 | 377207.00 | 232489.00 | 422914.00 |
| unique words (%) | 0.18 | 2.69 | 32.88 | 72.88 | 48.84 | 88.84 |
| word | freq | bing | AFINN | nrc | prob | AFINN.score | |
|---|---|---|---|---|---|---|---|
| Length:1969 | Min. : 1.00 | negative:1378 | Min. :-5.0000 | negative:468 | Min. :1.417e-05 | Min. :-0.0161903 | |
| Class :character | 1st Qu.: 4.00 | positive: 591 | 1st Qu.:-2.0000 | sadness :236 | 1st Qu.:5.666e-05 | 1st Qu.:-0.0003824 | |
| Mode :character | Median : 10.00 | Median :-2.0000 | anger :233 | Median :1.416e-04 | Median :-0.0001133 | ||
| Mean : 35.85 | Mean :-0.7659 | positive:220 | Mean :5.079e-04 | Mean : 0.0002656 | |||
| 3rd Qu.: 30.00 | 3rd Qu.: 2.0000 | fear :216 | 3rd Qu.:4.249e-04 | 3rd Qu.: 0.0001417 | |||
| Max. :1791.00 | Max. : 5.0000 | disgust :174 | Max. :2.537e-02 | Max. : 0.0761070 | |||
| (Other) :422 |
| metric | se | |
|---|---|---|
| AFINN.mean | 0.52306 | 0.02521 |
| bing.pos.prop | 0.57853 | 0.00266 |
| bing.neg.prop | 0.42147 | 0.00266 |
| nrc.anger.prop | 0.06990 | 0.00266 |
| nrc.antecipation.prop | 0.10084 | 0.00266 |
| nrc.disgust.prop | 0.05475 | 0.00266 |
| nrc.fear.prop | 0.07410 | 0.00266 |
| nrc.joy.prop | 0.14054 | 0.00266 |
| nrc.negative.prop | 0.13285 | 0.00266 |
| nrc.positive.prop | 0.18019 | 0.00266 |