## Loading required package: NLP
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
The data provided for the project is explored and analyzed. First the data is randomly sampled. The data is analyzed word-for-word (UniGram) for occurrence distributions. The data was then analyzed by phrases of 2 or 3 words for occurrences and distributions. It was shown that by eliminating phrases that occur too often a significant amount of the data can be covered (up to 94% with three-word phrases). This report is by no means definitive of the content of the final product, but it provides exploratory details of the datset provided for further analysis.
The data was provided from three sources: news articles, blogs, and tweets from twitter.com. In order to provide an efficient algorithm that could be operated on a basic device (such as a mobile device), the data was randomly sampled. Each file was sampled with a probability to provide a similar number of lines to prevent heavier weighting from one of the sources. After cleaning the data the news file provided 4,321 lines, the blogs file provided 5,339 lines, and the tweets file provided 5,427 lines. Each file was then “tokenized” into UniGrams (single words) to create a dictionary of unique words for each file. The number of occurrences, percentage of the total words, and cumulative percentage was calculated for each word.
The sample news file contained 119,609 total words, and 16,189 unique words. This means that approximately 86% of the words were used multiple times.
The sample blog file contained 135,136 total words, and 16,279 unique words. Close to the sample news file results, approximately 88% of words were reused.
The sample tweets file contained 66,062 total words, and only 9,538 unique words. Approximately 86% of words were reused, similar to both the news and blogs sample files.
The ten most common words in each file were very similar. Below are the ten most common words, their frequency in the text, the percentage of the total words, and the cumulative percentage for the news, blogs, and tweets sample files (in that order):
## [1] "The ten most common words in the news sample file:"
## word freq percent cumulative_percent
## 14450 the 6762 5.6534207 5.653421
## 14648 to 3200 2.6753840 8.328805
## 105 a 3047 2.5474672 10.876272
## 696 and 2925 2.4454682 13.321740
## 10054 of 2721 2.2749124 15.596652
## 7280 in 2245 1.8769491 17.473601
## 12590 s 1392 1.1637920 18.637394
## 14448 that 1286 1.0751699 19.712563
## 5799 for 1275 1.0659733 20.778537
## 12623 said 1048 0.8761882 21.654725
## [1] "The ten most common words in the blogs sample file:"
## word freq percent cumulative_percent
## 14513 the 6545 4.843269 4.843269
## 14733 to 3893 2.880802 7.724071
## 755 and 3685 2.726883 10.450953
## 7248 i 3168 2.344305 12.795258
## 60 a 3164 2.341345 15.136603
## 10214 of 3006 2.224426 17.361029
## 7389 in 2037 1.507370 18.868399
## 7810 it 1707 1.263172 20.131571
## 14506 that 1628 1.204712 21.336283
## 7781 is 1601 1.184732 22.521016
## [1] "The ten most common words in the tweets sample file:"
## word freq percent cumulative_percent
## 8357 the 2043 3.092549 3.092549
## 4130 i 2020 3.057734 6.150283
## 8509 to 1758 2.661137 8.811420
## 9479 you 1383 2.093488 10.904908
## 51 a 1277 1.933033 12.837940
## 4419 it 936 1.416851 14.254791
## 328 and 908 1.374466 15.629257
## 3229 for 823 1.245799 16.875057
## 4230 in 808 1.223093 18.098150
## 4407 is 769 1.164058 19.262208
Articles and conjunctions, to be expected, represented the largest number of words in the data.
The personal pronoun “I” is common in both the blogs and tweets sample files, but not is not present in the top ten words of the news sample file. This follows expectations because news articles are very often written from a third person perspective, whereas blogs and tweets are commonly written in first person.
The personal pronoun “you” is also very common in tweets, evidence of the conversational nature of social media.
It is also worth noting the lone “s” in the top ten words of the news sample file. Because the tokenizer reads punctuation as separate UniGrams (single words), the “s” after an apostraphe is separated from the actual word. This indicates that news articles may utilize more contractions and possessives than other sources.
Attention should also be drawn to the percentages of the text these top ten words represents. The top ten words of the news sample file represents approximately 20% of the total words, the top ten words of the blogs sample file represents approximately 22% of the total words, and the top ten words of the tweets file represents approximately 19% of the total words.
The data presented are very skewed. The following statistics show the severity of the skewness for the news, blogs, and tweets sample files, respectively:
## [1] "Descrptives statistics of the news sample file:"
## freq percent
## Min. : 1.000 Min. :0.000836
## 1st Qu.: 1.000 1st Qu.:0.000836
## Median : 1.000 Median :0.000836
## Mean : 7.388 Mean :0.006177
## 3rd Qu.: 3.000 3rd Qu.:0.002508
## Max. :6762.000 Max. :5.653421
## [1] "Descrptives statistics of the blogs sample file:"
## freq percent
## Min. : 1.000 Min. :0.000740
## 1st Qu.: 1.000 1st Qu.:0.000740
## Median : 1.000 Median :0.000740
## Mean : 8.301 Mean :0.006143
## 3rd Qu.: 3.000 3rd Qu.:0.002220
## Max. :6545.000 Max. :4.843269
## [1] "Descrptives statistics of the tweets sample file:"
## freq percent
## Min. : 1.000 Min. :0.001514
## 1st Qu.: 1.000 1st Qu.:0.001514
## Median : 1.000 Median :0.001514
## Mean : 6.926 Mean :0.010484
## 3rd Qu.: 3.000 3rd Qu.:0.004541
## Max. :2043.000 Max. :3.092549
These descriptive statistics are similar: the mean occurrences for the news, blogs and tweets sample files are approximately 7, 8, and 6 occurrences, respectively. The median is 1 for all three.
The figure below shows the frequency of the number of occurences for the news, blogs, and tweets sample files (i.e. 8937 words occur only once in the news sample file). In order to get a better glimpse of a majority of the data the x-axis has been limited to 50 occurrences.
The files were compiled to be analyzed as a single corpus (a collection of text documents, an not to be confused with the R programming object VCorpus or DCorpus, which will be used later). The compiled file had a total of 28,415 lines of data. Again, the data was tokenized into UniGrams and the number of occurrences, percentage of the total words, and cumulative percentage was calculated for each word.
The compiled file contained approximately 320,807 words, 28,415 of which were unique. In the compiled data approximately 92% of words were reused.
Logically, the top ten words were very similar to the inidivudal data sets:
## word freq percent cumulative_percent
## 25233 the 15350 4.784808 4.784808
## 25603 to 8851 2.758980 7.543788
## 1287 and 7518 2.343465 9.887253
## 159 a 7488 2.334114 12.221367
## 17881 of 6486 2.021776 14.243143
## 12597 i 5858 1.826020 16.069163
## 12827 in 5090 1.586624 17.655787
## 13485 it 3642 1.135262 18.791049
## 25221 that 3489 1.087570 19.878619
## 10066 for 3409 1.062633 20.941251
Again, articles and conjunctions represented a large portion of the word usage. “I” is still in the top ten list, influenced by the high usage in both the blogs and tweets files.
The data still appears to represent a heavy skewness. 20% of the file dictionary is represented in the top ten words.
## [1] "Descrptives statistics of the compiled files:"
## freq percent
## Min. : 1.00 Min. :0.000312
## 1st Qu.: 1.00 1st Qu.:0.000312
## Median : 1.00 Median :0.000312
## Mean : 11.29 Mean :0.003519
## 3rd Qu.: 3.00 3rd Qu.:0.000935
## Max. :15350.00 Max. :4.784808
The descriptive statistics now indicate even worse skewness. While the median and 3rd quartile have not changed (1), the maximum number of occurrences is from “the”, repeating 15,350 times. The next highest occurrence, “to”, isn’t even a third of the maximum with 8,851 occurrences.
Another histogram shows the frequency of the number of word occurrences:
Prediction algorithims, however, typically make use of Bi-, Tri-, and QuadGrams, which are collections of words (2, 3, and 4, respectively), as oppossed to UniGrams which are single words.
For example, the line “Tyger Tyger, burning bright” has four UniGrams (two are unique): “Tyger”, “Tyger”, “burning”, and “bright”. There are three BiGrams: “Tyger Tyger”, “Tyger, burning”, and “burning bright” (although the tokenizer may regard the comma as a separete NGram), and there are two TriGrams: “Tyger Tyger, burning” and “Tyger, burning bright”.
NGrams are analyzed probabilistically, so it is helpful to explore the Bi- and TriGram tokenized data.
There are 305,727 total BiGrams and 166,327 unique BiGrams. Approximately 46% of the BiGrams are reused.
There are 290,855 total TriGrams and 256,778 unique TriGrams. Approximately 12% are reused.
As the size of the NGram increases more NGrams become available while there are less repeats.
Below are the top ten Bi- and TriGrams, respectively, and their corresponding frequencies, percentages of the total NGrams and the cumulative percentage:
## [1] "The top ten BiGrams are:"
## word freq percent cumulative_percent
## 98852 of the 1377 0.4504018 0.4504018
## 70560 in the 1283 0.4196554 0.8700573
## 148421 to the 687 0.2247103 1.0947676
## 53425 for the 619 0.2024682 1.2972358
## 100717 on the 615 0.2011599 1.4983956
## 74403 it s 602 0.1969077 1.6953033
## 68007 i m 546 0.1785907 1.8738940
## 146765 to be 523 0.1710677 2.0449617
## 17097 at the 439 0.1435922 2.1885538
## 12582 and the 397 0.1298544 2.3184083
## [1] "The top ten TriGrams are:"
## word freq percent cumulative_percent
## 98561 i don t 147 0.05054065 0.05054065
## 4465 a lot of 126 0.04332055 0.09386120
## 153757 one of the 100 0.03438139 0.12824260
## 198647 thanks for the 71 0.02441079 0.15265338
## 83252 going to be 62 0.02131646 0.17396985
## 99678 i m not 62 0.02131646 0.19528631
## 113560 it was a 62 0.02131646 0.21660277
## 204692 the end of 57 0.01959739 0.23620017
## 113049 it s a 56 0.01925358 0.25545375
## 100682 i want to 55 0.01890977 0.27436351
No Bi- or TriGram represents even a half of a percent of the entire collection. The top ten BiGrams only represent approximately 2% of all of the BiGrams and the top ten TriGrams don’t even represent half of a percent of all of the TriGrams.
The “stranded” letters - i.e. “m” (from “I’m”) or “s” from a possessive or contraction represent characters after apostraphes. These will be helpful in the predictive algorithm, as contractions are clearly some of the most common phrases. The predictive algorithm will need to account for these missing apostraphes removed by the tokenizer and of course offer the user the correct punctuation.
The following show the descriptive statistics of both NGram sets:
## [1] "Descrptives statistics of the BiGram collection:"
## freq percent
## Min. : 1.000 Min. :0.0003271
## 1st Qu.: 1.000 1st Qu.:0.0003271
## Median : 1.000 Median :0.0003271
## Mean : 1.838 Mean :0.0006012
## 3rd Qu.: 1.000 3rd Qu.:0.0003271
## Max. :1377.000 Max. :0.4504018
## [1] "Descrptives statistics of the TriGram:"
## freq percent
## Min. : 1.000 Min. :0.0003438
## 1st Qu.: 1.000 1st Qu.:0.0003438
## Median : 1.000 Median :0.0003438
## Mean : 1.133 Mean :0.0003894
## 3rd Qu.: 1.000 3rd Qu.:0.0003438
## Max. :147.000 Max. :0.0505406
The descriptive statistics show a massive shift in skewness. The mean of both are less than 2. The maximum BiGram occurrences is 1,377 (recall that the maximum for the compiled file was 15,000) and only 147 TriGram occurrences.
The figures below illustrates the distribution. Notice that while the overall shape of the distribution is similar, the x-axis has been further condensed to zoom in on a majority of the data.
Of course a significant number ofNGrams only occur once in both sets. In fact, 136,471 BiGrams occur only once (82% of the BiGrams) and 214,392 TriGrams occur multiple times (94% of the TriGrams).
These NGrams that only occur once are the most valuable for the data set. NGrams that can lead to multiple words provide a lower probability of offering the correct prediction. I.e. When “a” is typed into the application, any singular, indefinite noun could follow. However, when one types “macaroni and”, there are fewer options, with “cheese” being a very likely option in American English.
The project goal is to provide increased utility to the user when typing by predicting the next word they will type, allowing them to quickly select from a list rather than type the entire word. This is particularly useful on devices with cumbersome keypads, i.e. small mobile devices. The data was provided from news and blog sentences as well as tweets. The sources were randomly sampled and the sources were tokenized into Uni-, Bi-, and TriGrams. The tokenized data was then analyzed to determine frequency of occurrences and the distribution of those occurrences. It was shown that using BiGrams and TriGrams will allow for a more unique dataset to provide a larger coverage of the dataset.