John Hopkins and Coursera Data Science Capstone Project: Milestone Report

## Loading required package: NLP
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate

Executive Summary

The data provided for the project is explored and analyzed. First the data is randomly sampled. The data is analyzed word-for-word (UniGram) for occurrence distributions. The data was then analyzed by phrases of 2 or 3 words for occurrences and distributions. It was shown that by eliminating phrases that occur too often a significant amount of the data can be covered (up to 94% with three-word phrases). This report is by no means definitive of the content of the final product, but it provides exploratory details of the datset provided for further analysis.

Analysis of sources

The data was provided from three sources: news articles, blogs, and tweets from twitter.com. In order to provide an efficient algorithm that could be operated on a basic device (such as a mobile device), the data was randomly sampled. Each file was sampled with a probability to provide a similar number of lines to prevent heavier weighting from one of the sources. After cleaning the data the news file provided 4,321 lines, the blogs file provided 5,339 lines, and the tweets file provided 5,427 lines. Each file was then “tokenized” into UniGrams (single words) to create a dictionary of unique words for each file. The number of occurrences, percentage of the total words, and cumulative percentage was calculated for each word.

The sample news file contained 119,609 total words, and 16,189 unique words. This means that approximately 86% of the words were used multiple times.

The sample blog file contained 135,136 total words, and 16,279 unique words. Close to the sample news file results, approximately 88% of words were reused.

The sample tweets file contained 66,062 total words, and only 9,538 unique words. Approximately 86% of words were reused, similar to both the news and blogs sample files.

The ten most common words in each file were very similar. Below are the ten most common words, their frequency in the text, the percentage of the total words, and the cumulative percentage for the news, blogs, and tweets sample files (in that order):

## [1] "The ten most common words in the news sample file:"

##       word freq   percent cumulative_percent
## 14450  the 6762 5.6534207           5.653421
## 14648   to 3200 2.6753840           8.328805
## 105      a 3047 2.5474672          10.876272
## 696    and 2925 2.4454682          13.321740
## 10054   of 2721 2.2749124          15.596652
## 7280    in 2245 1.8769491          17.473601
## 12590    s 1392 1.1637920          18.637394
## 14448 that 1286 1.0751699          19.712563
## 5799   for 1275 1.0659733          20.778537
## 12623 said 1048 0.8761882          21.654725

## [1] "The ten most common words in the blogs sample file:"

##       word freq  percent cumulative_percent
## 14513  the 6545 4.843269           4.843269
## 14733   to 3893 2.880802           7.724071
## 755    and 3685 2.726883          10.450953
## 7248     i 3168 2.344305          12.795258
## 60       a 3164 2.341345          15.136603
## 10214   of 3006 2.224426          17.361029
## 7389    in 2037 1.507370          18.868399
## 7810    it 1707 1.263172          20.131571
## 14506 that 1628 1.204712          21.336283
## 7781    is 1601 1.184732          22.521016

## [1] "The ten most common words in the tweets sample file:"

##      word freq  percent cumulative_percent
## 8357  the 2043 3.092549           3.092549
## 4130    i 2020 3.057734           6.150283
## 8509   to 1758 2.661137           8.811420
## 9479  you 1383 2.093488          10.904908
## 51      a 1277 1.933033          12.837940
## 4419   it  936 1.416851          14.254791
## 328   and  908 1.374466          15.629257
## 3229  for  823 1.245799          16.875057
## 4230   in  808 1.223093          18.098150
## 4407   is  769 1.164058          19.262208

Articles and conjunctions, to be expected, represented the largest number of words in the data.

The personal pronoun “I” is common in both the blogs and tweets sample files, but not is not present in the top ten words of the news sample file. This follows expectations because news articles are very often written from a third person perspective, whereas blogs and tweets are commonly written in first person.

The personal pronoun “you” is also very common in tweets, evidence of the conversational nature of social media.

It is also worth noting the lone “s” in the top ten words of the news sample file. Because the tokenizer reads punctuation as separate UniGrams (single words), the “s” after an apostraphe is separated from the actual word. This indicates that news articles may utilize more contractions and possessives than other sources.

Attention should also be drawn to the percentages of the text these top ten words represents. The top ten words of the news sample file represents approximately 20% of the total words, the top ten words of the blogs sample file represents approximately 22% of the total words, and the top ten words of the tweets file represents approximately 19% of the total words.

The data presented are very skewed. The following statistics show the severity of the skewness for the news, blogs, and tweets sample files, respectively:

## [1] "Descrptives statistics of the news sample file:"

##       freq             percent        
##  Min.   :   1.000   Min.   :0.000836  
##  1st Qu.:   1.000   1st Qu.:0.000836  
##  Median :   1.000   Median :0.000836  
##  Mean   :   7.388   Mean   :0.006177  
##  3rd Qu.:   3.000   3rd Qu.:0.002508  
##  Max.   :6762.000   Max.   :5.653421

## [1] "Descrptives statistics of the blogs sample file:"

##       freq             percent        
##  Min.   :   1.000   Min.   :0.000740  
##  1st Qu.:   1.000   1st Qu.:0.000740  
##  Median :   1.000   Median :0.000740  
##  Mean   :   8.301   Mean   :0.006143  
##  3rd Qu.:   3.000   3rd Qu.:0.002220  
##  Max.   :6545.000   Max.   :4.843269

## [1] "Descrptives statistics of the tweets sample file:"

##       freq             percent        
##  Min.   :   1.000   Min.   :0.001514  
##  1st Qu.:   1.000   1st Qu.:0.001514  
##  Median :   1.000   Median :0.001514  
##  Mean   :   6.926   Mean   :0.010484  
##  3rd Qu.:   3.000   3rd Qu.:0.004541  
##  Max.   :2043.000   Max.   :3.092549

These descriptive statistics are similar: the mean occurrences for the news, blogs and tweets sample files are approximately 7, 8, and 6 occurrences, respectively. The median is 1 for all three.

The figure below shows the frequency of the number of occurences for the news, blogs, and tweets sample files (i.e. 8937 words occur only once in the news sample file). In order to get a better glimpse of a majority of the data the x-axis has been limited to 50 occurrences.

Analysis of corpus

The files were compiled to be analyzed as a single corpus (a collection of text documents, an not to be confused with the R programming object VCorpus or DCorpus, which will be used later). The compiled file had a total of 28,415 lines of data. Again, the data was tokenized into UniGrams and the number of occurrences, percentage of the total words, and cumulative percentage was calculated for each word.

The compiled file contained approximately 320,807 words, 28,415 of which were unique. In the compiled data approximately 92% of words were reused.

Logically, the top ten words were very similar to the inidivudal data sets:

##       word  freq  percent cumulative_percent
## 25233  the 15350 4.784808           4.784808
## 25603   to  8851 2.758980           7.543788
## 1287   and  7518 2.343465           9.887253
## 159      a  7488 2.334114          12.221367
## 17881   of  6486 2.021776          14.243143
## 12597    i  5858 1.826020          16.069163
## 12827   in  5090 1.586624          17.655787
## 13485   it  3642 1.135262          18.791049
## 25221 that  3489 1.087570          19.878619
## 10066  for  3409 1.062633          20.941251

Again, articles and conjunctions represented a large portion of the word usage. “I” is still in the top ten list, influenced by the high usage in both the blogs and tweets files.

The data still appears to represent a heavy skewness. 20% of the file dictionary is represented in the top ten words.

## [1] "Descrptives statistics of the compiled files:"

##       freq             percent        
##  Min.   :    1.00   Min.   :0.000312  
##  1st Qu.:    1.00   1st Qu.:0.000312  
##  Median :    1.00   Median :0.000312  
##  Mean   :   11.29   Mean   :0.003519  
##  3rd Qu.:    3.00   3rd Qu.:0.000935  
##  Max.   :15350.00   Max.   :4.784808

The descriptive statistics now indicate even worse skewness. While the median and 3rd quartile have not changed (1), the maximum number of occurrences is from “the”, repeating 15,350 times. The next highest occurrence, “to”, isn’t even a third of the maximum with 8,851 occurrences.

Another histogram shows the frequency of the number of word occurrences:

Prediction algorithims, however, typically make use of Bi-, Tri-, and QuadGrams, which are collections of words (2, 3, and 4, respectively), as oppossed to UniGrams which are single words.

For example, the line “Tyger Tyger, burning bright” has four UniGrams (two are unique): “Tyger”, “Tyger”, “burning”, and “bright”. There are three BiGrams: “Tyger Tyger”, “Tyger, burning”, and “burning bright” (although the tokenizer may regard the comma as a separete NGram), and there are two TriGrams: “Tyger Tyger, burning” and “Tyger, burning bright”.

NGrams are analyzed probabilistically, so it is helpful to explore the Bi- and TriGram tokenized data.

There are 305,727 total BiGrams and 166,327 unique BiGrams. Approximately 46% of the BiGrams are reused.

There are 290,855 total TriGrams and 256,778 unique TriGrams. Approximately 12% are reused.

As the size of the NGram increases more NGrams become available while there are less repeats.

Below are the top ten Bi- and TriGrams, respectively, and their corresponding frequencies, percentages of the total NGrams and the cumulative percentage:

## [1] "The top ten BiGrams are:"

##           word freq   percent cumulative_percent
## 98852   of the 1377 0.4504018          0.4504018
## 70560   in the 1283 0.4196554          0.8700573
## 148421  to the  687 0.2247103          1.0947676
## 53425  for the  619 0.2024682          1.2972358
## 100717  on the  615 0.2011599          1.4983956
## 74403     it s  602 0.1969077          1.6953033
## 68007      i m  546 0.1785907          1.8738940
## 146765   to be  523 0.1710677          2.0449617
## 17097   at the  439 0.1435922          2.1885538
## 12582  and the  397 0.1298544          2.3184083

## [1] "The top ten TriGrams are:"

##                  word freq    percent cumulative_percent
## 98561         i don t  147 0.05054065         0.05054065
## 4465         a lot of  126 0.04332055         0.09386120
## 153757     one of the  100 0.03438139         0.12824260
## 198647 thanks for the   71 0.02441079         0.15265338
## 83252     going to be   62 0.02131646         0.17396985
## 99678         i m not   62 0.02131646         0.19528631
## 113560       it was a   62 0.02131646         0.21660277
## 204692     the end of   57 0.01959739         0.23620017
## 113049         it s a   56 0.01925358         0.25545375
## 100682      i want to   55 0.01890977         0.27436351

No Bi- or TriGram represents even a half of a percent of the entire collection. The top ten BiGrams only represent approximately 2% of all of the BiGrams and the top ten TriGrams don’t even represent half of a percent of all of the TriGrams.

The “stranded” letters - i.e. “m” (from “I’m”) or “s” from a possessive or contraction represent characters after apostraphes. These will be helpful in the predictive algorithm, as contractions are clearly some of the most common phrases. The predictive algorithm will need to account for these missing apostraphes removed by the tokenizer and of course offer the user the correct punctuation.

The following show the descriptive statistics of both NGram sets:

## [1] "Descrptives statistics of the BiGram collection:"

##       freq             percent         
##  Min.   :   1.000   Min.   :0.0003271  
##  1st Qu.:   1.000   1st Qu.:0.0003271  
##  Median :   1.000   Median :0.0003271  
##  Mean   :   1.838   Mean   :0.0006012  
##  3rd Qu.:   1.000   3rd Qu.:0.0003271  
##  Max.   :1377.000   Max.   :0.4504018

## [1] "Descrptives statistics of the TriGram:"

##       freq            percent         
##  Min.   :  1.000   Min.   :0.0003438  
##  1st Qu.:  1.000   1st Qu.:0.0003438  
##  Median :  1.000   Median :0.0003438  
##  Mean   :  1.133   Mean   :0.0003894  
##  3rd Qu.:  1.000   3rd Qu.:0.0003438  
##  Max.   :147.000   Max.   :0.0505406

The descriptive statistics show a massive shift in skewness. The mean of both are less than 2. The maximum BiGram occurrences is 1,377 (recall that the maximum for the compiled file was 15,000) and only 147 TriGram occurrences.

The figures below illustrates the distribution. Notice that while the overall shape of the distribution is similar, the x-axis has been further condensed to zoom in on a majority of the data.

Of course a significant number ofNGrams only occur once in both sets. In fact, 136,471 BiGrams occur only once (82% of the BiGrams) and 214,392 TriGrams occur multiple times (94% of the TriGrams).

These NGrams that only occur once are the most valuable for the data set. NGrams that can lead to multiple words provide a lower probability of offering the correct prediction. I.e. When “a” is typed into the application, any singular, indefinite noun could follow. However, when one types “macaroni and”, there are fewer options, with “cheese” being a very likely option in American English.

Conclusion

The project goal is to provide increased utility to the user when typing by predicting the next word they will type, allowing them to quickly select from a list rather than type the entire word. This is particularly useful on devices with cumbersome keypads, i.e. small mobile devices. The data was provided from news and blog sentences as well as tweets. The sources were randomly sampled and the sources were tokenized into Uni-, Bi-, and TriGrams. The tokenized data was then analyzed to determine frequency of occurrences and the distribution of those occurrences. It was shown that using BiGrams and TriGrams will allow for a more unique dataset to provide a larger coverage of the dataset.

John Hopkins and Coursera Data Science Capstone Project: Milestone Report

Joshua Smith

Monday, October 27, 2014

Executive Summary

Analysis of sources

Analysis of corpus

Conclusion