Overview

In this report, I download the data and create a basic report of summary statistics about the data sets. I point out some interesting findings and outline my plan and approach for creating a prediction algorithm and Shiny app.

Load, Read In and Sample the Data

setwd("C:/Users/Public/Documents/Documents/Data Science/Capstone/files/")
set.seed(1234)
sampleSize <- 0.05  #Sample of 5%  
#  Read in blog text data and sample it
con <- file("en_US.blogs.txt","rb")
blog <- readLines(con, encoding="UTF-8")
close(con)
blogsubset <- sample(blog,length(blog)*sampleSize)
#  Read in news text data and sample it
con1 <- file("en_US.news.txt","rb")
news <- readLines(con1,encoding="UTF-8")
close(con1)
newssubset <- sample(news, length(news)*sampleSize)
#  Read in twitter data and sample it
con2 <- file("en_US.twitter.txt","rb")
twitter <- suppressWarnings(readLines(con2,encoding="UTF-8"))
close(con2)
twittersubset <- sample(twitter, length(twitter)*sampleSize)

Pre-Process and Tokenize

Here we consoildate the text files and create a corpus. Then, we perform tokenization for unigrams, bigrams, trigrams and quadrigrams.

#  Create a corpus of the combined sampled data
mycorpus <- corpus(paste(blogsubset,newssubset,twittersubset))
#  Create tokenization (unigram, bigram, trigram and 4-gram )
unigram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 1)
bigram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 2)
trigram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 3)
quadgram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 4)

Exploratory Data Analysis

As shwon below, the sampled corpus consists of about 659,000 sentences, made up from 10.2 million words.

Unigram Analysis

But, the corpus is created from about 142 thousand word types (i.e. unigrams), 56 thousand of which occurred only once or twice in the corpus. So, really we have a vocabulary of about 86 thousand words supporting the corpus. Moverover, we find that the top 140 unique words (in a frequency sorted dictionary) account for 50% of the total word volume in the corpus, and 8,500 unique words account for 90% of total. This suggests that we should consider pruning low frequency unigrams in order to make our prediction algorithm smaller and more time efficient.

#  How many sentences in the corpus sampled at 5%?  About 659,00 sentences.
sum(nsentence(mycorpus))

## [1] 659077

#  How many total words (i.e. N) in the corpus?  About 10.2 million.
sum(ntoken(unigram))

## [1] 10230522

#  How many word types (i.e. unigram features)?  Vocabulary consistes of about 142,000 words.
unigram.dfm <- dfm(unigram)  #Note that we are not removing stopwords, nor stemming

## 
##    ... indexing documents: 118,007 documents
##    ... indexing features: 141,810 feature types
##    ... created a 118007 x 141810 sparse dfm
##    ... complete. 
## Elapsed time: 28.8 seconds.

nfeature(unigram.dfm)

## [1] 141810

#  How many word types only show up once and twice respectively in the Corpus?
topuni <- topfeatures(unigram.dfm,n=nfeature(unigram.dfm))
unigram.feat <- data.frame(topuni)
unigram.feat <- cbind(names(topuni),unigram.feat)
names(unigram.feat) <- c("feature","frequency")
#  25,828 features only show up once in the corpus
nrow(unigram.feat[unigram.feat$frequency == 1,])

## [1] 25828

#  29,752 features only show up twice in the corpus
nrow(unigram.feat[unigram.feat$frequency == 2,])

## [1] 29752

#  How many unique words are needed in a frequency sorted dictionary to cover 50% (or 5.1 million) of all word instances in the corpus (which has 10.2 million words)? Answer: only about 140 words are needed!
sum(unigram.feat$frequency[1:140])

## [1] 5110234

#  How many unique words are needed in a frequency sorted dictionary to cover 90% (or 9.1 million) of all word instances in the corpus (which has 10.2 million words)? Answer: only about 8,500 words are needed!
sum(unigram.feat$frequency[1:8500])

## [1] 9231933

#  What are the top 100 unigram features in terms of frequency.  Note that many are stopwods.
topfeatures(unigram.dfm,n=100)

##     the      to     and       a      of      in       i    that     for 
##  523859  283048  269220  250140  222117  176461  154846  111448  107785 
##      is      it      on     you    with     was      at    this      as 
##  107534   93364   81476   77163   75824   69610   57384   56289   54654 
##      be      my    have     but     are      he     not      we    from 
##   54307   53923   53807   50864   48805   48297   42399   42307   41807 
##      so    said     his      by    they      or    will     all      me 
##   36188   35453   34713   34709   34686   34141   32486   32359   31547 
##      an   about     one     out     has      up      if    when     who 
##   31529   30401   29970   28781   28430   28025   27021   26939   26911 
##    just    what     had    your    more    like     can     her   their 
##   26860   26092   25627   25291   25267   24581   24563   24090   22885 
##      do    it's     she   there    time   would    some    been    were 
##   22312   22304   22093   21958   21319   21176   20778   20271   20179 
##     get      no     new     our     i'm    them   which     how     now 
##   19899   18986   18782   18543   17421   17255   16995   16479   16347 
##  people    also    into    good    than   after   don't    know   first 
##   15659   15052   14899   14804   14769   14553   14383   14300   14242 
##   other     day    back because    then     him    over     its    only 
##   14176   13913   13526   13340   13198   13161   13089   12980   12712 
##    last     two     see    make      go    love   could   think    much 
##   12689   12671   12668   12577   12333   12199   12094   11956   11859 
##    year 
##   11762

#  Bar chart or the top 50 unigram features by frequency
a <- ggplot(unigram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Unigrams by Frequency")

Bigram Analysis

There are about 1.7 million bigrams in the sampled corpus. This represents only about 0.02% of the total universe of bigram, given a vocabulary of 86 thousand words (i.e. 1.7 divided by 66,000^2). In addition, about 950 thousand bigrams (or 56% of total) show up only once or twice in the corpus.

#  How many bigrams are there?  Vocabulary consistes of 1.7 million bigrams.
bigram.dfm <- dfm(bigram)

## 
##    ... indexing documents: 118,007 documents
##    ... indexing features: 1,720,704 feature types
##    ... created a 118007 x 1720704 sparse dfm
##    ... complete. 
## Elapsed time: 15.6 seconds.

nfeature(bigram.dfm)

## [1] 1720704

#  How many bigrams only show up once and twice respectively in the Corpus?
topbi <- topfeatures(bigram.dfm,n=nfeature(bigram.dfm))
bigram.feat <- data.frame(topbi)
bigram.feat <- cbind(names(topbi),bigram.feat)
names(bigram.feat) <- c("feature","frequency")
#  475,163 bigrams only show up once in the corpus
nrow(bigram.feat[bigram.feat$frequency == 1,])

## [1] 475163

#  475,533 bigrams only show up twice in the corpus
nrow(bigram.feat[bigram.feat$frequency == 2,])

## [1] 475533

#  What are the top 100 unigram features in terms of frequency.  Note that many are stopwods.
topfeatures(bigram.dfm,n=100)

##    of_the    in_the    to_the    on_the   for_the     to_be    at_the 
##     49104     45878     23454     20939     19501     16602     15040 
##   and_the      in_a  with_the    it_was      is_a  from_the     for_a 
##     14666     13325     12068     10603     10534     10294      9460 
##    with_a     it_is     i_was     and_i      of_a    i_have   will_be 
##      9102      8962      8884      8791      8685      8610      8323 
##      as_a    one_of      i_am    is_the  going_to  that_the    by_the 
##      7964      7848      7596      7380      7339      7229      7161 
##    if_you    have_a    to_get     and_a   this_is     was_a   want_to 
##      6793      6699      6662      6495      6426      6283      6226 
## the_first    that_i    out_of      to_a   have_to     to_do   i_think 
##      6192      6156      5989      5949      5934      5875      5595 
##      on_a have_been  the_same   you_can    he_was   to_make     a_lot 
##      5577      5568      5442      5217      5180      5175      5171 
##   i_don't    as_the   all_the  has_been     a_few     but_i    to_see 
##      5164      5109      5028      4994      4982      4964      4893 
## more_than about_the     of_my  into_the      be_a   he_said  a_little 
##      4680      4596      4592      4573      4562      4559      4521 
##  would_be     i_had    when_i  there_is   to_have   need_to     in_my 
##      4412      4377      4186      4115      4057      3977      3919 
##  they_are  over_the  you_have     had_a     to_go there_are   you_are 
##      3910      3884      3880      3836      3781      3757      3686 
##     i_can    i_love   that_is    is_not    i_know    a_good  the_most 
##      3660      3659      3641      3637      3609      3566      3560 
##    we_are   but_the   part_of  and_then    lot_of      so_i the_world 
##      3549      3534      3482      3447      3447      3414      3412 
##  the_best   we_have   the_new   was_the     a_new   some_of    of_his 
##      3400      3394      3373      3369      3361      3358      3292 
##  at_least    can_be 
##      3237      3217

#  Bar chart or the top 50 bigram features by frequency
a <- ggplot(bigram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Bigrams by Frequency")

Trigram Analysis

There are about 3.8 million trigrams in the smapled corpus. Of these, 2.4 million trigrams (or 63% of total) show up only once or twice in the corpus. Thus, we have only about 1.4 million trigrams on which we can reasonably provide predictions.

#  How many bigrams are there?  Vocabulary consistes of 3.8 million bigrams.
trigram.dfm <- dfm(trigram)

## 
##    ... indexing documents: 118,007 documents
##    ... indexing features: 3,783,631 feature types
##    ... created a 118007 x 3783631 sparse dfm
##    ... complete. 
## Elapsed time: 15.74 seconds.

nfeature(trigram.dfm)

## [1] 3783631

#  How many bigrams only show up once and twice respectively in the Corpus?
toptri <- topfeatures(trigram.dfm,n=nfeature(trigram.dfm))
trigram.feat <- data.frame(toptri)
trigram.feat <- cbind(names(toptri),trigram.feat)
names(trigram.feat) <- c("feature","frequency")
#  1.2 million trigrams only show up once in the corpus
nrow(trigram.feat[trigram.feat$frequency == 1,])

## [1] 1217173

#  1.2 million trigrams only show up twice in the corpus
nrow(trigram.feat[trigram.feat$frequency == 2,])

## [1] 1168639

#  What are the top 100 unigram features in terms of frequency.  Note that many are stopwods.
topfeatures(trigram.dfm,n=100)

##         one_of_the           a_lot_of            to_be_a 
##               3858               3325               1805 
##         out_of_the         as_well_as        some_of_the 
##               1718               1663               1660 
##        going_to_be         the_end_of           it_was_a 
##               1603               1590               1518 
##         be_able_to        part_of_the          i_want_to 
##               1424               1370               1362 
##     thanks_for_the        the_rest_of          this_is_a 
##               1301               1232               1155 
##        a_couple_of           i_have_a       i_don't_know 
##               1104               1080               1068 
##      the_fact_that         there_is_a         end_of_the 
##               1068               1045               1031 
##          i_have_to        you_have_to     the_first_time 
##               1011               1008               1008 
##        most_of_the       in_the_first        you_want_to 
##                953                949                942 
##        this_is_the        it_would_be   according_to_the 
##                941                909                908 
##       in_the_world            it_is_a        there_is_no 
##                895                895                874 
##        is_going_to  the_united_states         at_the_end 
##                872                871                860 
##          is_one_of          one_of_my       i'm_going_to 
##                848                836                826 
##        i_have_been          to_have_a        rest_of_the 
##                823                813                811 
##         all_of_the           i_had_to        in_order_to 
##                803                781                780 
##          to_make_a      for_the_first        in_front_of 
##                767                754                738 
##      i_don't_think           to_go_to        in_the_past 
##                736                735                733 
##        of_the_year         it_will_be          i_need_to 
##                733                727                724 
##          to_be_the        at_the_same        back_to_the 
##                723                722                715 
##      don't_want_to            i_had_a       was_going_to 
##                713                712                711 
##         as_much_as           a_bit_of      the_same_time 
##                689                686                681 
##         if_you_are looking_forward_to        of_the_most 
##                681                662                660 
##        at_the_time        i_wanted_to         when_i_was 
##                650                646                643 
##      thank_you_for         you_have_a         to_do_with 
##                641                637                632 
##        if_you_have         want_to_be        there_was_a 
##                631                623                620 
##      in_the_middle           to_get_a       one_of_those 
##                619                608                607 
##           i_am_not       a_little_bit          i_think_i 
##                606                604                603 
##        more_than_a         have_to_be         i_love_you 
##                601                595                591 
##         to_see_the       are_going_to          i_will_be 
##                588                588                586 
##        a_chance_to         of_the_day         as_long_as 
##                584                581                571 
##          will_be_a         to_make_it      can't_wait_to 
##                568                568                562 
##         it_was_the         we_need_to        to_make_the 
##                561                560                555 
##     because_of_the        side_of_the      would_like_to 
##                553                547                537 
##        that_i_have 
##                536

#  Bar chart or the top 50 bigram features by frequency
a <- ggplot(trigram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Trigrams by Frequency")

Quadrigram Analysis

There are about 4.7 million quadrigrams in the sampled corpus. Of these, 3.2 million quadrigrams (or 68% of total) show up only once or twice in the corpus. Thus, we have only about 1.5 million quadrigrams on which we can reasonably provide predictions.

#  How many quadrigrams are there?  Corpus consistes of 4.7 million 4-Grams.
quadgram.dfm <- dfm(quadgram)

## 
##    ... indexing documents: 118,007 documents
##    ... indexing features: 4,744,496 feature types
##    ... created a 118007 x 4744496 sparse dfm
##    ... complete. 
## Elapsed time: 32.33 seconds.

nfeature(quadgram.dfm)

## [1] 4744496

#  How many bigrams only show up once and twice respectively in the Corpus?
topquad <- topfeatures(quadgram.dfm,n=nfeature(quadgram.dfm))
quadgram.feat <- data.frame(topquad)
quadgram.feat <- cbind(names(topquad),quadgram.feat)
names(quadgram.feat) <- c("feature","frequency")
#  1.7 million trigrams only show up once in the corpus
nrow(quadgram.feat[quadgram.feat$frequency == 1,])

## [1] 1671832

#  1.5 million trigrams only show up twice in the corpus
nrow(quadgram.feat[quadgram.feat$frequency == 2,])

## [1] 1514250

#  What are the top 100 4-Grams in terms of frequency.  Note that many are stopwods.
topfeatures(quadgram.dfm,n=100)

##        the_end_of_the       the_rest_of_the         at_the_end_of 
##                   870                   796                   703 
##      at_the_same_time    for_the_first_time       one_of_the_most 
##                   625                   561                   506 
##      when_it_comes_to      in_the_middle_of         is_one_of_the 
##                   466                   461                   453 
##         to_be_able_to        is_going_to_be thanks_for_the_follow 
##                   387                   373                   346 
##  in_the_united_states        if_you_want_to         going_to_be_a 
##                   332                   324                   313 
##       i_don't_want_to       one_of_the_best       a_lot_of_people 
##                   310                   268                   262 
##            a_bit_of_a       for_the_rest_of         by_the_end_of 
##                   259                   253                   248 
##     the_middle_of_the     on_the_other_hand       will_be_able_to 
##                   243                   242                   239 
##        as_well_as_the      it's_going_to_be   said_in_a_statement 
##                   236                   235                   235 
##        i_was_going_to     the_bottom_of_the        the_top_of_the 
##                   235                   224                   219 
##         i_am_going_to       a_member_of_the  at_the_university_of 
##                   213                   210                   204 
##        was_one_of_the       i_would_like_to    one_of_my_favorite 
##                   203                   200                   200 
##        as_a_result_of     can't_wait_to_see       i_can't_wait_to 
##                   199                   195                   194 
##       in_front_of_the      going_to_have_to       a_few_years_ago 
##                   188                   184                   183 
##          as_much_as_i       was_going_to_be     thanks_for_the_rt 
##                   180                   179                   178 
##       a_little_bit_of     thank_you_so_much         have_a_lot_of 
##                   176                   175                   175 
##          to_go_to_the       for_a_long_time     i_don't_know_what 
##                   174                   172                   171 
##    in_the_first_place     what_do_you_think     for_the_most_part 
##                   170                   170                   170 
##          a_lot_of_the        as_part_of_the      turned_out_to_be 
##                   169                   168                   166 
##       the_back_of_the     you_don't_have_to       i_don't_know_if 
##                   165                   165                   165 
##         at_the_top_of       i_would_love_to     thank_you_for_the 
##                   162                   162                   162 
##          i_want_to_be         i_was_able_to       in_the_block_of 
##                   162                   162                   159 
##          as_well_as_a     the_first_time_in         if_you_have_a 
##                   158                   154                   151 
##  the_beginning_of_the        you_have_to_do        i_have_no_idea 
##                   151                   149                   149 
##      the_first_time_i     the_other_side_of     in_the_first_half 
##                   148                   147                   146 
##         be_one_of_the          to_be_in_the   at_the_beginning_of 
##                   146                   145                   144 
##     on_the_other_side        i_wish_i_could        to_do_with_the 
##                   144                   144                   143 
##       are_going_to_be        end_of_the_day         in_one_of_the 
##                   141                   140                   138 
##       to_come_up_with      most_of_the_time        in_the_face_of 
##                   138                   136                   136 
##    nothing_to_do_with      i_didn't_want_to       there_will_be_a 
##                   135                   134                   131 
##         as_one_of_the        in_the_case_of      i_don't_know_how 
##                   131                   130                   129 
##     you_don't_want_to     there_are_so_many         i_have_to_say 
##                   128                   127                   125 
##      one_of_the_first         on_top_of_the        to_get_back_to 
##                   125                   121                   120 
##         to_get_out_of 
##                   120

#  Bar chart or the top 50 bigram features by frequency
a <- ggplot(quadgram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Trigrams by Frequency")

Creating a Shiny App and Underlying Prediction Algorithm

Below is my plan and approach for the project. I appreciate your feedback:

How I envisage the Shiny App will work:

There will be an input box where the user will type one (1) sentence. To keep it simple, we will limit input to just one sentence.
As the user starts to type the sentence, predictions will be displayed in an output box.
To simplify, predictions will be generated based on the last three words typed. That is, no predictions are generated with the first two words typed by the user.
User will type after each word entered, triggering the prediction algorithm to refresh and generate a new prediction.
One (1) predictions will be generated (not multiple).
To keep it simple, we will NOT build the facility for user to select predicted words and have the app populate the word in the input box.

Prediction Algorithm specifications:

The core of the algorithm will be an N-gram model. More advanced techniques such as semantic analysis, while probably can boost accuary, are beyond the scope of this project.
Specifically a Quadrigram language model will be created to generate predictions, using a Training dataset and following the process above.
“Backoff” will be utilized, meaning that the algorithm will seek to find the highest frequency quadigram that match the trigram inputted by the user. However, if no match is found, then the process is repeated with the next lower N-gram. That is, the algorithm will take the bigram and will seek to find the highest frequency mathcing trigram, and so on.
As noted above, given high sparsity of higher order N-grams, the algorthim must be adjusted for unseen words and N-gram.
The algorithm will select predictions, based on conditional probabilities, rather than frequency counts. That is, N-gram with the highest probability given that we have seen the N-1 gram will be selected.
Conditional probabilities will incorporate Good-Turing smoothing, which reallocates (i.e. smooths) probabilities mass of n-grams that were seen once to the n-grams that were never seen in the Training dataset. As described in the Coursera NPL videos, this involves adjusting counts when calculating conditional probabilities.

Performance

The algorithm needs to be optimized for size and runtime so that the Shiny app can run on the shinyapps.io server.

One approach which will be investigated in the upcoming weeks is pruning (i.e. removing) the N-grams that only occur once. As noted above, a high percentage of N-gram only occur once. Specifically, we will investigate pruning N-grams that occur only once and have conditional probability of 1.
An alternate approach is to reduce the size of the training dataset to 50,000 samples. Studies of various models show that accuracy improvements level off at approximately 50,000 sentences.
Finally, we will evaluate the performance (i.e. accuracy) of our language using the perplexity measure.

I appreciate your feedback on this plan and approach!

Capstone Project - Milestone Report

John Lee

Wednesday, May 1, 2016