In this report, I download the data and create a basic report of summary statistics about the data sets. I point out some interesting findings and outline my plan and approach for creating a prediction algorithm and Shiny app.
setwd("C:/Users/Public/Documents/Documents/Data Science/Capstone/files/")
set.seed(1234)
sampleSize <- 0.05 #Sample of 5%
# Read in blog text data and sample it
con <- file("en_US.blogs.txt","rb")
blog <- readLines(con, encoding="UTF-8")
close(con)
blogsubset <- sample(blog,length(blog)*sampleSize)
# Read in news text data and sample it
con1 <- file("en_US.news.txt","rb")
news <- readLines(con1,encoding="UTF-8")
close(con1)
newssubset <- sample(news, length(news)*sampleSize)
# Read in twitter data and sample it
con2 <- file("en_US.twitter.txt","rb")
twitter <- suppressWarnings(readLines(con2,encoding="UTF-8"))
close(con2)
twittersubset <- sample(twitter, length(twitter)*sampleSize)
Here we consoildate the text files and create a corpus. Then, we perform tokenization for unigrams, bigrams, trigrams and quadrigrams.
# Create a corpus of the combined sampled data
mycorpus <- corpus(paste(blogsubset,newssubset,twittersubset))
# Create tokenization (unigram, bigram, trigram and 4-gram )
unigram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 1)
bigram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 2)
trigram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 3)
quadgram <- tokenize(toLower(mycorpus),removePunct = TRUE, removeNumbers = TRUE,removeTwitter = TRUE,removeSeparators=TRUE,ngrams = 4)
As shwon below, the sampled corpus consists of about 659,000 sentences, made up from 10.2 million words.
But, the corpus is created from about 142 thousand word types (i.e. unigrams), 56 thousand of which occurred only once or twice in the corpus. So, really we have a vocabulary of about 86 thousand words supporting the corpus. Moverover, we find that the top 140 unique words (in a frequency sorted dictionary) account for 50% of the total word volume in the corpus, and 8,500 unique words account for 90% of total. This suggests that we should consider pruning low frequency unigrams in order to make our prediction algorithm smaller and more time efficient.
# How many sentences in the corpus sampled at 5%? About 659,00 sentences.
sum(nsentence(mycorpus))
## [1] 659077
# How many total words (i.e. N) in the corpus? About 10.2 million.
sum(ntoken(unigram))
## [1] 10230522
# How many word types (i.e. unigram features)? Vocabulary consistes of about 142,000 words.
unigram.dfm <- dfm(unigram) #Note that we are not removing stopwords, nor stemming
##
## ... indexing documents: 118,007 documents
## ... indexing features: 141,810 feature types
## ... created a 118007 x 141810 sparse dfm
## ... complete.
## Elapsed time: 28.8 seconds.
nfeature(unigram.dfm)
## [1] 141810
# How many word types only show up once and twice respectively in the Corpus?
topuni <- topfeatures(unigram.dfm,n=nfeature(unigram.dfm))
unigram.feat <- data.frame(topuni)
unigram.feat <- cbind(names(topuni),unigram.feat)
names(unigram.feat) <- c("feature","frequency")
# 25,828 features only show up once in the corpus
nrow(unigram.feat[unigram.feat$frequency == 1,])
## [1] 25828
# 29,752 features only show up twice in the corpus
nrow(unigram.feat[unigram.feat$frequency == 2,])
## [1] 29752
# How many unique words are needed in a frequency sorted dictionary to cover 50% (or 5.1 million) of all word instances in the corpus (which has 10.2 million words)? Answer: only about 140 words are needed!
sum(unigram.feat$frequency[1:140])
## [1] 5110234
# How many unique words are needed in a frequency sorted dictionary to cover 90% (or 9.1 million) of all word instances in the corpus (which has 10.2 million words)? Answer: only about 8,500 words are needed!
sum(unigram.feat$frequency[1:8500])
## [1] 9231933
# What are the top 100 unigram features in terms of frequency. Note that many are stopwods.
topfeatures(unigram.dfm,n=100)
## the to and a of in i that for
## 523859 283048 269220 250140 222117 176461 154846 111448 107785
## is it on you with was at this as
## 107534 93364 81476 77163 75824 69610 57384 56289 54654
## be my have but are he not we from
## 54307 53923 53807 50864 48805 48297 42399 42307 41807
## so said his by they or will all me
## 36188 35453 34713 34709 34686 34141 32486 32359 31547
## an about one out has up if when who
## 31529 30401 29970 28781 28430 28025 27021 26939 26911
## just what had your more like can her their
## 26860 26092 25627 25291 25267 24581 24563 24090 22885
## do it's she there time would some been were
## 22312 22304 22093 21958 21319 21176 20778 20271 20179
## get no new our i'm them which how now
## 19899 18986 18782 18543 17421 17255 16995 16479 16347
## people also into good than after don't know first
## 15659 15052 14899 14804 14769 14553 14383 14300 14242
## other day back because then him over its only
## 14176 13913 13526 13340 13198 13161 13089 12980 12712
## last two see make go love could think much
## 12689 12671 12668 12577 12333 12199 12094 11956 11859
## year
## 11762
# Bar chart or the top 50 unigram features by frequency
a <- ggplot(unigram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Unigrams by Frequency")
There are about 1.7 million bigrams in the sampled corpus. This represents only about 0.02% of the total universe of bigram, given a vocabulary of 86 thousand words (i.e. 1.7 divided by 66,000^2). In addition, about 950 thousand bigrams (or 56% of total) show up only once or twice in the corpus.
# How many bigrams are there? Vocabulary consistes of 1.7 million bigrams.
bigram.dfm <- dfm(bigram)
##
## ... indexing documents: 118,007 documents
## ... indexing features: 1,720,704 feature types
## ... created a 118007 x 1720704 sparse dfm
## ... complete.
## Elapsed time: 15.6 seconds.
nfeature(bigram.dfm)
## [1] 1720704
# How many bigrams only show up once and twice respectively in the Corpus?
topbi <- topfeatures(bigram.dfm,n=nfeature(bigram.dfm))
bigram.feat <- data.frame(topbi)
bigram.feat <- cbind(names(topbi),bigram.feat)
names(bigram.feat) <- c("feature","frequency")
# 475,163 bigrams only show up once in the corpus
nrow(bigram.feat[bigram.feat$frequency == 1,])
## [1] 475163
# 475,533 bigrams only show up twice in the corpus
nrow(bigram.feat[bigram.feat$frequency == 2,])
## [1] 475533
# What are the top 100 unigram features in terms of frequency. Note that many are stopwods.
topfeatures(bigram.dfm,n=100)
## of_the in_the to_the on_the for_the to_be at_the
## 49104 45878 23454 20939 19501 16602 15040
## and_the in_a with_the it_was is_a from_the for_a
## 14666 13325 12068 10603 10534 10294 9460
## with_a it_is i_was and_i of_a i_have will_be
## 9102 8962 8884 8791 8685 8610 8323
## as_a one_of i_am is_the going_to that_the by_the
## 7964 7848 7596 7380 7339 7229 7161
## if_you have_a to_get and_a this_is was_a want_to
## 6793 6699 6662 6495 6426 6283 6226
## the_first that_i out_of to_a have_to to_do i_think
## 6192 6156 5989 5949 5934 5875 5595
## on_a have_been the_same you_can he_was to_make a_lot
## 5577 5568 5442 5217 5180 5175 5171
## i_don't as_the all_the has_been a_few but_i to_see
## 5164 5109 5028 4994 4982 4964 4893
## more_than about_the of_my into_the be_a he_said a_little
## 4680 4596 4592 4573 4562 4559 4521
## would_be i_had when_i there_is to_have need_to in_my
## 4412 4377 4186 4115 4057 3977 3919
## they_are over_the you_have had_a to_go there_are you_are
## 3910 3884 3880 3836 3781 3757 3686
## i_can i_love that_is is_not i_know a_good the_most
## 3660 3659 3641 3637 3609 3566 3560
## we_are but_the part_of and_then lot_of so_i the_world
## 3549 3534 3482 3447 3447 3414 3412
## the_best we_have the_new was_the a_new some_of of_his
## 3400 3394 3373 3369 3361 3358 3292
## at_least can_be
## 3237 3217
# Bar chart or the top 50 bigram features by frequency
a <- ggplot(bigram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Bigrams by Frequency")
There are about 3.8 million trigrams in the smapled corpus. Of these, 2.4 million trigrams (or 63% of total) show up only once or twice in the corpus. Thus, we have only about 1.4 million trigrams on which we can reasonably provide predictions.
# How many bigrams are there? Vocabulary consistes of 3.8 million bigrams.
trigram.dfm <- dfm(trigram)
##
## ... indexing documents: 118,007 documents
## ... indexing features: 3,783,631 feature types
## ... created a 118007 x 3783631 sparse dfm
## ... complete.
## Elapsed time: 15.74 seconds.
nfeature(trigram.dfm)
## [1] 3783631
# How many bigrams only show up once and twice respectively in the Corpus?
toptri <- topfeatures(trigram.dfm,n=nfeature(trigram.dfm))
trigram.feat <- data.frame(toptri)
trigram.feat <- cbind(names(toptri),trigram.feat)
names(trigram.feat) <- c("feature","frequency")
# 1.2 million trigrams only show up once in the corpus
nrow(trigram.feat[trigram.feat$frequency == 1,])
## [1] 1217173
# 1.2 million trigrams only show up twice in the corpus
nrow(trigram.feat[trigram.feat$frequency == 2,])
## [1] 1168639
# What are the top 100 unigram features in terms of frequency. Note that many are stopwods.
topfeatures(trigram.dfm,n=100)
## one_of_the a_lot_of to_be_a
## 3858 3325 1805
## out_of_the as_well_as some_of_the
## 1718 1663 1660
## going_to_be the_end_of it_was_a
## 1603 1590 1518
## be_able_to part_of_the i_want_to
## 1424 1370 1362
## thanks_for_the the_rest_of this_is_a
## 1301 1232 1155
## a_couple_of i_have_a i_don't_know
## 1104 1080 1068
## the_fact_that there_is_a end_of_the
## 1068 1045 1031
## i_have_to you_have_to the_first_time
## 1011 1008 1008
## most_of_the in_the_first you_want_to
## 953 949 942
## this_is_the it_would_be according_to_the
## 941 909 908
## in_the_world it_is_a there_is_no
## 895 895 874
## is_going_to the_united_states at_the_end
## 872 871 860
## is_one_of one_of_my i'm_going_to
## 848 836 826
## i_have_been to_have_a rest_of_the
## 823 813 811
## all_of_the i_had_to in_order_to
## 803 781 780
## to_make_a for_the_first in_front_of
## 767 754 738
## i_don't_think to_go_to in_the_past
## 736 735 733
## of_the_year it_will_be i_need_to
## 733 727 724
## to_be_the at_the_same back_to_the
## 723 722 715
## don't_want_to i_had_a was_going_to
## 713 712 711
## as_much_as a_bit_of the_same_time
## 689 686 681
## if_you_are looking_forward_to of_the_most
## 681 662 660
## at_the_time i_wanted_to when_i_was
## 650 646 643
## thank_you_for you_have_a to_do_with
## 641 637 632
## if_you_have want_to_be there_was_a
## 631 623 620
## in_the_middle to_get_a one_of_those
## 619 608 607
## i_am_not a_little_bit i_think_i
## 606 604 603
## more_than_a have_to_be i_love_you
## 601 595 591
## to_see_the are_going_to i_will_be
## 588 588 586
## a_chance_to of_the_day as_long_as
## 584 581 571
## will_be_a to_make_it can't_wait_to
## 568 568 562
## it_was_the we_need_to to_make_the
## 561 560 555
## because_of_the side_of_the would_like_to
## 553 547 537
## that_i_have
## 536
# Bar chart or the top 50 bigram features by frequency
a <- ggplot(trigram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Trigrams by Frequency")
There are about 4.7 million quadrigrams in the sampled corpus. Of these, 3.2 million quadrigrams (or 68% of total) show up only once or twice in the corpus. Thus, we have only about 1.5 million quadrigrams on which we can reasonably provide predictions.
# How many quadrigrams are there? Corpus consistes of 4.7 million 4-Grams.
quadgram.dfm <- dfm(quadgram)
##
## ... indexing documents: 118,007 documents
## ... indexing features: 4,744,496 feature types
## ... created a 118007 x 4744496 sparse dfm
## ... complete.
## Elapsed time: 32.33 seconds.
nfeature(quadgram.dfm)
## [1] 4744496
# How many bigrams only show up once and twice respectively in the Corpus?
topquad <- topfeatures(quadgram.dfm,n=nfeature(quadgram.dfm))
quadgram.feat <- data.frame(topquad)
quadgram.feat <- cbind(names(topquad),quadgram.feat)
names(quadgram.feat) <- c("feature","frequency")
# 1.7 million trigrams only show up once in the corpus
nrow(quadgram.feat[quadgram.feat$frequency == 1,])
## [1] 1671832
# 1.5 million trigrams only show up twice in the corpus
nrow(quadgram.feat[quadgram.feat$frequency == 2,])
## [1] 1514250
# What are the top 100 4-Grams in terms of frequency. Note that many are stopwods.
topfeatures(quadgram.dfm,n=100)
## the_end_of_the the_rest_of_the at_the_end_of
## 870 796 703
## at_the_same_time for_the_first_time one_of_the_most
## 625 561 506
## when_it_comes_to in_the_middle_of is_one_of_the
## 466 461 453
## to_be_able_to is_going_to_be thanks_for_the_follow
## 387 373 346
## in_the_united_states if_you_want_to going_to_be_a
## 332 324 313
## i_don't_want_to one_of_the_best a_lot_of_people
## 310 268 262
## a_bit_of_a for_the_rest_of by_the_end_of
## 259 253 248
## the_middle_of_the on_the_other_hand will_be_able_to
## 243 242 239
## as_well_as_the it's_going_to_be said_in_a_statement
## 236 235 235
## i_was_going_to the_bottom_of_the the_top_of_the
## 235 224 219
## i_am_going_to a_member_of_the at_the_university_of
## 213 210 204
## was_one_of_the i_would_like_to one_of_my_favorite
## 203 200 200
## as_a_result_of can't_wait_to_see i_can't_wait_to
## 199 195 194
## in_front_of_the going_to_have_to a_few_years_ago
## 188 184 183
## as_much_as_i was_going_to_be thanks_for_the_rt
## 180 179 178
## a_little_bit_of thank_you_so_much have_a_lot_of
## 176 175 175
## to_go_to_the for_a_long_time i_don't_know_what
## 174 172 171
## in_the_first_place what_do_you_think for_the_most_part
## 170 170 170
## a_lot_of_the as_part_of_the turned_out_to_be
## 169 168 166
## the_back_of_the you_don't_have_to i_don't_know_if
## 165 165 165
## at_the_top_of i_would_love_to thank_you_for_the
## 162 162 162
## i_want_to_be i_was_able_to in_the_block_of
## 162 162 159
## as_well_as_a the_first_time_in if_you_have_a
## 158 154 151
## the_beginning_of_the you_have_to_do i_have_no_idea
## 151 149 149
## the_first_time_i the_other_side_of in_the_first_half
## 148 147 146
## be_one_of_the to_be_in_the at_the_beginning_of
## 146 145 144
## on_the_other_side i_wish_i_could to_do_with_the
## 144 144 143
## are_going_to_be end_of_the_day in_one_of_the
## 141 140 138
## to_come_up_with most_of_the_time in_the_face_of
## 138 136 136
## nothing_to_do_with i_didn't_want_to there_will_be_a
## 135 134 131
## as_one_of_the in_the_case_of i_don't_know_how
## 131 130 129
## you_don't_want_to there_are_so_many i_have_to_say
## 128 127 125
## one_of_the_first on_top_of_the to_get_back_to
## 125 121 120
## to_get_out_of
## 120
# Bar chart or the top 50 bigram features by frequency
a <- ggplot(quadgram.feat[1:50,], aes(reorder(feature,-frequency),frequency))
a + geom_bar(stat="identity")+xlab("Features") + theme(text = element_text(size=14),axis.text.x = element_text(angle = 90, hjust = 1,vjust=0))+ggtitle("Top 50 Trigrams by Frequency")
Below is my plan and approach for the project. I appreciate your feedback:
There will be an input box where the user will type one (1) sentence. To keep it simple, we will limit input to just one sentence.
As the user starts to type the sentence, predictions will be displayed in an output box.
To simplify, predictions will be generated based on the last three words typed. That is, no predictions are generated with the first two words typed by the user.
User will type
One (1) predictions will be generated (not multiple).
To keep it simple, we will NOT build the facility for user to select predicted words and have the app populate the word in the input box.
The core of the algorithm will be an N-gram model. More advanced techniques such as semantic analysis, while probably can boost accuary, are beyond the scope of this project.
Specifically a Quadrigram language model will be created to generate predictions, using a Training dataset and following the process above.
“Backoff” will be utilized, meaning that the algorithm will seek to find the highest frequency quadigram that match the trigram inputted by the user. However, if no match is found, then the process is repeated with the next lower N-gram. That is, the algorithm will take the bigram and will seek to find the highest frequency mathcing trigram, and so on.
As noted above, given high sparsity of higher order N-grams, the algorthim must be adjusted for unseen words and N-gram.
The algorithm will select predictions, based on conditional probabilities, rather than frequency counts. That is, N-gram with the highest probability given that we have seen the N-1 gram will be selected.
Conditional probabilities will incorporate Good-Turing smoothing, which reallocates (i.e. smooths) probabilities mass of n-grams that were seen once to the n-grams that were never seen in the Training dataset. As described in the Coursera NPL videos, this involves adjusting counts when calculating conditional probabilities.
The algorithm needs to be optimized for size and runtime so that the Shiny app can run on the shinyapps.io server.
One approach which will be investigated in the upcoming weeks is pruning (i.e. removing) the N-grams that only occur once. As noted above, a high percentage of N-gram only occur once. Specifically, we will investigate pruning N-grams that occur only once and have conditional probability of 1.
An alternate approach is to reduce the size of the training dataset to 50,000 samples. Studies of various models show that accuracy improvements level off at approximately 50,000 sentences.
Finally, we will evaluate the performance (i.e. accuracy) of our language using the perplexity measure.
I appreciate your feedback on this plan and approach!