Synopsis
The ultimate goal of the Data Science Capstone Project is to make use the skills acquired throughout the 9 data science courses to create an application that will predict the next word. The application is to be built as a shiny apps in which the apps will receive an input in the form of text and the apps will produce few possible recommended/predictive next word using a predictive model.
This milestone report is to show the explanatory analysis in producing the predictive model. The model will be trained using a corpus, a collection of written text called HC Corpora which consists of 3 media data i.e. blogs, news and twitter.
1. The Analysis
We gather the following information from the 3 sets of data
1. No of lines
2. File size
3. No of words
4. No of characters
The table below shows the results
## [1] "**NO OF LINES** in Twitter: 2360148 , Blogs : 899288 ,News : 77259"
## [1] "**FILE SIZE** for Twitter: 167105338 , Blogs : 210160014 ,News : 205811889"
## [1] "**NO OF WORDS** in Twitter: 30578891 , Blogs : 37865888 ,News : 2665742"
## [1] "**NO OF CHAR** in Twitter: 125769312 , Blogs : 163325412 ,News : 12502954"
2. The Sample Data
For easy analysis, we will be sampling about 20,000 text lines randomly from the 3 datasets.
## [1] "**NO OF LINES** of a new sampling data 20000"
3. Clean up Sample Data & Tokenization
From the 20,000 sample data, we are creating a corpus, a collection of 3 data sets. We then proceed with cleaning the 20,000 sample data. For easy cleaning, we are using document feature matrix (dfm) function available from quntenda library. This function when applies to the corpus will convert all letters in to lowercase, will remove white characters, special characters and numbers. Once the sample data has been cleaned, we will then tokenize thhem into unigram, bigram and trigram.Some of the observations are as below:
Top 20 most frequent unigram
## the to and a of in i that for is it on
## 89136 48618 45716 42406 38061 29704 26191 18883 18477 18125 15769 13809
## you with was at this as my be
## 12971 12952 11777 9629 9571 9454 9362 9259
Top 20 most frequent bigram
## of_the in_the to_the on_the for_the to_be at_the and_the
## 8325 7634 3977 3551 3335 2873 2512 2505
## in_a with_the it_was is_a from_the for_a with_a it_is
## 2348 1986 1775 1766 1681 1653 1601 1574
## i_was of_a and_i i_have
## 1562 1556 1485 1467
Top 20 most frequent trigram
## one_of_the a_lot_of to_be_a as_well_as the_end_of
## 689 599 299 267 266
## going_to_be out_of_the it_was_a i_donĂ¢_t i_want_to
## 265 254 253 246 237
## be_able_to some_of_the part_of_the the_fact_that the_rest_of
## 235 231 212 198 194
## i_have_a thanks_for_the the_first_time a_couple_of this_is_a
## 190 186 184 184 183
4. Bar Charts for the ngram



With these diagrams, it concludes the exploratory analysis of the the sample data.
5. Moving Forward
Moving forward, I will be looking into coding a shiny app that will demonstrate the prediction of the next word given words(s) or phrase as input. I will probably using either markov or backoff predictive model in a an attempt to predict the next word. I hope to be able to publish the app in a month or so from now. So, thank you for reading my milestone report.