Synopsis

The ultimate goal of the Data Science Capstone Project is to make use the skills acquired throughout the 9 data science courses to create an application that will predict the next word. The application is to be built as a shiny apps in which the apps will receive an input in the form of text and the apps will produce few possible recommended/predictive next word using a predictive model.

This milestone report is to show the explanatory analysis in producing the predictive model. The model will be trained using a corpus, a collection of written text called HC Corpora which consists of 3 media data i.e. blogs, news and twitter.

1. The Analysis

We gather the following information from the 3 sets of data

1. No of lines
2. File size
3. No of words
4. No of characters

The table below shows the results

## [1] "**NO OF LINES** in Twitter: 2360148 , Blogs : 899288 ,News : 77259"
## [1] "**FILE SIZE** for Twitter: 167105338 , Blogs : 210160014 ,News : 205811889"
## [1] "**NO OF WORDS** in Twitter: 30578891 , Blogs : 37865888 ,News : 2665742"
## [1] "**NO OF CHAR** in Twitter: 125769312 , Blogs : 163325412 ,News : 12502954"

2. The Sample Data

For easy analysis, we will be sampling about 20,000 text lines randomly from the 3 datasets.

## [1] "**NO OF LINES** of a new sampling data 20000"

3. Clean up Sample Data & Tokenization

From the 20,000 sample data, we are creating a corpus, a collection of 3 data sets. We then proceed with cleaning the 20,000 sample data. For easy cleaning, we are using document feature matrix (dfm) function available from quntenda library. This function when applies to the corpus will convert all letters in to lowercase, will remove white characters, special characters and numbers. Once the sample data has been cleaned, we will then tokenize thhem into unigram, bigram and trigram.Some of the observations are as below:

Top 20 most frequent unigram

##   the    to   and     a    of    in     i  that   for    is    it    on 
## 89136 48618 45716 42406 38061 29704 26191 18883 18477 18125 15769 13809 
##   you  with   was    at  this    as    my    be 
## 12971 12952 11777  9629  9571  9454  9362  9259

Top 20 most frequent bigram

##   of_the   in_the   to_the   on_the  for_the    to_be   at_the  and_the 
##     8325     7634     3977     3551     3335     2873     2512     2505 
##     in_a with_the   it_was     is_a from_the    for_a   with_a    it_is 
##     2348     1986     1775     1766     1681     1653     1601     1574 
##    i_was     of_a    and_i   i_have 
##     1562     1556     1485     1467

Top 20 most frequent trigram

##     one_of_the       a_lot_of        to_be_a     as_well_as     the_end_of 
##            689            599            299            267            266 
##    going_to_be     out_of_the       it_was_a       i_donĂ¢_t      i_want_to 
##            265            254            253            246            237 
##     be_able_to    some_of_the    part_of_the  the_fact_that    the_rest_of 
##            235            231            212            198            194 
##       i_have_a thanks_for_the the_first_time    a_couple_of      this_is_a 
##            190            186            184            184            183

4. Bar Charts for the ngram

With these diagrams, it concludes the exploratory analysis of the the sample data.

5. Moving Forward

Moving forward, I will be looking into coding a shiny app that will demonstrate the prediction of the next word given words(s) or phrase as input. I will probably using either markov or backoff predictive model in a an attempt to predict the next word. I hope to be able to publish the app in a month or so from now. So, thank you for reading my milestone report.