Overview:

This report is created for the Coursera Capstone Milestone. It summarizes the the basic data analysis that was completed following the ingestion of the Blog, News, and Twitter English datasets that make up the reference corpus for the project.

This dataset was obtained from the course website on Coursera (https://class.coursera.org/dsscapstone-003/lecture/7) and is derived from the HC Corpus (http://www.corpora.heliohost.org/aboutcorpus.html).


Analysis Steps

1.

The first step in this project was to load the raw data from the individual “.TXT” files. I did this using the readLines function in a binary format (to resolve some non-standard character import issues).

Next, I processed the data and performed the following steps:
1. Transformed all characters to lowercase
2. Removed all symbols
3. Removed all numerical characters
4. Removed duplicate whitespaces
5. Removed leading and trailing whitespaced


2.

I then computed some summary statistics on the three english language files I imported and processed. The basic stats I computed for these datasets can be seen in the table below:

summary_data<-matrix(c(
        en_US_blog_word_count_sum,en_US_blog_word_count_average,
        en_US_blog_character_count_sum,en_US_blog_character_count_average,
        en_US_news_word_count_sum,en_US_news_word_count_average,
        en_US_news_character_count_sum,en_US_news_character_count_average,
        en_US_twitter_word_count_sum,en_US_twitter_word_count_average,
        en_US_twitter_character_count_sum,en_US_twitter_character_count_average),
        ncol=4, byrow=TRUE)
colnames(summary_data)<-c("Word Count Total","   Average Words/Record",
                          "  Character Count Total","  Average Characters/Record")
rownames(summary_data)<-c("Blog", "News", "Twitter")
summary_data<-as.table(summary_data)
summary_data
##         Word Count Total    Average Words/Record   Character Count Total   Average Characters/Record
## Blog          36816799.0                    40.9             198718413.0                       221.0
## News          33468818.0                    33.1             192987990.0                       191.0
## Twitter       29354815.0                    12.4             151971948.0                        64.4


3.

In comparison to the basic stats I completed above, I also inported the matching values from the original (unprocessed) HC Corpus document. This serves as a good comparison of the overall effect that the data processing has had (via filtering). The comparison can be seen in the table below:

Corpus_data<-matrix(c(
        37242000,41.41,
        206824000,NA,
        34275000,33.93,
        203223000,NA,
        29876000,12.66,
        162122000,NA),
        ncol=4, byrow=TRUE)
colnames(Corpus_data)<-c("Word Count Total","   Average Words/Record",
                          "  Character Count Total","  Average Characters/Record")
rownames(Corpus_data)<-c("HC_Corp_Blog", "HC_Corp_News", "HC_Corp_Twitter")
Corpus_data<-as.table(Corpus_data)
Corpus_data
##                 Word Count Total    Average Words/Record   Character Count Total   Average Characters/Record
## HC_Corp_Blog         3.72420e+07             4.14100e+01             2.06824e+08                            
## HC_Corp_News         3.42750e+07             3.39300e+01             2.03223e+08                            
## HC_Corp_Twitter      2.98760e+07             1.26600e+01             1.62122e+08



Here is a plot that allows you to see the impact of data processing more clearly:


4.

The next steps in this project are to take the processed HC_Corpus data, and use it to build a word prediction tool based off of a users real-time input. I plan to use Shiny for the interactive dashboard interface, and plan to follow the steps outlines below for the subsequent development steps:

    1. Create ngrams in lengths of 2, 3, and 4 words.
    2. Create a sorted frequency table of the ngrams based of of the Processed Corpus.
    3. Create a training and test dataset for model building (machine learning) based on 80/20 split.
    4. Use the prediction model to output a word based on an inputted string’s last 2, 3, or 4 words (cascading logic).
    5. Incorporate the concept in an interactive Shiny Dashboard, and create a final project presentation.