Executive Summary

The following report is an initial exploratory analysis for the Data Science Capstone project.
English text is analyzed from 3 main sources: blogs, news, and twitter feeds.

Analyzing All Files: Basic Summaries

The following is a basic summary of the 3 english files that we will be using for analysis:

##         Num.of.Lines Size.in.mb
## Blog          899288   200.4242
## Twitter      2360148   159.3641
## News           77259   196.2775
##            Min. 1st Qu. Median      Mean 3rd Qu.  Max.
## blog_chars    1      47    157 231.69601     331 40835
## twit_chars    2      37     64  68.80281     100   213
## news_chars    2     111    186 203.00243     270  5760

Term Frequency Analysis: Plots, Wordclouds, Word Coverage

For the purposes of our analysis, a small subset of the complete files (Blog, Twitter, and News files) will be used to develop term frequency plots, wordclouds, and word coverage (due to hardware/computational limitations). The subset was randomly sampled from within the ‘Getting and Cleaning the Data.R’ script in this repository.
We will use the qdap package to develop term frequency data from our subset. As our text data is likely to contain main stop words, we will analyze 3 different sets of our data:
- freq_all: All words, no stopwords removed
- freq_TOP200stopwords: The top 100 stopwords (from within ‘qdap’ package) removed
- freq_TMstopwords: All stopwords removed (from the ‘tm’ package)

## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## Registered S3 methods overwritten by 'qdap':
##   method               from
##   t.DocumentTermMatrix tm  
##   t.TermDocumentMatrix tm
## 
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
## 
##     Filter

Term Frequency: All Words

Term Frequency: Top 100 Stop Words Removed

Term Frequency: All ‘tm’ Stop Words Removed

We will also perform a coverage analysis to determine the amount of unique instances/terms we need to cover certain percentages of the total corpora length. To do this, we develop a function ‘wordcoverage’ which takes the terms (and their frequencies) as input, along with the coverage percentage we wish to analyze and the total count of words in the corpora. In this case, with our cleaned data sitting in the ‘freq_all’ variable, we will simply use the sum of the frequencies in this variable to determine total word count.

library(ngram)
totalwordcount <- sum(freq_allwords[2])

wordcoverage <- function(terms, coverage, totalwords){
      
      if(sum(terms[2]) < totalwords*coverage){
            stop("The frequencies in the terms provided do not exceed the total word count.")
      }
      
      terms[,'Cumulative Freqs'] = cumsum(terms[2])
      index <- min(which(terms[3] > totalwords*coverage))
      
      index
}

coverage_df <- data.frame(
      'Total Instances' = c(
                  sum(freq_allwords[2]), 
                  sum(freq_TOP100stopwords[2]), 
                  sum(freq_TMstopwords[2])
            ),
      'Total Word Count' = totalwordcount,
      'Coverage of total Word Count' = c(
                  sum(freq_allwords[2])/totalwordcount,
                  sum(freq_TOP100stopwords[2])/totalwordcount,
                  sum(freq_TMstopwords[2]/totalwordcount)
            ),
      'Unique Terms' = c(
                  nrow(freq_allwords),
                  nrow(freq_TOP100stopwords),
                  nrow(freq_TMstopwords)
            )
)

rownames(coverage_df) = c("All Words","Top 100 Stopwords Removed", "All tm Stopwords Removed")
coverage_df
##                           Total.Instances Total.Word.Count
## All Words                         1997407          1997407
## Top 100 Stopwords Removed         1582992          1997407
## All tm Stopwords Removed          1076628          1997407
##                           Coverage.of.total.Word.Count Unique.Terms
## All Words                                    1.0000000        48488
## Top 100 Stopwords Removed                    0.7925235        35936
## All tm Stopwords Removed                     0.5390128        35792

From the above, the total terms in our subset corpora is 1,997,407. In addition, with the Top 100 stopwords removed, we only cover ~79% of the total words (as the top stopwords are expectedly high in frequency; this decrease in total coverage is more extreme when we remove all tm stopwords in the 3rd row above).
Now we can use our ‘wordcoverage’ function to determine the total number of unique words we require from each subset to cover ‘X’ % of the total number of word occurrences (i.e., 1,997,407):

##                           X0.25 X0.50 X0.75 X0.90
## All Words                    15   149  1616  6680
## Top 100 Stopwords Removed    61   948 11726    NA
## All tm Stopwords Removed    856 12405    NA    NA

From above, we see that we only require 149 words to cover 50% of the total word occurrences when we do NOT remove any stopwords (first row). However, when we remove all stopwords from the ‘tm’ package, this number to cover 50% of word occurrences jumps to 12,405! From this, we can see the huge frequencies that exist in the top stopwords.
To further explore, we can build wordclouds of the top terms in each set:

Wordcloud: Top 100 Stop Words Removed

Since the top 2 words are stopwords that have such disproportionately high frequencies, we can simply remove them for this wordcloud to get a better ‘overall’ picture:

Wordcloud: All ‘tm’ Stop Words Removed

wordcloud(freq_TMstopwords$WORD,freq_TMstopwords$FREQ, 
          max.words = 50, 
          colors = c("turquoise2","darkgoldenrod1","tomato"))

Conclusion

Moving forward, we will develop n-gram tokenizations of the terms and likely opt to keep some stopwords in the final model (this is because since we are developing a predictive text app, we cannot remove all stopwords as these are clearly an integral part of natural language). In addition, we will further explore tools in minimizing memory limitations, likely through the use of creating separate files containing tokenizations and reading from them dynamically instead of storing items in the workspace.
When building the model, we will explore various techniques in predicting text (e.g., back-off models), and we will aim to measure accuracy through machine learning models (i.e., inputting ‘test’ fragments of n-gram-sized sentences and having our model predict the true upcoming words).