The following report is an initial exploratory analysis for the Data Science Capstone project.
English text is analyzed from 3 main sources: blogs, news, and twitter feeds.
The following is a basic summary of the 3 english files that we will be using for analysis:
## Num.of.Lines Size.in.mb
## Blog 899288 200.4242
## Twitter 2360148 159.3641
## News 77259 196.2775
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## blog_chars 1 47 157 231.69601 331 40835
## twit_chars 2 37 64 68.80281 100 213
## news_chars 2 111 186 203.00243 270 5760
For the purposes of our analysis, a small subset of the complete files (Blog, Twitter, and News files) will be used to develop term frequency plots, wordclouds, and word coverage (due to hardware/computational limitations). The subset was randomly sampled from within the ‘Getting and Cleaning the Data.R’ script in this repository.
We will use the qdap package to develop term frequency data from our subset. As our text data is likely to contain main stop words, we will analyze 3 different sets of our data:
- freq_all: All words, no stopwords removed
- freq_TOP200stopwords: The top 100 stopwords (from within ‘qdap’ package) removed
- freq_TMstopwords: All stopwords removed (from the ‘tm’ package)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## Registered S3 methods overwritten by 'qdap':
## method from
## t.DocumentTermMatrix tm
## t.TermDocumentMatrix tm
##
## Attaching package: 'qdap'
## The following object is masked from 'package:base':
##
## Filter
We will also perform a coverage analysis to determine the amount of unique instances/terms we need to cover certain percentages of the total corpora length. To do this, we develop a function ‘wordcoverage’ which takes the terms (and their frequencies) as input, along with the coverage percentage we wish to analyze and the total count of words in the corpora. In this case, with our cleaned data sitting in the ‘freq_all’ variable, we will simply use the sum of the frequencies in this variable to determine total word count.
library(ngram)
totalwordcount <- sum(freq_allwords[2])
wordcoverage <- function(terms, coverage, totalwords){
if(sum(terms[2]) < totalwords*coverage){
stop("The frequencies in the terms provided do not exceed the total word count.")
}
terms[,'Cumulative Freqs'] = cumsum(terms[2])
index <- min(which(terms[3] > totalwords*coverage))
index
}
coverage_df <- data.frame(
'Total Instances' = c(
sum(freq_allwords[2]),
sum(freq_TOP100stopwords[2]),
sum(freq_TMstopwords[2])
),
'Total Word Count' = totalwordcount,
'Coverage of total Word Count' = c(
sum(freq_allwords[2])/totalwordcount,
sum(freq_TOP100stopwords[2])/totalwordcount,
sum(freq_TMstopwords[2]/totalwordcount)
),
'Unique Terms' = c(
nrow(freq_allwords),
nrow(freq_TOP100stopwords),
nrow(freq_TMstopwords)
)
)
rownames(coverage_df) = c("All Words","Top 100 Stopwords Removed", "All tm Stopwords Removed")
coverage_df
## Total.Instances Total.Word.Count
## All Words 1997407 1997407
## Top 100 Stopwords Removed 1582992 1997407
## All tm Stopwords Removed 1076628 1997407
## Coverage.of.total.Word.Count Unique.Terms
## All Words 1.0000000 48488
## Top 100 Stopwords Removed 0.7925235 35936
## All tm Stopwords Removed 0.5390128 35792
From the above, the total terms in our subset corpora is 1,997,407. In addition, with the Top 100 stopwords removed, we only cover ~79% of the total words (as the top stopwords are expectedly high in frequency; this decrease in total coverage is more extreme when we remove all tm stopwords in the 3rd row above).
Now we can use our ‘wordcoverage’ function to determine the total number of unique words we require from each subset to cover ‘X’ % of the total number of word occurrences (i.e., 1,997,407):
## X0.25 X0.50 X0.75 X0.90
## All Words 15 149 1616 6680
## Top 100 Stopwords Removed 61 948 11726 NA
## All tm Stopwords Removed 856 12405 NA NA
From above, we see that we only require 149 words to cover 50% of the total word occurrences when we do NOT remove any stopwords (first row). However, when we remove all stopwords from the ‘tm’ package, this number to cover 50% of word occurrences jumps to 12,405! From this, we can see the huge frequencies that exist in the top stopwords.
To further explore, we can build wordclouds of the top terms in each set:
Since the top 2 words are stopwords that have such disproportionately high frequencies, we can simply remove them for this wordcloud to get a better ‘overall’ picture:
wordcloud(freq_TMstopwords$WORD,freq_TMstopwords$FREQ,
max.words = 50,
colors = c("turquoise2","darkgoldenrod1","tomato"))
Moving forward, we will develop n-gram tokenizations of the terms and likely opt to keep some stopwords in the final model (this is because since we are developing a predictive text app, we cannot remove all stopwords as these are clearly an integral part of natural language). In addition, we will further explore tools in minimizing memory limitations, likely through the use of creating separate files containing tokenizations and reading from them dynamically instead of storing items in the workspace.
When building the model, we will explore various techniques in predicting text (e.g., back-off models), and we will aim to measure accuracy through machine learning models (i.e., inputting ‘test’ fragments of n-gram-sized sentences and having our model predict the true upcoming words).