Load Libraries and Set WD

library(pillar)
## Warning: package 'pillar' was built under R version 4.0.4
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:pillar':
## 
##     dim_desc
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
library (tm)
## Warning: package 'tm' was built under R version 4.0.4
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.0.3
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.0.4
## Loading required package: RColorBrewer
library(stringi)
## Warning: package 'stringi' was built under R version 4.0.3
library(ngram)
## Warning: package 'ngram' was built under R version 4.0.3
library(xfun)
## Warning: package 'xfun' was built under R version 4.0.4
## 
## Attaching package: 'xfun'
## The following objects are masked from 'package:base':
## 
##     attr, isFALSE
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 4.0.3
library(ngram)
library(textmineR)
## Warning: package 'textmineR' was built under R version 4.0.4
## Loading required package: Matrix
## 
## Attaching package: 'textmineR'
## The following object is masked from 'package:Matrix':
## 
##     update
## The following object is masked from 'package:stats':
## 
##     update
library(RWeka)
## Warning: package 'RWeka' was built under R version 4.0.4
setwd("C:/Users/themo/OneDrive/Data Science JHU/Capstone")

Download and Unzip Data

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")

Read in each Dataset

blog_path <- file.path("./final/en_US/en_US.blogs.txt")
news_path <- file.path("./final/en_US/en_US.news.txt")
twitter_path <- file.path("./final/en_US/en_US.twitter.txt")

blog_data <- readLines(blog_path, encoding="UTF-8", skipNul=TRUE)
twitter_data <- readLines(news_path, encoding="UTF-8", skipNul=TRUE)
## Warning in readLines(news_path, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on './final/en_US/en_US.news.txt'
news_data <- readLines(twitter_path, encoding="UTF-8", skipNul=TRUE)   

Explorartory Data Analysis and STATS

Length of each dataset

length(blog_data)
## [1] 899288
length(twitter_data)
## [1] 77259
length(news_data)
## [1] 2360148

Max character string length

max(nchar(blog_data))
## [1] 40833
max(nchar(twitter_data))
## [1] 5760
max(nchar(news_data))
## [1] 140

##Word count per dataset

wordcount(blog_data, sep = " ", count.function = sum)
## [1] 37334131
wordcount(twitter_data, sep = " ", count.function = sum)
## [1] 2643969
wordcount(news_data, sep = " ", count.function = sum)
## [1] 30373583

Create subsets of each data set.

blog_data_subset <- blog_data[sample(1:length(blog_data),3000)]
twitter_data_subset <- twitter_data[sample(1:length(twitter_data),3000)]
news_data_subset <- news_data[sample(1:length(news_data),3000)]

working_corpus <- VCorpus(VectorSource(c(blog_data_subset, twitter_data_subset, news_data_subset)), readerControl=list(readPlain, language="en", load=TRUE))

Removing undesired terms

working_corpus <- tm_map(working_corpus, stripWhitespace)
working_corpus <- tm_map(working_corpus, removePunctuation)
working_corpus <- tm_map(working_corpus, removeNumbers)
working_corpus <- tm_map(working_corpus, removeWords, stopwords("english"))
working_corpus <- tm_map(working_corpus, tolower)
working_corpus <- tm_map(working_corpus, PlainTextDocument)
working_corpus <- tm_map(working_corpus, stemDocument)
working_corpus <- tm_map(working_corpus, removePunctuation)

Remove Sparse Terms

inspect(removeSparseTerms(dtm, 0.4))

I did not display the output for this code as it was very long. Essentially,

This function is described on Stackflow in this way: In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)

For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99. The exact interpretation for sparse = 0.99 is that for term \(j\), you will retain all terms for which \(df_j > N * (1 - 0.99)\), where \(N\) is the number of documents – in this case probably all terms will be retained (see example below).

https://stackoverflow.com/questions/28763389/how-does-the-removesparseterms-in-r-work

Word Frequencies (by Ngram)

NGrams

one_word_gm = as.data.frame((as.matrix(  TermDocumentMatrix(working_corpus) )) ) 
one_word_gmv <- sort(rowSums(one_word_gm),decreasing=TRUE)
one_word_gmd <- data.frame(word = names(one_word_gmv),freq=one_word_gmv)
one_word_gmd[1:15,]
##      word freq
## the   the 1430
## said said  903
## will will  843
## one   one  812
## like like  719
## get   get  714
## just just  687
## time time  622
## can   can  602
## year year  582
## make make  511
## day   day  498
## new   new  478
## know know  445
## now   now  442
ggplot(one_word_gmd[1:15,], aes(x=reorder(word, freq),y=freq)) + 
        geom_bar(stat="identity", width=0.5, fill="green") + 
        labs(title="One_Word_gms")+
        xlab("Unigrams") + ylab("Frequency") + 
        theme(axis.text.x=element_text(angle=65, vjust=0.6))

two_word_gm <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
two_word_gmDTM <- TermDocumentMatrix(working_corpus,control = list(tokenize = two_word_gm))
two_word_freqTerm_tidy <-(two_word_gmDTM)
two_word_freqTerm_tidy[["dimnames"]][["Terms"]][1:20]
##  [1] "'s 's"       "'s clean"    "'s live"     "'s may"      "'s mopac"   
##  [6] "'s year"     "– ’re"       "– ’s"        "– “turtles”" "– a"        
## [11] "– airline’"  "– almost"    "– animos"    "– anoth"     "– anthoni"  
## [16] "– anyth"     "– aren’t"    "– audio"     "– austen"    "– barrel"
three_word_gm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
three_word_gmDTM <- (DocumentTermMatrix(working_corpus,control = list(tokenize = three_word_gm)))
three_word_freqTerm <- (findFreqTerms(three_word_gmDTM,lowfreq=15))
three_word_freqTerm_tidy <-(three_word_gmDTM)
three_word_freqTerm_tidy[["dimnames"]][["Terms"]][1:20]
##  [1] "'s 's year"        "'s clean interfac" "'s live still"    
##  [4] "'s may even"       "'s mopac locomot"  "'s year simpl"    
##  [7] "– ’re often"       "– ’s tough"        "– “turtles” way"  
## [10] "– a busi"          "– airline’ custom" "– almost half"    
## [13] "– animos never"    "– anoth daynbsp"   "– anthoni barte"  
## [16] "– anyth –"         "– aren’t subject"  "– audio frequenc" 
## [19] "– austen wife"     "– barrel barrel"
four_word_gm <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
four_word_gmDTM <- (DocumentTermMatrix(working_corpus,control = list(tokenize = four_word_gm)))
four_word_freqTerm <- (findFreqTerms(four_word_gmDTM,lowfreq=15))
four_word_freqTerm_tidy <-(four_word_gmDTM)
four_word_freqTerm_tidy[["dimnames"]][["Terms"]][1:20]
##  [1] "'s 's year simpl"         "'s clean interfac relat" 
##  [3] "'s live still touch"      "'s may even 's"          
##  [5] "'s mopac locomot renumb"  "'s year simpl broken"    
##  [7] "– ’re often context"      "– ’s tough world"        
##  [9] "– a busi broken"          "– airline’ custom servic"
## [11] "– almost half space"      "– animos never reach"    
## [13] "– anoth daynbsp everyon"  "– anthoni barte slate"   
## [15] "– anyth – need"           "– aren’t subject matter" 
## [17] "– audio frequenc amplifi" "– austen wife whose"     
## [19] "– barrel barrel delici"   "– big enough allow"

Associations with DocumentTermMatrix

The math of findAssoc() is based on the standard function cor() in the stats package of R. Given three numeric vectors, cor() computes their covariance divided by both the standard deviations.

So given a DocumentTermMatrix dtm containing terms “word1” and “word2” such that findAssocs(dtm, “word1”, 0) returns “word2” with a value of x, the correlation of the term vectors for “word1” and “word2” is x. (per Stackflow https://stackoverflow.com/questions/14267199/math-of-tmfindassocs-how-does-this-function-work)

##Associations for two random words

findAssocs(dtm, "deficit", .08)
## $deficit
##  “needless      acton  customari    inexcus    levels”   million”    outlast 
##       0.38       0.38       0.38       0.38       0.38       0.38       0.38 
## secondrank     twoset      curri    circumv        cor   freehold      galen 
##       0.38       0.38       0.31       0.27       0.27       0.27       0.27 
##    surplus        “in     trojan        usc commission    county’    semifin 
##       0.27       0.22       0.22       0.22       0.20       0.19       0.19 
##     sprint      minus      irvin    proport        tax      salem       hike 
##       0.19       0.17       0.15       0.15       0.15       0.14       0.13 
##     joseph   reminisc        gap      pacif    pleasur      ralli      addit 
##       0.13       0.13       0.11       0.11       0.11       0.11       0.10 
##   meanwhil    million       debt     histor  necessari     reserv tournament 
##       0.10       0.10       0.09       0.09       0.09       0.09       0.09 
##     budget     counti       over 
##       0.08       0.08       0.08
findAssocs(dtm, "history", .08)
## $history
## numeric(0)

Upfront, I initially experienced a lot of system issues working with the size of the data…crashes of both RStudio and my browser. I finally got things to run smoothly after many restarts and killing processes. I also had to do a real good clean-up of CHROME.Additionally, I noticed that there were some portions of my code that were duplicative and/or out of sequence.

Interesting findings/observations: The size of the data certainly provided a challenge.I had to continue to circle back through my code to ensure that it would run by adjusting my sample sizes down continually.

The removal of terms, punctuation, spaces, etc. has not initially provided the result I expected. I will need to circle back through with additional coding to make this even cleaner.

Plans for creating a prediction algorithm and Shiny app:

My next step is to get the data even cleaner in preparation for the predictive modeling component of the project. Honestly, I am not sure, yet, how I am going to approach develop the model.