library(pillar)
## Warning: package 'pillar' was built under R version 4.0.4
library(magrittr)
## Warning: package 'magrittr' was built under R version 4.0.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:pillar':
##
## dim_desc
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.0.4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.4
library (tm)
## Warning: package 'tm' was built under R version 4.0.4
## Loading required package: NLP
## Warning: package 'NLP' was built under R version 4.0.3
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.0.4
## Loading required package: RColorBrewer
library(stringi)
## Warning: package 'stringi' was built under R version 4.0.3
library(ngram)
## Warning: package 'ngram' was built under R version 4.0.3
library(xfun)
## Warning: package 'xfun' was built under R version 4.0.4
##
## Attaching package: 'xfun'
## The following objects are masked from 'package:base':
##
## attr, isFALSE
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 4.0.3
library(ngram)
library(textmineR)
## Warning: package 'textmineR' was built under R version 4.0.4
## Loading required package: Matrix
##
## Attaching package: 'textmineR'
## The following object is masked from 'package:Matrix':
##
## update
## The following object is masked from 'package:stats':
##
## update
library(RWeka)
## Warning: package 'RWeka' was built under R version 4.0.4
setwd("C:/Users/themo/OneDrive/Data Science JHU/Capstone")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url, destfile = "Coursera-SwiftKey.zip")
unzip("Coursera-SwiftKey.zip")
blog_path <- file.path("./final/en_US/en_US.blogs.txt")
news_path <- file.path("./final/en_US/en_US.news.txt")
twitter_path <- file.path("./final/en_US/en_US.twitter.txt")
blog_data <- readLines(blog_path, encoding="UTF-8", skipNul=TRUE)
twitter_data <- readLines(news_path, encoding="UTF-8", skipNul=TRUE)
## Warning in readLines(news_path, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on './final/en_US/en_US.news.txt'
news_data <- readLines(twitter_path, encoding="UTF-8", skipNul=TRUE)
length(blog_data)
## [1] 899288
length(twitter_data)
## [1] 77259
length(news_data)
## [1] 2360148
max(nchar(blog_data))
## [1] 40833
max(nchar(twitter_data))
## [1] 5760
max(nchar(news_data))
## [1] 140
##Word count per dataset
wordcount(blog_data, sep = " ", count.function = sum)
## [1] 37334131
wordcount(twitter_data, sep = " ", count.function = sum)
## [1] 2643969
wordcount(news_data, sep = " ", count.function = sum)
## [1] 30373583
blog_data_subset <- blog_data[sample(1:length(blog_data),3000)]
twitter_data_subset <- twitter_data[sample(1:length(twitter_data),3000)]
news_data_subset <- news_data[sample(1:length(news_data),3000)]
working_corpus <- VCorpus(VectorSource(c(blog_data_subset, twitter_data_subset, news_data_subset)), readerControl=list(readPlain, language="en", load=TRUE))
Removing undesired terms
Remove whitespace
Removing numbers
Remove stopwords (i.e. “and”, “or”, “not”, “is”)
Convert to lowercase
Create Plain-text format
working_corpus <- tm_map(working_corpus, stripWhitespace)
working_corpus <- tm_map(working_corpus, removePunctuation)
working_corpus <- tm_map(working_corpus, removeNumbers)
working_corpus <- tm_map(working_corpus, removeWords, stopwords("english"))
working_corpus <- tm_map(working_corpus, tolower)
working_corpus <- tm_map(working_corpus, PlainTextDocument)
working_corpus <- tm_map(working_corpus, stemDocument)
working_corpus <- tm_map(working_corpus, removePunctuation)
I did not display the output for this code as it was very long. Essentially,
This function is described on Stackflow in this way: In the sense of the sparse argument to removeSparseTerms(), sparsity refers to the threshold of relative document frequency for a term, above which the term will be removed. Relative document frequency here means a proportion. As the help page for the command states (although not very clearly), sparsity is smaller as it approaches 1.0. (Note that sparsity cannot take values of 0 or 1.0, only values in between.)
For example, if you set sparse = 0.99 as the argument to removeSparseTerms(), then this will remove only terms that are more sparse than 0.99. The exact interpretation for sparse = 0.99 is that for term \(j\), you will retain all terms for which \(df_j > N * (1 - 0.99)\), where \(N\) is the number of documents – in this case probably all terms will be retained (see example below).
https://stackoverflow.com/questions/28763389/how-does-the-removesparseterms-in-r-work
NGrams
one_word_gm = as.data.frame((as.matrix( TermDocumentMatrix(working_corpus) )) )
one_word_gmv <- sort(rowSums(one_word_gm),decreasing=TRUE)
one_word_gmd <- data.frame(word = names(one_word_gmv),freq=one_word_gmv)
one_word_gmd[1:15,]
## word freq
## the the 1430
## said said 903
## will will 843
## one one 812
## like like 719
## get get 714
## just just 687
## time time 622
## can can 602
## year year 582
## make make 511
## day day 498
## new new 478
## know know 445
## now now 442
ggplot(one_word_gmd[1:15,], aes(x=reorder(word, freq),y=freq)) +
geom_bar(stat="identity", width=0.5, fill="green") +
labs(title="One_Word_gms")+
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=65, vjust=0.6))
two_word_gm <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
two_word_gmDTM <- TermDocumentMatrix(working_corpus,control = list(tokenize = two_word_gm))
two_word_freqTerm_tidy <-(two_word_gmDTM)
two_word_freqTerm_tidy[["dimnames"]][["Terms"]][1:20]
## [1] "'s 's" "'s clean" "'s live" "'s may" "'s mopac"
## [6] "'s year" "– ’re" "– ’s" "– “turtles”" "– a"
## [11] "– airline’" "– almost" "– animos" "– anoth" "– anthoni"
## [16] "– anyth" "– aren’t" "– audio" "– austen" "– barrel"
three_word_gm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
three_word_gmDTM <- (DocumentTermMatrix(working_corpus,control = list(tokenize = three_word_gm)))
three_word_freqTerm <- (findFreqTerms(three_word_gmDTM,lowfreq=15))
three_word_freqTerm_tidy <-(three_word_gmDTM)
three_word_freqTerm_tidy[["dimnames"]][["Terms"]][1:20]
## [1] "'s 's year" "'s clean interfac" "'s live still"
## [4] "'s may even" "'s mopac locomot" "'s year simpl"
## [7] "– ’re often" "– ’s tough" "– “turtles” way"
## [10] "– a busi" "– airline’ custom" "– almost half"
## [13] "– animos never" "– anoth daynbsp" "– anthoni barte"
## [16] "– anyth –" "– aren’t subject" "– audio frequenc"
## [19] "– austen wife" "– barrel barrel"
four_word_gm <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
four_word_gmDTM <- (DocumentTermMatrix(working_corpus,control = list(tokenize = four_word_gm)))
four_word_freqTerm <- (findFreqTerms(four_word_gmDTM,lowfreq=15))
four_word_freqTerm_tidy <-(four_word_gmDTM)
four_word_freqTerm_tidy[["dimnames"]][["Terms"]][1:20]
## [1] "'s 's year simpl" "'s clean interfac relat"
## [3] "'s live still touch" "'s may even 's"
## [5] "'s mopac locomot renumb" "'s year simpl broken"
## [7] "– ’re often context" "– ’s tough world"
## [9] "– a busi broken" "– airline’ custom servic"
## [11] "– almost half space" "– animos never reach"
## [13] "– anoth daynbsp everyon" "– anthoni barte slate"
## [15] "– anyth – need" "– aren’t subject matter"
## [17] "– audio frequenc amplifi" "– austen wife whose"
## [19] "– barrel barrel delici" "– big enough allow"
The math of findAssoc() is based on the standard function cor() in the stats package of R. Given three numeric vectors, cor() computes their covariance divided by both the standard deviations.
So given a DocumentTermMatrix dtm containing terms “word1” and “word2” such that findAssocs(dtm, “word1”, 0) returns “word2” with a value of x, the correlation of the term vectors for “word1” and “word2” is x. (per Stackflow https://stackoverflow.com/questions/14267199/math-of-tmfindassocs-how-does-this-function-work)
##Associations for two random words
findAssocs(dtm, "deficit", .08)
## $deficit
## “needless acton customari inexcus levels” million” outlast
## 0.38 0.38 0.38 0.38 0.38 0.38 0.38
## secondrank twoset curri circumv cor freehold galen
## 0.38 0.38 0.31 0.27 0.27 0.27 0.27
## surplus “in trojan usc commission county’ semifin
## 0.27 0.22 0.22 0.22 0.20 0.19 0.19
## sprint minus irvin proport tax salem hike
## 0.19 0.17 0.15 0.15 0.15 0.14 0.13
## joseph reminisc gap pacif pleasur ralli addit
## 0.13 0.13 0.11 0.11 0.11 0.11 0.10
## meanwhil million debt histor necessari reserv tournament
## 0.10 0.10 0.09 0.09 0.09 0.09 0.09
## budget counti over
## 0.08 0.08 0.08
findAssocs(dtm, "history", .08)
## $history
## numeric(0)
Upfront, I initially experienced a lot of system issues working with the size of the data…crashes of both RStudio and my browser. I finally got things to run smoothly after many restarts and killing processes. I also had to do a real good clean-up of CHROME.Additionally, I noticed that there were some portions of my code that were duplicative and/or out of sequence.
Interesting findings/observations: The size of the data certainly provided a challenge.I had to continue to circle back through my code to ensure that it would run by adjusting my sample sizes down continually.
The removal of terms, punctuation, spaces, etc. has not initially provided the result I expected. I will need to circle back through with additional coding to make this even cleaner.
Plans for creating a prediction algorithm and Shiny app:
My next step is to get the data even cleaner in preparation for the predictive modeling component of the project. Honestly, I am not sure, yet, how I am going to approach develop the model.