This report discusses progress so far on the natural language processing project involving processing 3 documents to make a prediction model for guessing the next word after 2-3 word phrase is given. Model Code has been omitted as recommended by mentor Fiona in the Discussion boards.
The original files are large, the file sizes are: en_US.blogs.txt - 210.2MB en_US.news.txt - 205 MB en_US.twitter.txt - 167 MB
The number of lines for the blog, twitter and news files respectively are:
## [1] 899288
## [1] 1010242
## [1] 2360148
The number of words in the the blog, news and twitter files:
## [1] "37334690"
## [1] "34372720"
## [1] "30374206"
Since the files are large, it takes a long time to load them and use them with the tm functions, I will use a sample of the original files. I tried 20 and 5 percent but it still took a while to run the tm functions against the remaining data. For now I am using around 1 percent of the files as a sample randomly taken by using rbinom in a loop. The code also verifies the character format is utf-8 and removes NA’s before saving the lines to separate sample files. Separate files are created so they can be reloaded as needed while experimenting with model development, without having to recreate the samples.
So far I have been using the tm package in R as suggested. I may explore other packages as any limitations in tm become apparent. There have already been a number of issues with tm. The quanteda package and text2vec were mentioned in the discussion forum and I plan to evaluate and consider using them. I suspect this may become more desirable as we work toward efficient models.
I am going to skip stemming because of the time it takes to perform this task. I have used the tm functions and some custom grep functions to clean the data. I was considering using more grep functions, but have found that the removeSparseTerms function takes care of a lot of the issues with odd words that don’t have meaning.
library("tm")
library("ggplot2")
library("dplyr")
cname <- file.path("~","Documents","datasciencecoursera","capstoneProject","sample01p")
SampDocs <- Corpus(DirSource(cname))
CorpSample <- tm_map(SampDocs, content_transformer(tolower))
CorpSample <- tm_map(CorpSample, removePunctuation)
CorpSample <- tm_map(CorpSample, removeNumbers)
CorpSample <- tm_map(CorpSample, removeWords, stopwords("english"))
removeSpecialChars <- function(x) gsub("[⁰•⁰½¼卐]"," ",x)
CorpSample <- tm_map(CorpSample, content_transformer(removeSpecialChars))
## should do at end of other cleaning
CorpSample <- tm_map(CorpSample, stripWhitespace)
CorpSample <- tm_map(CorpSample, PlainTextDocument)
After creating a corpus and cleaning it we can see the number of lines in the three docs within the corpus. The total content is significantly smaller than the three original documents.
length(SampDocs[[1]]$content )
## [1] 8887
length(SampDocs[[2]]$content )
## [1] 10082
length(SampDocs[[3]]$content )
## [1] 23732
I have built three models to represent uni, bi and tri-grams. (code omitted) The rest of this section will look at some of the characteristics of these models.
Here is the length of each model, which tells me how many n grams are represented in each of my three models - Uni, bi and tri.
length(rowSums(as.matrix(tdm1s)))
## [1] 9642
length(rowSums(as.matrix(tdm2s)))
## [1] 3755
length(rowSums(as.matrix(tdm3s)))
## [1] 92
This histogram shows the uni-gram model and the top single words in the model.
library("ggplot2")
library("tm")
library("dplyr")
##UNI PLOT
freq1 <- sort(rowSums(as.matrix(tdm1s)), decreasing = TRUE)
freq1DF <- data.frame(freq1)
freq1DF <- data.frame(add_rownames(freq1DF, "word"))
colnames(freq1DF) <- c("word", "frequency")
g <- ggplot(freq1DF[1:20,], aes(x = reorder(word, +frequency), y =frequency))
g <- g + geom_bar(stat = "identity") + coord_flip() + ggtitle("Top 20 Unigrams")
g + labs(x= "Unigram words", y="number of occurances" )
I also had to try a word cloud of the top words.
#wordcloud
library(wordcloud)
#setting the same seed each time ensures consistent look across clouds
set.seed(2016)
#limit words by specifying min frequency
wordcloud(names(freq1),freq1, min.freq=1000, colors=brewer.pal(6,"Set1"))
This next histogram shows the bi-gram model and the top bi-grams in the model.
library("ggplot2")
library("tm")
library("dplyr")
##Bigram PLOT
freq2 <- sort(rowSums(as.matrix(tdm2s)), decreasing = TRUE)
freq2DF <- data.frame(freq2)
freq2DF <- data.frame(add_rownames(freq2DF, "word"))
colnames(freq2DF) <- c("word", "frequency")
g <- ggplot(freq2DF[1:20,], aes(x = reorder(word, +frequency), y =frequency))
g <- g + geom_bar(stat = "identity") + coord_flip() + ggtitle("Top 20 Bigrams")
g + labs(x= "Bigram words", y="number of occurances" )
This histogram shows the tri-gram model and the top tri-grams in the model.
library("ggplot2")
library("tm")
library("dplyr")
##TRIgram PLOT
freq3 <- sort(rowSums(as.matrix(tdm3s)), decreasing = TRUE)
freq3DF <- data.frame(freq3)
freq3DF <- data.frame(add_rownames(freq3DF, "word"))
colnames(freq3DF) <- c("word", "frequency")
g <- ggplot(freq3DF[1:20,], aes(x = reorder(word, +frequency), y =frequency))
g <- g + geom_bar(stat = "identity") + coord_flip() + ggtitle("Top 20 Trigrams")
g + labs(x= "Trigram words", y="number of occurances" )
As expected, there are a lot less tri-grams vs uni or bi-grams. This number, the number of accrual tri-grams and the number of times they occur, seems low to me and I wonder if I will need to increase the sample size. Looking at this chart also makes me wonder if I should include the stopwords that were removed while cleaning the data. Phrases like “dont get wrong” or “im big fan” makes me wonder how useful they will be with missing stop words.
I am excited about building an App to predicted the next word in a phrase. I will look closely at building an app that is accurate but does not use too much memory and computing power on the Shiny app server. I plan to move forward with the tm package but definitely will look at other packages such as quanteda and text2vec.
I want to look at the Kneser-Ney smoothing algorithm for handling n-grams that don’t show up in the test data (unknown words).
I have questions around using bi gram vs. tri-gram and around removing stopwords. Models will be explored and tested for accuracy using these features.