This paper describes a preliminary approach to the Johns Hopkins University Data Science Capstone project. The goal of the Capstone project is to generate an interactive product that takes an ordered series of words as input and predicts and outputs the next word in the series. Given an initial corpus of text data, taken from a set of blogs, news articles, and twitter data feeds, the instructions for the project are to use principles and techniques from the fields of text mining, natural language processing, statistics, and data science to build a model to predict a next word in an ordered series of words, and to build a product in a Shiny app that incorporates the predictive model. Included in this paper are the approaches to basic data cleaning, exploratory data analysis, and initial prediction model ideas and considerations.
The first step, of course, is to load the data into an appropriate data structure in R, and then to clean the data to remove unwanted symbols, extra spacing, punctuation, etc. To perform this step, the text mining package in R, called “tm”, provides much of the useful functionality used in achieving this step. In particular, the data files are loaded into a data structure using the “corpus” function. While the “tm” package does include functions for cleaning the data, including functions for removing punctuation, multiple/trailing spaces, etc., I have found through experimentation that loading the files using the R readLines function, and then cleaning them using the R gsub function, performs much faster than using the “tm” cleaning functions. Therefore, I begin converting each text file to an R vector structure, and then use various gsub functions to clean the text so as to only include words. Following this, I store the cleaned files out to a directory, and load the files back in using the Corpus tm function. (It may sound like extra work, but in fact it appears to be much faster for later generating document term matrices, which are needed for extracting ngrams.)
#Make sure libraries are loaded
if (!require(reshape2)) install.packages(reshape2, dependencies = TRUE)
library(reshape2)
#install.packages("tm")
if (!require(tm)) install.packages(tm, dependencies = TRUE)
library(tm)
#install.packages("RWeka")
if (!require(RWeka)) install.packages(RWeka, dependencies = TRUE)
library(RWeka)
#install.packages("ggplot2")
if (!require(ggplot2)) install.packages(ggplot2, dependencies = TRUE)
library(ggplot2)
#install.packages("wordcloud")
if (!require(wordcloud)) install.packages(wordcloud, dependencies = TRUE)
library(wordcloud)
if (!require(quanteda)) install.packages(quanteda, dependencies = TRUE)
library(quanteda)
#Set file path names
doc_blogs <- "C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Uncompressed\\en_US.blogs.txt"
doc_news <- "C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Uncompressed\\en_US.news.txt"
doc_twitter <- "C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Uncompressed\\en_US.twitter.txt"
#Clean the data before creating the corpus
#Read data from file into vector data structure
text2vector = function(filepath) {
con = file(filepath, "r")
return(readLines(con, -1))
}
#Pull data from individual files into corresponding vector data structures
blogsData <- text2vector(doc_blogs)
newsData <- text2vector(doc_news)
twitterData <- text2vector(doc_twitter)
#function for cleaning data
Text_To_Clean <- function(text_blob) {
# swap all sentence ends with code 'ootoo'
text_blob <- gsub(pattern=';|\\.|!|\\?', x=text_blob, replacement='ootoo')
# swap all apostrophe's with code 'ooaoo'
text_blob <- gsub(pattern="\\'", x=text_blob, replacement='ooaoo')
# remove all non-alpha text (numbers etc)
text_blob <- gsub(pattern="[^a-zA-Z]", x=text_blob, replacement = ' ')
# force all characters to lower case
text_blob <- tolower(text_blob)
#add apostrophes back in
text_blob <- gsub(pattern="ooaoo", x=text_blob, replacement="'")
# remove contiguous spaces
text_blob <- gsub(pattern="\\s+", x=text_blob, replacement=' ')
# split sentences by split code
sentence_vector <- unlist(strsplit(x=text_blob, split='ootoo',fixed = TRUE))
return (sentence_vector)
}
corpus_blogs <- Text_To_Clean(blogsData)
corpus_news <- Text_To_Clean(newsData)
corpus_twitter <- Text_To_Clean(twitterData)
#Save for future use
saveRDS(corpus_blogs, file="C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Clean\\blogs.txt")
saveRDS(corpus_news, file="C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Clean\\news.txt")
saveRDS(corpus_twitter, file="C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Clean\\twitter.txt")
#create corpus from saved clean files
docs <- Corpus(DirSource("C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Clean"))
Looking first at the individual source files (e.g., blogs, news, twitter), one can first note the size of the files, and numbers of lines in each file.
#Preliminary Stats of individual data files
#function to count the number of lines in the file
LinesInFile = function(filepath) {
con = file(filepath, "r")
numlines <- 0
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
numlines <- numlines+1
}
close(con)
return(numlines)
}
blogsLines <- LinesInFile(doc_blogs)
newsLines <- LinesInFile(doc_news)
twitterLines <- LinesInFile(doc_twitter)
print(paste("The file 'en_US.blogs.txt' is 205,235 Mbytes long and contains", as.character(blogsLines), "lines.", sep=" "))
## [1] "The file 'en_US.blogs.txt' is 205,235 Mbytes long and contains 899288 lines."
print(paste("The file 'en_US.news.txt' is 200,989 Mbytes long and contains", as.character(newsLines), "lines.", sep=" "))
## [1] "The file 'en_US.news.txt' is 200,989 Mbytes long and contains 77259 lines."
print(paste("The file 'en_US.twitter.txt' is 163,189 Mbytes long and contains", as.character(twitterLines), "lines.", sep=" "))
## [1] "The file 'en_US.twitter.txt' is 163,189 Mbytes long and contains 2360148 lines."
It’s also possible to use another package, quanteda, to extract summary statistics:
#Make sure libraries are loaded
if (!require(quanteda)) install.packages(quanteda, dependencies = TRUE)
library(quanteda)
#create corpus
textfiles <- textfile(file = "C:\\Users\\jcosta\\Documents\\Data Science\\Capstone\\Dataset\\Uncompressed\\*.txt")
myCorpus <- corpus(textfiles)
#Preliminary Stats of individual data files
summary(myCorpus)
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences
## en_US.blogs.txt 446614 44936366 2019467
## en_US.news.txt 104029 3193405 140506
## en_US.twitter.txt 531354 37197445 2564201
##
## Source: C:/Users/jcosta/Dropbox/Family Folder/DataScience/Capstone - Jess/Week2/* on x86-64 by jcosta
## Created: Sat Sep 03 16:54:51 2016
## Notes:
The table above shows the Types (i.e., number of distinct “words” or “terms”), Tokens (i.e,. the total number of “words” or “terms”), and Sentences (i.e., unit of written language) on a per/document basis for the corpus.
A Document Term Matrix can be used to derive more interesting and useful information about the corpus. The “tm” package includes a function “dtm” for generating a document term matrix.
## DocumentTermMatrix does the tokenizing and forms a document term matrix
dtm <- DocumentTermMatrix(docs)
The summary information from the Document Term Matrix reveals some additional information:
print(dtm)
## <<DocumentTermMatrix (documents: 3, terms: 35057)>>
## Non-/sparse entries: 45924/59247
## Sparsity : 56%
## Maximal term length: 21
## Weighting : term frequency (tf)
As revealed, the document corpus is a 3 x 35057 dimension matrix in which 56% of the rows are zero. The maximum length of any term in the matrix is 21 characters.
The frequency of occurrence of each term in the corpus can be calculated by summing each column.
freq_monograms <- colSums(as.matrix(dtm))
The top 20 most occuring terms are:
freq_monograms<-sort(freq_monograms, decreasing=TRUE)
head(freq_monograms, 20)
## you the and good thank love thanks that what not
## 12958 7569 5435 5084 4587 4572 4019 3477 3325 2915
## i'm great for it's was yes this just are too
## 2807 2572 2456 2286 2050 1928 1912 1839 1777 1760
monograms <- data.frame(word=names(freq_monograms), freq=freq_monograms)
A more visual presentation of the most frequent words in the corpus is shown in a bargram and a wordcloud:
#plotting theme
mytheme <- theme(plot.title=element_text(face="bold", size="14", color="darkblue"),
axis.title=element_text(face="bold", size=10, color="black"),
axis.text=element_text(size=9, color="darkblue"),
axis.text.x=element_text(angle=45, hjust = 1),
legend.position="top"
)
ggplot(monograms[1:20,], aes(word, freq)) + geom_bar(stat = "identity", fill="magenta") + mytheme + ggtitle("Frequency of Single Words (Monograms) in Document Corpus")
## Display a wordcloud
wordcloud(names(freq_monograms), freq_monograms, max.words=200, colors=brewer.pal(8, "Dark2"), rot.per=0.3)
Given the initial exploratory analysis, the next step is to build a predictive model that takes as input an ordered series of words and predicts the next word in the series.
One potential model for predicting the next word in an ordered series of words is to collect the frequencies of n-grams (n from 1 to 4 or 5) found in the corpus, and calculate probabilities of the n-grams calculated in the Exploratory Data Analysis section. Then, given an ordered series of n-1 words, we can calculate the probabilites of various next words given the apriori knowledge of the frequencies for various ngrams.
To implement this model, is is necessary to build a set of n-grams (n = 2, 3, 4 or more) and calculate the frequencies of each ngram. The bi-, tri-, quad- and quant-grams can be generated using the tokenizer function of the tm package called NGramTokenizer.
BigramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
dtm_bigrams<-DocumentTermMatrix(docs, control=list(tokenize=BigramTokenizer))
print(dtm_bigrams)
## <<DocumentTermMatrix (documents: 3, terms: 123259)>>
## Non-/sparse entries: 134396/235381
## Sparsity : 64%
## Maximal term length: 31
## Weighting : term frequency (tf)
TrigramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
dtm_trigrams<-DocumentTermMatrix(docs, control=list(tokenize=TrigramTokenizer))
print(dtm_trigrams)
## <<DocumentTermMatrix (documents: 3, terms: 103686)>>
## Non-/sparse entries: 107010/204048
## Sparsity : 66%
## Maximal term length: 40
## Weighting : term frequency (tf)
quadgramTokenizer<-function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
dtm_quadgrams<-DocumentTermMatrix(docs, control=list(tokenize=quadgramTokenizer))
print(dtm_quadgrams)
## <<DocumentTermMatrix (documents: 3, terms: 68795)>>
## Non-/sparse entries: 69185/137200
## Sparsity : 66%
## Maximal term length: 48
## Weighting : term frequency (tf)
Looking at the summary statistics of the document term matrices for the bi-, tri-, quad- and quant-grams, we observe that the while the number of terms reduces (due to the merging of words into a single term) as n increases (i.e., there are 123259 unique bigrams, 103686 unique trigrams, and only 68795 unique quadgrams), the sparsity of the resulting Document Term Matrices remains nearly equal at around 64 to 66%.
The most frequent bigrams in our corpus are illustrated below. NOTE: In this implementation contractions (i.e., two words contracted into a single word which contains an apostrophe) are counted as two words. The tokenizer automatically removes the apostrophe and counts contractions as two words, which makes sense since the contraction is in fact two words contracted into a single word. If one prefers to count the contraction as a single word, it is possible to tokenize the contractions as a single word, give more coding effort.
##Plot Bigram Frequencies
freq_bigrams <-sort(colSums(as.matrix(dtm_bigrams)), decreasing=TRUE)
head(freq_bigrams, 20)
## thank you i m it s i love love you that s
## 4319 3049 2592 2180 1305 1230
## don t can t good luck no problem i am love it
## 1158 1036 1025 985 900 897
## i know t wait see you i can it was follow me
## 825 693 673 661 625 599
## what s let s
## 579 571
bigrams <-data.frame(word=names(freq_bigrams), freq=freq_bigrams)
ggplot(bigrams[1:20,], aes(word, freq)) + geom_bar(stat = "identity", fill="magenta") + mytheme + ggtitle("Frequency of Bigrams in Document Corpus")
## Display a wordcloud
wordcloud(names(freq_bigrams), freq_bigrams, max.words=60, colors=brewer.pal(8, "Dark2"), rot.per=0.3)
The most frequent trigrams in our corpus are illustrated below (noting the above NOTE relating to counting contractions as 2 words):
##Plot Trigrams Frequencies
freq_trigrams <- sort(colSums(as.matrix(dtm_trigrams)), decreasing=TRUE)
head(freq_trigrams, 20)
## i love it i love you can t wait i don t check it out
## 773 689 682 482 441
## i can t what s up don t know see you soon i m so
## 423 376 331 325 306
## i miss you that s it wish me luck i m sorry i like it
## 276 263 253 242 240
## let s go it and amazon here we go life is good you got it
## 221 213 203 191 189
trigrams <- data.frame(word=names(freq_trigrams), freq=freq_trigrams)
ggplot(trigrams[1:20,], aes(word, freq)) + geom_bar(stat = "identity", fill="magenta") + mytheme + ggtitle("Frequency of Trigrams in Document Corpus")
## Display a wordcloud
wordcloud(names(freq_trigrams), freq_trigrams, max.words=30, colors=brewer.pal(8, "Dark2"), rot.per=0.3)
The most frequent quadgrams in our corpus are illustrated below (noting the above NOTE relating to counting contractions as 2 words):
##Plot Quadgrams Frequencies
freq_quadgrams <- sort(colSums(as.matrix(dtm_quadgrams)), decreasing=TRUE)
head(freq_quadgrams, 12)
## i don t know i can t wait i ll be there i know i know let s get it
## 298 283 105 87 77
## you can do it i ll take it i m not sure hey what s up im a big fan
## 77 67 66 63 59
## i don t care i m so sorry
## 56 56
quadgrams <- data.frame(word=names(freq_quadgrams), freq=freq_quadgrams)
ggplot(quadgrams[1:20,], aes(word, freq)) + geom_bar(stat = "identity", fill="magenta") + mytheme + ggtitle("Frequency of Quadgrams in Document Corpus")
## Display a wordcloud
wordcloud(names(freq_quadgrams), freq_quadgrams, max.words=20, colors=brewer.pal(8, "Dark2"), rot.per=0.3)
Another way to predict the next work in an ordered series of words is to build a probability tree, and once the tree is built, simply traverse the tree to the most probable leaf node, and if no word exists, return the most frequent word in the database.
During my initial research into text mining, I read somewhere that there is a package that takes into account the sentence structure and grammar. This is a possibility for use in improving the performance of the prediction - that is, by understanding the sentence structure of the input, one can possibly increase the weighting of words from the prediction pool that meet the expected next word type (e.g., noun, verb, adjective, article, etc.).
One question of interest is how much effect does the inclusion or removal of the end of line markers in the initial data cleaning have on the accuracy of the model? In theory, inclusion of the end of line markers would result in a more accurate model, since theoretically the way sentences flow together do not have the same predictability as the way words flow together in a sentence.
Finally, increasing the variety and quantity of the sources of the corpus should as well improve the quality of the prediction function.