The Capstone Project was initiated to build a predictive word model based an input phrase with multiple words. The milestone report details the status of each project tasks lists.
The project has proved to be especially challenging given the lack of familiarity with natural language processing. At the same time, these challenges have been exciting due to the amount of new knowledge being ascertained.
The milestone reports includes a review of source data and exploratory data analysis along with forward looking processes for the future project tasks.
The project was initiated by starting to research Natural Language Processing, Text Mining, and the associated tools in R based on links provided via the Coursera course page:
Natural language processing Wikipedia page
Text mining infrastucture in R
CRAN Task View: Natural Language Processing
Coursera course on NLP (not in R)
After exhausting the above materials general research was collected via Evernote for a variety of Google searches as well as online book resources. Those materials will be referenced as utilized during the final project report.
Three data sources were utlized from the Capstone dataset and loaded into R using the readLines command. The followin summary statitcis represent the source data files:
| File | Line Count | Word Counts | Character Count | Avg. Char. Count | Std. Dev. Character Count |
|---|---|---|---|---|---|
| en_US.blogs.txt | 899,288 | 39,126,759 | 206,912,084 | 230.1 | 258.7 |
| en_US.news.txt | 1,010,242 | 36,723,514 | 203,177,180 | 201.1 | 133.2 |
| en_US.twitter.txt | 2,360,148 | 32,794,523 | 162,091,410 | 68.7 | 37.2 |
| Totals | 4,269,678 | 108,644,796 | 572,180,674 |
Given the large size of the raw data sources, a 1% sample set of each data object was constructed after loading the source data into R. The samples were then summarized for comparison to the original data sources to ensure a representatve had been selected.
| R Object | Line Count | Word Count | Character Count | Avg. Char. Count | Std. Dev. Char. Count |
|---|---|---|---|---|---|
| blogsSample | 8,992 | 389,686 | 2,065,345 | 229.7 | 254.6 |
| newsSample | 10,102 | 366,565 | 2,025,874 | 200.5 | 132.7 |
| tweetsSample | 23,601 | 329,084 | 1,626,085 | 68.9 | 37.2 |
| Totals | 42,695 | 1,085,335 | 5,717,304 |
Several steps were taken to cleanse the each sample data set prior to processing them into corpus objects:
Encoding(blogs) <- "UTF-8")iconv command (e.g. blogsUTF8 <-iconv(news, "UTF-8", "UTF-8", sub=''))stringi package per code from R.K. Lancer’s github respositorylibrary("stringi")
asciify <- function(text) {
text <- stri_trans_general(text, 'Any-Latin')
text <- stri_trans_general(text, 'Any-Publishing')
text <- stri_trans_general(text, 'Latin-Ascii')
text
}
tweetsNormalized <- asciify(tweets)
rawText) and processed into a corpus called docsdocs <- Corpus(VectorSource(rawText))
Cleansing within the corpus docs was then executed as follows prior to tokenization:
toSpace as a content_transformer for some special characters:toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/|@|\\|#~!@#$%^&*()_+:<>?,./;\"\u2614\u26a1", mc.cores=1)
docs <- tm_map(docs, content_transformer(tolower))docs <- tm_map(docs, content_transformer(removeNumbers))docs <- tm_map(docs, content_transformer(removePunctuation))docs <- tm_map(docs, removeWords, profanityFilter) where the profanityFilter is the list of English words in the github repository# Note: A loop was the only way to achieve execution of the stripWhitespace
# command due to errors encounted trying to use the transform_content() function
for(x in 1:length(docs)) {
docs[[x]] <- stripWhitespace(docs[[x]])
}
Note No removeWords, stopWords cleansing was performed on the corpus text since the purpose of the predictive model itself would be inclusive of these words.
After the corpus data cleansing was completed, an exploratory analysis was undertaken by forming a DocumentTermMatrix for the corpus text.
library("tm")
dtmFinal <- DocumentTermMatrix(docs)
dtmFinal
Given the high sparsity rate (100%) and high count of terms (56293), the next step was to removeSparseTerms to make available a reasonable set of terms (428 based on the sparsity factor applied) for additional exploratory analysis
dtmFinal.ns <- removeSparseTerms(dtmFinal, 0.995)
dtmFinal.ns
A term frequency matrix was constructed based on the nonspare document term matrix and a simply ordered histrogram was platted based on the most frequent 25 words in the frequency matrix:
A word cloud was also constructed from the frequency matrix:
## Loading required package: RColorBrewer
The corpus text was next converted to a data frame for applying RWeka to form n-grams (1-gram, 2-gram, 3-gram, and 4-gram):
# Convert corpus "docs" to Data Frame for application of RWeka functions
cleanedText <- data.frame(text=unlist(sapply(docs, '[', "content")), stringsAsFactors=FALSE)
# Build the n-grams for the model
library("RWeka")
unigrams <- NGramTokenizer(cleanedText, Weka_control(min=1, max=1))
bigrams <- NGramTokenizer(cleanedText, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
trigrams <- NGramTokenizer(cleanedText, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
quadgrams <- NGramTokenizer(cleanedText, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))
Frequency tables were then built and saved to disk from the n-grams for further exploratory analysis. The n-gram frequency tables were then sorted and plotted based on the Top 25 most frequent entries in each n-gram collection: