This is an exercize in NLP. I have adopted a “literate statistics approach,” which makes all of my code accessible and my result duplicable. The reader uninterested in result duplication is invited to disregard the code lines framed within several subsequent windows.
Together with the assignment came 5 questions to consider, namely:
1. Some words are more frequent than others - what are the distributions of word frequencies?
2. What are the frequencies of 2-grams and 3-grams in the dataset?
3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4. How do you evaluate how many of the words come from foreign languages?
5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
I will tackle these questions one by one.
I start out by installing the required R packages, and loading and reading three of the text files provided for this exercise, namely, those in English. However, the same exercize could be executed, with minor changes to the code, on the provided files in different languages.
for (package in c('knitr', 'tm', 'RWeka', 'stringi', 'stringr', 'ggplot2', 'dplyr', 'wordcloud', 'NLP', 'openNLP', 'qdap', 'textcat')) {
if (!require(package, character.only=T, quietly=T)) {
install.packages(package)
library(package, character.only=T, warn.conflicts=F, verbose=F, quietly=T)
}
}
opts_chunk$set(echo=TRUE)
set.seed(33)
# define source and target for download
targetFile <- "Coursera-SwiftKey.zip"
sourceFile <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# download source to target
download.file(sourceFile, targetFile)
# unzip file
unzip(targetFile)
# open connections
twitterFile <- file("./final/en_US/en_US.twitter.txt", "rb")
newsFile <- file("./final/en_US/en_US.news.txt", "rb")
blogsFile <- file("./final/en_US/en_US.blogs.txt", "rb")
# read files
twitter_crudo <- readLines(twitterFile, encoding = "UTF-8", skipNul = TRUE)
news_crudo <- readLines(newsFile, encoding = "UTF-8", skipNul = TRUE)
blogs_crudo <- readLines(blogsFile, encoding = "UTF-8", skipNul = TRUE)
# close connections
close(twitterFile)
close(newsFile)
close(blogsFile)
The next chunk of code computes basic information about the three selected text files.
News_numWordsPerLine <- stri_count_words(news_crudo)
News_totWords <- sum(News_numWordsPerLine)
News_sizeMb <- file.info("./final/en_US/en_US.news.txt")$size/1024^2
News_totLines <- length(news_crudo)
News_meanWordsPerLine <- mean(News_numWordsPerLine)
News_maxWordsPerLine <- max(News_numWordsPerLine)
News_characters <- sum(nchar(news_crudo))
News_maxCharactersPerLine <- max(nchar(news_crudo))
Blogs_numWordsPerLine <- stri_count_words(blogs_crudo)
Blogs_totWords <- sum(Blogs_numWordsPerLine)
Blogs_sizeMb <- file.info("./final/en_US/en_US.blogs.txt")$size/1024^2
Blogs_totLines <- length(blogs_crudo)
Blogs_meanWordsPerLine <- mean(Blogs_numWordsPerLine)
Blogs_maxWordsPerLine <- max(Blogs_numWordsPerLine)
Blogs_characters <- sum(nchar(blogs_crudo))
Blogs_maxCharactersPerLine <- max(nchar(blogs_crudo))
Twitter_numWordsPerLine <- stri_count_words(twitter_crudo)
Twitter_totWords <- sum(Twitter_numWordsPerLine)
Twitter_sizeMb <- file.info("./final/en_US/en_US.twitter.txt")$size/1024^2
Twitter_totLines <- length(twitter_crudo)
Twitter_meanWordsPerLine <- mean(Twitter_numWordsPerLine)
Twitter_maxWordsPerLine <- max(Twitter_numWordsPerLine)
Twitter_characters <- sum(nchar(twitter_crudo))
Twitter_maxCharactersPerLine <- max(nchar(twitter_crudo))
I display here summary information about the text files, namely, their names, number of words, number of characters, MB size, number of text lines, mean number of words per line, and maximum number of words per line.
summaryTable <- data.frame(filename = c("blogs","news","twitter","total"),
totWords = c(Blogs_totWords,News_totWords,Twitter_totWords,Blogs_totWords+
News_totWords+Twitter_totWords),
totCharacters = c(Blogs_characters, News_characters, Twitter_characters,
Blogs_characters+News_characters+Twitter_characters),
sizeMb = c(Blogs_sizeMb, News_sizeMb, Twitter_sizeMb,
Blogs_sizeMb+News_sizeMb+Twitter_sizeMb),
totLines = c(Blogs_totLines,News_totLines,Twitter_totLines,
Blogs_totLines+News_totLines+Twitter_totLines),
meanWordsPerLine = c(Blogs_meanWordsPerLine,
News_meanWordsPerLine, Twitter_meanWordsPerLine, (Blogs_totWords+News_totWords+Twitter_totWords)/(Blogs_totLines+News_totLines+
Twitter_totLines)),
maxWordsPerLine = c(Blogs_maxWordsPerLine, News_maxWordsPerLine, Twitter_maxWordsPerLine,
((Blogs_maxWordsPerLine*Blogs_sizeMb)+(News_maxWordsPerLine*News_sizeMb)+(Twitter_maxWordsPerLine*Twitter_sizeMb))/(Blogs_sizeMb+News_sizeMb+Twitter_sizeMb)))
summaryTable
## filename totWords totCharacters sizeMb totLines meanWordsPerLine
## 1 blogs 37546246 206824505 200.4242 899288 41.75108
## 2 news 34762395 203223159 196.2775 1010242 34.40997
## 3 twitter 30093410 162096241 159.3641 2360148 12.75065
## 4 total 102402051 572143905 556.0658 4269678 23.98355
## maxWordsPerLine
## 1 6726.000
## 2 1796.000
## 3 47.000
## 4 3071.683
As a personal memento, I include here the code used to answer some of the questions in quiz 1. Please disregard if uninterested.
#quiz 1.4
love_occurs <- sum(grepl(pattern = "love", x = twitter_crudo)) # of sentences where there is
# one or more occurrences of "love"
hate_occurs <- sum(grepl(pattern = "hate", x = twitter_crudo)) # same for word "hate"
answer_4 <- love_occurs/hate_occurs
#quiz1.5
answer_5 <- twitter_crudo[grep(pattern = "biostat", x = twitter_crudo)]
#quiz 1.6
answer_6 <- sum(grepl(pattern="A computer once beat me at chess, but it was no match for me at kickboxing",x=twitter_crudo))
Since our assignment invites us to purge our text files of profanities, here is my source of profane words.
profanities <- readLines("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", encoding = "UTF-8")
Next I assemble a training set and display the same basic information displayed above about the three source files.
blogsSample <- sample(blogs_crudo, as.integer(Blogs_totLines*0.01))
newsSample <- sample(news_crudo, as.integer(News_totLines*0.01))
twitterSample <- sample(twitter_crudo, as.integer(Twitter_totLines*0.01))
#blogsSample <- sample(blogs_crudo, as.integer(Blogs_totLines*0.01))
#newsSample <- sample(news_crudo, as.integer(News_totLines*0.01))
#twitterSample <- sample(twitter_crudo, as.integer(Twitter_totLines*0.01))
#combine the 3 samples
threeSamples <- c(blogsSample,newsSample, twitterSample)
write.table(threeSamples, "threeSamples.txt", sep="\t")
#collect data for summary statistics of the data set which was obtained by attaching three
#samples to one anothet
threeSamples_numWordsPerLine <- stri_count_words(threeSamples)
threeSamples_totWords <- sum(threeSamples_numWordsPerLine)
threeSamples_sizeMb <- file.info("threeSamples.txt")$size/1024^2
threeSamples_totLines <- length(threeSamples)
threeSamples_meanWordsPerLine <- mean(threeSamples_numWordsPerLine)
threeSamples_maxWordsPerLine <- max(threeSamples_numWordsPerLine)
threeSamples_characters <- sum(nchar(threeSamples))
threeSamples_maxCharactersPerLine <- max(nchar(threeSamples))
summaryTable2 <- data.frame(filename = "threeSamples",
totWords = threeSamples_totWords,
totCharacters = threeSamples_characters,
sizeMb = threeSamples_sizeMb,
totLines = threeSamples_totLines,
meanWordsPerLine = threeSamples_meanWordsPerLine,
maxWordsPerLine = threeSamples_maxWordsPerLine)
summaryTable2
## filename totWords totCharacters sizeMb totLines meanWordsPerLine
## 1 threeSamples 1021430 5713476 5.919318 42695 23.92388
## maxWordsPerLine
## 1 638
The next two windows of code contain most of the crucial NLP coding. Each section of code is complemented by a concise explanatory comment. All in all, the following lines of code:
1. create a text corpus out of the training text file,
2. preprocess the corpus, getting it rid of unhelpful elements and details;
2.1. separately from the above preprocessing, they rid the corpus of stopwords, which allows for more proficuous exploratory analysis (this step is kept separated from the overall corpus preprocessing because I won’t implement it in the next assignment, at the stage of word prediction, since I deem stopwords relevant to good predictive results);
3. create a term document matrix out of the corpus, which consists of each distinct line of text as a columns (or document) and each distinct word of text as row (or term);
4. remove the sparse terms from this matrix;
5. tokenize the matrix to generate separate mono-, bi- and trigrams.
# 6 sorts of elements to eliminate from text file
hashtags <- "#[0-9][a-z][A-Z]+"
special <- c("®","™", "¥", "£", "¢", "€", "#", "â€" , "ð" , "Ÿ˜","Š","í", "½","ð","$")
urls <- "(f|ht)tp(s?)://(.*)[.][a-z]+"
email <- "^[[:alnum:].-]+@[[:alnum:].-]+$"
date <- "[0-9]{2}/[0-9]{2}/[0-9]{4}"
control_characters <- '[[:cntrl:]]'
# helper function to remove above 6 elements
removeFunction <- function(textFile) {
textFile <- gsub(paste0(control_characters),"", textFile)
textFile <- gsub(paste0(urls),"", textFile)
textFile <- gsub(paste0(email),"", textFile)
textFile <- gsub(paste0(special, collapse = '|'),"",textFile)
textFile <- gsub(paste0(hashtags),"", textFile)
textFile <- gsub(paste0(date), "", textFile)
return (textFile)
}
# helper function to preprocess corpus
preProcessFunction <- function(myCorpus) {
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeWords, profanities)
#myCorpus <- tm_map(myCorpus, removeWords, stopwords("en"))
myCorpus <- tm_map(myCorpus, removeNumbers)
#myCorpus <- tm_map(myCorpus, stemDocument)
myCorpus <- tm_map(myCorpus, stripWhitespace)
return (myCorpus)
}
#helper function to rid corpus of stopwords
preProcessFunction2 <- function(myCorpus2) {
myCorpus2 <- tm_map(myCorpus2, removeWords, stopwords("en"))
return (myCorpus2)
}
# helper function to compute frequencies of termDocMatrixC
data_frame <- function(matrice){
frequenze <- sort(rowSums(as.matrix(matrice)), decreasing=TRUE)
frame <- data.frame(word=names(frequenze), freq=frequenze)
return(frame)
}
# helper functions to generate (tokenize) multi-grams
biGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triGramTokenize <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
# apply removeFunction()
threeSamples <- removeFunction(threeSamples)
write.table(threeSamples, "threeSamples.txt", sep="\t")
#create corpus
threeSamplesAsCorpus <- VCorpus(VectorSource(threeSamples))
preProcessedCorpus <- preProcessFunction(threeSamplesAsCorpus)
preProcessedCorpus2 <- preProcessFunction2(preProcessedCorpus)
# exploratory analysis is to be executed on preProcessedCorpus2
# predictive work will be executed on preProcessedCorpus (inclusive of stopwords)
termDocMatrix2 <- TermDocumentMatrix(preProcessedCorpus2)
# remove sparse elements
termDocMatrix2e <- removeSparseTerms(termDocMatrix2, 0.997)
freqDataFrame2 <- data_frame(termDocMatrix2e)
## Sets the default number of threads to use in parallel library
# see brian.keng at http://stackoverflow.com/questions/17703553/bigrams-instead-of-single-words-in-termdocument-matrix-using-r-and-rweka
options(mc.cores=1)
termDocMatrixByTwo <- TermDocumentMatrix(preProcessedCorpus2, control=list(tokenize=biGramTokenize))
termDocMatrixByTwoE <- removeSparseTerms(termDocMatrixByTwo, 0.99973)
freqDataFrameByTwoE <- data_frame(termDocMatrixByTwoE)
termDocMatrixByThree <- TermDocumentMatrix(preProcessedCorpus2, control=list(tokenize=triGramTokenize))
termDocMatrixByThreeB <- removeSparseTerms(termDocMatrixByThree, 0.99997)
freqDataFrameByThreeB <- data_frame(termDocMatrixByThreeB)
An important lesson to be learned from the above computations is the sensitivity of the sparse factor that is used in the removeSparseTerms() function. Minimal variations in its value have substantial consequences on the size of the term doc matrix.
The frequency data frame of my monograms contains 632 monograms.
The frequency data frame of my bigrams contains 929 bigrams.
The frequency data frame of my trigrams contains 3460 trigrams.
Next I display the top 20 terms of each one of these frequency data frames.
freqDataFrame2[1:20,]
## word freq
## will will 3181
## said said 3080
## just just 2916
## one one 2886
## like like 2690
## can can 2501
## get get 2221
## time time 2031
## new new 1976
## now now 1786
## good good 1767
## dont dont 1692
## day day 1652
## people people 1568
## love love 1539
## know know 1500
## see see 1416
## back back 1355
## think think 1316
## make make 1306
freqDataFrameByTwoE[1:20,]
## word freq
## right now right now 237
## new york new york 204
## cant wait cant wait 199
## last year last year 194
## dont know dont know 187
## last night last night 171
## high school high school 145
## feel like feel like 137
## years ago years ago 136
## im going im going 123
## last week last week 119
## im sure im sure 116
## first time first time 114
## dont think dont think 111
## looking forward looking forward 107
## happy birthday happy birthday 106
## can get can get 97
## looks like looks like 96
## make sure make sure 96
## st louis st louis 93
freqDataFrameByThreeB[1:20,]
## word freq
## cant wait see cant wait see 43
## happy mothers day happy mothers day 36
## happy new year happy new year 24
## new york city new york city 24
## im pretty sure im pretty sure 22
## let us know let us know 21
## president barack obama president barack obama 20
## two years ago two years ago 17
## gov chris christie gov chris christie 15
## st louis county st louis county 14
## dont get wrong dont get wrong 13
## world war ii world war ii 12
## dont even know dont even know 11
## dont feel like dont feel like 11
## happy valentines day happy valentines day 11
## high school students high school students 11
## im looking forward im looking forward 11
## looking forward seeing looking forward seeing 11
## rock n roll rock n roll 11
## cant wait get cant wait get 10
Next I plot the histograms of the same top 20 terms.
#plot the unigrams
ggplot(freqDataFrame2[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme(axis.title.y = element_blank()) +
coord_flip() +
labs(y="Frequency", title="Most Common Unigrams")
# plot the plot of bigrams
ggplot(freqDataFrameByTwoE[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme(axis.title.y = element_blank()) +
coord_flip() +
labs(y="Frequency", title="Most Common Bigrams")
# plot the plot of trigrams
ggplot(freqDataFrameByThreeB[1:12,], aes(x=reorder(word,freq), y=freq, fill=freq)) +
geom_bar(stat="identity") +
theme(axis.title.y = element_blank()) +
coord_flip() +
labs(y="Frequency", title="Most Common Trigrams")
Next I display two word-cloud representations, namely, first of my top-twenty bigram frequencies and second of my top-twenty trigram frequencies.
# compute word cloud of bigrams
nuvolaDeiBigrams <- wordcloud(freqDataFrameByTwoE[,1], freqDataFrameByTwoE[,2],max.words=20, random.order=FALSE,
rot.per=0.2, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
# compute word cloud of trigrams
nuvolaDeiTrigrams <- wordcloud(freqDataFrameByThreeB[,1], freqDataFrameByThreeB[,2],max.words=20, random.order=FALSE,
rot.per=0.2, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
Thus far I have tackled the first two of the 5 questions under consideration. Next comes this question: “How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?” Let me write a suitable program, and extend my answer to unique words, bi-grams, and tri-grams.
wordsForTargetCoverage <-function(freqDataFrame,targetCoverage) {
nwords<-0 # initial counter
aimedAtCoverage<-targetCoverage*sum(freqDataFrame$freq)
for (i in 1:nrow(freqDataFrame)) {
if (nwords >= aimedAtCoverage) {
return (i)}
nwords<-nwords+freqDataFrame$freq[i]
}}
In order to cover 50% of all word instances in the language (precisely, in my training text file, which, as we saw above, includes 1021430 words), I need 129 words.
In order to cover 90% of all word instances, I need 482 words.
In order to cover 50% of all bi-gram instances in the language (precisely, in my training text file, which includes, after tokenization and sparsity decrease, 929 bi-grams), I need 236 bi-grams.
In order to cover 90% of all bi-gram instances, I need 751 bi-grams.
In order to cover 50% of all tri-gram instances in the language (precisely, in my training text file, which includes, after tokenization and sparsity decrease, 3460 tri-grams, I need 1356 trigrams.
In order to cover 90% of all trigram instances, I need 3040 tri-grams.
Question 4: “How do you evaluate how many of the words come from foreign languages?”
With the function textcat() from the package “textcat” one ought to be able to loop through the words of a text file and delete the words that are not identified as English words. Then one could compute the ratio of the new, “purged” text file and the original taxt file to ascertain the ratio of foreign words. But I see that this requires further probing of the functionality of that package. Moreover, since the English language borrows innumerable words from foreign languages, the risk of this procedure is that it would deprive the original text file of meaningful sentences.
Question 5: “Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?”
Two other ways to increase coverage are context clustering and word clustering. Both of them pertain to the task of attribuing specific meanings, or senses, to a word, so they are both related to the task of word-disambiguation. The former is aimed at clustering the contexts in which the target word occurs, the latter at clustering the words which are related to the target word. It seems to me that a proper use of these two methods could increase coverage without increasing the number of words. But it’s all a bit above my head for now.
A way to increase coverage is thorugh stemming. In the above preprocess function, I deliberately commented off the command < tm_map(myCorpus, stemDocument) >, which would have stemmed the words in my training file to their semantic root. Stemming has the great advantage that, for instance, it turns the noun “stop” and the verb “stopped” into the same word: “stop.” So, yes, it increases coverage by decreasing the number of words needed to cover a certain desired ratio of the language. But stemming suffers from a serious disadvantage in word prediction, which is the main goal of this stepstone project. I mean, if you want to know what comes next after “I stopped by the,” it is unlikely that my application would give you a good result if the verb “to stop” never appeared in my data set in the guise of a past tense form.
Lastly, the second part of this capstone project entails implementing a next-word prediction algorithm and display the result in a Shiny app. For my algorithm, I plan to rely on the Markov Chain principle, which enables one to disregard the overall semantic and contextual history of the queried phrase, relying instead on the most recent semantic history conveyed by bi-grams and tri-grams. Secondly, I plan to rely on the backoff principle, which enables one to switch back to shorter n-grams when the longer n-grams (my longest n-grams will be tri-grams) do not provide predictive results. Thirdly, in the light of the Kneser-Ney smoothing paradigm, designed to counter data sparsity via discounting (i.e. by attributing lower probability to n-grams present in the language [aka corpus] and some small probability to n-grams absent from it), I plan to increase the range of my available n-grams by the adoption of “wild cards.” A wild card is equivalent to “any word” (in my case, any word from the language [or corpus])." More specifically, in my quest for the next-word, I’ll first check if there are any tri-grams whose first two words correspond to the last two words of the queried phrase. If yes, I’ll select, as my predicted word, the last word from the tri-gram with the highest probability. If not, I’ll check if there are any tri-grams whose first word is a wild card and whose second word corresponds to the last word of the queried phrase. If yes, the third word from the most likely of these tri-grams (which to all intents and purposes are the bi-gams from my corpus) will provide my predicted word. If not, I’ll switch to examining the tri-grams whose middle word is a wild card. And so on, consistently with the backoff principle.