Introduction

The purpose of this document is to initiate the process that will lead to the create of an Natural Language Processing (NLP) tool that predicts the next word in a sentence being typed. Most smartphones come with such functionality. In fact Swiftkey a company that produces one such keyboard for smartphones is involved in this project.

The Data

Data for this project was sourced from a corpus called HC Corpora. Only the english language corpus was processed. The enlish language corpus consisted of three text files: * Blogs posts * News articles * Twitter messages

The data had to be cleaned of offensive and profane words. An balance had to be reached to ensure that words that have a dual meaning (eg balls or penis) are not removed. I took a decision in favour of retaining words that have dual meanings, removing only clearly offensive words.

Loading Data

The following is some basic information about the raw text files that will be processed: * File size of Blogs file 200.4242077Mb. * File size of News file 196.2775126Mb. * File size of Twitter file 159.364069Mb.

The data was originally downloaded on the Fri Mar 13 09:45:13 2015.

In order to reduce processing time the source file loading and profanity cleanup is done once and the intermediate files are loaded automatically the second time round. Once needs to remove the file data/dsscapstone-003-001.RData so that the process is run from the begining.

Information about the data that will be used (profanity words removed):

Text Source Lines Words
Blogs 899288 3.914266810^{7}
News 1010242 3.674980310^{7}
Twitter 2360148 3.286870210^{7}
barplot(c(avgB, avgN, avgT), border="tan2", names.arg=c("Blogs", "News", "Twitter"), ylab="Words per line", xlab="Source", main="Average Words / Posting")

#cleanup
rm (lenB, wrdB, avgB, lenN, wrdN, avgN, lenT, wrdT, avgT)

Cleaning the Corpora

The code below cleans the corpora and take a sample of the data.

# https://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument
# https://stackoverflow.com/questions/24311561/how-to-use-stemdocument-in-r
# Function clean up the passed parameter leaving a cleaned up text
CleanUp <- function (corpus) {
    corpus <- tm_map(corpus, tolower)  
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, removeWords,stopwords("english"))
    corpus <- tm_map(corpus, PlainTextDocument)
    corpus <- tm_map(corpus, stemDocument)
    
    return (corpus)
}

dataBlogs <- sample(dataBlogs, 1000)
dataNews <- sample(dataNews, 1000)
dataTwitter <- sample(dataTwitter, 1000)

corpusBlogs <- Corpus(VectorSource(dataBlogs))
corpusNews <- Corpus(VectorSource(dataNews))
corpusTwitter <- Corpus(VectorSource(dataTwitter))

# cleanup 
rm (dataBlogs, dataNews, dataTwitter)

corpusData <- list(corpusBlogs, corpusNews, corpusTwitter)

# cleanup 
rm (corpusBlogs, corpusNews, corpusTwitter)

# Clean up the data
for (i in 1: length(corpusData))
{
    corpusData[[i]] <- CleanUp(corpusData[[i]])
}

Word Cloud

The word cloud give an idea of the most popular words in the corpus.

par(mfrow = c(1,3))
Titles <- c("Blogs", "News", "Twitter")

for(i in 1:3){
    tdm <- DocumentTermMatrix(corpusData[[i]])
    # plot word cloud
    wordcloud(words=colnames(tdm), freq=col_sums(tdm), scale=c(5,0.5), max.words=100, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
    title(Titles[i])
}

# cleanup
rm (tdm)

n-Grams

Various n-Gram charts of the differt text sources. In order to ensure the readability of the different charts only the first 25 tokens of each are shown.

# Calculate word freqencies
for(i in 1:3) {
    tdm <- TermDocumentMatrix(corpusData[[i]])
    wordFreq <- findFreqTerms(tdm, lowfreq=200)
    print (paste0("Word frequency for ", Titles[i]))
    print (wordFreq[1:20])
}
## [1] "Word frequency for Blogs"
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1] "Word frequency for News"
##  [1] "said" NA     NA     NA     NA     NA     NA     NA     NA     NA    
## [11] NA     NA     NA     NA     NA     NA     NA     NA     NA     NA    
## [1] "Word frequency for Twitter"
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# CleanUp
rm (tdm, wordFreq)
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
par(mfrow = c(3,1))
for(i in 1:3) {
    # For each item compute an analysis of different token lengths
    dtm <- DocumentTermMatrix(corpusData[[i]], control=list(tokenize=UnigramTokenizer))
    freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)[1:25]
    
    bar <- barplot(freq, axes=FALSE, axisnames=FALSE, ylab="Frequency", main=paste0("Frequency of 1-Grams  for ",Titles[i]))
    text(bar, par("usr")[3], labels=names(freq), srt=60, adj=c(1.1,1.1), xpd=TRUE, cex=0.9)
    axis(2)
}

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
par(mfrow = c(3,1))
for(i in 1:3) {
    # For each item compute an analysis of different token lengths
    dtm <- DocumentTermMatrix(corpusData[[i]], control=list(tokenize=BigramTokenizer))
    freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)[1:25]
    
    bar <- barplot(freq, axes=FALSE, axisnames=FALSE, ylab="Frequency", main=paste0("Frequency of 2-Grams for ",Titles[i]))
    text(bar, par("usr")[3], labels=names(freq), srt=60, adj=c(1.1,1.1), xpd=TRUE, cex=0.9)
    axis(2)
}

TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
par(mfrow = c(3,1))
for(i in 1:3) {
    # For each item compute an analysis of different token lengths
    dtm <- DocumentTermMatrix(corpusData[[i]], control=list(tokenize=TrigramTokenizer))
    freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)[1:25]
    
    bar <- barplot(freq, axes=FALSE, axisnames=FALSE, ylab="Frequency", main=paste0("Frequency of 3-Grams for ",Titles[i]))
    text(bar, par("usr")[3], labels=names(freq), srt=60, adj=c(1.1,1.1), xpd=TRUE, cex=0.9)
    axis(2)
}

QurgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
par(mfrow = c(3,1))
for(i in 1:3) {
    # For each item compute an analysis of different token lengths
    dtm <- DocumentTermMatrix(corpusData[[i]], control=list(tokenize=QurgramTokenizer))
    freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)[1:25]
    
    bar <- barplot(freq, axes=FALSE, axisnames=FALSE, ylab="Frequency", main=paste0("Frequency of 4-Grams for ",Titles[i]))
    text(bar, par("usr")[3], labels=names(freq), srt=60, adj=c(1.1,1.1), xpd=TRUE, cex=0.9)
    axis(2)
}

rm (dtm, freq, bar)

It can be observed that while some words are common to all sources, each source seems to have its own style of writing.

Tasks that need to be accomplished

This data will be used to create the NLP algorithm. The general steps that need to be performed are the following:

  1. Take a sample of the data that will be used for testing and to build a model. This is also necessary because the algorith must operate in a reasonable amount of time and computer resources are limited.
  2. Build the prediction algorithms
  3. Went the algorithm has been fine tuned develop and publish a shiny app.