The goal of the capstone project is build a predictive text model like those use by SwiftKey. The model should predict the next word based on user inputs of one or more words. To build such an application, one needs to understand Natural Language Processing and how words are put together. As a first step, this will involve an analysis of text data to understand the distribution and relationship between the words, tokens and phrases.
In this report, we will perform Explorartory Data Analysis on the corpora which we will be using to build the predictive text model. The data comes from HC Corpora and can be downloaded here.
The downloaded file contains a number of text files in English and other languages. For both our analysis and application, we will focus only on the English files namely “en_US.blogs.txt”, “en_US.news.txt” and “en_US.twitter.txt”. We will study the distributions of word and N-gram frequencies in the dataset. These 3 files are stored in the txt folder of the working directory.
We have created a few functions to simplify and standardise the coding process. The following functions will be initialised. The codes of these functions are hidden so that this report will not be lengthy but the description of the each function are shown in the table below.
| Function | Description |
|---|---|
| readFile | Function to make a connection to a text file, read and return its content and close the connection |
| dataSummary | Function to gather the summary of the text files |
| numOfWords | Function to count the number of words in a word string |
| numOfPunc | Function to count the number of punctuations and special characters in a word string |
| numOfUniqueWords | Function to count the number of unique words in a word string |
| filterBadWords | Function to detect whether a word string contains bad words |
| mySample | Function to sample a smaller percentage of a dataset |
| formatNGram | Function to organise the N-Gram Tokenizer results into a data frame |
| orderNGram | Function to order the tokenized words or phrases in descending order of frequency |
| plotCloud | Function to plot a wordCloud of the tokenized words or phrases |
| plotFreq | Function to plot a barchart of the tokenized words or phrases with high frequency |
| plotCumFreq | Function to plot a cumulative frequency line chart of the tokenized words or phrases indicating the number of tokens to cover 50% and 90% of all instances |
Note that for profanity (bad words) filtering, we will use a list of banned words by GOOGLE. This is downloaded from here
We also created a function to clean up the Corpora for better text mininig analysis. This is so that when we tokenize each word or phrase, each token will be meaningful and useful for text prediction.
### formatCorpus - Function to clean the data for text mining.
### Cleaning include:
### 1. Removing all non-english character symbols
### 2. Changing all english character to lower case
### 3. Remove all punctuation
### 4. Remove all numbers
### 5. Remove all stopwords (eg, words that are too common like the, a, for etc)
### 6. Remove any whitespaces
### Note that stemming is commented of as it is not necessary in this analysis
formatCorpus <- function(d) {
#create the toSpace content transformer
toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x)) })
#create the toSpace content transformer
toBlank <- content_transformer(function(x, pattern) { return (gsub(pattern, "", x)) })
# Set all special characters to blank
d <- tm_map(d, toBlank, "[^a-zA-Z0-9 ]")
# Transform to lower case (need to wrap in content_transformer)
d <- tm_map(d, content_transformer(tolower))
# Remove punctuation – replace punctuation marks with ” “
d <- tm_map(d, content_transformer(removePunctuation))
# Strip digits (std transformation, so no need for content_transformer)
d <- tm_map(d, content_transformer(removeNumbers))
# Remove stopwords using the standard list in tm
d <- tm_map(d, removeWords, stopwords("english"))
# Strip whitespace
d <- tm_map(d, stripWhitespace)
# Stem document
# dOrg <- d
# d <- tm_map(d, stemDocument)
# d <- tm_map(d, stemCompletion, dictionary=dOrg)
d
}
# Loading all the download data (RAW)
twiData <- data.frame(Record = readFile(paste0(subDir,"en_US.twitter.txt")), stringsAsFactors = FALSE)
newsData <- data.frame(Record = readFile(paste0(subDir,"en_US.news.txt")), stringsAsFactors = FALSE)
blogData <- data.frame(Record = readFile(paste0(subDir,"en_US.blogs.txt")), stringsAsFactors = FALSE)
rbind(dataSummary(twiData$Record, "Twitter"),
dataSummary(newsData$Record, "News"),
dataSummary(blogData$Record, "Blog"))
## Data No.of.Lines Max.No.Char Min.No.Char Avg.No.Char
## 1 Twitter 2360148 140 2 68.68045
## 2 News 77259 5760 2 202.42830
## 3 Blog 899288 40833 1 229.98695
We observed that the twitter file contains the most number of lines. However, each line is limited to a maximum number of characters at 140. Blog posts have a tendency of being longer than news. Noticed that there are records with only 1-2 characters and average number of characters for news and blog posts at only slightly more than 200. This shows that there are significant number of short or invalid posts. This gives some considerations as to which records we should filter and which records we should select for text mining analysis.
Due to system limitation and huge data set, it is not possible to perform exploratory data analysis on the full data sets. We will approach the sampling process by first filtering off as many records as possible that are deemed not useful for text prediction.
Filtering consideration as follows:
### Filter Records with a high percentage of characters that are punctuations
twiData$nChar <- nchar(twiData$Record)
newsData$nChar <- nchar(newsData$Record)
blogData$nChar <- nchar(blogData$Record)
twiData$numPunc <- numOfPunc(twiData$Record)
newsData$numPunc <- numOfPunc(newsData$Record)
blogData$numPunc <- numOfPunc(blogData$Record)
twiData$pctPunc <- twiData$numPunc/twiData$nChar
newsData$pctPunc <- newsData$numPunc/newsData$nChar
blogData$pctPunc <- blogData$numPunc/blogData$nChar
twiData <- twiData[twiData$pctPunc < quantile(twiData$pctPunc,0.1,na.rm=TRUE),]
newsData <- newsData[newsData$pctPunc < quantile(newsData$pctPunc,0.35,na.rm=TRUE),]
blogData <- blogData[blogData$pctPunc < quantile(blogData$pctPunc,0.35,na.rm=TRUE),]
### Filter Records with Bad Words
twiData$wordOK <- filterBadWords(twiData$Record)
newsData$wordOK <- filterBadWords(newsData$Record)
blogData$wordOK <- filterBadWords(blogData$Record)
twiData <- twiData[twiData$wordOK == TRUE,]
newsData <- newsData[newsData$wordOK == TRUE,]
blogData <- blogData[blogData$wordOK == TRUE,]
### Filter the records with high words to record character length ratio
twiData$numWords <- numOfWords(twiData$Record)
newsData$numWords <- numOfWords(newsData$Record)
blogData$numWords <- numOfWords(blogData$Record)
twiData$ratioWC <- twiData$numWords/twiData$nChar
newsData$ratioWC <- newsData$numWords/newsData$nChar
blogData$ratioWC <- blogData$numWords/blogData$nChar
twiData <- twiData[twiData$ratioWC < quantile(twiData$ratioWC,0.8,na.rm=TRUE),]
newsData <- newsData[newsData$ratioWC < quantile(newsData$ratioWC,0.8,na.rm=TRUE),]
blogData <- blogData[blogData$ratioWC < quantile(blogData$ratioWC,0.8,na.rm=TRUE),]
### Filter Records with only one word
twiData <- twiData[twiData$numWords > 1,]
newsData <- newsData[newsData$numWords > 1,]
blogData <- blogData[blogData$numWords > 1,]
### Filter Records with many repeated words
twiData$uniqueWords <- numOfUniqueWords(twiData$Record)
newsData$uniqueWords <- numOfUniqueWords(newsData$Record)
blogData$uniqueWords <- numOfUniqueWords(blogData$Record)
twiData$pctRepeat <- twiData$uniqueWords/twiData$numWords
newsData$pctRepeat <- newsData$uniqueWords/newsData$numWords
blogData$pctRepeat <- blogData$uniqueWords/blogData$numWords
twiData <- twiData[twiData$pctRepeat > quantile(twiData$pctRepeat,0.1,na.rm=TRUE)
| twiData$uniqueWords > 20,]
newsData <- newsData[newsData$pctRepeat > quantile(newsData$pctRepeat,0.1,na.rm=TRUE)
| newsData$uniqueWords > 20,]
blogData <- blogData[blogData$pctRepeat > quantile(blogData$pctRepeat,0.1,na.rm=TRUE)
| blogData$uniqueWords > 20,]
### Select Records with few words for twitter records and more words with news and blog records
twiData <- twiData[twiData$numWords < quantile(twiData$numWords,0.5,na.rm=TRUE),]
newsData <- newsData[newsData$numWords > quantile(newsData$numWords,0.5,na.rm=TRUE),]
blogData <- blogData[blogData$numWords > quantile(blogData$numWords,0.5,na.rm=TRUE),]
rbind(dataSummary(twiData$Record, "Twitter"),
dataSummary(newsData$Record, "News"),
dataSummary(blogData$Record, "Blog"))
## Data No.of.Lines Max.No.Char Min.No.Char Avg.No.Char
## 1 Twitter 73393 140 9 21.6236
## 2 News 10326 1929 195 318.4692
## 3 Blog 115412 37191 206 470.8799
The final list of records ready for sampling and text mining analysis. Note that the total number of lines in the 3 datasets have reduced.
To further overcome the system limitation and memory issues, we will break up each text mining analysis with smaller datasets sampled from the final dataset. The results of all analyses will be consolidated eventually.
### Final data for text mining
tDataFinal <- twiData$Record
nDataFinal <- newsData$Record
bDataFinal <- blogData$Record
token_delim <- " \\t\\r\\n.!?,;\"()"
reSamp <- 20
pctage <- 5
for (i in c(1:reSamp)) {
myCorpus <- c(mySample(tDataFinal,pctage),mySample(nDataFinal,pctage),mySample(bDataFinal,pctage))
print(paste(Sys.time(),"-- Processing Sample", i,"of",length(myCorpus), "lines"))
doc <- VCorpus(VectorSource(myCorpus))
doc2 <- formatCorpus(doc)
doc2 <- data.frame(sapply(doc2, function(x) {x[[1]]}),stringsAsFactors=F)
n1Grams <- NGramTokenizer(doc2, Weka_control(min = 1, max = 1, delimiters = token_delim))
n2Grams <- NGramTokenizer(doc2, Weka_control(min = 2, max = 2, delimiters = token_delim))
n3Grams <- NGramTokenizer(doc2, Weka_control(min = 3, max = 3, delimiters = token_delim))
dfN1Grams <- formatNGram(table(n1Grams))
dfN2Grams <- formatNGram(table(n2Grams))
dfN3Grams <- formatNGram(table(n3Grams))
if (i == 1) {
dfN1GramsComb <- dfN1Grams
dfN2GramsComb <- dfN2Grams
dfN3GramsComb <- dfN3Grams
} else {
dfN1GramsComb <- rbind(dfN1GramsComb, dfN1Grams)
dfN2GramsComb <- rbind(dfN2GramsComb, dfN2Grams)
dfN3GramsComb <- rbind(dfN3GramsComb, dfN3Grams)
}
dfN1GramsComb <- orderNGram(dfN1GramsComb)
dfN2GramsComb <- orderNGram(dfN2GramsComb)
dfN3GramsComb <- orderNGram(dfN3GramsComb)
}
dfN1GramsComb$cumFreq <- cumsum(dfN1GramsComb$Freq)/sum(dfN1GramsComb$Freq)
Few of the most popular unigram words are “one”, “will”, “can”. These are show in the wordcloud and frequency barplot below. We have also plotted the cumulative frequency of all unigram words. A total of 1097 and 14796 unigrams words are needed to cover 50% and 90% of all 107190 word instances respectively.
plotCloud(dfN1GramsComb,2,0.5,10,200)
plotFreq(dfN1GramsComb,10)
plotCumFreq(dfN1GramsComb)
Few of the most popular bigram words are “years ago”, “new york”, “even though” as shown in the wordcloud and frequency barplot below.
plotCloud(dfN2GramsComb,2,0.2,10,150)
plotFreq(dfN2GramsComb,10)
Similarly, popular words for trigram words are as follows:
plotCloud(dfN3GramsComb,2,0.3,10,100)
plotFreq(dfN3GramsComb,10)
Now that we understand the characteristics of the raw data, how to clean them and identified the most popular words and phrases, we may move on to the next step in building a n-gram model and how the model can handle words not in the data. Techniques like smoothing and backoff models will be considered.
In the process of building the prediction model, we will split the final dataset into a training and test set. Training set for training the model and the test set to validate the accuracy of the model. Different types of n-gram models will be explored. The best model will be used and built into a Shiny App with an input for user to enter a word or phrase. The application to return its prediction on the next word.
Separately, the system performance of the application will also be evaluated to ensure memory size and run time are within logical limits to give all users a good experience when using the text prediction program.
Libraries required for this assignment project: RWeka, R.utils, stringr, SnowballC, tm, slam, dplyr, wordcloud, ggplot2, parallel, doParallel