Milestone Report - Exploratory Data Analysis

Introduction

The goal of the capstone project is build a predictive text model like those use by SwiftKey. The model should predict the next word based on user inputs of one or more words. To build such an application, one needs to understand Natural Language Processing and how words are put together. As a first step, this will involve an analysis of text data to understand the distribution and relationship between the words, tokens and phrases.

In this report, we will perform Explorartory Data Analysis on the corpora which we will be using to build the predictive text model. The data comes from HC Corpora and can be downloaded here.

The downloaded file contains a number of text files in English and other languages. For both our analysis and application, we will focus only on the English files namely “en_US.blogs.txt”, “en_US.news.txt” and “en_US.twitter.txt”. We will study the distributions of word and N-gram frequencies in the dataset. These 3 files are stored in the txt folder of the working directory.

Pre-analysis Initialisation

We have created a few functions to simplify and standardise the coding process. The following functions will be initialised. The codes of these functions are hidden so that this report will not be lengthy but the description of the each function are shown in the table below.

Function	Description
readFile	Function to make a connection to a text file, read and return its content and close the connection
dataSummary	Function to gather the summary of the text files
numOfWords	Function to count the number of words in a word string
numOfPunc	Function to count the number of punctuations and special characters in a word string
numOfUniqueWords	Function to count the number of unique words in a word string
filterBadWords	Function to detect whether a word string contains bad words
mySample	Function to sample a smaller percentage of a dataset
formatNGram	Function to organise the N-Gram Tokenizer results into a data frame
orderNGram	Function to order the tokenized words or phrases in descending order of frequency
plotCloud	Function to plot a wordCloud of the tokenized words or phrases
plotFreq	Function to plot a barchart of the tokenized words or phrases with high frequency
plotCumFreq	Function to plot a cumulative frequency line chart of the tokenized words or phrases indicating the number of tokens to cover 50% and 90% of all instances

Note that for profanity (bad words) filtering, we will use a list of banned words by GOOGLE. This is downloaded from here

We also created a function to clean up the Corpora for better text mininig analysis. This is so that when we tokenize each word or phrase, each token will be meaningful and useful for text prediction.

### formatCorpus - Function to clean the data for text mining.  
### Cleaning include:
###     1. Removing all non-english character symbols
###     2. Changing all english character to lower case
###     3. Remove all punctuation
###     4. Remove all numbers
###     5. Remove all stopwords (eg, words that are too common like the, a, for etc)
###     6. Remove any whitespaces
### Note that stemming is commented of as it is not necessary in this analysis

formatCorpus <- function(d) {
    #create the toSpace content transformer
    toSpace <- content_transformer(function(x, pattern) { return (gsub(pattern, " ", x)) })
    #create the toSpace content transformer
    toBlank <- content_transformer(function(x, pattern) { return (gsub(pattern, "", x)) })
    # Set all special characters to blank
    d <- tm_map(d, toBlank, "[^a-zA-Z0-9 ]")
    # Transform to lower case (need to wrap in content_transformer)
    d <- tm_map(d, content_transformer(tolower))
    # Remove punctuation – replace punctuation marks with ” “
    d <- tm_map(d, content_transformer(removePunctuation))  
    # Strip digits (std transformation, so no need for content_transformer)
    d <- tm_map(d, content_transformer(removeNumbers))      
    # Remove stopwords using the standard list in tm
    d <- tm_map(d, removeWords, stopwords("english"))       
    # Strip whitespace
    d <- tm_map(d, stripWhitespace)                         
    # Stem document
    # dOrg <- d
    # d <- tm_map(d, stemDocument)                           
    # d <- tm_map(d, stemCompletion, dictionary=dOrg)
    d
}

Data Loading

# Loading all the download data (RAW)

twiData <- data.frame(Record = readFile(paste0(subDir,"en_US.twitter.txt")), stringsAsFactors = FALSE)
newsData <- data.frame(Record = readFile(paste0(subDir,"en_US.news.txt")), stringsAsFactors = FALSE)
blogData <- data.frame(Record = readFile(paste0(subDir,"en_US.blogs.txt")), stringsAsFactors = FALSE)

rbind(dataSummary(twiData$Record, "Twitter"),
      dataSummary(newsData$Record, "News"),
      dataSummary(blogData$Record, "Blog"))

##      Data No.of.Lines Max.No.Char Min.No.Char Avg.No.Char
## 1 Twitter     2360148         140           2    68.68045
## 2    News       77259        5760           2   202.42830
## 3    Blog      899288       40833           1   229.98695

We observed that the twitter file contains the most number of lines. However, each line is limited to a maximum number of characters at 140. Blog posts have a tendency of being longer than news. Noticed that there are records with only 1-2 characters and average number of characters for news and blog posts at only slightly more than 200. This shows that there are significant number of short or invalid posts. This gives some considerations as to which records we should filter and which records we should select for text mining analysis.

Data Sampling and Processing

Due to system limitation and huge data set, it is not possible to perform exploratory data analysis on the full data sets. We will approach the sampling process by first filtering off as many records as possible that are deemed not useful for text prediction.

Filtering consideration as follows:

Filter records with a high percentage of punctuations and special characters. In constructing proper English sentences or paragraphs, we do not expect many special characters. Such records will be of no use for text prediction modeling.

### Filter Records with a high percentage of characters that are punctuations

twiData$nChar <- nchar(twiData$Record)
newsData$nChar <- nchar(newsData$Record)
blogData$nChar <- nchar(blogData$Record)

twiData$numPunc <- numOfPunc(twiData$Record)
newsData$numPunc <- numOfPunc(newsData$Record)
blogData$numPunc <- numOfPunc(blogData$Record)

twiData$pctPunc <- twiData$numPunc/twiData$nChar
newsData$pctPunc <- newsData$numPunc/newsData$nChar
blogData$pctPunc <- blogData$numPunc/blogData$nChar

twiData <- twiData[twiData$pctPunc < quantile(twiData$pctPunc,0.1,na.rm=TRUE),]
newsData <- newsData[newsData$pctPunc < quantile(newsData$pctPunc,0.35,na.rm=TRUE),]
blogData <- blogData[blogData$pctPunc < quantile(blogData$pctPunc,0.35,na.rm=TRUE),]

Filtering records with bad words. As we will not want our text prediction model to return a bad word as a prediction, we will avoid such scenario from happening by excluding such records from the dataset.

### Filter Records with Bad Words

twiData$wordOK <- filterBadWords(twiData$Record)
newsData$wordOK <- filterBadWords(newsData$Record)
blogData$wordOK <- filterBadWords(blogData$Record)

twiData <- twiData[twiData$wordOK == TRUE,]
newsData <- newsData[newsData$wordOK == TRUE,]
blogData <- blogData[blogData$wordOK == TRUE,]

Filter records with lots of words but number of characters forming the whole string is small. This is highly unlikely in a sentence structure. If there are many words, we would expect the string to be long and hence contain many characters. However, such record can occur if there are many short forms like ‘u for you’, ‘m for am’, etc in sentences. We aim to avoid short forms in the text prediction model.

### Filter the records with high words to record character length ratio

twiData$numWords <- numOfWords(twiData$Record)
newsData$numWords <- numOfWords(newsData$Record)
blogData$numWords <- numOfWords(blogData$Record)

twiData$ratioWC <- twiData$numWords/twiData$nChar
newsData$ratioWC <- newsData$numWords/newsData$nChar
blogData$ratioWC <- blogData$numWords/blogData$nChar

twiData <- twiData[twiData$ratioWC < quantile(twiData$ratioWC,0.8,na.rm=TRUE),]
newsData <- newsData[newsData$ratioWC < quantile(newsData$ratioWC,0.8,na.rm=TRUE),]
blogData <- blogData[blogData$ratioWC < quantile(blogData$ratioWC,0.8,na.rm=TRUE),]

Filter records with only one word. Since the text prediction model is to predict the next word. Such records are not useful.

### Filter Records with only one word

twiData <- twiData[twiData$numWords > 1,]
newsData <- newsData[newsData$numWords > 1,]
blogData <- blogData[blogData$numWords > 1,]

Filter records with too many repeated words like ‘ha ha ha’, ‘yes yes yes’, etc. Such records are not natural to sentence structure hence not useful text prediction. On the other hand, it may also cause biasness to the prediction model.

### Filter Records with many repeated words

twiData$uniqueWords <- numOfUniqueWords(twiData$Record)
newsData$uniqueWords <- numOfUniqueWords(newsData$Record)
blogData$uniqueWords <- numOfUniqueWords(blogData$Record)

twiData$pctRepeat <- twiData$uniqueWords/twiData$numWords
newsData$pctRepeat <- newsData$uniqueWords/newsData$numWords
blogData$pctRepeat <- blogData$uniqueWords/blogData$numWords

twiData <- twiData[twiData$pctRepeat > quantile(twiData$pctRepeat,0.1,na.rm=TRUE) 
                   | twiData$uniqueWords > 20,]
newsData <- newsData[newsData$pctRepeat > quantile(newsData$pctRepeat,0.1,na.rm=TRUE) 
                     | newsData$uniqueWords > 20,]
blogData <- blogData[blogData$pctRepeat > quantile(blogData$pctRepeat,0.1,na.rm=TRUE) 
                     | blogData$uniqueWords > 20,]

To further limit the records for analysis, we will perform a selection based on number of words in each record. We will be selecting twitter records with fewer words. Twitter records tends to have shorter word groupings like ‘Happy Birthday’, ‘Good Morning’, etc. On the other hand, we expect news and blog posts to have longer and more natural language sentence structure Hence, we will be selecting news and blog records with more words.

### Select Records with few words for twitter records and more words with news and blog records

twiData <- twiData[twiData$numWords < quantile(twiData$numWords,0.5,na.rm=TRUE),]
newsData <- newsData[newsData$numWords > quantile(newsData$numWords,0.5,na.rm=TRUE),]
blogData <- blogData[blogData$numWords > quantile(blogData$numWords,0.5,na.rm=TRUE),]

rbind(dataSummary(twiData$Record, "Twitter"),
      dataSummary(newsData$Record, "News"),
      dataSummary(blogData$Record, "Blog"))

##      Data No.of.Lines Max.No.Char Min.No.Char Avg.No.Char
## 1 Twitter       73393         140           9     21.6236
## 2    News       10326        1929         195    318.4692
## 3    Blog      115412       37191         206    470.8799

The final list of records ready for sampling and text mining analysis. Note that the total number of lines in the 3 datasets have reduced.

Text Mining Analysis

To further overcome the system limitation and memory issues, we will break up each text mining analysis with smaller datasets sampled from the final dataset. The results of all analyses will be consolidated eventually.

### Final data for text mining

tDataFinal <- twiData$Record
nDataFinal <- newsData$Record
bDataFinal <- blogData$Record

token_delim <- " \\t\\r\\n.!?,;\"()"
reSamp <- 20
pctage <- 5

for (i in c(1:reSamp)) {
    
    myCorpus <- c(mySample(tDataFinal,pctage),mySample(nDataFinal,pctage),mySample(bDataFinal,pctage))
    
    print(paste(Sys.time(),"-- Processing Sample", i,"of",length(myCorpus), "lines"))
    
    doc <- VCorpus(VectorSource(myCorpus))
    doc2 <- formatCorpus(doc)
    doc2 <- data.frame(sapply(doc2, function(x) {x[[1]]}),stringsAsFactors=F)
    
    n1Grams <- NGramTokenizer(doc2, Weka_control(min = 1, max = 1, delimiters = token_delim))
    n2Grams <- NGramTokenizer(doc2, Weka_control(min = 2, max = 2, delimiters = token_delim))
    n3Grams <- NGramTokenizer(doc2, Weka_control(min = 3, max = 3, delimiters = token_delim))
    
    dfN1Grams <- formatNGram(table(n1Grams))
    dfN2Grams <- formatNGram(table(n2Grams))
    dfN3Grams <- formatNGram(table(n3Grams))
    
    if (i == 1) {
        dfN1GramsComb <- dfN1Grams
        dfN2GramsComb <- dfN2Grams
        dfN3GramsComb <- dfN3Grams
    } else {
        dfN1GramsComb <- rbind(dfN1GramsComb, dfN1Grams)
        dfN2GramsComb <- rbind(dfN2GramsComb, dfN2Grams)
        dfN3GramsComb <- rbind(dfN3GramsComb, dfN3Grams)
    }
    
    dfN1GramsComb <- orderNGram(dfN1GramsComb)
    dfN2GramsComb <- orderNGram(dfN2GramsComb)
    dfN3GramsComb <- orderNGram(dfN3GramsComb)    
    
}  

dfN1GramsComb$cumFreq <- cumsum(dfN1GramsComb$Freq)/sum(dfN1GramsComb$Freq)

Results

Unigram

Few of the most popular unigram words are “one”, “will”, “can”. These are show in the wordcloud and frequency barplot below. We have also plotted the cumulative frequency of all unigram words. A total of 1097 and 14796 unigrams words are needed to cover 50% and 90% of all 107190 word instances respectively.

plotCloud(dfN1GramsComb,2,0.5,10,200)

plotFreq(dfN1GramsComb,10)

plotCumFreq(dfN1GramsComb)

Bigram

Few of the most popular bigram words are “years ago”, “new york”, “even though” as shown in the wordcloud and frequency barplot below.

plotCloud(dfN2GramsComb,2,0.2,10,150)

plotFreq(dfN2GramsComb,10)

Trigram

Similarly, popular words for trigram words are as follows:

plotCloud(dfN3GramsComb,2,0.3,10,100)

plotFreq(dfN3GramsComb,10)

Conclusion

Now that we understand the characteristics of the raw data, how to clean them and identified the most popular words and phrases, we may move on to the next step in building a n-gram model and how the model can handle words not in the data. Techniques like smoothing and backoff models will be considered.

In the process of building the prediction model, we will split the final dataset into a training and test set. Training set for training the model and the test set to validate the accuracy of the model. Different types of n-gram models will be explored. The best model will be used and built into a Shiny App with an input for user to enter a word or phrase. The application to return its prediction on the next word.

Separately, the system performance of the application will also be evaluated to ensure memory size and run time are within logical limits to give all users a good experience when using the text prediction program.

End of Assignment - Thank you for your time reviewing my work. Have a nice day!

Libraries required for this assignment project: RWeka, R.utils, stringr, SnowballC, tm, slam, dplyr, wordcloud, ggplot2, parallel, doParallel

Data Science Capstone

Chan Chee-Foong

September 1, 2016