Abstraction

This is the Milestone project report for the Coursera Data Science Capstone project. The objective of this project is to create a predictive text model using a large text corpus of documents as training data based on the concpets similar to SwitKey application. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This milestone report describes the major features of the training data with exploratory data analysis and summarizes the plans for creating the predictive model. As such the intent of this report is to cover the following tasks of the project:

 - Task 0 : Understanding the Problem
 - Task 1 : Data Acquisition and Cleaning
 - Task 2 : Exploratory Analysis
 - Task 3 : Modeling 

As part of this first milestone report a plan for next tasks of the project such as Prediction and Creative Exploration and the final Data product development also described.

Problem Statement

The overall objective of this capestone project is to develop a accurate and efficient text prediction algorithm based on analysis of large corpus of text documents. As part of this activity it also mainly contains familiarizing myself with NLP( Natural Language Processing) and text mining - Learning about the basics of NLP and how it relates to the data science process I have learned in the Data Science Specialization.

Getting The Data

The Data for the analysis is obtained from the course web site in the form of downloadable zip file containing the text files.

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

The text data are provided in 4 different languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian.

In this project the focus is on the United States English data sets and hence en_US folder would be used for the text analysis.

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

The data sets consist of text from 3 different sources: 1) Blogs, 2) News and 3) Twitter feeds.

library(stringi)

# Get file sizes
blogs.size <- file.info("./final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("./final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("./final/en_US/en_US.twitter.txt")$size / 1024 ^ 2

# Read the blogs and Twitter data into R
blogsText <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
newsText <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitterText <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# Get words in files
blogs.words <- stri_count_words(blogsText)
news.words <- stri_count_words(newsText)
twitter.words <- stri_count_words(twitterText)

# Summary of the data sets
dataInfo = data.frame(source = c("blogs.txt", "news.txt", "twitter.txt"),
           fileSizeInMB = c(blogs.size, news.size, twitter.size),
           numOfLines = c(length(blogsText), length(newsText), length(twitterText)),
           numOfWords = c(sum(blogs.words), sum(news.words), sum(twitter.words)))
          

dataInfo
##        source fileSizeInMB numOfLines numOfWords
## 1   blogs.txt     200.4242     899288   37546246
## 2    news.txt     196.2775    1010242   34762395
## 3 twitter.txt     159.3641    2360148   30093410

Cleaning The Data

As next steps the data must be cleaned. This step involves removing special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower case.

I created a sample data set from the 3 files by combining randomly 1% of each of the original 3 files into a single dataset (Since the data sets are quite large, we will randomly choose 1% of the data to demonstrate the data cleaning and exploratory analysis).

I performed cleaning tasks: transformed to lower case, removed numbers and punctuation; removed basic stopwords (common words typically excluded from statistical text analysis); and performed basic profanity removal. For Profanity words removal I used bad words list based on the list from Luis von Ahns “bad word list” [Please refer [4]]. I also stemmed the document. This process simplifies the variations of words to their common ‘stem’.

library(tm)

# set the seed for the random number
set.seed(56789)

# get the random data sample for 1% of total corpus 
sampleTextData <- c(sample(blogsText,length(blogsText)*0.01), 
                    sample(newsText, length(newsText)*0.01),
                    sample(twitterText,length(twitterText)*0.01))

# length of the sample data size
length(sampleTextData)
## [1] 42695
# create the corpus data for the 1% data size
corpus <- VCorpus(VectorSource(sampleTextData))

# create the functions for cleaning the corpus
removeWildChar <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

# get the profanity words list to clean the corpus
profaneList <- read.table("bad-words.txt", quote="\"", stringsAsFactors=FALSE)
profanityWords <- c(profaneList$V1)


#Now clean the corpus data
corpusClean <- tm_map(corpus, removeWildChar, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpusClean <- tm_map(corpusClean, removeWildChar, "@[^\\s]+")
corpusClean <- tm_map(corpusClean, content_transformer(tolower))
corpusClean <- tm_map(corpusClean, removeNumbers)
corpusClean <- tm_map(corpusClean, removePunctuation, preserve_intra_word_dashes = FALSE)
corpusClean <- tm_map(corpusClean, removeWords, stopwords("english"))
corpusClean <- tm_map(corpusClean, stemDocument)
corpusClean <- tm_map(corpusClean, stripWhitespace)
corpusClean <- tm_map(corpusClean, removeWords, profanityWords)
corpusClean <- tm_map(corpusClean, PlainTextDocument)

Exploratory Analysis

Now the courpus is some what clean and ready to perform exploratory analysis on the data. It would be good to perform the data tokenisation for the words/terms that are repetable in the corpus.

We compare the distributions of tokenized words in the corpus for Single/Uni, Bi/two, Tri/three, and Quad/four gram tokenizers.

library(RWeka)
library(ggplot2)

# defin the n-Gram tokenizer functions

UniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BiGramTokenizer  <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TriGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
QuadGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))

# set the options for the cores 
options(mc.cores=1)

# Define the function that returns the n-Gram frequency of the terms from the term document matrix
 getNGramFreq <- function(tdm) {
    count <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
    return(data.frame(word = names(count), freq = count))
 }
 
UniGramFreq <- getNGramFreq(removeSparseTerms(TermDocumentMatrix(corpusClean, 
    control = list(tokenize = UniGramTokenizer)), 0.9999))
BiGramFreq <- getNGramFreq(removeSparseTerms(TermDocumentMatrix(corpusClean, 
    control = list(tokenize = BiGramTokenizer)), 0.9999))
TriGramFreq <- getNGramFreq(removeSparseTerms(TermDocumentMatrix(corpusClean, 
    control = list(tokenize =TriGramTokenizer)), 0.9999))
QuadGramFreq <- getNGramFreq(removeSparseTerms(TermDocumentMatrix(corpusClean, 
    control = list(tokenize =QuadGramTokenizer)), 0.9999))

Generate the Plots

Modeling Task

The model is developed using the n Gram Tokenisation. However the cleaned “1%” sample corpus data representoing the whole set of the english words is highly unlikely and hence the model still need to be developed to represent the entire complete corpus

The key challenge is that the structure of R combined with the memory limits of completing this project on a personal computer makes a lot of text mining tasks particularly computationally cumbersome (clock time and feasibility). Hence a better statistical model need to be developed and may required “knn folds”, “boostrap”" or batch processing of the data. Also it may be required to do the weigthing of the terms based on the conplete corpus. Need to explore the “weigthing” in case of Term Document Matrix and see the impact of “weightTfIdf”. Also calculations had to be batched into a smaller number of chunks on the personal computer to avoid any performance impacts.

Another challenge in using this data is that one of the 3 training datasets is highly context specific. Twitter has its own semantics hence in the development of the model, all ‘Hashtag’ terms (e.g.: #savethedate) and usernames (e.g.: @username) will be removed. This exploratory analysis also highlighted the need to do some spellcheck/cleaning of the document for creative spellings (aaahhhh and ahhhhh return multiple times each) prior to creating a prediction model.

Hence the current model is basic and it requires further refinement before the data product being developed as part of next steps.

Plan for Next Tasks

This concludes the exploratory analysis. At this stage, I am familiar with the data and the basic text mining infrastructure in R.The next steps of this capstone project would be developing a text prediction algorithm, and perfect the algorithm as a Data product and deploy as Shiny app.

I will start from basic n-gram model using the exploratory analysis and explore more sophisticated algorithms if possible.

My Next Steps to produce a predictive algorithm and final project submission:
  1. I would be developing a text prediction algorithm. I will start from basic n-gram model using the exploratory analysis and explore more sophisticated algorithms if possible. This model would match the last three words of a prediction to a stored 4gram. If the pattern isn’t found, it ‘backsoff’ to check for the last two words in a 3gram, etc.

  2. I will also propose metrics for model efficiency and accuracy using cross-validation.

  3. I need to incorporate statistical smoothing that can account for entirely unseen values. I am still unsure of how to implement the model to provide athis in a plausible prediction even if the training data doesn’t include the particular pattern of words given.

  4. I will build a data product as a Shiney app that accepts an n-gram phrase and predicts the next word.The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase.

  5. Finally after I have a working prediction app, need to revist performance, tweak the model.

  6. I will also create the report and a presentation explaining how I accomplished this project.

References

Following are the references that have been used so far for this project:

[1] Natural language processing Wikipedia page : https://en.wikipedia.org/wiki/Natural_language_processing

[2] Text mining infrastucture in R : http://www.jstatsoft.org/article/view/v025i05

[3] CRAN Task View: Natural Language Processing: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

[4] Profanity workds from Luis von Ahn’s Research Group - http://www.cs.cmu.edu/~biglou/resources/bad-words.txt