Coursera Data Science Capstone: Milestone Report

1. Synopsis

The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. NLP (Natural language processing) techniques will be used to perform the analysis and build the predictive model. This Report is intended to give an introductory look at:
1. Extracting and Cleaning the Data.
2. Extracting the Major features of the training data with the help of exploratory data analysis and
3. Describe the full plans for creating the predictive model.

2. Loading the necessary libraries

library(stringi) # stats files
library(NLP); library(openNLP)
library(tm) # Text mining
library(rJava)
library(RWeka) # tokenizer - create unigrams, bigrams, trigrams
library(RWekajars)
library(SnowballC) # Stemming
library(RColorBrewer) # Color palettes
library(qdap)

## Loading required package: qdapDictionaries

## Loading required package: qdapRegex

## Loading required package: qdapTools

## 
## Attaching package: 'qdap'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following objects are masked from 'package:base':
## 
##     Filter, proportions

library(ggplot2) #visualization

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:qdapRegex':
## 
##     %+%

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

3. Loading the Data into R

The Data is first Downloaded from the source link then it is loaded into R from the Local Disk.

blog<-readLines("en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news<-readLines("en_US.news.txt", skipNul = TRUE, warn=FALSE)
twitter<-readLines("en_US.twitter.txt", skipNul = TRUE, warn=FALSE)

3. Evaluating the Basic Statistic of Data loaded from Blogs, News and Twitter files

create_summary_table <- function(twitter,blog,news){
  stats <- data.frame(source = c("twitter","blog","news"),
            arraySizeMB = c(object.size(twitter)/1024^2,object.size(blog)/1024^2,object.size(news)/1024^2),
            fileSizeMB = c(file.info("en_US.twitter.txt")$size/1024^2,file.info("en_US.blogs.txt")$size/1024^2,file.info("en_US.news.txt")$size/1024^2),
            lineCount = c(length(twitter),length(blog),length(news)),
            wordCount = c(sum(stri_count_words(twitter)),sum(stri_count_words(blog)),sum(stri_count_words(news))),
            charCount = c(stri_stats_general(twitter)[3],stri_stats_general(blog)[3],stri_stats_general(news)[3])
  )
  print(stats)
}
create_summary_table(twitter,blog,news)

##    source arraySizeMB fileSizeMB lineCount wordCount charCount
## 1 twitter   318.98975   159.3641   2360148  30218166 162385035
## 2    blog   255.35453   200.4242    899288  38154238 208361438
## 3    news    19.76917   196.2775     77259   2693898  15683765

4. Sampling the data

The datasets are quite large in size, therefore there are 10000 rows of each dataset sampled and combined into a single dataset.

set.seed(1805)
sampleData <- c(sample(twitter,10000),sample(blog,10000),sample(news,10000))

5. Cleaning the Data

Here first the data are transformed into the core data type of NLP analysis, which is a Corpus. after that a set of cleaning procedures (e.g. Removing Whitespaces, Numbers, UTR, punctuation and so on) is taking place, in order to get meaningful insights on the data.

corpus <- VCorpus(VectorSource(sampleData))
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",x))})
#Cleaning all non ASCII characters
corpus <- tm_map(corpus,toSpace,"[^[:graph:]]")
#Transforming all data to lower case
corpus <- tm_map(corpus,content_transformer(tolower))
#Deleting all English stopwords and any stray letters left my the non-ASCII removal
corpus <- tm_map(corpus,removeWords,c(stopwords("english"),letters))
#Removing Punctuation
corpus <- tm_map(corpus,removePunctuation)
#Removing Numbers
corpus <- tm_map(corpus,removeNumbers)
#Removing Profanities
profanities = readLines('bad-words.txt')
corpus <- tm_map(corpus, removeWords, profanities)
#Removing all stray letters left by the last two calls
corpus <- tm_map(corpus,removeWords,letters)
#Striping all extra whitespace
corpus <- tm_map(corpus,stripWhitespace)

6. Exploratory Analysis

Now We are going to perform exploratory data analysis on the data. First of all the n-gram matrices are created for n=1,2,3 and then the most frequent terms are found.The following function is used to extract 1-grams, 2-grams, 3-grams and 4-grams from the text Corpus then freq. terms are found.

6.1 Creating N-grams

N-Gram represents sequence of “n” number of text items(E.g. phonemes, syllables, letters, words, base pairs and etc).

#Creating a unigram DTM
unigramTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 1, max = 1))}
unigrams <- DocumentTermMatrix(corpus, control = list(tokenize = unigramTokenizer))
#Creating a bigram DTM
BigramTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
bigrams <- DocumentTermMatrix(corpus, control = list(tokenize = BigramTokenizer))
#Creating a trigram DTM
TrigramTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 3, max = 3))}
trigrams <- DocumentTermMatrix(corpus, control = list(tokenize = TrigramTokenizer))

6.2 Most Frequent Terms per N-gram

Below the top n-grams for n=1,2,3 can be seen.

6.3 Most common unigrams

g <- ggplot(unigrams_freq_df,aes(x=reorder(word,-frequency),y=frequency))+geom_bar(stat="identity",fill="darkolivegreen4") + xlab("Unigram") + ylab("Frequency") +labs(title="Most common unigrams") + theme(axis.text.x=element_text(angle=55, hjust=1))
g

6.4 Most common bigrams

g <- ggplot(bigrams_freq_df,aes(x=reorder(word,-frequency),y=frequency))+geom_bar(stat="identity",fill="darkolivegreen4") + xlab("Bigram") + ylab("Frequency") +labs(title="Most common bigrams") + theme(axis.text.x=element_text(angle=55, hjust=1))
g

6.5 Most common trigrams

g <- ggplot(trigrams_freq_df,aes(x=reorder(word,-frequency),y=frequency))+geom_bar(stat="identity",fill="darkolivegreen4") + xlab("Trigram") + ylab("Frequency") +labs(title="Most common trigrams") + theme(axis.text.x=element_text(angle=55, hjust=1))
g

7. Further Planning

The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app.

Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.