The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. NLP (Natural language processing) techniques will be used to perform the analysis and build the predictive model. This Report is intended to give an introductory look at:
1. Extracting and Cleaning the Data.
2. Extracting the Major features of the training data with the help of exploratory data analysis and
3. Describe the full plans for creating the predictive model.
library(stringi) # stats files
library(NLP); library(openNLP)
library(tm) # Text mining
library(rJava)
library(RWeka) # tokenizer - create unigrams, bigrams, trigrams
library(RWekajars)
library(SnowballC) # Stemming
library(RColorBrewer) # Color palettes
library(qdap)
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
##
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
##
## ngrams
## The following objects are masked from 'package:base':
##
## Filter, proportions
library(ggplot2) #visualization
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:qdapRegex':
##
## %+%
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
The Data is first Downloaded from the source link then it is loaded into R from the Local Disk.
blog<-readLines("en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news<-readLines("en_US.news.txt", skipNul = TRUE, warn=FALSE)
twitter<-readLines("en_US.twitter.txt", skipNul = TRUE, warn=FALSE)
create_summary_table <- function(twitter,blog,news){
stats <- data.frame(source = c("twitter","blog","news"),
arraySizeMB = c(object.size(twitter)/1024^2,object.size(blog)/1024^2,object.size(news)/1024^2),
fileSizeMB = c(file.info("en_US.twitter.txt")$size/1024^2,file.info("en_US.blogs.txt")$size/1024^2,file.info("en_US.news.txt")$size/1024^2),
lineCount = c(length(twitter),length(blog),length(news)),
wordCount = c(sum(stri_count_words(twitter)),sum(stri_count_words(blog)),sum(stri_count_words(news))),
charCount = c(stri_stats_general(twitter)[3],stri_stats_general(blog)[3],stri_stats_general(news)[3])
)
print(stats)
}
create_summary_table(twitter,blog,news)
## source arraySizeMB fileSizeMB lineCount wordCount charCount
## 1 twitter 318.98975 159.3641 2360148 30218166 162385035
## 2 blog 255.35453 200.4242 899288 38154238 208361438
## 3 news 19.76917 196.2775 77259 2693898 15683765
The datasets are quite large in size, therefore there are 10000 rows of each dataset sampled and combined into a single dataset.
set.seed(1805)
sampleData <- c(sample(twitter,10000),sample(blog,10000),sample(news,10000))
Here first the data are transformed into the core data type of NLP analysis, which is a Corpus. after that a set of cleaning procedures (e.g. Removing Whitespaces, Numbers, UTR, punctuation and so on) is taking place, in order to get meaningful insights on the data.
corpus <- VCorpus(VectorSource(sampleData))
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern," ",x))})
#Cleaning all non ASCII characters
corpus <- tm_map(corpus,toSpace,"[^[:graph:]]")
#Transforming all data to lower case
corpus <- tm_map(corpus,content_transformer(tolower))
#Deleting all English stopwords and any stray letters left my the non-ASCII removal
corpus <- tm_map(corpus,removeWords,c(stopwords("english"),letters))
#Removing Punctuation
corpus <- tm_map(corpus,removePunctuation)
#Removing Numbers
corpus <- tm_map(corpus,removeNumbers)
#Removing Profanities
profanities = readLines('bad-words.txt')
corpus <- tm_map(corpus, removeWords, profanities)
#Removing all stray letters left by the last two calls
corpus <- tm_map(corpus,removeWords,letters)
#Striping all extra whitespace
corpus <- tm_map(corpus,stripWhitespace)
Now We are going to perform exploratory data analysis on the data. First of all the n-gram matrices are created for n=1,2,3 and then the most frequent terms are found.The following function is used to extract 1-grams, 2-grams, 3-grams and 4-grams from the text Corpus then freq. terms are found.
N-Gram represents sequence of “n” number of text items(E.g. phonemes, syllables, letters, words, base pairs and etc).
#Creating a unigram DTM
unigramTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 1, max = 1))}
unigrams <- DocumentTermMatrix(corpus, control = list(tokenize = unigramTokenizer))
#Creating a bigram DTM
BigramTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
bigrams <- DocumentTermMatrix(corpus, control = list(tokenize = BigramTokenizer))
#Creating a trigram DTM
TrigramTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 3, max = 3))}
trigrams <- DocumentTermMatrix(corpus, control = list(tokenize = TrigramTokenizer))
Below the top n-grams for n=1,2,3 can be seen.
g <- ggplot(unigrams_freq_df,aes(x=reorder(word,-frequency),y=frequency))+geom_bar(stat="identity",fill="darkolivegreen4") + xlab("Unigram") + ylab("Frequency") +labs(title="Most common unigrams") + theme(axis.text.x=element_text(angle=55, hjust=1))
g
g <- ggplot(bigrams_freq_df,aes(x=reorder(word,-frequency),y=frequency))+geom_bar(stat="identity",fill="darkolivegreen4") + xlab("Bigram") + ylab("Frequency") +labs(title="Most common bigrams") + theme(axis.text.x=element_text(angle=55, hjust=1))
g
g <- ggplot(trigrams_freq_df,aes(x=reorder(word,-frequency),y=frequency))+geom_bar(stat="identity",fill="darkolivegreen4") + xlab("Trigram") + ylab("Frequency") +labs(title="Most common trigrams") + theme(axis.text.x=element_text(angle=55, hjust=1))
g
The next steps of this capstone project would be to finalize our predictive algorithm, and deploy our algorithm as a Shiny app.
Our predictive algorithm will be using n-gram model with frequency lookup similar to our exploratory analysis above.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.