Introduction

This Milestone Report is about exploring the data of the Capstone Project of the Data Science Coursera specialization.

Coursera and SwitfKey are partnering on this project; that apply data science in the area of natural language processing.

The project uses a large text corpus of documents to predict the next word on preceding input.

The data is extracted and cleaned from files and used with the Shiny application.

Here, we have some information about the corpus of data and prepare a plan to create the predictive model.

Loading Data

library(tm)
library(wordcloud)
library(RWeka)
library(stringi)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(qdap)

#### Download and save data 
#specify the source and destination of the download
#destination_file <- "20180808_Coursera_SwiftKey.zip"
#source_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# execute the download
#download.file(source_file, destination_file)

# extract the files from the zip file
#unzip(destination_file)
#
#url_profanity <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
#profanity_file_destination  <- "final/en_US/profanity.txt"
#download.file(url_profanity, profanity_file_destination)


blogs = readLines("final/en_US/en_US.blogs.txt", skipNul = T, encoding="UTF-8")
news = readLines("final/en_US/en_US.news.txt",skipNul = T, encoding="UTF-8")
twitter = readLines("final/en_US/en_US.twitter.txt",skipNul = T, encoding="UTF-8")

First of all, let’s look at the data structures and overview.

blog.size = round((file.info("final/en_US/en_US.blogs.txt")$size/1024^2),2)
new.size= round((file.info("final/en_US/en_US.news.txt")$size/1024^2),2)
twitter.size = round((file.info("final/en_US/en_US.twitter.txt")$size/1024^2),2)

sum.tab=data.frame(file=c("Blogs","News","Twitter"),
                   size=c(blog.size,new.size,twitter.size),
                   lines=c(length(blogs),length(news),length(twitter)),
                   words=c(sum(stri_count_words(blogs)),
                           sum(stri_count_words(news)),
                           sum(stri_count_words(twitter))))           
names(sum.tab)=c("File","Size(Mb)","Number of Lines","Number of Words")

kable(sum.tab) %>%
  kable_styling(bootstrap_options="striped",full_width=F)
File Size(Mb) Number of Lines Number of Words
Blogs 200.42 899288 37546239
News 196.28 1010242 34762395
Twitter 159.36 2360148 30093413

Since the volume of the given database is really big, 2000 of lines will be used for the demonstration of Cleaning and Exploratory Analysis in this Milestone report. Below is the summary table of sample data.

set.seed(2131)
sample.size=2000

sample.blog=sample(blogs,sample.size)
sample.new=sample(news,sample.size)
sample.twitter=sample(twitter,sample.size)

sum.samtab=data.frame(file=c("Sample Blogs","Sample News","Sample Twitter"),
                      size=round((sample.size/c(length(blogs),length(news),length(twitter)))
                                 *c(blog.size,new.size,twitter.size),2),
                      lines=c(length(sample.blog),length(sample.new),length(sample.twitter)),
                      words=c(sum(stri_count_words(sample.blog)),
                              sum(stri_count_words(sample.new)),
                              sum(stri_count_words(sample.twitter))))
names(sum.samtab)=names(sum.tab)

kable(sum.samtab) %>%
  kable_styling(bootstrap_options="striped",full_width=F)
File Size(Mb) Number of Lines Number of Words
Sample Blogs 0.45 2000 82596
Sample News 0.39 2000 70124
Sample Twitter 0.14 2000 25229

Cleaning data

The tm package was used to clean the data. This was based on tm and Text Mining in R. The profanity words can be found at CMU.

For cleaning the text data for further text mining process, we will follow below steps:

#####Data Cleaning

sample=c(sample.blog,sample.new,sample.twitter)

##remove website link and twitter @
sample=gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample)
sample=gsub("@[^\\s]+"," ",sample)

# Remove text within brackets
sample=bracketX(sample)

##remove latin1 words
latin.sym=grep("[^NOT_ASCII](NOT_ASCII){2}[^NOT_ASCII]",iconv(sample, "latin1", "ASCII", sub="NOT_ASCII"))
sample[latin.sym]=stri_trans_general(sample[latin.sym], "latin-ascii")
sample=gsub('[^\x20-\x7E]', "'", sample)

##replace abbreviate words with their full terms
sample=replace_abbreviation(sample)

##replace contractions with their base words
sample=replace_contraction(sample)

##lower case
sample=tolower(sample)

##remove stopwords and unwanted abbreviation
sample=removeWords(sample,stopwords("en"))
sample=gsub("'[A-z]+", " ", sample)

##remove punctuations
sample=gsub("[[:punct:]]", " ", sample)

##remove numbers
sample=removeNumbers(sample)

##remove profinity
profanity = read.table(file ="final/en_US/profanity.txt", stringsAsFactors=F)
sample=removeWords(sample,profanity[,1])

##remove extra space
sample=stripWhitespace(sample)

corpus = VCorpus(VectorSource(sample))
corpus = tm_map(corpus, PlainTextDocument)

rm(sample.twitter,sample.blog,sample.new)
rm(blogs,news,twitter,swear.words,latin.sym)
## Warning in rm(blogs, news, twitter, swear.words, latin.sym): object
## 'swear.words' not found

Tokenization

We now need break it into words and sentences, and to turn it into n-grams. These are all called tokenization because we are breaking up the text into units of meaning, called tokens.

In Natural Language Processing (NLP), n-gram is a contiguous sequence of n items from a given sequence of text or speech. Unigrams are single words. Bigrams are two words combinations. Trigrams are three-word combinations.

The tokenizer method is allowed in R using the package RWeka. The following function is used to extract 1-grams, 2-grams, 3-grams and 4-grams from the text Corpus using RWeka.

Frequency Table and Visualization by WordCloud

We will examine top 20 most frequently word combination appearance. In particular, unigram, digram and trigram word combination.The frequency will be shown as table, histogram and wordcloud.

top=20 #number of top frequent appear words

#Frequency table
getFreq = function(tdm,ngram) {
  gram=function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
  tdm = TermDocumentMatrix(tdm,control= list(tokenizer=gram))
  freq1 = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  freq=data.frame(word = names(freq1), freq = freq1)
  freq$word=as.character(freq$word)
  return(freq)
}

#Frequency Histogram
makePlot = function(table, label) {
  ggplot(table, aes(reorder(word, freq), freq)) +
    geom_bar(stat = "identity", fill = I("blue"))+
    coord_flip()+
    labs(x = label, y = "Frequency")
}

1. Unigram

a. Frequency Table

uni.freq = getFreq(corpus,1)

kable(uni.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")
word freq
will 686
said 597
can 512
one 462
like 433
just 428
time 363
get 330
new 287
first 286
now 281
people 271
day 262
also 249
year 243
back 238
know 231
two 226
good 221
much 206

b. Frequency Histogram

makePlot(uni.freq[1:top,], "20 Most Common Unigrams")

c. Frequency Wordcloud

wordcloud(words = uni.freq$word, freq = uni.freq$freq, min.freq = 1,
            max.words=50, random.order=T, rot.per=0.35, random.color = F,scale=c(4,.2),
            colors=brewer.pal(8,name= "Set1"))

2. Digram

a. Frequency Table

bi.freq = getFreq(corpus,2)

kable(bi.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")
word freq
u s 57
year old 47
last year 42
new york 39
dot com 31
right now 31
let us 30
st louis 29
years ago 26
can get 25
first time 25
high school 25
new jersey 23
www dot 23
feel like 21
can wait 19
looks like 18
make sure 16
san francisco 15
even though 14

b. Frequency Histogram

makePlot(bi.freq[1:top,], "30 Most Common Bigrams")

c. Frequency Wordcloud

wordcloud(words = bi.freq$word, freq = bi.freq$freq, min.freq = 1,
            max.words=30, random.order=T, rot.per=0.35, random.color = F,scale=c(3,.1),
            colors=brewer.pal(8,name= "Set1"))

3. Trigram

a. Frequency Table

tri.freq = getFreq(corpus,3)

kable(tri.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")
word freq
new york times 5
president barack obama 5
digs service points 4
let us get 4
nice hair guy 4
age grade range 3
aug age grade 3
can wait see 3
chicago chicago illinois 3
cloths cold water 3
five times week 3
four years ago 3
gov chris christie 3
grade range yo 3
happy mother day 3
happy mothers day 3
hope great day 3
let us make 3
like year old 3
new york state 3

b. Frequency Histogram

makePlot(tri.freq[1:top,], "20 Most Common Trigrams")

c. Frequency Wordcloud

wordcloud(words = tri.freq$word, freq = tri.freq$freq, min.freq = 1,
            max.words=30, random.order=T, rot.per=0.35, random.color = T,scale=c(2,.1),
            colors=brewer.pal(8,name= "Set1"))

Findings

We can conclude after the exploratory analysis that the process is very heavy and requires a lot of processing power and RAM. A lot of the frequent words are repeated. The more complex the N-Gram the lower the frequency

What Next?

Prediction model and plans for Shiny app While the strategy for modeling and prediction has not been finalized, the n-gram model with a frequency look-up table might be used based on the analysis above. A possible method of prediction is to use the 4-gram model to find the most likely next word first. If none is found, then the 3-gram model is used, and so forth. Furthermore, stemming might also be done in data preprocessing.

For the app a simple minimalistic design with probabibily of next word showing displayed, or a button to introduced the next word prediction.