Introduction

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this project, we will explore the major features of the text data given for the Coursera Data Science Capstone through Johns Hopkins University. The project is sponsored by SwiftKey. The final purpose is to create text prediction application with R Shiny application that predicts words using a natural language processing model.There are 4 given language database, in particular, we will work with English database instead of the others which are Russian, German and Finnish.

The first step which is also the goal of this report is to do some basic overviews and some necessary cleanings in order to get familiar with the database as well as prepare for further prediction model creation. More specifically, we will remove from the data numbers, symbols, punctuation and other words that should not be predicted to increase prediction accuracy. Then we will observe some of the most frequently appear words including single, two and three word phrases.

Loading Data

library(tm)
library(wordcloud)
library(RWeka)
library(stringi)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(qdap)
#setwd("D:/test/Coursera/Capstone_final project")
blogs = readLines("final/en_US/en_US.blogs.txt", skipNul = T, encoding="UTF-8")
news = readLines("final/en_US/en_US.news.txt",skipNul = T, encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.news.txt", skipNul = T, encoding =
## "UTF-8"): incomplete final line found on 'final/en_US/en_US.news.txt'
twitter = readLines("final/en_US/en_US.twitter.txt",skipNul = T, encoding="UTF-8")

First of all, let’s look at the data structures and overview.

blog.size = round((file.info("final/en_US/en_US.blogs.txt")$size/1024^2),2)
new.size= round((file.info("final/en_US/en_US.news.txt")$size/1024^2),2)
twitter.size = round((file.info("final/en_US/en_US.twitter.txt")$size/1024^2),2)

sum.tab=data.frame(file=c("Blogs","News","Twitter"),
                   size=c(blog.size,new.size,twitter.size),
                   lines=c(length(blogs),length(news),length(twitter)),
                   words=c(sum(stri_count_words(blogs)),
                           sum(stri_count_words(news)),
                           sum(stri_count_words(twitter))))           
names(sum.tab)=c("File","Size(Mb)","Number of Lines","Number of Words")

kable(sum.tab) %>%
  kable_styling(bootstrap_options="striped",full_width=F)
File Size(Mb) Number of Lines Number of Words
Blogs 200.42 899288 37546246
News 196.28 77259 2674536
Twitter 159.36 2360148 30093410

Since the volume of the given database is really big, 1000 of lines will be used for the demonstration of Cleaning and Exploratory Analysis in this Milestone report. Below is the summary table of sample data.

set.seed(123)
sample.size=1000

sample.blog=sample(blogs,sample.size)
sample.new=sample(news,sample.size)
sample.twitter=sample(twitter,sample.size)

sum.samtab=data.frame(file=c("Sample Blogs","Sample News","Sample Twitter"),
                      size=round((sample.size/c(length(blogs),length(news),length(twitter)))
                                 *c(blog.size,new.size,twitter.size),2),
                      lines=c(length(sample.blog),length(sample.new),length(sample.twitter)),
                      words=c(sum(stri_count_words(sample.blog)),
                              sum(stri_count_words(sample.new)),
                              sum(stri_count_words(sample.twitter))))
names(sum.samtab)=names(sum.tab)

kable(sum.samtab) %>%
  kable_styling(bootstrap_options="striped",full_width=F)
File Size(Mb) Number of Lines Number of Words
Sample Blogs 0.22 1000 40768
Sample News 2.54 1000 34559
Sample Twitter 0.07 1000 12195

Cleaning data

For cleaning the text data for further text mining process, we will follow below steps:

#####CLEANING

sample=c(sample.blog,sample.new,sample.twitter)

##remove website link and twitter @
sample=gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample)
sample=gsub("@[^\\s]+"," ",sample)

# Remove text within brackets
sample=bracketX(sample)

##remove latin1 words
latin.sym=grep("[^NOT_ASCII](NOT_ASCII){2}[^NOT_ASCII]",iconv(sample, "latin1", "ASCII", sub="NOT_ASCII"))
sample[latin.sym]=stri_trans_general(sample[latin.sym], "latin-ascii")
sample=gsub('[^\x20-\x7E]', "'", sample)

##replace abbreviate words with their full terms
sample=replace_abbreviation(sample)

##replace contractions with their base words
sample=replace_contraction(sample)

##lower case
sample=tolower(sample)

##remove stopwords and unwanted abbreviation
sample=removeWords(sample,stopwords("en"))
sample=gsub("'[A-z]+", " ", sample)

##remove punctuations
sample=gsub("[[:punct:]]", " ", sample)

##remove numbers
sample=removeNumbers(sample)

##remove profinity
swear.words = read.table(file ="swearWords.txt", stringsAsFactors=F)
sample=removeWords(sample,swear.words[,1])

##remove extra space
sample=stripWhitespace(sample)

corpus = VCorpus(VectorSource(sample))
corpus = tm_map(corpus, PlainTextDocument)

rm(sample.twitter,sample.blog,sample.new)
rm(blogs,news,twitter,swear.words,latin.sym)

Frequency Table and Visualization by WordCloud

We will examine top 30 most frequently word combination appearance. In particular, unigram, digram and trigram word combination.The frequency will be shown as table, histogram and wordcloud.

top=30 #number of top frequent appear words

#Frequency table
getFreq = function(tdm,ngram) {
  gram=function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
  tdm = TermDocumentMatrix(tdm,control= list(tokenizer=gram))
  freq1 = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  freq=data.frame(word = names(freq1), freq = freq1)
  freq$word=as.character(freq$word)
  return(freq)
}

#Frequency Histogram
makePlot = function(table, label) {
  ggplot(table, aes(reorder(word, freq), freq)) +
    geom_bar(stat = "identity", fill = I("royalblue3"))+
    coord_flip()+
    labs(x = label, y = "Frequency")
}

1. Unigram

a. Frequency Table

uni.freq = getFreq(corpus,1)

kable(uni.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")
word freq
will 387
said 304
one 258
just 240
can 236
like 203
time 199
people 166
new 153
day 145
year 145
get 142
now 130
first 127
also 123
know 117
see 116
good 111
think 110
back 109
two 103
love 102
way 99
make 96
much 96
really 96
want 95
work 94
last 93
many 93

b. Frequency Histogram

makePlot(uni.freq[1:top,], "30 Most Common Unigrams")

c. Frequency Wordcloud

wordcloud(words = uni.freq$word, freq = uni.freq$freq, min.freq = 1,
            max.words=50, random.order=T, rot.per=0.35, random.color = F,scale=c(4,.2),
            colors=brewer.pal(8,name= "Set1"))

2. Digram

a. Frequency Table

bi.freq = getFreq(corpus,2)

kable(bi.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")
word freq
last year 24
new york 22
high school 19
dot com 18
year old 18
let us 16
new jersey 15
first time 14
right now 14
even though 13
will make 13
last week 11
one day 10
st louis 10
united states 10
www dot 10
years ago 10
can get 9
can wait 9
every day 9
many people 9
feel like 8
just like 8
make sure 8
one thing 8
will see 8
anyone else 7
can also 7
can see 7
felt like 7

b. Frequency Histogram

makePlot(bi.freq[1:top,], "30 Most Common Digrams")

c. Frequency Wordcloud

wordcloud(words = bi.freq$word, freq = bi.freq$freq, min.freq = 1,
            max.words=30, random.order=T, rot.per=0.35, random.color = F,scale=c(3,.1),
            colors=brewer.pal(8,name= "Set1"))

3. Trigram

a. Frequency Table

tri.freq = getFreq(corpus,3)

kable(tri.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")
word freq
cup cup cup 5
jobs north dakota 4
pharmacist jobs north 4
around around around 3
can wait see 3
ho chi minh 3
let us get 3
ngo dinh diem 3
president barack obama 3
year old boy 3
american banker magazine 2
approach raising one 2
average per doctor 2
beat us place 2
brave new world 2
carmel valley ranch 2
casino hotel complex 2
chief executive officer 2
church adult choir 2
city business community 2
commuter rail line 2
conference call reporters 2
county sheriff department 2
defence budget continues 2
democratic candidates running 2
double opt double 2
duong van minh 2
eleanor roosevelt baruch 2
even though think 2
first quarter year 2

b. Frequency Histogram

makePlot(tri.freq[1:top,], "30 Most Common Trigrams")

c. Frequency Wordcloud

wordcloud(words = tri.freq$word, freq = tri.freq$freq, min.freq = 1,
            max.words=30, random.order=T, rot.per=0.35, random.color = T,scale=c(2,.1),
            colors=brewer.pal(8,name= "Set1"))

Plans for prediction process and Shiny app

We now get familiar to and have an overview look at the database. Hence, for the next step, we will move to building prediction model based on the frequency of common word combination we had create above.

Finally, we will apply the prediction algorithm to create a Shiny app as well as a R presentation to show off and testing the result.