SwiftKey Project_ Milestone Report

Introduction

Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this project, we will explore the major features of the text data given for the Coursera Data Science Capstone through Johns Hopkins University. The project is sponsored by SwiftKey. The final purpose is to create text prediction application with R Shiny application that predicts words using a natural language processing model.There are 4 given language database, in particular, we will work with English database instead of the others which are Russian, German and Finnish.

The first step which is also the goal of this report is to do some basic overviews and some necessary cleanings in order to get familiar with the database as well as prepare for further prediction model creation. More specifically, we will remove from the data numbers, symbols, punctuation and other words that should not be predicted to increase prediction accuracy. Then we will observe some of the most frequently appear words including single, two and three word phrases.

Loading Data

library(tm)
library(wordcloud)
library(RWeka)
library(stringi)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(qdap)
#setwd("D:/test/Coursera/Capstone_final project")
blogs = readLines("final/en_US/en_US.blogs.txt", skipNul = T, encoding="UTF-8")
news = readLines("final/en_US/en_US.news.txt",skipNul = T, encoding="UTF-8")

## Warning in readLines("final/en_US/en_US.news.txt", skipNul = T, encoding =
## "UTF-8"): incomplete final line found on 'final/en_US/en_US.news.txt'

twitter = readLines("final/en_US/en_US.twitter.txt",skipNul = T, encoding="UTF-8")

First of all, let’s look at the data structures and overview.

blog.size = round((file.info("final/en_US/en_US.blogs.txt")$size/1024^2),2)
new.size= round((file.info("final/en_US/en_US.news.txt")$size/1024^2),2)
twitter.size = round((file.info("final/en_US/en_US.twitter.txt")$size/1024^2),2)

sum.tab=data.frame(file=c("Blogs","News","Twitter"),
                   size=c(blog.size,new.size,twitter.size),
                   lines=c(length(blogs),length(news),length(twitter)),
                   words=c(sum(stri_count_words(blogs)),
                           sum(stri_count_words(news)),
                           sum(stri_count_words(twitter))))           
names(sum.tab)=c("File","Size(Mb)","Number of Lines","Number of Words")

kable(sum.tab) %>%
  kable_styling(bootstrap_options="striped",full_width=F)

File	Size(Mb)	Number of Lines	Number of Words
Blogs	200.42	899288	37546246
News	196.28	77259	2674536
Twitter	159.36	2360148	30093410

Since the volume of the given database is really big, 1000 of lines will be used for the demonstration of Cleaning and Exploratory Analysis in this Milestone report. Below is the summary table of sample data.

set.seed(123)
sample.size=1000

sample.blog=sample(blogs,sample.size)
sample.new=sample(news,sample.size)
sample.twitter=sample(twitter,sample.size)

sum.samtab=data.frame(file=c("Sample Blogs","Sample News","Sample Twitter"),
                      size=round((sample.size/c(length(blogs),length(news),length(twitter)))
                                 *c(blog.size,new.size,twitter.size),2),
                      lines=c(length(sample.blog),length(sample.new),length(sample.twitter)),
                      words=c(sum(stri_count_words(sample.blog)),
                              sum(stri_count_words(sample.new)),
                              sum(stri_count_words(sample.twitter))))
names(sum.samtab)=names(sum.tab)

kable(sum.samtab) %>%
  kable_styling(bootstrap_options="striped",full_width=F)

File	Size(Mb)	Number of Lines	Number of Words
Sample Blogs	0.22	1000	40768
Sample News	2.54	1000	34559
Sample Twitter	0.07	1000	12195

Cleaning data

For cleaning the text data for further text mining process, we will follow below steps:

Remove URLs and website addresses by change internet addresses into blank space
Remove words within brackets
Remove symbols encoded in latin1
Replace abbreviations and contractions with their base forms
Lowercase all words
Remove stopwords by replace them into blank space
Remove punctuations
Remove numbers
Remove profinity using bad-words.txt from http://www.bannedwordlist.com/lists/swearWords.txt
Remove strip white space and create plain text documents
Create Corpus: corpus is created to use in tm_map for text cleaning

#####CLEANING

sample=c(sample.blog,sample.new,sample.twitter)

##remove website link and twitter @
sample=gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample)
sample=gsub("@[^\\s]+"," ",sample)

# Remove text within brackets
sample=bracketX(sample)

##remove latin1 words
latin.sym=grep("[^NOT_ASCII](NOT_ASCII){2}[^NOT_ASCII]",iconv(sample, "latin1", "ASCII", sub="NOT_ASCII"))
sample[latin.sym]=stri_trans_general(sample[latin.sym], "latin-ascii")
sample=gsub('[^\x20-\x7E]', "'", sample)

##replace abbreviate words with their full terms
sample=replace_abbreviation(sample)

##replace contractions with their base words
sample=replace_contraction(sample)

##lower case
sample=tolower(sample)

##remove stopwords and unwanted abbreviation
sample=removeWords(sample,stopwords("en"))
sample=gsub("'[A-z]+", " ", sample)

##remove punctuations
sample=gsub("[[:punct:]]", " ", sample)

##remove numbers
sample=removeNumbers(sample)

##remove profinity
swear.words = read.table(file ="swearWords.txt", stringsAsFactors=F)
sample=removeWords(sample,swear.words[,1])

##remove extra space
sample=stripWhitespace(sample)

corpus = VCorpus(VectorSource(sample))
corpus = tm_map(corpus, PlainTextDocument)

rm(sample.twitter,sample.blog,sample.new)
rm(blogs,news,twitter,swear.words,latin.sym)

Frequency Table and Visualization by WordCloud

We will examine top 30 most frequently word combination appearance. In particular, unigram, digram and trigram word combination.The frequency will be shown as table, histogram and wordcloud.

top=30 #number of top frequent appear words

#Frequency table
getFreq = function(tdm,ngram) {
  gram=function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
  tdm = TermDocumentMatrix(tdm,control= list(tokenizer=gram))
  freq1 = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  freq=data.frame(word = names(freq1), freq = freq1)
  freq$word=as.character(freq$word)
  return(freq)
}

#Frequency Histogram
makePlot = function(table, label) {
  ggplot(table, aes(reorder(word, freq), freq)) +
    geom_bar(stat = "identity", fill = I("royalblue3"))+
    coord_flip()+
    labs(x = label, y = "Frequency")
}

1. Unigram

a. Frequency Table

uni.freq = getFreq(corpus,1)

kable(uni.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")

word	freq
will	387
said	304
one	258
just	240
can	236
like	203
time	199
people	166
new	153
day	145
year	145
get	142
now	130
first	127
also	123
know	117
see	116
good	111
think	110
back	109
two	103
love	102
way	99
make	96
much	96
really	96
want	95
work	94
last	93
many	93

b. Frequency Histogram

makePlot(uni.freq[1:top,], "30 Most Common Unigrams")

c. Frequency Wordcloud

wordcloud(words = uni.freq$word, freq = uni.freq$freq, min.freq = 1,
            max.words=50, random.order=T, rot.per=0.35, random.color = F,scale=c(4,.2),
            colors=brewer.pal(8,name= "Set1"))

2. Digram

a. Frequency Table

bi.freq = getFreq(corpus,2)

kable(bi.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")

word	freq
last year	24
new york	22
high school	19
dot com	18
year old	18
let us	16
new jersey	15
first time	14
right now	14
even though	13
will make	13
last week	11
one day	10
st louis	10
united states	10
www dot	10
years ago	10
can get	9
can wait	9
every day	9
many people	9
feel like	8
just like	8
make sure	8
one thing	8
will see	8
anyone else	7
can also	7
can see	7
felt like	7

b. Frequency Histogram

makePlot(bi.freq[1:top,], "30 Most Common Digrams")

c. Frequency Wordcloud

wordcloud(words = bi.freq$word, freq = bi.freq$freq, min.freq = 1,
            max.words=30, random.order=T, rot.per=0.35, random.color = F,scale=c(3,.1),
            colors=brewer.pal(8,name= "Set1"))

3. Trigram

a. Frequency Table

tri.freq = getFreq(corpus,3)

kable(tri.freq[1:top,],row.names = F) %>%
  kable_styling(bootstrap_options="striped",full_width=F)%>%
  scroll_box(width = "100%", height = "400px")

word	freq
cup cup cup	5
jobs north dakota	4
pharmacist jobs north	4
around around around	3
can wait see	3
ho chi minh	3
let us get	3
ngo dinh diem	3
president barack obama	3
year old boy	3
american banker magazine	2
approach raising one	2
average per doctor	2
beat us place	2
brave new world	2
carmel valley ranch	2
casino hotel complex	2
chief executive officer	2
church adult choir	2
city business community	2
commuter rail line	2
conference call reporters	2
county sheriff department	2
defence budget continues	2
democratic candidates running	2
double opt double	2
duong van minh	2
eleanor roosevelt baruch	2
even though think	2
first quarter year	2

b. Frequency Histogram

makePlot(tri.freq[1:top,], "30 Most Common Trigrams")

c. Frequency Wordcloud

wordcloud(words = tri.freq$word, freq = tri.freq$freq, min.freq = 1,
            max.words=30, random.order=T, rot.per=0.35, random.color = T,scale=c(2,.1),
            colors=brewer.pal(8,name= "Set1"))

Plans for prediction process and Shiny app

We now get familiar to and have an overview look at the database. Hence, for the next step, we will move to building prediction model based on the frequency of common word combination we had create above.

Finally, we will apply the prediction algorithm to create a Shiny app as well as a R presentation to show off and testing the result.

SwiftKey Project_ Milestone Report

Zoey Le

September 6, 2018

Introduction

Loading Data

Cleaning data

Frequency Table and Visualization by WordCloud

1. Unigram

a. Frequency Table

b. Frequency Histogram

c. Frequency Wordcloud

2. Digram

a. Frequency Table

b. Frequency Histogram

c. Frequency Wordcloud

3. Trigram

a. Frequency Table

b. Frequency Histogram

c. Frequency Wordcloud

Plans for prediction process and Shiny app