Large databases comprising of text in a target language are commonly used when generating language models for various purposes. In this project, we will explore the major features of the text data given for the Coursera Data Science Capstone through Johns Hopkins University. The project is sponsored by SwiftKey. The final purpose is to create text prediction application with R Shiny application that predicts words using a natural language processing model.There are 4 given language database, in particular, we will work with English database instead of the others which are Russian, German and Finnish.
The first step which is also the goal of this report is to do some basic overviews and some necessary cleanings in order to get familiar with the database as well as prepare for further prediction model creation. More specifically, we will remove from the data numbers, symbols, punctuation and other words that should not be predicted to increase prediction accuracy. Then we will observe some of the most frequently appear words including single, two and three word phrases.
library(tm)
library(wordcloud)
library(RWeka)
library(stringi)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(qdap)
#setwd("D:/test/Coursera/Capstone_final project")
blogs = readLines("final/en_US/en_US.blogs.txt", skipNul = T, encoding="UTF-8")
news = readLines("final/en_US/en_US.news.txt",skipNul = T, encoding="UTF-8")
## Warning in readLines("final/en_US/en_US.news.txt", skipNul = T, encoding =
## "UTF-8"): incomplete final line found on 'final/en_US/en_US.news.txt'
twitter = readLines("final/en_US/en_US.twitter.txt",skipNul = T, encoding="UTF-8")
First of all, let’s look at the data structures and overview.
blog.size = round((file.info("final/en_US/en_US.blogs.txt")$size/1024^2),2)
new.size= round((file.info("final/en_US/en_US.news.txt")$size/1024^2),2)
twitter.size = round((file.info("final/en_US/en_US.twitter.txt")$size/1024^2),2)
sum.tab=data.frame(file=c("Blogs","News","Twitter"),
size=c(blog.size,new.size,twitter.size),
lines=c(length(blogs),length(news),length(twitter)),
words=c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))))
names(sum.tab)=c("File","Size(Mb)","Number of Lines","Number of Words")
kable(sum.tab) %>%
kable_styling(bootstrap_options="striped",full_width=F)
| File | Size(Mb) | Number of Lines | Number of Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546246 |
| News | 196.28 | 77259 | 2674536 |
| 159.36 | 2360148 | 30093410 |
Since the volume of the given database is really big, 1000 of lines will be used for the demonstration of Cleaning and Exploratory Analysis in this Milestone report. Below is the summary table of sample data.
set.seed(123)
sample.size=1000
sample.blog=sample(blogs,sample.size)
sample.new=sample(news,sample.size)
sample.twitter=sample(twitter,sample.size)
sum.samtab=data.frame(file=c("Sample Blogs","Sample News","Sample Twitter"),
size=round((sample.size/c(length(blogs),length(news),length(twitter)))
*c(blog.size,new.size,twitter.size),2),
lines=c(length(sample.blog),length(sample.new),length(sample.twitter)),
words=c(sum(stri_count_words(sample.blog)),
sum(stri_count_words(sample.new)),
sum(stri_count_words(sample.twitter))))
names(sum.samtab)=names(sum.tab)
kable(sum.samtab) %>%
kable_styling(bootstrap_options="striped",full_width=F)
| File | Size(Mb) | Number of Lines | Number of Words |
|---|---|---|---|
| Sample Blogs | 0.22 | 1000 | 40768 |
| Sample News | 2.54 | 1000 | 34559 |
| Sample Twitter | 0.07 | 1000 | 12195 |
For cleaning the text data for further text mining process, we will follow below steps:
#####CLEANING
sample=c(sample.blog,sample.new,sample.twitter)
##remove website link and twitter @
sample=gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample)
sample=gsub("@[^\\s]+"," ",sample)
# Remove text within brackets
sample=bracketX(sample)
##remove latin1 words
latin.sym=grep("[^NOT_ASCII](NOT_ASCII){2}[^NOT_ASCII]",iconv(sample, "latin1", "ASCII", sub="NOT_ASCII"))
sample[latin.sym]=stri_trans_general(sample[latin.sym], "latin-ascii")
sample=gsub('[^\x20-\x7E]', "'", sample)
##replace abbreviate words with their full terms
sample=replace_abbreviation(sample)
##replace contractions with their base words
sample=replace_contraction(sample)
##lower case
sample=tolower(sample)
##remove stopwords and unwanted abbreviation
sample=removeWords(sample,stopwords("en"))
sample=gsub("'[A-z]+", " ", sample)
##remove punctuations
sample=gsub("[[:punct:]]", " ", sample)
##remove numbers
sample=removeNumbers(sample)
##remove profinity
swear.words = read.table(file ="swearWords.txt", stringsAsFactors=F)
sample=removeWords(sample,swear.words[,1])
##remove extra space
sample=stripWhitespace(sample)
corpus = VCorpus(VectorSource(sample))
corpus = tm_map(corpus, PlainTextDocument)
rm(sample.twitter,sample.blog,sample.new)
rm(blogs,news,twitter,swear.words,latin.sym)
We will examine top 30 most frequently word combination appearance. In particular, unigram, digram and trigram word combination.The frequency will be shown as table, histogram and wordcloud.
top=30 #number of top frequent appear words
#Frequency table
getFreq = function(tdm,ngram) {
gram=function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
tdm = TermDocumentMatrix(tdm,control= list(tokenizer=gram))
freq1 = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
freq=data.frame(word = names(freq1), freq = freq1)
freq$word=as.character(freq$word)
return(freq)
}
#Frequency Histogram
makePlot = function(table, label) {
ggplot(table, aes(reorder(word, freq), freq)) +
geom_bar(stat = "identity", fill = I("royalblue3"))+
coord_flip()+
labs(x = label, y = "Frequency")
}
uni.freq = getFreq(corpus,1)
kable(uni.freq[1:top,],row.names = F) %>%
kable_styling(bootstrap_options="striped",full_width=F)%>%
scroll_box(width = "100%", height = "400px")
| word | freq |
|---|---|
| will | 387 |
| said | 304 |
| one | 258 |
| just | 240 |
| can | 236 |
| like | 203 |
| time | 199 |
| people | 166 |
| new | 153 |
| day | 145 |
| year | 145 |
| get | 142 |
| now | 130 |
| first | 127 |
| also | 123 |
| know | 117 |
| see | 116 |
| good | 111 |
| think | 110 |
| back | 109 |
| two | 103 |
| love | 102 |
| way | 99 |
| make | 96 |
| much | 96 |
| really | 96 |
| want | 95 |
| work | 94 |
| last | 93 |
| many | 93 |
makePlot(uni.freq[1:top,], "30 Most Common Unigrams")
wordcloud(words = uni.freq$word, freq = uni.freq$freq, min.freq = 1,
max.words=50, random.order=T, rot.per=0.35, random.color = F,scale=c(4,.2),
colors=brewer.pal(8,name= "Set1"))
bi.freq = getFreq(corpus,2)
kable(bi.freq[1:top,],row.names = F) %>%
kable_styling(bootstrap_options="striped",full_width=F)%>%
scroll_box(width = "100%", height = "400px")
| word | freq |
|---|---|
| last year | 24 |
| new york | 22 |
| high school | 19 |
| dot com | 18 |
| year old | 18 |
| let us | 16 |
| new jersey | 15 |
| first time | 14 |
| right now | 14 |
| even though | 13 |
| will make | 13 |
| last week | 11 |
| one day | 10 |
| st louis | 10 |
| united states | 10 |
| www dot | 10 |
| years ago | 10 |
| can get | 9 |
| can wait | 9 |
| every day | 9 |
| many people | 9 |
| feel like | 8 |
| just like | 8 |
| make sure | 8 |
| one thing | 8 |
| will see | 8 |
| anyone else | 7 |
| can also | 7 |
| can see | 7 |
| felt like | 7 |
makePlot(bi.freq[1:top,], "30 Most Common Digrams")
wordcloud(words = bi.freq$word, freq = bi.freq$freq, min.freq = 1,
max.words=30, random.order=T, rot.per=0.35, random.color = F,scale=c(3,.1),
colors=brewer.pal(8,name= "Set1"))
tri.freq = getFreq(corpus,3)
kable(tri.freq[1:top,],row.names = F) %>%
kable_styling(bootstrap_options="striped",full_width=F)%>%
scroll_box(width = "100%", height = "400px")
| word | freq |
|---|---|
| cup cup cup | 5 |
| jobs north dakota | 4 |
| pharmacist jobs north | 4 |
| around around around | 3 |
| can wait see | 3 |
| ho chi minh | 3 |
| let us get | 3 |
| ngo dinh diem | 3 |
| president barack obama | 3 |
| year old boy | 3 |
| american banker magazine | 2 |
| approach raising one | 2 |
| average per doctor | 2 |
| beat us place | 2 |
| brave new world | 2 |
| carmel valley ranch | 2 |
| casino hotel complex | 2 |
| chief executive officer | 2 |
| church adult choir | 2 |
| city business community | 2 |
| commuter rail line | 2 |
| conference call reporters | 2 |
| county sheriff department | 2 |
| defence budget continues | 2 |
| democratic candidates running | 2 |
| double opt double | 2 |
| duong van minh | 2 |
| eleanor roosevelt baruch | 2 |
| even though think | 2 |
| first quarter year | 2 |
makePlot(tri.freq[1:top,], "30 Most Common Trigrams")
wordcloud(words = tri.freq$word, freq = tri.freq$freq, min.freq = 1,
max.words=30, random.order=T, rot.per=0.35, random.color = T,scale=c(2,.1),
colors=brewer.pal(8,name= "Set1"))
We now get familiar to and have an overview look at the database. Hence, for the next step, we will move to building prediction model based on the frequency of common word combination we had create above.
Finally, we will apply the prediction algorithm to create a Shiny app as well as a R presentation to show off and testing the result.