This Milestone Report is about exploring the data of the Capstone Project of the Data Science Coursera specialization.
Coursera and SwitfKey are partnering on this project; that apply data science in the area of natural language processing.
The project uses a large text corpus of documents to predict the next word on preceding input.
The data is extracted and cleaned from files and used with the Shiny application.
Here, we have some information about the corpus of data and prepare a plan to create the predictive model.
library(tm)
library(wordcloud)
library(RWeka)
library(stringi)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(qdap)
#### Download and save data
#specify the source and destination of the download
#destination_file <- "20180808_Coursera_SwiftKey.zip"
#source_file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# execute the download
#download.file(source_file, destination_file)
# extract the files from the zip file
#unzip(destination_file)
#
#url_profanity <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
#profanity_file_destination <- "final/en_US/profanity.txt"
#download.file(url_profanity, profanity_file_destination)
blogs = readLines("final/en_US/en_US.blogs.txt", skipNul = T, encoding="UTF-8")
news = readLines("final/en_US/en_US.news.txt",skipNul = T, encoding="UTF-8")
twitter = readLines("final/en_US/en_US.twitter.txt",skipNul = T, encoding="UTF-8")
First of all, let’s look at the data structures and overview.
blog.size = round((file.info("final/en_US/en_US.blogs.txt")$size/1024^2),2)
new.size= round((file.info("final/en_US/en_US.news.txt")$size/1024^2),2)
twitter.size = round((file.info("final/en_US/en_US.twitter.txt")$size/1024^2),2)
sum.tab=data.frame(file=c("Blogs","News","Twitter"),
size=c(blog.size,new.size,twitter.size),
lines=c(length(blogs),length(news),length(twitter)),
words=c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))))
names(sum.tab)=c("File","Size(Mb)","Number of Lines","Number of Words")
kable(sum.tab) %>%
kable_styling(bootstrap_options="striped",full_width=F)
| File | Size(Mb) | Number of Lines | Number of Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546239 |
| News | 196.28 | 1010242 | 34762395 |
| 159.36 | 2360148 | 30093413 |
Since the volume of the given database is really big, 2000 of lines will be used for the demonstration of Cleaning and Exploratory Analysis in this Milestone report. Below is the summary table of sample data.
set.seed(2131)
sample.size=2000
sample.blog=sample(blogs,sample.size)
sample.new=sample(news,sample.size)
sample.twitter=sample(twitter,sample.size)
sum.samtab=data.frame(file=c("Sample Blogs","Sample News","Sample Twitter"),
size=round((sample.size/c(length(blogs),length(news),length(twitter)))
*c(blog.size,new.size,twitter.size),2),
lines=c(length(sample.blog),length(sample.new),length(sample.twitter)),
words=c(sum(stri_count_words(sample.blog)),
sum(stri_count_words(sample.new)),
sum(stri_count_words(sample.twitter))))
names(sum.samtab)=names(sum.tab)
kable(sum.samtab) %>%
kable_styling(bootstrap_options="striped",full_width=F)
| File | Size(Mb) | Number of Lines | Number of Words |
|---|---|---|---|
| Sample Blogs | 0.45 | 2000 | 82596 |
| Sample News | 0.39 | 2000 | 70124 |
| Sample Twitter | 0.14 | 2000 | 25229 |
The tm package was used to clean the data. This was based on tm and Text Mining in R. The profanity words can be found at CMU.
For cleaning the text data for further text mining process, we will follow below steps:
#####Data Cleaning
sample=c(sample.blog,sample.new,sample.twitter)
##remove website link and twitter @
sample=gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", " ", sample)
sample=gsub("@[^\\s]+"," ",sample)
# Remove text within brackets
sample=bracketX(sample)
##remove latin1 words
latin.sym=grep("[^NOT_ASCII](NOT_ASCII){2}[^NOT_ASCII]",iconv(sample, "latin1", "ASCII", sub="NOT_ASCII"))
sample[latin.sym]=stri_trans_general(sample[latin.sym], "latin-ascii")
sample=gsub('[^\x20-\x7E]', "'", sample)
##replace abbreviate words with their full terms
sample=replace_abbreviation(sample)
##replace contractions with their base words
sample=replace_contraction(sample)
##lower case
sample=tolower(sample)
##remove stopwords and unwanted abbreviation
sample=removeWords(sample,stopwords("en"))
sample=gsub("'[A-z]+", " ", sample)
##remove punctuations
sample=gsub("[[:punct:]]", " ", sample)
##remove numbers
sample=removeNumbers(sample)
##remove profinity
profanity = read.table(file ="final/en_US/profanity.txt", stringsAsFactors=F)
sample=removeWords(sample,profanity[,1])
##remove extra space
sample=stripWhitespace(sample)
corpus = VCorpus(VectorSource(sample))
corpus = tm_map(corpus, PlainTextDocument)
rm(sample.twitter,sample.blog,sample.new)
rm(blogs,news,twitter,swear.words,latin.sym)
## Warning in rm(blogs, news, twitter, swear.words, latin.sym): object
## 'swear.words' not found
We now need break it into words and sentences, and to turn it into n-grams. These are all called tokenization because we are breaking up the text into units of meaning, called tokens.
In Natural Language Processing (NLP), n-gram is a contiguous sequence of n items from a given sequence of text or speech. Unigrams are single words. Bigrams are two words combinations. Trigrams are three-word combinations.
The tokenizer method is allowed in R using the package RWeka. The following function is used to extract 1-grams, 2-grams, 3-grams and 4-grams from the text Corpus using RWeka.
We will examine top 20 most frequently word combination appearance. In particular, unigram, digram and trigram word combination.The frequency will be shown as table, histogram and wordcloud.
top=20 #number of top frequent appear words
#Frequency table
getFreq = function(tdm,ngram) {
gram=function(x) NGramTokenizer(x,Weka_control(min=ngram,max=ngram))
tdm = TermDocumentMatrix(tdm,control= list(tokenizer=gram))
freq1 = sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
freq=data.frame(word = names(freq1), freq = freq1)
freq$word=as.character(freq$word)
return(freq)
}
#Frequency Histogram
makePlot = function(table, label) {
ggplot(table, aes(reorder(word, freq), freq)) +
geom_bar(stat = "identity", fill = I("blue"))+
coord_flip()+
labs(x = label, y = "Frequency")
}
uni.freq = getFreq(corpus,1)
kable(uni.freq[1:top,],row.names = F) %>%
kable_styling(bootstrap_options="striped",full_width=F)%>%
scroll_box(width = "100%", height = "400px")
| word | freq |
|---|---|
| will | 686 |
| said | 597 |
| can | 512 |
| one | 462 |
| like | 433 |
| just | 428 |
| time | 363 |
| get | 330 |
| new | 287 |
| first | 286 |
| now | 281 |
| people | 271 |
| day | 262 |
| also | 249 |
| year | 243 |
| back | 238 |
| know | 231 |
| two | 226 |
| good | 221 |
| much | 206 |
makePlot(uni.freq[1:top,], "20 Most Common Unigrams")
wordcloud(words = uni.freq$word, freq = uni.freq$freq, min.freq = 1,
max.words=50, random.order=T, rot.per=0.35, random.color = F,scale=c(4,.2),
colors=brewer.pal(8,name= "Set1"))
bi.freq = getFreq(corpus,2)
kable(bi.freq[1:top,],row.names = F) %>%
kable_styling(bootstrap_options="striped",full_width=F)%>%
scroll_box(width = "100%", height = "400px")
| word | freq |
|---|---|
| u s | 57 |
| year old | 47 |
| last year | 42 |
| new york | 39 |
| dot com | 31 |
| right now | 31 |
| let us | 30 |
| st louis | 29 |
| years ago | 26 |
| can get | 25 |
| first time | 25 |
| high school | 25 |
| new jersey | 23 |
| www dot | 23 |
| feel like | 21 |
| can wait | 19 |
| looks like | 18 |
| make sure | 16 |
| san francisco | 15 |
| even though | 14 |
makePlot(bi.freq[1:top,], "30 Most Common Bigrams")
wordcloud(words = bi.freq$word, freq = bi.freq$freq, min.freq = 1,
max.words=30, random.order=T, rot.per=0.35, random.color = F,scale=c(3,.1),
colors=brewer.pal(8,name= "Set1"))
tri.freq = getFreq(corpus,3)
kable(tri.freq[1:top,],row.names = F) %>%
kable_styling(bootstrap_options="striped",full_width=F)%>%
scroll_box(width = "100%", height = "400px")
| word | freq |
|---|---|
| new york times | 5 |
| president barack obama | 5 |
| digs service points | 4 |
| let us get | 4 |
| nice hair guy | 4 |
| age grade range | 3 |
| aug age grade | 3 |
| can wait see | 3 |
| chicago chicago illinois | 3 |
| cloths cold water | 3 |
| five times week | 3 |
| four years ago | 3 |
| gov chris christie | 3 |
| grade range yo | 3 |
| happy mother day | 3 |
| happy mothers day | 3 |
| hope great day | 3 |
| let us make | 3 |
| like year old | 3 |
| new york state | 3 |
makePlot(tri.freq[1:top,], "20 Most Common Trigrams")
wordcloud(words = tri.freq$word, freq = tri.freq$freq, min.freq = 1,
max.words=30, random.order=T, rot.per=0.35, random.color = T,scale=c(2,.1),
colors=brewer.pal(8,name= "Set1"))
We can conclude after the exploratory analysis that the process is very heavy and requires a lot of processing power and RAM. A lot of the frequent words are repeated. The more complex the N-Gram the lower the frequency
Prediction model and plans for Shiny app While the strategy for modeling and prediction has not been finalized, the n-gram model with a frequency look-up table might be used based on the analysis above. A possible method of prediction is to use the 4-gram model to find the most likely next word first. If none is found, then the 3-gram model is used, and so forth. Furthermore, stemming might also be done in data preprocessing.
For the app a simple minimalistic design with probabibily of next word showing displayed, or a button to introduced the next word prediction.