This is the initial report of a “predictive text model” project. The goal of the project is to build a data product that allows predicting the next word of a sentence typed by the user. With this report we will develop the requirements for the Milestone Report from the Capstone Project Course in the Data Science Specialization in Coursera.
A predictive text model is included into the Natural Language Processing (NLP). This is a new data field for us, but we are going to apply the same approach used in the whole Data Specialization: Getting, cleaning and exploring data in a first stage (hear), and, later, selecting and building a prediction algorithm to be applied in the model.
The next link gets the data set used in the project.The data comes from a corpus named HC Corpora.
We have downloaded the zip file, unzipped it and loaded in R the three American English data base. The text files contain documents - each document in one line- from some blogs, news and twitter. It was considered the option of loading only a subset of lines. In fact, the next step is a sample process. But, at the end, in order to have greater flexibility and because that the initial loaded was fast enough, the code loads directly the full file.
The code, as the one of the whole project, is in the Appendix.
The summary of the three files is displayed in the next table:
We can see the next features:
The size of the files is very large. So, it has led us a lot of memory problems with the “tm” package. That has been the cause of using a data sample. The total number of the words is similar in the three text types.
The largest documents are the blogs, followed by the news and twitter. The short document number in twitter reflects the limit in the number of characters by twitter. Furthermore, the words mean has, in this case, the smallest deviation (6.9), which could explain the intensive use of the limit.
The blogs have the most number of words by line, but its deviation is also the largest. There is a big variety of blogs size.
At last, we can observe that the biggest word size by sentence is in the news.
According the features, we have decided to extract a random sample from the data, with a sample size of the 5 %; the same in each one, given the similar words number in the three file types.
Additionally, there are two more data sources whose use will be explained in the cleaning data section : One, the stop words(“english”) dictionary ( with 174 lines) that appears in the “tm” package in R, and two, a profanity word dictionary ( with 377 lines), loaded from the next link.
The cleaning data process has been developed with the tools of the “tm” and “quanteda” packages in R. The transformation types of the words have to be chosen according to the specific goal of the project: selecting the next word in a sentence; different goals would imply different transformations.
We know that we want to predict the next word from an incomplete sentence written by the user. So, our basic unit in the data set should be the sentences, not the documents. As result of loading the data in R we have three objects with texts, when each of the lines is a document. We must, therefore, split each document in its sentences. For doing it, we have used a function from the “quanteda” package.
A corpus is a specific data class that allows, in the NLP field, to work easily with the text data, basically for cleaning the data and creating a Document Term Matrix for counting n-grams, as we will see later.
A question here is: will it be necessary filtering by data source in the corpus? When a user writes a sentence waiting for the three or four prediction words, we do not know if he is a twitter user, a blog writer or a journalist. He will be anybody and we cannot apply different models for each one. So, we will generalize and will consider the three files as an unique set. The corpus will combine the three files in one.
We will transform the raw texts in a regular word set that led to optimize the language models based on them. The decisions of what to do or not to do are the next ones:
In conclusion, we will consider in the cleaning function only the first four types of transformations.
Profanity, as Merriam-Webster dictionary says, is “an offensive word” or “offensive language” (Wikipedia). In order to avoid the prediction of this kind of words, they will be removed in the corpus.
First, the raw text in a document from the sample:
## [1] "Today<U+0092>s TNT (This-N-That) post from Club Creative Studio is about technique in your creative process. I often find myself playing <U+0093>the waiting game<U+0094>. You know what this means to an individual on a daily basis. We wait in line, we wait for the mail to come, we wait in traffic, we wait for the dryer to signal the clothes are dry. We wait for the text response, we wait for something to download, we WAIT, WAIT, WAIT!"
Second, the raw text divided in sentences:
## $`53`
## [1] "Today<U+0092>s TNT (This-N-That) post from Club Creative Studio is about technique in your creative process."
##
## $`54`
## [1] "I often find myself playing <U+0093>the waiting game<U+0094>."
##
## $`55`
## [1] "You know what this means to an individual on a daily basis."
##
## $`56`
## [1] "We wait in line, we wait for the mail to come, we wait in traffic, we wait for the dryer to signal the clothes are dry."
##
## $`57`
## [1] "We wait for the text response, we wait for something to download, we WAIT, WAIT, WAIT!"
And finally, the cleaned sentences:
## $`53`
## [1] "todays tnt thisnthat post from club creative studio is about technique in your creative process"
##
## $`54`
## [1] "i often find myself playing the waiting game"
##
## $`55`
## [1] "you know what this means to an individual on a daily basis"
##
## $`56`
## [1] "we wait in line we wait for the mail to come we wait in traffic we wait for the dryer to signal the clothes are dry"
##
## $`57`
## [1] "we wait for the text response we wait for something to download we wait wait wait"
The exploratory analysis will try to analyze the distribution of the words and the relationships between them. We need a transformation process that allows dealing in a statistical way with the text strings. This is the N-gram model. We will work with the unigram, bigrams and trigrams, sequences of one, two and three words respectively.
In order to build the N-grams, we have used the “RWeka” and “tm” packages. It is significant the memory problems that we had with RWeka. The code worked only after splitting the corpus in groups of a thousand of documents (or less): running the “tokenizer” in each group and summarizing the partial results. (See the appendix).
After an initial visual review, we could see, in the set of token with frequency = 1, a big number of weird words, typos and expressions without meaning. We have decided remove them in the three N-gram sets.
Plot 1 shows the distribution of the words -Unigrams-. In order to build a clearer plot, in the x-axes appears only the first 400 words (with a number), sorted by its frequency.
We can see that the distribution is strongly concentrated and skewed to the left.
Plot 2 and Plot 3 are an amplified view from the plot 1 that allows seeing the specific words. There are, ordered by frequency, the top 30 words in the Plot 2. Plot 3, with a “wordcloud” format, shows the top 200.
The whole of the top 30 are stop words. For seeing words with a specific meaning, we would have to reduce the frequency in the bar plot or seeing the smallest ones in the “wordcloud”.
We can also analise the coverage of the language index, which is the cumulative percentage of the total tokens covered by a number of word types. It is on the left corner of the plot 6 in the next section.There are a relatively short number of words that explains a high percentage of the total words. Only 2119 word types cover the 80 % of the total tokens. And 6333, the 90 %. This feature shows, from another point of view, the high concentration of the unigram distribution.
Bigrams and Trigrams are the most important results for the goal of our project. We are going to predict the next word. So, we do not need only unigrams, but sorted sequences of words and its frequencies. If we want to predict a word that follows to another ones it will be good to know how many times this word appears after the last word typed (bigrams) and, even better, the same for the two last words typed (trigrams). This is the information that the distribution of the bigrams and trigrams give us.
Plot 4 represents the distribution of the bigrams and trigrams. There has been made a similar treatment than the unigrams: the first 400 tokens with number.
The frequencies size is in general shorter than the ones of the unigram set of the Plot 1. The distribution is, so, less concentrated, showing the huge options that the language gives for building phrases and sentences.
Th next Plot 5 displays the most frequent bigrams and trigrams. They are the most grammatically correct phrases used in English.
We can see that, as in the unigrams, all the top 30 words have a stop word and, in general, the 68.4 % of the bigrams and the 96.1% of the trigrams have one of this kind of words. It will be, therefore, more than likely that one of these words appears in the set of predicted words.
Finally, as we saw yet in the case of the unigrams, we can see the coverage index performance. In the Plot 6 is displayed them in the two figures on the right.
Bigrams and Trigrams show a line of coverage nearer of the straight line than the one of the Unigrams. This is because of the bigger dispersion of its distributions: For covering the 80 % of the total bigrams, we need 90475 bigram types. In the case of the trigrams, the number is 176987.
# milestone report.R
#
# locate this file in the working directory with functions.R
## notation of the files in /data R-objects
# text_o: original texts in R vectors: blogs_o, news_o, twitter_o
# sum_table : summary of the data: sum_table
# text_s: sample of the original data: text_s
# text_c: corpus from text_s: text_c
# text_cl:cleaned corpus text_cl: with stopwords
# freq_types: frecuencies freq_blogs
# freq_news
# freq_twitter
# freq: token frequencies freq: with stopwords
#
#------------------ open libraries----------------------------------------------
library(readr)
library(tm)
library(quanteda)
library(stringi)
library(dplyr)
library(RWeka)
library(ggplot2)
library(grid)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)
source("functions.R")
#----------------- Loading the data---------------------------------------------
# data folder
if (!file.exists("./data")) { # data folder
dir.create("data")
}
# paths and file names
url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/
Coursera-SwiftKey.zip"
zipfile <- "./data/Coursera-SwiftKey.zip"
blogs_t<-"./data/final/en_US/en_US.blogs.txt"
news_t<-"./data/final/en_US/en_US.news.txt"
twitter_t<-"./data/final/en_US/en_US.twitter.txt"
# downloading the zip data file
if (!file.exists(zipfile)){
download.file(url,zipfile)
write.csv(date(),"./data/date_download.txt")
}
# unzipping the data file
if (!file.exists(blogs_t) | !file.exists(news_t) |!file.exists(twitter_t)){
unzip(zipfile,exdir = "./data")
}
#-------------- loading files in R------------------------------------------
if (!exists("blogs_o") | !exists("news_o") | !exists("twitter_o")){
if(!file.exists("./data/text_o")){
blogs_o<-read_lines(blogs_t)
news_o<-read_lines(news_t)
twitter_o<-read_lines(twitter_t)
save(list=c("blogs_o","news_o","twitter_o"),file = "./data/text_o")
}
else{
load("./data/text_o")
}
}
#----------------------- basic summaries---------------------------------------
if (!exists("sum_table")){
if(!file.exists("./data/sum_table")){
blogs_w<-stri_count_words(blogs_o)
news_w<-stri_count_words(news_o)
twitter_w<-stri_count_words(twitter_o)
blogs_st<-stri_count_boundaries(blogs_o,type="sentence")
news_st<-stri_count_boundaries(news_o,type="sentence")
twitter_st<-stri_count_boundaries(twitter_o,type="sentence")
file_size_Mb<-round(c(file.info(blogs_t)$size/1048576,
file.info(news_t)$size/1048576,
file.info(twitter_t)$size/1048576),2)
sum_table<-data.frame(file=c("blogs","news","twitter"),size_Mb=file_size_Mb,
docs= c(length(blogs_o),length(news_o),length(twitter_o)),
words=c(sum(blogs_w),sum(news_w),sum(twitter_w)),
words_mean=round(c(mean(blogs_w),mean(news_w),
mean(twitter_w)),2),
words_sd=round(c(sd(blogs_w),sd(news_w),sd(twitter_w)),2),
sentences=c(sum(blogs_st),sum(news_st),sum(twitter_st)),
sentences_mean=round(c(mean(blogs_st),mean(news_st),
mean(twitter_st)),2),
words_by_sent=round(c(sum(blogs_w)/sum(blogs_st),
sum(news_w)/sum(news_st),
sum(twitter_w)/sum(twitter_st)),2))
sum_table<-t(as.matrix(sum_table))
save("sum_table",file = "./data/sum_table")
}
else{
load("./data/sum_table")
}
}
rm("blogs_w","news_w","twitter_w","blogs_st","news_st","twitter_st")
#----------------------- data sampling ---------------------------------------
set.seed(123)
t=0.05
if (!exists("text_s")){
if(!file.exists("./data/text_s")){
index<-sample(1:length(blogs_o),length(blogs_o)*t)
blogs_s<-data.frame(text=blogs_o[index],source=rep("blogs",length(index)))
index<-sample(1:length(news_o),length(news_o)*t)
news_s<-data.frame(text=news_o[index],source=rep("news",length(index)))
index<-sample(1:length(twitter_o),length(twitter_o)*t)
twitter_s<-data.frame(text=twitter_o[index],source=rep("twitter",
length(index)))
# a data frame with texts and sources.
text_s<-rbind(blogs_s,news_s,twitter_s)
text_s$text<-as.character(text_s$text)
save("text_s",file = "./data/text_s")
}
else{
load("./data/text_s")
}
}
# memory cleaning
rm (list=c("blogs_o","news_o","twitter_o","blogs_s","news_s","twitter_s",
"blogs_t","news_t","twitter_t","index"))
#----------------------- data corpus ---------------------------------------
if (!exists("text_c")){
if(!file.exists("./data/text_c")){
# splitting documents in sentences ( quanteda package).
text_sent<-unlist(tokenize(text_s$text,what="sentence"))
# corpus
text_c<-VCorpus(VectorSource(text_sent))
save("text_c",file = "./data/text_c")
}
else{
load("./data/text_c")
}
}
#----------------------- cleaning corpus ----------------
if (!exists("text_cl")){
if(!file.exists("./data/text_cl")){
# cleaning
text_cl<-clean_c(text_c)
save("text_cl",file = "./data/text_cl")
}
else{
load("./data/text_cl")
}
}
#--------running freq function-------------------------------------------------
# frequencies
if (!exists("freq")|!exists("freq_sw")){
if(!file.exists("./data/freq")){
c_sp<-sp(text_cl) # splitting large corpus
freq<-freq_f(c_sp) # frequencies
save("freq",file="./data/freq")
}
else{
load("./data/freq")
}
}
# exploring data
# after exploring data: removin tokens with frequency = 1
unigram<-filter(freq$freq_uni,freq > 1)
bigram<-filter(freq$freq_bi, freq > 1)
trigram<-filter(freq$freq_tri,freq > 1)
# adding cum data and percentages
unigram<-cum(unigram)
bigram<-cum(bigram)
trigram<-cum(trigram)
#---------------------barplots ------------------------------------------------
bar_uni<-bar(unigram[1:400,],Plot_1_:Top_400_unigrams)
bar_uni_flip<-bar_flip(unigram[1:30,],Plot_2_:Top_30_unigrams)
bar_bi<-bar(bigram[1:400,],Bigrams)
bar_bi_flip<-bar_flip(bigram[1:30,],Bigrams)
bar_tri<-bar(trigram[1:400,],Trigrams)
bar_tri_flip<-bar_flip(trigram[1:30,],Trigrams)
#---------------coverage percentage plot-----------
index<-sample(1:nrow(unigram),500)
line_cover1<-line_cover(unigram[index,],Unigrams)
index<-sample(1:nrow(bigram),500)
line_cover2<-line_cover(bigram[index,],Bigrams)
index<-sample(1:nrow(trigram),500)
line_cover3<-line_cover(trigram[index,],Trigrams)
## plotting unigrams
bar_uni
bar_uni_flip
wc(unigram,200,Plot_3_:The_first_200_unigrams)
## plotting bigrams and trigrams
grid.arrange(bar_bi,bar_tri,ncol=2,
top=textGrob("Plot 4:The first 400 tokens",
gp=gpar(fontsize=15,font=2)))
grid.arrange(bar_bi_flip,bar_tri_flip,ncol=2,
top=textGrob("Plot 5:The first 30 tokens",
gp=gpar(fontsize=15,font=2)))
wc(bigram,100,Plot_6_:The_first_100_bigrams)
wc(trigram,100,Plot_7_:The_first_100_trigrams)
# plotting coverage
grid.arrange(line_cover1,line_cover2,line_cover3,ncol=3,
top=textGrob("Plot 8: Coverage of the total text",
gp=gpar(fontsize=15,font=2)))
#---------------------data numbers------------------------------------
# Unigrams
# tokens for 80 % coverage
unigram[findInterval(80,unigram$perc),6]
# tokens for 90 % coverage
unigram[findInterval(90,unigram$perc),6]
# percentage of non stop words_type
round(nrow(unigram[unigram$sw == "TRUE",])*100/nrow(unigram),1)
# Bigrams
# tokens for 80 % coverage
bigram[findInterval(80,bigram$perc),8]
# percentage of non stop words_type
round(nrow(bigram[bigram$sw == "TRUE",])*100/nrow(bigram),1)
# Trigrams
# tokens for 80 % coverage
trigram[findInterval(80,trigram$perc),9]
# percentage of non stop words_type
round(nrow(trigram[trigram$sw == "TRUE",])*100/nrow(trigram),1)
# functions.R
library(tm)
library(stringi)
library(dplyr)
library(RWeka)
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)
#----------------------- defining cleaning functions ---------------------------
# 1 -removing anything other than English letters or space ( Zhao)
removeNumPunct<- function(x){
gsub("[^[:alpha:][:space:]]*","",x)
}
# 1.1 -removing spaces at the beginning or at the end of a sentence.
removeSpaces_LeadTail<-function(x){
gsub("^\\s+|\\s+$", "", x)
}
# 2 -removing URLs
removeURL <- function(x){
gsub("http[^[:space:]]*","",x)
}
# 3 -cleaning function for a vector
#first: loading and cleaning profanities
load("./data/profanities")
# data from https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
profanities_cl<-removeNumPunct(profanities)
profanities_cl<-tolower(profanities_cl)
clean_v<- function(v){
v<- removeNumPunct(v)
v<- tolower(v)
v<- removeURL(v)
v<- removeWords(v,profanities_cl)
v<- stripWhitespace(v)
v<- removeSpaces_LeadTail(v)
return(v)
}
# 4 -cleaning function for a corpus
clean_c<- function(c){
c<- tm_map(c,content_transformer(removeNumPunct))
c<- tm_map(c,content_transformer(tolower))
c<- tm_map(c,content_transformer(removeURL))
c<- tm_map(c,removeWords,profanities_cl)
c<- tm_map(c,stripWhitespace)
c<- tm_map(c,content_transformer(removeSpaces_LeadTail))
return(c)
}
#----------------freq functions-------------------------------------------------
# 5 -splitting a corpus (with memory problems)------------------------
# max number or documents in each split: select
max=1000
sp <- function(cl){
f<-ceiling(seq_along(cl)/max)
cs<-split(cl,f)
return(cs)
}
# 6 -Unigrams
dtm_uni<-function(cs){
tokenizer_1 <- function (x) NGramTokenizer(x, Weka_control(min=1, max = 1))
dtm_uni <- TermDocumentMatrix(cs, control = list(wordLengths=c(1,Inf),
tokenize =tokenizer_1))
freq_uni <- sort(rowSums(as.matrix(dtm_uni)), decreasing = TRUE) # total freq
freq_uni <- data.frame(token = names(freq_uni), freq = freq_uni)# data frame
return(freq_uni)
}
# 7 -Bigrams
dtm_bi<- function(cs){
tokenizer_2 <- function (x) NGramTokenizer(x, Weka_control(min=2, max = 2))
dtm_bi <- TermDocumentMatrix(cs, control = list(wordLengths=c(1,Inf),
tokenize =tokenizer_2))
freq_bi <- sort(rowSums(as.matrix(dtm_bi)), decreasing = TRUE) # total freq
freq_bi <- data.frame(token = names(freq_bi), freq = freq_bi) # data frame
return(freq_bi)
}
# 8 -Trigrams
dtm_tri<- function(cs){
tokenizer_3 <- function (x) NGramTokenizer(x, Weka_control(min=3, max = 3))
dtm_tri <- TermDocumentMatrix(cs, control = list(wordLengths=c(1,Inf),
tokenize =tokenizer_3))
freq_tri <- sort(rowSums(as.matrix(dtm_tri)), decreasing = TRUE) # total freq
freq_tri <- data.frame(token = names(freq_tri), freq = freq_tri)#data frame
return(freq_tri)
}
# 9 -freq_f:applying dtm a cs
#first: cleaning stopwords
stopwords_cl<-clean_v(stopwords("english"))
freq_f<- function(cs){
# Unigrams
freq_uni<-lapply(cs,dtm_uni)
freq_uni<-do.call(rbind.data.frame, freq_uni)
freq_uni<-summarize(group_by(freq_uni,token),freq=sum(freq))
freq_uni<-arrange(freq_uni,desc(freq))
## stopwords
freq_uni$sw<-freq_uni$token %in% stopwords_cl
# Bigrams
freq_bi<-lapply(cs,dtm_bi)
freq_bi<-do.call(rbind.data.frame, freq_bi)
freq_bi<-summarize(group_by(freq_bi,token),freq=sum(freq))
freq_bi<-arrange(freq_bi,desc(freq))
## the two words in the bigrams
n<-nrow(freq_bi)
w<-matrix(nrow=n,ncol=2)
colnames(w)=c("w1","w2")
for (i in 1:n){
w[i,]=unlist(strsplit(as.character(freq_bi$token[i])," "))
}
wdf<-as.data.frame(w)
freq_bi<-cbind(freq_bi,wdf)
freq_bi<-tbl_df(freq_bi)
## stopwords
freq_bi$sw<-freq_bi$w1 %in% stopwords_cl | freq_bi$w2 %in% stopwords_cl
# Trigrams
freq_tri<-lapply(cs,dtm_tri)
freq_tri<-do.call(rbind.data.frame, freq_tri)
freq_tri<-summarize(group_by(freq_tri,token),freq=sum(freq))
freq_tri<-arrange(freq_tri,desc(freq))
## the three words in the trigrams
n<-nrow(freq_tri)
w<-matrix(nrow=n,ncol=3)
colnames(w)=c("w1","w2","w3")
for (i in 1:n){
w[i,]=unlist(strsplit(as.character(freq_tri$token[i])," "))
}
wdf<-as.data.frame(w)
freq_tri<-cbind(freq_tri,wdf)
freq_tri<-tbl_df(freq_tri)
## stopwords
freq_tri$sw<-freq_tri$w1 %in% stopwords_cl | freq_tri$w2 %in% stopwords_cl |
freq_tri$w3 %in% stopwords_cl
res<-list(freq_uni=freq_uni,freq_bi=freq_bi,freq_tri=freq_tri)
return(res)
}
# 10 - cum frequencies and percentages
# añade a freq: cum_freq,perc and voc ( token_type numbers)
cum<-function(freq){
tot<-sum(freq$freq)
n<-nrow(freq)
cum_freq<-vector("numeric",n)
cum_freq[1]=freq$freq[1]
for (i in 2:n){
cum_freq[i]=cum_freq[i-1]+freq$freq[i]
}
freq$cum_freq <- cum_freq
freq$perc<-round(freq$cum_freq*100/tot,4)
freq$voc<-1:n
return(freq)
}
#----------------plot functions-------------------------------------------------
# 11 - word cloud
wc <- function(freq,max,tit){
layout(matrix(c(1,2),nrow=2),heights = c(1,10))
par(mar=rep(0,4));plot.new()
tit<-gsub("_"," ",deparse(substitute(tit)))
text(x=0.5,y=0.5,tit)
wordcloud(words=freq$token, freq = freq$freq,
min.freq=1, max.words = max,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal (8, "Dark2"),
main="Tittle")
}
# 12 - bar plots with flip
bar_flip<-function(freq,tit){
arrange_all<-arrange(freq,freq)
freq$token<-factor(freq$token,levels=arrange_all$token)
freq<-arrange(freq,desc(freq))
tit<-gsub("_"," ",deparse(substitute(tit)))
bar_plot<-ggplot(freq,aes(x=token,y=freq,fill=sw))+
geom_bar(stat="identity")+coord_flip()+
ylab("frecuency")+xlab("Token")+
ggtitle(tit) +
theme(plot.title = element_text(lineheight=.7, face="bold"))+
theme(legend.title=element_blank())+
scale_fill_discrete(breaks=c("TRUE", "FALSE"),
labels=c("stop word", "normal word"))
return(bar_plot)
}
# 13 - bar plots
bar<-function(freq,tit){
arrange_all<-arrange(freq,freq)
freq$token<-factor(freq$token,levels=arrange_all$token)
freq<-arrange(freq,desc(freq))
tit<-gsub("_"," ",deparse(substitute(tit)))
bar_plot<-ggplot(freq,aes(x=voc,y=freq))+
geom_bar(stat="identity")+
ylab("frecuency")+xlab("Token")+
ggtitle(tit) +
theme(plot.title = element_text(lineheight=.7, face="bold"))
return(bar_plot)
}
# 14 - coverage line plot
line_cover<-function(freq,tit){
line_plot<-ggplot(freq,aes(x=voc,y=perc))+geom_line()+
ylab("cum percentage")+xlab("Token type number")+
ggtitle(deparse(substitute(tit))) +
theme(plot.title = element_text(lineheight=.7, face="bold"))
return(line_plot)
}