Introduction

This is the initial report of a “predictive text model” project. The goal of the project is to build a data product that allows predicting the next word of a sentence typed by the user. With this report we will develop the requirements for the Milestone Report from the Capstone Project Course in the Data Science Specialization in Coursera.

A predictive text model is included into the Natural Language Processing (NLP). This is a new data field for us, but we are going to apply the same approach used in the whole Data Specialization: Getting, cleaning and exploring data in a first stage (hear), and, later, selecting and building a prediction algorithm to be applied in the model.

Getting data

The next link gets the data set used in the project.The data comes from a corpus named HC Corpora.

We have downloaded the zip file, unzipped it and loaded in R the three American English data base. The text files contain documents - each document in one line- from some blogs, news and twitter. It was considered the option of loading only a subset of lines. In fact, the next step is a sample process. But, at the end, in order to have greater flexibility and because that the initial loaded was fast enough, the code loads directly the full file.

The code, as the one of the whole project, is in the Appendix.

The summary of the three files is displayed in the next table:

We can see the next features:

The size of the files is very large. So, it has led us a lot of memory problems with the “tm” package. That has been the cause of using a data sample. The total number of the words is similar in the three text types.
The largest documents are the blogs, followed by the news and twitter. The short document number in twitter reflects the limit in the number of characters by twitter. Furthermore, the words mean has, in this case, the smallest deviation (6.9), which could explain the intensive use of the limit.
The blogs have the most number of words by line, but its deviation is also the largest. There is a big variety of blogs size.
At last, we can observe that the biggest word size by sentence is in the news.

According the features, we have decided to extract a random sample from the data, with a sample size of the 5 %; the same in each one, given the similar words number in the three file types.

Additionally, there are two more data sources whose use will be explained in the cleaning data section : One, the stop words(“english”) dictionary ( with 174 lines) that appears in the “tm” package in R, and two, a profanity word dictionary ( with 377 lines), loaded from the next link.

Cleaning data

The cleaning data process has been developed with the tools of the “tm” and “quanteda” packages in R. The transformation types of the words have to be chosen according to the specific goal of the project: selecting the next word in a sentence; different goals would imply different transformations.

Splitting into sentences

We know that we want to predict the next word from an incomplete sentence written by the user. So, our basic unit in the data set should be the sentences, not the documents. As result of loading the data in R we have three objects with texts, when each of the lines is a document. We must, therefore, split each document in its sentences. For doing it, we have used a function from the “quanteda” package.

Creating a corpus

A corpus is a specific data class that allows, in the NLP field, to work easily with the text data, basically for cleaning the data and creating a Document Term Matrix for counting n-grams, as we will see later.

A question here is: will it be necessary filtering by data source in the corpus? When a user writes a sentence waiting for the three or four prediction words, we do not know if he is a twitter user, a blog writer or a journalist. He will be anybody and we cannot apply different models for each one. So, we will generalize and will consider the three files as an unique set. The corpus will combine the three files in one.

Tokenization

We will transform the raw texts in a regular word set that led to optimize the language models based on them. The decisions of what to do or not to do are the next ones:

Numbers and Punctuation: We have considered removing all of them. Numbers in general, do not contribute to the goal of predicting a word. Punctuation, once the text has been divided in sentences, does not contribute either.
Capital and Lower letters: All the letters have been changed to lower case in order to homogenize the sample.
URL directions: All of them have been removed, because they do not have any content.
Blank spaces: After doing all the above transformations, the blank spaces are adjusted.
Stop words: There are the words without direct meaning, as pronouns, articles, prepositions, etc., that have usually a high frequency. Given our goal, we think the best is not to leave them, because the grammar of the sentences would be seriously affected in a prediction text model.
Stemming: This process replaces a set of words with the same root by another and unique word with the same content. We think that this is a good tool with classification problems or some other similar models, but it does not for predicting words, because In this case we are interested in word sequences, and if we change the suffix of the word, the sequence would be wrong. So, we are not going to apply it.

In conclusion, we will consider in the cleaning function only the first four types of transformations.

Profanity filtering

Profanity, as Merriam-Webster dictionary says, is “an offensive word” or “offensive language” (Wikipedia). In order to avoid the prediction of this kind of words, they will be removed in the corpus.

An example

First, the raw text in a document from the sample:

## [1] "Today<U+0092>s TNT (This-N-That) post from Club Creative Studio is about technique in your creative process. I often find myself playing <U+0093>the waiting game<U+0094>. You know what this means to an individual on a daily basis. We wait in line, we wait for the mail to come, we wait in traffic, we wait for the dryer to signal the clothes are dry. We wait for the text response, we wait for something to download, we WAIT, WAIT, WAIT!"

Second, the raw text divided in sentences:

## $`53`
## [1] "Today<U+0092>s TNT (This-N-That) post from Club Creative Studio is about technique in your creative process."
## 
## $`54`
## [1] "I often find myself playing <U+0093>the waiting game<U+0094>."
## 
## $`55`
## [1] "You know what this means to an individual on a daily basis."
## 
## $`56`
## [1] "We wait in line, we wait for the mail to come, we wait in traffic, we wait for the dryer to signal the clothes are dry."
## 
## $`57`
## [1] "We wait for the text response, we wait for something to download, we WAIT, WAIT, WAIT!"

And finally, the cleaned sentences:

## $`53`
## [1] "todays tnt thisnthat post from club creative studio is about technique in your creative process"
## 
## $`54`
## [1] "i often find myself playing the waiting game"
## 
## $`55`
## [1] "you know what this means to an individual on a daily basis"
## 
## $`56`
## [1] "we wait in line we wait for the mail to come we wait in traffic we wait for the dryer to signal the clothes are dry"
## 
## $`57`
## [1] "we wait for the text response we wait for something to download we wait wait wait"

Exploratory analysis

N-Grams model

The exploratory analysis will try to analyze the distribution of the words and the relationships between them. We need a transformation process that allows dealing in a statistical way with the text strings. This is the N-gram model. We will work with the unigram, bigrams and trigrams, sequences of one, two and three words respectively.

In order to build the N-grams, we have used the “RWeka” and “tm” packages. It is significant the memory problems that we had with RWeka. The code worked only after splitting the corpus in groups of a thousand of documents (or less): running the “tokenizer” in each group and summarizing the partial results. (See the appendix).

After an initial visual review, we could see, in the set of token with frequency = 1, a big number of weird words, typos and expressions without meaning. We have decided remove them in the three N-gram sets.

Distribution of the words

Plot 1 shows the distribution of the words -Unigrams-. In order to build a clearer plot, in the x-axes appears only the first 400 words (with a number), sorted by its frequency.

We can see that the distribution is strongly concentrated and skewed to the left.

Plot 2 and Plot 3 are an amplified view from the plot 1 that allows seeing the specific words. There are, ordered by frequency, the top 30 words in the Plot 2. Plot 3, with a “wordcloud” format, shows the top 200.

The whole of the top 30 are stop words. For seeing words with a specific meaning, we would have to reduce the frequency in the bar plot or seeing the smallest ones in the “wordcloud”.

We can also analise the coverage of the language index, which is the cumulative percentage of the total tokens covered by a number of word types. It is on the left corner of the plot 6 in the next section.There are a relatively short number of words that explains a high percentage of the total words. Only 2119 word types cover the 80 % of the total tokens. And 6333, the 90 %. This feature shows, from another point of view, the high concentration of the unigram distribution.

Relationship between the words

Bigrams and Trigrams are the most important results for the goal of our project. We are going to predict the next word. So, we do not need only unigrams, but sorted sequences of words and its frequencies. If we want to predict a word that follows to another ones it will be good to know how many times this word appears after the last word typed (bigrams) and, even better, the same for the two last words typed (trigrams). This is the information that the distribution of the bigrams and trigrams give us.

Plot 4 represents the distribution of the bigrams and trigrams. There has been made a similar treatment than the unigrams: the first 400 tokens with number.

The frequencies size is in general shorter than the ones of the unigram set of the Plot 1. The distribution is, so, less concentrated, showing the huge options that the language gives for building phrases and sentences.

Th next Plot 5 displays the most frequent bigrams and trigrams. They are the most grammatically correct phrases used in English.

We can see that, as in the unigrams, all the top 30 words have a stop word and, in general, the 68.4 % of the bigrams and the 96.1% of the trigrams have one of this kind of words. It will be, therefore, more than likely that one of these words appears in the set of predicted words.

Finally, as we saw yet in the case of the unigrams, we can see the coverage index performance. In the Plot 6 is displayed them in the two figures on the right.

Bigrams and Trigrams show a line of coverage nearer of the straight line than the one of the Unigrams. This is because of the bigger dispersion of its distributions: For covering the 80 % of the total bigrams, we need 90475 bigram types. In the case of the trigrams, the number is 176987.

Planning for the rest of the project

Statistical prediction algorithms: We have to select the algorithms to use, adjusted to our goal and computer restrictions. At this time the best models appear to be the backoff N-grams and the Kneser-Ney algorithm for smoothing. The basic reference is hear: https://web.stanford.edu/~jurafsky/slp3/4.pdf
Evaluating the prediction model: We will compare the performance of different models applying an “intrinsic evaluation”: The probability of each one of the N Grams, obtained in a training set, will be evaluated in the testing set -a set of different data to the training set- . The final model will be the one with the best performance in the test set. Initially, the two data sets could have an 80%(training)-20%(testing) proportion of the total sample. It will be possible to have to extract a third sample: a different training to get additional parameters of the model in the case of smoothing models. This is called the “held-out” set. In this case the proportions would be: 80%-10%-10% for the training, held-out and testing respectively.
A data product: the defined model will be applied in a shiny app, that allows to a web user typing two or three words in a sentence and then he could select one of the three or four next words predicted by the algorithm. The plan is that the user, additionally and optionally, could select the option that uses a data set without stop words. The efficiency considerations will be considered with treatments as, for instance, exploring the option of represent in memory the words as 64-bit hash numbers.
The presentation: We will develop a slide deck for the project. It will be short and easy of understand with the goal the readers want to use the app. It will be developed with R Studio presenter.

Appendix

milestone report code

# milestone report.R
# 
# locate this file in the working directory with functions.R

## notation of the files in /data      R-objects
# text_o: original texts in R vectors: blogs_o, news_o, twitter_o
# sum_table : summary of the data:     sum_table
# text_s: sample of the original data: text_s
# text_c: corpus from text_s:          text_c
# text_cl:cleaned corpus               text_cl: with stopwords
# freq_types: frecuencies              freq_blogs 
#                                      freq_news  
#                                      freq_twitter
# freq: token frequencies              freq: with stopwords
#                                      

#------------------ open libraries----------------------------------------------
library(readr)
library(tm)
library(quanteda)
library(stringi)
library(dplyr)
library(RWeka)
library(ggplot2)
library(grid)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)

source("functions.R")

#----------------- Loading the data---------------------------------------------

# data folder
if (!file.exists("./data")) {  # data folder
    dir.create("data")
}

# paths and file names
url<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/
Coursera-SwiftKey.zip"
zipfile <- "./data/Coursera-SwiftKey.zip"
blogs_t<-"./data/final/en_US/en_US.blogs.txt"
news_t<-"./data/final/en_US/en_US.news.txt"
twitter_t<-"./data/final/en_US/en_US.twitter.txt"

# downloading the zip data file
if (!file.exists(zipfile)){
    download.file(url,zipfile)
    write.csv(date(),"./data/date_download.txt")
}

# unzipping the data file
if (!file.exists(blogs_t) | !file.exists(news_t) |!file.exists(twitter_t)){
    unzip(zipfile,exdir = "./data")
}

#-------------- loading files in R------------------------------------------

if (!exists("blogs_o") | !exists("news_o") | !exists("twitter_o")){
    if(!file.exists("./data/text_o")){
        blogs_o<-read_lines(blogs_t)
        news_o<-read_lines(news_t)
        twitter_o<-read_lines(twitter_t)
        save(list=c("blogs_o","news_o","twitter_o"),file = "./data/text_o")
        
    }
    else{
        load("./data/text_o")
    }
}

#----------------------- basic summaries--------------------------------------- 
if (!exists("sum_table")){
    if(!file.exists("./data/sum_table")){
        
        blogs_w<-stri_count_words(blogs_o)
        news_w<-stri_count_words(news_o)
        twitter_w<-stri_count_words(twitter_o)
        
        blogs_st<-stri_count_boundaries(blogs_o,type="sentence")
        news_st<-stri_count_boundaries(news_o,type="sentence")
        twitter_st<-stri_count_boundaries(twitter_o,type="sentence")
        
        file_size_Mb<-round(c(file.info(blogs_t)$size/1048576,
                              file.info(news_t)$size/1048576,
                              file.info(twitter_t)$size/1048576),2)
        
        sum_table<-data.frame(file=c("blogs","news","twitter"),size_Mb=file_size_Mb,
                              docs= c(length(blogs_o),length(news_o),length(twitter_o)),         
                              words=c(sum(blogs_w),sum(news_w),sum(twitter_w)),
                              words_mean=round(c(mean(blogs_w),mean(news_w),
                                                 mean(twitter_w)),2),
                              words_sd=round(c(sd(blogs_w),sd(news_w),sd(twitter_w)),2),
                              sentences=c(sum(blogs_st),sum(news_st),sum(twitter_st)),
                              sentences_mean=round(c(mean(blogs_st),mean(news_st),
                                                     mean(twitter_st)),2),
                              words_by_sent=round(c(sum(blogs_w)/sum(blogs_st),
                                                    sum(news_w)/sum(news_st),
                                                    sum(twitter_w)/sum(twitter_st)),2))
        sum_table<-t(as.matrix(sum_table))
        save("sum_table",file = "./data/sum_table")
        
    }
    else{
        load("./data/sum_table")
    }
}

rm("blogs_w","news_w","twitter_w","blogs_st","news_st","twitter_st")

#----------------------- data sampling --------------------------------------- 
set.seed(123)
t=0.05

if (!exists("text_s")){
    if(!file.exists("./data/text_s")){
        index<-sample(1:length(blogs_o),length(blogs_o)*t)
        blogs_s<-data.frame(text=blogs_o[index],source=rep("blogs",length(index)))
        
        index<-sample(1:length(news_o),length(news_o)*t)
        news_s<-data.frame(text=news_o[index],source=rep("news",length(index)))
        
        index<-sample(1:length(twitter_o),length(twitter_o)*t)
        twitter_s<-data.frame(text=twitter_o[index],source=rep("twitter",
                                                               length(index)))
        
        # a data frame with texts and sources.
        text_s<-rbind(blogs_s,news_s,twitter_s)
        text_s$text<-as.character(text_s$text)
        save("text_s",file = "./data/text_s")
        
    }
    else{
        load("./data/text_s")
    }
}

# memory cleaning 
rm (list=c("blogs_o","news_o","twitter_o","blogs_s","news_s","twitter_s",
           "blogs_t","news_t","twitter_t","index"))

#----------------------- data corpus --------------------------------------- 

if (!exists("text_c")){
    if(!file.exists("./data/text_c")){
        # splitting documents in sentences ( quanteda package).
        text_sent<-unlist(tokenize(text_s$text,what="sentence"))
        # corpus 
        text_c<-VCorpus(VectorSource(text_sent))
        save("text_c",file = "./data/text_c")
        
    }
    else{
        load("./data/text_c")
    }
}

#----------------------- cleaning corpus ----------------

if (!exists("text_cl")){
    if(!file.exists("./data/text_cl")){
        
        # cleaning 
        text_cl<-clean_c(text_c)
        save("text_cl",file = "./data/text_cl")
    }
    else{
        load("./data/text_cl")
    }
}    

#--------running freq function-------------------------------------------------


# frequencies
if (!exists("freq")|!exists("freq_sw")){
    if(!file.exists("./data/freq")){
        
        c_sp<-sp(text_cl)  # splitting large corpus
        freq<-freq_f(c_sp) # frequencies
        save("freq",file="./data/freq")
    }
    else{
        load("./data/freq")
    }
}  
# exploring data 

# after exploring data: removin tokens with frequency = 1
unigram<-filter(freq$freq_uni,freq > 1)
bigram<-filter(freq$freq_bi, freq > 1)
trigram<-filter(freq$freq_tri,freq > 1)

# adding cum data and percentages
unigram<-cum(unigram)
bigram<-cum(bigram)
trigram<-cum(trigram)


#---------------------barplots ------------------------------------------------
bar_uni<-bar(unigram[1:400,],Plot_1_:Top_400_unigrams)
bar_uni_flip<-bar_flip(unigram[1:30,],Plot_2_:Top_30_unigrams)

bar_bi<-bar(bigram[1:400,],Bigrams)
bar_bi_flip<-bar_flip(bigram[1:30,],Bigrams)

bar_tri<-bar(trigram[1:400,],Trigrams)
bar_tri_flip<-bar_flip(trigram[1:30,],Trigrams)

#---------------coverage percentage plot-----------
index<-sample(1:nrow(unigram),500)
line_cover1<-line_cover(unigram[index,],Unigrams)
index<-sample(1:nrow(bigram),500)
line_cover2<-line_cover(bigram[index,],Bigrams)
index<-sample(1:nrow(trigram),500)
line_cover3<-line_cover(trigram[index,],Trigrams)



## plotting unigrams
bar_uni
bar_uni_flip
wc(unigram,200,Plot_3_:The_first_200_unigrams)

## plotting bigrams and trigrams

grid.arrange(bar_bi,bar_tri,ncol=2,
             top=textGrob("Plot 4:The first 400 tokens",
                          gp=gpar(fontsize=15,font=2)))             

grid.arrange(bar_bi_flip,bar_tri_flip,ncol=2,
             top=textGrob("Plot 5:The first 30 tokens",
                          gp=gpar(fontsize=15,font=2)))

wc(bigram,100,Plot_6_:The_first_100_bigrams)
wc(trigram,100,Plot_7_:The_first_100_trigrams)

# plotting coverage
grid.arrange(line_cover1,line_cover2,line_cover3,ncol=3,
             top=textGrob("Plot 8: Coverage of the total text",
                          gp=gpar(fontsize=15,font=2)))

#---------------------data numbers------------------------------------
# Unigrams

# tokens for 80 % coverage
unigram[findInterval(80,unigram$perc),6]

# tokens for 90 % coverage
unigram[findInterval(90,unigram$perc),6]

# percentage of non stop words_type
round(nrow(unigram[unigram$sw == "TRUE",])*100/nrow(unigram),1)


# Bigrams

# tokens for 80 % coverage
bigram[findInterval(80,bigram$perc),8]

# percentage of non stop words_type
round(nrow(bigram[bigram$sw == "TRUE",])*100/nrow(bigram),1)

# Trigrams

# tokens for 80 % coverage
trigram[findInterval(80,trigram$perc),9]

# percentage of non stop words_type
round(nrow(trigram[trigram$sw == "TRUE",])*100/nrow(trigram),1)

functions code

# functions.R
 
library(tm)
library(stringi)
library(dplyr)
library(RWeka)
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(RColorBrewer)

#----------------------- defining cleaning functions --------------------------- 


# 1 -removing anything other than English letters or space ( Zhao)
removeNumPunct<- function(x){
    gsub("[^[:alpha:][:space:]]*","",x)
}

# 1.1 -removing spaces at the beginning or at the end of a sentence.
removeSpaces_LeadTail<-function(x){
    gsub("^\\s+|\\s+$", "", x)
}

# 2 -removing URLs
removeURL <- function(x){
    gsub("http[^[:space:]]*","",x)
}

  
# 3 -cleaning function for a vector
#first: loading and cleaning profanities
load("./data/profanities")
# data from https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en
profanities_cl<-removeNumPunct(profanities)
profanities_cl<-tolower(profanities_cl)

clean_v<- function(v){
    v<- removeNumPunct(v)
    v<- tolower(v)
    v<- removeURL(v)
    v<- removeWords(v,profanities_cl)
    v<- stripWhitespace(v)
    v<- removeSpaces_LeadTail(v)
    return(v)
}

# 4 -cleaning function for a corpus

clean_c<- function(c){
    c<- tm_map(c,content_transformer(removeNumPunct))
    c<- tm_map(c,content_transformer(tolower))
    c<- tm_map(c,content_transformer(removeURL))
    c<- tm_map(c,removeWords,profanities_cl)
    c<- tm_map(c,stripWhitespace)
    c<- tm_map(c,content_transformer(removeSpaces_LeadTail))
return(c)
}

#----------------freq functions-------------------------------------------------

# 5 -splitting a corpus (with memory problems)------------------------

# max number or documents in each split: select
max=1000

sp <- function(cl){
    f<-ceiling(seq_along(cl)/max)
    cs<-split(cl,f)
    return(cs)
}


# 6 -Unigrams 
    
dtm_uni<-function(cs){
    tokenizer_1 <- function (x) NGramTokenizer(x, Weka_control(min=1, max = 1))
    dtm_uni <- TermDocumentMatrix(cs, control = list(wordLengths=c(1,Inf),
                                                     tokenize =tokenizer_1))
    freq_uni <- sort(rowSums(as.matrix(dtm_uni)), decreasing = TRUE) # total freq
    freq_uni <- data.frame(token = names(freq_uni), freq = freq_uni)# data frame
    return(freq_uni)
}
    
# 7 -Bigrams  
dtm_bi<- function(cs){
    tokenizer_2 <- function (x) NGramTokenizer(x, Weka_control(min=2, max = 2))
    dtm_bi <- TermDocumentMatrix(cs, control = list(wordLengths=c(1,Inf),
                                                    tokenize =tokenizer_2)) 
    freq_bi <- sort(rowSums(as.matrix(dtm_bi)), decreasing = TRUE) # total freq 
    freq_bi <- data.frame(token = names(freq_bi), freq = freq_bi) # data frame
    return(freq_bi)
}
    
# 8 -Trigrams 
dtm_tri<- function(cs){
    tokenizer_3 <- function (x) NGramTokenizer(x, Weka_control(min=3, max = 3))
    dtm_tri <- TermDocumentMatrix(cs, control = list(wordLengths=c(1,Inf),
                                                     tokenize =tokenizer_3)) 
    freq_tri <- sort(rowSums(as.matrix(dtm_tri)), decreasing = TRUE) # total freq 
    freq_tri <- data.frame(token = names(freq_tri), freq = freq_tri)#data frame
    return(freq_tri)
}
    
    
# 9 -freq_f:applying dtm a cs 

#first: cleaning stopwords
stopwords_cl<-clean_v(stopwords("english"))

freq_f<- function(cs){            
    
# Unigrams 
    freq_uni<-lapply(cs,dtm_uni)
    freq_uni<-do.call(rbind.data.frame, freq_uni)
    freq_uni<-summarize(group_by(freq_uni,token),freq=sum(freq))
    freq_uni<-arrange(freq_uni,desc(freq))

    ## stopwords
    freq_uni$sw<-freq_uni$token %in% stopwords_cl
    
    
# Bigrams
    freq_bi<-lapply(cs,dtm_bi)
    freq_bi<-do.call(rbind.data.frame, freq_bi)
    freq_bi<-summarize(group_by(freq_bi,token),freq=sum(freq))
    freq_bi<-arrange(freq_bi,desc(freq))
    
    ## the two words in the bigrams
    n<-nrow(freq_bi)
    w<-matrix(nrow=n,ncol=2)
    colnames(w)=c("w1","w2")
    for (i in 1:n){
        w[i,]=unlist(strsplit(as.character(freq_bi$token[i])," "))
    }
    
    wdf<-as.data.frame(w)
    freq_bi<-cbind(freq_bi,wdf)
    freq_bi<-tbl_df(freq_bi)
    
    ## stopwords
    freq_bi$sw<-freq_bi$w1 %in% stopwords_cl | freq_bi$w2 %in% stopwords_cl   
    
    
# Trigrams
    freq_tri<-lapply(cs,dtm_tri)
    freq_tri<-do.call(rbind.data.frame, freq_tri)
    freq_tri<-summarize(group_by(freq_tri,token),freq=sum(freq))
    freq_tri<-arrange(freq_tri,desc(freq))

    ## the three words in the trigrams
    n<-nrow(freq_tri)
    w<-matrix(nrow=n,ncol=3)
    colnames(w)=c("w1","w2","w3")
    for (i in 1:n){
        w[i,]=unlist(strsplit(as.character(freq_tri$token[i])," "))
    }
    
    wdf<-as.data.frame(w)
    freq_tri<-cbind(freq_tri,wdf)
    freq_tri<-tbl_df(freq_tri)
    
    ## stopwords
    freq_tri$sw<-freq_tri$w1 %in% stopwords_cl | freq_tri$w2 %in% stopwords_cl | 
        freq_tri$w3 %in% stopwords_cl
        
    
    
    res<-list(freq_uni=freq_uni,freq_bi=freq_bi,freq_tri=freq_tri)
    return(res)
}

# 10 - cum frequencies and percentages

# añade a freq: cum_freq,perc and voc ( token_type numbers)
cum<-function(freq){
    tot<-sum(freq$freq)
    n<-nrow(freq)
    cum_freq<-vector("numeric",n)
    cum_freq[1]=freq$freq[1]
    
    for (i in 2:n){
      cum_freq[i]=cum_freq[i-1]+freq$freq[i]
    }
    
    freq$cum_freq <- cum_freq  
    freq$perc<-round(freq$cum_freq*100/tot,4)  
    freq$voc<-1:n
    return(freq)
}

#----------------plot functions-------------------------------------------------

# 11 - word cloud

wc <- function(freq,max,tit){
    layout(matrix(c(1,2),nrow=2),heights = c(1,10))
    par(mar=rep(0,4));plot.new()
    tit<-gsub("_"," ",deparse(substitute(tit)))
    text(x=0.5,y=0.5,tit)
    wordcloud(words=freq$token, freq = freq$freq,
              min.freq=1, max.words = max,
              random.order = FALSE,
              rot.per = 0.35,
              colors = brewer.pal (8, "Dark2"),
              main="Tittle")   
    
}

# 12 - bar plots with flip 
bar_flip<-function(freq,tit){
    
    arrange_all<-arrange(freq,freq)
    freq$token<-factor(freq$token,levels=arrange_all$token)
    freq<-arrange(freq,desc(freq))
    tit<-gsub("_"," ",deparse(substitute(tit)))
    bar_plot<-ggplot(freq,aes(x=token,y=freq,fill=sw))+
        geom_bar(stat="identity")+coord_flip()+
        ylab("frecuency")+xlab("Token")+
        ggtitle(tit) + 
        theme(plot.title = element_text(lineheight=.7, face="bold"))+
        theme(legend.title=element_blank())+
        scale_fill_discrete(breaks=c("TRUE", "FALSE"),
                          labels=c("stop word", "normal word")) 
    return(bar_plot)
}
# 13 - bar plots 
bar<-function(freq,tit){
    
    arrange_all<-arrange(freq,freq)
    freq$token<-factor(freq$token,levels=arrange_all$token)
    freq<-arrange(freq,desc(freq))
    tit<-gsub("_"," ",deparse(substitute(tit)))
    bar_plot<-ggplot(freq,aes(x=voc,y=freq))+
        geom_bar(stat="identity")+
        ylab("frecuency")+xlab("Token")+
        ggtitle(tit) + 
        theme(plot.title = element_text(lineheight=.7, face="bold"))
    
    
    return(bar_plot)
}
# 14 - coverage line plot
line_cover<-function(freq,tit){
    
    line_plot<-ggplot(freq,aes(x=voc,y=perc))+geom_line()+
        ylab("cum percentage")+xlab("Token type number")+
        ggtitle(deparse(substitute(tit))) + 
        theme(plot.title = element_text(lineheight=.7, face="bold"))
      return(line_plot)
}

A predictive text model

Milestone Report

José A. Ariño

2 de septiembre de 2016