Project - Text Prediction

Introduction

The main idea behing Text Prediction is the estimation of the next character or word given a string of the input history. This may represent a useful solution to the problem of mistyping words and to suggest which is the next word that should be.

Over the past decade, there has been a dramatic increase in the usage of electronic devices for email, social networking and other activities. Errors typing on such devices are far from uncommon, and can have considerable implications concerning the efficient use of such devices for communication.

The objective of this project is to develop a text predictive algorithm derived from large data sets composed of different sources material such as blogs, twitter and news data.

To start, the main techinique used is the n-grams approach where n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and son on. These large sizes are not going to be used in this project.

Task 0 - Understanding the problem

The data was obtained from HC corpora (www.corpora.heliohost.org). The choosen language was Englsih.

Obtaining the data:

library(tm)
library(RWekajars)
library(RWeka)
library(dplyr)
library(magrittr)
library(ggplot2)
library(stringi)

setwd("~/Projetos Analytics/ESTUDOS/R/CAPSTONE/Coursera-SwiftKey/final/en_US")

blogs <- readLines("en_US.blogs.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8")

Now, we did a descriptive analysis to undestand what we’ve got.

#Number of entries
length(blogs) #899,288

## [1] 899288

length(news) #77,259

## [1] 77259

length(twitter) #2,360,148

## [1] 2360148

#Summary the sources
summary(nchar(blogs))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40830

summary(nchar(news))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   111.0   186.0   202.4   270.0  5760.0

summary(nchar(twitter))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

#Count the number of words per entry and do a summarization
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
summary(words_blogs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

summary(words_news)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

summary(words_twitter)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Task 1 - Data acquisition and cleaning

Tasks to accomplish:

Tokenization - identifying appropriate tokens such as words punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Data cleaning
Exploratory analysis
Profanity filtering - removing profanity and other words you do not want to predict.

1.Tokenization and data cleaning

Identifying appropriate tokens such as words, punctuation, and numbers.Writing a function that takes a file as input and returns a tokenized version of it.

Text Preprocessing - Tokenization and data cleaning

tokenizator <- function (x){
  corpus <- Corpus(VectorSource(x))# make a corpus object
  corpus <- tm_map(corpus, tolower) # make everything lowercase
  corpus <- tm_map(corpus, removeWords,stopwords("english"))
  corpus <- tm_map(corpus, removePunctuation) # remove punctuation
  corpus <- tm_map(corpus, removeNumbers) # remove numbers
  corpus <- tm_map(corpus, stripWhitespace) # get rid of extra spaces
  corpus <- tm_map(corpus, PlainTextDocument) #That should make sure all data is in PlainTextDocument
}
blog_token <- tokenizator(blogs_sample)
twitter_token <- tokenizator(twitter_sample)
news_token <- tokenizator(news_sample)

tdm <- TermDocumentMatrix(x=twitter_token) 
dtm <- DocumentTermMatrix(twitter_token)

2.Exploratory data analysis

Now, lets do an exploratory analysis considering a group of words 3 by 3 (n-grams=3 - Trigrams). The data used is from Twitter.

ngram = 3
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
tdm_grams <- TermDocumentMatrix(twitter_token, control = list(tokenize = ngramTokenizer))

Showing the popular trigrams

popularNgrams <- findFreqTerms(tdm_grams,lowfreq=5)
popularNgrams

## [1] "follow follow follow"   "happy mothers day"     
## [3] "happy new year"         "happy th birthday"     
## [5] "let us know"            "looking forward seeing"

Plotting the most frequent trigrams

ngramsFrequency <- rowSums(as.matrix(tdm_grams[popularNgrams,]))
print(qplot(names(ngramsFrequency),ngramsFrequency)+
        coord_flip()+
        geom_bar(stat = "identity")+
        ggtitle("N-grams Frequency\nTwitter Sample\n")+
        xlab("Frequency") +
        ylab("N-grams") +
        theme(plot.title = element_text(lineheight=.8, face="bold")))

3.Profanity filtering

Removing profanity and other words you do not want to predict.

#Loading a profanity list
profanity <- readLines("swearWords.txt")
profanity

##  [1] "anal"         "anus"         "arse"         "ass"         
##  [5] "ballsack"     "balls"        "bastard"      "bitch"       
##  [9] "biatch"       "bloody"       "blowjob"      "blow job"    
## [13] "bollock"      "bollok"       "boner"        "boob"        
## [17] "bugger"       "bum"          "butt"         "buttplug"    
## [21] "clitoris"     "cock"         "coon"         "crap"        
## [25] "cunt"         "damn"         "dick"         "dildo"       
## [29] "dyke"         "fag"          "feck"         "fellate"     
## [33] "fellatio"     "felching"     "fuck"         "f u c k"     
## [37] "fudgepacker"  "fudge packer" "flange"       "Goddamn"     
## [41] "God damn"     "hell"         "homo"         "jerk"        
## [45] "jizz"         "knobend"      "knob end"     "labia"       
## [49] "lmao"         "lmfao"        "muff"         "nigger"      
## [53] "nigga"        "omg"          "penis"        "piss"        
## [57] "poop"         "prick"        "pube"         "pussy"       
## [61] "queer"        "scrotum"      "sex"          "shit"        
## [65] "s hit"        "sh1t"         "slut"         "smegma"      
## [69] "spunk"        "tit"          "tosser"       "turd"        
## [73] "twat"         "vagina"       "wank"         "whore"       
## [77] "wtf"

text_token #you do not want to predict.

## <<VCorpus (documents: 10000, metadata (corpus/indexed): 0/0)>>

Next steps for the project

After these initial steps, the next phase is the modeling phase, to develop the application thats predictis the next word, given the previous one. After some research, the most suitable approach is to use markov chains to handle this challenge.

In the 1948 landmark paper “A Mathematical Theory of Communication”, Claude Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering.

Capstone Milestone

Leandro Guerra

Friday, July 24, 2015