The main idea behing Text Prediction is the estimation of the next character or word given a string of the input history. This may represent a useful solution to the problem of mistyping words and to suggest which is the next word that should be.
Over the past decade, there has been a dramatic increase in the usage of electronic devices for email, social networking and other activities. Errors typing on such devices are far from uncommon, and can have considerable implications concerning the efficient use of such devices for communication.
The objective of this project is to develop a text predictive algorithm derived from large data sets composed of different sources material such as blogs, twitter and news data.
To start, the main techinique used is the n-grams approach where n-gram is a contiguous sequence of n items from a given sequence of text or speech. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and son on. These large sizes are not going to be used in this project.
The data was obtained from HC corpora (www.corpora.heliohost.org). The choosen language was Englsih.
Obtaining the data:
library(tm)
library(RWekajars)
library(RWeka)
library(dplyr)
library(magrittr)
library(ggplot2)
library(stringi)
setwd("~/Projetos Analytics/ESTUDOS/R/CAPSTONE/Coursera-SwiftKey/final/en_US")
blogs <- readLines("en_US.blogs.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8")
Now, we did a descriptive analysis to undestand what we’ve got.
#Number of entries
length(blogs) #899,288
## [1] 899288
length(news) #77,259
## [1] 77259
length(twitter) #2,360,148
## [1] 2360148
#Summary the sources
summary(nchar(blogs))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 47 156 230 329 40830
summary(nchar(news))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 111.0 186.0 202.4 270.0 5760.0
summary(nchar(twitter))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 37.00 64.00 68.68 100.00 140.00
#Count the number of words per entry and do a summarization
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
summary(words_blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
summary(words_news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
summary(words_twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
Tasks to accomplish:
Identifying appropriate tokens such as words, punctuation, and numbers.Writing a function that takes a file as input and returns a tokenized version of it.
Text Preprocessing - Tokenization and data cleaning
tokenizator <- function (x){
corpus <- Corpus(VectorSource(x))# make a corpus object
corpus <- tm_map(corpus, tolower) # make everything lowercase
corpus <- tm_map(corpus, removeWords,stopwords("english"))
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # get rid of extra spaces
corpus <- tm_map(corpus, PlainTextDocument) #That should make sure all data is in PlainTextDocument
}
blog_token <- tokenizator(blogs_sample)
twitter_token <- tokenizator(twitter_sample)
news_token <- tokenizator(news_sample)
tdm <- TermDocumentMatrix(x=twitter_token)
dtm <- DocumentTermMatrix(twitter_token)
Now, lets do an exploratory analysis considering a group of words 3 by 3 (n-grams=3 - Trigrams). The data used is from Twitter.
ngram = 3
ngramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ngram, max = ngram))
tdm_grams <- TermDocumentMatrix(twitter_token, control = list(tokenize = ngramTokenizer))
Showing the popular trigrams
popularNgrams <- findFreqTerms(tdm_grams,lowfreq=5)
popularNgrams
## [1] "follow follow follow" "happy mothers day"
## [3] "happy new year" "happy th birthday"
## [5] "let us know" "looking forward seeing"
Plotting the most frequent trigrams
ngramsFrequency <- rowSums(as.matrix(tdm_grams[popularNgrams,]))
print(qplot(names(ngramsFrequency),ngramsFrequency)+
coord_flip()+
geom_bar(stat = "identity")+
ggtitle("N-grams Frequency\nTwitter Sample\n")+
xlab("Frequency") +
ylab("N-grams") +
theme(plot.title = element_text(lineheight=.8, face="bold")))
Removing profanity and other words you do not want to predict.
#Loading a profanity list
profanity <- readLines("swearWords.txt")
profanity
## [1] "anal" "anus" "arse" "ass"
## [5] "ballsack" "balls" "bastard" "bitch"
## [9] "biatch" "bloody" "blowjob" "blow job"
## [13] "bollock" "bollok" "boner" "boob"
## [17] "bugger" "bum" "butt" "buttplug"
## [21] "clitoris" "cock" "coon" "crap"
## [25] "cunt" "damn" "dick" "dildo"
## [29] "dyke" "fag" "feck" "fellate"
## [33] "fellatio" "felching" "fuck" "f u c k"
## [37] "fudgepacker" "fudge packer" "flange" "Goddamn"
## [41] "God damn" "hell" "homo" "jerk"
## [45] "jizz" "knobend" "knob end" "labia"
## [49] "lmao" "lmfao" "muff" "nigger"
## [53] "nigga" "omg" "penis" "piss"
## [57] "poop" "prick" "pube" "pussy"
## [61] "queer" "scrotum" "sex" "shit"
## [65] "s hit" "sh1t" "slut" "smegma"
## [69] "spunk" "tit" "tosser" "turd"
## [73] "twat" "vagina" "wank" "whore"
## [77] "wtf"
text_token #you do not want to predict.
## <<VCorpus (documents: 10000, metadata (corpus/indexed): 0/0)>>
After these initial steps, the next phase is the modeling phase, to develop the application thats predictis the next word, given the previous one. After some research, the most suitable approach is to use markov chains to handle this challenge.
In the 1948 landmark paper “A Mathematical Theory of Communication”, Claude Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering.