In this capstone project, we will build a predictive text model based on algorithms in natural language processing. Given a word in a text, our model will predict the next word. For our analysis and models, we use the data from a corpus called HC corpora(http://www.corpora.heliohost.org/). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.
This report describes a exploratory analysis on the dataset and summarizes our future work to build the predictive text model.
First we download and extract the zip file contained at Capstone Data. The dataset contains three kinds of text documents: News, Blogs, Twitter posts. They are provided in 4 different languages: German, English, Finnis, Russian. In our capstone project we will only be concerned with the English Dataset. Now we load the data in Rstudio.
setwd("~/Downloads/Data/Capstone/final/en_US")
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Now we make a basic summary of our text documents i.e. number of lines, words, file sizes.
library(stringi)
# Number of words
blogs.wd <- stri_count_words(blogs)
news.wd <- stri_count_words(news)
twitter.wd <- stri_count_words(twitter)
# Calculate size
blogs.sz = file.info("en_US.blogs.txt")$size / 2^20
news.sz = file.info("en_US.news.txt")$size / 2^20
twitter.sz = file.info("en_US.twitter.txt")$size / 2^20
# summarize the data
data.frame(source = c("blogs", "news", "twitter"),
num.lines = c(length(blogs),length(news), length(twitter)),
num.words = c(sum(blogs.wd), sum(news.wd), sum(twitter.wd)),
file.size.MB = c(blogs.sz, news.sz, twitter.sz)
)
## source num.lines num.words file.size.MB
## 1 blogs 899288 37546246 200.4242
## 2 news 1010242 34762395 196.2775
## 3 twitter 2360148 30093410 159.3641
Since our data is huge, we will sample the data before starting our exploratory analysis. We will consider 1% of our whole data.
options(mc.cores=1)
library(tm)
set.seed(1234)
Sample <- c(sample(blogs, length(blogs)*.01),
sample(news, length(news)*.01),
sample(twitter, length(twitter)*.01))
# create myCorpus and preprocessing
corpus <- VCorpus(VectorSource(Sample))
corpus <- tm_map(corpus, PlainTextDocument)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# Removing URLs
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- VCorpus(VectorSource(corpus))
Now we can start our exploratory analysis. We will find out the most frequent unigram, bigram, trigram in our corpus.
library(RWeka)
library(ggplot2)
get_freq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
# unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
# trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
bigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
trigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
makePlot <- function(data, label) {
ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 70, size = 15, hjust = 1)) +
geom_bar(stat = "identity", fill = I("blue1"))
}
unigram_freq <- get_freq(removeSparseTerms(
TermDocumentMatrix(corpus), 0.999))
bigram_freq <- get_freq(removeSparseTerms(
TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer)), 0.999))
trigram_freq <- get_freq(removeSparseTerms(
TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer)), 0.999))
Now we can have a look at the head of unigrams and make a histogram with the 25 most common unigrams.
head(unigram_freq, 10)
## word freq
## the the 47923
## and and 24307
## for for 10812
## that that 10410
## you you 9515
## with with 7197
## was was 6407
## this this 5456
## have have 5439
## but but 4897
makePlot(unigram_freq, "25 Most Common Unigrams")
We can also have similar plots for bigrams and trigrams.
head(bigram_freq, 10)
## word freq
## of the of the 4261
## in the in the 4147
## to the to the 2238
## for the for the 1989
## on the on the 1920
## to be to be 1592
## at the at the 1496
## and the and the 1237
## in a in a 1185
## with the with the 1060
makePlot(bigram_freq, "25 Most Common Unigrams")
head(trigram_freq, 10)
## word freq
## one of the one of the 358
## a lot of a lot of 313
## thanks for the thanks for the 242
## to be a to be a 192
## going to be going to be 178
## it was a it was a 150
## out of the out of the 146
## as well as as well as 145
## some of the some of the 144
## i want to i want to 138
makePlot(trigram_freq, "25 Most Common Unigrams")
Our future work will be to implement our markov chain algorithm to create a predictive text model. After making our model we will deploy it as a shiny app, where a user can write something in english and our model will predict the next 2/3 words.