Introduction to the Milestone Report

The goal of the capstone project is to prepare a predictive text model using a large text corpus of documents fetched from twitter, news and blogs as the training data. Natural language processing techniques will be used to perform the analysis and build the predictive model.

This report describes the major features of the training data with the exploratory data analysis and summarizes the plan for creating the predictive model.

Getting the data

The data sets consist of text from 3 different sources: 1) News, 2) Blogs and 3) Twitter feeds. The text data are provided in 4 different languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian. In this project, we will only focus on the English - United States data sets.

set.seed(123)
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Then we examine the data sets and summarize our findings (file sizes, line counts, word counts, and mean words per line) below:

library(stringi)

# Get file sizes
blogs.size <- file.info("en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("en_US.twitter.txt")$size / 1024 ^ 2

# Get words in files
blogs.words <- stri_count_words(blogs)
news.words <- stri_count_words(news)
twitter.words <- stri_count_words(twitter)

# Summary of the data sets
data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blogs.size, news.size, twitter.size),
           num.lines = c(length(blogs), length(news), length(twitter)),
           num.words = c(sum(blogs.words), sum(news.words), sum(twitter.words)),
           mean.num.words = c(mean(blogs.words), mean(news.words), mean(twitter.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546246       41.75108
## 2    news     196.2775   1010242  34762395       34.40997
## 3 twitter     159.3641   2360148  30093410       12.75065

Cleaning the data

Now we clean the dataset before we perform exploratory data analysis. In this step, we remove URLs, special characters, punctuation, numbers, excess whitespace, stopwords, and change the text to lower case. We randomly choose a part of the large data set to demonstrate the data cleaning and exploratory analysis.

library(tm)
## Warning: package 'tm' was built under R version 3.4.3
## Loading required package: NLP
sample_size <- 0.001
data.sample <- c(sample(blogs, length(blogs) * sample_size),
                 sample(news, length(news) * sample_size),
                 sample(twitter, length(twitter) * sample_size))

# Create corpus and clean the data
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Exploratory Analysis

Now, we perform exploratory analysis on the data with the aim to find the most frequently occurring words in the dataset. Here, we are listing the most common unigrams, bigrams and trigrams.

library(RWeka)
## Warning: package 'RWeka' was built under R version 3.4.3
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
options(mc.cores=1)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Frequency") +
    theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
    geom_bar(stat = "identity", fill = I("grey50"))
}

# Get frequencies of most common n-grams in data sample
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.9999))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.9999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

Histogram for the 30 most common unigrams in the data sample

makePlot(freq1, "30 Most Common Unigrams")

Histogram for the 30 most common bigrams in the data sample

makePlot(freq2, "30 Most Common Bigrams")

Histogram for the 30 most common trigrams in the data sample

makePlot(freq3, "30 Most Common Trigrams")

The way ahead for Prediction Algorithm and Shiny App

Thus, concluding our exploratory analysis, the next steps of this capstone project would be to finalize our predictive algorithm and deploy our algorithm as a shiny app.

Our algorithm will be using a n-gram model with the frequency lookup similar to our exploratory analysis as done above. One of the possible strategies would be to use the trigram model to predict the next word. In case of no matching trigram, the algorithm would go back to the bigram model, and then to the unigram model if required.

The user interface of the shiny app will consist of a text input box that will allow the user to enter a phrase. thereafter, the app will use the algorithm to suggest the most likely next word. The user will be able to configure how many words our app should suggest.