The goal of the Capstone Project is to apply all techniques learned during the Data Science Specialization for building a brand new application: analysis of text data and natural language processing.
This work involves several tasks: (i) Understanding the problem; (ii) Data acquisition and cleaning; (iii) Exploratory analysis; (iv) Statistical modeling; (v) Predictive modeling; (vi) Creative exploration; (vii) Creating a data product; and (viii) Creating a short slide deck pitching the data product.
The current Milestone Report aims to explain the major features of the data, as well as to summarize plans for creating the prediction algorithm and Shiny app.
This represents the training data that will be the basis for most of the capstone. The Coursera-SwiftKey.zip file is downloaded from the Coursera site. It contains folder “final” with raw text data from three sources (news headlines, blog entries, ans tweets) within four locales en_US, de_DE, ru_RU and fi_FI, corresponding to the English, German, Russian and Finlandese languages databases.
Loading Packages/Reading Data
First, R environment is cleared and necessary libraries are loaded.
rm(list=ls())
library(NLP);library(tm);library(R.utils); library(stringi); library(wordcloud);
library(RWeka); library(ggplot2)
The Coursera-SwiftKey.zip file is downloaded and unzipped:
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/Coursera-SwiftKey.zip", method = "curl", mode = "wb")
unzip("./data/Coursera-SwiftKey.zip", exdir = "./data")
The English database is loaded and read:
data_twitter.raw <- readLines("data/final/en_US/en_US.twitter.txt")
data_news.raw <- readLines("data/final/en_US/en_US.news.txt")
data_blogs.raw <- readLines("data/final/en_US/en_US.blogs.txt")
data.raw <- c(data_twitter.raw, data_news.raw, data_blogs.raw)
The first step is to understand the distribution and relationship between the words, tokens, and phrases in the text.
The number of lines for each source of data are determined:
nlines_twitter <- length(data_twitter.raw)
nlines_news <- length(data_news.raw)
nlines_blogs <- length(data_blogs.raw)
nlines_twitter; nlines_news; nlines_blogs
## [1] 2360148
## [1] 1010242
## [1] 899288
The number of words for each source of data along with summary are determined:
nwords_twitter <- stri_count_words(data_twitter.raw)
nwords_news <- stri_count_words(data_news.raw)
nwords_blogs <- stri_count_words(data_blogs.raw)
summary(nwords_twitter); summary(nwords_news); summary(nwords_blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
For further analysis, sampling of the corpus is performed:
set.seed(1234)
sample_twitter <- data_twitter.raw[sample(1:length(data_twitter.raw), 5000)]
sample_news <- data_news.raw[sample(1:length(data_news.raw), 5000)]
sample_blogs <- data_blogs.raw[sample(1:length(data_blogs.raw), 5000)]
sample_data <- c(sample_twitter, sample_news, sample_blogs)
For a better examination, the Corpus of the sampled data (from twitter, news and blogs databases) is cleaned:
myCorpus <- Corpus(VectorSource(sample_data))
to_Space <- content_transformer(function(x, pattern) gsub(pattern, "", x))
myCorpus <- tm_map(myCorpus, to_Space,"\"|/|@|\\|")
myCorpus <- tm_map(myCorpus, content_transformer(stringi::stri_trans_tolower))
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
Word cloud is built for showing top words of the Cleaned Corpus, where words’ size will vary by frequency.
wordcloud(myCorpus, max.words =100, min.freq = 1, random.order = TRUE, rot.per = .20,
colors = colorRampPalette(brewer.pal(9,"Blues"))(32), scale = c(3, .3))
The n-grams tokenizers are created:
myCorpus_data_frame <- data.frame(text = unlist(sapply(myCorpus, identity)),
stringsAsFactors = FALSE)
findNGrams <- function(corp, grams, top) {
ngram <- NGramTokenizer(corp, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram <- data.frame(table(ngram))
ngram <- ngram[order(ngram$Freq, decreasing = TRUE),][1:top,]
colnames(ngram) <- c("Words","Count")
ngram
}
monoGrams <- findNGrams(myCorpus_data_frame, 1, 5000)
biGrams <- findNGrams(myCorpus_data_frame, 2, 5000)
triGrams <- findNGrams(myCorpus_data_frame, 3, 5000)
quadriGrams <- findNGrams(myCorpus_data_frame, 4, 5000)
Mono-grams: plot the top 20 most frequent 1-grams:
ggplot(monoGrams[1:20,], aes(x = reorder(Words, Count), y = Count, fill = Count)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Words", y = "Count", title = "Figure: Most common 1-grams in text sample")
The above figure of the 1-grams in text sample shows the same words ordering as it was seen int the Word Cloud: said, will, one, just, …, etc.
Bi-grams: plot the top 20 most frequent 2-grams:
ggplot(biGrams[1:20,], aes(x = reorder(Words, Count), y = Count, fill = Count)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Words", y = "Count", title = "Figure: Most common 2-grams in text sample")
Tri-grams: plot the top 20 most frequent 3-grams:
ggplot(triGrams[1:20,], aes(x = reorder(Words, Count), y = Count, fill = Count)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Words", y = "Count", title = "Figure: Most common 3-grams in text sample")
Quadri-grams: plot the top 20 most frequent 4-grams:
ggplot(quadriGrams[1:20,], aes(x = reorder(Words, Count), y = Count, fill = Count)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Words", y = "Count", title = "Figure: Most common 4-grams in text sample")
There are three strategies to follow:
Start with mono-gram model, and then use bi-gram, tri-gram & quadri-gram for predicting the next word.
Start with quadri-gram model to find the most likely next word. Then to continue with tri-gram, bi-gram and mono-gram.
Use library(text2vec) and GloVe algorithm in R: https://cran.r-project.org/web/packages/text2vec/text2vec.pdf To complete the next word into a sentence, the cosine similarity between pairs of words is measured. Before to measure cosine similarity, the words are converted into vectors and GloVe algorithm is applied.
A benchmarking can be done using GloVe algorithm via Recurent Neural Networks (RNN) within Anaconda environment and the TensorFlow backend.
The Shiny app will take a phrase (multiple words) as input. After the user clicks submit, the next word is predicted. It would be interesting to show the predicted next word, via all three strategies mentioned above.