The final capstone project for the Coursera Data Science Specialization is to create an application that predicts the next word in a phrase or sentence. To accomplish this, we will use natural language processing techniques, namely, N-grams/Markov Models to evaluate statistical probabilities of the next word appearing based on the appearance of prior word(s). The types of data we will use will be a variety of large bodies of text documents (commonly referred to as a Corpus (plural Corpora)). This report, will go into the details of the Corpora we have available and briefly discuss possibilities for modeling the data to accomplish the ultimate goal.

Available Data

For this project, we were provided a zip file of corpus files downloadable here. These files contained a set of twitter messages, news articles and blog entries in 4 different languages. Keep in mind that there are many other corpus documents available online (e.g., Reuters 215780 ) but, for our purposes, we will focus on the provided US English set of documents.

Data Analysis

Below, is a summary of the 3 corpus documents.

File Number of Lines Longest Line Number of Words File Size (MB)
Blogs 899,288 40,833 37,334,147 200.42
News 1,010,242 11,384 34,372,530 196.28
Twitter 2,360,148 140 30,373,603 159.36

Blogs, our largest file, has the most number of words (over 37 million) but the least number of lines. The longest line in the blogs file (over 40 thousand characters) indicate that this file comprises of longer sentences and lengthier expression of ideas. News, our second largest file, also has the second largest number of lines (over 1 million), number of words (over 34 million) and longest line (over 11 thousand). I would consider the news file a good middle ground for sentence length and variety of word usage and ideas. Lastly, our Twitter file comprises of short phrases (maximum 140 characters). The Twitter file consequently has the most number of lines (over 2 million) and the least number of words (over 30 million).

Sampling

Due to the large size of these files, processing the entire contents on a local computer proved to be very time consuming. So, I chose to sample the data in order to move forward as quickly as possible. For each individual corpus, I chose to randomly sample 40% (configurable) of the document prior to performing exploratory analysis. After performing exploratory analysis on each individual corpus, I chose to set a sampling percentage of 10% (configurable) for each file and combined them into a single file. The code below is how this was accomplished for the full sample:

set.seed(892)
sample_pct <- .1
blogs_data_sample <- blogs_data[sample(1:blogs_lines, blogs_lines*sample_pct)]
news_data_sample <- news_data[sample(1:news_lines, news_lines*sample_pct)]
twitter_data_sample <- twitter_data[sample(1:twitter_lines, twitter_lines*sample_pct)]
sample_data <- list(blogs_data_sample, news_data_sample, twitter_data_sample)

Common Words

Words were independently analyzed in each of the corpus files (at the higher sample rate of 40%). Subsequently all files were combined and analyzed (at a lower sample rate of 10%). The most frequent words varied per corpus, however, the words “will”, “just”, and “can” made the top 10 in all 3 documents. The barplot and word cloud graphics below show the details.

Blogs Data File

Below is a summary of the most frequent words based on the blogs sampled data:

News Data File

Below is a summary of the most frequent words bases on the news sampled data:

Twitter Data File

Below is a summary of the most frequent words based on the twitter sampled data:

Full Data

Below is a summary of the most frequent words base on the full sampled data:

Future Plans

We will use the sampled data analyzed above to create a statistical model (based on N-grams/Markov Models) to predict the next word in a phrase/sentence. The model created will subsequently be tested and tweaked to strike a good balance between accuracy and speed. This will be an iterative process and may require additional techniques (such as Katz’s back-off model and/or new data sources.

Appendix

Below is the full R processing code used in part to generate this report:

library(tm)
library(RWekajars)
library(RWeka)
library(wordcloud)
require(openNLP)
require(reshape)
set.seed(892)
sample_pct <- .4
setwd("~/ACADEMICS/datascience/Final Capstone")

#Corpus files dowloaded from:
# http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
#cached locally for performace

#Get files locations
us_txt_dir <- "final/en_US/"
blogs_txt <- paste(us_txt_dir, "en_US.blogs.txt", sep = "")
news_txt <- paste(us_txt_dir, "en_US.news.txt", sep = "")
twitter_txt <- paste(us_txt_dir, "en_US.twitter.txt", sep = "")

#Load text into R. Lowecase to normalize future operations
blogs_data <- tolower(readLines(blogs_txt, skipNul = T))
news_data <- tolower(readLines(news_txt, skipNul = T))
twitter_data <- tolower(readLines(twitter_txt, skipNul = T))

blogs_size <- round(file.size(blogs_txt)/1048576, 2)
news_size <- round(file.size(news_txt)/1048576, 2)
twitter_size <- round(file.size(twitter_txt)/1048576, 2)

#Get line counts
blogs_lines <- length(blogs_data)
news_lines <- length(news_data)
twitter_lines <- length(twitter_data)

#Get max line length
blogs_char_cnt <- lapply(blogs_data, nchar)
blogs_max_chars <- blogs_char_cnt[[which.max(blogs_char_cnt)]]

news_char_cnt <- lapply(news_data, nchar)
news_max_chars <- news_char_cnt[[which.max(news_char_cnt)]]

twitter_char_cnt <- lapply(twitter_data, nchar)
twitter_max_chars <- twitter_char_cnt[[which.max(twitter_char_cnt)]]

#Get word counts (based on spaces)
blogs_words <- sum( sapply(gregexpr("\\S+", blogs_data), length ) )
news_words <- sum( sapply(gregexpr("\\S+", news_data), length ) )
twitter_words <- sum( sapply(gregexpr("\\S+", twitter_data), length ) )

#Summary of corpus stats
corpus_stats <- data.frame( "Files" = c("Blogs", "News", "Twitter"),
                            "Lines" = c(blog_lines, news_lines, twitter_lines),
                            "Longest_Line" = c(blogs_max_chars, news_max_chars, twitter_max_chars),
                            "Words" = c(blogs_words, news_words, twitter_words),
                            "File_Size_Mb" = c(blogs_size, news_size, twitter_size))
############################ Exploratory Analysis ################################################

saveRDS(corpus_stats, "data/corpus_stats.Rda")
##################################################################################################

#search for specific word ratios
twitter_num_love <- sum(grepl("love", twitter_data) == TRUE)
twitter_hate_num <- sum(grepl("hate", twitter_data) == TRUE)
twitter_love_hate_ratio <- twitter_num_love / twitter_hate_num

#search specific words in a sentence
bs_line <- twitter_data[grepl("biostats", twitter_data)]
chess_line_cnt <- sum(grepl("A computer once beat me at chess, but it was no match for me at kickboxing", twitter_data, ignore.case = T) == TRUE)

####################### Analyze Blogs Data
blogs_data_sample <- blogs_data[sample(1:blogs_lines, blogs_lines*sample_pct)]
blogs_cp <- Corpus(VectorSource(list(blogs_data_sample)))

#Clean up corpus
blogs_cp <- tm_map(blogs_cp, removeNumbers)
blogs_cp <- tm_map(blogs_cp, removePunctuation)
blogs_cp <- tm_map(blogs_cp, stripWhitespace)

#Create doc term matrix
blogs_dtm <- DocumentTermMatrix(blogs_cp, control = list(stopwords = TRUE))

#Find frequent words
blogs_dtm_mtrx <- as.matrix(blogs_dtm)
blogs_frequency <- colSums(blogs_dtm_mtrx)
blogs_frequency <- sort(blogs_frequency, decreasing = TRUE)
saveRDS(blogs_frequency, "data/blogs_frequency.Rda")

######################## Analyze News Data
news_data_sample <- news_data[sample(1:news_lines, news_lines*sample_pct)]
news_cp <- Corpus(VectorSource(list(news_data_sample)))

#Clean up corpus
news_cp <- tm_map(news_cp, removeNumbers)
news_cp <- tm_map(news_cp, removePunctuation)
news_cp <- tm_map(news_cp, removeWords, stopwords('english'))
news_cp <- tm_map(news_cp, stripWhitespace)

#Create doc term matrix
news_dtm <- DocumentTermMatrix(news_cp)

#Find frequent words
news_dtm_mtrx <- as.matrix(news_dtm)
news_frequency <- colSums(news_dtm_mtrx)
news_frequency <- sort(news_frequency, decreasing = TRUE)
saveRDS(news_frequency, "data/news_frequency.Rda")

######################## Analyze Twitter Data
twitter_data_sample <- twitter_data[sample(1:twitter_lines, twitter_lines*sample_pct)]
twitter_cp <- Corpus(VectorSource(list(twitter_data_sample)))

#Clean up corpus
twitter_cp <- tm_map(twitter_cp, removeNumbers)
twitter_cp <- tm_map(twitter_cp, removePunctuation)
twitter_cp <- tm_map(twitter_cp, removeWords, stopwords('english'))
twitter_cp <- tm_map(twitter_cp, stripWhitespace)

#Create doc term matrix
twitter_dtm <- DocumentTermMatrix(twitter_cp)

#Find frequent words
twitter_dtm_mtrx <- as.matrix(twitter_dtm)
twitter_frequency <- colSums(twitter_dtm_mtrx)
twitter_frequency <- sort(twitter_frequency, decreasing = TRUE)
saveRDS(twitter_frequency, "data/twitter_frequency.Rda")


######################## Analyze Full Data
#Create smaller samples for further processing of combined corpus
sample_pct <- .1
blogs_data_sample <- blogs_data[sample(1:blogs_lines, blogs_lines*sample_pct)]
news_data_sample <- news_data[sample(1:news_lines, news_lines*sample_pct)]
twitter_data_sample <- twitter_data[sample(1:twitter_lines, twitter_lines*sample_pct)]
sample_data <- list(blogs_data_sample, news_data_sample, twitter_data_sample)

#Create Corpus
cp <- Corpus(VectorSource(sample_data))

#Clean up corpus
cp <- tm_map(cp, removeNumbers)
cp <- tm_map(cp, removePunctuation)
cp <- tm_map(cp, stripWhitespace)

#Create doc term matrix
dtm <- DocumentTermMatrix(cp, control = list(stopwords = TRUE))

#Find frequent words
dtm_mtrx <- as.matrix(dtm)
frequency <- colSums(dtm_mtrx)
frequency <- sort(frequency, decreasing = TRUE)
wordcloud(names(frequency), frequency, min.freq = 25, random.order = FALSE, colors = brewer.pal(8, "Spectral"))

saveRDS(frequency, "data/sample_frequency.Rda")

#tokenize for unigrams

ngramTokenizer <- function(l) {
    function(x) unlist(lapply(ngrams(words(x), l), paste, collapse = " "), use.names = FALSE)
}

#generate unigram data set
generateNgramData <- function(n) {
    ng_tdm <- TermDocumentMatrix(cp, control = list(tokenize = ngramTokenizer(n)))
    ng_matrix <- as.matrix(ng_tdm)
    ng_matrix <- rowSums(ng_matrix)
    ng_matrix <- sort(ng_matrix, decreasing = TRUE)
    final_ngram <- data.frame(terms = names(ng_matrix), freq = ng_matrix)
    
    if(n == 2) columns <- c('one', 'two')
    if(n == 3) columns <- c('one', 'two', 'three')
    if(n == 4) columns <- c('one', 'two', 'three', 'four')
    
    final_ngram <- transform(final_ngram, terms = colsplit(terms, split = " ", names = columns ))
    final_ngram
}
final_unigram <- generateNgramData(2)
final_bigram <- generateNgramData(3)
final_trigram <- generateNgramData(4)

#save final output for fast performace of Shiny App
saveRDS(final_unigram, file = "data/final_unigram.Rda")
saveRDS(final_bigram, file = "data/final_bigram.Rda")
saveRDS(final_trigram, file = "data/final_trigram.Rda")