Introduction and Problem/Goal

This report presents the preliminary results and analysis of my attempt of a natural language processing (NLP) system for the prediction of words. This is based on patterns of words derived from a set of sample text data from blogs, news, and twitter feeds. This report contains results of my initial exploration of the data, and discussion for next steps.

In this final capstone project, the goal is to develop a prediction system for predicting the next word to be entered by a user, given a prior sequence of nGrams (one or more words). The prediction is based on a statistical analysis of N word sequences from a large body of real-world natural text. The prediction works primarily on the most frequently occurring N+1 word, given the prior N word sequence. The maximum value of N will be limited by memory and processing limitations, and there are diminishing returns on accuracy improvement for longer nGrams.

Data Gathering and Libraries

# First load the necessary libraries
library(tm)
library(ggplot2)
library(dplyr)
library(RWeka)
library(stringi)

## Warning: package 'stringi' was built under R version 3.4.4

library(profr)   # for profiling.
library(parallel)

set.seed(2972)

#  ------------ PREPROCESSING ---------------

# 1. Load the data into respective data sets/data frame. Read a sample, rather than use the whole thing.

set.seed(2972)
# Load the data first. 

con_blog <- file("en_US.blogs.txt", "r")
con_news <- file("en_US.news.txt", "r")
con_twitter <- file("en_US.twitter.txt", "r")

con_badwords <- file("bad-words.txt", "r")   # Bad Word List


# read in the lines. Not using a data frame since this is text mining.
lines_blog <- readLines(con_blog,skipNul = TRUE)
lines_news <- readLines(con_news,skipNul = TRUE) # Hack to avoid error with EOL.

## Warning in readLines(con_news, skipNul = TRUE): incomplete final line found
## on 'en_US.news.txt'

lines_twitter <- readLines(con_twitter,skipNul = TRUE)
bad_words <- readLines(con_badwords, skipNul = TRUE) 


# Use a sample instead of the entire file.
blog_samp <- sample(lines_blog, length(lines_blog) * 0.02)    # Take 2% sample.
news_samp <- sample(lines_news, length(lines_news) * 0.02)
twitter_samp <- sample(lines_twitter, length(lines_twitter) * 0.02)

# Close the handles
close(con_blog)
close(con_news)
close(con_twitter)
close(con_badwords)

In the next step, I will be building the corpus, and cleansing the data.

Build the Corpus and Cleanse the Data

# Group it together.
data <- c(blog_samp, news_samp, twitter_samp)

# Summarize the data.

# First, file size.
fsize_blogs <- file.info("en_US.blogs.txt")$size / 1024 ^ 2
fsize_news <- file.info("en_US.news.txt")$size / 1024 ^ 2
fsize_twitter <- file.info("en_US.twitter.txt")$size / 1024 ^2

n_words_blogs <- stri_count_words(lines_blog)
n_words_news <- stri_count_words(lines_news)
n_words_twitter <- stri_count_words(lines_twitter)

data.frame(source = c("blogs", "news", "twitter"),
           filesize = c(fsize_blogs, fsize_news, fsize_twitter),
           nlines = c(length(lines_blog), length(lines_news), length(lines_twitter)),
           nwords = c(sum(n_words_blogs), sum(n_words_news), sum(n_words_twitter)),
           mean.num.words = c(mean(n_words_blogs), mean(n_words_news), mean(n_words_twitter)))

##    source filesize  nlines   nwords mean.num.words
## 1   blogs 200.4242  899288 38154238       42.42716
## 2    news 196.2775   77259  2693898       34.86840
## 3 twitter 159.3641 2360148 30218166       12.80350

# Next, create the corpus and cleanse the data in it.
corp <- VCorpus(VectorSource(data))
xform_spaces <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corp <- tm_map(corp, xform_spaces, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corp <- tm_map(corp, xform_spaces, "@[^\\s]+")
corp <- tm_map(corp, removeWords, stopwords("en"))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, PlainTextDocument)

corp <- tm_map(corp, content_transformer(tolower))

corp <- tm_map(corp, removeWords, bad_words)  # eliminate the bad words and stuff like that.

Tokenize the data.

In this section, I identify appropriate tokens such as words, punctuation, and numbers. I also developed a function that takes a file as input and returns a tokenized version of it.

# First get the word counts/frequencies

freq <- function(t) {
  f <- sort(rowSums(as.matrix(t)), decreasing = TRUE)
  return(data.frame(word = names(f), freq = f))
}

These next few lines actually create nGrams and tokenize the corpus.

uni_gram <- function(z) NGramTokenizer( z, Weka_control(min = 1, max = 1))
bi_gram <- function(z) NGramTokenizer( z, Weka_control(min = 2, max = 2))
tri_gram <- function(z) NGramTokenizer( z, Weka_control(min = 3, max = 3))
four_gram <- function(z) NGramTokenizer( z, Weka_control(min = 4, max = 4))

In this next section, I determine the frequencies of the unigram, bigrams, and trigrams.

uni <- freq(removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = uni_gram)), 0.9999))
bi <- freq(removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = bi_gram)), 0.9999))
tri <- freq(removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = tri_gram)), 0.9999))
four <- freq(removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = four_gram)), 0.9999))

Plot

In this section, I develop several plots of the data, examining the top 30-most nGrams, starting with unigrams (one word), then moving to bigrams and trigrams.

gimme_Plot <- function(data, label) {
  ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
    labs(x = label, y = "Count/Frequency") +
    theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +
    geom_bar(stat = "identity", fill = I("purple")) 
}

# Unigram.

gimme_Plot(uni, "Unigrams")

gimme_Plot(bi, "Bigrams")

gimme_Plot(tri, "Trigrams")

gimme_Plot(four, "Quadgrams")

## Warning: Removed 20 rows containing missing values (position_stack).

Frequency Summary for nGrams

The top word is: the. Its frequency was 5921.

For bigrams, the top bigram is: i think. Its frequency was 973.

For trigrams, the top trigram is: i think i. Its frequency was 134.

For quadgrams, the top quadgram is: i feel like i. It’s frequency was 22.

Conclusion and Next Steps

The most common words or phrases (nGrams) are as follows:

the
i think ,
i think i ,
i feel like i

The next steps to consider include:

Develop a prediction algorithm that predicts the next word in a series, based on the nGrams.
Develop a shiny application that allows a user to interact with the prediction algorithm and words.
Build short presentation and publish.

This concludes the DS Capstone Milestone report.

Dr. Rich Huebner

Milestone Report for Data Science Capstone Project