Text Mining Exploratory

Overview

The Coursera capstone project will use Natural Language Processing (NLP) to create a model for predicting the next word or words as someone is typing. This report will focus on the cleaning and exploring of three types of social media text files which include blogs, news and twitter feeds. This is the early stages of creating the predictive model, therefore the analysis will prepare the text files into corpuses, removing offending words and then combining the words into a n-gram of 3 words. Methods of analysis include word frequencies by media type.

The quanteda library was selected for processing the text files and analyzing the text into words frequencies and n-grams.

Getting and Preparing Data

Three text files have been provided by Coursera from the Capstone project and have been split into sample files using OS tools to randomly select records for efficiently processing and training the data. For reproducing results the sample files will be packaged with the code to ensure the data remains the same during analysis. See the appendix on how the files were generated.

Sample files were randomly generated selecting 5% of the original files.

Summary of Media Files

Below is a summary of the original and sample media files. See the appendix on how the size and counts were determined.

Media Type	File Size (MB)	# of Lines	Sample Size (MB)	Sample # of Lines
Blogs	210.16	899,288	10.58	44,964
News	205.81	77,259	10.29	77,259
Twitter	167.11	2,360,148	6.39	50,512

Prepare Media Text Files for Analysis

Load all Samples files

The readtext library was selected for importing the three different media text files for its ease of importing capabilities with different file formats as well as reading in the document level metadata associated with the texts.

media <- readtext(mediaSamples, cache = FALSE)

Transform Media Files for Analysis

Each media text file will be stored as word tokens which allow for easier transforming of data, frequency counts, creating text corpuses or document matrixes which will be used for word analysis and natural language processing.

Below, the media files will first be converted to lower case word tokens and then transformed to remove inappropriate words, punction, numbers, separators, symbols, and URLs. Note, lower cased words will improve word frequency counts. Additional cleansing will remove stop words and any words less than three characters long. The raw token dataset will keep all words except for inappropriate words and then can be used to compare analysis between the clean dataset and the raw dataset. I anticipate the raw dataset will be needed in order to predict the next possible word or words as someone is typing.

rawTokens represents all words in the sample file (minus inappropriate words)
mediaTokens represents a clean set of words from rawTokens

rawTokens <- tokens_tolower(quanteda::tokens(media$text))
rawTokens <- tokens_select(rawTokens, cursewords.list, selection = "remove", case_insensitive = TRUE)
mediaTokens <- quanteda::tokens(rawTokens, 
                    remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE,
                    remove_symbols = TRUE, remove_url = TRUE, remove_twitter = TRUE)
mediaTokens <- tokens_select(mediaTokens, stopwords("english"), selection = "remove", case_insensitive = TRUE)
mediaTokens <- tokens_select(mediaTokens, selection = "keep", min_nchar = 3 ,case_insensitive = TRUE)

Summary of the Tokenized Datasets

Below is a summary of the word tokens in both datasets.

Word Counts:

File	Raw Word Count	Media Word Count (after cleaning)
Blogs	2,230,205	949,719
News	2,021,414	965,683
Twitter	1,406,283	609,702

Word Analysis

Based on Wikipedia, a term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix (DTM), rows correspond to documents in the collection and columns correspond to terms. The media word set of tokens will be stored in a Document Feature Matrix from the quanteda library which provides easier use for transforming data and processing words.

Stemming removes suffixes to reduce inflectional forms and derivationally related forms of a word to a common base form. For example, toy, toys, toy’s form the root word toy.

blogs <- mediaTokens$blogs
blogs.dfm <- quanteda::dfm(blogs, stem=TRUE)

news <- mediaTokens$news
news.dfm <- quanteda::dfm(news, stem=TRUE)

twitter <- mediaTokens$twitter
twitter.dfm <- quanteda::dfm(twitter, stem=TRUE)

Word Frequency Plots

The plot below shows the top 20 words found in each of the media datasets.

blogs.top <- quanteda::topfeatures(blogs.dfm, n=20, scheme = "count")
blogs.df <- enframe(blogs.top)
names(blogs.df) <- c("words","frequency")
blogs.df$words <- reorder(blogs.df$words, blogs.df$frequency)
g.blogs.top <- ggplot(blogs.df, aes(x = words, y = frequency)) +
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Blog Words")

news.top <- quanteda::topfeatures(news.dfm, n=20, scheme = "count")
news.df <- enframe(news.top)
names(news.df) <- c("words","frequency")
news.df$words <- reorder(news.df$words, news.df$frequency)
g.news.top <- ggplot(news.df, aes(x = words, y = frequency)) +
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "News Words")

twitter.top <- quanteda::topfeatures(twitter.dfm, n=20, scheme = "count")
twitter.df <- enframe(twitter.top)
names(twitter.df) <- c("words","frequency")
twitter.df$words <- reorder(twitter.df$words, twitter.df$frequency)
g.twitter.top <- ggplot(twitter.df, aes(x = words, y = frequency)) +
  geom_bar(stat = "identity") + coord_flip() +
  labs(title = "Twitter Words")

grid.arrange(g.blogs.top, g.news.top, g.twitter.top, nrow=1)

Most Frequent Media Words (Features) Combined

For a visualization effect, the word cloud is used to plot the most frequent words (features) found for the combined media datasets which are plotted with their sizes proportional to their numeric values

combined.dfm <- quanteda::dfm(mediaTokens, stem = TRUE)
quanteda::textplot_wordcloud(combined.dfm, min.freq = 60, random.order=FALSE, rot.per=.10, colors=RColorBrewer::brewer.pal(8, 'Dark2'), comparison = FALSE, max_words=200)

textstat_frequency(combined.dfm, n=20)

##    feature frequency rank docfreq group
## 1     said     14925    1       3   all
## 2      one     14160    2       3   all
## 3     just     13615    3       3   all
## 4      get     13287    4       3   all
## 5     like     13237    5       3   all
## 6     time     11804    6       3   all
## 7      can     11197    7       3   all
## 8     year     10000    8       3   all
## 9      day      9836    9       3   all
## 10    make      9380   10       3   all
## 11     new      8837   11       3   all
## 12    love      8583   12       3   all
## 13    know      8293   13       3   all
## 14    work      8111   14       3   all
## 15    good      8072   15       3   all
## 16     now      7982   16       3   all
## 17     say      7445   17       3   all
## 18    want      7326   18       3   all
## 19   peopl      7314   19       3   all
## 20     see      7056   20       3   all

Word Prediction Algorithm Plan

The outcome of this project, is to create a shiny application that will predict the next word or set of words a user may type while creating a message or search term. Today, google is displaying a set of words or completed thoughts while typing a message in an email. How will the application perform this feat and make the right suggests at least 80% of the time.

First, Creating N-Grams

N-grams are a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward. N-grams are used to develop models that can be used in a variety of tasks such as spelling correction, word breaking, word predictions and text summarization. We will explore different n-gram sizes such as a unigram (N=1), bigram (N=2), trigram (N=3) and n-grams greater than 3.

Example of n-grams. Sentence: Life is trying things to see if they work

bigrams	trigrams	n-grams (N=4)
Life is	Life is trying	Life is trying things
is trying	is trying things	is trying things to
trying things	trying things to	trying things to see
things to	things to see	things to see if
to see	to see if	to see if they
see if	see if they	see if they work
if they	if they work
they work

Second, Predicting Words from N-Grams

This is where the fun happens, to be able to use different n-grams to predict the next word or set of words. At this moment I am not sure if a bigram, trigram or greater is the best way to predict the next set of words. In this phase, analysis will have to be done to use all potential n-grams sizes to determine which is the best or if multple n-grams will have the best outcome.

Relative frequency, which is estimating the N-gram probablility by dividing the observed frequency of a particular sequence by the observed frequency, will be used to help predict the next word based on the different N-grams. But, a word in one media document many not be as frequent in another document, such words used when tweeting may not be frequent or used in blogs. Therefore, the relative frequency will have to be based on the media type N-grams.

This is just a start at how to predict the next word. Further reading on other peoples research will be used to help formulate the correct predictive model.

Last, Shiny Application for Predicting

A shiny application will be created which will allow a user to enter text while predicting the next word. Below is a list of key features anticipated for the final product:

An option field for selecting the media type which is Blogs, Tweets or News.
A text field for entering text and possibly making it interactive as the user is typing.
A button to predict the next word or words, but may not be needed if the text field is interactive.
The outcome text area to show the results. May show all three in seperate outcome text fields.

Appendix

Creating Sample Media Data Files

Because the text files were large and I was unsure of the processing power for anyone potentially pulling the code with the data which could overwelm their machine, I felt that it was better to create sample files at the operating system level and provide them with the project. I may also create training files as well but that will be determined at a later date.

Here is how the files where created using Windows 10 and Unix/linux like commands to randomly generate sample files containing 5% of the data:

sampleSize=$(awk ’END{print int((NR==0)?0:(NR*0.05))}’ en_US.twitter.txt) shuf -n $sampleSize en_US.twitter.txt > somefile_.txt

File Sizes and Number of Lines

Below is the code used to generate the file sizes and number of lines. The file size was converted to MB for human readability.

blogSize <- round(file.info(blogFile)["size"][,1] / 1000 / 1000, 2)
twitterSize <- round(file.info(twitterFile)["size"][,1] / 1000 / 1000, 2)
newsSize <- round(file.info(newsFile)["size"][,1] / 1000 / 1000, 2)

con <- file(blogFile, "r")
blogNbrOfLines <- NROW(readLines(con))

con <- file(twitterFile, "r")
twitterNbrOfLines <- NROW(readLines(con))

con <- file(newsFile, "r")
newsNbrOfLines <- NROW(readLines(con))

# Samples
blogSampleFile <- "../final/en_US/blogs.sample.txt"
twitterSampleFile <- "../final/en_US/twitter.sample.txt"
newsSampleFile <- "../final/en_US/news.sample.txt"

blogSampleSize <- round(file.info(blogSampleFile)["size"][,1] / 1000 / 1000, 2)
twitterSampleSize <- round(file.info(twitterSampleFile)["size"][,1] / 1000 /1000, 2)
newsSampleSize <- round(file.info(newsSampleFile)["size"][,1] / 1000 / 1000, 2)

con <- file(blogSampleFile, "r")
blogSampleNbrOfLines <- NROW(readLines(con))

con <- file(newsSampleFile, "r")
twitterSampleNbrOfLines <- NROW(readLines(con))

con <- file(newsFile, "r")
newsSampleNbrOfLines <- NROW(readLines(con))
con <- ""

N-gram Tokenization

quanteda’s library will be used to create N-grams. Below is an example of N-gramming.

mediaNgram3 <- tokens(mediaTokens, ngrams=3)
head(mediaNgram3$blogs, 10)

##  [1] "use_make_hard-boiled"   "make_hard-boiled_eggs" 
##  [3] "hard-boiled_eggs_think" "eggs_think_pretty"     
##  [5] "think_pretty_genius"    "pretty_genius_hehe"    
##  [7] "genius_hehe_donâ"       "hehe_donâ_kick"        
##  [9] "donâ_kick_can"          "kick_can_blond"