The objective of this phase of the project to understand the distribution and relationship between the words, tokens, and phrases in the provided text through an exporatory analysis. This phase also is to underatand the variation and frequencies in words and words pairs in the provided data.
Ultimately the project will develop a predictive model for text. The model will be trained using a collection of English text (corpus) that is compiled from 3 sources - news, blogs, and tweets.
The first step I performed was to read the data and get basic stats on each of the files. Below is a table of the number of lines, characters, words, and minimum, mean, and maximum words per line.
## Set working directory
setwd("~/R/Coursera/Data Science/Capstone")
## Load needed packages
library(stringi)
## Read blogs
blogs <- readLines("Data/en_US/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
## Read news
news <- readLines("Data/en_US/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
## Read twitter
twitter <- readLines("Data/en_US/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
## Calculate words per line for all 3 files
words_per_line <- sapply(list(blogs,news,twitter), function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(words_per_line) <- c('WPL_Min','WPL_Mean','WPL_Max')
## Create dataset with the basic file stats
stats <- data.frame(FileName=c("USblogs","USnews","UStwitter"),
t(rbind(sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
words_per_line))
)
## Display basic file statistics
head(stats)
## FileName Lines Chars Words WPL_Min WPL_Mean WPL_Max
## 1 USblogs 899288 206824382 37570839 0 41.75108 6726
## 2 USnews 77259 15639408 2651432 1 34.61779 1123
## 3 UStwitter 2360148 162096031 30451128 1 12.75063 47
To prepare the data for modeling, cleaning the raw files must be performed first. There are a number of types of characters that are not wanted in the data - nonconforming, punctuation, numbers, and white space. The non-conforming characters are removed with the iconv function and the punctuation, numbers, and whitespace are cleaned using the tm_map function of the tm package.
Additionally, to speed up processing, a 5% sample of the source files is taken prior to combining into a single corpus.
## Load needed packages
library(tm)
## Remove non-conforming characters
blogs <- iconv(blogs, "UTF-8", "ASCII", sub="")
news <- iconv(news, "UTF-8", "ASCII", sub="")
twitter <- iconv(twitter, "UTF-8", "ASCII", sub="")
## Sample data to speed up processing
set.seed(122669)
sample_blogs <- blogs[sample(1:length(blogs), 0.05*length(blogs), replace=FALSE)]
sample_news <- news[sample(1:length(news), 0.05*length(news), replace=FALSE)]
sample_twitter <- twitter[sample(1:length(twitter), 0.05*length(twitter), replace=FALSE)]
## Combine 3 files into 1
corpus <- Corpus(VectorSource(c(sample_blogs, sample_news, sample_twitter)), readerControl = list(reader=readPlain,language="en"))
## Remove punctuation, numbers, white space, convert to lowercase
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
In continuing the exploratory data analysis and in preparation for modeling, tokenizing and creating N-grams are the next necessary steps. Tokenization is the process of breaking up the corpus into words or other meaningful elements. N-grams are sets of co-occuring words…essentially bits of phrases that appear together. These are helpful in predicting the next word. I take a look at the most commonly occuring unigrams (1 word), bigrams (2 words) , trigrams (3 words), and quadgrams (4 words).
## Load needed packages
library(RWeka)
library(tm)
## Create tokenizers
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
## Create n-grams
Unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = UnigramTokenizer))
Bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
Trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
Quadgrams <- TermDocumentMatrix(corpus, control = list(tokenize = QuadgramTokenizer))
Now that the N-grams have been created, let’s take a look at them in some plots.
## Load needed packages
library(ggplot2)
## Plot unigrams
uni <- ggplot(data = UnigramsDenseSorted[1:40,], aes(x = reorder(word, -freq), y = freq)) + geom_bar(stat="identity")
uni <- uni + labs(x = "N-gram", y = "Frequency", title = "Frequencies of the 40 Most Frequent Unigrams")
uni <- uni + theme(axis.text.x=element_text(angle=90))
uni
## Plot bigrams
bi <- ggplot(data = BigramsDenseSorted[1:40,], aes(x = reorder(word, -freq), y = freq)) + geom_bar(stat="identity")
bi <- bi + labs(x = "N-gram", y = "Frequency", title = "Frequencies of the 40 Most Frequent Bigrams")
bi <- bi + theme(axis.text.x=element_text(angle=90))
bi
## Plot trigrams
tri <- ggplot(data = TrigramsDenseSorted[1:40,], aes(x = reorder(word, -freq), y = freq)) + geom_bar(stat="identity")
tri <- tri + labs(x = "N-gram", y = "Frequency", title = "Frequencies of the 40 Most Frequent Trigrams")
tri <- tri + theme(axis.text.x=element_text(angle=90))
tri
## Plot quadgrams
quad <- ggplot(data = QuadgramsDenseSorted[1:40,], aes(x = reorder(word, -freq), y = freq)) + geom_bar(stat="identity")
quad <- quad + labs(x = "N-gram", y = "Frequency", title = "Frequencies of the 40 Most Frequent Quadgrams")
quad <- quad + theme(axis.text.x=element_text(angle=90))
quad
This above analysis is a brief introduction to the data themselves. Now it is time to use the cleaned data to start figuring out a prediction model. It seems that a Markov-chain model is the most likely candidate for text prediction. It is based upon the concept that a prediction can be made about the next word that the user will type based on the current word or words. If the prediction is made based solely on the previous word, it is a first-order Markov model, and would use bigrams from the data. If the prediction is based upon the previous two words, it is a second-order Markov model and uses trigrams. So the N-grams created for this analysis will be a good starting point. I did have some difficulty creating the N-grams and had to decrease sample size, so this is something that i’ll have to investigate further.