Introduction

This is a milestone report of Week 2 Data Science Specialization Capstone Project. The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationship in the data and prepare to build my first NLP model. The user will provide a word or a phrase and the application will try to predict the next word. The model will be trained using a corpus (a collection of English text) that is compiled from 3 sources - news, blogs, and tweets.

In the following report, I load and clean the data as well as use NLP (Natural Language Processing) applications in R (tm and RWeka) to tokenize n-grams as a first step toward building a predictive model.

## Preparing package
knitr::opts_chunk$set(echo = TRUE)
set.seed(100)
library(knitr)
library(stringi)
library(tm)
library(RWeka)
library(wordcloud)
library(RColorBrewer)
library(dplyr)
library(ggplot2)
library(SnowballC)

Load data

I have selected to use only english_US dataset for this project. First, I assume that the data is located in ./final/en_us/.

blogs<-readLines(con = "../final/en_US/en_US.blogs.txt",encoding = "UTF-8")
twitter<-readLines(con = "../final/en_US/en_US.twitter.txt",encoding = "UTF-8")
## Warning in readLines(con = "../final/en_US/en_US.twitter.txt", encoding =
## "UTF-8"): line 167155 appears to contain an embedded nul
## Warning in readLines(con = "../final/en_US/en_US.twitter.txt", encoding =
## "UTF-8"): line 268547 appears to contain an embedded nul
## Warning in readLines(con = "../final/en_US/en_US.twitter.txt", encoding =
## "UTF-8"): line 1274086 appears to contain an embedded nul
## Warning in readLines(con = "../final/en_US/en_US.twitter.txt", encoding =
## "UTF-8"): line 1759032 appears to contain an embedded nul
news<-readLines(con = "../final/en_US/en_US.news.txt",encoding = "UTF-8")

Data Statistics Summary

Next, I calculated summary statistics for the data by determining the number of lines, number of characters, and number of words for each of the 3 datasets (twitter, blogs, and news). I also calculated the number of words per line (min, mean, and max).

WPL=sapply(list(blogs,news,twitter),function(x)summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(WPL)=c('WPL_Min','WPL_Mean','WPL_Max')
stats=data.frame(
  Dataset=c("blogs","news","twitter"),      
  t(rbind(
  sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
  Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
  WPL)
))
head(stats)
##   Dataset   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
## 1   blogs  899288 206824382 37570839       0    41.75    6726
## 2    news 1010242 203223154 34494539       1    34.41    1796
## 3 twitter 2360148 162096031 30451128       1    12.75      47

Data sampling and cleaning

Since I have seen the summary stats of the data, it is crucial to do sampling because the data is too big for processing. I sampled the dataset for 1% each file. The sample dataset was cleaned by removing uncommon characters from the news, blogs and twitter sample dataset. Then, the cleaned samples were combined to become one new dataset called sampleTotal.

## Clean the data
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

## Sampling the data
sample_blogs <- sample(blogs,size =1/100*length(blogs)) 
sample_news<- sample(news,size =1/100*length(news)) 
sample_twitter <- sample(twitter,size =1/100*length(twitter)) 

## Combine all the subsample into one sample
sample_total <- c(sample_blogs,sample_news,sample_twitter)
summary(sample_total)
##    Length     Class      Mode 
##     42695 character character

Corpus Creation

Next, a corpus was build using tm package. Tm package is used to clean my corpus before it is analyzed. Some pre-processing of the corpus done were: * Convert to lowercase * Punctuation and number removal * White space stripping * Plain text conversion * Remove English stop words * Stemming

corpus <- VCorpus(VectorSource(sample_total))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument) 
corpus <- tm_map(corpus, PlainTextDocument)

N-Grams Analysis

The cleaned corpus need to be converted to a useful format for Natural Language Processing (NLP) task. The format is based on the Term Document Matrices (TDM) of n-Grams. N-Grams is the representaion of text in n-tuple of words. Examples of n-grams are: * Unigram based on one word * 2-grams based on a pair of words * 3-grams based on 3 tuple of words * n-grams based on n tuple of words * ‘n’ can be any number

The TDMs store the frequencies of the N-grams. To perform the n-gram TDM, we used RWeka underlying functions. RWeka is a collection of machine learning algorithms for data mining that links with Weka.

unigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

unigrams <- TermDocumentMatrix(corpus, control = list(tokenize = unigram_tokenizer))
bigrams <- TermDocumentMatrix(corpus, control = list(tokenize = bigram_tokenizer))
trigrams <- TermDocumentMatrix(corpus, control = list(tokenize = trigram_tokenizer))

unigrams
## <<TermDocumentMatrix (terms: 40708, documents: 42695)>>
## Non-/sparse entries: 502215/1737525845
## Sparsity           : 100%
## Maximal term length: 116
## Weighting          : term frequency (tf)
bigrams
## <<TermDocumentMatrix (terms: 403416, documents: 42695)>>
## Non-/sparse entries: 513650/17223332470
## Sparsity           : 100%
## Maximal term length: 122
## Weighting          : term frequency (tf)
trigrams
## <<TermDocumentMatrix (terms: 469517, documents: 42695)>>
## Non-/sparse entries: 475706/20045552609
## Sparsity           : 100%
## Maximal term length: 126
## Weighting          : term frequency (tf)

N-grams Frequency

With all the n-grams are tokkenized, the frequency for each n-grams were calculted using ‘findFreqTerms’ function from tm package. Each frequecny terms of each n-grams was stored in different dataframe for plotting purpose later.

unigrams_freqTerm <- findFreqTerms(unigrams,lowfreq = 50)
bigrams_freqTerm <- findFreqTerms(bigrams,lowfreq=50)
trigrams_freqTerm <- findFreqTerms(trigrams,lowfreq=8)

## Unigram freqeuncy dataframe
unigrams_freq <- rowSums(as.matrix(unigrams[unigrams_freqTerm,]))
unigrams_freq <- data.frame(word=names(unigrams_freq), frequency=unigrams_freq)
head(unigrams_freq)
##            word frequency
## abil       abil       115
## abl         abl       293
## absolut absolut       116
## abus       abus        69
## accept   accept       149
## access   access       120
## Bigram freqeuncy dataframe
bigrams_freq <- rowSums(as.matrix(bigrams[bigrams_freqTerm,]))
bigrams_freq <- data.frame(word=names(bigrams_freq), frequency=bigrams_freq)
head(bigrams_freq)
##                word frequency
## can get     can get       124
## can help   can help        54
## can make   can make        66
## can see     can see        76
## cant wait cant wait       182
## come back come back        98
## Trigram freqeuncy dataframe
trigrams_freq <- rowSums(as.matrix(trigrams[trigrams_freqTerm,]))
trigrams_freq <- data.frame(word=names(trigrams_freq), frequency=trigrams_freq)
head(trigrams_freq)
##                                    word frequency
## blah blah blah           blah blah blah        11
## cant get enough         cant get enough         8
## cant wait get             cant wait get        12
## cant wait see             cant wait see        36
## chief financi offic chief financi offic         8
## cinco de mayo             cinco de mayo        25

Top n-grams Frequency Visualization

The frequency distribution of each n-grams category were visualized into 3 different bar plots for ease of analysis. A wordcloud were also created besides it.

## Function for visualization of n-grams 
plot_n_grams <- function(df_gram, title, num, barC) {
  df_sort <- df_gram[order(-df_gram$frequency),][1:num,] 
  ggplot(data = df_sort[1:num,], aes(x = reorder(word, -frequency), y = frequency)) +
    geom_bar(stat = "identity", fill = barC, colour = "black") +
    coord_cartesian(xlim = c(0, num+1)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    theme(axis.text.x=element_text(angle=90))
}

Unigram Plot and Wordcloud

plot_n_grams(unigrams_freq,"Top 30 Unigrams",30,"red")

wordcloud(unigrams_freq$word, unigrams_freq$frequency, scale = c(2,1), max.words=30, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))

Bigram Plot and Wordcloud

plot_n_grams(bigrams_freq,"Top 30 Bigrams",30,"green")

wordcloud(bigrams_freq$word, bigrams_freq$frequency, scale = c(2,1), max.words=30, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))

Trigram Plot and Wordcloud

plot_n_grams(trigrams_freq,"Top 30 Trigrams",30,"blue")

wordcloud(trigrams_freq$word, trigrams_freq$frequency, scale = c(1.5,1), max.words=30, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))

Findings

  • The longer the N-grams, the lower their abundance.
  • The highest Unigrams’s frequency is 3329
  • The highest Bigrams’s frequency is 253
  • The highest Trigrams’s frequency is 44

Future plans

This concludes the initial exploration of the datasets provided. Corpus of pre-processed terms has been created. Next step is to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the analysis above. This algorithm will then be deployed in a Shiny app and will suggest the most likely next word after a phrase is typed.