Milestone Report

This report summarizes the progress I have made in data science capstone so far and the plans of creating final data product.

Loading Corpus

After we download text data from Corpora, we need to load them for analysis. The text mining package we are going to use is tm. Here we only intend to develope algorithm for English. The corpus has three sources: blogs, news and twitter. Each of text files is about 200 MB. It takes too long to load and process later. As a demonstration, I sampled 1% of text in following discussion.

library(tm)
library(SnowballC)
getSources()

## [1] "DataframeSource" "DirSource"       "ReutersSource"   "URISource"      
## [5] "VectorSource"

getReaders()

## [1] "readDOC"                 "readPDF"                
## [3] "readReut21578XML"        "readReut21578XMLasPlain"
## [5] "readPlain"               "readRCV1"               
## [7] "readRCV1asPlain"         "readTabular"            
## [9] "readXML"

cname <- file.path(".", "sampled100")
dir(cname)

## [1] "blogs100.txt"   "news100.txt"    "twitter100.txt"

docs <- Corpus(DirSource(cname))

Data Cleaning

Before the corpus is ready for use, the raw text data needs to be preprocessed. On preliminary step, only basic steps are performed like removing numbers and converting the whole text to lower case. As further step, we can develope customized text cleaning command such as removing emoticon in twitter. These transformations can be listed with getTransformations():

getTransformations()

## [1] "as.PlainTextDocument" "removeNumbers"        "removePunctuation"   
## [4] "removeWords"          "stemDocument"         "stripWhitespace"

docs <- tm_map(docs, tolower, mc.cores = 1)
docs <- tm_map(docs, removeNumbers, mc.cores = 1)
docs <- tm_map(docs, stripWhitespace, mc.cores = 1)
docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores = 1)
docs <- tm_map(docs, removePunctuation, mc.cores = 1)
docs <- tm_map(docs, stemDocument, mc.cores = 1)

Data Exploration

Now we form the term matrix which is a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. That gives us a rough picture of document. For example, we can list highest-frequency words by sorting the terms according to the frequency.

dtm <- DocumentTermMatrix(docs)
dim(dtm) #print out the dimension of dtm

## [1]     3 39999

freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
wordDF <- data.frame(word=names(freq), freq = freq)
str(wordDF)

## 'data.frame':    39999 obs. of  2 variables:
##  $ word: Factor w/ 39999 levels ""    "| __truncated__,"\u26bd\u26bd\u26bd\u26bd",..: 38713 30055 24733 13578 19793 18060 35309 5095 8369 39544 ...
##  $ freq: num  3217 3172 3148 3070 3040 ...

frequency plot

library(ggplot2)
ggplot(aes(word, freq),data=subset(wordDF, freq>1500)) +
  geom_bar(stat="identity", fill = "blue") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

plot of chunk unnamed-chunk-4

word cloud

library(wordcloud)

## Loading required package: Rcpp
## Loading required package: RColorBrewer

set.seed(123)
wordcloud(names(freq), freq, min.freq=1000, colors=brewer.pal(6, "Dark2"))

plot of chunk unnamed-chunk-5

N-gram

To be able to predict the next possible word, we need split the text into n-gram. Here we only study case n=2 and n=3.

2-gram

library(RWeka)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = ' '))
tdmBitoken <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer, tolower = FALSE, removePunctuation = FALSE))
freqBitoken <- rowSums(as.matrix(tdmBitoken))
freqBitoken <- sort(freqBitoken[freqBitoken > 1], decreasing = TRUE)
bigram <- data.frame(token = names(freqBitoken),freq = freqBitoken)
library(ggplot2)
ggplot(aes(token, freq),data=subset(bigram, freq>100)) +
  geom_bar(stat="identity", fill = "blue") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

plot of chunk unnamed-chunk-6

3-gram

library(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = ' '))
tdmTritoken <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer, tolower = FALSE))
freqTritoken <- rowSums(as.matrix(tdmTritoken))
freqTritoken <- sort(freqTritoken[freqTritoken > 1], decreasing = TRUE)
trigram <- data.frame(token = names(freqTritoken), freq = freqTritoken)
library(ggplot2)
ggplot(aes(token, freq),data=subset(trigram, freq>10)) +
  geom_bar(stat="identity", fill = "blue") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

plot of chunk unnamed-chunk-7

Future Plan

With 2-grams and 3-grams, we can develope a simple ranking algorithm given typing words by sorting associated grams. However, there are still many problems needs to be addressed.

On data cleaning stage, as mentioned early, a more sophisticated customized algorithm can further remove unwanted characters.
A robust predicting algorithm should be able to deal with typo like “waht”.
To make the data product more practical, we should include the prediction after punctuation.