This report summarizes the progress I have made in data science capstone so far and the plans of creating final data product.
After we download text data from Corpora, we need to load them for analysis. The text mining package we are going to use is tm. Here we only intend to develope algorithm for English. The corpus has three sources: blogs, news and twitter. Each of text files is about 200 MB. It takes too long to load and process later. As a demonstration, I sampled 1% of text in following discussion.
library(tm)
library(SnowballC)
getSources()
## [1] "DataframeSource" "DirSource" "ReutersSource" "URISource"
## [5] "VectorSource"
getReaders()
## [1] "readDOC" "readPDF"
## [3] "readReut21578XML" "readReut21578XMLasPlain"
## [5] "readPlain" "readRCV1"
## [7] "readRCV1asPlain" "readTabular"
## [9] "readXML"
cname <- file.path(".", "sampled100")
dir(cname)
## [1] "blogs100.txt" "news100.txt" "twitter100.txt"
docs <- Corpus(DirSource(cname))
Before the corpus is ready for use, the raw text data needs to be preprocessed. On preliminary step, only basic steps are performed like removing numbers and converting the whole text to lower case. As further step, we can develope customized text cleaning command such as removing emoticon in twitter. These transformations can be listed with getTransformations():
getTransformations()
## [1] "as.PlainTextDocument" "removeNumbers" "removePunctuation"
## [4] "removeWords" "stemDocument" "stripWhitespace"
docs <- tm_map(docs, tolower, mc.cores = 1)
docs <- tm_map(docs, removeNumbers, mc.cores = 1)
docs <- tm_map(docs, stripWhitespace, mc.cores = 1)
docs <- tm_map(docs, removeWords, stopwords("english"), mc.cores = 1)
docs <- tm_map(docs, removePunctuation, mc.cores = 1)
docs <- tm_map(docs, stemDocument, mc.cores = 1)
Now we form the term matrix which is a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix. That gives us a rough picture of document. For example, we can list highest-frequency words by sorting the terms according to the frequency.
dtm <- DocumentTermMatrix(docs)
dim(dtm) #print out the dimension of dtm
## [1] 3 39999
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
wordDF <- data.frame(word=names(freq), freq = freq)
str(wordDF)
## 'data.frame': 39999 obs. of 2 variables:
## $ word: Factor w/ 39999 levels "" "| __truncated__,"\u26bd\u26bd\u26bd\u26bd",..: 38713 30055 24733 13578 19793 18060 35309 5095 8369 39544 ...
## $ freq: num 3217 3172 3148 3070 3040 ...
library(ggplot2)
ggplot(aes(word, freq),data=subset(wordDF, freq>1500)) +
geom_bar(stat="identity", fill = "blue") +
theme(axis.text.x=element_text(angle=45, hjust=1))
library(wordcloud)
## Loading required package: Rcpp
## Loading required package: RColorBrewer
set.seed(123)
wordcloud(names(freq), freq, min.freq=1000, colors=brewer.pal(6, "Dark2"))
To be able to predict the next possible word, we need split the text into n-gram. Here we only study case n=2 and n=3.
library(RWeka)
options(mc.cores=1)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = ' '))
tdmBitoken <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer, tolower = FALSE, removePunctuation = FALSE))
freqBitoken <- rowSums(as.matrix(tdmBitoken))
freqBitoken <- sort(freqBitoken[freqBitoken > 1], decreasing = TRUE)
bigram <- data.frame(token = names(freqBitoken),freq = freqBitoken)
library(ggplot2)
ggplot(aes(token, freq),data=subset(bigram, freq>100)) +
geom_bar(stat="identity", fill = "blue") +
theme(axis.text.x=element_text(angle=45, hjust=1))
library(RWeka)
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = ' '))
tdmTritoken <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer, tolower = FALSE))
freqTritoken <- rowSums(as.matrix(tdmTritoken))
freqTritoken <- sort(freqTritoken[freqTritoken > 1], decreasing = TRUE)
trigram <- data.frame(token = names(freqTritoken), freq = freqTritoken)
library(ggplot2)
ggplot(aes(token, freq),data=subset(trigram, freq>10)) +
geom_bar(stat="identity", fill = "blue") +
theme(axis.text.x=element_text(angle=45, hjust=1))
With 2-grams and 3-grams, we can develope a simple ranking algorithm given typing words by sorting associated grams. However, there are still many problems needs to be addressed.