The goal of this milestone report for the Coursera Capstone Project is to explain the exploratory analysis of the dataset and to give a brief description on the plans for creating the prediction algorithm and the Shiny app. The predictive model will be trained using a corpus, a collection of written texts, called the HC Corpora which has been filtered by language.
The data set is from the HC Corpora (http://www.corpora.heliohost.org). It consists of a collection of corpora for various languages that is freely available to download. The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language. The data sources are newspapers, magazines, blogs (personal and professional) and Twitter updates.
Download the dataset and unzip the files
datadir <- "C:/temp"
The data consists of text files in four different languages: English, Russian, German and Finnish. For each language, three text files exist: blogs, news and twitter. We will only consider the English language files for this report.
The following code computes the file size, line and word count for the English language blog, news and twitter files:
filestatstics <- function(filename) {
filesize <- file.info(filename)$size / 1024^2
con <- file(filename, open="rb", blocking=FALSE)
contents <- readLines(con, warn=FALSE)
close(con)
linecount <- length(contents)
words <- strsplit(contents, " ")
wordcount <- sum(sapply(words, FUN=length, simplify=TRUE))
rm(contents, words)
return(list(filesize=filesize, linecount=linecount, wordcount=wordcount))
}
blogs.stats <- filestatstics(paste0(datadir, "/final/en_US/en_US.blogs.txt"))
news.stats <- filestatstics(paste0(datadir, "/final/en_US/en_US.news.txt"))
twitter.stats <- filestatstics(paste0(datadir, "/final/en_US/en_US.twitter.txt"))
statsDF <- data.frame(filename = c("en_US.blogs", "en_US.news", "en_US.twitter"),
filesize = c(blogs.stats$filesize, news.stats$filesize, twitter.stats$filesize),
linecount = c(blogs.stats$linecount, news.stats$linecount, twitter.stats$linecount),
wordcount = c(blogs.stats$wordcount, news.stats$wordcount, twitter.stats$wordcount) )
library(knitr)
kable(x=statsDF, col.names=c("File Name", "File Size (MB)", "Line Count", "Word Count"), digits=1)
| File Name | File Size (MB) | Line Count | Word Count |
|---|---|---|---|
| en_US.blogs | 200.4 | 899288 | 37334131 |
| en_US.news | 196.3 | 1010242 | 34372530 |
| en_US.twitter | 159.4 | 2360148 | 30373543 |
The data files are quite large. Exploratory data analysis on the full data set would be too time consuming. To facilitate faster exploratory analysis, will create a random sample of the English language blogs, news and twitter files. Specifically, we will randomly sample 1% of the lines in each file.
set.seed(98765)
sampleDocument <- function(inFile, outFile, samplePercentage) {
con <- file(inFile, open="rb", blocking=FALSE)
doc <- readLines(con, warn=FALSE)
close(con)
N <- length(doc)
doc_sample <- sample(doc, N*samplePercentage)
doc_sample_final <- iconv(doc_sample, "UTF-8", "UTF-8", sub='')
write(doc_sample_final, file=outFile, append=FALSE)
}
dataIn <- paste0(datadir, "/final/en_US")
dataOut <- paste0(datadir, "/final_Sample/en_US")
samplePercentage = 0.01
sampleDocument(paste0(dataIn, "/en_US.blogs.txt"), paste0(dataOut, "/en_US.blogs.txt"), samplePercentage)
sampleDocument(paste0(dataIn, "/en_US.news.txt"), paste0(dataOut, "/en_US.news.txt"), samplePercentage)
sampleDocument(paste0(dataIn, "/en_US.twitter.txt"), paste0(dataOut, "/en_US.twitter.txt"), samplePercentage)
Size statistics for the randomly sampled files generated above are given in the table below.
blogs.stats <- filestatstics(paste0(dataOut, "/en_US.blogs.txt"))
news.stats <- filestatstics(paste0(dataOut, "/en_US.news.txt"))
twitter.stats <- filestatstics(paste0(dataOut, "/en_US.twitter.txt"))
statsDF <- data.frame(fileName = c("en_US.blogs", "en_US.news", "en_US.twitter"),
filesize = c(blogs.stats$filesize, news.stats$filesize, twitter.stats$filesize),
linecount = c(blogs.stats$linecount, news.stats$linecount, twitter.stats$linecount),
wordcount = c(blogs.stats$wordcount, news.stats$wordcount, twitter.stats$wordcount) )
kable(x=statsDF, col.names=c("File Name", "File Size (MB)", "Line Count", "Word Count"), digits=1)
| File Name | File Size (MB) | Line Count | Word Count |
|---|---|---|---|
| en_US.blogs | 2.0 | 8992 | 370277 |
| en_US.news | 2.0 | 10102 | 345487 |
| en_US.twitter | 1.6 | 23601 | 303564 |
we will create a corpus from the randomly sampled English language files for blogs, news and twitter. The raw text can cause problems when one tries to quantitatively analyze the data. For example, such data will contain characters, symbols and/or words that do not provide helpful information regarding the structure of the language. Hence, it is often beneficial to remove such things as numbers, punctuation, extraneous spacing, special characters, URLs, etc. In addition, we will remove English language “stopwords” such as “the”, “it’s”, etc. These words occur very frequently in text, but do not provide useful information for our task at hand, namely predicing the next word in a phrase.
To accomplish the cleaning of the raw text files and create a corpus, the following code is used:
library(tm)
## Loading required package: NLP
# Create corpus. Reads blogs, news, twitter
corpus <- Corpus(DirSource(dataOut), readerControl = list(language = "en_US"))
# Content transformer funtion
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
# Remove URLs
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
# Remove RTs and vias (mostly from tweets)
corpus <- tm_map(corpus, toSpace, "RT |via ")
# Replace twitter accounts (@XXXXXX) by space
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Convert all to lower case
corpus <- tm_map(corpus, tolower)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove whitespace
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Now that we have cleaned and tokenized the data, we can conduct exploratory analysis. We will do three things to illustrate the type of exploratory analysis that can be done on a text-based data set. First, we will determine the most frequent words that occur in each file. Second, we will look at the most frequent two word phrases (bi-grams). Finally we will use a word-cloud to illustrate the most frequent tri-grams (three-word phrases).
It is of interest to get an idea for the most frequently occurring words in the documents. The following code computes word frequencies for each document and orders them from largest to smallest. We report the top 20 most frequent words in each of the three English language files.
TDM <- TermDocumentMatrix(corpus)
mTDM <- as.matrix(TDM)
wordFreq <- data.frame(Term=rownames(TDM), blogs=mTDM[,1], news=mTDM[,2], twitter=mTDM[,3], row.names=NULL)
wordFreq.blogs <- wordFreq[order(-wordFreq[,2]), c(1,2)]
wordFreq.news <- wordFreq[order(-wordFreq[,3]), c(1,3)]
wordFreq.twitter <- wordFreq[order(-wordFreq[,4]), c(1,4)]
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.2.5
wordFreq.blogs <- melt(wordFreq.blogs, id.vars=c("Term"))[1:20, ]
wordFreq.news <- melt(wordFreq.news, id.vars=c("Term"))[1:20, ]
wordFreq.twitter <- melt(wordFreq.twitter, id.vars=c("Term"))[1:20, ]
wordFreq2 <- rbind(wordFreq.blogs, wordFreq.news, wordFreq.twitter)
colnames(wordFreq2) <- c("Term", "Type", "Freq")
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
qplot(x=Freq, y=reorder(Term, Freq), data=wordFreq2, xlab="Freq", ylab="Term") + facet_grid(Type~., scales="free_y")
One can also determine the most frequent N-grams, e.g. the most frequent combinations of N words. First, we consider bi-grams, or two-word phrases. The following code finds, orders and displays the most frequent bi-grams.
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.2.5
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TDM_2 <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
mTDM_2 <- as.matrix(TDM_2)
biGramFreq <- data.frame(Ngram=rownames(TDM_2), blogs=mTDM_2[,1], news=mTDM_2[,2], twitter=mTDM_2[,3])
biGramFreq.blogs <- biGramFreq[order(-biGramFreq[,2]), c(1,2)]
biGramFreq.news <- biGramFreq[order(-biGramFreq[,3]), c(1,3)]
biGramFreq.twitter <- biGramFreq[order(-biGramFreq[,4]), c(1,4)]
library(reshape2)
biGramFreq.blogs <- melt(biGramFreq.blogs, id.vars=c("Ngram"))[1:20, ]
biGramFreq.news <- melt(biGramFreq.news, id.vars=c("Ngram"))[1:20, ]
biGramFreq.twitter <- melt(biGramFreq.twitter, id.vars=c("Ngram"))[1:20, ]
biGramFreq2 <- rbind(biGramFreq.blogs, biGramFreq.news, biGramFreq.twitter)
colnames(biGramFreq2) <- c("Ngram", "Type", "Freq")
qplot(x=Freq, y=reorder(Ngram, Freq), data=biGramFreq2, xlab="Freq", ylab="Ngram") + facet_grid(Type~., scales="free_y")
We can use a “word-cloud” to graphically display the most frequently occurring tri-grams over all three files.
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
TDM_3 <- TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer))
mTDM_3 <- as.matrix(TDM_3)
triGramFreq <- sort(rowSums(mTDM_3), decreasing=TRUE)
triGramFreq <- data.frame(Ngram=attr(triGramFreq, "names")[1:20], Freq=triGramFreq[1:20], row.names=NULL)
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 3.2.5
## Loading required package: RColorBrewer
library(RColorBrewer)
pal = brewer.pal(8,"Dark2")
wordcloud(words = triGramFreq$Ngram, freq = triGramFreq$Freq,
scale = c(3,.3), random.order = F, random.color = F, colors = pal)
Prediction of the next word in a sentence will depend on the previos N-grams in that sentence or phrase. I am considering a prediction algorithm that is based on 2-, 3-, 4- and 5-grams. For example, consider a given phrase where we now want to predict the next word. I would create separate predictions based on the previous 2-gram, 3-gram, etc. The weight given to a given prediction would be proprtional to the number of words in the N-gram used to make that prediction. The final prediction would be the word with the largest weighting. The Shiny app will allow a user to enter a phrase and then give them the top 3 or 4 predicted words that should occur next in the given phrase. I hope to attach a “score” to each predicted word, e.g. a confidence measure of each of the predicted words.