This document describes current progress of Coursera Capstone project. The goal of the project is to build a model which can predict next word given some input (1 or 2 words). In this document, a description of the current state of the project is given. It includes basic exploratory analysis of the provided input documents. The analysis covers some basic statistical information about content of the documents, such as size and structure of the samples, frequency distribution of the words and provides some basic preprocessing of the input data, so it can be used further in the analysis pipeline.
This section describes the steps taken to obtain the data and prepare them for further analysis.
Data is provided in this link. The first step is downloading the data and unpacking it’s content. In order to keep some structure, we will have separate directories for raw and processed data:
source("constants.R");
if(!dir.exists("data")){
dir.create(dataDirName)
}
if(!dir.exists(rawDataDirPath)){
dir.create(rawDataDirPath);
}
if(!dir.exists(processedDataDirPath)){
dir.create(processedDataDirPath)
}
if(!file.exists(projectDataFilePath)){
print("Downloading project data set ...")
download.file(projectDataUrl,paste(rawDataDirPath,projectDataFileName,sep="/"),method = "curl")
unzip(projectDataFilePath, overwrite = FALSE,exdir = rawDataDirPath)
}
Data is contained in .zip file which is unpacked at download. This file contains text from blogs, news sources and Twitter in different languages. For this project, English language will be used. Basic statistics for downloaded data is as follows:
## [1] "Blogs file size: 210.16 MB"
## [1] "Number of lines in blogs data: 899288"
## [1] "News file size: 205.81 MB"
## [1] "Number of lines in news data: 1010242"
## [1] "Twitter file size: 167.11 MB"
## [1] "Number of lines in Twitter data: 2360148"
Looking at the output, it is notcable that this data is quite large. In order to make analysis easier, we will take a sample of the data to work with.
To facilitate further analysis, a random sample of the data is used, which will be used further in analysis. We take 5,000 random samples from each file. Furthermore, each line is converted to lowercase and we remove any profanity words which may be present.
source("constants.R")
#regex to used for profanity filtering
profanityRegex <- '\\b(?:fuck|whore|bitch|shit|crap|slut|ass)\\b'
doSample <- function(path, sampleSize=5000){
conn = file(path, encoding = "UTF-8");
# read lines from file, one by one. We want random sample of 1000 lines from each file
count <- 1; # number of lines currently in sample
sample <- character(sampleSize)
linesin <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
for(i in 1:length(linesin)){
val <- rbinom(c(1),1,0.5); # random value to decide whether toinclude line in sample or not
if(val == 1){
# add to sample, converting to lowercase and removing profanity words
# also remove any non-english character
sample[count] <- iconv(gsub(profanityRegex, '', tolower(linesin[i])), from = "UTF-8", to = "ASCII", sub = "")
count <- count + 1;
}
if(count > sampleSize){
break;
}
}
close(conn);
rm(linesin)
sample
}
blogsPath <- paste(rawDataDirPath, "final/en_US/en_US.blogs.txt",sep = "/")
blogSample <- doSample(blogsPath)
newsPath <- paste(rawDataDirPath, "final/en_US/en_US.news.txt",sep = "/")
newsSample <- doSample(newsPath)
twitterPath <- paste(rawDataDirPath, "final/en_US/en_US.twitter.txt",sep = "/")
twitterSample <- doSample(twitterPath)
sprintf('Blogs data sample size: %i', length(blogSample))
## [1] "Blogs data sample size: 5000"
sprintf('News data sample size: %i', length(newsSample))
## [1] "News data sample size: 5000"
sprintf('Twitter data sample size: %i', length(twitterSample))
## [1] "Twitter data sample size: 5000"
First step in analysis is to create a corpus of available text data. We use tm library functions to craete the corpus and perform some cleanup tasks, like removing numbers, punctuations, stop words and white spaces:
bigramTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
trigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
srcVector <- VectorSource(c(blogSample, newsSample,twitterSample))
corpus <- Corpus(srcVector)
# process text data in corpus
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
Before the corpus is created, we define 2 custom functions which enable creation of Term-Document Matrix based on 2-grams and 3-grams. We will use these n-grams to inspect different combinations of words.
# find n-grams in corpus
# 1-gram matrix
tdm1G <- TermDocumentMatrix(corpus)
#2-gram matrix
tdm2G <- TermDocumentMatrix(corpus,control = list(tokenizer = bigramTokenizer))
# 3-gram matrix
tdm3G <- TermDocumentMatrix(corpus,control = list(tokenizer = trigramTokenizer))
With term-document matrices, we can examine n-grams frequency distribution for each matrix. The plots bellow show the number of terms for some frequency value. We can see that number of terms decreases rapidly as the frequency increases.
freqDist <- function(matrix, limit,step){
out <- c(limit)
current <- length(findFreqTerms(matrix, 1))
out[1] <- current
count <- step + 1
for(i in 2:limit){
current <- length(findFreqTerms(matrix,count))
if(current == 0){
break
} else {
out[i] <- current
}
count <- count + step
}
out
}
barplot(freqDist(tdm1G,20,5),names.arg = seq(1,100,5),ylab = "Count (log)",xlab = "Frequency",
log = "y",main = "Frequency count for 1-gram terms")
barplot(freqDist(tdm2G,20,5),names.arg = seq(1,100,5),ylab = "Count (log)",xlab = "Frequency",
log = "y",main = "Frequency count for 2-gram terms")
barplot(freqDist(tdm1G,10,1),names.arg = seq(1,10,1),ylab = "Count (log)",xlab = "Frequency",
log = "y",main = "Frequency count for 3-gram terms")
In order to visualize most common terms, we will craete bar chart of top terms in each n-gram category. For 1-grams, a distibution for tokens with frequency greater then 500 is shown.
terms <- findFreqTerms(tdm1G, 500)
barplot(rowSums( inspect( tdm1G[ terms, dimnames(tdm1G)$Docs] )),las=2,
ylab = "Frequency", xlab = "Words", main = "1-grams frequency distribution")
For 2-grams, the frequency is much lower then for 1-grams, Here we can see the distribution for tokens with frequency over 50:
#adjust bottom margin so labels can fit
par(mar=c(7,4,4,2))
terms <- findFreqTerms(tdm2G, 50)
barplot(rowSums( inspect( tdm2G[ terms, dimnames(tdm2G)$Docs] )),las=2,
ylab = "Frequency", xlab = "Words", main = "2-grams frequency distribution")
Finally, for 3-grams, the frequency is again much lower then for 2-grams. Here we can see the frequency distribution for tokens with frequency greater then 5:
# adjust bottom margin so even longer labels can fit
par(mar=c(10,4,4,2))
terms <- findFreqTerms(tdm3G, 5)
barplot(rowSums( inspect( tdm3G[ terms, dimnames(tdm3G)$Docs] )),las=2,
ylab = "Frequency", xlab = "Words", main = "3-grams frequency distribution")
As we can see from plots, most of terms in our corpus have very low frequency. This is particularly true for 2-gram and 3-gram tokens, as the number of tokens with frequency of one is above 200,000, and falls rapidly as the frequency increases. This might mean that it would be difficult to predict word input based on these tokens, since it requires a lot of space for storage. Some optimizations need to be performed in order to make the model run smoothly.
library(knitr)
library(tm)
knitr::opts_chunk$set(echo = TRUE)
source("constants.R");
if(!dir.exists("data")){
dir.create(dataDirName)
}
if(!dir.exists(rawDataDirPath)){
dir.create(rawDataDirPath);
}
if(!dir.exists(processedDataDirPath)){
dir.create(processedDataDirPath)
}
if(!file.exists(projectDataFilePath)){
print("Downloading project data set ...")
download.file(projectDataUrl,paste(rawDataDirPath,projectDataFileName,sep="/"),method = "curl")
unzip(projectDataFilePath, overwrite = FALSE,exdir = rawDataDirPath)
}
blogsConn = file("data/raw/final/en_US/en_US.blogs.txt")
blogsLines = readLines(blogsConn)
sprintf('Blogs file size: %.2f MB', file.size("data/raw/final/en_US/en_US.blogs.txt")/1000000)
sprintf('Number of lines in blogs data: %d', length(blogsLines))
close(blogsConn)
newsConn = file("data/raw/final/en_US/en_US.news.txt")
newsLines = readLines(newsConn)
sprintf('News file size: %.2f MB', file.size("data/raw/final/en_US/en_US.news.txt")/1000000)
sprintf('Number of lines in news data: %d', length(newsLines))
close(newsConn)
twitterConn = file("data/raw/final/en_US/en_US.twitter.txt")
twLines = readLines(twitterConn, encoding = "UTF-8", skipNul = TRUE)
sprintf('Twitter file size: %.2f MB', file.size("data/raw/final/en_US/en_US.twitter.txt")/1000000)
sprintf('Number of lines in Twitter data: %d', length(twLines))
close(twitterConn)
source("constants.R")
#regex to used for profanity filtering
profanityRegex <- '\\b(?:fuck|whore|bitch|shit|crap|slut|ass)\\b'
doSample <- function(path, sampleSize=5000){
conn = file(path, encoding = "UTF-8");
# read lines from file, one by one. We want random sample of 1000 lines from each file
count <- 1; # number of lines currently in sample
sample <- character(sampleSize)
linesin <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
for(i in 1:length(linesin)){
val <- rbinom(c(1),1,0.5); # random value to decide whether toinclude line in sample or not
if(val == 1){
# add to sample, converting to lowercase and removing profanity words
# also remove any non-english character
sample[count] <- iconv(gsub(profanityRegex, '', tolower(linesin[i])), from = "UTF-8", to = "ASCII", sub = "")
count <- count + 1;
}
if(count > sampleSize){
break;
}
}
close(conn);
rm(linesin)
sample
}
blogsPath <- paste(rawDataDirPath, "final/en_US/en_US.blogs.txt",sep = "/")
blogSample <- doSample(blogsPath)
newsPath <- paste(rawDataDirPath, "final/en_US/en_US.news.txt",sep = "/")
newsSample <- doSample(newsPath)
twitterPath <- paste(rawDataDirPath, "final/en_US/en_US.twitter.txt",sep = "/")
twitterSample <- doSample(twitterPath)
sprintf('Blogs data sample size: %i', length(blogSample))
sprintf('News data sample size: %i', length(newsSample))
sprintf('Twitter data sample size: %i', length(twitterSample))
bigramTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
trigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
srcVector <- VectorSource(c(blogSample, newsSample,twitterSample))
corpus <- Corpus(srcVector)
# process text data in corpus
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
# find n-grams in corpus
# 1-gram matrix
tdm1G <- TermDocumentMatrix(corpus)
#2-gram matrix
tdm2G <- TermDocumentMatrix(corpus,control = list(tokenizer = bigramTokenizer))
# 3-gram matrix
tdm3G <- TermDocumentMatrix(corpus,control = list(tokenizer = trigramTokenizer))
freqDist <- function(matrix, limit,step){
out <- c(limit)
current <- length(findFreqTerms(matrix, 1))
out[1] <- current
count <- step + 1
for(i in 2:limit){
current <- length(findFreqTerms(matrix,count))
if(current == 0){
break
} else {
out[i] <- current
}
count <- count + step
}
out
}
barplot(freqDist(tdm1G,20,5),names.arg = seq(1,100,5),ylab = "Count (log)",xlab = "Frequency",
log = "y",main = "Frequency count for 1-gram terms")
barplot(freqDist(tdm2G,20,5),names.arg = seq(1,100,5),ylab = "Count (log)",xlab = "Frequency",
log = "y",main = "Frequency count for 2-gram terms")
barplot(freqDist(tdm1G,10,1),names.arg = seq(1,10,1),ylab = "Count (log)",xlab = "Frequency",
log = "y",main = "Frequency count for 3-gram terms")
terms <- findFreqTerms(tdm1G, 500)
barplot(rowSums( inspect( tdm1G[ terms, dimnames(tdm1G)$Docs] )),las=2,
ylab = "Frequency", xlab = "Words", main = "1-grams frequency distribution")
#adjust bottom margin so labels can fit
par(mar=c(7,4,4,2))
terms <- findFreqTerms(tdm2G, 50)
barplot(rowSums( inspect( tdm2G[ terms, dimnames(tdm2G)$Docs] )),las=2,
ylab = "Frequency", xlab = "Words", main = "2-grams frequency distribution")
# adjust bottom margin so even longer labels can fit
par(mar=c(10,4,4,2))
terms <- findFreqTerms(tdm3G, 5)
barplot(rowSums( inspect( tdm3G[ terms, dimnames(tdm3G)$Docs] )),las=2,
ylab = "Frequency", xlab = "Words", main = "3-grams frequency distribution")