Milestone Report

Introduction

This milestone report is for the Coursera Capstone Project which aims to build a Text Predictor Application. The report describes the current progress of the project. Among the things that have already been done are the extraction of the training data set, initial description, sampling and then cleansing of the data. Exploratory analysis of the data has also been done .

Characteristics of the Dataset

The dataset used for analysis has been derived from this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip..

if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}

The unzipped file has 4 major groups: English, Russian, German and Finish. Since this app will focus only on English we only consider that group. The folder has three text sources : blogs, news, and twitter. The entire set has about 4 million lines and 1 Billion words.

library(knitr)
library(stringi)

blogData <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
newsData <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitterData <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

bloglength <- length(blogData)
newslength <- length(newsData)
twitterlength <- length(twitterData)

blogsize <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
newssize <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twittersize <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2     

blogwords <- stri_count_words(blogData)
newswords <- stri_count_words(newsData)
twitterwords <- stri_count_words(twitterData)
tablesource<- data.frame(Dataset = c("blogs", "news", "twitter"),
           File_size_MB = c(blogsize, newssize, twittersize),
           Num_lines = c(bloglength, newslength, twitterlength),
           Num_words = c(sum(blogwords), sum(newswords), sum(twitterwords)),
           Mean_words = c(mean(blogwords), mean(newswords), mean(twitterwords)))


kable(tablesource)

Dataset	File_size_MB	Num_lines	Num_words	Mean_words
blogs	200.4242	899288	37546246	41.75108
news	196.2775	1010242	34762395	34.40997
twitter	159.3641	2360148	30093410	12.75065

print(paste("The total number of words in the set is ", sum(blogwords)+ sum(newswords)+sum(twitterwords)))

[1] “The total number of words in the set is 102402051”

Sampling and Cleansing

The entire set is cumbersome. We sample only 1% from each set so that it will be more manageable to work.

The sample set is cleansed in terms of removal of numbers, whitespaces, punctuations. The words also have been converted into lowercases.

library(tm)
library(stringi)
library(stringr)
library(ggplot2)
library(dplyr)
library(ngram)

set.seed(1202)
blogsSample <- sample(blogData, length(blogData)*0.01)
newsSample <- sample(news, length(news)*0.01)
twitterSample <- sample(twitter, length(twitter)*0.01)
twitterSample <- sapply(twitterSample, function(row) iconv(row, "latin1", "ASCII", sub=""))
text_sample  <- c(blogsSample,newsSample,twitterSample)

write(text_sample,"textsample.txt")
file1a <- file("textsample.txt", "rb")
text_sample <- readLines(file1a, encoding="UTF-8")
close(file1a)

toSpace <- content_transformer(function(x, pattern){gsub(pattern, " ", x)})

preprocessCorpus <- function(corpus){
  # Helper function to preprocess corpus
  corpus <- tm_map(corpus, toSpace, "/|@|\\|")
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stripWhitespace)
  return(corpus)
}

trialdata <- VCorpus(VectorSource(text_sample))
trialdata2<-preprocessCorpus(trialdata)

The number of lines in the text sample is 42,695 which is about 1% of the original set.

Creating Word Groupings and Frequencies

We look at the frequency of words appearing in the sample. We take a look at the individual words, the bigrams (in 2’s), trigrams (in 3’s) and quadgrams (in 4’s). For unigrams we check the word frequencies excluding the stop words (or commonly used words such as ‘the’, ‘a’, and ‘in’ ). For phrase search we include the stop words. There have been challenges encountered using the RWeka package. As an alternative the ngram package was used. Below is a sample code used to tokenize and graph the resulting frequencies.

freq_frame <- function(tdm){
  # Function to convert a term document matrix into words and associated frequencies
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}

NgramTokenizer <- function(x) {unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)}

tdm <- TermDocumentMatrix(trialdata2, control = list(tokenize = NgramTokenizer))
tdm1 <- removeSparseTerms(tdm, 0.9999)
freq1_frame <- freq_frame(tdm1)

top20<-freq1_frame[1:20,]

ggplot(top20, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common words in text sample")

Exploratory Analysis

The top 20 most common unigrams (excluding stop words)

The top 20 most common bigrams (including stop words)

The top 20 most common trigrams (including stop words)

The top 20 most common 4-grams (including stop words)

Next Steps

The next step of this project is to store the word frequencies in a data set. The frequency data tells us the probability of the word group appearing. For example if the user inputs “the end of”, there is a very high probability that the next word is “the” since the “the end of the” is the highest ranking quadgram. The value of the most frequent quadgram has dropped to 70 (vs 4000+ in bigram) in so the database might consider only up to quadgram. The balance between accuracy and speed will be evaluated once the Shiny application is built.

end of the report