Introduction

This milestone report is for the Coursera Capstone Project which aims to build a Text Predictor Application. The report describes the current progress of the project. Among the things that have already been done are the extraction of the training data set, initial description, sampling and then cleansing of the data. Exploratory analysis of the data has also been done .

Characteristics of the Dataset

The dataset used for analysis has been derived from this link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip..

if (!file.exists("Coursera-SwiftKey.zip")) {
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip")
  unzip("Coursera-SwiftKey.zip")
}

The unzipped file has 4 major groups: English, Russian, German and Finish. Since this app will focus only on English we only consider that group. The folder has three text sources : blogs, news, and twitter. The entire set has about 4 million lines and 1 Billion words.

library(knitr)
library(stringi)

blogData <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
newsData <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitterData <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

bloglength <- length(blogData)
newslength <- length(newsData)
twitterlength <- length(twitterData)

blogsize <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
newssize <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twittersize <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2     

blogwords <- stri_count_words(blogData)
newswords <- stri_count_words(newsData)
twitterwords <- stri_count_words(twitterData)
tablesource<- data.frame(Dataset = c("blogs", "news", "twitter"),
           File_size_MB = c(blogsize, newssize, twittersize),
           Num_lines = c(bloglength, newslength, twitterlength),
           Num_words = c(sum(blogwords), sum(newswords), sum(twitterwords)),
           Mean_words = c(mean(blogwords), mean(newswords), mean(twitterwords)))


kable(tablesource)
Dataset File_size_MB Num_lines Num_words Mean_words
blogs 200.4242 899288 37546246 41.75108
news 196.2775 1010242 34762395 34.40997
twitter 159.3641 2360148 30093410 12.75065
print(paste("The total number of words in the set is ", sum(blogwords)+ sum(newswords)+sum(twitterwords)))

[1] “The total number of words in the set is 102402051”

Sampling and Cleansing

The entire set is cumbersome. We sample only 1% from each set so that it will be more manageable to work.

The sample set is cleansed in terms of removal of numbers, whitespaces, punctuations. The words also have been converted into lowercases.

library(tm)
library(stringi)
library(stringr)
library(ggplot2)
library(dplyr)
library(ngram)

set.seed(1202)
blogsSample <- sample(blogData, length(blogData)*0.01)
newsSample <- sample(news, length(news)*0.01)
twitterSample <- sample(twitter, length(twitter)*0.01)
twitterSample <- sapply(twitterSample, function(row) iconv(row, "latin1", "ASCII", sub=""))
text_sample  <- c(blogsSample,newsSample,twitterSample)

write(text_sample,"textsample.txt")
file1a <- file("textsample.txt", "rb")
text_sample <- readLines(file1a, encoding="UTF-8")
close(file1a)

toSpace <- content_transformer(function(x, pattern){gsub(pattern, " ", x)})

preprocessCorpus <- function(corpus){
  # Helper function to preprocess corpus
  corpus <- tm_map(corpus, toSpace, "/|@|\\|")
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stripWhitespace)
  return(corpus)
}

trialdata <- VCorpus(VectorSource(text_sample))
trialdata2<-preprocessCorpus(trialdata)

The number of lines in the text sample is 42,695 which is about 1% of the original set.

Creating Word Groupings and Frequencies

We look at the frequency of words appearing in the sample. We take a look at the individual words, the bigrams (in 2’s), trigrams (in 3’s) and quadgrams (in 4’s). For unigrams we check the word frequencies excluding the stop words (or commonly used words such as ‘the’, ‘a’, and ‘in’ ). For phrase search we include the stop words. There have been challenges encountered using the RWeka package. As an alternative the ngram package was used. Below is a sample code used to tokenize and graph the resulting frequencies.

freq_frame <- function(tdm){
  # Function to convert a term document matrix into words and associated frequencies
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}

NgramTokenizer <- function(x) {unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)}

tdm <- TermDocumentMatrix(trialdata2, control = list(tokenize = NgramTokenizer))
tdm1 <- removeSparseTerms(tdm, 0.9999)
freq1_frame <- freq_frame(tdm1)

top20<-freq1_frame[1:20,]

ggplot(top20, aes(x=reorder(word,freq), y=freq, fill=freq)) +
  geom_bar(stat="identity") +
  theme_bw() +
  coord_flip() +
  theme(axis.title.y = element_blank()) +
  labs(y="Frequency", title="Most common words in text sample")

Exploratory Analysis

The top 20 most common unigrams (excluding stop words)

The top 20 most common bigrams (including stop words)

The top 20 most common trigrams (including stop words)

The top 20 most common 4-grams (including stop words)

Next Steps

The next step of this project is to store the word frequencies in a data set. The frequency data tells us the probability of the word group appearing. For example if the user inputs “the end of”, there is a very high probability that the next word is “the” since the “the end of the” is the highest ranking quadgram. The value of the most frequent quadgram has dropped to 70 (vs 4000+ in bigram) in so the database might consider only up to quadgram. The balance between accuracy and speed will be evaluated once the Shiny application is built.

end of the report