1. Executive Summary

The objective of the Coursera Data Science Capstone project is to build a prediction model which will provide next word suggestion when user enters a word on the keyboard.It will involve Natural Language Processing and Text Mining.

This is a milestone report which will include a basic summary of the datasets which are used to build the prediction model.

2. Setup runtime environment

Loading in all the necessary libraries.

3. Download data

The data is from a corpus called HC Corpora. The datasets are hosted on teh Coursera site Capstone Dataset

4. Load data & Generic Data Exploratory

The downloaded data are loaded into memory.

The sources of the 3 datasets to be used for creating the prediction model are

  1. blogs (en_US.blogs.txt)
  2. news (en_US.news.txt)
  3. twitter (en_US.twitter.txt)
source file size(MB) # of records word count
en_US.blogs.txt 200.4242077 899288 38154238
en_US.news.txt 196.2775126 77259 2693898
en_US.twitter.txt 159.364069 2360148 30218125

5. Data Cleaning

The total size of the datasets is large. Thus, a random sample of 10% of the total data size is used to train the prediction model. Also, the datasets contain non-useful words to build the model, for example, stop words, numbers and punctuations, thus, cleaning is required.

## random selection of 10% of records as training data set
blogs.trainidx <- sample.split(blogs.raw, SplitRatio=0.10)
news.trainidx <- sample.split(news.raw, SplitRatio=0.10)
twitter.trainidx <- sample.split(twitter.raw, SplitRatio=0.10)

## create training data set
blogs.trainset <- blogs.raw[blogs.trainidx]; rm(blogs.raw); rm(blogs.trainidx)
news.trainset <- news.raw[news.trainidx]; rm(news.raw); rm(news.trainidx)
twitter.trainset <- twitter.raw[twitter.trainidx]; rm(twitter.raw); rm(twitter.trainidx)

## activate garbage collection
gc()
##           used (Mb) gc trigger (Mb) max used  (Mb)
## Ncells  860142 46.0    4193053  224  4322609 230.9
## Vcells 8715147 66.5   57931309  442 72135181 550.4
## remove stop words
blogs.trainset <- removeWords(blogs.trainset, stopwords("english"))
news.trainset <- removeWords(news.trainset , stopwords("english"))
twitter.trainset <- removeWords(twitter.trainset, stopwords("english"))

## change to lowercase
blogs.trainset <- tolower(blogs.trainset)
news.trainset <- tolower(news.trainset)
twitter.trainset <- tolower(twitter.trainset)

## remove punctuation
blogs.trainset <- removePunctuation(blogs.trainset)
news.trainset <- removePunctuation(news.trainset)
twitter.trainset <- removePunctuation(twitter.trainset)

## remove numbers    
blogs.trainset <- removeNumbers(blogs.trainset)
news.trainset <- removeNumbers(news.trainset)
twitter.trainset <- removeNumbers(twitter.trainset)

## remove hashtag
blogs.trainset <- gsub(" #\\S*", "", blogs.trainset)
news.trainset <- gsub(" #\\S*", "", news.trainset)
twitter.trainset <- gsub(" #\\S*", "", twitter.trainset)

## remove non-alphanumeric characters
blogs.trainset <- gsub("[^0-9a-zA-Z///' ]","", blogs.trainset)
news.trainset <- gsub("[^0-9a-zA-Z///' ]","", news.trainset)
twitter.trainset <- gsub("[^0-9a-zA-Z///' ]","", twitter.trainset)

## remove whitespaces
blogs.trainset <- stripWhitespace(blogs.trainset)
news.trainset <- stripWhitespace(news.trainset)
twitter.trainset <- stripWhitespace(twitter.trainset)

6. Text Mining

Datasets are converted into Corpus of documents and then combined into a single Corpus.

## Convert character vectors into to Corpus
blogs.corpus <- VCorpus(VectorSource(blogs.trainset)); rm(blogs.trainset)
news.corpus <- VCorpus(VectorSource(news.trainset)); rm(news.trainset)
twitter.corpus <- VCorpus(VectorSource(twitter.trainset)); rm(twitter.trainset)

## combine the 3 corpora into one corpus
combine.corpus <- c(blogs.corpus, news.corpus, twitter.corpus); rm(blogs.corpus); rm(news.corpus); rm(twitter.corpus)

## activate garbage collection
gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 12106699 646.6   17841328 952.9 16108235 860.3
## Vcells 30781764 234.9   47761510 364.4 72135181 550.4

A TermDocumentMatrix is built to reflect the number of times each word or set of words in the corpus is found. An unigram is built and the top 20 Words that occur the most is displayed in the histogram.

## unigram tokenizer
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}

unidm <- TermDocumentMatrix(combine.corpus, control = list(tokenize = UnigramTokenizer))
uni_df_words <- as.data.frame(slam::row_sums(unidm, na.rm=T))
colnames(uni_df_words)<- "freq"
uni_df_words <- cbind(word = rownames(uni_df_words),uni_df_words)
rownames(uni_df_words) <- NULL
uni_df_words_sorted <- arrange(uni_df_words,desc(freq))

ggplot(uni_df_words_sorted[1:20,], aes(x=word,y=freq)) +
    labs(x="Top 20 Words",y="Frequency") +
    ggtitle("TOP 20 Words") +
    theme(axis.text.x=element_text(angle=60, size=18, vjust=0.5)) +
    geom_bar(stat="Identity") + geom_text(aes(label=freq), vjust=-0.4)

A word cloud of the top 60 words are displayed below. The higher occurence of a word, the larger is the word in the word cloud.

wordcloud(uni_df_words_sorted$word, uni_df_words_sorted$freq, scale=c(5,0.5), max.words=60, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))

7. Other Analysis Tasks

Some other analysis work to be done are

  1. Investigate if stemming will help in the prediction modeling
  2. Remove profanity words
  3. To build bigram and trigram TermDocumentMatrix
  4. Investigate other ways to increase the size of the training dataset.