Introduction

This is the milestone report which reports the results of the Exploratory data analysis performed on the corpus provided as part of the Coursera Data Science Capstone Project. The goal of the project is to build a predictive text model combined with a shiny app UI that will predict the next word as the user types a sentence similar to the way most smart phone keyboards are implemented today using the technology of Swiftkey.

The model will be trained using a corpus (a collection of English text) that is compiled from 3 sources - news, blogs, and tweets.

In the following report, I will load and clean the data and also use NLP (Natural Language Processing) applications in R (tm and RWeka) to tokenize n-grams as a first step toward building a predictive model.

Getting the Data

The data sets made available for this project are extracted from the zip file of the following URL:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

There are 3 distinctive sources available in the folder provided in 4 different languages: (German, American English, Finnish and Russian) which are:

  1. Blogs
  2. News
  3. Twitter feeds

However, our data analysis will only focus on the American English data sets.

Summary of the data

#Preloading the required R libraries
library(tm)
library(ggplot2)
library(RWeka)
library(R.utils)
library(dplyr)
library(knitr)
library(doParallel)
library(SnowballC)
library(stringi)
library(wordcloud)

p1<- "C:/CapstoneProject/en_US/en_US.blogs.txt"
p2<- "C:/CapstoneProject/en_US/en_US.news.txt"
p3<- "C:/CapstoneProject/en_US/en_US.twitter.txt"

# Read blogs data in binary mode
conn <- file(p1, open="rb")
blogs <- readLines(conn, encoding="UTF-8"); close(conn)
# Read news data in binary mode
conn <- file(p2, open="rb")
news <- readLines(conn, encoding="UTF-8"); close(conn)
# Read twitter data in binary mode
conn <- file(p3, open="rb")
twitter <- readLines(conn, encoding="UTF-8"); close(conn)
# Remove temporary variable
rm(conn)

# Compute statistics and summary info for each data type
Info<- sapply(list(blogs,news,twitter),function(x) summary(stri_count_words(x))[c('Min.','Mean','Max.')])
rownames(Info) <- c('Min','Mean','Max')
stats <- data.frame(
  FileName=c("en_US.blogs","en_US.news","en_US.twitter"),      
  t(rbind(
    sapply(list(blogs,news,twitter),stri_stats_general)[c('Lines','Chars'),],
    Words=sapply(list(blogs,news,twitter),stri_stats_latex)['Words',],
    Info)
  ))
head(stats)
##        FileName   Lines     Chars    Words Min     Mean  Max
## 1   en_US.blogs  899288 206824382 37570839   0 41.75108 6726
## 2    en_US.news 1010242 203223154 34494539   1 34.40997 1796
## 3 en_US.twitter 2360148 162096031 30451128   1 12.75063   47

Sampling the Dataset

Considering 1% of the total dataset as Sample Data. Also as part of cleaning the data the non-English characters have been removed from the sample. Finally, blogs, news and tweets have been combined in SampleData.

# Set random seed for reproducibility and sample the data
set.seed(1234)
sampleBlogs <- blogs[sample(1:length(blogs), 0.01*length(blogs), replace=FALSE)]
sampleNews <- news[sample(1:length(news), 0.01*length(news), replace=FALSE)]
sampleTwitter <- twitter[sample(1:length(twitter), 0.01*length(twitter), replace=FALSE)]

# Remove non-English characters for sampled Blogs/News/Twitter
sampleBlogs <- iconv(sampleBlogs, "UTF-8", "ASCII", sub="")
sampleNews <- iconv(sampleNews, "UTF-8", "ASCII", sub="")
sampleTwitter <- iconv(sampleTwitter, "UTF-8", "ASCII", sub="")
sampleData <- c(sampleBlogs,sampleNews,sampleTwitter)

# Remove temporary variables
rm(blogs, news, twitter, p1, p2, p3)

Build Corpus and Cleaning the sample dataset

In this step a function named build_corpus is defined to build a corpus by performing the below stated cleaning steps: 1. Convert all words to lowercase 2. Eliminate punctuation 3. Eliminate numbers 4. Strip whitespace 5. Eliminate banned words 6. Stemming (Using Porter’s Stemming Algorithm) 7. Create Plain Text Format

build_corpus <- function (x = sampleData) {
  sample_c <- VCorpus(VectorSource(x)) # Create corpus dataset
  sample_c <- tm_map(sample_c, tolower) # all lowercase
  sample_c <- tm_map(sample_c, removePunctuation) # Eleminate punctuation
  sample_c <- tm_map(sample_c, removeNumbers) # Eliminate numbers
  sample_c <- tm_map(sample_c, stripWhitespace) # Strip Whitespace
  
  # read and process a file of banned words
  bw <- read.csv(file ='Terms-to-Block.csv', stringsAsFactors=F, skip=3)
  bannedWords <- gsub(",", "", tolower(bw[,2]))
  sample_c <- tm_map(sample_c, removeWords, bannedWords) # Eliminate banned words
  sample_c <- tm_map(sample_c, removeWords, stopwords("english")) # Eliminate English stop words
  sample_c <- tm_map(sample_c, stemDocument) # Stem the document
  sample_c <- tm_map(sample_c, PlainTextDocument) # Create plain text format
}
corpusData <- build_corpus(sampleData)

Tokenization and Identifying the frequencies on N-Grams

In this section, firstly, the RWeka package has been used to develop Tokenizers function in order to create unigram, bigram and trigram. Secondly, a Document Term Matrix (DTM) is created for the corpus by defining the getTermTable() function.

library(RWeka)
getTermTable <- function(corpusData, ngrams = 1, lowfreq = 50) {
  #create term-document matrix tokenized on n-grams
  tokenizer <- function(x) { NGramTokenizer(x, Weka_control(min = ngrams, max = ngrams)) }
  tdm <- TermDocumentMatrix(corpusData, control = list(tokenize = tokenizer))
  #find the top term grams with a minimum of occurrence in the corpus
  top_terms <- findFreqTerms(tdm,lowfreq)
  top_terms_freq <- rowSums(as.matrix(tdm[top_terms,]))
  top_terms_freq <- data.frame(word = names(top_terms_freq), frequency = top_terms_freq)
  top_terms_freq <- arrange(top_terms_freq, desc(frequency))
}
    
tt.Data <- list(3)
for (i in 1:3) {
  tt.Data[[i]] <- getTermTable(corpusData, ngrams = i, lowfreq = 10)
}

Word Cloud

Here’s a word cloud for unigrams, bigrams and trigrams.

library(wordcloud)
library(RColorBrewer)

# Set random seed for reproducibility
set.seed(1234)
# Set Plotting in 1 row 3 columns
par(mfrow=c(1, 3))
for (i in 1:3) {
  wordcloud(tt.Data[[i]]$word, tt.Data[[i]]$frequency, scale = c(3,1), max.words=100, random.order=FALSE, rot.per=0, fixed.asp = TRUE, use.r.layout = FALSE, colors=brewer.pal(8, "Dark2"))
}

Frequency Plot

Here’s a plot depicting the 20 most frequest Unigrams, Bigrams and Trigrams

library(gridExtra)
plot.Grams <- function (x = tt.Data, N=10) {
  g1 <- ggplot(data = head(x[[1]],N), aes(x = reorder(word, -frequency), y = frequency)) + 
        geom_bar(stat = "identity", fill = "green") + 
        ggtitle(paste("Unigrams")) + 
        xlab("Unigrams") + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
  g2 <- ggplot(data = head(x[[2]],N), aes(x = reorder(word, -frequency), y = frequency)) + 
        geom_bar(stat = "identity", fill = "blue") + 
        ggtitle(paste("Bigrams")) + 
        xlab("Bigrams") + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
  g3 <- ggplot(data = head(x[[3]],N), aes(x = reorder(word, -frequency), y = frequency)) + 
        geom_bar(stat = "identity", fill = "darkgreen") + 
        ggtitle(paste("Trigrams")) + 
        xlab("Trigrams") + ylab("Frequency") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1))
  # Put three plots into 1 row 3 columns
  gridExtra::grid.arrange(g1, g2, g3, ncol = 3)
}
library(ggplot2); library(gridExtra)
plot.Grams(x = tt.Data, N = 20)

Next Steps

The next steps of this project would be to build the preditcion model and to focus on fine tuning the performance of the model so that the memory usage and the response time of the prediction model is at a user acceptance level.