Task 0 : Problem Understanding

In this capstone project, we will be applying data science in the area of natural language processing. The datasets comes from HC corpora (http://www.corpora.heliohost.org/), which is a collection of corpora for various languages. The files have been filtered based on languages, but may still contain some foreign text. The dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

First, a large corpus of text documents will be analysed to discover the data structure. It will be followed with a series of data processing steps (e.g. removing punctuations, numbers, stopwords, profanities). Next, the datasets will be samples to build a predictive text model. Eventually, a predictive text product will be built.

Task 1 : Data Aquasition and Cleaning

The data are provided in various languages (English, French, German, Finnish) and comes from different data sources (blogs, news, twitter). Only the english context will be analyzed. The list of profanity can be downloaded from http://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-banned-by-google-txt-file.zip.

## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input

You can also embed plots, for example:

## [1] 898384
## [1] 77258
## [1] 2302307
head(data.en.blogs, n=3)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     V1
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               We love you Mr. Brown.
## 3 Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
##                                                                                                                                                                                  V1
## 1                                                                                                                                                 He wasn't home alone, apparently.
## 2                         The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
## 3 WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
# Read profanties
data.profanity <- read.table(file.profanity, header = FALSE, sep="\n", quote = "", stringsAsFactors = FALSE)

# Number of words
nrow(data.profanity)
## [1] 550

Task 2 - Exploratory analysis

The datasets are analyzed to obtain the general statistics (word counts, line counts). Histograms are plotted for the word counts for the different data sources. It can be seen that blog data have the largest (word count per line) spread . Tweets (Twitter) have much smaller spread.

library(stringi)
## Warning: package 'stringi' was built under R version 3.1.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
# Get general text statistics
stri_stats_general(data.en.blogs$V1)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      898384      898384   207646896   171336847
stri_stats_general(data.en.news$V1)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77258       77258    15679057    13113058
stri_stats_general(data.en.twitter$V1)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2302307     2302307   150063800   123870039
# Get summary of word counts
counts.en.blogs <- unlist(stri_count_words(data.en.blogs$V1))
counts.en.news <- unlist(stri_count_words(data.en.news$V1))
counts.en.twitter <- unlist(stri_count_words(data.en.twitter$V1))

summary(counts.en.blogs)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   29.00   42.32   61.00 6726.00
summary(counts.en.news)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   19.00   32.00   34.86   46.00 1123.00
summary(counts.en.twitter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.00   11.00   12.25   18.00   60.00
# Plot histograms of word counts
ggplot(as.data.frame(counts.en.blogs), aes(x=counts.en.blogs)) + 
  geom_histogram(binwidth=200, fill="steelblue", color="black") + 
  labs(x="Word counts per entry", y="Frequency (log10)", title="Histogram of word counts for blogs") + 
  scale_y_log10()
## Warning: Stacking not well defined when ymin != 0

ggplot(as.data.frame(counts.en.news), aes(x=counts.en.news)) + 
  geom_histogram(binwidth=40, fill="steelblue", color="black") + 
  labs(x="Word counts per entry", y="Frequency", title="Histogram of word counts for news")

ggplot(as.data.frame(counts.en.twitter), aes(x=counts.en.twitter)) + 
  geom_histogram(binwidth=2, fill="steelblue", color="black") + 
  labs(x="Word counts per entry", y="Frequency", title="Histogram of word counts for twitter")

Preprocess the Data

To expedite data processing, each dataset will be sampled (10%). The following language processing step are performed

removing punctuation removing numbers removing stop words removing profanities stemming words stripping white space converting to lower case

library(data.table)
library(NLP)
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.1.3
getCorpus <- function (data, rm_words)
{
  #build corpus
  corpus <- Corpus(VectorSource(data))
  
  # convert words to lower case
  corpus  <- tm_map(corpus, content_transformer(tolower))
  
  # remove white spaces
  corpus <- tm_map(corpus, stripWhitespace)
  
  # remove punctuation
  corpus <- tm_map(corpus, removePunctuation)
  
  # remove numbers
  corpus <- tm_map(corpus, removeNumbers)
  
  # stem words
  corpus <- tm_map(corpus, stemDocument)
  
  # remove stopwords
  corpus <- tm_map(corpus, removeWords, rm_words)
  
  return (corpus)  
}

# Set seed
set.seed(323)

sample.pct <- 0.1
sample.data.en.blogs <- as.data.frame(data.en.blogs[sample(nrow.data.en.blogs, sample.pct * nrow.data.en.blogs),])
sample.data.en.news <- as.data.frame(data.en.blogs[sample(nrow.data.en.news, sample.pct * nrow.data.en.news),])
sample.data.en.twitter <- as.data.frame(data.en.blogs[sample(nrow.data.en.twitter, sample.pct * nrow.data.en.twitter),])

nrow(sample.data.en.blogs)
## [1] 89838
corpus.en.blogs <- getCorpus(sample.data.en.blogs, c(stopwords('english'), data.profanity))
corpus.en.news <- getCorpus(sample.data.en.news, c(stopwords('english'), data.profanity))
corpus.en.twitter <- getCorpus(sample.data.en.twitter, c(stopwords('english'), data.profanity))

Tokenization

Uni-grams, bi-grams and tri-grams are generated for each data source. Sparse terms will be removed. The top 20 unigrams will be plotted. The top bigrams are illustrated using word clouds.

library(rJava)
## Warning: package 'rJava' was built under R version 3.1.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.1.3
getNGrams <-function(corpus, size, pct)
{
  tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = size, max = size))
  
  tdm <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer))
  
  # remove sparse terms
  tdm <- removeSparseTerms(tdm, pct)
  
  # aggregate term frequencies
  tf <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  
  return(data.frame(term=names(tf), frequency=tf))
}

# Get unigrams
unigram.en.blogs <- getNGrams(corpus.en.blogs, 1, 0.99)
unigram.en.news <- getNGrams(corpus.en.news, 1, 0.99)
unigram.en.twitter <- getNGrams(corpus.en.twitter, 1, 0.99)

# Get bigrams
bigram.en.blogs <- getNGrams(corpus.en.blogs, 2, 0.999)
bigram.en.news <- getNGrams(corpus.en.news, 2, 0.999)
bigram.en.twitter <- getNGrams(corpus.en.twitter, 2, 0.999)

# Get trigrams
trigram.en.blogs <- getNGrams(corpus.en.blogs, 3, 0.9999)
trigram.en.news <- getNGrams(corpus.en.news, 3, 0.9999)
trigram.en.twitter <- getNGrams(corpus.en.twitter, 3, 0.9999)

# Plot top trigrams
ggplot(trigram.en.blogs[1:20,], aes(x=reorder(term, frequency), y=frequency, fill=frequency)) + geom_bar(stat="identity") +
  labs(x="Term", y="Frequency", title="Top 20 trigrams for blogs") + coord_flip()

ggplot(trigram.en.twitter[1:20,], aes(x=reorder(term, frequency), y=frequency, fill=frequency)) + geom_bar(stat="identity") + 
  labs(x="Term", y="Frequency", title="Top 20 trigrams for twitter") + coord_flip()

Future Work

Prediction Strategies - Tentative Ideas

Aggregate the data sources into a common dataset Increase the sample size of datasets Maximize coverage with optimal corpus size for efficiency Shiny App - Tentative Ideas

Provide text input for getting users’ inputs Provide a list of suggestions for the next word Use different ngram models to provide suggestions Allow the incremental learning through the inclusion of users’ inputs