In this capstone project, we will be applying data science in the area of natural language processing. The datasets comes from HC corpora (http://www.corpora.heliohost.org/), which is a collection of corpora for various languages. The files have been filtered based on languages, but may still contain some foreign text. The dataset can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
First, a large corpus of text documents will be analysed to discover the data structure. It will be followed with a series of data processing steps (e.g. removing punctuations, numbers, stopwords, profanities). Next, the datasets will be samples to build a predictive text model. Eventually, a predictive text product will be built.
The data are provided in various languages (English, French, German, Finnish) and comes from different data sources (blogs, news, twitter). Only the english context will be analyzed. The list of profanity can be downloaded from http://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-banned-by-google-txt-file.zip.
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
## dec, : embedded nul(s) found in input
You can also embed plots, for example:
## [1] 898384
## [1] 77258
## [1] 2302307
head(data.en.blogs, n=3)
## V1
## 1 In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
## 2 We love you Mr. Brown.
## 3 Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
## V1
## 1 He wasn't home alone, apparently.
## 2 The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
## 3 WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
# Read profanties
data.profanity <- read.table(file.profanity, header = FALSE, sep="\n", quote = "", stringsAsFactors = FALSE)
# Number of words
nrow(data.profanity)
## [1] 550
The datasets are analyzed to obtain the general statistics (word counts, line counts). Histograms are plotted for the word counts for the different data sources. It can be seen that blog data have the largest (word count per line) spread . Tweets (Twitter) have much smaller spread.
library(stringi)
## Warning: package 'stringi' was built under R version 3.1.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
# Get general text statistics
stri_stats_general(data.en.blogs$V1)
## Lines LinesNEmpty Chars CharsNWhite
## 898384 898384 207646896 171336847
stri_stats_general(data.en.news$V1)
## Lines LinesNEmpty Chars CharsNWhite
## 77258 77258 15679057 13113058
stri_stats_general(data.en.twitter$V1)
## Lines LinesNEmpty Chars CharsNWhite
## 2302307 2302307 150063800 123870039
# Get summary of word counts
counts.en.blogs <- unlist(stri_count_words(data.en.blogs$V1))
counts.en.news <- unlist(stri_count_words(data.en.news$V1))
counts.en.twitter <- unlist(stri_count_words(data.en.twitter$V1))
summary(counts.en.blogs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 29.00 42.32 61.00 6726.00
summary(counts.en.news)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 19.00 32.00 34.86 46.00 1123.00
summary(counts.en.twitter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.00 11.00 12.25 18.00 60.00
# Plot histograms of word counts
ggplot(as.data.frame(counts.en.blogs), aes(x=counts.en.blogs)) +
geom_histogram(binwidth=200, fill="steelblue", color="black") +
labs(x="Word counts per entry", y="Frequency (log10)", title="Histogram of word counts for blogs") +
scale_y_log10()
## Warning: Stacking not well defined when ymin != 0
ggplot(as.data.frame(counts.en.news), aes(x=counts.en.news)) +
geom_histogram(binwidth=40, fill="steelblue", color="black") +
labs(x="Word counts per entry", y="Frequency", title="Histogram of word counts for news")
ggplot(as.data.frame(counts.en.twitter), aes(x=counts.en.twitter)) +
geom_histogram(binwidth=2, fill="steelblue", color="black") +
labs(x="Word counts per entry", y="Frequency", title="Histogram of word counts for twitter")
To expedite data processing, each dataset will be sampled (10%). The following language processing step are performed
removing punctuation removing numbers removing stop words removing profanities stemming words stripping white space converting to lower case
library(data.table)
library(NLP)
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tm)
library(SnowballC)
## Warning: package 'SnowballC' was built under R version 3.1.3
getCorpus <- function (data, rm_words)
{
#build corpus
corpus <- Corpus(VectorSource(data))
# convert words to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# remove white spaces
corpus <- tm_map(corpus, stripWhitespace)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
# stem words
corpus <- tm_map(corpus, stemDocument)
# remove stopwords
corpus <- tm_map(corpus, removeWords, rm_words)
return (corpus)
}
# Set seed
set.seed(323)
sample.pct <- 0.1
sample.data.en.blogs <- as.data.frame(data.en.blogs[sample(nrow.data.en.blogs, sample.pct * nrow.data.en.blogs),])
sample.data.en.news <- as.data.frame(data.en.blogs[sample(nrow.data.en.news, sample.pct * nrow.data.en.news),])
sample.data.en.twitter <- as.data.frame(data.en.blogs[sample(nrow.data.en.twitter, sample.pct * nrow.data.en.twitter),])
nrow(sample.data.en.blogs)
## [1] 89838
corpus.en.blogs <- getCorpus(sample.data.en.blogs, c(stopwords('english'), data.profanity))
corpus.en.news <- getCorpus(sample.data.en.news, c(stopwords('english'), data.profanity))
corpus.en.twitter <- getCorpus(sample.data.en.twitter, c(stopwords('english'), data.profanity))
Uni-grams, bi-grams and tri-grams are generated for each data source. Sparse terms will be removed. The top 20 unigrams will be plotted. The top bigrams are illustrated using word clouds.
library(rJava)
## Warning: package 'rJava' was built under R version 3.1.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.1.3
getNGrams <-function(corpus, size, pct)
{
tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = size, max = size))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = tokenizer))
# remove sparse terms
tdm <- removeSparseTerms(tdm, pct)
# aggregate term frequencies
tf <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
return(data.frame(term=names(tf), frequency=tf))
}
# Get unigrams
unigram.en.blogs <- getNGrams(corpus.en.blogs, 1, 0.99)
unigram.en.news <- getNGrams(corpus.en.news, 1, 0.99)
unigram.en.twitter <- getNGrams(corpus.en.twitter, 1, 0.99)
# Get bigrams
bigram.en.blogs <- getNGrams(corpus.en.blogs, 2, 0.999)
bigram.en.news <- getNGrams(corpus.en.news, 2, 0.999)
bigram.en.twitter <- getNGrams(corpus.en.twitter, 2, 0.999)
# Get trigrams
trigram.en.blogs <- getNGrams(corpus.en.blogs, 3, 0.9999)
trigram.en.news <- getNGrams(corpus.en.news, 3, 0.9999)
trigram.en.twitter <- getNGrams(corpus.en.twitter, 3, 0.9999)
# Plot top trigrams
ggplot(trigram.en.blogs[1:20,], aes(x=reorder(term, frequency), y=frequency, fill=frequency)) + geom_bar(stat="identity") +
labs(x="Term", y="Frequency", title="Top 20 trigrams for blogs") + coord_flip()
ggplot(trigram.en.twitter[1:20,], aes(x=reorder(term, frequency), y=frequency, fill=frequency)) + geom_bar(stat="identity") +
labs(x="Term", y="Frequency", title="Top 20 trigrams for twitter") + coord_flip()
Prediction Strategies - Tentative Ideas
Aggregate the data sources into a common dataset Increase the sample size of datasets Maximize coverage with optimal corpus size for efficiency Shiny App - Tentative Ideas
Provide text input for getting users’ inputs Provide a list of suggestions for the next word Use different ngram models to provide suggestions Allow the incremental learning through the inclusion of users’ inputs