The goal of this report is to display basic exploratory analysis for NLP - Next Word Prediction Algorithm for the Capstone project of Data Science Specialization.
We are presented with three different datasets in English: Blog, Twitter and News data.
Here we will present important statistics about the data and n-gram models after tokenization.
We simply read in the data files for the english language. Assuming they are in the current directory.
#Read in files
blogs <-readLines("en_US.blogs.txt", skipNul=TRUE, encoding="UTF-8")
twitter <-readLines("en_US.twitter.txt", skipNul=TRUE, encoding="UTF-8")
news <-readLines("en_US.news.txt", skipNul=TRUE, encoding="UTF-8")
Here we are getting some statistics about our dataset files. The library chosen to do wordcount and sentence count easily is stringi
I selected the following characteristics to report : "FILE SIZE", "WORD COUNT", "LINE COUNT", "SENTENCE COUNT")
library(stringi)
## File sizes (in Bytes)
blogs_fs <- file.info("en_US.blogs.txt")$size
twitter_fs <- file.info("en_US.twitter.txt")$size
news_fs <- file.info("en_US.news.txt")$size
## Word counts
blogs_wc <- sum(stri_count_words(blogs))
twitter_wc <- sum(stri_count_words(twitter))
news_wc <- sum(stri_count_words(news))
## Line Counts (Notice this is just the length after readLines())
blogs_lc <- length(blogs)
twitter_lc <- length(twitter)
news_lc <- length(news)
## Sentence Count. Notice this is different from line count because
## A line would contain multiple sentences seperated by punctuation.
blogs_sc <- sum(stri_count_boundaries(blogs, type="sentence"))
twitter_sc <- sum(stri_count_boundaries(twitter, type="sentence"))
news_sc <- sum(stri_count_boundaries(news, type="sentence"))
Now we can combine our statistics table and report
## Combine stats
#
stats <- cbind (c("Blogs", "Tweets", "News"), c(blogs_fs, twitter_fs, news_fs), c(blogs_wc, twitter_wc, news_wc), c(blogs_lc, twitter_lc, news_lc), c(blogs_sc, twitter_sc, news_sc))
stats <- as.data.frame(stats)
names(stats) <- c(" DATASET"," FILE SIZE", " WORD COUNT", " LINE COUNT", " SENTENCE COUNT")
print(stats, row.names = FALSE)
## DATASET FILE SIZE WORD COUNT LINE COUNT SENTENCE COUNT
## Blogs 210160014 37541795 899288 2380481
## Tweets 167105338 30092907 2360148 3780376
## News 205811889 34762303 1010242 2025776
Due to the large size of the data, for practical purposes we sample to a smaller portion. Honestly my current system cannot handle even doing 5 percent of the data due to Out of Memory Errors.
So we are for now settling for 1 percent of the data
# Sample 1% of the entire data based on line counts. More gives Out of Memory Error
blogs_sample <- sample(blogs, length(blogs) * 0.01)
twitter_sample <- sample(twitter, length(twitter) * 0.01)
news_sample <- sample(news, length(news) * 0.01)
Now we can create the corpus using library tm.
Notice we need to do some standard clean up i.e removing punctuation and ignoring case.
At this point profanity is not handled. Also stopword removal is not added because I am not yet sure removal of stopwords from the corpus is a good idea for a “predicting next word” application.
i.e How will the app predict if the next word is most likely a stopword if we remove them completely from the corpus?
I have read removing stopwords is standard for search algorithms or topic discovery, but not sure it fits here.
library(tm)
CombinedSamples <- c(blogs_sample, twitter_sample, news_sample)
# Create the Corpus
corpus <- Corpus(VectorSource(CombinedSamples))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
We are using RWeka library for tokenization and n-gram model building. Notice RWeka behavior is better after removing punctuation while building the corpus.
We will do unigrams bigrams and trigrams. For the final app I think 4-grams might be necessary.
library(RWeka)
unigrams <- data.frame(table(NGramTokenizer(corpus, Weka_control(min = 1, max = 1))))
bigrams <- data.frame(table(NGramTokenizer(corpus, Weka_control(min = 2, max = 2))))
trigrams <- data.frame(table(NGramTokenizer(corpus, Weka_control(min = 3, max = 3))))
To get an idea of the most common n-grams, I first order them and do a basic bar plot.
# Top 40 Unigrams
top_unigrams <- unigrams[order(unigrams[,2],decreasing=TRUE),] [1:40,]
barplot(top_unigrams[,2], names.arg = top_unigrams[,1], border=NA,
las=2, main="TopUnigrams", cex.main=1, cex.names=0.75, col="tomato")
# Top 40 Bigrams
top_bigrams <- bigrams[order(bigrams[,2],decreasing=TRUE),] [1:40,]
barplot(top_bigrams[,2], names.arg = top_bigrams[,1], border=NA,
las=2, main="TopBigrams", cex.main=1, cex.names=0.75, col="tomato")
# Top 40 Trigrams
top_trigrams <- trigrams[order(trigrams[,2],decreasing=TRUE),] [1:40,]
barplot(top_trigrams[,2], names.arg = top_trigrams[,1], border=NA,
las=2, main="TopTrigrams", cex.main=1, cex.names=0.75, col="tomato")
For the future I plan to implement Profanity filtering and find out whether removing stopwords is a good idea.
Also the size of the corpus is being an issue for computation, so I believe there must be a more clever way of handling this to improve computational complexity.
I also plan to do smarter sampling like different sample sizes of different datasets. Because I believe twitter data must be more important metric than a formatted blog or news content for an app like SwiftKey.
As for the performance of my current prediction algorithm , if the exact n-gram exists in the corpus the prediction is quite good.
But the performance must be improved for the cases when n-gram does not exist in the corpus.
I believe Smoothing and Topical Analysis is the techniques to handle this case , which I plan to work on for the final app.
Finally, if possible I plan to include 4-grams too.