DS Capstone Project: Milestone Report

Summary

The Data Science Capstone Project, consist of designing and implementing a language model capable to predict the next word in a sentence. For this we should put in practice all the skill acquired during the previous 9 courses plus some additional skill that we must self-learn on the road, mainly NLP techniques.

This problem was introduced before and it is also known as a variant of âThe Shannon Gameâ, where we calculate the probability of a word given a previous sequence of words.

In this report we will show an initial exploratory analysis of the data provided to train and test our model, as well as the initial predictive model based on Markov assumption and ngrams counts.

Exploratory Analysis

Our dataset consist in 3 files with a sample of text from twitter, news websites and blogs.
Let’s take a look of this files and summarize their content.

Dataset	Twitter	News	Blogs
Size	159.4 MB	196.3 MB	200.4 MB
Size in Memory	301.4 MB	19.2 MB	248.5 MB
Lines	2360148	77259	899288
Word Count	30373543	2643969	37334131

I obtained “Size”“,”Size in Memory“” and “Lines”" from RStudio Interface. To perform the word count i used:

twitter.nwords <- sum(sapply(gregexpr("\\S+", twitter), length))
news.nwords <- sum(sapply(gregexpr("\\S+", blogs), length))
blogs.nwords <- sum(sapply(gregexpr("\\S+", blogs), length))

Initialization and Pre-Processing

For this part of the project i’ll be using the “tm”, “RWeka” and “dplyr” libraries to process and manipulate the data so the initialization will be to set the seed so the results are reproducibles and then load them.

Also as you can see in the table above, this are huge files which will make our processing very slow, so I’m going to use only a sample of them.

The coding is done in a modular basis in order to keep it easy to maintain and upgrade.

initialize <- function(complete=TRUE){
    set.seed(1)
    
    library(tm)
    library(RWeka)
    library(dplyr)
}

create.corpus <- function(sampleSize){      
    s.twitter <- sample(twitter, sampleSize)
    #s.news <- sample(news, sampleSize)
    #sblogs <- sample(blogs, sampleSize)
    
    #raw_corpus <- c(twitter, news, blogs)
    s.twitter
}

initialize()
# I take 50K sample from each files and bind them together
corp <- create.corpus(50000)

As you can see, for this report i have only choose data from the twitter data set. The motivation for this is that this prediction algorithm are most of the time use in mobile devices for short text messages, like twits, facebooks status, or sms. News and Blogs are more complex compositions are are normally written in laptops with full keyboards. So I want to train my model with this king of short text.

The next step is to “clean” the corpus we have created. For this task we will use the “tm” package that provide us several functions to perform this task. Follow the code comments for further description.

cleancorpus <- function(corp){    
    # Create tm corpus
    corpus <- Corpus(VectorSource(corp)) 
    
    # Remove unknown characters
    corpus <- tm_map(corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
    
    # Reduce all to lowercase
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, PlainTextDocument)
    
    # Remove URLs with custom content handlers
    removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
    removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)
    corpus <- tm_map(corpus, removeURL)
    corpus <- tm_map(corpus, removeWWW)
    
    # Remove punctuation, numbers and additional white spaces.
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, removeNumbers)
    corpus <- tm_map(corpus, stripWhitespace)
    
    # Transform to plain text
    corpus <- tm_map(corpus, PlainTextDocument)
    corpus    
}

corp <- clean.corpus(corp)
corp.no.stop <- tm_map(corpus, removeWords, stopwords("english"))

As you can see, i have generated 2 corpuses, one with and the other without stopwords, this is because we are going to need stopwords to generate representative bigrams and trigrams, however we also want to know which are the most frquent “core” words.

1-2-3 NGrams Tokenization

Following Markov assumption we will calculate the probability for the next word in a sentence based on the probability of the same word given the last 2 words, and then using a backoff model for unknown Bigrams.

In order to implement this technique, we need to count all Unigrams, Bigrams and Trigrams in our Corpus. To do this we will use the “NGramTokenizer” function from the “RWeka” package to generate the ngrams, them we’ll convert them to a data.frame, sort them and filter the ngrams that not reach a minimun count that can be set in the function call.

tokenize <- function(cor, uni.min.freq=0, bi.min.freq=0, tri.min.freq=0){
    
    # Conver tm corpus to text as RWeka requires
    text <- data.frame(text=unlist(sapply(cor, `[`, "content")), stringsAsFactors = FALSE)
    
    # Create, convert, sort and filter Unigrams
    Unigrams.list <- NGramTokenizer(text, Weka_control(min = 1, max = 1))
    Unigrams <- data.frame(table(Unigrams.list))
    Unigrams <- Unigrams %>% filter(Freq > uni.min.freq) %>% arrange(desc(Freq))
    
    # Create, convert, sort and filter Bigrams
    Bigrams.list <- NGramTokenizer(text, Weka_control(min = 2, max = 2))
    Bigrams <- data.frame(table(Bigrams.list))
    Bigrams <- Bigrams %>% filter(Freq > bi.min.freq) %>% arrange(desc(Freq))
    
    # Create, convert, sort and filter Trigrams
    Trigrams.list <- NGramTokenizer(text, Weka_control(min = 3, max = 3))
    Trigrams <- data.frame(table(Trigrams.list))
    Trigrams <- Trigrams %>% filter(Freq > tri.min.freq) %>% arrange(desc(Freq))
    
    # Create list for function return
    ngrams <- list(Unigrams, Bigrams, Trigrams)
    ngrams    
}

ngrams <- tokenize(corp,5,1,1)
unigrams <- ngrams[[1]]
bigrams <- ngrams[[2]]
trigrams <- ngrams[[3]]
rm(ngrams)

Results

Now lets take a look of the most common ngrams in a barplot.

par(mfrow=c(1,3))

barplot(unigrams$Freq[1:20],
        xlab='Unigrams',
        ylab='Count',
        main='Unigrams Frequency',
        names.arg= unigrams$Unigrams.list[1:20], las=2)

barplot(bigrams$Freq[1:20],
        xlab='Bigrams',
        ylab='Count',
        main='Bigrams Frequency',
        names.arg= bigrams$Bigrams.list[1:20], las=2)

barplot(trigrams$Freq[1:20],
        xlab='Trigrams',
        ylab='Count',
        main='Trigrams Frequency',
        names.arg= trigrams$Trigrams.list[1:20], las=2)

image: image: image:

Steps Forward

In order to improve accuracy other techniques must be used to complement the ngrams implementation as smoothing and backoff.
I also want to use semantic indicator as POS tagging to help with the prediction.