Capstone Project: Milestone Report

Summary

The Data Science Capstone Project, consist of designing and implementing a language model capable to predict the next word in a sentence. For this we should put in practice all the skill acquired during the previous 9 courses plus some additional skill that we must self-learn on the road, mainly NLP techniques.

This problem was introduced before and it is also known as a variant of The Shannon Game, where we calculate the probability of a word given a previous sequence of words.

In this report we will show an initial exploratory analysis of the data provided to train and test our model, as well as the initial predictive model based on Markov assumption and ngrams counts.

Note: This is the second time i take this course so there’s a very similar Milestone Report at here . This was also done by me. Just in case you check for plagiarism.

Loading data into R dataspace

The first step in the process is to load the datasets into R enviroment. The files were download from the URL provided in the assigment, unzip and placed on the R project working enviroment.

# Reading text files from working directory

blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Exploratory Analysis

Our dataset consist in 3 files with a sample of text from twitter, news websites and blogs. Let’s take a look of this files and summarize their content.

Source	Twitter	Blogs	News
File Size [Mb]	159.4	200.4	196.3
Size in Memory	301.4	248.5	249.6
Lines	2360148	899288	1010242
Wordcount	30373543	37334131	34372530

File Size, Size in Memory and amount of lines in source files was obtained from R Studio interface. While the word count i got it using the following command.

twitter.nwords <- sum(sapply(gregexpr("\\S+", twitter), length))
blogs.nwords <- sum(sapply(gregexpr("\\S+", blogs), length))
news.nwords <- sum(sapply(gregexpr("\\S+", news), length))

Initialization: Loading libraries, assuring repeatability and sampling data.

In order to have repeatability I use the seed() function so random process are similar in every run. Also in this Milestone Report be using RWeka and tm libraries for Corpus manipulation and Tokenization, so i include them in the project.

As we so in the previous table files are quite large and it would be very time consuming to perform the analysis over the whole dataset, so i sample 20K lines of each dataset in order to test the algorithm faster.

initialize <- function(complete=TRUE){
  set.seed(1)
  
  library(tm)
  library(RWeka)
}

create.corpus <- function(sampleSize){      
  s.twitter <- sample(twitter, sampleSize)
  s.news <- sample(news, sampleSize)
  s.blogs <- sample(blogs, sampleSize)
  
  raw_corpus <- c(s.twitter, s.news, s.blogs)
}

initialize()
corp <- create.corpus(20000)

Corpus cleaning and pre-processing

Now that we have our “raw corpus” we need to clean it. This means removing all characters or strings that are not meaningful words. For this we use a quite standart process creating a tm corpus and run a series of cleaning functions on top of it. In the code comments you can see a description of each step.

clean.corpus <- function(corp){    
  # Create tm formated corpus
  corpus <- Corpus(VectorSource(corp)) 
  
  # Remove unknown characters
  corpus <- tm_map(corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
  
  # Reduce all to lowercase
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, PlainTextDocument)
  
  # Remove URLs with custom content handlers
  removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
  removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)
  corpus <- tm_map(corpus, removeURL)
  corpus <- tm_map(corpus, removeWWW)
  
  # Remove punctuation, numbers and additional white spaces.
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  
  # Transform to plain text
  corpus <- tm_map(corpus, PlainTextDocument)
  corpus    
}

corp <- clean.corpus(corp)

Tokenize corpues and compute 1-2-3 Gram Frequency

The next step in the process is to “Tokenize” our corpus, this meaning to split the words and compute the frequency of each of them. For this we create a Term Document Matrix using tm packages, them we remove very low frequency words and then sort the information by decreasing frequency.

# Creating Bigram and Trigram Tokenizer functions
BigramTokenizer <- function(x){
      unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  }
  
TrigramTokenizer <- function(x){
      unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
  }
  
# function to filter information and create a data frame with the most common (highest frequency) words
freq_df <- function(tdm){
    freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
    freq_df <- data.frame(word=names(freq), freq=freq)
    return(freq_df)
  }

# Creating TDM and NGrams frequency dataframe
  
unigram <- removeSparseTerms(TermDocumentMatrix(corp), 0.9999)

## Warning in nr * nc: NAs producidos por enteros excedidos

unigram_freq <- freq_df(unigram)
  
bigram <- removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = BigramTokenizer)), 0.9999)

## Warning in nr * nc: NAs producidos por enteros excedidos

bigram_freq <- freq_df(bigram)
  
trigram <- removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = TrigramTokenizer)), 0.9999)

## Warning in nr * nc: NAs producidos por enteros excedidos

trigram_freq <- freq_df(trigram)

Plot Top-20 words histogram

Now that we have all tokens filtered and sorted, let’s make a plot of them in order to visualize their relationship.

barplot(unigram_freq$freq[1:20],
          #xlab='Unigrams',
          ylab='Count',
          main='Unigrams Frequency',
          names.arg= unigram_freq$word[1:20], las=2)

barplot(bigram_freq$freq[1:20],
          #xlab='Bigrams',
          ylab='Count',
          main='Bigrams Frequency',
          names.arg= bigram_freq$word[1:20], las=2)

barplot(trigram_freq$freq[1:20],
          #xlab='Trigrams',
          ylab='Count',
          main='Trigrams Frequency',
          names.arg= trigram_freq$word[1:20], las=2)

Steps Forwards

In order to improve accuracy other techniques must be used to complement the ngrams implementation as smoothing and backoff. I also want to use semantic indicator as POS tagging to help with the prediction.