The Data Science Capstone Project, consist of designing and implementing a language model capable to predict the next word in a sentence. For this we should put in practice all the skill acquired during the previous 9 courses plus some additional skill that we must self-learn on the road, mainly NLP techniques.
This problem was introduced before and it is also known as a variant of The Shannon Game, where we calculate the probability of a word given a previous sequence of words.
In this report we will show an initial exploratory analysis of the data provided to train and test our model, as well as the initial predictive model based on Markov assumption and ngrams counts.
Note: This is the second time i take this course so there’s a very similar Milestone Report at here . This was also done by me. Just in case you check for plagiarism.
The first step in the process is to load the datasets into R enviroment. The files were download from the URL provided in the assigment, unzip and placed on the R project working enviroment.
# Reading text files from working directory
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
Our dataset consist in 3 files with a sample of text from twitter, news websites and blogs. Let’s take a look of this files and summarize their content.
| Source | Blogs | News | |
|---|---|---|---|
| File Size [Mb] | 159.4 | 200.4 | 196.3 |
| Size in Memory | 301.4 | 248.5 | 249.6 |
| Lines | 2360148 | 899288 | 1010242 |
| Wordcount | 30373543 | 37334131 | 34372530 |
File Size, Size in Memory and amount of lines in source files was obtained from R Studio interface. While the word count i got it using the following command.
twitter.nwords <- sum(sapply(gregexpr("\\S+", twitter), length))
blogs.nwords <- sum(sapply(gregexpr("\\S+", blogs), length))
news.nwords <- sum(sapply(gregexpr("\\S+", news), length))
In order to have repeatability I use the seed() function so random process are similar in every run. Also in this Milestone Report be using RWeka and tm libraries for Corpus manipulation and Tokenization, so i include them in the project.
As we so in the previous table files are quite large and it would be very time consuming to perform the analysis over the whole dataset, so i sample 20K lines of each dataset in order to test the algorithm faster.
initialize <- function(complete=TRUE){
set.seed(1)
library(tm)
library(RWeka)
}
create.corpus <- function(sampleSize){
s.twitter <- sample(twitter, sampleSize)
s.news <- sample(news, sampleSize)
s.blogs <- sample(blogs, sampleSize)
raw_corpus <- c(s.twitter, s.news, s.blogs)
}
initialize()
corp <- create.corpus(20000)
Now that we have our “raw corpus” we need to clean it. This means removing all characters or strings that are not meaningful words. For this we use a quite standart process creating a tm corpus and run a series of cleaning functions on top of it. In the code comments you can see a description of each step.
clean.corpus <- function(corp){
# Create tm formated corpus
corpus <- Corpus(VectorSource(corp))
# Remove unknown characters
corpus <- tm_map(corpus, content_transformer(function(x) iconv(enc2utf8(x), sub = "byte")))
# Reduce all to lowercase
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
# Remove URLs with custom content handlers
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
removeWWW <- function(x) gsub("www[[:alnum:]]*", "", x)
corpus <- tm_map(corpus, removeURL)
corpus <- tm_map(corpus, removeWWW)
# Remove punctuation, numbers and additional white spaces.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
# Transform to plain text
corpus <- tm_map(corpus, PlainTextDocument)
corpus
}
corp <- clean.corpus(corp)
The next step in the process is to “Tokenize” our corpus, this meaning to split the words and compute the frequency of each of them. For this we create a Term Document Matrix using tm packages, them we remove very low frequency words and then sort the information by decreasing frequency.
# Creating Bigram and Trigram Tokenizer functions
BigramTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
TrigramTokenizer <- function(x){
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
# function to filter information and create a data frame with the most common (highest frequency) words
freq_df <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_df <- data.frame(word=names(freq), freq=freq)
return(freq_df)
}
# Creating TDM and NGrams frequency dataframe
unigram <- removeSparseTerms(TermDocumentMatrix(corp), 0.9999)
## Warning in nr * nc: NAs producidos por enteros excedidos
unigram_freq <- freq_df(unigram)
bigram <- removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = BigramTokenizer)), 0.9999)
## Warning in nr * nc: NAs producidos por enteros excedidos
bigram_freq <- freq_df(bigram)
trigram <- removeSparseTerms(TermDocumentMatrix(corp, control = list(tokenize = TrigramTokenizer)), 0.9999)
## Warning in nr * nc: NAs producidos por enteros excedidos
trigram_freq <- freq_df(trigram)
Now that we have all tokens filtered and sorted, let’s make a plot of them in order to visualize their relationship.
barplot(unigram_freq$freq[1:20],
#xlab='Unigrams',
ylab='Count',
main='Unigrams Frequency',
names.arg= unigram_freq$word[1:20], las=2)
barplot(bigram_freq$freq[1:20],
#xlab='Bigrams',
ylab='Count',
main='Bigrams Frequency',
names.arg= bigram_freq$word[1:20], las=2)
barplot(trigram_freq$freq[1:20],
#xlab='Trigrams',
ylab='Count',
main='Trigrams Frequency',
names.arg= trigram_freq$word[1:20], las=2)
In order to improve accuracy other techniques must be used to complement the ngrams implementation as smoothing and backoff. I also want to use semantic indicator as POS tagging to help with the prediction.