Summary

This is a quick report of the tryouts so far, any criticism is very welcome!

The goal of the whole capstone project for JHU data scientist specialization is to create a prediction model based on different writing sources. The model should be able to predict the next word that is going to be used based on the last one or two words of a sentence.

Data Loading/Exploration

require(stringr); require(quanteda); require(dplyr); require(ggplot2); require(reshape2)

Let’s start by loading the contents. Out of convenience and to test out the model, we will only use 20% of the total corpus. Let’s do this and construct a first dfm (document feature matrix) composed of bigrams.

Bigrams (and later on trigrams) represent occurences of words that appear next to each other in the text.

For example: “Hi, my name is John” is composed of the bigrams “hi_my”, “my_name”, “name_is”, “is_john”. This greatly helps predicting the next word to come. For example the word “my” might be the most frequent one after the word “hi”. We can therefore use this to start building the model.

We will also print the top features that will (at this point) be mostly stopwords (like “the”, “of”, etc…).

con1 <- file("final/en_US/en_US.blogs.txt")
con2 <- file("final/en_US/en_US.news.txt")
con3 <- file("final/en_US/en_US.twitter.txt")

txt1 <- readLines(con1)
txt2 <- readLines(con2)
txt3 <- readLines(con3)

close(con1); close(con2); close(con3)

wd_cnt1 <- sum(length(unlist(str_split(txt1, " "))))
wd_cnt2 <- sum(length(unlist(str_split(txt2, " "))))
wd_cnt3 <- sum(length(unlist(str_split(txt3, " "))))


# Quick table for size of the file
sizedf <- data.frame(file = c("blogs", "news", "twitter"),
                     lines = c(length(txt1), length(txt2), length(txt3)),
                     words = c(wd_cnt1, wd_cnt2, wd_cnt3)
)
sizedf
##      file   lines    words
## 1   blogs  899288 37334131
## 2    news 1010242 34372530
## 3 twitter 2360148 30373545
### We notice that despite great differences in terms of file size and lines, the three files hold
### about the same amount of words
# calculating 20% of the total
txt <- c(unlist(txt1), unlist(txt2), unlist(txt3))
sub_20 <- ceiling(length(txt) * .2)
smol_txt <- txt[sample(1:length(txt), sub_20, replace = F)]
smol_txt <- tolower(smol_txt); rm(txt, txt1, txt2, txt3)


smol_bigram <- tokens(smol_txt, what = "word", remove_numbers = T, ngrams = 2,
                    remove_punct = T, remove_url = T, remove_hyphens = T, remove_twitter = T)

smol_dfm <- dfm(smol_bigram) ; rm(smol_bigram)

topfeatures(smol_dfm)
##   of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
##    86597    81984    42769    40199    39350    32107    28625    25187 
##     in_a with_the 
##    23888    21184
prettycols <- colorRampPalette(c("steelblue", "red3"))
top_ft <- data.frame(values = head(colSums(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10),
                     features = head(colnames(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10))
flop_ft <- data.frame(values = tail(colSums(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10),
                     features = tail(colnames(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10))

ggplot(top_ft, aes(x = features, y = values)) + geom_bar(stat = "identity", fill = prettycols(10)) +
        ylab("Occurences") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Top 10 features from the corpus in term of occurence")

ggplot(flop_ft, aes(x = features, y = values)) + geom_bar(stat = "identity") +
        ylab("Occurences") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ggtitle("Bottom 10 features from the corpus in term of occurence")

The next steps will be based on the use of these bi or trigrams to estimate the probability of what the next word could be.

as a quick example:

#trying to predict "will" and "dog"
quickpred <- function(str){
        input <- paste0("^", str)
        selector <- grepl(input, colnames(smol_dfm))
        df <- data.frame(prediction = colnames(smol_dfm)[selector],
                         occurences = colSums(smol_dfm)[selector]
        )
        df$prob <- df$occurences / sum(colSums(smol_dfm)[selector])
        pred <- head(df[order(df$prob, decreasing = T),])
        return(pred)
}
quickpred("will")
##            prediction occurences       prob
## will_be       will_be      16233 0.24095295
## will_have   will_have       2252 0.03342734
## will_not     will_not       1776 0.02636188
## willing_to willing_to       1139 0.01690664
## will_get     will_get       1022 0.01516996
## will_never will_never        995 0.01476919
quickpred("dog")
##          prediction occurences       prob
## dogs_and   dogs_and        163 0.03642458
## dog_and     dog_and        142 0.03173184
## dog_is       dog_is        110 0.02458101
## dog_in       dog_in         78 0.01743017
## dogs_are   dogs_are         76 0.01698324
## dogs_in     dogs_in         49 0.01094972

Room for improvement

The model is way too heavy now with only 20% of the corpus. I will have to store the model in a much more compact and accessible way in order to make it faster (let alone working) on a lighter setup.

Here’s some of the things I thought of for future improvement: