This is a quick report of the tryouts so far, any criticism is very welcome!
The goal of the whole capstone project for JHU data scientist specialization is to create a prediction model based on different writing sources. The model should be able to predict the next word that is going to be used based on the last one or two words of a sentence.
require(stringr); require(quanteda); require(dplyr); require(ggplot2); require(reshape2)
Let’s start by loading the contents. Out of convenience and to test out the model, we will only use 20% of the total corpus. Let’s do this and construct a first dfm (document feature matrix) composed of bigrams.
Bigrams (and later on trigrams) represent occurences of words that appear next to each other in the text.
For example: “Hi, my name is John” is composed of the bigrams “hi_my”, “my_name”, “name_is”, “is_john”. This greatly helps predicting the next word to come. For example the word “my” might be the most frequent one after the word “hi”. We can therefore use this to start building the model.
We will also print the top features that will (at this point) be mostly stopwords (like “the”, “of”, etc…).
con1 <- file("final/en_US/en_US.blogs.txt")
con2 <- file("final/en_US/en_US.news.txt")
con3 <- file("final/en_US/en_US.twitter.txt")
txt1 <- readLines(con1)
txt2 <- readLines(con2)
txt3 <- readLines(con3)
close(con1); close(con2); close(con3)
wd_cnt1 <- sum(length(unlist(str_split(txt1, " "))))
wd_cnt2 <- sum(length(unlist(str_split(txt2, " "))))
wd_cnt3 <- sum(length(unlist(str_split(txt3, " "))))
# Quick table for size of the file
sizedf <- data.frame(file = c("blogs", "news", "twitter"),
lines = c(length(txt1), length(txt2), length(txt3)),
words = c(wd_cnt1, wd_cnt2, wd_cnt3)
)
sizedf
## file lines words
## 1 blogs 899288 37334131
## 2 news 1010242 34372530
## 3 twitter 2360148 30373545
### We notice that despite great differences in terms of file size and lines, the three files hold
### about the same amount of words
# calculating 20% of the total
txt <- c(unlist(txt1), unlist(txt2), unlist(txt3))
sub_20 <- ceiling(length(txt) * .2)
smol_txt <- txt[sample(1:length(txt), sub_20, replace = F)]
smol_txt <- tolower(smol_txt); rm(txt, txt1, txt2, txt3)
smol_bigram <- tokens(smol_txt, what = "word", remove_numbers = T, ngrams = 2,
remove_punct = T, remove_url = T, remove_hyphens = T, remove_twitter = T)
smol_dfm <- dfm(smol_bigram) ; rm(smol_bigram)
topfeatures(smol_dfm)
## of_the in_the to_the for_the on_the to_be at_the and_the
## 86597 81984 42769 40199 39350 32107 28625 25187
## in_a with_the
## 23888 21184
prettycols <- colorRampPalette(c("steelblue", "red3"))
top_ft <- data.frame(values = head(colSums(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10),
features = head(colnames(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10))
flop_ft <- data.frame(values = tail(colSums(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10),
features = tail(colnames(smol_dfm)[order(colSums(smol_dfm), decreasing = T)], 10))
ggplot(top_ft, aes(x = features, y = values)) + geom_bar(stat = "identity", fill = prettycols(10)) +
ylab("Occurences") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Top 10 features from the corpus in term of occurence")
ggplot(flop_ft, aes(x = features, y = values)) + geom_bar(stat = "identity") +
ylab("Occurences") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bottom 10 features from the corpus in term of occurence")
The next steps will be based on the use of these bi or trigrams to estimate the probability of what the next word could be.
as a quick example:
#trying to predict "will" and "dog"
quickpred <- function(str){
input <- paste0("^", str)
selector <- grepl(input, colnames(smol_dfm))
df <- data.frame(prediction = colnames(smol_dfm)[selector],
occurences = colSums(smol_dfm)[selector]
)
df$prob <- df$occurences / sum(colSums(smol_dfm)[selector])
pred <- head(df[order(df$prob, decreasing = T),])
return(pred)
}
quickpred("will")
## prediction occurences prob
## will_be will_be 16233 0.24095295
## will_have will_have 2252 0.03342734
## will_not will_not 1776 0.02636188
## willing_to willing_to 1139 0.01690664
## will_get will_get 1022 0.01516996
## will_never will_never 995 0.01476919
quickpred("dog")
## prediction occurences prob
## dogs_and dogs_and 163 0.03642458
## dog_and dog_and 142 0.03173184
## dog_is dog_is 110 0.02458101
## dog_in dog_in 78 0.01743017
## dogs_are dogs_are 76 0.01698324
## dogs_in dogs_in 49 0.01094972
The model is way too heavy now with only 20% of the corpus. I will have to store the model in a much more compact and accessible way in order to make it faster (let alone working) on a lighter setup.
Here’s some of the things I thought of for future improvement: