Summary

This is my final report in the Coursera Data Science Specialization Capstone project March-May 2016. We were challenged with tackling text analytics and Natural Language Processing (NLP) head on. It was a new topic for most of us, and we had to apply almost everything we learned from the specialization. We were given three text files from blogs, news media and twitter to analyze. I used a sample size of 1% to explore the data, as can be seen in my interim report. To train the algorithm i used a gross sample size of 10% (blogs), 20% (news) and 7.5% (twitter). An enormous amount of time went into trying different approaches to increase the accuracy of the prediction. The resulting accuracy of around 15% true predictions is disappointing on its own, but the work has really sparked my interest in this domain. I think it is especially interesting to investigate how the same problem can be solved with machine learning.

R Code

This assignment did not explicitly allow code sharing, so I can provide no link to the code here. Reviewing peers will be instructed on how to find the necessary elements where they will remain online during the reviewing period. Coursera Code of Honor

Process description

Understanding the problem, Data aquisition and cleaning and Exploratory analysis were covered in the interim report. Since then I refined the regex cleaning quite a bit and refactored everything into functions. I also ended up splitting by sentences using the OpenNLP R package.

Other changes since the interim report: I removed all sentences with numbers, except phrases like “1st”, “1970s”, “B2B” and so on. That was because otherwise the stopwords surrounding numbers formed illegal bi-, tri- and fourgrams. Therefore also upweighting the news source in the gross sampling, as that source type has the most numbers in it. In addition, I implemented a soft form of lemmatization, by removing plural s where considered safe. That’s only when length > 4 characters and the word finds a match with a higher frequency if the s is removed. No “’s” are changed. Note: In some instances the plural s is added instead. That’s when that is more common.

The tm R package was initially used to create the unified text corpus. However, midway I switched to the quanteda package. At the same time, I finally started using the data.table package, after having been content with data frames so far in my 10 months of R coding experience. Quanteda and data.table scale well, and it was important for the many iterations I had to do.

Now I split my 10% (blogs), 20% (news) and 7.5% (twitter) sample into training and test (80/20). Test was then split 50/50 into devtest and test. On the training data I created n-grams from 1 through 4. The ngrams were split into its single words and then converted to integers with mapping to a full unigram dictionary which I created on the full dataset. This saved space and allowed for a little more training data to be used.

I also experimented with bigrams excluding stopwords and predicting with this based on the last or second to last user input word. Another experiment was to create a 2-skip-4-gram - bigrams of all combinations skipping from 1 to 4 words. None of those two experiments are included in the final product, as they primarily contributed to slowing down the app. Instead I pruned the regular ngrams by removing “singletons” (terms occuring once) and focused more on alternatives to the simple prediction algorithm I had made so far. One outcome of that is the downvoting of stopwords that is default in the app.

From the initial simple “Stupid Back-off” method i went into a very deep valley of absorbing theory on NLP. I aimed to implement one of the more advanced form of next-word probability smoothing. It proved a true challenge. The model I implemented is based on the following theory.

Kneser-Ney smoothing

The original Kneser-Ney model [7][14]… \[ p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{max(c(w_{i-n+1}^{i-1},w_{i}) - \delta,0)}{\sum_{w_i} c(w_{i-n+1}^{i})} + \frac{\delta}{\sum_{w_i} c(w_{i-n+1}^{i})} (c(w_{i-n+1}^{i-1}))p_{KN}(w_i|w_{i-n+2}^{i-1}) \]
…has a static discounting factor \({\delta}\). Chen & Goodman [10] present evidence that it is better to do smoothing with different discounting factors for ngrams occuring once, twice or three or more times.

\[ p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{max(c(w_{i-n+1}^{i}) - D(c(w_{i-n+1}^{i})),0)}{\sum_{w_i} c(w_{i-n+1}^{i})} + \lambda(w_{i-n+1}^{i-1})p_{KN}(w_i|w_{i-n+2}^{i-1}) \]

This notation is somewhat differing between almost all sources I have found. I ended up basing my implementation on the general recursive formulation by Jurafsky and Martin [15]:

\[ p_{KN}(w_i|w_{i-n+1}^{i-1}) = \frac{max(c_{KN}(w_{i-n+1}w_i) - D, 0)}{c_{KN}(w_{i-n+1}^{i-1})}+\lambda(w_{i-n+1}^{i-1})p_{KN}(w_i|w_{i-n+2}^{i-1}) \]

where the definition of \(c_{KN}\) is count(\(\bullet\)) for the highest order ngram we are using, and continuationcount(\(\bullet\)) for the rest of the ngrams. The term “continuationcount” is the number of unique single word prefixes we observe for our term \(\bullet\).

Another important factor here is, at the end of the recursion, to interpolate the unigrams with the uniform distribution like this (\(V\) is the vocabulary, so \(|V|\) is the sum of unigram occurences):

\[ p_{KN}(w) = \frac{max(c_{KN}(w) - D, 0)}{\sum_{w'} c_{KN}(w')} + \frac{\lambda(\epsilon)}{|V|} \]

The \(\frac{\lambda(\epsilon)}{|V|}\) element here is the probability reserved for unknown words.

The discount factor \(D\) is differentiated: \[ D(c) = \begin{cases} 0, & \text{if $c$ = 0} \\ D_1, & \text{if $c$ = 1, calculated as} & 1-2Y\frac{N_2}{N_1} \\ D_2, & \text{if $c$ = 2, calculated as} & 2-3Y\frac{N_3}{N_2} \\ D_3, & \text{if $c$ $\ge$ 3, calculated as} & 3-4Y\frac{N_4}{N_3} \end{cases} \]

Where the \(Y\) component is: \[Y = \frac{N_1}{N_1+2N_2}\]

The \(N_i\) notation means the number of ngrams with \(i\) occurrences, in other words the frequency of ngram frequency \(i\).

Finally, the the lambda \(\lambda\) normalizing constant is used to distribute the probability mass we have discounted with the \(D\)’s. It is defined like this:
\[ \lambda(w_{i-n+1}^{i-1})=\frac{D_1N_1(w_{i-n+1}^{i-1}\bullet)+D_2N_2(w_{i-n+1}^{i-1}\bullet)+D_3N_3(w_{i-n+1}^{i-1}\bullet)}{\sum w_i c(w_{i-n+1}^{i})} \]

The strength of this model is that it sets unigram’s probability proportional to the number of different contexts it appears in, rather than to the number of occurences of the word. It is however hard to get right, but my function KNprep precalculates as much as I was able to do based on the modified recursive Kneser-Ney formula above. Excerpts from the relevant function:

########## Kneser-Ney prepping data sets ##########

KNprep <- function(dt, n) {
    # Takes a data.table coming from ngramprep() and adds (Mod.) Kneser-Ney-discounting of the freqs
    #
    # Args: 
    #    dt: a data.table as ngramprep() serves it, requires Terms, Nextword and Freq as minimum cols
    #    n:  the ngram level (one dt per ngram level 1-4) 
    # Returns:
    #    dt: The prepared data table now also suitable for Modified Kneser-Ney prediction
    #        as described by Chen & Goodman (1999)
    
    col <- colnames(dt)
    col <- col[1:n]
    

    # Y = N_c / (N_c + 2N_(c+1)), where N_c is the count of ngrams with count==c (for Kneser-Ney)
    #     Note: This is a simplified calculation of Y using only the two lowest kept freqs.
    c1 <- min(dt$Freq) # The lowest available Freq
    c2 <- min(subset(dt, Freq>c1)$Freq) # And the second lowest
    Y <- nrow(dt[Freq == c1]) / (nrow(dt[Freq == c1]) + 2 * nrow(dt[Freq == c2])) # 1:0.50 2:0.56 3+:0.62  
    
    # D = Discounting parameter different for freq==1, freq==2 and freq>=3
    #     Ref Goodman and Chen (1999)
    dt[, D := 0]
    dt[Freq == 1]$D <- 1 - 2 * Y * (nrow(dt[Freq == 2]) / nrow(dt[Freq == 1]))
    dt[Freq == 2]$D <- 2 - 3 * Y * (nrow(dt[Freq == 3]) / nrow(dt[Freq == 2]))
    dt[Freq > 2]$D  <- 3 - 4 * Y * (nrow(dt[Freq == 4]) / nrow(dt[Freq == 3]))

    # Nom = First nominator in P_KN formula ( max{c(w_i-1, w_i)-D, 0} )
    dt <- dt[, Nom := pmax(Freq-D, 0)]
    
    # Denom = Denominator is the count of the preceding word(s) 
    if(n==1) {
        dt <- dt[, Denom := sum(Freq)]
    } else if (n==2) {
        dt <- dt[, .(w, Freq, D, Nom, Denom = sum(Freq)), by = w1]
    } else if (n==3) {
        dt <- dt[, .(w, Freq, D, Nom, Denom = sum(Freq)), by = list(w2, w1)]
    } else if (n==4) {
        dt <- dt[, .(w, Freq, D, Nom, Denom = sum(Freq)), by = list(w3, w2, w1)]
    }
    
    
    # NN = number of word types that follows w_i-1 in the training data
    if(n==1) {
        dt <- dt[, .(w, Freq, D, Nom, Denom, NN = length(w))]
    } else if (n==2) {
        dt <- dt[, .(w, Freq, D, Nom, Denom, NN = length(w)), by=w1]
    } else if (n==3) {
        dt <- dt[, .(w, Freq, D, Nom, Denom, NN = length(w)), by=list(w2, w1)]
    } else if (n==4) {
        dt <- dt[, .(w, Freq, D, Nom, Denom, NN = length(w)), by=list(w3, w2, w1)]
    }
    
    
    # L  = Lambda, normalizing constant, the probability mass we've discounted
    dt[, L := (D / Freq) * NN]
    
    # N = The number of different ngrams this nextword completes in training set 
    #     (c(w_(i-1)) for Kneser-Ney). Used in P_continutation (PC)
    if(n==1) {
        dt <- dt[, .(w, Freq, D, Nom, Denom, NN, L, .N)]
    } else if (n==2) {
        dt <- dt[, .(w1, Freq, D, Nom, Denom, NN, L, .N), by=w]
    } else if (n==3) {
        dt <- dt[, .(w2, w1, Freq, D, Nom, Denom, NN, L, .N), by=w]
    } else if (n==4) {
        dt <- dt[, .(w3, w2, w1, Freq, D, Nom, Denom, NN, L, .N), by=w]
    }
    

    # PC = P_continuation
    dt[, PC := N / nrow(dt)] # Count of this novel continuation div. by number of unique grams
    
    # Prob_KN - Estimated KN probability
    dt[, P_KN := (Nom/Denom) + ((D/Denom) * NN) * PC]
    .
    .
    .

The application

The application [13] in R Shiny is simple. To save space and time, it has no visualizations or other extras. It merely renders a DT::datatable containing the most promising predictions based on the user input. Some JavaScript to liven it up, but not much. The user just types or pastes the unfinished sentence, and waits for the suggestions.

Nextword app screen shot

The “Downvote stopwords” checkbox is default set to true. It will not change the probabilities of such words throughout the back-off procedure, but simply prefer non-stop words in the final suggestion table. If no or very few non-stop words are found, it will however add the stopwords the algorithm did find.

Evaluation

Against the held-out test data set, my model’s fit was measured by its ability to predict correctly on the test data. Running automatic final tests on the true test set proved accuracy on around 12% if only accepting the first predicted word, and around 18% for the top three predictions. That will then be my estimate of the app’s precicsion, but it will of course be dependent on the source and where the sentences are cut.

Perplexity \(PP(W)=\sqrt[_N]{\frac{1}{P(w_1w_2...w_N)}}\) has not been measured on the final product.

Interesting opportunities

The model can be improved to do a better prediction in many ways:

References