Overview

This final report describes the ideas and development of creating a useful data product. Part of this process involves highlighting the changes made since publishing the milestone report. While the milestone report had substantial exploratory value, building upon this foundation presented several challenges. First and foremost was the data used for the milestone report. This dataset was a small sample of the actual data available and when attempting to upscale the models built during this phase of the project resulted in constant memory errors. This bottleneck required the researcher to come up with a more efficient solution to create the models to be used in developing the prediction algorithm.

library(dplyr)
library(readr)
library(stringr)
library(textclean)
library(tidyr)
library(tidytext)
library(tokenizers)

The data used in this project can be downloaded from the Data Science Capstone Website. The dataset contains 3 sources of text: blogs, news and tweets in four languages including English, Finnish, German and Russian. For this project the English texts were selected and split into a training, validation and test dataset.

blogsenUS <- ".//splitData/en_US.blogs.training.txt"
newsenUS <- ".//splitData/en_US.news.training.txt"
twitterenUS <- ".//splitData/en_US.twitter.training.txt"

blogsLines <- read_lines(blogsenUS, skip_empty_rows = TRUE)
newsLines <- read_lines(newsenUS, skip_empty_rows = TRUE)
twitterLines <- read_lines(twitterenUS, skip_empty_rows = TRUE)

Transforming the raw data

Similar to the the transformations carried out in the preliminary phase of the project, the raw data used for the final product also required considerable cleaning and processing. However, there are several notable changes made since the milestone report. The first change involved spliting the text by sentences to prevent the algorithm from predicting words across sentence borders. The second major change involved writing an efficient function to clean the raw data that no longer used the TM package. This package presented itself to be very memory intensive and thus did not allow the data to be cleaned in a timely manner nor used in the Shiny app. Moreover, there were other steps necessary to be implemented to remove special constructions such as contractions and slang language.

cleanData <- function(text) {
  profanityList <- read_lines("googleprofanitylist.txt")
  removeWords <- function(str, stopwords) {
    x <- unlist(strsplit(str, " "))
    paste(x[!x %in% stopwords], collapse = " ")}
  text <- str_replace_all(text, "(ht|f)tp(s?)://\\S+", "")
  text <- str_replace_all(text, "@[^\\s]+", "")
  text <- str_replace_all(text, "[,()\":;”…]", " ")
  text <- replace_internet_slang(text)
  text <- tolower(text)
  text <- removeWords(text, profanityList)
  text <- replace_contraction(text)
  text <- removePunctuation(text)
  text <- removeNumbers(text)
  text <- str_replace_all(text, "\\s+", " ")
  text <- unlist(text)
}

Building N-gram models

Having successfully transformed the data into a structured format, the next step consisted of building N-gram models and corresponding frequency tables. While the milestone report explored three N-gram models, the final product has been built using a series of four N-gram models. Each model uses a Markov Assumption in which each word depends only on the previous N words in a given string of text. Using 60 percent of the data available, the four language models contained a large collection of N-grams, many of which were sparse terms. Coupled with the memory limitation of the average computing device, model pruning then became a crucial intermediary step in attempting to create the prediction algorithm.

cleanBlogs <- data_frame(cleanBlogs)
cleanNews <- data_frame(cleanNews)
cleanTwitter <- data_frame(cleanTwitter)

createNgrams <- function(data, textInput, ngram){
  unnest_tokens(data, output, textInput, token = "ngrams", n = ngram) %>% count(output) %>% arrange(desc(n))
}
pruneGramsb <- function(data){
  data %>% mutate(nCumPercent = cumsum(n) / sum(n)) %>% filter(nCumPercent <= 0.50) %>% select(-nCumPercent) 
}

Each model was pruned separately first based on the origin of the data and then merged into a new final dataset.

mergeGrams <- function(data1, data2, data3){
  nGrams <- do.call("rbind", list(data1, data2, data3))
  nGrams <- nGrams %>% arrange(desc(n))
  nGrams.merged <- nGrams %>% group_by(output, add = FALSE) %>% summarize(n = sum(n)) %>%
    arrange(desc(n))
  return(nGrams.merged)
}

uniGrams <- mergeGrams(unigramsBlogs.pruned, unigramsNews.pruned, unigramsTwitter.pruned)
biGrams <- mergeGrams(bigramsBlogs.pruned, bigramsNews.pruned, bigramsTwitter.pruned)
triGrams <- mergeGrams(trigramsBlogs.pruned, trigramsNews.pruned, trigramsTwitter.pruned)
quadGrams <- mergeGrams(quadgramsBlogs.pruned, quadgramsNews.pruned, quadgramsTwitter.pruned)

The final step in the model building process involved taking each N-gram model and separating them into two groups. The first group contained the first word(s) of each N-gram and the second group contained only the last word. This was necessary to allow the researcher to be able to approximate the probability of an unobserved N-gram in the following phase of the project.

combinedUniGrams <- rename(uniGrams, "firstGram" = "output")
combinedBiGrams <- extract(biGrams, output, into = c("firstGram", "lastGram"), '(.*)\\s+([^ ]+)$')
combinedTriGrams <- extract(triGrams, output, into = c("firstGram", "lastGram"), '(.*)\\s+([^ ]+)$')
combinedQuadGrams <- extract(quadGrams, output, into = c("firstGram", "lastGram"), '(.*)\\s+([^ ]+)$')

Developing the prediction algorithm

Transitioning into the penultimate phase of the project, the data was now suitable to be used in developing the prediction algorithm. Given the dictionary of N-grams, the researcher calculated the probability for possible next words through Maximum-Likelihood Estimation combined with Good Turing Smoothing.

simpleGoodTuring <- function(data) {
    ngramData <- data.frame(X=table(data$n))
    names(ngramData) <- c("r", "n")
    ngramData$r <- as.numeric(as.character(ngramData$r))
    nSize <- nrow(ngramData)
    N <- sum(ngramData$r*ngramData$n)
    pZero <- ngramData$r[1]/N
    ngramData$Z <- 0

    for (c in 1:nSize) {
        if (c == 1) {
          i <- 0
      } 
        else {
          i <- ngramData$r[c-1]
      }
        if (c == nSize) {
          k <- ngramData$r[c]
      } 
        else {
          k <- ngramData$r[c+1]
      }
    ngramData$Z[c] <- 2*ngramData$n[c] / ( k-i )
    }
    
    ngramData$logR <- log(ngramData$r)
    ngramData$logZ <- log(ngramData$Z)
    ngramData$rStar <- 0
    model1 <- glm(logZ ~ logR, data = ngramData)
    c0 <- model1$coefficients[1]
    c1 <- model1$coefficients[2]
    ycheck = FALSE
    
    for (c in 1:(nSize-1)) {
        rPlusOne <- ngramData$r[c] + 1
        sRPlusOne <- exp(c0 + (c1 * ngramData$logR[c+1]))
        sR <- exp(c0 + (c1 * ngramData$logR[c]))
        y <- rPlusOne * sRPlusOne/sR
        
        if (ycheck) {
            ngramData$rStar[c] <- y
      } 
        else { 
          nRPlusOne <- ngramData$n[ngramData$r == rPlusOne]
          nR <- ngramData$n[c]
          x <- (rPlusOne) * nRPlusOne/nR
        if (abs(x-y) > 1.96 * sqrt(((rPlusOne)^2) * (nRPlusOne/((nR)^2))*(1+(nRPlusOne/nR)))) {
        ngramData$rStar[c] <- x
      }
        else {
          ngramData$rStar[c] <- y
          ycheck = TRUE
      }
  }
        if (c==(nSize-1)) {
          ngramData$rStar[c+1] <- y
      }
  }
    nOne <- sum(ngramData$n * ngramData$rStar)
    ngramData$p <- (1-pZero) * ngramData$rStar/nOne
    return(ngramData)
}

The prediction algorithm

Depending on the length of text input, the prediction algorithm selects candidates from appropriately ranked N-grams. In the case that there are insufficient possibilities in a current tier, the algorithm explores possible candidates from a lower ranked N-gram and merges duplicates together to provide a fitting prediction. This mechanism is known as the “The Stupid Backoff algorithm”, a highly efficient and inexpensive algorithm proposed by Thorsten Brants et al (2007).

predictFunction <- function(inputText) {
    
inputText <- cleanData(inputText)

    if (nrow(inputText) >= 3) {
        firstText <- paste(tail(inputText, 3)[1,], tail(inputText, 3)[2,], tail(inputText, 3)[3,])
        prediction <- quadGrams %>%
            filter(firstGram %in% firstText) %>%
            select(lastGram, n, p)
        if (nrow(prediction) < 5) {
            oldprediction <- prediction
            firstText <- paste(tail(inputText, 2)[2,], tail(inputText, 2)[1,])
            prediction <- triGrams %>%
                filter(firstGram %in% firstText) %>%
                select(lastGram, n, p)
            prediction <- rbind(oldprediction, prediction)
            if (nrow(prediction) < 5)  {
                oldprediction <- prediction
                firstText <- paste(tail(inputText, 1)[1,])
                prediction <- biGrams %>%
                    filter(firstGram %in% firstText) %>%
                    select(lastGram, n, p)
                prediction <- rbind(oldprediction, prediction)
                if (nrow(prediction) < 5)  {
                    oldprediction <- prediction
                    prediction <- uniGrams %>%
                    select(firstGram, n, p)
                    prediction <- rbind(oldprediction, prediction)
                }
            }
        }  
        return(prediction %>% group_by(lastGram) %>% summarise_all(sum) %>% ungroup() %>% arrange(desc(p)))
}
    else if (nrow(inputText) == 2) {
        firstText <- paste(tail(inputText, 2)[2,], tail(inputText, 2)[1,])
        prediction <- triGrams %>%
            filter(firstGram %in% firstText) %>%
            select(lastGram, n, p)
        if (nrow(prediction) < 5)  {
            oldprediction <- prediction
            firstText <- paste(tail(inputText, 1)[1,])
            prediction <- biGrams %>%
                filter(firstGram %in% firstText) %>%
                select(lastGram, n, p)
            prediction <- rbind(oldprediction, prediction)
            if (nrow(prediction) < 5)  {
                oldprediction <- prediction
                prediction <- uniGrams %>%
                select(firstGram, n, p)
                prediction <- rbind(oldprediction, prediction)
            }
        }
        return(prediction %>% group_by(lastGram) %>% summarise_all(sum) %>% ungroup() %>% arrange(desc(p)))
}     
    else if (nrow(inputText) == 1) {
        firstText <- paste(tail(inputText, 1)[1,])
        prediction <- biGrams %>%
            filter(firstGram %in% firstText) %>%
            select(lastGram, n, p)
        if (nrow(prediction) < 5) {
            prediction <- uniGrams %>%
            select(firstGram, n, p)
        }     
        return(prediction %>% group_by(lastGram) %>% summarise_all(sum) %>% ungroup() %>% arrange(desc(p)))
    }
}

Data product

The final goal of the project was to develop a data product to demonstrate how the prediction algorithm works. A shiny application serves as the final product for this project. The Shiny application built offers an intuitive and simple graphical user interface that can easily be adapted for educational and commercial uses. The dashboard presents the user with a text input box, which when filled out with a word or short phrase returns a top prediction along with a dynamic table suggesting possible next words and a word cloud visualizing the predictions.

Figure 1: Screenshot of the Shiny App

As the app does not contain sensitive data hosting it publicly on the Shiny web server seemed to be a good choice. The application is available at the following link: Shiny App.

References

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean. (2007), Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858–867.

Coursera Data Science Capstone: Final Report

Amyas Walji

November 2019