title: “Predict Next Word - Language Modeling Capstone Prject”

author: “Michelle Tan”

date: “3/10/2018”

output: html_document

Introduction This presentation is a high level description of the language modeling Capstone Project of Coursera Data Science Specilization

The purpose of this project is to build a natural language model that suggests an appropriate next unseen word in the user specified words sequence. Three types of data including twitter, news and blogs were consumed to train the model. Appropriate data cleaning and sub-setting techniques were applied to finalize the training data. Various word combinations (N-Grams) were then created using clean data sets and a predictive algorithm (Katz Back-off) was applied to predict next word. The final predictive model was optimized appropriately to work as a Shiny application.

[Shiny App URL] (https://michelletan.shinyapps.io/Nextwordprediction/)

[Project Software] (https://github.com/Michelletan78/Nextword-Prediction)

Data Handling and Cleaning Prior to building word prediction algorithm, the following steps were executed to handle and clean very large twitter, news and blogs files

A subset of the original data was randomly selected from the three sources and merged into one. Due to limitation of computation resources data was divided into various small chunks before cleaning. Data cleaning involved converting to lower case, removing punctuations, numbers and non printable characters. Four sets of word combinations (n-grams), with 4-words, 3-words, 2-words, and 1-word were then created. After calculating their cumulative frequencies, these four n-grams were sorted and saved. Low frequency n-grams were further filtered to reduce their size for optimum performance. Finally, the four n-gram objects were saved as R-Compressed files (.RData files). Word Prediction Model The next word prediction model is based on the Katz Back-off algorithm. Here are the steps involved in predicting the next word of the user specified sentence

Load four compressed data sets containing sorted n-grams with cumulative frequencies. Filter the user specified sequence of words by applying same techniques to clean the training data sets. Depending upon the number of words specified by the user, extract last three or two or the last one word. First use a 4-gram; the first three words of which are the last three words of the user provided sentence. If no 4-gram is matched, back-off to 3-gram. Match first two words of 3-gram with last two words of the sentence. If no 3-gram is matched, back-off to 2-gram. Match first word of 2-gram with last word of the sentence. Finally if no match found in 2-grams, use the most frequent word from 1-gram as next word. Shiny Application A Shiny application was developed based on the next word prediction model described previously. Here are key features of the App

User enters a sequence of words in the text box, then press “Next Word” button.

library(tm)
## Warning: package 'tm' was built under R version 3.4.3
## Loading required package: NLP
library(stringr)
## Warning: package 'stringr' was built under R version 3.4.3
library(shiny)

# load One-Gram, Two-Gram, Three-Gram and Four-Gram Data frame files
# This data is already cleansed with N-Grams frequency in decending order
# The data was convert to lower case, punctuations removed, numbers removed, 
# white spaces removed, non print characters removed

my.data <- readRDS("fDF1.ext");
my.data <- readRDS("fDF2.ext");
my.data <- readRDS("fDF3.ext");
my.data <- readRDS("fDF4.ext");


mesg <- as.character(NULL);

#-------------------------------------------------
# This function "Clean up" the user input string 
# before it is used to predict the next term
#-------------------------------------------------
CleanInputString <- function(inStr)
{
  # Test sentence
  inStr <- "This is. the; -  .   use's 12"
  
  # First remove the non-alphabatical characters
  inStr <- iconv(inStr, "latin1", "ASCII", sub=" ");
  inStr <- gsub("[^[:alpha:][:space:][:punct:]]", "", inStr);
  
  # Then convert to a Corpus
  inStrCrps <- VCorpus(VectorSource(inStr))
  
  # Convert the input sentence to lower case
  # Remove punctuations, numbers, white spaces
  # non alphabets characters
  inStrCrps <- tm_map(inStrCrps, content_transformer(tolower))
  inStrCrps <- tm_map(inStrCrps, removePunctuation)
  inStrCrps <- tm_map(inStrCrps, removeNumbers)
  inStrCrps <- tm_map(inStrCrps, stripWhitespace)
  inStr <- as.character(inStrCrps[[1]])
  inStr <- gsub("(^[[:space:]]+|[[:space:]]+$)", "", inStr)
  
  # Return the cleaned resulting senytense
  # If the resulting string is empty return empty and string.
  if (nchar(inStr) > 0) {
    return(inStr); 
  } else {
    return("");
  }
}

#---------------------------------------
# Description of the Back Off Algorithm
#---------------------------------------
# To predict the next term of the user specified sentence
# 1. first we use a FourGram; the first three words of which are the last three words of the user provided sentence
#    for which we are trying to predict the next word. The FourGram is already sorted from highest to lowest frequency
# 2. If no FourGram is found, we back off to ThreeGram (first two words of ThreeGram last two words of the sentence)
# 3. If no FourGram is found, we back off to TwoGram (first word of TwoGram last word of the sentence)
# 4. If no TwoGram is found, we back off to OneGram (the most common word with highest frequency)
#
PredNextTerm <- function(inStr)
{
  assign("mesg", "in PredNextTerm", envir = .GlobalEnv)
  
  # Clean up the input string and extract only the words with no leading and trailing white spaces
  inStr <- CleanInputString(inStr);
  
  # Split the input string across white spaces and then extract the length
  inStr <- unlist(strsplit(inStr, split=" "));
  inStrLen <- length(inStr);
  
  nxtTermFound <- FALSE;
  predNxtTerm <- as.character(NULL);
  #mesg <<- as.character(NULL);
  # 1. First test the Four Gram using the four gram data frame
  if (inStrLen >= 3 & !nxtTermFound)
  {
    # Assemble the terms of the input string separated by one white space each
    inStr1 <- paste(inStr[(inStrLen-2):inStrLen], collapse=" ");
    
    # Subset the Four Gram data frame 
    searchStr <- paste("^",inStr1, sep = "");
    DF4Temp <- fDF4[grep (searchStr, fDF4$terms), ];
    
    # Check to see if any matching record returned
    if ( length(DF4Temp[, 1]) > 1 )
    {
      predNxtTerm <- DF4Temp[1,1];
      nxtTermFound <- TRUE;
      mesg <<- "Next word is predicted using 4-gram."
    }
    DF4Temp <- NULL;
  }
  
  # 2. Next test the Three Gram using the three gram data frame
  if (inStrLen >= 2 & !nxtTermFound)
  {
    # Assemble the terms of the input string separated by one white space each
    inStr1 <- paste(inStr[(inStrLen-1):inStrLen], collapse=" ");
    
    # Subset the Three Gram data frame 
    searchStr <- paste("^",inStr1, sep = "");
    DF3Temp <- fDF3[grep (searchStr, fDF3$terms), ];
    
    # Check to see if any matching record returned
    if ( length(DF3Temp[, 1]) > 1 )
    {
      predNxtTerm <- DF3Temp[1,1];
      nxtTermFound <- TRUE;
      mesg <<- "Next word is predicted using 3-gram."
    }
    DF3Temp <- NULL;
  }
  
  # 3. Next test the Two Gram using the three gram data frame
  if (inStrLen >= 1 & !nxtTermFound)
  {
    # Assemble the terms of the input string separated by one white space each
    inStr1 <- inStr[inStrLen];
    
    # Subset the Two Gram data frame 
    searchStr <- paste("^",inStr1, sep = "");
    DF2Temp <- fDF2[grep (searchStr, fDF2$terms), ];
    
    # Check to see if any matching record returned
    if ( length(DF2Temp[, 1]) > 1 )
    {
      predNxtTerm <- DF2Temp[1,1];
      nxtTermFound <- TRUE;
      mesg <<- "Next word is predicted using 2-gram.";
    }
    DF2Temp <- NULL;
  }
  
  # 4. If no next term found in Four, Three and Two Grams return the most
  #    frequently used term from the One Gram using the one gram data frame
  if (!nxtTermFound & inStrLen > 0)
  {
    predNxtTerm <- fDF1$terms[1];
    mesg <- "No next word found, the most frequent word is selected as next word."
  }
  
  nextTerm <- word(predNxtTerm, -1);
  
  if (inStrLen > 0){
    dfTemp1 <- data.frame(nextTerm, mesg);
    return(dfTemp1);
  } else {
    nextTerm <- "";
    mesg <-"";
    dfTemp1 <- data.frame(nextTerm, mesg);
    return(dfTemp1);
  }
}

msg <- ""
shinyServer(function(input, output) {
  output$prediction <- renderPrint({
    str2 <- CleanInputString(input$inputString);
    strDF <- PredNextTerm(str2);
    input$action;
    msg <<- as.character(strDF[1,2]);
    cat("", as.character(strDF[1,1]))
    cat("\n\t");
    cat("\n\t");
    cat("Note: ", as.character(strDF[1,2]));
  })
  
  output$text1 <- renderText({
    paste("Input Sentence: ", input$inputString)});
  
  output$text2 <- renderText({
    input$action;
    #paste("Note: ", msg);
  })
}
)