Course Project: Data Science Capstones

Eladio Rego

25 de junio de 2017

Overview

The objective of this aplication is to build a model for predict the possible next word from a from a given sequence of text.

Description

Models that assign probabilities to sequences of words are called language model or LMs.

This model will introduce a simples model that assigns probabilities LM to sentences and sequences of words, the N-gram.

An N-gram is a sequence of N words: a 2-gram (or bigram) is a two-word sequence of words like “good morning”, “coffee milk”, or “herbal tea”, and a 3-gram (or trigram) is a three-word sequence of words like “please can you”, or “do your homework”. We’ll use N-gram models to estimate the probability of the last word of an N-gram given the previous words, and also to assign probabilities to entire sequences.

The Ngram model is one of the most important tools in speech and language processing

Loading an preparing Data

We will create data from corpus of txt documents composed of blogs, news and tweets provided by swiftkey in the data science capstone project.

We will use the tm package to format the documents using Corpus from tm package.

We will clean and format the data, removing stopwors, badwords, special characters,…

    #Transform NA to ""
    tr_NA <- content_transformer(function(x, pattern) gsub(pattern, "", x))
    
    corpusWord <- tm_map(corpusWord, tr_NA, "NA") 
    corpusWord <- tm_map(corpusWord, content_transformer(tolower))
    corpusWord <- tm_map(corpusWord, removeWords, stopwords("english")) 
    corpusWord <- tm_map(corpusWord, removeNumbers)
    corpusWord <- tm_map(corpusWord, removePunctuation)
    corpusWord <- tm_map(corpusWord, stripWhitespace)
    corpusWord <- tm_map(corpusWord, stemDocument)
    corpusWord <- tm_map(corpusWord, removeWords, badwordsvector)

Generating n-grams

To generate n-grams for our model prediction, we will use rweka and tm.

Thanks to TermDocumentMatrix from tm and NGramTokenizer from rweka we will build n-gram from 1-gram to 4-gram.

We will include a discount field whic will be useful in the future when we use Katz’s back-off model for n-grams

For example in order to create the 2-gram we will follow the next steps:

    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))

    gramBlogsBi <- BigramTokenizer
    
    dtm_Bigram <- TermDocumentMatrix(corpusWord, control = list(tokenize = BigramTokenizer))
    
    freqBi.df <- getGramAsDataFrame(dtm_Bigram)
    
    namevector <- c("discount")
    freqBi.df[,namevector] <- 1

Katz’s back-off model

Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.

It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.

https://en.wikipedia.org/wiki/Katz%27s_back-off_model

We will need to calculate the Good-Turing discount.

The code for Good Turing Discount is below:

    getGoodTuringDiscount <- function(r, nfreq, nfreqPlus1)
    {
        # For high frequencies we will approximate discount to 1
        if(r > 50)
        {
            return(1)
        }
        
        
        discount = ((r+1)/r)* (log10(nfreqPlus1 + 0.01)/log10(nfreq + 0.01))
        
        if(discount > 1)
        {
            discount = 1
        }
        
        return(discount)
    }

Shiny App

Go to https://erego.shinyapps.io/predictNextWord/