1 Executive summary
2 Download and Load Data
3 Function to extract ngrams
- 3.1 Generate the ngrams
4 Plotting the ngrams
5 Predicting
- 5.1 Cleanse the Input Text
- 5.2 Predicting the next word
6 Testing the Prediction Model
7 Conclusion

1 Executive summary

The goal of this project is to predict the next word based on user input, similar to the SwiftKey algorithm used on iOS and Android mobile devices. Natural Language Processing (NLP) is used to clean the data and build n-gram models, then a stupid Back-off algorithm is used to look for the most likely next words.

Based on an n-gram model (using unigram, bigram, trigram and quadgram), the stupid back-off method is used to determine the next likely word in the order of quadgram, trigram, bigram and unigram. Specific weights are assigned to the different ngrams, so that the words determined by the higher ngram (say quadgram) will have higher likelyhood of appearing as the outcome than lower ngrams such as trigram or bigram. If no next word could be determined by the higher ngrams, the unigram with highest likelihood is returned.

This project took samples of past historical user inputs via Blog postings, News articles and Twitter feeds to do natural language processing so that a prediction of the next word is provided given a user input of one or more words. Using the Blog, News and Twitter datasets provided from the HC Corpora (www.corpora.heliohost.org), a smaller sample is taken to create a model using quadgram (4 words), trigram (3 words), bigram (2 words) and unigram (1 word) along with the respective frequency of occurance.

The ngram models are developed after cleansing the data of the following 1. converting them to lower 2. removing the punctuation 3. removing the numbers 4. removing the stopwords 5. removing extra whitespaces

The prediction is done like this - if a sentence more than 3 words are entered, the last 3 words are taken for prediction. However, if it is equal to or less than 3, then it is retained as such for determining the next word. From the statement, last word is used in Bigram to determine the next word; similarly last 2 words are used in Trigram and 3 words are used in Quadgram to determine the next words. If no input is entered or the inputs couldnot be matched to any of the ngrams, then the unigram with highest frequencies are returned.

This is a simplistic model which is as good as the input gets. Hence this is really based on the text taken and contextualized to blogs, news and twitter. If a test case is text taken from say William Shakespeare’s play, this predictor may not work.

This predictor is also limited by the model, being optimised using the most frequently occuring two and three words. Hence, for a four or five word phrase, it may not measure up to expectation since only bigrams and trigrams were used. However, with some effort, 4-word and 5-word models can be built to extend this predictor.

2 Download and Load Data

The data is downloaded from swiftkey url (http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) automatically if the working directory the “data” subdirectory is not present. Else, the script assumes that the data is already downloaded and the files are present in the directory.

If the data directory is not present in the working directory, the data is downloaded and unzipped by the script. This is to ensure that the data is not downloaded again, to save time and bandwidth.

# To avoid downloading the huge dataset again and again, it is downloaded and
# unzipped only when the data directory is missing.

# Get current working dir
maindir <- getwd()
subdir  <- "data"

# Download file and unzip only when the data directory is not present
chkdir <- dir.exists(file.path(maindir, subdir))
if (chkdir!=TRUE) {
        # Create data directory and set that as the data directory
        dir.create(file.path(workdir, subdir), showWarnings = FALSE)
        setwd(file.path(maindir, subdir))
        
        # specify the source and destination of the download
        dest.file <- "Coursera-SwiftKey.zip"
        src.file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        
        # execute the download
        download.file(src.file, dest.file)
        # extract the files from the zip file
        unzip(dest.file)
        }

3 Function to extract ngrams

The key component of Natural Language Processing (NLP), to predict the next word, is ngrams. According to Wikipedia, In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on.

There are various conditions built in the script to save time. If the n-grams are already saved in the data directory, then it is loaded directly for processing, else, the ngrams are created. There are few steps in the creation of ngrams. First of all the data is sampled from a huge collection of corpus, twitter, news and blog. Then the data is futher sampled into smaller training and test datasets for training and testing the model. The raw data viz. twitter, news and blog are saved in the .RData format for faster loading and processing. Similarly, the corpus is also saved in .RData format.

Finally after the ngrams are generated from the training data, they are saved in a .RData file for faster loading. Eventually when the function is run, it looks for saved .RData file. It loads them if found, if not, it creates and saves the 3 RData files, namely rawdata.RData, corpus.RData and ngrams.RData. The .RData files need to be deleted if the intention is to regenerate the files, say after some parameters such as seed or sample size etc. are changed.

#====================================
# n-gram extraction function
#====================================
ngramextract <- function(){
        #-----------------------------
        # Initial setup
        #-----------------------------
        # Load libraries
        suppressWarnings(suppressMessages(library(stringi))) 
        suppressWarnings(suppressMessages(library(tm))) 
        suppressWarnings(suppressMessages(library(slam))) 
        suppressWarnings(suppressMessages(library(RWeka))) 
        
        # Set working directory
        setwd("~/RProject/DataScienceCoursera/10_Capstone Project/coursera-datascience-capstone")
        
        # Number of records in the smaller dataset
        samplerecords <- 20000
        # Number of records in training dataset
        nlinestr     <- samplerecords * 0.9
        # Number of records in testing dataset
        nlinests     <- samplerecords * 0.1
        
        #-----------------------------
        # NGRAMS CREATION
        #-----------------------------
        # Load data. If that does not exist then continue with the following steps
        # Load the ngrams data if they exist
        checkn <- file.exists(file.path("data/ngrams.RData"))
        if (checkn==F){
                
                #-----------------------------
                # RAW DATA
                #-----------------------------
                # Load the rawdata if exists, else create it
                checkr <- file.exists(file.path('data/rawdata.RData'))
                if (checkr==F){
                        # read in the data
                        twit <- readLines('data/final/en_US/en_US.twitter.txt', encoding = 'UTF-8')
                        news <- readLines('data/final/en_US/en_US.news.txt'   , encoding = 'UTF-8')
                        blog <- readLines('data/final/en_US/en_US.blogs.txt'  , encoding = 'UTF-8')
                        
                        # Save rawdata
                        save(twit, news, blog, file='data/rawdata.RData')
                        
                } else {load('data/rawdata.RData')}
                
                #-----------------------------
                # CORPUS DATA
                #-----------------------------
                # Load the corpus data if exists, else create it
                checkc <- file.exists(file.path('data/corpus.RData'))
                if (checkc==F){
                        # Set Seed for all sampling
                        set.seed(1000)
                        
                        # Smaller dataset
                        twit <- twit[sample(1:length(twit), samplerecords)]
                        news <- news[sample(1:length(news), samplerecords)]
                        blog <- blog[sample(1:length(blog), samplerecords)]
                        
                        # Sampling of twitter into training and testing
                        index <- sample(1:length(twit), nlinestr)
                        sampletwittr <- twit[index]  # Train
                        sampletwitts <- twit[-index] # Test
                        
                        # Sampling of news into training and testing
                        index <- sample(1:length(news), nlinestr)
                        samplenewstr <- news[index]
                        samplenewsts <- news[-index]
                        
                        # Sampling of blog into training and testing
                        index <- sample(1:length(blog), nlinestr)
                        sampleblogtr <- blog[index]
                        sampleblogts <- blog[-index]
                        
                        # Subsetting the test datasets
                        indexts <- sample(1:nlinestr, nlinests)
                        sampletwitts <- sampletwitts[indexts]
                        samplenewsts <- samplenewsts[indexts]
                        sampleblogts <- sampleblogts[indexts]
                        
                        #-----------------------------
                        # clean the data
                        #-----------------------------
                        ## clean twitter data
                        asciitwittr <- stri_enc_toascii(sampletwittr)
                        asciitwittr <- stri_replace_all_regex(asciitwittr,'\032','')
                        asciitwitts <- stri_enc_toascii(sampletwitts)
                        asciitwitts <- stri_replace_all_regex(asciitwitts,'\032','')
                        
                        ## clean news dats
                        asciinewstr <- stri_enc_toascii(samplenewstr)
                        asciinewstr <- stri_replace_all_regex(asciinewstr,'\032','')
                        asciinewsts <- stri_enc_toascii(samplenewsts)
                        asciinewsts <- stri_replace_all_regex(asciinewsts,'\032','')
                        
                        ## clean blogdata
                        asciiblogtr <- stri_enc_toascii(sampleblogtr)
                        asciiblogtr <- stri_replace_all_regex(asciiblogtr,'\032','')
                        asciiblogts <- stri_enc_toascii(sampleblogts)
                        asciiblogts <- stri_replace_all_regex(asciiblogts,'\032','')
                        
                        # Train and Test data
                        train  <- VectorSource(c(asciitwittr, asciinewstr, asciiblogtr))
                        test   <- VectorSource(c(asciitwitts, asciinewsts, asciiblogts))
                        
                        # Assign to corpus of training and testing
                        corpustr <- Corpus(train)
                        corpusts <- Corpus(test)
                        
                        # Clean memory
                        rm(blog, news, twit, sampleblogtr, samplenewstr)
                        rm(sampletwittr, sampleblogts, samplenewsts, sampletwitts)
                        
                        # Save and load the corupus in Rdata format
                        save(corpustr, corpusts, file='data/corpus.RData')
                } else {load('data/corpus.RData')}
                
                # Load profane words
                profanewords <- readLines("data/profanityfilter.txt", encoding = "UTF-8")
                
                # Cleanse the data
                corpustr <- tm_map(corpustr, content_transformer(tolower))
                corpustr <- tm_map(corpustr, removePunctuation)
                corpustr <- tm_map(corpustr, removeNumbers)
                corpustr <- tm_map(corpustr, removeWords, stopwords("english"))
                corpustr <- tm_map(corpustr, removeWords, profanewords)
                corpustr <- tm_map(corpustr, stripWhitespace)
                
                
                #-----------------------------
                # Generate ngram frequencies
                #-----------------------------
                # sentence delimiters; prevent clustering across sentence boundaries
                delimiters <- " \\t\\r\\n.!?,;\"()"
                
                # Tokenize
                library(RWeka)
                unigramTokenizer  <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
                bigramTokenizer   <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
                trigramTokenizer  <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
                quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
                
                # Controls for the ngrams
                uni.control  <- list(tokenize = unigramTokenizer , bounds = list(global = c(6,Inf)))
                bi.control   <- list(tokenize = bigramTokenizer  , bounds = list(global = c(3,Inf)))
                tri.control  <- list(tokenize = trigramTokenizer , bounds = list(global = c(2,Inf)))
                quad.control <- list(tokenize = quadgramTokenizer, bounds = list(global = c(2,Inf)))
                
                # Since the NGramTokenizer hangs on the parallel::mclapply call, 
                # changing the number of threads sorts the issue
                # Sets the default number of threads to use
                options(mc.cores=1)
                
                # Generate the Term document matrix
                library(tm)
                unigram.tdm  <- TermDocumentMatrix(corpustr, control = uni.control)
                bigram.tdm   <- TermDocumentMatrix(corpustr, control = bi.control)
                trigram.tdm  <- TermDocumentMatrix(corpustr, control = tri.control)
                quadgram.tdm <- TermDocumentMatrix(corpustr, control = quad.control)
                
                ## Generate frequencies of ngrams 
                library(slam)
                freq.uni  <- rowapply_simple_triplet_matrix(unigram.tdm, sum)
                freq.bi   <- rowapply_simple_triplet_matrix(bigram.tdm, sum)
                freq.tri  <- rowapply_simple_triplet_matrix(trigram.tdm, sum)
                freq.quad <- rowapply_simple_triplet_matrix(quadgram.tdm, sum)
                
                # Save and load Rdata file
                save(freq.uni, freq.bi, freq.tri, freq.quad, file = 'data/ngramsfreq.RData')
                
                #-----------------------------
                # Generate the ngram dataframes
                #-----------------------------
                # Split the strings into individual words
                firstname.bi <- sapply(strsplit(names(freq.bi), " "), function(a) a[1])
                secname.bi   <- sapply(strsplit(names(freq.bi), " "), function(a) a[2])
                
                firstname.tri <- sapply(strsplit(names(freq.tri), " "), function(a) a[1])
                secname.tri   <- sapply(strsplit(names(freq.tri), " "), function(a) a[2])
                thirname.tri  <- sapply(strsplit(names(freq.tri), " "), function(a) a[3])
                
                firstname.quad <- sapply(strsplit(names(freq.quad), " "), function(a) a[1])
                secname.quad   <- sapply(strsplit(names(freq.quad), " "), function(a) a[2])
                thirname.quad  <- sapply(strsplit(names(freq.quad), " "), function(a) a[3])
                fourname.quad  <- sapply(strsplit(names(freq.quad), " "), function(a) a[4])
                
                # get the final n-gram dataframe (Data table)
                library(data.table)
                ## Unigram data table
                unigramDT <- data.table(unigram = names(freq.uni), 
                                        freq    = freq.uni)
                
                ## Bigram data table
                bigramDT  <- data.table(bigram   = names(freq.bi), 
                                        freq     = freq.bi, 
                                        unigram  = firstname.bi, 
                                        nextword = secname.bi)
                
                ## Trigram data table
                trigramDT <- data.table(trigram  = names(freq.tri), 
                                        freq     = freq.tri, 
                                        unigram  = paste(firstname.tri, secname.tri), 
                                        nextword = thirname.tri)
                
                ## Quadgram data table
                quadgramDT <- data.table(quadgram = names(freq.quad), 
                                         freq = freq.quad, 
                                         trigram = paste(firstname.quad, secname.quad, thirname.quad), 
                                         nextword = fourname.quad)
                
                # Sort by frequency
                unigramDT   <- unigramDT[order(freq, decreasing=TRUE),]
                bigramDT    <- bigramDT[order(freq, decreasing=TRUE),]
                trigramDT   <- trigramDT[order(freq, decreasing=TRUE),]
                quadgramDT  <- quadgramDT[order(freq, decreasing=TRUE),]
                
                # Save and load Rdata file
                save(unigramDT, bigramDT, trigramDT, quadgramDT, file = 'data/ngrams.RData')
                
        } else {load('data/ngrams.RData')}
        
        # Output data
        ngrams <- list(unigramDT, bigramDT, trigramDT, quadgramDT)
        # Return output data
        return(ngrams)
}

3.1 Generate the ngrams

ngrams <- ngramextract()

4 Plotting the ngrams

4.1 Load ngrams

The generated ngrams are saved in the data folder. They can be loaded easily when the ngramextract function is called. If the ngrams.RData file is not found in the data folder, the function goes on to create them, which could be time-consuming and system resource intensive.

# Load the ngrams dataframes
load('data/ngrams.RData')

# Sort the ngrams
unigramDF  <- unigramDT[order(unigramDT$freq, decreasing=T),]
bigramDF   <- bigramDT[order(bigramDT$freq, decreasing=T),]
trigramDF  <- trigramDT[order(trigramDT$freq, decreasing=T),]
quadgramDF <- quadgramDT[order(quadgramDT$freq, decreasing=T),]

4.2 Subsetting the ngram dataframe

# Number of nGrams to plot
n <- 30
# Get the highest 30 numbers of nGrams
unigram.n30  <- unigramDF[1:n, ]
bigram.n30   <- bigramDF[1:n, ]
trigram.n30  <- trigramDF[1:n, ]
quadgram.n30 <- quadgramDF[1:n, ]

4.3 Plot ngram histograms for unigram, bigram, trigram and quadgram

The histogram indicate the frequency of the top 30 ngrams. This gives a rough idea of the kind of words and ngrams stored in the corpus.

# Load needed libraries
library(ggplot2)

# Unigram histogram
ggplot(unigram.n30, aes(x = unigram, y = freq)) + 
        geom_bar(stat = "Identity") +
        ggtitle("Top 30 frequencies of Unigrams") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Bigram histogram
ggplot(bigram.n30, aes(x = bigram, y = freq)) +
        geom_bar(stat = "Identity") +
        ggtitle("Top 30 frequencies of Bigrams") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Trigram histogram
ggplot(trigram.n30, aes(x = trigram, y = freq)) +
        geom_bar(stat = "Identity") +
        ggtitle("Top 30 frequencies of Trigrams") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Quadgram histogram
ggplot(quadgram.n30, aes(x = quadgram, y = freq)) +
        geom_bar(stat = "Identity") +
        ggtitle("Top 30 frequencies of Quadgrams") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

4.4 Plotting ngram cloud

# Load library
library(wordcloud)
# Load ngram frequency data
load("Data/ngramsfreq.RData")
# Combine multiple wordclouds into one graph
par(mfrow=c(1,4)) 
# Max words for the wordcloud
maxword <- 30; xlim <- 0.5; ylim = 1.2
# Unigram word cloud
wc1 <- wordcloud(names(freq.uni), freq.uni, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Unigrams Word Cloud")
# Bigram word cloud
wc2 <- wordcloud(names(freq.bi), freq.bi, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Bigrams Word Cloud")
# Trigram word cloud
wc3 <- wordcloud(names(freq.tri), freq.tri, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Trigrams Word Cloud")
# Quadgram word cloud
wc4 <- wordcloud(names(freq.quad), freq.quad, max.words = maxword, colors = brewer.pal(6, "Dark2"))
text(x=xlim, y=ylim, "Top Quadgrams Word Cloud")

5 Predicting

This is the most important step in the process whereby the prediction model is developed. The prediction model is based on identification of the next words from the appropriate ngrams, assigning them a weighted score and predicting the required number of next words based on the user specifications. To ensure higher accuracy, weights are assigned in reducing order from QuadGram, TriGram, BiGram and UniGram. Which means that if the next word is found in Quadram, it will carry higher weightage than the remaining three nGrams. The weights assigned are nearly arbitrary, and further research is needed to determing the relative weights for higher accuracy.

5.1 Cleanse the Input Text

The input text is first cleansed before the next word is found. The cleansing include removing punctuations, converting the characters to lower case, removing numbers, removing the english stop words etc. The cleansed text is outputted as either a vector or a string depending on where the function is used. The string output is used in the “Shiny App” where the cleansed input text is shown. The character vector is used within the predict function.

#====================================
# Cleanse Input function
#====================================
textcleanse <- function(input, type){
        # type of output, vector or string
        # v = character vector, s = string
        if (type=="s") {
                input <- tolower(input) 
                input <- removePunctuation(input)
                input <- removeNumbers(input)
                input <- unlist(strsplit(input, " "))
                input <- setdiff(input, stopwords(kind = "en"))
                input <- input[grepl('[[:alpha:]]', input)]
                input <- paste(input, sep=" ", collapse=" ") 
                } else {
                        input <- tolower(input) 
                        input <- removePunctuation(input)
                        input <- removeNumbers(input)
                        input <- rev(unlist(strsplit(input, " ")))
                        input <- setdiff(input, stopwords(kind = "en"))
                        input <- input[grepl('[[:alpha:]]', input)]
                        }
        return(input)
        }

5.2 Predicting the next word

#====================================
# Predict next word function
#====================================
predictnextword <- function( input, maxwords ) {
        #-----------------------------
        # Constants
        #-----------------------------
        # ngram weights
        quadgramwt <- 1
        trigramwt  <- 0.6
        bigramwt   <- 0.4
        unigramwt  <- 0.16
        
        #-----------------------------
        # Load ngrams data
        #-----------------------------
        # Load ngrams
        load("data/ngrams.RData")
        
        # Cleanse the input text
        input <- textcleanse(input, "v")
        
        # Inputs for searching quadgram, trigram and unigrams
        input.tri <- paste(input[3], input[2], input[1], sep = ' ')
        input.bi  <- paste(input[2], input[1], sep = ' ')
        input.uni <- input[1]
        
        #-----------------------------
        # Searching next word in ngrams
        #-----------------------------
        quadgramsearch <- grepl(paste0("^", input.tri, "$"), quadgramDT$trigram)
        quadgramsubset <- quadgramDT[quadgramsearch,]
        if (sum(quadgramsearch) == 0) {
                quadgramsubset <- quadgramDT[1,]
                quadgramsubset$nextword <- "n.a."}
        
        trigramsearch <- grepl(paste0("^", input.bi, "$"), trigramDT$bigram)
        trigramsubset <- trigramDT[trigramsearch,]
        if (sum(trigramsearch) == 0) {
                trigramsubset <- trigramDT[1,]
                trigramsubset$nextword <- "n.a."}
        
        bigramsearch  <- grepl(paste0("^",input.uni,"$"), bigramDT$unigram)
        bigramsubset  <- bigramDT[bigramsearch,]
        if (sum(bigramsearch) == 0) {
                bigramsubset <- bigramDT[1,]
                bigramsubset$nextword <- "n.a."}
        
        unigramDT      <- unigramDT[order(unigramDT$freq, decreasing = T),]
        unigramsubset  <- unigramDT[1:maxwords,]
        
        # Compute probability scores for the ngrams
        unigramsubset$score  <- unigramsubset$freq  / sum(unigramDT$freq)      * unigramwt
        bigramsubset$score   <- bigramsubset$freq   / sum(bigramsearch)        * bigramwt
        trigramsubset$score  <- trigramsubset$freq  / sum(trigramsubset$freq)  * trigramwt
        quadgramsubset$score <- quadgramsubset$freq / sum(quadgramsubset$freq) * quadgramwt
        
        #-----------------------------
        # Predict the word based on ngram results
        #-----------------------------
        # Determine the predicted words and their score
        # predicted word for the dataframe
        nextword <- c(quadgramsubset$nextword, trigramsubset$nextword, 
                      bigramsubset$nextword, unigramsubset$unigram)
        # scores for the predicted words
        score <- c(quadgramsubset$score, trigramsubset$score, 
                   bigramsubset$score, unigramsubset$score)
        
        # Build the predicted word dataframe
        predictedword <- data.frame(next_word = nextword, score = score, stringsAsFactors = F)
        
        # Remove n.a.
        predictedword <- subset(predictedword, next_word != "n.a.")
        
        #-----------------------------
        # Final processing of predicted words
        #-----------------------------
        # Summarize the dataframe
        suppressWarnings(suppressMessages(library(dplyr)))
        predictedword <- predictedword %>%
                group_by(next_word) %>%
                summarize(score=sum(score))
        
        # Sort the dataframe using the score
        predictedword <- predictedword[order(predictedword$score, decreasing = T),]
        
        # Remove duplicates
        predictedword <- unique(predictedword$next_word)
        # Retain only the max results
        predictedword <- predictedword[1:maxwords]
        # Remove spaces
        predictedword <- predictedword[grepl('[[:alpha:]]',predictedword)]
        
        output <- paste(predictedword, sep=",", collapse=", ") 
        #-----------------------------
        # Return final outcome
        #-----------------------------
        return(output)       
        }

6 Testing the Prediction Model

# Testing 1
inputtext <- "mg cholesterol mg sodium"
nextwords <- predictnextword(inputtext, 3)

The predicted next word(s) are: g, said, will

# Testing 2
inputtext <- "president barack obama"
nextwords <- predictnextword(inputtext, 3)

The predicted next word(s) are: said, administration, will

# Testing 3
inputtext <- "two years"
nextwords <- predictnextword(inputtext, 3)

The predicted next word(s) are: ago, old, later

# Testing 4
inputtext <- "even"
nextwords <- predictnextword(inputtext, 3)

The predicted next word(s) are: though, know, better

# Testing 5
inputtext <- ""
nextwords <- predictnextword(inputtext, 3)

The predicted next word(s) are: said, will, one

7 Conclusion

Based on the next word prediction, a Shiny Apps is deployed. The app takes an input string and the number of words to be predicted. Then it predicts the next word based on the back-off algorithm based on the ngrams. The algorithm attaches higher weights to higher ngrams predictions to ensure higher accuracy. However, the weights attached to the ngrams will need further testing to determine the weights that will produce higher accuracy.

Coursera Data Science Capstone Project: Final Report

Sarajit Poddar

23rd August 2015