September 11, 2018

Objective

The objective of the data science capstone Is to make A simple n-gram Prediction Model which is used to predict the next word given the input one has provided

What is ngram

A ngram is a contigious sequence of n items from a given sample of text or speech . The ngram typically are collected from text or speech corpus

Idea Behind the Algorithm

  • Consider the Sentence "Hello This is me" . We would predict the next word given this senetence. As you can imagine it would be teidous so what we do we will predict the next word given the last words of input as accordance by markov decision process. The next predicted would be the one which has been obtained most propotionally given the words we have input

  • We would Combine Markov Decision Process with Backoff procedure . For example if not enough information is found on using trigram we would use bigram to predict the next word and if not use unigram

Data

  • To Prepare the data i used quanteda package . The quanteda package is used to make n gram ,remove stopwords punctuations and so on which was finally converted to document term matrix

  • Used a 25 percent random sampling which is representative of whole population

tokens(Vector,remove_numbers=T,remove_punct=T,
       remove_symbols=T,remove_separators = TRUE,
       remove_twitter=T,remove_hyphens=T,
       remove_url=T,ngrams=2,concatenator=" ") -> Vector1
dfm(Vector1,remove = stopwords("english")) -> Vector1
tolower(Vector1) -> Vector1
docfreq(Vector1) -> Vector1

More Information