finalshiny

GusEsq
9 october 2016

Prediction Algorithm

I tried to keep my algorithm simple

over 2-gram word prediction
3 next possible words
need some rules to work (lowercases, english)

Explanation and Example

Step 1: Getting all data together and sample just 1 percent
Step 2: Make the corpus
Step 3: Clean the corpus: stems, punctuation, lowercases, etc.
Step 4: 2-gramm matrix prioritized by frecuency
Step 5: select top 3 second words for input, (this is showed on next slide)

e.g. Let's try the word “rock”“ '> nextwordpred("rock”) [1] “star i chair”

Algorithm 1/2

#I upload to the shiny serve a csv file with the corpora 2-gramm order by frequency
#This is the main reason the algorith its quick
freqtwoGramw <- read.csv("database.csv")
#function for running prediction
nextwordpred <- function(word) {
split the last word of the phrase
  part1 <- strsplit(word, " ") 
  part2 <- length(part1[[1]])
  lookfor <- part1[[1]][part2]
  comilla <- "^"
  check <- paste(comilla,lookfor, sep="")

Algorithm 2/2

#Show the 3 most frequent words 
  options <- grep(check, freqtwoGramw$term, ignore.case = FALSE)
  option1 <- options[1]
  option2 <- options[2]
  option3 <- options[3]
  result1 <- as.character(freqtwoGramw$term[option1])
  result2 <- as.character(freqtwoGramw$term[option2])
  result3 <- as.character(freqtwoGramw$term[option3])
  word1 <- strsplit(result1, split = " ")[[1]][2]
  word2 <- strsplit(result2, split = " ")[[1]][2]
  word3 <- strsplit(result3, split = " ")[[1]][2]
  final<- rbind(word1, word2, word3)
  print(final) 
}

Conclusions

I always follow the rule of understanding the problem as the most important thing, when bringing down a solution. The 2-gram frecuency term look up, will give a basic but powerful approach to the next word.

This course was really challenging, for me it was really difficult for all the coding, I'm more used to statistic work, but this kind of development it's finally product for a final user, and there is where the value is added.