finalshiny

GusEsq
9 october 2016

Prediction Algorithm

I tried to keep my algorithm simple

  • over 2-gram word prediction
  • 3 next possible words
  • need some rules to work (lowercases, english)

Explanation and Example

  • Step 1: Getting all data together and sample just 1 percent
  • Step 2: Make the corpus
  • Step 3: Clean the corpus: stems, punctuation, lowercases, etc.
  • Step 4: 2-gramm matrix prioritized by frecuency
  • Step 5: select top 3 second words for input, (this is showed on next slide)

e.g. Let's try the word “rock”“ '> nextwordpred("rock”) [1] “star i chair”

Algorithm 1/2

#I upload to the shiny serve a csv file with the corpora 2-gramm order by frequency
#This is the main reason the algorith its quick
freqtwoGramw <- read.csv("database.csv")
#function for running prediction
nextwordpred <- function(word) {
split the last word of the phrase
  part1 <- strsplit(word, " ") 
  part2 <- length(part1[[1]])
  lookfor <- part1[[1]][part2]
  comilla <- "^"
  check <- paste(comilla,lookfor, sep="")

Algorithm 2/2

#Show the 3 most frequent words 
  options <- grep(check, freqtwoGramw$term, ignore.case = FALSE)
  option1 <- options[1]
  option2 <- options[2]
  option3 <- options[3]
  result1 <- as.character(freqtwoGramw$term[option1])
  result2 <- as.character(freqtwoGramw$term[option2])
  result3 <- as.character(freqtwoGramw$term[option3])
  word1 <- strsplit(result1, split = " ")[[1]][2]
  word2 <- strsplit(result2, split = " ")[[1]][2]
  word3 <- strsplit(result3, split = " ")[[1]][2]
  final<- rbind(word1, word2, word3)
  print(final) 
}

Conclusions

I always follow the rule of understanding the problem as the most important thing, when bringing down a solution. The 2-gram frecuency term look up, will give a basic but powerful approach to the next word.

This course was really challenging, for me it was really difficult for all the coding, I'm more used to statistic work, but this kind of development it's finally product for a final user, and there is where the value is added.