Data Science Capstone: Predictive text model

Bruno González
Nov 22th, 2018

Quick description

  1. Upload the data from the english texts
  2. Tokenize the data/create the n-grams
  3. Calculate the frequency of each n-gram precided by the same (n-1)-gram.
  4. Select only the more frequent n-gram
  5. The App will load the matrix and look for the input value on it.

N-grams function

  • For creating the n-grams, we use the function that dependos on the text entered and n. The function is defined as following:
ftoken <- function(dat, n=1){

  dat_tok <- gsub("[[:punct:]]", "", dat)
  dat_tok <- gsub("[[:digit:]]", "", dat_tok)
  dat_tok <- strsplit(dat_tok, "\\s+")
  dat_tok <- unlist(dat_tok)
  dat_tok <- tolower(dat_tok)
  if(n>1){
    dat_tok2 <- {}
    for(i in 1:(length(dat_tok))){
      aux <- paste(dat_tok[i])
      for(j in 1:(n-1)){aux <- paste(aux,dat_tok[i+j])}
      dat_tok2 [i] <- aux
  }}
  else{dat_tok2 <- dat_tok}
  dat_tok2
}

List of most frequent n-grams

  • First is created the list with all the n-grams and their frequencies for a specific n-gram and (n+1)-gram (this will be the prediction)
fmatpred <- function(tokn, tokn1){
  mat <- data.frame(tokn,tokn1) %>% group_by(tokn) %>% mutate(freqt=n())%>%
    ungroup() %>% group_by(tokn1) %>% mutate(freq=n()/freqt) %>%
    summarize(tokn=nth(tokn,1), freq=max(freq)) %>% filter(freq > 0.01)
}
  • Then, is filtered only the most possible prediction, based on the frequency
ffunction <- function(l)
  {
  l <- l %>%  group_by(tokn) %>% mutate(rank = rank(desc(freq))) %>% filter(rank<2)
  }

Shiny app

  • Then, a Shiny App was built that serch into the list the phrase introduced to predict the next word.