2024-07-14

Introduction

In the age of information overload, efficiency reigns supreme. Text prediction has emerged as a powerful tool, anticipating the words we want to type and streamlining the writing process. This presentation dives into the world of text prediction, exploring its inner workings, applications, and the impact it has on the way we interact with language.

The application

{width=50%, height = 50%}

How it works

The algorithm works in the way that it generates 1-gram and 2-grams. Then they are used to generate the next word.

If we want to generate more then one word, the algoritm will repeat itself after it will generate a new word.

give_me <- function( starter , n , ile ,s=FALSE){
  if( n== 1){
      w = sample(predict_next_word(starter,ile), 1 , replace = FALSE)
  return(w)  }
  nowe = sample(predict_next_word(starter,ile), 1 , replace = FALSE)
  words <- str_split(starter, " ")[[1]]
  m <- length(words)
  starter2 <- paste(words[m],nowe)
  nowe2 <- give_me(starter2, n-1 ,ile )
  if(s==FALSE) out <- paste(nowe,nowe2)
  else out <- paste(starter,nowe,nowe2)
  return( toupper(out))
}

Efficentiy

The tricky part is that the n-grams are not generated in the fly. They are done offline and just loaded into the app while starting. The code to generate the n-grams is the following:

bigrams <- d %>%
  unnest_tokens(output = bigram, input=txt , token = "ngrams", n = 2) %>%
  separate(bigram, into = c("word1", "word2"), sep = " ") %>%
  count(word1, word2, sort = TRUE)
trigrams <- d %>%
  unnest_tokens(output = trigram, input=txt , token = "ngrams", n = 3) %>%
  separate(trigram, into = c("word1", "word2", "word3"), sep = " ") %>%
  count(word1, word2, word3, sort = TRUE)

after that the ngrams are loaded as R object:

bload("bigrams.RData")
load("trigrams.RData")

Nondeterminism

The algorithm can be set to work as non deterministic. If we increase the randomness value on the left side the next word will be chosen from the top X words. The ile says how big window for random choice there is.

predict_next_word <- function(previous_words,ile) {
  words <- str_split(previous_words, " ")[[1]]
  n <- length(words)
  if (n == 1) {
    prediction <- bigrams %>%
      filter(word1 == words[1]) %>%
      top_n(ile, wt = n) %>%
      pull(word2)
  } else if (n == 2) {
    prediction <- trigrams %>%
      filter(word1 == words[1], word2 == words[2]) %>%
      top_n(ile, wt = n) %>%
      pull(word3) } 
  return(prediction)
  }