Word Prediction using N-grams

2023-02-04

Predictive Text App

Hello! My predictive text app is a simple R Shiny App where the user can input text and below the text will be the next predicted word. The app is reactive! So, predicted words will appear as you type them out.

Below is an example of what the server function looks like.

server <- function(input, output) {
  source("predict.R")
  dataInput <- reactive({
    ngrams(input$caption)
  })
    output$value <- dataInput
}

The UI

an image caption Source: text predictor

This is what the app looks like. The user inputs there text where it says to, and then below it is where the next predicted word will show. See the example for, “I love you so”? The predicted word was “much”!

How it works

After cleaning and sampling the given datasets, a file of bigrams, trigrams, and quadgrams were created. The code for the prediction algorithm calls from these files so that the whole cleaning process doesn’t have to happen every time the app is opened.

To predict the next word from a sample of user input, the input gets pushed through a function that counts how many words are in the input. Depending on how many words there are, the algorithm looks through the files of bigrams, trigrams, and fourgrams and pulls out the match that has the highest probability. For example, if the input is “how are”, then the algorithm looks at trigrams that start with “how are” and gives the third word that has the highest probability. If you try this on the app, it will produce “you”!

Predict.R

If the app does not recognize the input, it will give you a question mark (?). Also, if the user uses non-numeric characters, it will filter out most of them. The algorithm will also make all output lowercase.

On the next slide is what the ngram function looks like, which is the function that takes in user input and determines which gram function it should go in.

Ngram function

ngrams <- function(input){
  # Create a dataframe
  input <- data_frame(text = input)
  # Clean the Input
  replace_reg <- "[^[:alpha:][:space:]]*"
  input <- input %>%
    mutate(text = str_replace_all(text, replace_reg, ""))
  # Find word count, separate words, lower case
  input_count <- str_count(input, boundary("word")) #str_count finds number of matches in input that are words
  input_words <- unlist(str_split(input, boundary("word"))) #str_split makes a list of words from input while unlist turns it into a vector again
  input_words <- tolower(input_words) #make it lowercase
  # Call the matching functions
  out <- ifelse(input_count == 1, bigram(input_words), #If the input is only one word, put it into the bigram function
                ifelse (input_count == 2, trigram(input_words), quadgram(input_words))) #otherwise, stick it in the other ones
  # Output
  return(out)
}

Thank you

Thank you for viewing my presentation!