NLP - Last Word Prediction Project

July 2023

Last Word Prediction Project

Natural Language Processing (NLP) Last Word Prediction Project is a first attempt at predicting the last word of an input sentence.

It starts with analysing some text, looking at the variety of words, symbols, punctuation, and digits found inside the text usually called corpus.

In literature, many R-packages already provide some pre-made functions and models to perform natural language processing (NLP). An overview of the available packages can be found here: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.

Key Points

Text analysis involves cleaning text, tokenaization and calculation of the word’s frequency. To identify the text’s vocabulary. The NGRAM technique is used to identify the context of the next word, in words prediction making group combinations of 1,2,3,..,n words out of a sentence.

Cleaning text

- punctuation
- symbols
- digits
- stop-words
- bad-words
- language terms


Tokenaization

- split the text strings into words

Word Frequency

- counting words
- calculating the percent frequency


NGRAMS

- combination of words in a sentence, such as:

    - bigrams
    - trigrams
    - ...
    - ngrams

Functions

With two functions the prediction of the last word of an input sentence is easily done. The only constraints are related to the time due to computation and the size of the frequency matrix.

Generate ngrams

generate_ngrams <- function(sample, n) {
  words <- unlist(strsplit(sample, "\\s+"))
  ngrams <- list()
  
  for (i in 1:(length(words)-(n-1))) {
    ngram <- paste(words[i:(i+n-1)], collapse=" ")
    ngrams[[i]] <- ngram
  }
  
  return(ngrams)
}

Predict last word

predict_last_word <- function(input) {
  input_words <- unlist(strsplit(input, "\\s+"))
  context <- tail(input_words, 2)
  context_plus <- paste(context[1],context[2])
  
  # select the words from the freq_mat_norm 
  context_freq <- freq_mat_norm[context_plus, ]
  predicted_word <- names(which.max(context_freq))
  
  return(predicted_word)
}

The Shiny App

To make the prediction of the last word available to user, a Shiny App is built. The App allows the user to input a sentence.

Disclaimer - Only 1% size random text of a specific source of text is used in this app.

In the ui - the user interface of the shiny app:

  textInput("text_input", "Enter a sentence:"),
  actionButton("predict_button", "Predict Next Word")

In the server - the engine of the shiny app:

    observeEvent(input$predict_button, {
      input_string <- input$text_input
      predicted_word <- predict_last_word(input_string) 
      output$prediction_output <- renderText(predicted_word)
    })
  

https://bit.ly/NLP-Word-Prediction