July 2023
It starts with analysing some text, looking at the variety of words, symbols, punctuation, and digits found inside the text usually called corpus.
In literature, many R-packages already provide some pre-made functions and models to perform natural language processing (NLP). An overview of the available packages can be found here: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html.
Text analysis involves cleaning text, tokenaization and calculation of the word’s frequency. To identify the text’s vocabulary. The NGRAM technique is used to identify the context of the next word, in words prediction making group combinations of 1,2,3,..,n words out of a sentence.
- punctuation
- symbols
- digits
- stop-words
- bad-words
- language terms
- split the text strings into words
- counting words
- calculating the percent frequency
- combination of words in a sentence, such as:
- bigrams
- trigrams
- ...
- ngrams
With two functions the prediction of the last word of an input sentence is easily done. The only constraints are related to the time due to computation and the size of the frequency matrix.
predict_last_word <- function(input) {
input_words <- unlist(strsplit(input, "\\s+"))
context <- tail(input_words, 2)
context_plus <- paste(context[1],context[2])
# select the words from the freq_mat_norm
context_freq <- freq_mat_norm[context_plus, ]
predicted_word <- names(which.max(context_freq))
return(predicted_word)
}To make the prediction of the last word available to user, a Shiny App is built. The App allows the user to input a sentence.
In the ui - the user interface of the shiny app:
textInput("text_input", "Enter a sentence:"),
actionButton("predict_button", "Predict Next Word")
In the server - the engine of the shiny app:
observeEvent(input$predict_button, {
input_string <- input$text_input
predicted_word <- predict_last_word(input_string)
output$prediction_output <- renderText(predicted_word)
})
https://bit.ly/NLP-Word-Prediction