Building a next word model

2023-09-14

About the model

The objective is to predict the next word of an unfinished sentence. Data from twitter, blogs and news were used. The data were cleaned and used to build a n-gram model (natural language), which uses the frequency of appearance of some words with others.

With the 3 txt files, I created “sampled files” and with a ngram model, I built a data frame with 3 columns, X1/X2/Y (most frequent 3-words associations) with Y being the predicted word from the 2 others. Then I used a naivesBayes, which was a good way to build a “light” model.

Some code from the model

tri_naiveBayes <- 
  naiveBayes( Y ~ X1 + X2 ,
              df_trigram2)
sample_data_df <- data.frame(doc_id = 1:500, text = sample_data)
vc <-sample_data_df %>%
  data.frame() %>%
  DataframeSource() %>%
  VCorpus %>%
  tm_map( stripWhitespace )
tdm_unigram <- vc %>%
  TermDocumentMatrix( control = list( removePunctuation = TRUE,
                                      removeNumbers = TRUE,
                                      wordLengths = c( 1, Inf) )
  )
unigram_levels <- unique(tdm_unigram$dimnames$Terms)
save(unigram_levels, file="unigram_level.RData")

Challenges encountered

The main challenge was to deal with the size of the data. In fact, my computer only has 3gb of RAM, and working with complete data was impossible. Thus, I had to reduce the size of training data, which reduced the performance of the model.

Moreover, I didn’t use more than 3-grams. So, even if the user tries a long sentence, only the tail is analyzed by the model.

Finally, I had to reduce the size of training data to deploy on shiny.

How does the app works ?

The use of the app is very simple. There one input (character) and one output with a possible next words (like with SwiftKey). The user has to validate the input with the “predict” button. The input and output are reactiveEvent, to update the variable of the final function model run in the app.

The app uses the less code as possible, so tri_naiveBayes and unigram levels (also from the model) are loaded and the model takes the input to run tri_naiveBayes:

output$next_word <- renderText({
    test_string <- as.character(input_phrase())
    test_split <- strsplit(test_string, split = " " )
    test_factor <- factor(unlist(test_split), levels=unigram_levels)
    test_df <- data.frame(X1 = test_factor[length(test_factor)-1], X2 = test_factor[length(test_factor)])
    predicted<-predict(tri_naiveBayes,test_df)
    predicted<- as.character(predicted)
    predicted