Claudio Torres Casanelli
15/09/2022
This work is a specially designed application that is able to predict next word when you are writing in english. It allows you to be aware of what are the most common words that comes after the text you are writing. This present application works for the English language, but it could work for any language you like. I have tested in spanish with wonderful results.
In order to build this app, I had to work through 3 main stages
In order to keep the application light and good performer in shinyapps.io I had to take a sample of the total data, with consist on 2% of the total data available. In the local tests, I have been able to work with 25% of the data, but the initial calculation is longer.
temp_blogs_rbinom <- rbinom(length(temp_blogs), 1, 0.02)
temp_news_rbinom <- rbinom(length(temp_news), 1, 0.02)
temp_twitter_rbinom <- rbinom(length(temp_twitter), 1, 0.02)
The data is then stored in a single file to allow the next step to be performed easily. It is important to notice that all these calculations have to be executed only ONCE, so once you have your sample data, it is not necessary to run all of this again. But for the application purpose all of these calculations were kept light.
#Definition of base files to write select info
file_train <- file("Coursera-SwiftKey/final/en_US/en_US_train.txt")
writeLines(df_blogs_train$temp_blogs, file_train)
writeLines(df_news_train$temp_news, file_train)
writeLines(df_twitter_train$temp_twitter, file_train)
close(file_train)
After some study and tests I decided to used quanteda as the libraries in R that helped me to calculate the tokens from the sample data. This function generates the token and then a user function generated separated files that contains the 1-gram to 5-gram detailed data.
temp_train <- readtext(file_train,encoding = "UTF-8")
corpus <- corpus(temp_train)
token <- tokens(corpus, remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE,
remove_separators = TRUE, remove_url = TRUE, include_docvars = TRUE, verbose = TRUE)
ng1_model <- calc_model(token, 1)
ng2_model <- calc_model(token, 2)
ng3_model <- calc_model(token, 3)
ng4_model <- calc_model(token, 4)
ng5_model <- calc_model(token, 5)
The prediction algorithm was developed directly in the shiny app and was thought to perform as quickly as possible, with a very practical approach.
The algorithm starts with the 5-gram and search a table called ng5_model that contains all the 5-grams. If the 5-gram is present in the sample, then is presented as a result. If there is NO 5-gram in the sample, the algorithm follow with the 4-gram.
The algorithm follow the same step on the 4-gram, then the 3-gram, 2-gram and 1-gram until finds a result.
It is all encapsuled in a function called “predict”
predict <- function(ng1_model, ng2_model, ng3_model, ng4_model, ng5_moles, input)
You can see the application working here: https://claudiotorres.shinyapps.io/NWPrediction/
It is very easy to use. Just wait for the graphics showing the most frequent n-grams to finish loading and then start writing in the text box. The application will keep showing in the table the results.
The results will show the next word prediction along with the probability for the prediction.