Word Prediction App

Michael Garcia

User Interface

The word prediction app provides information about the entire set of text and provides top n-grams depending on the user input.

user interface

Application Features

Input and Prediction

prediction

A text field is available to predict the next n-gram of words. The output is dynamic and will provide the top predictions. The higher the value the most probable outcome.

Visualization and Stats

The application initially displays the top ngrams by ngram size, a histogram, and a wordcloud for the entire data set.

stats

histogram

wordcloud

Word Prediction Application Workflow

The application runs several methods to prepare the data or to load previously saved class objects

workflow

Word Prediction Algorithm

  • The genCorpus method contains argument with lists the of the text files to use
  • This uses the Text Mining (tm) VCorpus and tm_map methods to prep the corpus text
  • The ngramGenerator accepts the corpus as an argument and creates matrix using the TermDocumentMatrix method
    • The matrix generated uses the corpus object, and recursivley creates nagrams along with the tokens
    • The ngrams are generated using the NLP library and ngrams method
    • ngrams function uses 2 arguments: uses the vector of words and the nth integer for length of computation.
    • Result is a list of the ngram values, which the TermMatrixDocument functon uses with the tokens and creates a table

Ngrams

NGrams work by probably model:

eq

Ngram refers to the combination of previous combination of N words that can aid in prediction of the N+1…N terms. The simplest form is probabilistic model or most likelihood estimation.

word_prob

Depending on the extend of the Ngrams computed, tokens are created for various backward text to predict the word that would occur.

So we are looking a the models probability outcome on frequency of the words with other words combined. And the tokens are the few N combination of words and the frequency of their occurance.

It is important to note, that the algorithm is as good as the data that is used for training or estimations.

Citations and Reference

[Fellows (2018); Gagolewski (2022); Wickham (2016); Feinerer and Hornik (2023); Neuwirth (2022); Wickham et al. (2023); Hornik, Meyer, and Buchta (2022); Dowle and Srinivasan (2023); Calin.Uioreanu:https://calin.shinyapps.io/predict_next_word; Hornik (2020)]

Dowle, Matt, and Arun Srinivasan. 2023. “Data.table: Extension of ‘Data.frame‘.” https://CRAN.R-project.org/package=data.table.
Feinerer, Ingo, and Kurt Hornik. 2023. “Tm: Text Mining Package.” https://CRAN.R-project.org/package=tm.
Fellows, Ian. 2018. “Wordcloud: Word Clouds.” https://CRAN.R-project.org/package=wordcloud.
Gagolewski, Marek. 2022. Stringi: Fast and Portable Character String Processing in r 103. https://doi.org/10.18637/jss.v103.i02.
Hornik, Kurt. 2020. “NLP: Natural Language Processing Infrastructure.” https://CRAN.R-project.org/package=NLP.
Hornik, Kurt, David Meyer, and Christian Buchta. 2022. “Slam: Sparse Lightweight Arrays and Matrices.” https://CRAN.R-project.org/package=slam.
Neuwirth, Erich. 2022. “RColorBrewer: ColorBrewer Palettes.” https://CRAN.R-project.org/package=RColorBrewer.
Wickham, Hadley. 2016. “Ggplot2: Elegant Graphics for Data Analysis.” https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. “Dplyr: A Grammar of Data Manipulation.” https://CRAN.R-project.org/package=dplyr.