Text Prediction Using R

Rahul Garg
July 25, 2020

Algorithm Used

Algorithm is comprised of following steps -

  1. PreProcess Data - a) It consists downloading data, b) loading it in R, c) Removing punctuations, numbers, whitespace and converting all alphabets to Lower case.
  2. Generate N-grams - TidyText package is used for generating n-gram. N-grams (1-grams, 2-grams and 3-grams) are stored in respective dataframes for easier processing.
  3. Prediction Model - Katz' Backoff Model is used for predicting the next word using n-grams.
  4. Shiny app - A shiny app is developed and deployed for easy access along with requisite documentation, plots and examples for better user interface.

Note - While 4-grams, 5-grams may provide better accuracy, given the constraints of computing power, present application is Limited to 3-grams.

The "Ultimate" App

Input - App takes a text input. The length is variable and app is capable of tackling variable length.

Output - There are two outputs of the application -

  1. The Next Word - The top ten predictions are displayed in decreasing order of their probability. Two types of predictions, one with stopwords and one with stopwords removed are presented in a tabular form.
  2. Probability Plots - Probability of the ten words predicted is shown in a bar chart and their values displayed at top of the bars.

Process and Examples - For better user understanding, the tabs of “Process” and “Examples” are provided.

Sample Output

plot of chunk unnamed-chunk-1

Analysis and Future Prospects

Thus, the application offers great opportunities for faster and easier text mining.

  • The final size of the app is barely 23.4 MB as shown by the shiny website upload. This too is because of plots used. For text prediction applications, the size can be reduced further.
  • Examples added to the app interface show that accuracy varies with and without stopwords and each one has pros and cons in certain cases. Carving out a balance between the two is a subject of potential interest.
  • With additional resources in form of computational power, use of 4-grams and 5-grams can considerably raise the accuracy of predictions.

Acknowledgements: Thanks to Faculty, Coursera team and peers who together made this development a learning experience