Text Prediction using NLP

Predicto: A Shiny App for Text Prediction

Abstract

The value of text-based information continues to increase with the growth of social media. It is getting increasingly difficult to analyze large corpus of text to discover a structure within it. This project utilizes a corpus of text to build a predictive model that displays up to 3 predictions based on user input. The project is a shiny app named Predicto.

Keywords: natural language processing, shiny app, predictive model, text mining, N-Gram, back off

Introduction

Predicto is a Shiny app that uses a text prediction algorithm to predict the next words based on text entered by a user. The application suggests the next word in a sentence using an n-gram algorithm. An n-gram is a contiguous sequence of n items from a given sample of text or speech.

The prediction model was built by extracting N-grams from SwiftKey’s ‘Tweets Blogs News’ dataset available on Kaggle. Various NLP (natural language processing) and text-mining techniques were explored to build an accurate and optimized model. The prediction model was built from a sample of about 8,00,000 lines extracted from the corpus of blogs, news and twitter data.

The predicted word is considered using the longest, most frequent matching n-gram. The predicted word is shown when the app detects that you have finished typing one or more words. The app might take a few seconds to load the results. The slider tool can be used to select up to three predictions. The top prediction is shown first followed by the second and the third most likely prediction.

Proposed System

The proposed system is a shiny app. Shiny is an R package that makes it easy to build interactive web apps straight from R. The interface of the app consists of two tabs, the first being the webpage for input and output and the second providing basic information about the app. The methodology and code are discussed below.

Methodology

The dataset consists of ‘Tweets Blogs News’ in 4 different languages. This project uses the one in English. The dataset was first downloaded and unzipped. The dataset consisted of 3 files namely, tweets, blogs and news. These were then explored to get an idea of the quality and quantity of the dataset. Basic exploration like number of lines per file and total lines was done. Because the data set consisted of almost 8,00,000 lines, it was sampled for ease of use. All non-English characters were first removed and it the data was converted to a uniform encoding, i.e., ASCII. Outliers were removed by considering only articles within the interquartile range based on length. The three files were then merged into a single file. A file containing profane words was downloaded from the web and loaded in memory. The data was then converted to lowercase for easy manipulation. Next, the data was stripped of all URLs, email addresses, Twitter handles, hash tags, ordinal numbers, profanity, punctuation marks and whitespaces. This was done using the tm package and the gsub function. Then, the data was written to the disk and the corpus was built. Finally, the data was tokenized, i.e., split into tokens (n-grams) and the n-gram frequencies were counted. Start word predictions, quadgrams, trigrams and bigrams were stored in separate files. As the user enters the text, the algorithm iterates from longest n-gram (4-gram) to shortest (2-gram) to detect a match.

Codes

The project would require 3 code files, viz., a server.R file, a ui.R file and a file for the predictive model, let us call it build-ngram-frequencies.R.

build-ngram-frequencies.R: This file performs what is discussed in the methodology section above. Various packages used this file are:

tm: The text mining package of R
dplyr: Used for data manipulation, contains functions like select, mutate, rename, delete etc.
stringr: Makes working with strings easy
stringi: Used for string processing. stringi contains functions not contained by stringr
quanteda: Used for quantitative analysis of textual data like content analysis or sentiment analysis

Various functions, that may require an explanation are as follows:

paste0: It is simply paste with " " separator
sample: Takes random sample of elements from dataset or vector
iconv: Translates labels into specified encoding
gsub: Used for substitution or replacement
corpus: Creates a corpus
duplicated: Determines which elements of a vector/data frame are duplicates
dfm: Creates a document-feature matrix, more memory efficient
mutate: Creates new variables from existing dataset

A frequently occurring word in the context of this project is back-off which means going back to (n-1)gram level to calculate the probabilities when you encounter a word with prob=0.

server.R: This file serves as the backend for the project. First, the n-gram frequency files, generated using build-ngram-frequencies.R and the bad words file was loaded. The functions used in this file are:

predictionMatch(userInput, ngrams): Uses backoff to find top 3 predictions, if available, for user input. Unigram has not been implemented to enhance performance by reducing computations
cleanInput(input): Removes profanity, URLs, emails, twitter handles, hash tags, punctuations, ordinals, whitespaces.
predictNextWord(input, word = 0): Uses predictionMatch to display predictions.
shinyServer(input, output): Serves the shiny app

ui.R: This file uses markdown to render the page. The packages tm, dplyr, shiny and markdown were first loaded. The interface was then designed within the shinyUI().

Data Set/ Data Collection

The dataset is available on Kaggle. As discussed above, the dataset consists of files in 4 different languages.

This project uses files from the “en_US” folder:

Sample snapshots of the dataset are as follows:

A closer look at the data shows that the corpus contains profane words. This observation led to the need of having a separate procedure to remove such words. It also contains many non-ASCII characters, leading and trailing whitespaces, ordinal numbers which had to be removed. Also, the dataset was so large (as is evident from the image) that it had to be sampled. Notepad could not support such volume of data and thus, it is advised that one uses Sublime Text or some other software.

Snapshots of the Project

Start word prediction:

About page:

Example using “Wish you a happy…” string:

Conclusion

The predictive model works really well and almost always predicts the next word that the user is thinking. The app is easy to use and the interface is very clear and concise. The “About” tab gives sufficient information about the app. The response time is in limits (2-3 seconds). Finally, the app is also flexible in choosing the number of predictions to be displayed.

Future scope

The predicted words could me made clickable (as in mobile keypad suggestions) to reduce the number of keystrokes by the user. It can also be enhanced to provide spelling suggestions. The app could be customized by using keystroke saving for individual users, which would in turn give rise to privacy concerns. Thus, privacy issues will also have to be taken care of. Another improvement could be the consideration of more than 4 input words for prediction, so as to maintain the context of the prediction with the input.