2023-11-18

Perspective

The present project was the final part of a 10 course Data Science track by Johns Hopkins University (JHU) on Coursera. It was done as an industry partnership with SwiftKey. The task was to clean and analyze a huge corpus of unstructured text and build a word prediction model to finally use it in a web application.

The Word Prediction application can be available in the following direction:

https://czuewj-raamses-d0az.shinyapps.io/TheTextPrediction/

The goal of this task is to create a product to highlight the prediction algorithm that I have built previously and to provide an interface that can be accessed by others. For this project must submit:

  1. A Shiny app. It takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.

  2. A slide deck. It consist of no more than 5 slides created with R Studio Presenter.

Corpus Data

The data is from a corpus called HC Corpora. It consists of text files collected from publicly available sources by a web crawler. I used english language files that were gathered from Twitter and different blogs and news sources.

The extension of the file is over 4 million lines combined.

A random sample was used from the raw data to build the final model.

Text Handling

  1. Word Stemming

    Reducing inflected or derived word to its basic part (ie. connection, connected and connecting, would all become connect)

  2. All text to lower case

    Removes the problem of beginning of sentence words being “different” than the others.

  3. Remove punctuation

    With simple ngram-model punctuation causes too many sequences.

  4. Remove numbers

    Remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day.

  5. Remove separators and X characters

    Spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode “separator” category (no use for prediction). For the X character this means @ and #.

Prediction Model

For simplicity, I only considered the so called -Markov Models-, they are a class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. On the other hand, the model is based on the stupid backoff -algorithm, it actually performs quite well given very large data.

Stupid backoff -algorithm centers around n-grams. They mean contiguous word sequences of length n. The user is free to choose n-grams of lengths one, two and three,… up to twenty. This means that the predictions can be based on maximum 19 previous words.

The Algorithm used is based on (for n=5 for example):

First, take the input and use the same text transformations as for the training data and return last four words. Then, search for four first input words in the 5-grams training data and if matched, predict the five word. If no match, then next step. Next, search with only the last input word in the first word of 4-grams training data. If matched, predict the second word. If no match, then next step.