Final project

2024-11-13

{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE)

Overview

This project was the final part of a 10 course Data Science track by Johns Hopkins University on Coursera. It was done as an industry partnership with SwiftKey. The job was to clean and analyze a large corpus of unstructured text and build a word prediction model and use it in a web application.

Word Predictor Application: https://simrankharbanda.shinyapps.io/final/

Continued

The goal of this exercise is to create a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others. For this project must submit:

A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
A slide deck consisting of no more than 5 slides created with R Studio Presenter

Data

The data is from a corpus called HC Corpora. It consists of text files collected from publicly available sources by a web crawler. I used english language files that were gathered from Twitter and different blogs and news sources. This combination should give a rather good mix of general language used today.The data are large text files. Over 4 million lines combined. Unix wordcount gives 102,081,616 individual words. They are not in a sequential order, eg. the lines in the “Blogs” - file are not complete posts and the same post does not continue in the next line.

Note: I used a random sample from the raw data to build the final model.

Text Transformations

The process of transforming raw text to useful units for text analysis is called tokenization. What kind of transformations are needed is an important choice. In the case of word prediction, it is probably the most important step.

used list of transformations:

Word Stemming
- In many NLP tasks you stem the words, which means reducing inflected or derived word to its basic part (ie. connection, connected and connecting, would all become connect)
All text to lower case
- Removes the problem of beginning of sentence words being “different” than the others.

Continued

Remove numbers
- Remove tokens that consist only of numbers, but not words that start with digits, e.g. 2day)
- Numbers are hard to predict or use in word prediction
Remove punctuation
- With simple ngram-model punctuation causes too many sequences
- Could be useful with advanced algorhitms combined with other features

Continued

Remove separators
- Spaces and variations of spaces, plus tab, newlines, and anything else in the Unicode “separator” category
- No use for prediction
Remove Twitter characters
- ie.(@ and #)
- Better to capture only the words

Pridection Model

To keep the scope of the project managable, I only concidered so called Markov models. They are a class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. I based my model on the stupid backoff -algorithm. Despite its name, it actually performs quite well given very large data. Actually, almost as well as some more complex models.

Stupid backoff -algorithm centers around n-grams. They mean contiguous word sequences of length n. Selection of the size depends on the genre of text you are trying to predict. Higher n-grams are not always preferable for prediction. Also the computing and storage needs grow expotentially with that parameter. I chose to let user choose n-grams of lengts one, two and three,… up to twinty. This means that the predictions can be based on maximum 19 previous words.

Continued

Basic Idea of the Algorithm used:

for example if n = 3

Take the input and use the same text transformations as for the training data and return last two words.
Search for two first input words in the 3-grams training data and if matched, predict the third word. If no match, then next step.
Search with only the last input word in the first word of 2-grams training data. If matched, predict the second word. If no match, then next step.
Predict the most common words in the 1-gram data.

Overview

Continued

Data

Text Transformations

used list of transformations:

Continued

Continued

Pridection Model

Continued

Thanks