2025-04-17

Overview

Overview:

  • This project is the final part of a 10 course Data Science track by Johns Hopkins University on Coursera.
  • It was done as an industry partnership with SwiftKey.
  • The job was to clean and analyze a large corpus of unstructured text and build a word prediction model and use it in a web application.

Project Goals

Project Goals:

The goal of this exercise is to create a product to highlight the prediction algorithm that I have built and to provide an interface that can be accessed by others. For this project must submit:

  1. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
  2. A slide deck consisting of no more than 5 slides created with R Studio Presenter

Data File

Data File:

The data is from a corpus called HC Corpora. It consists of text files collected from publicly available sources by a web crawler. I used english language files that were gathered from Twitter and different blogs and news sources. This combination should give a rather good mix of general language used today.The data are large text files. Over 4 million lines combined. Unix wordcount gives 102,081,616 individual words. They are not in a sequential order, eg. the lines in the “Blogs” - file are not complete posts and the same post does not continue in the next line.

Note: I used a random sample from the raw data to build the final model.

Data Cleaning

Data Cleaning:

Data cleaning involves transforming the raw text in the corpus into a format more suitable for automated manipulation. The tm package provides numerous functions for such transformations (see Feinerer et al., 2008, p. 9). For this package, the texts were converted to lower case, stripped of whitespace, and common stopwords (i.e., words so common that they contain little information; see Feinerer et al., 2008, pp. 25-26) were removed. From the cleaned English corpus, a term-document matrix (TDM) was created, which is a matrix of words or phrases and their frequencies in a corpus.

Prediction Model

Prediction Model:

According to Wikipedia (N-gram, n.d.), “an n-gram is a contiguous sequence of n items from a given sequence of text or speech.” This package takes a key word or phrase, matches that key to the most frequent n-1 term found in a TDM of n-word terms, and returns the nth word of that item.

Of course, not all possible words or phrases exist in the corpus from which the TDM was derived. For this reason, a simplified Katz’s back-off model is used, which backs off to smaller n-grams when a key is not found in the larger n-gram. The maximum n-gram handled is a trigram. The word returned is the match found in the largest n-gram where the key is found. When the key is not found in the unigram, the most common word in the corpus “will” is returned. This function is demonstrated using a Shiny app hosted on shinyapps.io

Conclusion

Conclusion:

This report has shown features of the R package “wordprediction”. It was designed using samples of 1000 words each from a corpus of collections English words. As shown in a demonstration, all phrases and words submitted to the function “katz_backoff_model” result in a prediction in the form of a single word returned.

Thanks