Capstone Project for the Johns Hopkins Data Science Specialization - Next Word Prediction

IrinaS

2023-03-19

Summary

The goal of this task: Build a smart application to presents options for what the next word might be

Source data: large, unstructured database of the English language in txt format from the SwiftKey

Raw data proccessing: RMarkdown

Source code in Git: Git Repository

Shiny application: ShinyApp

Work Stages

Cleaning and analyzing text data

The raw data contains corpora (collected from publicly available sources by a web crawler) in 4 different languages, for this project only en_US locale files were used.

Since the data is too big and computer doesn’t have enough capacity to perform exploratory data analysis I have to randomly sampling 1% of total data(vector of size 5.55Gb). The amount of remaining data is still sufficient for statistical analysis.

To perform data analysis data was normalized:

Building and sampling from a predictive text model

  1. If user enter several words:

    a.App grab last 2 words and find Trigram started with this words. Prediction – last word of correspondent Trigram. If correspondent Trigram doesn’t exist – App grab 1 last word;

    b.App find Bigram which starts with this word. Prediction – last word of correspondent Bigram. If doesn’t exists – app find Trigram which starts with this word. Prediction – second word of such Trigram. If such case doesn’t exist - find Trigram with second word. Prediction – last word of correspondent Trigram. If such a case also does not exist – App find the most popular words from Unigram.

  2. If user enter 1 word – see p.1b

  3. If user doesn’t enter word – App find the most popular word from Unigram

Shiny App

Link on ShinyApp

The application has following options:

  1. User can enter word or sentence for next word prediction
  2. User can choose number of most probable words for prediction (I limited possibility to choose number of predicted words to 10 since there is no business logic for bigger number as part of this assignment)
  3. User can choose model for prediction - Unigram, Bigram, Trigram or Cumulative (optimal from N-grams). The logic of it works you can find on slide above.

On the mainPanel after entering all necessary parameters you can see:

  1. What model you have choose
  2. Table with most probable predicted word (with frequency and probability of this combination in source data). Number of predicted words was specified by user
  3. Visualization of most probable next word in wordcloud.

Thank you for yout attention!