03/10/2020

Introduction

This is a 5 web page presentation using R Markdown that features a Natural Language Processing project with the objective of deploying a Shiny web app that takes input as a phrase (multiple words) in a text box input and outputs a prediction of the ‘next word’. When someone types:

I went to the …

The web app will predict what the next word might be. It could be gym, store, restaurant.

How to achieve the objective:

  • Data: Blogs, News & Twitter from a corpus called HC Corpora
  • Tools: Rstudio (TM, Quanteda, WordCloud, Tidyr, Stringi/Stringr, TidyText)
  • Modelling: Markov chain & Katz Backoff

Data Visualisation

The best way to explore text data is to look at the data visually!

Top 100 words from the raw data.

Top 100 Words

Next Word Prediction Methodology

  1. Download raw text files, check the size and perform Exploratory Data Analysis
  2. Clean data; separate into 2 word, 3 word, 4 word and 5 word n grams, save as data tables which contain: Features | Count | Frequency | Coverage
  3. Performance improvement by reducing the size to only include words with up to 50% coverage in the data table. The reasons are in the report
  4. Sort n grams data table by count, frequency and coverage
  5. Predict Word function: uses a Markov Chain and “back-off” type prediction model
  6. Model uses last 4, 3, 2, or 1 words to predict the best 5th, 4th, 3rd, or 2nd word in the data tables

The Shiny Web App

The app provides simple instructions on the main page

Key Features:

  1. Text box for user input: one or more words
  2. Predicted next word outputs dynamically below user input
  3. Multiple header tabs:
  • Main|Data Visualisation|How it works|About

Key Benefits:

  1. Fast response
  2. Simple user interface

Please click the link to check out my Shiny Web App