Natural Language Processing Application

KFollmer
Oct. 15th, 2016

Introduction

Request: Build a text prediction web application

Requirements

  • Timely (< 1 minute response time)
  • Predicts the subsequent word of a user-inputed phrase
  • Hosted as a Shiny App

Sources Used

R Packages

library(shiny)
library(quanteda)
library(stringr)
library(data.table)
library(dplyr)

Data Source Used: Capstone Dataset: A large sampling of language data extracted from twitter, blog entries and news sources

Basic Steps of the Algorithm

  1. Create 4-gram dictionary using the sample data source references in previous slide.
  2. Store the last 2 words of the user's input to use to predict the next word
  3. Use the dictionary in conjuction with the kwic() function of the quanteda package by Ken Benoit to search for occurances of the bigram stored in step 2
  4. Create a frequency table of all the possibilities of the next word as found in the dictionary
  5. Select the most frequently occuring 'next word' as the answer

Performance Notes

There is a tradeoff between performance and acurracy with this tool. Performance was prioritized over formatting

Improvements for Next Version

  • More complex logic
  • A more robust 'lookup' dictionary
  • More use of regex to correctly match words regardless of common spelling errors