Presentation for Capstone Project

Arif A. Arshad
February 10, 2018

The General Problems

The problems that I originally sought to solve were general, set out for the aim of introducing myself to text mining through cleaning, processing, and analyzing textual data, in addition to the tasks of developing an algorithm and building an app. The problem sequence included how to process big data text files given the memory constraints involved in using R, how to clean dirty, scraped text to make it usable for text analysis, how to create a dictionary of ngrams with associated probabilities, how to develop an algorithm that can predict the next word of text, and finally how to create an app around such an algorithm. The process was immersing and gave me deep hands-on experience of what is involved in text-mining and app development.

The Algorithm

The algorithm for the app is a simple matching algorithm. It takes the input text and searches for matches from the memory efficient 28MB ngram dictionary that contains 632,457 different ngrams from source texts including twitter, blogs, AP news, Newsweek, Economist, and Time. The dictionary also contains the ngrams' associated probabilities using methods described in Jurafsky and Martin's Speech and Language Processing. After finding a set of matches (or possibly no matches), it chooses the entry with the greatest associated probability and outputs the next word in the matched ngram.

Algorithm's Performance and How to Use the App

The algorithm in various test sets was only able to predict at most 18% of the next words. No doubt, the algorithm should perform better, but how? The answer lay in a more targeted ngram list and hence a more targeted app in terms of marketing. In order to use the app, just type in or copy some text and click 'Predict'. The app will display the predicted next word.

Language, Discourse, and Prediction

Language usage is vast, having infinite an set of possibilities. However, more predictive patterns can be found given how localized the setting is. The social roles people play involve different ways of speaking. As a result, we can make use of the simple algorithm and a memory efficient ngram dictionary by targeting the source text more narrowly. This will allow us to develop next word prediction models and apps for a variety of cultural and practical contexts; in other words, new and different markets.

Helping Elementary School Kids Read

A promising application of the algorithm is in helping little kids learn how to read by predicting the next word. This may be especially promising for kids with autism or even English language learners. We can use text files from various children's books to build their vocabulary capacity and help develop a sense for syntax and common academic usage. It can be part of a project to produce learning games or sold to an existing learning game maker.