Capestone Coursera Data Science Final Project

Language Modelling - Text Predictor

Avinash Singh Pundhir
Analyst

The application is the capstone project for the Coursera Data Science specialization from Johns Hopkins University in collaboration with SwiftKey.
Main goal of this project is to build an app that is able to predict the next word that should follow any specific sentense.
The app will request an english sentense as input and will predict the most probable words that can follow.
Natural language modelling concepts such as tokenization, N-grams, language modelling are evaluated and applied in the application logic.

The application uses a data corpus provided specifically for the project purposes.
The corpus has data gathered from news, blogs and twitter feeds.
Inout data has been preprocessed and cleaned and using N Grams tokenization unigrams, bigrams, trigrams and Ngrams are generated.
These N Grams are used as an input to the text prediction methology implemented based on Katz backoff and good turing methods.
A combination of the above two approaches is used to generate relevance score. A higher relevance score signifies better match.

width

The app is relatively fast, and returns a default list of 3 suggested matches in just a few seconds.
The app does not "learn" from user input. Future improvements could include building n-gram tables based upon a user's frequent input
Also we can enhance the app to suggest a word pair rathe than just suggesting a single word match.
We can increase accuracy by adding some more context sensitive logic based on part of speeches and opposite words.