Presentation Swift Key Capstone

Filipe Rigueiro
October 7, 2018

COURSERA SWIFT KEY PRESENTATION

Introduction

Coursera and SwitfKey are partnering on this project; that apply data science in the area of natural language processing.
The project uses a large text corpus of documents to predict the next word on preceding input. The data is extracted and cleaned from files and used with the Shiny application.

The ultimate purpose of this project is to built a Shiny app that suggest possible words when users type some random sentences.

Details can be found in these links:

Shiny app
Github

Processes that were done

Data Exploratory: including data overview and cleaning are presented at Milestone report;
Prepare unigram, bigram, trigram and quadgram from the data;
Using back-off model to suggest user top words that likely to appear next;
Build Shiny app;
Create presentation

Back-off model

The algorithm will follow these steps below:

Load and clean the input;
Trim the last words (up to 3 words);
Match trimmed words with Quadgram, Trigram, Bigram and Unigram.

If 3 words selected then Quadgram data is used. If 2 words selected then Trigram data is used. If 1 words selected then Bigram data is used. If none words selected then Unigram data is used;

How the app works

Input text in input box;
Suggestions will appear bellow (the most left one the most probable);
When clicked the word will be copied to the inout box.

Results from Shiny App and Limitations

Training data is limited to 6000 due to memory failure.
Various testes were made to increase the amount of training data but unsuccessfull.

A larger training data (in the hundreds of thousands) would greatly improve the model accuracy.