Filipe Rigueiro
October 7, 2018
COURSERA SWIFT KEY PRESENTATION
Coursera and SwitfKey are partnering on this project; that apply data science in the area of natural language processing.
The project uses a large text corpus of documents to predict the next word on preceding input.
The data is extracted and cleaned from files and used with the Shiny application.
The ultimate purpose of this project is to built a Shiny app that suggest possible words when users type some random sentences.
Details can be found in these links:
The algorithm will follow these steps below:
If 3 words selected then Quadgram data is used. If 2 words selected then Trigram data is used. If 1 words selected then Bigram data is used. If none words selected then Unigram data is used;
Training data is limited to 6000 due to memory failure.
Various testes were made to increase the amount of training data but unsuccessfull.
A larger training data (in the hundreds of thousands) would greatly improve the model accuracy.