Coursera Data Science Capstone Project

Mislav F.
23.09.2020

The objective of this capstone was to develop a web app that can predict the next word, like that used by mobile keyboards applications implemented by the Swiftkey.
There are many tasks to be realized such as: (1) Understanding the problem, (2) Cleaning and exploring text data used for model learning; (3) Preprocessing the text data and extracting feature; (4) Learning the NLP model; (5) Developing a web application and writing the slides.
The data for model learning came from three files (Blogs, News and Twitter). This data has been collected by crawling blog, news, and twits on the Internet. The data was cleaned, processed, tokenized, and n-gram features are created.

Firstly, the provided datasets have been explored, cleaned and normalized. Cleaning was performed by removing the puntuations, whitespaces, numbers, etc. while normalization was done by casting words to lower case.
Secondly, so-called n-grams (i.e. bigrams and trigrams) features where created to construct the predictive models.
Finally, from n-gram features word frequncy table was created for predicting most likely next words in the text sequence, given the previous one or two words.

Screenshot of the user interface with the box for inputing the text and list of top three suggested words.