Coursera Data Science Capstone Project

Nuno R
4/16/2020

The project will use the corpus of data from blogs, news, and twitter twits and try to predict the next word after a few letters or words are entered at the prompt.
This is a great interface to integrate with mobile applications limited input options

The files used for this analysis will only include the English language files from the total Corpora.
The files will include English US blogs (en_US.blogs.txt), English US News (en_US.news.txt), and English US twits from Twitter (en_US.twitter.txt)

The application predicts the next possible word in a sentence based on the user input
The user enters text in an input box, and the application returns the most likely word to be used
The algorithm obtains the word from n-grams dataframes. Where “n” is the number of words in the gram. Each n-gram is compared to the frequency of 2, 3 or 4 word sequences

The main server code that renders the UI and returns the analysis based on the user options
Simply put, the user is asked to select an option the the server on the fly calculates a model that identifies the next word.

The processing, cleaning, and research on n-grams is a very time consuming task
The amount of tests, debugging, and re-run of each file is very time consuming, even with a sample of 1000 lines
Removing words during cleaning is a process that needs a lot of tweaking and there's a whole world of techniques out there to fine-tune NLP algorithms
The final algorithm might be more or less accurate depending on this first step to collect, cleaning, and process the data for analysis