Coursera Data Science Capstone Project

Nuno R
4/16/2020

Introduction (1/5)

  • The project will use the corpus of data from blogs, news, and twitter twits and try to predict the next word after a few letters or words are entered at the prompt.

  • This is a great interface to integrate with mobile applications limited input options

Data (2/5)

  • The files used for this analysis will only include the English language files from the total Corpora.

  • The files will include English US blogs (en_US.blogs.txt), English US News (en_US.news.txt), and English US twits from Twitter (en_US.twitter.txt)

Approach (3/5)

  • The application predicts the next possible word in a sentence based on the user input

  • The user enters text in an input box, and the application returns the most likely word to be used

  • The algorithm obtains the word from n-grams dataframes. Where “n” is the number of words in the gram. Each n-gram is compared to the frequency of 2, 3 or 4 word sequences

Shiny Application (4/5)

  • The main server code that renders the UI and returns the analysis based on the user options

  • Simply put, the user is asked to select an option the the server on the fly calculates a model that identifies the next word.

Conclusion and Next Steps (5/5)

  • The processing, cleaning, and research on n-grams is a very time consuming task

  • The amount of tests, debugging, and re-run of each file is very time consuming, even with a sample of 1000 lines

  • Removing words during cleaning is a process that needs a lot of tweaking and there's a whole world of techniques out there to fine-tune NLP algorithms

  • The final algorithm might be more or less accurate depending on this first step to collect, cleaning, and process the data for analysis