Coursera Data Science - Capstone Project

Gustavo Seifer
08.August.2021

Capstone Project Details

Background and rationale

  • Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities.

  • The main objective of this project was to develop a text prediction model which involves Natutal Processing Language.

APP

Through a simple user interface the App predicts the next word.

Main tasks

  • Understanding the problem

  • Data acquisition and cleaning

  • Exploratory analysis

  • Statistical modeling & Predictive modeling

  • Creating a data product (Shiny App)

  • Creating a short slide deck pitching your product

Brief Description of the Methodology for developing the App

The text from different sources (News, Blogs, Twitter) was analyzed, cleaned and properly transform in a tidy format through tokenization (each word per row = token).

The words was randomly sample in order to reduce the computation time.

The words were filtered in order to eliminate stopping words and words without meaning

The words were counted globally and by source.

n-grams (bi-grams and tri-grams) were generated. Based on the n-grams a predicted model was developed

The next word is quickly predicted after the input of the user.

Advantages, disadvantages and next steps

This is a first mockup open to be improved and feed with more sources in order to increase its prediction power.

Main advantages

  1. Fast
  2. Intuitive and easy to use
  3. Text prediction is a growing field

Main disadvantages

  1. This App only covers English language
  2. Depending on the machine the text database coudl be burdensome to manipulate

Next Steps: to extended it to other languages and to extend it to other OS.

App and References

ShinyApp

https://gus079.shinyapps.io/shiny_app/

R for Data Science

https://r4ds.had.co.nz/

Text Mining with R

https://www.tidytextmining.com/index.html

Supervised Machine Learning for Text Analysis in R

https://smltar.com/