10/5/2020

Introduction

This is the final course of the Data Science Specialization in R. It combines all the knowledge and skills learned during the course - from understanding data science, to installing R and RStudio, loading, subsetting, wrangling, exploring, using statistical inference, data preprocessing, testing and training our data sets based on applicable algorithm.

The capstone is a partnership between Johns Hopkins University and Swiftkey. I have used this product a while back in 2013-2015. I was amazed at the innovation on digital keyboards. The ability to slide your finger across the keyboard without lifting it. It then predicts the word with high accuracy. This course provides a basic blueprint on how to achieve the word prediction technology. The goal is to learn the basic and build upon this knowledge.

Factors in EDA and NLP Research

  • Local PC vs. Cloud - What are the benefits?

  • R NLP Package and Python NLP?

    • TM
    • spacyr/spacy
    • Quanteda
    • readtext
    • openNLP
    • RWeka
    • NLP
    • coreNLP
    • tidytext
    • wordcloud2
    • SnowballC
    • Tried many more
  • Modeling - Speed vs. Accuracy

Modeling

word predict freq
let us know if you have any 5
dc in next dc dc in next 5
in next dc dc in next dc 5
a square foot home with pool originally 5
square foot home with pool originally built 5
foot home with pool originally built in 5

Conclusion

  • Great case study to build start NLP knowledge and to build on top of it.
  • More aware of news and technology in NLP space such as BERT and GPT-3
  • Leaves more question on how to optimize modeling for better accuracy and speed
  • NLP has plenty of use case such as: sentiment analysis for stock analysis; product feature analysis base on product reviews
  • Huge opportunity in NLP space alone.
  • Next Word Seer