Coursera Data Science Capstone

Mahesh Divakaran

2022-08-22

1. Data Cleanse

library(tokenizers)
library(stopwords)
tokenize_words(<text>, stopwords=stopwords::stopwords("en"))

2. N-gram Dictionary

Get 2-grams and 3-grams (with stopwords).

tokenize_ngrams(<text>, n_min=2, n=3)

To reduce the N-gram dictionary size, first calculate frequency for each N-gram, then abandon the least frequent ones (the long tail), say the ones only cover 10% of occurrences or the ones that only appear once in the text corpus.

E.g. The total count of 1-gram is around 540,000. We would only need 6,000 words to cover 90% of the occurrences.

3. Exploratory Analysis

Use Twitter text as an example.

knitr::include_graphics("download.png")

4. Shiny UI

The Shiny app uses 3-gram dictionary (ommiting 3-grams that appears only once in the text corpus). It will match the last two words of an input with the first two words of entries in the dictionary, to predict the third word. If no entries found, it will instead match the last word of the input only. If no entries found again, it will return the most frequent 3-grams as result.

You can launch the app:

online https://datascience9.shinyapps.io/capstone/ or locally by running the following code in your RStudio

library(shiny)
runGitHub("Text_Input_Prediction", "AUG22")

5. Statictics

Thank you!