Data Science Capstone Final Assigment

Jose Javier Saravia Mata
29/06/2020

Autor LinkedIn: https://www.linkedin.com/in/jose-javier-saravia-mata-01435a41/

Model creation processs

Take a sample less than 1% of all .txt (for avoid RAM problems)
Transform to corpus and then remove punctuation and numbers
Remove Profanity Words
Create a tokenizer function
Create 3:6 ngram token dataset
Bind all ngram dataset and export to RDS
Pre process data with personal script

readRDS("salida.rds")[1:3,]

            oracion          key2 freq2 pred2
1 happy mothers day happy mothers     6   day
2     new years eve     new years     5   eve
3   green cap kitty     green cap     4 kitty

Prediction function explanation

Clean Input
Detect rows that contain the input
Filter rows witg endsWith (input, row) == TRUE
Aggregate sum of freq by keys of each row
Calculate relative frequency for each result
Show up to 5 rows (order by relative frequency desc)

App Explanation

Put sentence to analice and then click on “Go”
Remember that model is trained with 1% of .txt files (is possible that don't recognice some input sentences)
App will show up to 5 answers order by relative frequency desc
If you want to explore RDS model's dataset just use the slider which select some rows to print

App avaliable on: https://licjaviersaravia.shinyapps.io/CourseraCapstone/

Resources

Shiny APP: https://licjaviersaravia.shinyapps.io/CourseraCapstone/ Files: https://drive.google.com/drive/folders/1QxrrW6l-Ndwj2Ok-h6C_v7HoSGV9RjEG?usp=sharing