Data Science Capstone Final Assigment

Jose Javier Saravia Mata
29/06/2020

Model creation processs

  • Take a sample less than 1% of all .txt (for avoid RAM problems)
  • Transform to corpus and then remove punctuation and numbers
  • Remove Profanity Words
  • Create a tokenizer function
  • Create 3:6 ngram token dataset
  • Bind all ngram dataset and export to RDS
  • Pre process data with personal script
readRDS("salida.rds")[1:3,]
            oracion          key2 freq2 pred2
1 happy mothers day happy mothers     6   day
2     new years eve     new years     5   eve
3   green cap kitty     green cap     4 kitty

Prediction function explanation

  • Clean Input
  • Detect rows that contain the input
  • Filter rows witg endsWith (input, row) == TRUE
  • Aggregate sum of freq by keys of each row
  • Calculate relative frequency for each result
  • Show up to 5 rows (order by relative frequency desc)

App Explanation

  • Put sentence to analice and then click on “Go”
  • Remember that model is trained with 1% of .txt files (is possible that don't recognice some input sentences)
  • App will show up to 5 answers order by relative frequency desc
  • If you want to explore RDS model's dataset just use the slider which select some rows to print

App avaliable on: https://licjaviersaravia.shinyapps.io/CourseraCapstone/

Resources