Data Science Capstone Presentation

Md Ahmed
12th February, 2018

This R-based presentation manifests next-word-prediction model with Shiny-application. The slidified visuals illustrate all the app relevant available resources, model algorithm steps and sequential functional processes with a data-flow-diagram.

Assignment Resources

  • Data was sourced from publicly available HC Corpora(English text from news, tweets and blogs totaled about 2.4 million records).
  • The prime objective of this 'app-project' is to allow any user to input a text or word into the application box, then app-algorithm will preidct the next most-likely user input choices.
  • A practical usable context for this app could be text messaging or searching on mobile devices.

Online sources to find Source-code and presentation

Prediction Algorithm progression

  • Combining all three data files and reduce it to 10% of total size as a randomly selective sample Corpus.
  • Cleansing the Corpus with 'tm' package as such removing punctuation,numbers, whitspace and more to select text only tokens.
  • Extracting out conditional frequencies of N-grams(Unigrams, Bigrams, Trigrams, Fourgrams and Fivegrams) from the cleansed Sample Corpus.
  • Computing the frequencies of each N-gram strings using maximum occurances count.
  • Start receiving user 'input-text' to begin the algorithmic process.
  • Using 'Markov model' assumpution with “katz's_backoff_model” as prediction strategy to display top 5-matching-predicted words.

Data flow algorithm visual

This diagram will display detail underlying linear steps of the 'app' progression. alt text

Learning and Constraints

The complexity of this project is rather challenging. I had to navigate through, critically analyze available online-sources(github codes) to figure out a solution. alt text

  • Sheer size of the combined dataset requires huge computational resources.
  • My test-data set needs more validation approach to evaluate it's true representation.
  • “missing value where TRUE/FALSE needed” ( needs code rewriting correction ).