Data Science Capstone Project

The scope of this project is to develop an application to predict the next word using the typed words as input.

Course Partners

Data Source

  • "Coursera-Swiftkey.zip": The data used for this project. It consists of three files containing unstructured data from blogs, news articles and tweets provided in 4 languages.

  • As the data is huge(563MB English files) I have used only a small sample from each of the files and clubbed them together to generate the ngram models.

  • High level Workflow: Workflow

How to access and use the Shiny App

  • The Shiny App link https://pradeepgurav.shinyapps.io/shinyapp/
  • The app first loads the ngram files(bigram, trigram and quadgram) and prompts the user to type input words
  • Enter one or more words, the app shows a list of recommendations as the 'next word'

Screenshot

Algorithm

  • When a single word is fed as input, the app looks into bigram and suggests top n words with highest probability(frequency) from the bigram model. However when two, three or more words are fed as input, the app looks into trigram and quadgram models. However I combine suggestions from all models into one list. I put the words recommended by higher gram models on top followed by recommendations from lower gram models.
  • The predictions can be made more accurate/realistic by:
    1. Using larger samples to generate ngrams
    2. Using higher degree n-grams as well
    3. Using domain specific input files to generate ngrams
    4. Saving user's most frequently used/preferred phrases in local cookies

References

  1. Input data zip
  2. Exploratory Analysis of the Data and N-gram data files
  3. Dan Jurafsky, Stanford University
  4. Sthada
  5. Wikipedia
  6. Many people who contributed to this capstone project over the years on github




    Thank you
    pradeep.gurav@gmail.com