Presentation Coursera Castone

Suman
10/7/2017

  • This is assignment of coursera datascience capstone project to find next predictive word.

  • The Capstone is a cooperation between Coursera and SwiftKey company.

How it works

  • Solution approach: Soultion is done in 3 parts

    • a) Create ngram dictionary (1 to 5) of words from twitter,news and blogs taking 50k each. Stopwords are removed.
    • b) Create prediction function which search the nth dictionary if user input n-1 words if thats not found in n-1 dictionary search in n-2 and so on and show that in the app
    • c) Put the code and dictionary in shiny server and user can access app from website
    • d) Once user enter something stopwords are removed and use predict function and reply
    • e) In the input box user type part of phrases and select no of prediction to be made(1 to 10) and the output is shown below

App and Code

setwd("C:\\Users\\suman\\Desktop\\datasciencecoursera\\capstone\\shiny")
#nextword<-predictnextword("hello good",2)

Things to consider

  • This is just a sample of .15 mil raw data so there might be very less phrases

  • There is capacity for 8GB ram to hold and process this much data and make balance between speed and size

  • The application might be bit slow so please keep patience

  • Stopwords are removed so user shouldnot expect any artical/preposition/aux verbs as predicting word

  • The Capstone is a cooperation between Coursera and SwiftKey company.

Improvement

  • Maybe stemword can be used to ignore non english and short word character

  • It can be integrated with spark to process huge data

  • Probablly news and blogs data are more gramaticall correct so should take more sample from those