Presentation Coursera Castone

Suman
10/7/2017

This is assignment of coursera datascience capstone project to find next predictive word.
The Capstone is a cooperation between Coursera and SwiftKey company.

How it works

Solution approach: Soultion is done in 3 parts
- a) Create ngram dictionary (1 to 5) of words from twitter,news and blogs taking 50k each. Stopwords are removed.
- b) Create prediction function which search the nth dictionary if user input n-1 words if thats not found in n-1 dictionary search in n-2 and so on and show that in the app
- c) Put the code and dictionary in shiny server and user can access app from website
- d) Once user enter something stopwords are removed and use predict function and reply
- e) In the input box user type part of phrases and select no of prediction to be made(1 to 10) and the output is shown below

App and Code

Code: Code is available at : https://github.com/suman12345678/datasciencecapstone
App : Application can be access at https://suman123456.shinyapps.io/shiny/
Presentation : http://rpubs.com/suman12345678/capstoneproj

setwd("C:\\Users\\suman\\Desktop\\datasciencecoursera\\capstone\\shiny")
#nextword<-predictnextword("hello good",2)

Things to consider

This is just a sample of .15 mil raw data so there might be very less phrases
There is capacity for 8GB ram to hold and process this much data and make balance between speed and size
The application might be bit slow so please keep patience
Stopwords are removed so user shouldnot expect any artical/preposition/aux verbs as predicting word
The Capstone is a cooperation between Coursera and SwiftKey company.

Improvement

Maybe stemword can be used to ignore non english and short word character
It can be integrated with spark to process huge data
Probablly news and blogs data are more gramaticall correct so should take more sample from those