Data Science Capstone, JHU

Himank Jain
October 5, 2019

image1

John Hopkins University

Swiftkey

Coursera

About The Project

  • This Project requires the application of knowledge gained from 9 courses in Data Science Specialization by John Hopkins University.
  • The Final Application is created in Rstudio's ShinyApp.
  • It predicts the next word in a sentence typed by the user.

The Specialization Course can be found at JHU Data Science

The Word Prediction App can be found at Word Prediction App

The Source Code and Project files can be found at Github

Data:

  • The Data is provided by Switkey.
  • It comes from HC Corpora Corpus.
  • It can be downloaded from here
  • The Data comes in 4 different languages.
  • The three sources of data used in the analysis are blogs, news, and twitter.

  • Due to huge file size and limited resources availability only 20% of sample from each source was used for modelling.

About The Model

  • This is a Prototype text processing Application that is used to predict next word in a sentence using Katz's back off model.
  • It's part of the John Hopkins University Data Science capstone project from Coursera in collaboration with Switkey
  • The Project uses Katz's Ngram Back Off model to predict the next word in a sentence.
  • Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
  • Find out more about katz's back off model here

Application Overview

  • The Application can be found here
  • It utilizes Ngram model to predict next word in a sentence entered by the user. The predictions are based on Swiftkey HC Corpora datasets.
  • The instruction for application use are on application page here
  • The Application predicts the next word in a sentence that you enter.
  • To minimize Size and Runtime of the model some accuracy had to be sacrificed.