Coursera: JHU Data Science Specialization- Capstone Project

Swift Keyboard

Shovit Bhari
2020-08-14

Introduction

  • This word prediction app is a final assignment of the tenth course (Data Science Capstone) in Coursea Data Science Specialization.
  • This course focuses on analyzing a large corpus of text documents to discover the structure in the data and how words are put together to build a predictive model.
  • N-Gram linguistics model was used to build a smart keyboard that predicts the next word based on input of the words.

Roadmap to the Model

  • Getting and cleaning the data:
    – All the porvided corpus was combined into one
    – 25% of the corpus was selected for training a model
  • Exploratory Data Analysis:
    – Frequency of words and their pairs were calculated
  • Modeling:
    – Quanteda package was used to tokenize the corpus
    – 1 to 7-gram model was built for word prediction

Algorithm and Prediction

  • To improve efficiency, word pairs that appear less than 5 times in the corpus were removed
  • Katz's back-off model was used to predict the next word
  • The model iterates from 7-gram to 1-gram to find matches in the last (n-1) words
  • It starts from 7-gram, backs off to 6-gram if there is no prediction.
  • It continues till, it back-off to 1-gram.
  • When the user input is null, the most frequent word 'the' is returned when number of prediction by default is 1

The Shiny App

Here is a link to the application which provides all the necessary instructions. Shiny

GitHub

References