June 18, 2017

Introduction

This is John Hopkins Data science Specialization Capstone Project Presentation

  • Word prediction application created using Swiftkey datasets of twitter, news, and blog natural language text.

  • As the datasets are huge, and due to memory and time limition, this application takes a sample of the datasets for prediction.

  • This application utilizes the N-gram model of natural language processing to take a user input and essentially "predicts" the next relevant word in the phrase.

  • The final prediction model utilizes the Stupid Backup model in its final implementation.

The Application

The Application is simple to use for the user. The user enters the text then clicks "Go" button. The next predicted word appears on the right side of the screen with a wordcloud of most frequent words used in the dataset with the entered phrase (max 50 words). The slider Input is to choose the minimum frequancy of the words in the wordcloud. A snapshot is displayed below.

knitr::include_graphics("Capture.JPG")

Tackeling memory and time issues

  • A small sample of the data has been taken to solve the memory problem, around %2.
  • As the process of tekonizing the data (even with small sample) is considerablly slow, I decided to create a module that prepare the data and saves it in RDS files.
  • The Application reads from the RDS files to show the result to the user in a resonable time.

References