Data Science Capstone Presentation

Jose Ignacio Gavara
31 January 2021

Introduction

This presentation is part of the Final Project of the Coursera Data Science Capstone course. The final project consists of two parts:

This presentation
An app hosted on shinyapps.io

The purpose of the app is to predict the next word in a text that a user has entered a part of. The app works with data from files contained in Swiftkey.zip, which is larger than 0.5 gb.

Methodology used

A sample of 1% of files has been taken, since the files are very large and would slow down computing time a lot, and I have created a single file (Corpus) that will be divided into three others that will make up the database from which the app will predict the next word in the combination of words entered in the input box.

The app uses an algorithm based on n-grams. An n-gram is a combination of n items, in this case words. From the corpus, three files have been generated divided by the n-gram of the that these files contain (two, three and four words), to make the prediction more efficient. Depending on the number of words entered by the user, the algorithm starts looking in one or another file, continuing in descending order of n-grams if it does not find a result in the first file.

How to use the app

The application is available at this link: https://jigavara.shinyapps.io/wordpredictor/
The user enters one or more words.
If the user does not enter any words, the app returns the prediction “NULL”
If the app does not find any combination of words In the files that matches the input, the app will return “the”, as this is the most common word.
In the remaining cases, the app will return the next word that continues in the corpus the combination of words entered by the user.

Possibilities of the methodology

The app is a “proof of concept” that can serve as the basis for more sophisticated applications that would use larger databases. These applications would be very useful for Internet search engine users from both PC and smartphone, as well as for users of instant messaging or e-mail applications.