Data Science Capstone Slide Deck

Adrian Trujillo
19/08/2020

This presentation deck will describe the capstone project for the Coursera Data Science specialization held by professors of the Johns Hopkins University and in cooperation with SwiftKey.

You can run it at this Shiny app link

First steps:

The goal of this project is to create an application that predicts the next word in an entered phrase. The application was developed in R-Studio

  • The data to train the algorithms were phrases from Twitter, Blogs and News downloaded from the link of the course assignment and became our Corpora

  • It is a big amount of information to be procesed in a desktop computer or laptop, so we sample 2% of each source.

First steps:

  • A cleaning task is necessary to eliminate urls, tweet mentions, punctuation marks, many unnecessary spaces, end-of-line marks, numbers, lower case letters, take plain text from rich format, remove metacharacters, and remove bad words obtained from a dictionary obtained from this link to github / shutterstock, which is so much appreciated.

  • The next step is to perform an exploratory data analysis that serves as the basis for creating a predictive model.

Getting the frequency of words

  • The cleaned corpora was then tokenized into so-called n grams.

  • Tokenization is the process of converting significant data, such as an account number, into a random string of characters called a token that gives a sample of text to n grams models.

  • n grams is an algorithm based on the Markov model, which determines the frequency of n words found within the text sentence.

  • Those frequency matrices of aggregate terms of double, triple, quadruple, quintuple and sextuple words have been transferred to frequency dictionaries.

Predictive model:

  • Try to capture the best context of the sentence depending on the length of the sentence.

  • If the sentence contains 5 or more words, it is run, it is about obtaining the highest frequency of the 6 grams model, to obtain the next word.

  • If it cannot find it, the words are limited to 4 and it is tried to obtain the highest frequency of the 5 grams model and so on until the 2 grams model is reached.

  • If it finds the next word the message is “The next predicted word is: …”

  • If it can't find the next word, the message is: “Sorry I can't predict it”

User interface:

-The user will enter a phrase or words in the input and then in the back-end application it will run the predictive model and provide the best prediction for the next word, if it finds it.

  • The interface will also display the text entered by the user.

Here is the screenshot of the application

Acknowledgments and thanks:

  • To Johns Hopkins University and Coursera for the effort and initiative in offering this course.

  • To the Ph.D. teachers, in order of first name: Brian Caffo, Jeff Leek, and Roger D. Peng for their dedication in passing on their valuable knowledge and experiences.

  • To my classmates for their dedication in correcting my assignments and participating in the forums.

  • To my family for joining me in this effort.