Marcos Medeiros
02/20/2022
This presentation is part of the Data Science Capstone Project Specialization offered by The Johns Hopkins University on Coursera. The aim of this project is develop a Shiny Application using NLP (Natural Language Processing).
The Milestone Report “Predicting the Next Word” is available in my RPubs account: https://rpubs.com/msrcos3s/milestone
There you can read about text mining, methods and strategies for word predicting.
The source data is available in https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Access the Shiny APP: https://msrcos3s.shinyapps.io/Capstone_Project
The purpose of the app is to predict which is the most likely next word from an entered word or phrase.
The most likely word is obtained by an algorithm from the n-gram matrixes comparing the frequencies of sequences of four, three, two or single word.
If a input word or phrase is not found in the sequencies matrixes, the word 'it' is returned.
The main panel indicates which type of n-gram was used in the prediction.
A database of 102.5 million words was provided by The JHU containing 3 data sources: Blogs, News and Twitter. We made a random 1% sample to compose the training set.
The first step was cleaning the data, removing whitespaces, symbols, punctuation, numbers and stopwords and converting to lower case.
In the sequence, we made a tokenization with DTM and removed sparse words.
With n-grams tables, we can predict the next word based on the text entered by the user. The algorithm returns the most likely word using the most frequent combination of 2,3 or 4 words, or the frequency of a single word.
Thank you for viewing!