JHU- Data Science Capstone (Coursera)

Bently Bartee
04/27/2021

My name is Bently Bartee;

I am a Data Analyst for Nordex USA. I am recently new to Data Analysis as I transferred over from Reliability Engineering, I currently develop statistical models, predictive analytics for our Wind Turbines in R and Python.

The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. You can use the slider tool to select up to three word predictions. Prediction will be show in order closet match to your typed in word.

The source code can be found GitHub

My LinkedIn Profile

Before we start create a basic summary of the three datasets provided.

We will review the:

file sizes
number of lines
number of characters
number of words for each source file.
We will also include are basic statistics: - words per line (min, mean, and max)

Blogs on average have more words per line, next would be news and twitter with the least per line. Twitter's lower word number count could have something to do with a maximum character count of 270 characters.

   source file.size.MB words.per.line word.count mean.word.count
1   blogs       200.42         899288   37546239           41.75
2    news       196.28          77259    2674536           34.62
3 twitter       159.36        2360148   30093413           12.75

Exploratory Analysis We will use some different techniques to develop an understanding of the data.

This we will look at:

most frequently used words
tokenizing
n-gram generation
unigrams
bigrams
trigrams

To develop an understanding of the various of our data, I went with a statistical approach of the data set that can later be used for the prediction model. (Shiny application)

Unigrams

Top 10 of the most common Unigrams in data sample. A bar chart and word cloud constructed to illustrate unique word frequencies.

plot of chunk unnamed-chunk-2

plot of chunk unnamed-chunk-3

Bigrams

Top 10 of the most common Bigrams in data sample. A bar chart and word cloud constructed to illustrate unique word frequencies.

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-5

Trigrams

Top 10 of the most common Trigrams in data sample. A bar chart and word cloud constructed to illustrate unique word frequencies.

plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-7

Tokenizing and N-Gram Generation

In the Shiny app I used a predictive model that would handle unigrams, bigrams, and trigrams. the best package that worked for me was quanteda and tm package. This was used to construct and use internal functions to analyze sample data and construct matrices of unigrams, bigrams, and trigrams.

In conclusion, the final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface.

Possible models:

The Shiny algorithm from our exploratory analysis above.

Trigram model to predict the next word.
No matching trigram, then back to the bigram model
Finally unigram model if needed.

The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.

The final strategy will be based on the one that increases efficiency and provides the best accuracy.