Bently Bartee
04/27/2021
My name is Bently Bartee;
I am a Data Analyst for Nordex USA. I am recently new to Data Analysis as I transferred over from Reliability Engineering, I currently develop statistical models, predictive analytics for our Wind Turbines in R and Python.
The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word. You can use the slider tool to select up to three word predictions. Prediction will be show in order closet match to your typed in word.
The source code can be found GitHub
Before we start create a basic summary of the three datasets provided.
We will review the:
Blogs on average have more words per line, next would be news and twitter with the least per line. Twitter's lower word number count could have something to do with a maximum character count of 270 characters.
source file.size.MB words.per.line word.count mean.word.count
1 blogs 200.42 899288 37546239 41.75
2 news 196.28 77259 2674536 34.62
3 twitter 159.36 2360148 30093413 12.75
Exploratory Analysis We will use some different techniques to develop an understanding of the data.
This we will look at:
To develop an understanding of the various of our data, I went with a statistical approach of the data set that can later be used for the prediction model. (Shiny application)
Unigrams
Top 10 of the most common Unigrams in data sample. A bar chart and word cloud constructed to illustrate unique word frequencies.
Bigrams
Top 10 of the most common Bigrams in data sample. A bar chart and word cloud constructed to illustrate unique word frequencies.
Trigrams
Top 10 of the most common Trigrams in data sample. A bar chart and word cloud constructed to illustrate unique word frequencies.
Tokenizing and N-Gram Generation
In the Shiny app I used a predictive model that would handle unigrams, bigrams, and trigrams.
the best package that worked for me was quanteda and tm package.
This was used to construct and use internal functions to analyze sample data and construct matrices of
unigrams, bigrams, and trigrams.
In conclusion, the final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface.
Possible models:
The Shiny algorithm from our exploratory analysis above.
The user interface of the Shiny app will consist of a text input box that will allow a user to enter a phrase. Then the app will use our algorithm to suggest the most likely next word after a short delay.
The final strategy will be based on the one that increases efficiency and provides the best accuracy.