This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales and their languages into English (en_US dataset), German (de_DE dataset), Russian (ru_RU dataset) and Finnish (fi_FI dataset). In this capstone, we will be applying data science in building a predictive model in the area of natural language processing.
In the 1st report, we’re going to present an overview of data throughout Exploratory Data Analysis (EDA) and our plan to bring up a Shiny App for data interaction. There are 03 parts to be introduced:
Part I: A Plan for predictive model built-in We will focus putting into practice such machine learning algorithms as:
Part II: Exploratory Data Analysis An overview of data will be illustrated into statistical tables and data visualization.
Part III: Shiny App Projection
Naive Bayes: One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods.
Support vector machines (SVM): An SVM model can be used for either regression or classification, and linear SVMs often work well with text data.
3.Regularized linear models (Lasso regularized model): Using regularization helps us choose a simpler model that we expect to generalize better to new observations, and variable selection helps us identify which features to include in our model.
Step 01: To do the sentiment analysis, we use the NRC Word-Emotion Association Lexicon to look at the overall positive vs. negative sentiment in the text before looking at more specific emotions.
Step 02: We capitalized the names of the emotions for plotting, and also reordered the factor so that more positive emotions are together in the plot and more negative emotions are together in the plot.
Let’s go through on each of datasets in general.
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.0 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: viridisLite
Observation: We can see the positive emotions are stronger than the negative ones in US_blog, US_news and US_twitter.
We can see the negative emotions are stronger than the positive ones in DE_blogs, DE_news and DE_twitter.
Observation:We can see the negative emotions are stronger than the positive ones in FI_blogs, FI_news and FI_twitter.
We can see the positive emotions are stronger than the negative ones in RU_blogs, RU_news and RU_twitter.
We used ‘nrc’ and ‘bing’, the general-purpose lexicons to evaluate the opinion or emotion in the text. For purpose of statistical analysis, we used en_US dataset in our performance.
## Joy Trust Anticipation Surprise Anger Fear
## Blogs 0.0006672548 0.51903184 0.47921886 0.46602208 0.49393563 0.54482454
## News 0.0465548248 0.04382258 0.03360237 0.02899782 0.03982738 0.04436636
## Twitter 0.9527779204 0.43714558 0.48717877 0.50498009 0.46623699 0.41080910
## Disgust Sadness Positive Negative
## Blogs 0.48832082 0.52663153 0.41301598 0.51710369
## News 0.02953343 0.03944057 0.03867961 0.03924194
## Twitter 0.48214575 0.43392790 0.54830441 0.44365437
## Joy Trust Anticipation Surprise
## Min. : 689 Min. : 98460 Min. : 69827 Min. : 27364
## 1st Qu.: 24381 1st Qu.: 540317 1st Qu.: 532831 1st Qu.:233565
## Median : 48072 Median : 982173 Median : 995835 Median :439765
## Mean :344196 Mean : 748929 Mean : 692679 Mean :314552
## 3rd Qu.:515950 3rd Qu.:1074164 3rd Qu.:1004106 3rd Qu.:458147
## Max. :983828 Max. :1166154 Max. :1012376 Max. :476528
## Anger Fear Disgust Sadness
## Min. : 35983 Min. : 48271 Min. : 19169 Min. : 39131
## 1st Qu.:228608 1st Qu.:247618 1st Qu.:166056 1st Qu.:234827
## Median :421233 Median :446964 Median :312942 Median :430522
## Mean :301158 Mean :362670 Mean :216354 Mean :330717
## 3rd Qu.:433746 3rd Qu.:519869 3rd Qu.:314946 3rd Qu.:476510
## Max. :446258 Max. :592774 Max. :316950 Max. :522498
## Positive Negative
## Min. : 243723 Min. : 142554
## 1st Qu.:1423083 1st Qu.: 877108
## Median :2602443 Median :1611661
## Mean :2100357 Mean :1210898
## 3rd Qu.:3028674 3rd Qu.:1745071
## Max. :3454905 Max. :1878480
Most common positive and negative words in US_Blogs.
Most common positive and negative words in US_News.
Most common positive and negative words in US_Twitter.
Our Shiny app will typically give you an access to the data and show how our chosen algorithms work well on the data-set. In the next reports, the features of Shiny app will be specified and conceptualized for the sake of bringing the best visualized experiments to the audience.