Capstone Project_Predictive Model For Text

Synopsis

This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales and their languages into English (en_US dataset), German (de_DE dataset), Russian (ru_RU dataset) and Finnish (fi_FI dataset). In this capstone, we will be applying data science in building a predictive model in the area of natural language processing.

What do the data look like in general?

The data is in a form of chunks of text, which are corpora of characters and numbers in lower and upper case.
Having sentence boundaries, which are often marked such as punctuation and exclamation marks, right brace or parenthesis etc.
Containing abbreviation (e.g. p.142), currency sign ($)
Grammatical errors (e.g. Youu)
Also, containing emoji / symbols etc. (e.g. 😖😰💔💔 )

Where do the data come from? The data is from a corpus called HC Corpora.

In the 1st report, we’re going to present an overview of data throughout Exploratory Data Analysis (EDA) and our plan to bring up a Shiny App for data interaction. There are 03 parts to be introduced:

Part I: A Plan for predictive model built-in We will focus putting into practice such machine learning algorithms as:

Naive Bayes
Support vector machines (SVM) (Boser, Guyon, and Vapnik 1992)
Regularized linear models (Friedman, Hastie, and Tibshirani 2010) (Lasso regularized model )
Convolutional neural network (CNN) architecture (Kim 2014)

Part II: Exploratory Data Analysis An overview of data will be illustrated into statistical tables and data visualization.

Part III: Shiny App Projection

Part I: Planing for predictive model built-in

Naive Bayes: One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods.
Support vector machines (SVM): An SVM model can be used for either regression or classification, and linear SVMs often work well with text data.

3.Regularized linear models (Lasso regularized model): Using regularization helps us choose a simpler model that we expect to generalize better to new observations, and variable selection helps us identify which features to include in our model.

Convolutional neural network (CNN): CNNs can be well-suited for modeling text data text often contains quite a lot of local structure. A CNN does not learn long-range structure within a sequence like an LSTM (Long Short-term Memory networks), but instead detects local patterns. A CNN takes data (like text) as input and then hopefully produces output that represents specific structures in the data.

Part II: Exploratory Data Analysis

Step 01: To do the sentiment analysis, we use the NRC Word-Emotion Association Lexicon to look at the overall positive vs. negative sentiment in the text before looking at more specific emotions.

Step 02: We capitalized the names of the emotions for plotting, and also reordered the factor so that more positive emotions are together in the plot and more negative emotions are together in the plot.

Let’s go through on each of datasets in general.

en_US Dataset

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

## Loading required package: viridisLite

Observation: We can see the positive emotions are stronger than the negative ones in US_blog, US_news and US_twitter.

de_DE Dataset

We can see the negative emotions are stronger than the positive ones in DE_blogs, DE_news and DE_twitter.

fi_FI Dataset

Observation:We can see the negative emotions are stronger than the positive ones in FI_blogs, FI_news and FI_twitter.

ru_RU Dataset

We can see the positive emotions are stronger than the negative ones in RU_blogs, RU_news and RU_twitter.

Part II: Corpus Summary

We used ‘nrc’ and ‘bing’, the general-purpose lexicons to evaluate the opinion or emotion in the text. For purpose of statistical analysis, we used en_US dataset in our performance.

##                  Joy      Trust Anticipation   Surprise      Anger       Fear
## Blogs   0.0006672548 0.51903184   0.47921886 0.46602208 0.49393563 0.54482454
## News    0.0465548248 0.04382258   0.03360237 0.02899782 0.03982738 0.04436636
## Twitter 0.9527779204 0.43714558   0.48717877 0.50498009 0.46623699 0.41080910
##            Disgust    Sadness   Positive   Negative
## Blogs   0.48832082 0.52663153 0.41301598 0.51710369
## News    0.02953343 0.03944057 0.03867961 0.03924194
## Twitter 0.48214575 0.43392790 0.54830441 0.44365437

##       Joy             Trust          Anticipation        Surprise     
##  Min.   :   689   Min.   :  98460   Min.   :  69827   Min.   : 27364  
##  1st Qu.: 24381   1st Qu.: 540317   1st Qu.: 532831   1st Qu.:233565  
##  Median : 48072   Median : 982173   Median : 995835   Median :439765  
##  Mean   :344196   Mean   : 748929   Mean   : 692679   Mean   :314552  
##  3rd Qu.:515950   3rd Qu.:1074164   3rd Qu.:1004106   3rd Qu.:458147  
##  Max.   :983828   Max.   :1166154   Max.   :1012376   Max.   :476528  
##      Anger             Fear           Disgust          Sadness      
##  Min.   : 35983   Min.   : 48271   Min.   : 19169   Min.   : 39131  
##  1st Qu.:228608   1st Qu.:247618   1st Qu.:166056   1st Qu.:234827  
##  Median :421233   Median :446964   Median :312942   Median :430522  
##  Mean   :301158   Mean   :362670   Mean   :216354   Mean   :330717  
##  3rd Qu.:433746   3rd Qu.:519869   3rd Qu.:314946   3rd Qu.:476510  
##  Max.   :446258   Max.   :592774   Max.   :316950   Max.   :522498  
##     Positive          Negative      
##  Min.   : 243723   Min.   : 142554  
##  1st Qu.:1423083   1st Qu.: 877108  
##  Median :2602443   Median :1611661  
##  Mean   :2100357   Mean   :1210898  
##  3rd Qu.:3028674   3rd Qu.:1745071  
##  Max.   :3454905   Max.   :1878480

Most common positive and negative words in the en_US dataset.

2.1. US Blogs

Most common positive and negative words in US_Blogs.

2.2. US News

Most common positive and negative words in US_News.

2.3. US Twitter

Most common positive and negative words in US_Twitter.

PART II: Shiny App

Our Shiny app will typically give you an access to the data and show how our chosen algorithms work well on the data-set. In the next reports, the features of Shiny app will be specified and conceptualized for the sake of bringing the best visualized experiments to the audience.

Capstone Project_Predictive Model For Text_Report 01

Anna Huynh

3/22/2021