Sentiment Labelled Sentences

Load the Libraries + Functions

The goal of this project to understand whether a written review is positive or negative can be tricky as the context of what is being reviewed and other factors can impact the sentiment of a review. In this , I will investigate if the words used in a reviews can predict their sentiment. The datasets come from a Kaggle project with labelled sentences which you can check out here [https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set].

For this project, I have choosen Amazon Reviews. The first column in each dataset is a measure of sentiment of the review (0 = negative, 1 = positive) and the second is the number of tokens (or words) in the review. The rest of the columns are words that were either used (coded as 1) or not used (coded at 0) in the review. The sentiment of the review should be used as your outcome in your binary logistic regression. For your predictors, choose 5-10 words to test if use of those words predicts the sentiment of the review.

#In using VM, you may need to install the packages 'gridExtra', 'htmlwidgets', and 'readxl' first
library(readr)
library(rms)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.0.3

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following objects are masked from 'package:rms':
## 
##     Predict, vif

Choosing Predictors

I will use RawTokencount , quality , recommend , problem , easy , great , price ,fit ,work , best it will lead to clear understanding of positive sentiment.Personally, I use these words mostly when I am writing a review.

Running a Binary Logistic Regression

X2(10) = 160.58 , p<0.001 Yes model is significant as the p value is less than 0.001.When we compare it to a model of no predictors, this model have less error.Yes ,the model can predict the sentiment.

It refers to the effect size of how well the model fit the data .R2 = .213 .It tells the amount of variance explained by the model i.e. is 21.3%.

I was able to predict the sentiment in a acceptable manner, as the C statistic is 0.70. C statistic is the number of times that the probability of the outcome matches the actual outcome.It also refers to the area under the ROC curve and it is a measure of goodness of fit.

options(scipen = 99)
#loading the data
amazon_reviews <- read_csv("C:/Users/Gautam/OneDrive/HU/3rdsem/ANLY540/Assignments540/amazon reviews.csv")

## Parsed with column specification:
## cols(
##   .default = col_double()
## )

## See spec(...) for full column specifications.

amazon <- amazon_reviews
head(amazon)

## # A tibble: 6 x 30
##   Sentiment RawTokenCount  case excellent great money sound quality  time
##       <dbl>         <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
## 1         1             6     1         1     0     0     0       0     0
## 2         0            18     0         0     0     0     0       0     0
## 3         1            11     0         0     0     0     0       0     0
## 4         0            23     0         0     0     0     0       0     0
## 5         0             9     0         0     0     1     0       0     0
## 6         1             7     0         0     1     0     1       1     0
## # ... with 21 more variables: battery <dbl>, ear <dbl>, charge <dbl>,
## #   problem <dbl>, recommend <dbl>, phone <dbl>, love <dbl>, best <dbl>,
## #   headset <dbl>, work <dbl>, nice <dbl>, product <dbl>, price <dbl>,
## #   buy <dbl>, call <dbl>, look <dbl>, purchase <dbl>, item <dbl>, fit <dbl>,
## #   service <dbl>, easy <dbl>

#converting the variables to factor
amazon$Sentiment <- as.factor(amazon$Sentiment)
amazon$quality <- as.factor(amazon$quality)
amazon$recommend <- as.factor(amazon$recommend)
amazon$problem <- as.factor(amazon$problem)
amazon$easy <- as.factor(amazon$easy)
amazon$great <- as.factor(amazon$great)
amazon$price <- as.factor(amazon$price)
amazon$fit <- as.factor(amazon$fit)
amazon$work <- as.factor(amazon$work)
amazon$best <- as.factor(amazon$best)

#logistic regression
model = lrm(Sentiment ~ RawTokenCount + quality + recommend + problem + easy + great + price + fit + work+best,data = amazon)
model$stats

##              Obs        Max Deriv       Model L.R.             d.f. 
## 923.000000000000   0.000001750302 160.582582398933  10.000000000000 
##                P                C              Dxy            Gamma 
##   0.000000000000   0.702195799992   0.404391599985   0.413637411620 
##            Tau-a               R2            Brier                g 
##   0.202386352153   0.212923980756   0.211754023073   1.030001032041 
##               gr               gp 
##   2.801068725514   0.199269582278

exp(model$coefficients)

##     Intercept RawTokenCount     quality=1   recommend=1     problem=1 
##     0.8317641     0.9745465     2.9885216     4.5650686     0.8547398 
##        easy=1       great=1       price=1         fit=1        work=1 
##     6.7551719    13.9257322    10.2290961     3.6745332     1.8518662 
##        best=1 
##    15.2625034

#coded - 1 positive sentiment 
#comparison (reference)- 0 - negative sentiment

Coefficients

‘RawTokenCount’ (b= -.02,p<0.05) predicts the negative sentiment , it is significant.This means that the more the number of words in a review , that will lead to a negative review.The odds of positive sentiment for having higher number of words ‘RawTokenCount’ is 0.97 lower than having lower number of words ‘RawTokenCount’, controlling for other variables.

‘Quality’ (b= 1.09,p<0.05) predicts the positive sentiment , it is significant.This tells that if a review includes quality it leads to a positive review.The odds of positive sentiment for having a ‘Quality’ word is 2.99 times higher than not having a ‘Quality’ word , controlling for other variables.

‘Recommend’ (b=1.52,p<0.05) predicts the positive sentiment , it is significant.This tells that if a review includes recommend it leads to a positive review.The odds of positive sentiment for having a ‘Recommend’ word is 4.56 times higher than not having a ‘Recommend’ word , controlling for other variables.

‘Easy’ (b=1.91 ,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Easy’ word is 6.75 times higher than not having a ‘easy’ word , controlling for other variables.

‘Problem’ (b= -0.15,p>0.05) predicts the negative sentiment , it is not significant.

‘Great’ (b= 2.63,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Great’ word is 13.92 times higher than not having a ‘Great’ word , controlling for other variables.

‘Price’ (b= 2.32,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Price’ word is 10.23 times higher than not having a ‘Price’ word , controlling for other variables.

‘Fit’ (b= 1.30,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Fit’ word is 3.67 times higher than not having a ‘Fit’ word , controlling for other variables.

‘Work’ (b= 0.61,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Work’ word is 1.85 times higher than not having a ‘Work’ word , controlling for other variables.

‘Best’ (b= 2.72,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Best’ word is 15.26 times higher than not having a ‘Best’ word , controlling for other variables.

Variable Selection

Use either a backwards stepwise approach or the drop 1 approach to determine which predictors you would keep in the model.
Fit the model with glm function in order to test either approach
** Which predictors would you retain?

I will retain all the predictors RawTokencount , quality , recommend , easy , great , price ,fit ,work , best except ‘problem’ because that is insignificant .The AIC for the best model is 1138.96.

model2 = glm(Sentiment ~ RawTokenCount + quality + recommend + problem + easy + great + price + fit + work+best,family = 'binomial',data = amazon)
model2.bw = step(model2, direction = 'backward')

## Start:  AIC=1140.84
## Sentiment ~ RawTokenCount + quality + recommend + problem + easy + 
##     great + price + fit + work + best
## 
##                 Df Deviance    AIC
## - problem        1   1119.0 1139.0
## <none>               1118.8 1140.8
## - RawTokenCount  1   1124.8 1144.8
## - work           1   1125.1 1145.1
## - fit            1   1127.1 1147.1
## - quality        1   1129.7 1149.7
## - recommend      1   1130.7 1150.7
## - easy           1   1133.7 1153.7
## - price          1   1140.7 1160.7
## - best           1   1142.9 1162.9
## - great          1   1182.1 1202.1
## 
## Step:  AIC=1138.96
## Sentiment ~ RawTokenCount + quality + recommend + easy + great + 
##     price + fit + work + best
## 
##                 Df Deviance    AIC
## <none>               1119.0 1139.0
## - RawTokenCount  1   1125.1 1143.1
## - work           1   1125.3 1143.3
## - fit            1   1127.3 1145.3
## - quality        1   1130.0 1148.0
## - recommend      1   1130.9 1148.9
## - easy           1   1133.9 1151.9
## - price          1   1141.0 1159.0
## - best           1   1143.2 1161.2
## - great          1   1182.1 1200.1

#The best model is model2 without 'problem' and all other 9 variables

Outliers

Use the model with glm function in order to test assumptions
Use the car library and the influencePlot() to create a picture of the outliers.
- ** Are there major outliers for this data?
Yes there are outliers,specifically there are 5.

influencePlot(model2)

##        StudRes        Hat       CookD
## 117 -2.3943856 0.01281401 0.017871400
## 304 -2.2574130 0.04970667 0.045432453
## 707  0.5950919 0.07784123 0.001534950
## 777  0.7878636 0.06347970 0.002289307
## 883 -2.1927820 0.05626939 0.044674378

Assumptions

Explore the vif values of the original model (not the interaction model) and determine if you meet the assumption of additivity (meaning no multicollinearity).
** Are there issues with multicollinearity?

As VIF is less than 5 ,so there are no issues with multicollinearity.

rms::vif(model2)

## RawTokenCount      quality1    recommend1      problem1         easy1 
##      1.042163      1.031112      1.005041      1.010456      1.006067 
##        great1        price1          fit1         work1         best1 
##      1.011522      1.002408      1.017167      1.029214      1.003880

Test for Overfitting

Use the validate function to test for overfitting.
** Is there evidence of overfitting?

The optimism levels are not very high ,also there is not a big differece between the R2 of training and test .So ,there is not any evidence of over fitting.

m.boot = lrm(Sentiment ~ RawTokenCount + quality + recommend + problem + easy + great + price + fit + work+best,data = amazon, x = T, y = T)
validate(m.boot, B = 100)

##           index.orig training    test optimism index.corrected   n
## Dxy           0.4044   0.4125  0.3958   0.0167          0.3877 100
## R2            0.2129   0.2300  0.1956   0.0344          0.1786 100
## Intercept     0.0000   0.0000 -0.0173   0.0173         -0.0173 100
## Slope         1.0000   1.0000  0.8548   0.1452          0.8548 100
## Emax          0.0000   0.0000  0.0378   0.0378          0.0378 100
## D             0.1729   0.1884  0.1577   0.0307          0.1422 100
## U            -0.0022  -0.0022  0.0041  -0.0063          0.0041 100
## Q             0.1751   0.1906  0.1536   0.0370          0.1381 100
## B             0.2118   0.2086  0.2139  -0.0054          0.2171 100
## g             1.0300   1.1816  0.9987   0.1829          0.8471 100
## gp            0.1993   0.2035  0.1850   0.0184          0.1808 100