The goal of this project to understand whether a written review is positive or negative can be tricky as the context of what is being reviewed and other factors can impact the sentiment of a review. In this , I will investigate if the words used in a reviews can predict their sentiment. The datasets come from a Kaggle project with labelled sentences which you can check out here [https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set].
For this project, I have choosen Amazon Reviews. The first column in each dataset is a measure of sentiment of the review (0 = negative, 1 = positive) and the second is the number of tokens (or words) in the review. The rest of the columns are words that were either used (coded as 1) or not used (coded at 0) in the review. The sentiment of the review should be used as your outcome in your binary logistic regression. For your predictors, choose 5-10 words to test if use of those words predicts the sentiment of the review.
#In using VM, you may need to install the packages 'gridExtra', 'htmlwidgets', and 'readxl' first
library(readr)
library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following objects are masked from 'package:rms':
##
## Predict, vif
I will use RawTokencount , quality , recommend , problem , easy , great , price ,fit ,work , best it will lead to clear understanding of positive sentiment.Personally, I use these words mostly when I am writing a review.
X2(10) = 160.58 , p<0.001 Yes model is significant as the p value is less than 0.001.When we compare it to a model of no predictors, this model have less error.Yes ,the model can predict the sentiment.
It refers to the effect size of how well the model fit the data .R2 = .213 .It tells the amount of variance explained by the model i.e. is 21.3%.
I was able to predict the sentiment in a acceptable manner, as the C statistic is 0.70. C statistic is the number of times that the probability of the outcome matches the actual outcome.It also refers to the area under the ROC curve and it is a measure of goodness of fit.
options(scipen = 99)
#loading the data
amazon_reviews <- read_csv("C:/Users/Gautam/OneDrive/HU/3rdsem/ANLY540/Assignments540/amazon reviews.csv")
## Parsed with column specification:
## cols(
## .default = col_double()
## )
## See spec(...) for full column specifications.
amazon <- amazon_reviews
head(amazon)
## # A tibble: 6 x 30
## Sentiment RawTokenCount case excellent great money sound quality time
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 6 1 1 0 0 0 0 0
## 2 0 18 0 0 0 0 0 0 0
## 3 1 11 0 0 0 0 0 0 0
## 4 0 23 0 0 0 0 0 0 0
## 5 0 9 0 0 0 1 0 0 0
## 6 1 7 0 0 1 0 1 1 0
## # ... with 21 more variables: battery <dbl>, ear <dbl>, charge <dbl>,
## # problem <dbl>, recommend <dbl>, phone <dbl>, love <dbl>, best <dbl>,
## # headset <dbl>, work <dbl>, nice <dbl>, product <dbl>, price <dbl>,
## # buy <dbl>, call <dbl>, look <dbl>, purchase <dbl>, item <dbl>, fit <dbl>,
## # service <dbl>, easy <dbl>
#converting the variables to factor
amazon$Sentiment <- as.factor(amazon$Sentiment)
amazon$quality <- as.factor(amazon$quality)
amazon$recommend <- as.factor(amazon$recommend)
amazon$problem <- as.factor(amazon$problem)
amazon$easy <- as.factor(amazon$easy)
amazon$great <- as.factor(amazon$great)
amazon$price <- as.factor(amazon$price)
amazon$fit <- as.factor(amazon$fit)
amazon$work <- as.factor(amazon$work)
amazon$best <- as.factor(amazon$best)
#logistic regression
model = lrm(Sentiment ~ RawTokenCount + quality + recommend + problem + easy + great + price + fit + work+best,data = amazon)
model$stats
## Obs Max Deriv Model L.R. d.f.
## 923.000000000000 0.000001750302 160.582582398933 10.000000000000
## P C Dxy Gamma
## 0.000000000000 0.702195799992 0.404391599985 0.413637411620
## Tau-a R2 Brier g
## 0.202386352153 0.212923980756 0.211754023073 1.030001032041
## gr gp
## 2.801068725514 0.199269582278
exp(model$coefficients)
## Intercept RawTokenCount quality=1 recommend=1 problem=1
## 0.8317641 0.9745465 2.9885216 4.5650686 0.8547398
## easy=1 great=1 price=1 fit=1 work=1
## 6.7551719 13.9257322 10.2290961 3.6745332 1.8518662
## best=1
## 15.2625034
#coded - 1 positive sentiment
#comparison (reference)- 0 - negative sentiment
‘RawTokenCount’ (b= -.02,p<0.05) predicts the negative sentiment , it is significant.This means that the more the number of words in a review , that will lead to a negative review.The odds of positive sentiment for having higher number of words ‘RawTokenCount’ is 0.97 lower than having lower number of words ‘RawTokenCount’, controlling for other variables.
‘Quality’ (b= 1.09,p<0.05) predicts the positive sentiment , it is significant.This tells that if a review includes quality it leads to a positive review.The odds of positive sentiment for having a ‘Quality’ word is 2.99 times higher than not having a ‘Quality’ word , controlling for other variables.
‘Recommend’ (b=1.52,p<0.05) predicts the positive sentiment , it is significant.This tells that if a review includes recommend it leads to a positive review.The odds of positive sentiment for having a ‘Recommend’ word is 4.56 times higher than not having a ‘Recommend’ word , controlling for other variables.
‘Easy’ (b=1.91 ,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Easy’ word is 6.75 times higher than not having a ‘easy’ word , controlling for other variables.
‘Problem’ (b= -0.15,p>0.05) predicts the negative sentiment , it is not significant.
‘Great’ (b= 2.63,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Great’ word is 13.92 times higher than not having a ‘Great’ word , controlling for other variables.
‘Price’ (b= 2.32,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Price’ word is 10.23 times higher than not having a ‘Price’ word , controlling for other variables.
‘Fit’ (b= 1.30,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Fit’ word is 3.67 times higher than not having a ‘Fit’ word , controlling for other variables.
‘Work’ (b= 0.61,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Work’ word is 1.85 times higher than not having a ‘Work’ word , controlling for other variables.
‘Best’ (b= 2.72,p<0.05) predicts the positive sentiment , it is significant.The odds of positive sentiment for having a ‘Best’ word is 15.26 times higher than not having a ‘Best’ word , controlling for other variables.
I will retain all the predictors RawTokencount , quality , recommend , easy , great , price ,fit ,work , best except ‘problem’ because that is insignificant .The AIC for the best model is 1138.96.
model2 = glm(Sentiment ~ RawTokenCount + quality + recommend + problem + easy + great + price + fit + work+best,family = 'binomial',data = amazon)
model2.bw = step(model2, direction = 'backward')
## Start: AIC=1140.84
## Sentiment ~ RawTokenCount + quality + recommend + problem + easy +
## great + price + fit + work + best
##
## Df Deviance AIC
## - problem 1 1119.0 1139.0
## <none> 1118.8 1140.8
## - RawTokenCount 1 1124.8 1144.8
## - work 1 1125.1 1145.1
## - fit 1 1127.1 1147.1
## - quality 1 1129.7 1149.7
## - recommend 1 1130.7 1150.7
## - easy 1 1133.7 1153.7
## - price 1 1140.7 1160.7
## - best 1 1142.9 1162.9
## - great 1 1182.1 1202.1
##
## Step: AIC=1138.96
## Sentiment ~ RawTokenCount + quality + recommend + easy + great +
## price + fit + work + best
##
## Df Deviance AIC
## <none> 1119.0 1139.0
## - RawTokenCount 1 1125.1 1143.1
## - work 1 1125.3 1143.3
## - fit 1 1127.3 1145.3
## - quality 1 1130.0 1148.0
## - recommend 1 1130.9 1148.9
## - easy 1 1133.9 1151.9
## - price 1 1141.0 1159.0
## - best 1 1143.2 1161.2
## - great 1 1182.1 1200.1
#The best model is model2 without 'problem' and all other 9 variables
car library and the influencePlot() to create a picture of the outliers.
influencePlot(model2)
## StudRes Hat CookD
## 117 -2.3943856 0.01281401 0.017871400
## 304 -2.2574130 0.04970667 0.045432453
## 707 0.5950919 0.07784123 0.001534950
## 777 0.7878636 0.06347970 0.002289307
## 883 -2.1927820 0.05626939 0.044674378
vif values of the original model (not the interaction model) and determine if you meet the assumption of additivity (meaning no multicollinearity).As VIF is less than 5 ,so there are no issues with multicollinearity.
rms::vif(model2)
## RawTokenCount quality1 recommend1 problem1 easy1
## 1.042163 1.031112 1.005041 1.010456 1.006067
## great1 price1 fit1 work1 best1
## 1.011522 1.002408 1.017167 1.029214 1.003880
The optimism levels are not very high ,also there is not a big differece between the R2 of training and test .So ,there is not any evidence of over fitting.
m.boot = lrm(Sentiment ~ RawTokenCount + quality + recommend + problem + easy + great + price + fit + work+best,data = amazon, x = T, y = T)
validate(m.boot, B = 100)
## index.orig training test optimism index.corrected n
## Dxy 0.4044 0.4125 0.3958 0.0167 0.3877 100
## R2 0.2129 0.2300 0.1956 0.0344 0.1786 100
## Intercept 0.0000 0.0000 -0.0173 0.0173 -0.0173 100
## Slope 1.0000 1.0000 0.8548 0.1452 0.8548 100
## Emax 0.0000 0.0000 0.0378 0.0378 0.0378 100
## D 0.1729 0.1884 0.1577 0.0307 0.1422 100
## U -0.0022 -0.0022 0.0041 -0.0063 0.0041 100
## Q 0.1751 0.1906 0.1536 0.0370 0.1381 100
## B 0.2118 0.2086 0.2139 -0.0054 0.2171 100
## g 1.0300 1.1816 0.9987 0.1829 0.8471 100
## gp 0.1993 0.2035 0.1850 0.0184 0.1808 100