Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

Understanding whether a written review is positive or negative can be tricky as the context of what is being reviewed and other factors can impact the sentiment of a review. In this assignment, you will investigate if the words used in a reviews can predict their sentiment. The datasets come from a Kaggle project with labelled sentences which you can check out here [https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set].

For this assignment, you can pick one of three datasets to analyze: Amazon Reviews, Yelp Reviews, or Movie Reviews. The first column in each dataset is a measure of sentiment of the review (0 = negative, 1 = positive) and the second is the number of tokens (or words) in the review. The rest of the columns are words that were either used (coded as 1) or not used (coded at 0) in the review. The sentiment of the review should be used as your outcome in your binary logistic regression. For your predictors, choose 10-20 words to test if use of those words predicts the sentiment of the review.

#In using VM, you may need to install the packages 'gridExtra', 'htmlwidgets', and 'readxl' first

install.packages("gridExtra")
## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("htmlwidgets")
## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("readxl")
## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(gridExtra)
library(htmlwidgets)
library(readxl)

# Read the data from amazon reviews 

amazon_reviews <- read.csv("amazon reviews.csv")

Choosing Predictors

Running a Binary Logistic Regression

library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
model = lrm(Sentiment ~ excellent + great + quality + fit + problem + recommend + love + price + nice + buy + look + best + item + easy, data = amazon_reviews)

model
## Logistic Regression Model
##  
##  lrm(formula = Sentiment ~ excellent + great + quality + fit + 
##      problem + recommend + love + price + nice + buy + look + 
##      best + item + easy, data = amazon_reviews)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs           923    LR chi2     238.21    R2       0.303    C       0.722    
##   0            467    d.f.            14    g        1.571    Dxy     0.443    
##   1            456    Pr(> chi2) <0.0001    gr       4.814    gamma   0.710    
##  max |deriv| 0.003                          gp       0.223    tau-a   0.222    
##                                             Brier    0.195                     
##  
##            Coef    S.E.    Wald Z Pr(>|Z|)
##  Intercept -0.5640  0.0853 -6.62  <0.0001 
##  excellent  3.3521  1.0314  3.25  0.0012  
##  great      2.8666  0.4429  6.47  <0.0001 
##  quality    0.7855  0.3558  2.21  0.0273  
##  fit        1.1947  0.4746  2.52  0.0118  
##  problem   -0.0821  0.4536 -0.18  0.8564  
##  recommend  1.6626  0.5005  3.32  0.0009  
##  love       9.5223 19.7050  0.48  0.6289  
##  price      2.2113  0.6369  3.47  0.0005  
##  nice       3.0777  1.0411  2.96  0.0031  
##  buy       -0.8546  0.5222 -1.64  0.1017  
##  look      -0.2942  0.5190 -0.57  0.5708  
##  best       2.7627  0.7488  3.69  0.0002  
##  item      -0.7176  0.5583 -1.29  0.1987  
##  easy       1.8904  0.5686  3.32  0.0009  
## 
model_glm = glm(Sentiment ~ excellent + great + quality + fit + problem + recommend + love + price + nice + buy + look + best + item + easy, family='binomial', data = amazon_reviews)

model_glm
## 
## Call:  glm(formula = Sentiment ~ excellent + great + quality + fit + 
##     problem + recommend + love + price + nice + buy + look + 
##     best + item + easy, family = "binomial", data = amazon_reviews)
## 
## Coefficients:
## (Intercept)    excellent        great      quality          fit      problem  
##     -0.5640       3.3521       2.8666       0.7855       1.1947      -0.0821  
##   recommend         love        price         nice          buy         look  
##      1.6626      16.8285       2.2113       3.0777      -0.8546      -0.2942  
##        best         item         easy  
##      2.7627      -0.7176       1.8904  
## 
## Degrees of Freedom: 922 Total (i.e. Null);  908 Residual
## Null Deviance:       1279 
## Residual Deviance: 1041  AIC: 1071
model$stats
##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
## 9.230000e+02 2.575801e-03 2.382063e+02 1.400000e+01 0.000000e+00 7.216673e-01 
##          Dxy        Gamma        Tau-a           R2        Brier            g 
## 4.433346e-01 7.104029e-01 2.218762e-01 3.033014e-01 1.946227e-01 1.571483e+00 
##           gr           gp 
## 4.813782e+00 2.233455e-01

Coefficients

m = glm(Sentiment ~ excellent + great + quality + problem + recommend + love  + price + nice +  buy + best + look + item + fit + easy, family = 'binomial', data = amazon_reviews)
m.bw = step(m, direction = 'backward')
## Start:  AIC=1071.21
## Sentiment ~ excellent + great + quality + problem + recommend + 
##     love + price + nice + buy + best + look + item + fit + easy
## 
##             Df Deviance    AIC
## - problem    1   1041.2 1069.2
## - look       1   1041.5 1069.5
## - item       1   1042.9 1070.9
## <none>           1041.2 1071.2
## - buy        1   1044.3 1072.3
## - quality    1   1046.1 1074.1
## - fit        1   1048.0 1076.0
## - recommend  1   1054.1 1082.1
## - easy       1   1055.4 1083.4
## - price      1   1058.9 1086.9
## - nice       1   1060.5 1088.5
## - best       1   1066.2 1094.2
## - excellent  1   1067.8 1095.8
## - love       1   1081.0 1109.0
## - great      1   1117.1 1145.1
## 
## Step:  AIC=1069.24
## Sentiment ~ excellent + great + quality + recommend + love + 
##     price + nice + buy + best + look + item + fit + easy
## 
##             Df Deviance    AIC
## - look       1   1041.6 1067.6
## - item       1   1043.0 1069.0
## <none>           1041.2 1069.2
## - buy        1   1044.3 1070.3
## - quality    1   1046.2 1072.2
## - fit        1   1048.1 1074.1
## - recommend  1   1054.2 1080.2
## - easy       1   1055.5 1081.5
## - price      1   1059.0 1085.0
## - nice       1   1060.6 1086.6
## - best       1   1066.3 1092.3
## - excellent  1   1067.9 1093.9
## - love       1   1081.1 1107.1
## - great      1   1117.2 1143.2
## 
## Step:  AIC=1067.57
## Sentiment ~ excellent + great + quality + recommend + love + 
##     price + nice + buy + best + item + fit + easy
## 
##             Df Deviance    AIC
## - item       1   1043.3 1067.3
## <none>           1041.6 1067.6
## - buy        1   1044.6 1068.6
## - quality    1   1046.5 1070.5
## - fit        1   1048.5 1072.5
## - recommend  1   1054.6 1078.6
## - easy       1   1055.9 1079.9
## - price      1   1059.5 1083.5
## - nice       1   1060.6 1084.6
## - best       1   1066.8 1090.8
## - excellent  1   1068.4 1092.4
## - love       1   1081.2 1105.2
## - great      1   1117.5 1141.5
## 
## Step:  AIC=1067.31
## Sentiment ~ excellent + great + quality + recommend + love + 
##     price + nice + buy + best + fit + easy
## 
##             Df Deviance    AIC
## <none>           1043.3 1067.3
## - buy        1   1046.2 1068.2
## - quality    1   1048.4 1070.4
## - fit        1   1050.4 1072.4
## - recommend  1   1055.2 1077.2
## - easy       1   1057.9 1079.9
## - price      1   1060.9 1082.9
## - nice       1   1062.7 1084.7
## - best       1   1068.8 1090.8
## - excellent  1   1070.4 1092.4
## - love       1   1083.3 1105.3
## - great      1   1117.6 1139.6

Outliers

library("car")
## Loading required package: carData
## 
## Attaching package: 'car'
## The following objects are masked from 'package:rms':
## 
##     Predict, vif
# building a model with the values retained
model_eff = glm(Sentiment ~ excellent + great + quality + recommend + love + 
    price + nice + buy + best + fit + easy, family = 'binomial', data = amazon_reviews)

influencePlot(model_eff)

##        StudRes        Hat       CookD
## 302 -2.5867628 0.05782218 0.088158881
## 316 -1.3057125 0.11749738 0.014705495
## 363 -2.4738405 0.07646019 0.089534269
## 429 -0.9563137 0.08564766 0.004612385

Assumptions

rms::vif(m)
## excellent     great   quality   problem recommend      love     price      nice 
##  1.001591  1.026305  1.007351  1.015389  1.039028  1.000000  1.005348  1.003189 
##       buy      best      look      item       fit      easy 
##  1.005758  1.002253  1.006201  1.068249  1.006492  1.004792

Test for Overfitting

model.boot = lrm(Sentiment ~ excellent + great + quality + recommend + love + 
price + nice + buy + best + fit + easy, data = amazon_reviews, x = T, y = T)

validate(model.boot, B=100)
##           index.orig training    test optimism index.corrected   n
## Dxy           0.4349   0.4401  0.4302   0.0099          0.4250 100
## R2            0.3009   0.3172  0.2779   0.0393          0.2617 100
## Intercept     0.0000   0.0000 -0.0465   0.0465         -0.0465 100
## Slope         1.0000   1.0000  0.8329   0.1671          0.8329 100
## Emax          0.0000   0.0000  0.0486   0.0486          0.0486 100
## D             0.2547   0.2709  0.2328   0.0381          0.2166 100
## U            -0.0022  -0.0022  0.0062  -0.0084          0.0062 100
## Q             0.2569   0.2731  0.2265   0.0465          0.2104 100
## B             0.1951   0.1923  0.1970  -0.0046          0.1997 100
## g             1.5440   1.9360  1.5859   0.3501          1.1939 100
## gp            0.2189   0.2215  0.2053   0.0162          0.2027 100

Discussion Question