Logistic Regression

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

Understanding whether a written review is positive or negative can be tricky as the context of what is being reviewed and other factors can impact the sentiment of a review. In this assignment, you will investigate if the words used in a reviews can predict their sentiment. The datasets come from a Kaggle project with labelled sentences which you can check out here [https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set].

For this assignment, you can pick one of three datasets to analyze: Amazon Reviews, Yelp Reviews, or Movie Reviews. The first column in each dataset is a measure of sentiment of the review (0 = negative, 1 = positive) and the second is the number of tokens (or words) in the review. The rest of the columns are words that were either used (coded as 1) or not used (coded at 0) in the review. The sentiment of the review should be used as your outcome in your binary logistic regression. For your predictors, choose 10-20 words to test if use of those words predicts the sentiment of the review.

#In using VM, you may need to install the packages 'gridExtra', 'htmlwidgets', and 'readxl' first

install.packages("gridExtra")

## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

install.packages("htmlwidgets")

## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

install.packages("readxl")

## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

library(gridExtra)
library(htmlwidgets)
library(readxl)

# Read the data from amazon reviews 

amazon_reviews <- read.csv("amazon reviews.csv")

Choosing Predictors

To create a simple, easy-to-interpret model, pick 10-20 words from one of the dataset which capture features most important to a good product (Amazon reviews), restaurant (Yelp reviews) or movie (Movie Reviews). Think about what words are strongly associated with either positive or negative reviews as well as words that may have different meanings in a positive versus negative review.
Which words did you choose as predictors and why?
- ANSWER: I am going with - excellent, love, great, recommend, nice, easy, quality, fit, item, best , price, look, buy , problem.
  I think the words above contribute to the positive reviews. There are other words i have ignored , which i think doesn’t contribute to either positive or negative reviews. But however the significance of the predictor depends on various other factors like sample size etc. We will be using backward stepwise selection procedure later to identify the significance of each variable.

Running a Binary Logistic Regression

Run the logistic regression using the rms package. (Answer all questions below after ANSWER: using complete sentences)
- Use the \(\chi^2\) test - is the overall model predictive of sentiment? Is it significant?
- What is Nagelkerke’s pseudo-\(R^2\)? What does it tell you about goodness of fit?
- What is the C statistic? How well are you predicting sentiment?
- ANSWER:
1. The model is statistically significant with \(X^2\) = 238.21 and overall P-value < 0.0001.The model accounts for 30.3% of the variance of the verb used with good discrimination C=0.72.
2. The Nagelkerke’s pseudo-\(R^2\) value ranges between 0 and 1. A value greater than 0.5 means there is large postive linear association. But here it is 0.303. It means it has a small positive linear association.
3. The C-statistic is goodness of fit for binary outcomes in a logistic regression model. Here the value of C is 0.72. which indicates it is a good model.

library(rms)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

model = lrm(Sentiment ~ excellent + great + quality + fit + problem + recommend + love + price + nice + buy + look + best + item + easy, data = amazon_reviews)

model

## Logistic Regression Model
##  
##  lrm(formula = Sentiment ~ excellent + great + quality + fit + 
##      problem + recommend + love + price + nice + buy + look + 
##      best + item + easy, data = amazon_reviews)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs           923    LR chi2     238.21    R2       0.303    C       0.722    
##   0            467    d.f.            14    g        1.571    Dxy     0.443    
##   1            456    Pr(> chi2) <0.0001    gr       4.814    gamma   0.710    
##  max |deriv| 0.003                          gp       0.223    tau-a   0.222    
##                                             Brier    0.195                     
##  
##            Coef    S.E.    Wald Z Pr(>|Z|)
##  Intercept -0.5640  0.0853 -6.62  <0.0001 
##  excellent  3.3521  1.0314  3.25  0.0012  
##  great      2.8666  0.4429  6.47  <0.0001 
##  quality    0.7855  0.3558  2.21  0.0273  
##  fit        1.1947  0.4746  2.52  0.0118  
##  problem   -0.0821  0.4536 -0.18  0.8564  
##  recommend  1.6626  0.5005  3.32  0.0009  
##  love       9.5223 19.7050  0.48  0.6289  
##  price      2.2113  0.6369  3.47  0.0005  
##  nice       3.0777  1.0411  2.96  0.0031  
##  buy       -0.8546  0.5222 -1.64  0.1017  
##  look      -0.2942  0.5190 -0.57  0.5708  
##  best       2.7627  0.7488  3.69  0.0002  
##  item      -0.7176  0.5583 -1.29  0.1987  
##  easy       1.8904  0.5686  3.32  0.0009  
##

model_glm = glm(Sentiment ~ excellent + great + quality + fit + problem + recommend + love + price + nice + buy + look + best + item + easy, family='binomial', data = amazon_reviews)

model_glm

## 
## Call:  glm(formula = Sentiment ~ excellent + great + quality + fit + 
##     problem + recommend + love + price + nice + buy + look + 
##     best + item + easy, family = "binomial", data = amazon_reviews)
## 
## Coefficients:
## (Intercept)    excellent        great      quality          fit      problem  
##     -0.5640       3.3521       2.8666       0.7855       1.1947      -0.0821  
##   recommend         love        price         nice          buy         look  
##      1.6626      16.8285       2.2113       3.0777      -0.8546      -0.2942  
##        best         item         easy  
##      2.7627      -0.7176       1.8904  
## 
## Degrees of Freedom: 922 Total (i.e. Null);  908 Residual
## Null Deviance:       1279 
## Residual Deviance: 1041  AIC: 1071

model$stats

##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
## 9.230000e+02 2.575801e-03 2.382063e+02 1.400000e+01 0.000000e+00 7.216673e-01 
##          Dxy        Gamma        Tau-a           R2        Brier            g 
## 4.433346e-01 7.104029e-01 2.218762e-01 3.033014e-01 1.946227e-01 1.571483e+00 
##           gr           gp 
## 4.813782e+00 2.233455e-01

Coefficients

Explain each coefficient (i.e. each word you chose) - are they significant? What do they imply if they are significant (i.e., which sentiment does it predict)? (Should be at least a paragraph.)
- ANSWER:
  - excellent: The P-value for the coefficient excellent is 0.0012, which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - great: The P-value for the coefficient great is <0.0001 , which is less 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - quality: The P-value for the coefficient quality is 0.027 , which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - fit: The P-value for the coefficient fit is 0.0118 , which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - problem: The P-value for the coefficient probelm is 0.8564, which is greater 0.05 and is not statistically significant and it predicts a negative sentiment with negative coefficient value.
  - recommend: The P-value for the coefficient recommend is 0.0009 , which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - love : The P-value for the coefficient love is 0.6289 , which is greater than 0.05 and is not statistically significant and it predicts a positive sentiment with positive coefficient value.
  - price: The P-value for the coefficient price is 0.0005, which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - nice: The P-value for the coefficient nice is 0.0031, which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - best: The P-value for the coefficient best is 0.0002, which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
  - buy : The P-value for the coefficient buy is 0.1017 , which is greater than 0.05 and is not statistically significant and it predicts a negative sentiment with negative coefficient value.
  - look : The P-value for the coefficient look is 0.5708, which is greater than 0.05 and is not statistically significant and it predicts a negative sentiment with negative coefficient value.
  - item : The P-value for the coefficient item is 0.1987, which is greater than 0.05 and is not statistically significant and it predicts a negative sentiment with negative coefficient value.
  - easy: The P-value for the coefficient easy is 0.0009, which is lesser than 0.05 and is statistically significant and it predicts a positive sentiment with positive coefficient value.
Use either a backwards stepwise approach or the drop 1 approach to determine which predictors you would keep in the model.
Fit the model with glm function in order to test either approach (but NOT both)
Which predictors would you retain?
- ANSWER: I have used backwards stepwise approach I would retain - buy , quality , fit , recommend , easy , price , nice , best , excellent , love , great.

m = glm(Sentiment ~ excellent + great + quality + problem + recommend + love  + price + nice +  buy + best + look + item + fit + easy, family = 'binomial', data = amazon_reviews)
m.bw = step(m, direction = 'backward')

## Start:  AIC=1071.21
## Sentiment ~ excellent + great + quality + problem + recommend + 
##     love + price + nice + buy + best + look + item + fit + easy
## 
##             Df Deviance    AIC
## - problem    1   1041.2 1069.2
## - look       1   1041.5 1069.5
## - item       1   1042.9 1070.9
## <none>           1041.2 1071.2
## - buy        1   1044.3 1072.3
## - quality    1   1046.1 1074.1
## - fit        1   1048.0 1076.0
## - recommend  1   1054.1 1082.1
## - easy       1   1055.4 1083.4
## - price      1   1058.9 1086.9
## - nice       1   1060.5 1088.5
## - best       1   1066.2 1094.2
## - excellent  1   1067.8 1095.8
## - love       1   1081.0 1109.0
## - great      1   1117.1 1145.1
## 
## Step:  AIC=1069.24
## Sentiment ~ excellent + great + quality + recommend + love + 
##     price + nice + buy + best + look + item + fit + easy
## 
##             Df Deviance    AIC
## - look       1   1041.6 1067.6
## - item       1   1043.0 1069.0
## <none>           1041.2 1069.2
## - buy        1   1044.3 1070.3
## - quality    1   1046.2 1072.2
## - fit        1   1048.1 1074.1
## - recommend  1   1054.2 1080.2
## - easy       1   1055.5 1081.5
## - price      1   1059.0 1085.0
## - nice       1   1060.6 1086.6
## - best       1   1066.3 1092.3
## - excellent  1   1067.9 1093.9
## - love       1   1081.1 1107.1
## - great      1   1117.2 1143.2
## 
## Step:  AIC=1067.57
## Sentiment ~ excellent + great + quality + recommend + love + 
##     price + nice + buy + best + item + fit + easy
## 
##             Df Deviance    AIC
## - item       1   1043.3 1067.3
## <none>           1041.6 1067.6
## - buy        1   1044.6 1068.6
## - quality    1   1046.5 1070.5
## - fit        1   1048.5 1072.5
## - recommend  1   1054.6 1078.6
## - easy       1   1055.9 1079.9
## - price      1   1059.5 1083.5
## - nice       1   1060.6 1084.6
## - best       1   1066.8 1090.8
## - excellent  1   1068.4 1092.4
## - love       1   1081.2 1105.2
## - great      1   1117.5 1141.5
## 
## Step:  AIC=1067.31
## Sentiment ~ excellent + great + quality + recommend + love + 
##     price + nice + buy + best + fit + easy
## 
##             Df Deviance    AIC
## <none>           1043.3 1067.3
## - buy        1   1046.2 1068.2
## - quality    1   1048.4 1070.4
## - fit        1   1050.4 1072.4
## - recommend  1   1055.2 1077.2
## - easy       1   1057.9 1079.9
## - price      1   1060.9 1082.9
## - nice       1   1062.7 1084.7
## - best       1   1068.8 1090.8
## - excellent  1   1070.4 1092.4
## - love       1   1083.3 1105.3
## - great      1   1117.6 1139.6

Outliers

Use the model with glm function in order to test assumptions
Use the car library and the influencePlot() to create a picture of the outliers.
- Are there major outliers for this data?
  - ANSWER: Yes , there are few outliers about 4 rows in this model.

library("car")

## Loading required package: carData

## 
## Attaching package: 'car'

## The following objects are masked from 'package:rms':
## 
##     Predict, vif

# building a model with the values retained
model_eff = glm(Sentiment ~ excellent + great + quality + recommend + love + 
    price + nice + buy + best + fit + easy, family = 'binomial', data = amazon_reviews)

influencePlot(model_eff)

##        StudRes        Hat       CookD
## 302 -2.5867628 0.05782218 0.088158881
## 316 -1.3057125 0.11749738 0.014705495
## 363 -2.4738405 0.07646019 0.089534269
## 429 -0.9563137 0.08564766 0.004612385

Assumptions

Explore the vif values of the model and determine if you meet the assumption of additivity (meaning no multicollinearity).
Are there issues with multicollinearity?
- ANSWER: The Variance Inflation Factor here is around 1 for all the predictor variables , That means there is no correlation between the particular predictor variable and the other predictor variables. Which means there is no multicollinearity. If the VIF > 5 , is a sign for further investigation and VIF > 10 is a sign of serious multi collinearity requiring correction

rms::vif(m)

## excellent     great   quality   problem recommend      love     price      nice 
##  1.001591  1.026305  1.007351  1.015389  1.039028  1.000000  1.005348  1.003189 
##       buy      best      look      item       fit      easy 
##  1.005758  1.002253  1.006201  1.068249  1.006492  1.004792

Test for Overfitting

Use the validate function to test for overfitting.
Is there evidence of overfitting?
- ANSWER: The \(R^2\) value for both the train and test data is almost same, Therefor there is no overfitting and the model is a perfect fit.

model.boot = lrm(Sentiment ~ excellent + great + quality + recommend + love + 
price + nice + buy + best + fit + easy, data = amazon_reviews, x = T, y = T)

validate(model.boot, B=100)

##           index.orig training    test optimism index.corrected   n
## Dxy           0.4349   0.4401  0.4302   0.0099          0.4250 100
## R2            0.3009   0.3172  0.2779   0.0393          0.2617 100
## Intercept     0.0000   0.0000 -0.0465   0.0465         -0.0465 100
## Slope         1.0000   1.0000  0.8329   0.1671          0.8329 100
## Emax          0.0000   0.0000  0.0486   0.0486          0.0486 100
## D             0.2547   0.2709  0.2328   0.0381          0.2166 100
## U            -0.0022  -0.0022  0.0062  -0.0084          0.0062 100
## Q             0.2569   0.2731  0.2265   0.0465          0.2104 100
## B             0.1951   0.1923  0.1970  -0.0046          0.1997 100
## g             1.5440   1.9360  1.5859   0.3501          1.1939 100
## gp            0.2189   0.2215  0.2053   0.0162          0.2027 100

Discussion Question

Describe a set of texts and research question that interests you that could be explored using this method. Basically, what is a potential application of this method to another area of research?
- ANSWER:
- Most of the companies use a chat interface to communicate with its customers through their website, If all the conversations between the customer care representative and the customer are stored. We can do a sentiment analysis and tell out of all the existing customers how many customers are happy and how many or not. I am not sure whether all the CRM softwares like salesforce,zohoone etc have this feature or no, But it is something i would like to explore.
- We can do a sentiment analysis on the tweets of the politician and tell the acceptance rating of a particular leader. This would be helpful for the political party in choosing the most accepted person when multiple candidates from the same political party contest to be the presidential nominee. This is second thing which i would like to explore.