Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
Understanding whether a written review is positive or negative can be tricky as the context of what is being reviewed and other factors can impact the sentiment of a review. In this assignment, you will investigate if the words used in a reviews can predict their sentiment. The datasets come from a Kaggle project with labelled sentences which you can check out here [https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set].
For this assignment, you can pick one of three datasets to analyze: Amazon Reviews, Yelp Reviews, or Movie Reviews. The first column in each dataset is a measure of sentiment of the review (0 = negative, 1 = positive) and the second is the number of tokens (or words) in the review. The rest of the columns are words that were either used (coded as 1) or not used (coded at 0) in the review. The sentiment of the review should be used as your outcome in your binary logistic regression. For your predictors, choose 10-20 words to test if use of those words predicts the sentiment of the review.
#In using VM, you may need to install the packages 'gridExtra', 'htmlwidgets', and 'readxl' first
install.packages("gridExtra")
## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("htmlwidgets")
## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
install.packages("readxl")
## Installing package into '/home/yashwanth_suruneni/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
library(gridExtra)
library(htmlwidgets)
library(readxl)
# Read the data from amazon reviews
amazon_reviews <- read.csv("amazon reviews.csv")
To create a simple, easy-to-interpret model, pick 10-20 words from one of the dataset which capture features most important to a good product (Amazon reviews), restaurant (Yelp reviews) or movie (Movie Reviews). Think about what words are strongly associated with either positive or negative reviews as well as words that may have different meanings in a positive versus negative review.
Which words did you choose as predictors and why?
rms package. (Answer all questions below after ANSWER: using complete sentences)
Use the \(\chi^2\) test - is the overall model predictive of sentiment? Is it significant?
What is Nagelkerke’s pseudo-\(R^2\)? What does it tell you about goodness of fit?
What is the C statistic? How well are you predicting sentiment?
ANSWER:
library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
model = lrm(Sentiment ~ excellent + great + quality + fit + problem + recommend + love + price + nice + buy + look + best + item + easy, data = amazon_reviews)
model
## Logistic Regression Model
##
## lrm(formula = Sentiment ~ excellent + great + quality + fit +
## problem + recommend + love + price + nice + buy + look +
## best + item + easy, data = amazon_reviews)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 923 LR chi2 238.21 R2 0.303 C 0.722
## 0 467 d.f. 14 g 1.571 Dxy 0.443
## 1 456 Pr(> chi2) <0.0001 gr 4.814 gamma 0.710
## max |deriv| 0.003 gp 0.223 tau-a 0.222
## Brier 0.195
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -0.5640 0.0853 -6.62 <0.0001
## excellent 3.3521 1.0314 3.25 0.0012
## great 2.8666 0.4429 6.47 <0.0001
## quality 0.7855 0.3558 2.21 0.0273
## fit 1.1947 0.4746 2.52 0.0118
## problem -0.0821 0.4536 -0.18 0.8564
## recommend 1.6626 0.5005 3.32 0.0009
## love 9.5223 19.7050 0.48 0.6289
## price 2.2113 0.6369 3.47 0.0005
## nice 3.0777 1.0411 2.96 0.0031
## buy -0.8546 0.5222 -1.64 0.1017
## look -0.2942 0.5190 -0.57 0.5708
## best 2.7627 0.7488 3.69 0.0002
## item -0.7176 0.5583 -1.29 0.1987
## easy 1.8904 0.5686 3.32 0.0009
##
model_glm = glm(Sentiment ~ excellent + great + quality + fit + problem + recommend + love + price + nice + buy + look + best + item + easy, family='binomial', data = amazon_reviews)
model_glm
##
## Call: glm(formula = Sentiment ~ excellent + great + quality + fit +
## problem + recommend + love + price + nice + buy + look +
## best + item + easy, family = "binomial", data = amazon_reviews)
##
## Coefficients:
## (Intercept) excellent great quality fit problem
## -0.5640 3.3521 2.8666 0.7855 1.1947 -0.0821
## recommend love price nice buy look
## 1.6626 16.8285 2.2113 3.0777 -0.8546 -0.2942
## best item easy
## 2.7627 -0.7176 1.8904
##
## Degrees of Freedom: 922 Total (i.e. Null); 908 Residual
## Null Deviance: 1279
## Residual Deviance: 1041 AIC: 1071
model$stats
## Obs Max Deriv Model L.R. d.f. P C
## 9.230000e+02 2.575801e-03 2.382063e+02 1.400000e+01 0.000000e+00 7.216673e-01
## Dxy Gamma Tau-a R2 Brier g
## 4.433346e-01 7.104029e-01 2.218762e-01 3.033014e-01 1.946227e-01 1.571483e+00
## gr gp
## 4.813782e+00 2.233455e-01
Explain each coefficient (i.e. each word you chose) - are they significant? What do they imply if they are significant (i.e., which sentiment does it predict)? (Should be at least a paragraph.)
Use either a backwards stepwise approach or the drop 1 approach to determine which predictors you would keep in the model.
Fit the model with glm function in order to test either approach (but NOT both)
Which predictors would you retain?
m = glm(Sentiment ~ excellent + great + quality + problem + recommend + love + price + nice + buy + best + look + item + fit + easy, family = 'binomial', data = amazon_reviews)
m.bw = step(m, direction = 'backward')
## Start: AIC=1071.21
## Sentiment ~ excellent + great + quality + problem + recommend +
## love + price + nice + buy + best + look + item + fit + easy
##
## Df Deviance AIC
## - problem 1 1041.2 1069.2
## - look 1 1041.5 1069.5
## - item 1 1042.9 1070.9
## <none> 1041.2 1071.2
## - buy 1 1044.3 1072.3
## - quality 1 1046.1 1074.1
## - fit 1 1048.0 1076.0
## - recommend 1 1054.1 1082.1
## - easy 1 1055.4 1083.4
## - price 1 1058.9 1086.9
## - nice 1 1060.5 1088.5
## - best 1 1066.2 1094.2
## - excellent 1 1067.8 1095.8
## - love 1 1081.0 1109.0
## - great 1 1117.1 1145.1
##
## Step: AIC=1069.24
## Sentiment ~ excellent + great + quality + recommend + love +
## price + nice + buy + best + look + item + fit + easy
##
## Df Deviance AIC
## - look 1 1041.6 1067.6
## - item 1 1043.0 1069.0
## <none> 1041.2 1069.2
## - buy 1 1044.3 1070.3
## - quality 1 1046.2 1072.2
## - fit 1 1048.1 1074.1
## - recommend 1 1054.2 1080.2
## - easy 1 1055.5 1081.5
## - price 1 1059.0 1085.0
## - nice 1 1060.6 1086.6
## - best 1 1066.3 1092.3
## - excellent 1 1067.9 1093.9
## - love 1 1081.1 1107.1
## - great 1 1117.2 1143.2
##
## Step: AIC=1067.57
## Sentiment ~ excellent + great + quality + recommend + love +
## price + nice + buy + best + item + fit + easy
##
## Df Deviance AIC
## - item 1 1043.3 1067.3
## <none> 1041.6 1067.6
## - buy 1 1044.6 1068.6
## - quality 1 1046.5 1070.5
## - fit 1 1048.5 1072.5
## - recommend 1 1054.6 1078.6
## - easy 1 1055.9 1079.9
## - price 1 1059.5 1083.5
## - nice 1 1060.6 1084.6
## - best 1 1066.8 1090.8
## - excellent 1 1068.4 1092.4
## - love 1 1081.2 1105.2
## - great 1 1117.5 1141.5
##
## Step: AIC=1067.31
## Sentiment ~ excellent + great + quality + recommend + love +
## price + nice + buy + best + fit + easy
##
## Df Deviance AIC
## <none> 1043.3 1067.3
## - buy 1 1046.2 1068.2
## - quality 1 1048.4 1070.4
## - fit 1 1050.4 1072.4
## - recommend 1 1055.2 1077.2
## - easy 1 1057.9 1079.9
## - price 1 1060.9 1082.9
## - nice 1 1062.7 1084.7
## - best 1 1068.8 1090.8
## - excellent 1 1070.4 1092.4
## - love 1 1083.3 1105.3
## - great 1 1117.6 1139.6
car library and the influencePlot() to create a picture of the outliers.
library("car")
## Loading required package: carData
##
## Attaching package: 'car'
## The following objects are masked from 'package:rms':
##
## Predict, vif
# building a model with the values retained
model_eff = glm(Sentiment ~ excellent + great + quality + recommend + love +
price + nice + buy + best + fit + easy, family = 'binomial', data = amazon_reviews)
influencePlot(model_eff)
## StudRes Hat CookD
## 302 -2.5867628 0.05782218 0.088158881
## 316 -1.3057125 0.11749738 0.014705495
## 363 -2.4738405 0.07646019 0.089534269
## 429 -0.9563137 0.08564766 0.004612385
vif values of the model and determine if you meet the assumption of additivity (meaning no multicollinearity).rms::vif(m)
## excellent great quality problem recommend love price nice
## 1.001591 1.026305 1.007351 1.015389 1.039028 1.000000 1.005348 1.003189
## buy best look item fit easy
## 1.005758 1.002253 1.006201 1.068249 1.006492 1.004792
model.boot = lrm(Sentiment ~ excellent + great + quality + recommend + love +
price + nice + buy + best + fit + easy, data = amazon_reviews, x = T, y = T)
validate(model.boot, B=100)
## index.orig training test optimism index.corrected n
## Dxy 0.4349 0.4401 0.4302 0.0099 0.4250 100
## R2 0.3009 0.3172 0.2779 0.0393 0.2617 100
## Intercept 0.0000 0.0000 -0.0465 0.0465 -0.0465 100
## Slope 1.0000 1.0000 0.8329 0.1671 0.8329 100
## Emax 0.0000 0.0000 0.0486 0.0486 0.0486 100
## D 0.2547 0.2709 0.2328 0.0381 0.2166 100
## U -0.0022 -0.0022 0.0062 -0.0084 0.0062 100
## Q 0.2569 0.2731 0.2265 0.0465 0.2104 100
## B 0.1951 0.1923 0.1970 -0.0046 0.1997 100
## g 1.5440 1.9360 1.5859 0.3501 1.1939 100
## gp 0.2189 0.2215 0.2053 0.0162 0.2027 100
Describe a set of texts and research question that interests you that could be explored using this method. Basically, what is a potential application of this method to another area of research?