Logistic Regression

Load the Libraries + Functions

Choosing Predictors

Which words did you choose as predictors and why?
character, plot, Time , Story, actor great because they are essential properties of a movie or a TV show

Running a Binary Logistic Regression

Use the \(\chi^2\) test - is the overall model predictive of verb choice? Is it significant?
LR \(\chi^2\) value is 46.06 and the probability is less than 0.05. The model is significant
What is Nagelkerke’s pseudo-\(R^2\)? What does it tell you about goodness of fit?
0.061 is the \(R^2\) value. Nagelkerke’s \(R^2\) doesn’t tell a anything about goodness of fit, it may be misleading
What is the C statistic? How well are we predicting?
C statistic value is 0.562. Concordance statistic is a measure to predict accuracy of the model. Model accuracy is 56.2%, acceptable predictive ability.

blr_model = lrm(Sentiment ~ character + plot + story + great + actor, data = df_movie)
blr_model

## Logistic Regression Model
##  
##  lrm(formula = Sentiment ~ character + plot + story + great + 
##      actor, data = df_movie)
##  
##                         Model Likelihood    Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes          Indexes    
##  Obs           977    LR chi2      46.06    R2       0.061    C       0.562    
##   0            487    d.f.             5    g        0.318    Dxy     0.123    
##   1            490    Pr(> chi2) <0.0001    gr       1.375    gamma   0.418    
##  max |deriv| 1e-10                          gp       0.063    tau-a   0.062    
##                                             Brier    0.239                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept -0.0431 0.0695 -0.62  0.5348  
##  character  0.3753 0.2961  1.27  0.2050  
##  plot      -1.6941 0.5485 -3.09  0.0020  
##  story     -0.2720 0.3724 -0.73  0.4651  
##  great      2.1885 0.5325  4.11  <0.0001 
##  actor      0.3298 0.4189  0.79  0.4311  
##

Coefficients

Explain each coefficient (i.e. each word you chose) - are they significant? What do they imply if they are significant (i.e., which sentiment does it predict)?
Character: movie characters may have influence on sentiment predicted Plot: narration of events not in an order is not significant Story: A narration of events as they occur is not significant Great: response after watching motion-pictures by the audience has significance in predicting sentiment Actor: An actor suitable for a character also has significance on the predicted sentiment

Variable Selection

Which predictors would you retain?
Plot and Great , because the Akaike information criterion is low

glm_model = glm(Sentiment ~ character + plot + story + great + actor , data = df_movie, family = 'binomial')
glm_model_bw = step(glm_model, direction = 'backward')

## Start:  AIC=1320.34
## Sentiment ~ character + plot + story + great + actor
## 
##             Df Deviance    AIC
## - story      1   1308.9 1318.9
## - actor      1   1309.0 1319.0
## - character  1   1310.0 1320.0
## <none>           1308.3 1320.3
## - plot       1   1321.5 1331.5
## - great      1   1336.1 1346.1
## 
## Step:  AIC=1318.88
## Sentiment ~ character + plot + great + actor
## 
##             Df Deviance    AIC
## - actor      1   1309.5 1317.5
## - character  1   1310.3 1318.3
## <none>           1308.9 1318.9
## - plot       1   1321.8 1329.8
## - great      1   1336.5 1344.5
## 
## Step:  AIC=1317.5
## Sentiment ~ character + plot + great
## 
##             Df Deviance    AIC
## - character  1   1311.1 1317.1
## <none>           1309.5 1317.5
## - plot       1   1322.6 1328.6
## - great      1   1338.1 1344.1
## 
## Step:  AIC=1317.11
## Sentiment ~ plot + great
## 
##         Df Deviance    AIC
## <none>       1311.1 1317.1
## - plot   1   1324.1 1328.1
## - great  1   1340.2 1344.2

Outliers

Are there major outliers for this data?
Yes, there are outliers after +2 and -2. The same can be seen from bigger circles

influencePlot(glm_model)

##        StudRes        Hat       CookD
## 134 -2.1777847 0.02629098 0.039491797
## 425 -0.6873315 0.05967553 0.002881715
## 465 -2.3476167 0.02472222 0.053876380
## 614 -1.2178683 0.07494388 0.014811833

Assumptions

Are there issues with multicollinearity?
No multicollinearity, VIF not greater than 5

rms::vif(blr_model)

## character      plot     story     great     actor 
##  1.015098  1.001805  1.011054  1.002779  1.007283

rms::vif(glm_model_bw)

##     plot    great 
## 1.000228 1.000228

Test for Overfitting

Is there evidence of overfitting?
\(R^2\) indicates slight overfitting

model = lrm(data=df_movie, x = T, y = T, Sentiment ~ plot + great)
validate(model)

##           index.orig training    test optimism index.corrected  n
## Dxy           0.0990   0.0976  0.0990  -0.0014          0.1004 40
## R2            0.0578   0.0606  0.0556   0.0050          0.0528 40
## Intercept     0.0000   0.0000 -0.0054   0.0054         -0.0054 40
## Slope         1.0000   1.0000  0.9228   0.0772          0.9228 40
## Emax          0.0000   0.0000  0.0188   0.0188          0.0188 40
## D             0.0433   0.0455  0.0415   0.0040          0.0393 40
## U            -0.0020  -0.0020  0.0020  -0.0041          0.0020 40
## Q             0.0453   0.0476  0.0395   0.0080          0.0373 40
## B             0.2402   0.2397  0.2407  -0.0010          0.2412 40
## g             0.2618   0.2917  0.2546   0.0371          0.2247 40
## gp            0.0496   0.0488  0.0477   0.0011          0.0484 40

Logistic Regression

Anvesh Raavi

2020-10-04

Load the Libraries + Functions

Choosing Predictors

Running a Binary Logistic Regression

Coefficients

Variable Selection

Outliers

Assumptions

Test for Overfitting