Assignment 5: Probabilistic Grammar

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. Here you will be distinguishing between the uses for let, allow, and permit. You should pick two of these verbs to examine and subset the dataset for only those columns. You can use the droplevels() function to help drop the empty level after you subset.

library(Rling)
data(let)
head(let)

##   Year  Reg   Verb Neg Permitter Imper
## 1 2003  MAG  allow  No    Inanim    No
## 2 2005 SPOK  allow  No    Inanim    No
## 3 1990 SPOK    let  No      Anim    No
## 4 2007 SPOK  allow  No    Inanim    No
## 5 1997  MAG permit  No      Anim   Yes
## 6 1996  MAG  allow  No      Anim    No

library(rms)

## Loading required package: Hmisc

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(visreg)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following objects are masked from 'package:rms':
## 
##     Predict, vif

data(let)
let <- subset(let, Verb != "permit")

Description of the Data

The data is from the COCA: Corpus of Contemporary American English investigating the verb choice of let, allow, and permit. These are permissive constructions that are often paired with the word to. Predict the verb choice in the Verb column with the following independent variables:

Reg: Spok for spoken conversations, Mag for magazine articles.
Permitter: semantic class of the clause subject, Anim for animate, Inanim for inanimate and Undef for undefined.
Imper: yes for the imperative, no for not.
Note: Year is in the dataset, which would be awesome but gave me trouble. You can try adding it if you want.

Sample Size Requirements

Is the sample size large enough for four predictors? the sample size is 354, and it is large enough for four predictors.
Is the split between your choosen verbs even enough to think you could predict them? the split between verbs is 52:47, and it is ok to be used to predict.

table(droplevels(let)$Verb)

## 
##   let allow 
##   187   167

Running a Binary Logistic Regression

Run the logistic regression using the rms package.
- Use the \(\chi^2\) test - is the overall model predictive of verb choice? Is it significant? the \(\chi^2\) value is 201.17. It proves that the overall model is predictive of verb choice. The p-value is less than 0.05. So it is significant.
- What is Nagelkerke’s pseudo-\(R^2\)? What does it tell you about goodness of fit? \(R^2\) is 57.9%. it shows that 57.9% of variables can be explained by the model.
- What is the C statistic? How well are we predicting? c is 87.6%. it shows that the prediction of the model is good.

model = lrm(Verb ~ Reg + Permitter+ Imper,
            data = let)
model

## Logistic Regression Model
##  
##  lrm(formula = Verb ~ Reg + Permitter + Imper, data = let)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           354    LR chi2     201.17    R2       0.579    C       0.876    
##   let          187    d.f.             4    g        2.417    Dxy     0.753    
##   allow        167    Pr(> chi2) <0.0001    gr      11.214    gamma   0.825    
##  max |deriv| 6e-10                          gp       0.377    tau-a   0.376    
##                                             Brier    0.132                     
##  
##                   Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept        -0.1935 0.2280 -0.85  0.3961  
##  Reg=SPOK          0.0530 0.3072  0.17  0.8629  
##  Permitter=Inanim  2.6069 0.4106  6.35  <0.0001 
##  Permitter=Undef   1.2888 0.6166  2.09  0.0366  
##  Imper=Yes        -3.1051 0.5464 -5.68  <0.0001 
##

Coefficients

Explain each coefficient - are they significant? What do they imply if they are significant (i.e., which verb does it predict)?
- Reg: it compares spoken conversations with magazine article, and shows that it is not significant.
- Permitter: if permitter is animate, the model will predict ‘let’ the most. if premitter is undefined and inanimate, the model will predict ‘allow’ the most. and it is significant.
- Imper:if it’s imperative, it will predict ‘let’ the most, and it will predict ‘allow’ in the other situation.

Interactions

Add the interaction between Imper and Reg by doing Imper*Reg, but remember you will need to do a glm model.
Use the anova function to answer if the addition of the interaction was significant.
- Is the interaction useful?
Use the visreg library and funtion to visualize the interaction.
- How would you explain that interaction? P-value is less than 0.05, so the interaction is significant.

model1 = glm(Verb ~ Reg + Permitter + Imper,
             family = binomial,
             data = let)
model2 = glm(Verb ~ Permitter + Imper*Reg,
             family = binomial,
             data = let)
anova(model1, model2, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: Verb ~ Reg + Permitter + Imper
## Model 2: Verb ~ Permitter + Imper * Reg
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       349     288.45                          
## 2       348     275.27  1   13.176 0.0002835 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model2)

## 
## Call:
## glm(formula = Verb ~ Permitter + Imper * Reg, family = binomial, 
##     data = let)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.19626  -0.57802  -0.00013   0.43343   1.93484  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.3432     0.2343  -1.465   0.1430    
## PermitterInanim    2.6611     0.4142   6.425 1.32e-10 ***
## PermitterUndef     1.4209     0.6195   2.294   0.0218 *  
## ImperYes          -1.3615     0.5919  -2.300   0.0214 *  
## RegSPOK            0.3667     0.3216   1.140   0.2543    
## ImperYes:RegSPOK -17.2280   720.3052  -0.024   0.9809    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 489.62  on 353  degrees of freedom
## Residual deviance: 275.27  on 348  degrees of freedom
## AIC: 287.27
## 
## Number of Fisher Scoring iterations: 17

visreg(model2, "Imper", by = "Reg")

table(droplevels(let)$Verb, let$Reg, let$Imper)

## , ,  = No
## 
##        
##         MAG SPOK
##   let    50   33
##   allow  99   64
## 
## , ,  = Yes
## 
##        
##         MAG SPOK
##   let    22   82
##   allow   4    0

Outliers

Use the car library and the influencePlot() to create a picture of the outliers.
- Are there major outliers for this data? There are some outliers.

influencePlot(model2)

##        StudRes        Hat       CookD
## 13   0.7803198 0.06328249 0.004091446
## 15   0.6635054 0.06591552 0.002970197
## 45  -2.2236205 0.01176981 0.020395281
## 67   1.9908803 0.03846154 0.038133333
## 111  1.9908803 0.03846154 0.038133333
## 132 -2.2236205 0.01176981 0.020395281

Assumptions

Explore the vif values of the original model (not the interaction model) and determine if you meet the assumption of additivity (meaning no multicollinearity). It shows that there is no multi-collinearity. It meets the assumption.

rms::vif(model)

##         Reg=SPOK Permitter=Inanim  Permitter=Undef        Imper=Yes 
##         1.090869         1.044777         1.067404         1.055026