ANLY540 - Analysis of Human Language - Assignment 5: Probabilistic Grammar

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

The data for this project has already been loaded. Here you will be distinguishing between the uses for let, allow, and permit. You should pick two of these verbs to examine and subset the dataset for only those columns. You can use the droplevels() function to help drop the empty level after you subset.

library(Rling)
library(rms)
library(visreg)
library(car)

data(let)
let <- subset(let, Verb != "permit")

Description of the Data

The data is from the COCA: Corpus of Contemporary American English investigating the verb choice of let, allow, and permit. These are permissive constructions that are often paired with the word to. Predict the verb choice in the Verb column with the following independent variables:

Reg: Spok for spoken conversations, Mag for magazine articles.
Permitter: semantic class of the clause subject, Anim for animate, Inanim for inanimate and Undef for undefined.
Imper: yes for the imperative, no for not.
Note: Year is in the dataset, which would be awesome but gave me trouble. You can try adding it if you want.

Sample Size Requirements

Is the sample size large enough for four predictors?
Is the split between your choosen verbs even enough to think you could predict them?

table(droplevels(let)$Verb)

## 
##   let allow 
##   187   167

The sample size of 354 is large enough for four predictors. I won’t be using the fourth predictor ‘Year’.
The split between the chosen verbs, ‘let’ and ‘allow’ is 53:47, which is good enough to distinguish between them.

Running a Binary Logistic Regression

Run the logistic regression using the rms package.

Use the \(\chi^2\) test - is the overall model predictive of verb choice? Is it significant?
What is Nagelkerke’s pseudo-\(R^2\)? What does it tell you about goodness of fit?
What is the C statistic? How well are we predicting?

model = lrm(Verb ~ Reg + Permitter+ Imper,
            data = let)
model

## Logistic Regression Model
##  
##  lrm(formula = Verb ~ Reg + Permitter + Imper, data = let)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           354    LR chi2     201.17    R2       0.579    C       0.876    
##   let          187    d.f.             4    g        2.417    Dxy     0.753    
##   allow        167    Pr(> chi2) <0.0001    gr      11.214    gamma   0.825    
##  max |deriv| 6e-10                          gp       0.377    tau-a   0.376    
##                                             Brier    0.132                     
##  
##                   Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept        -0.1935 0.2280 -0.85  0.3961  
##  Reg=SPOK          0.0530 0.3072  0.17  0.8629  
##  Permitter=Inanim  2.6069 0.4106  6.35  <0.0001 
##  Permitter=Undef   1.2888 0.6166  2.09  0.0366  
##  Imper=Yes        -3.1051 0.5464 -5.68  <0.0001 
##

Based on the Likelihood ratio \(\chi^2\) test, the \(\chi^2\) value is 201.17, indicating that the overall model is predictive of verb choice. The p-value associated with the test is less than 0.05, indicating that it is significant. The model is better than random noise.
Nagelkerke’s pseudo-\(R^2\) here is 57.9%. In a traditional sense, this indicates what percentage of the variance in y is explained by the model. However, here y is categorical so the concept of variance doesn’t hold good. However, this is good for comparing between models to assess the goodness of fit between models. Concordance is a better indicator of goodness of fit in this case.
The C statistic indicates the concordance index, the measure of predictive ability. The C stat is 87.6% which indicates that the predictive ability of the model is excellent.

Coefficients

Explain each coefficient - are they significant? What do they imply if they are significant (i.e., which verb does it predict)?

We have 3 categories for ‘Permitter’, the semantic class of the object, and 2 categories each for ‘Reg’ and ‘Imper’.

Permitter:

table(droplevels(let)$Verb, let$Permitter)

##        
##         Anim Inanim Undef
##   let    175      8     4
##   allow   64     91    12

When the ‘Permitter’ is animate, the model is most likely to predict ‘let’ and when the ‘Permitter’ is Inanimate, the model is most likely to predict ‘allow’. This is significant as indicated by the p-value of the coefficient.
When the ‘Permitter’ is animate, the model is most likely to predict ‘let’ and when the ‘Permitter’ is Undefined, the model is most likely to predict ‘allow’. This is significant as indicated by the p-value of the coefficient.

Reg:

table(droplevels(let)$Verb, let$Reg)

##        
##         MAG SPOK
##   let    72  115
##   allow 103   64

The coefficient for ‘Reg’ variable that compares Spoken conversations v/s Magazine articles is not significant.

Imper:

table(droplevels(let)$Verb, let$Imper)

##        
##          No Yes
##   let    83 104
##   allow 163   4

The model is most likely to predict ‘let’ when imperative and ‘allow’ when otherwise. This is significant as seen from the p-value for the model coefficient.

Interactions

Add the interaction between Imper and Reg by doing Imper*Reg, but remember you will need to do a glm model.
Use the anova function to answer if the addition of the interaction was significant.
- Is the interaction useful?
Use the visreg library and funtion to visualize the interaction.
- How would you explain that interaction?

model1 = glm(Verb ~ Reg + Permitter + Imper,
             family = binomial,
             data = let)
model2 = glm(Verb ~ Permitter + Imper*Reg,
             family = binomial,
             data = let)
anova(model1, model2, test = "Chisq")

summary(model2)

## 
## Call:
## glm(formula = Verb ~ Permitter + Imper * Reg, family = binomial, 
##     data = let)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.19626  -0.57802  -0.00013   0.43343   1.93484  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.3432     0.2343  -1.465   0.1430    
## PermitterInanim    2.6611     0.4142   6.425 1.32e-10 ***
## PermitterUndef     1.4209     0.6195   2.294   0.0218 *  
## ImperYes          -1.3615     0.5919  -2.300   0.0214 *  
## RegSPOK            0.3667     0.3216   1.140   0.2543    
## ImperYes:RegSPOK -17.2280   720.3052  -0.024   0.9809    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 489.62  on 353  degrees of freedom
## Residual deviance: 275.27  on 348  degrees of freedom
## AIC: 287.27
## 
## Number of Fisher Scoring iterations: 17

visreg(model2, "Imper", by = "Reg")

The addition of the interaction is significant as seen from the p-value of the anova test. However, the interaction term doesn’t seem to be useful as seen from the p-value of the interaction term in the model summary of the model with the interaction.
It is difficult to really interpret anything about the interaction from the visualization. However, based on the table below, we can see the interaction. When Imperative and Spoken, the model predicts ‘let’ always but predicts ‘allow’ when Not Imperative and Spoken.

table(droplevels(let)$Verb, let$Reg, let$Imper)

## , ,  = No
## 
##        
##         MAG SPOK
##   let    50   33
##   allow  99   64
## 
## , ,  = Yes
## 
##        
##         MAG SPOK
##   let    22   82
##   allow   4    0

Outliers

Use the car library and the influencePlot() to create a picture of the outliers.
- Are there major outliers for this data?

influencePlot(model2)

There seem to be some outliers in the data as seen from the plot (larger hat-values and studentized residuals above +2 and below -2)

Assumptions

Explore the vif values of the original model (not the interaction model) and determine if you meet the assumption of additivity (meaning no multicollinearity).

rms::vif(model)

##         Reg=SPOK Permitter=Inanim  Permitter=Undef        Imper=Yes 
##         1.090869         1.044777         1.067404         1.055026

There doesn’t seem to be multi-collinearity as the values are not extreme (not over 5). That is it meets the assumption of additivity.