LING 275 HW7

Bresnan et al. 2007’s logit model

Load the languageR package and type names(dative) to get a list of the variables available in the dative dataset - a subset of the data used in the Bresnan et al.’s analysis of the dative alternation. The choice of dative construction is coded as RealizationOfRecipient, which has 2 levels: 'NP' and 'PP'. (You can verify this by typing levels(dative$RealizationOfRecipient).) Build a logit model that tries to predict dative type, but has just an intercept, using glm(..., family=binomial, data=dative). What is the real-world interpretation of the single coefficient? What is its p-value? What does this tell you? Don’t forget to undo the logit transformation.

set.seed(11231)
library(languageR)
names(dative)

##  [1] "Speaker"                "Modality"              
##  [3] "Verb"                   "SemanticClass"         
##  [5] "LengthOfRecipient"      "AnimacyOfRec"          
##  [7] "DefinOfRec"             "PronomOfRec"           
##  [9] "LengthOfTheme"          "AnimacyOfTheme"        
## [11] "DefinOfTheme"           "PronomOfTheme"         
## [13] "RealizationOfRecipient" "AccessOfRec"           
## [15] "AccessOfTheme"

levels(dative$RealizationOfRecipient)

## [1] "NP" "PP"

intercept.m <- glm(RealizationOfRecipient ~ 1 , family=binomial, data=dative)
summary(intercept.m)

## 
## Call:
## glm(formula = RealizationOfRecipient ~ 1, family = binomial, 
##     data = dative)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7763  -0.7763  -0.7763   1.6409   1.6409  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0450     0.0399  -26.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3741.1  on 3262  degrees of freedom
## Residual deviance: 3741.1  on 3262  degrees of freedom
## AIC: 3743.1
## 
## Number of Fisher Scoring iterations: 4

The single coefficient is (the estimate of) $logit(p(Y=1))$, i.e., the logit of the probability that $Y=1$ (or more precisely, the estimation of the probability, which is the relative frequency of $Y=1$). In this case, since NP is alphabetically lower than PP, PP is encoded by 1. Hence the coefficient is $logit(relfreq(PP))$. We can verify this from the results below

1 / (1 + exp(-coef(intercept.m)))

## (Intercept) 
##     0.26019

table(dative$RealizationOfRecipient) / length(dative$RealizationOfRecipient)

## 
##      NP      PP 
## 0.73981 0.26019

The p-value is $<2e-16$, which means the intercept is statistically significantly not 0. A zero intercept corresponds to $p(PP)=0.5$, hence the p-value tells us that the two choices (NP and PP) are not equally likely.

Use the predict() function to generate predicted probabilities for all of the response data that your model was fit to - i.e., the whole RealizationOfRecipient vector. That is, if your model is called intercept.m, type predictions = predict(intercept.m, type='response') to get a vector of probabilities, each of which is the predicted probability that the corresponding element of the RealizationOfRecipient vector will be the level that R chose as non-baseline. Then, use ifelse() to threshold the probabilities at .5, mapping each element to the model’s “best guess” about its value.

predictions <- predict(intercept.m, type='response')
baseline.level = levels(dative$RealizationOfRecipient)[1]
non.baseline.level = levels(dative$RealizationOfRecipient)[2]
thresholded = ifelse(predictions > .5, non.baseline.level, baseline.level)

Calculate the proportion of predictions which are equal to the observed values. What is the answer? Thinking about the model’s structure, what does this correspond to?

mean(thresholded == dative$RealizationOfRecipient)

## [1] 0.73981

We can see that the proportion of correct predictions is 0.73981. Note that the model only has an intercept, which means the predicted probability is a constant. This constant, as we have already seen, is the relative frequency of PP, i.e., 0.26019. Hence for each data point, the model predicts the same probability, which is below the threshold. This means that the model predicts that all the data to be NP. Therefore the proportion of correct prediction is exactly the relative frequency of NP in the observed values, which can be seen from the relative frequency table above.

Look back at Joan Bresnan’s slides, specifically at the first regression analysis she conducted (slides 57-70 in part 1, corresponding to section 3.1 of the Bresnan et al. 2007 paper). Which predictors have a large negative effect on model fit if removed, according to the graph on p.66? Explain what this means. Then, compare this list to the magnitude of the coefficients on slides 68-69. Would your conclusions about which predictors are “important” diverge if you had considered just one of these two metrics? If you find discrepancies, how might you explain them?

From the graph on p.66, we can see that removing Pron Rec or Pron Theme has the largest negative effects on model fit (followed by LogRThmDiff, then Def of Theme, then Animacy of Rec, and then Previous and Num of Theme. Def of Rec is the last and does not have a significantly negative effect).
The graph shows for each of the predictor, if it is removed, how much the −2 log likelihood of the testing data, give the model trained without this predictor, will increase. (An increase in −2 log likelihood means that it is more unlikely to have observed the testing data given the model, indicating poorer model fit.)

The magnitude of the coefficients on slides 68-69 shows the order Animacy of Rec $>$ Def of Theme $>$ Pron Rec , Pron Theme, LogRThmDiff $>$ Def of Rec $>$ Num of Theme. This order is different from the previous one. For instance, the Pron Rec, Pron Theme, LogRThmDiff, which are the three most important predictors according to the previous order, only has the the third largest (in terms of absolute value) coefficient. In contrast, Animacy of Rec, which does not affect model fit as much if removed, nevertheless has the largest coefficient in the model. So the two metrics give different conclusions.

This discrepency is due to the fact these predictors in the model are correlated. As a result, if a predictor is highly correlated with some other predictors, we will not lose much model fit if we remove it. Nevertheless, it might just happen that such a predictor has a large coefficient in the model, because when predictors are highly correlated the estimates of coefficients tend to have greater variance.

Build a series of models using glm(), starting from the top and adding one predictor at a time, using the 6 predictors on Bresnan’s slide 66 that are included in dative - everything except “Previous” and “Num of Theme”. (Note that you’ll have to create the third-ranked variable, LogRThmDiff by hand, by taking the difference between the log-scaled length of the recipient and the log-scaled length of the theme.) For each, use the procedure you used for the intercept-only model to generate predictions and evaluate them against the training set. What proportion of the training examples match each model’s thresholded predictions? Return your results in the form of a vector.

dative$LogRThmDiff <- log(dative$LengthOfRecipient) - log(dative$LengthOfTheme)
m.1p <- glm(RealizationOfRecipient ~ PronomOfRec , 
            family=binomial, data=dative)
m.2p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme, 
            family=binomial, data=dative)
m.3p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff, 
            family=binomial, data=dative)
m.4p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff 
                                     + DefinOfTheme, 
            family=binomial, data=dative)
m.5p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff 
                                     + DefinOfTheme + AnimacyOfRec, 
            family=binomial, data=dative)
m.6p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff 
                                     + DefinOfTheme + AnimacyOfRec + DefinOfRec, 
            family=binomial, data=dative)

evl <- function(m){
  predictions <- predict(m, type='response')
  baseline.level = levels(dative$RealizationOfRecipient)[1]
  non.baseline.level = levels(dative$RealizationOfRecipient)[2]
  thresholded = ifelse(predictions > .5, non.baseline.level, baseline.level)
  return(mean(thresholded == dative$RealizationOfRecipient))
}
sapply(list(m.1p, m.2p, m.3p, m.4p, m.5p, m.6p), evl)

## [1] 0.7486975 0.7704566 0.8467668 0.8623966 0.8611707 0.8660742

We can see that in general the proportion of correct predictions increases as more predictors are used.

Repeat 4, but this time use cross-validation to determine how the predictions are affected by adding new predictors on each step. There are lots of ways to do this; for now, we’ll do it the long way (to make sure we understand) by writing a function that divides the data at random into 90% training and 10% test sets, computes the prediction formula, and applies it to the test set predictors, and compares the thresholded result to the test set response variable. Here’s the skeleton of an example:

cross.validate.m1 = sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  training.indices = (1:N)[-test.indices] # all indices not in test set
  training.set = dative[training.indices,]
  
  m = glm(RealizationOfRecipient ~ PronomOfRec, data = training.set, family=binomial)
  coefs = coef(m)
  f.hat = function(i) {
    # factor encoding starts from 1, but in logistic regression it should start from 0
    decoded.predictors <- ifelse(is.factor(test.set[i, "PronomOfRec"]), 
                             unclass(test.set[i, "PronomOfRec"]) - 1, 
                             test.set[i, "PronomOfRec"])
    score <- sum(coefs * c(1, decoded.predictors))  # function on all predictors in model
    return(1 / (1 + exp(-score)))   # don't forget to undo logit transformation
  }
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) test.predictions[i] = f.hat(i)
  
  baseline.level = levels(dative$RealizationOfRecipient)[1]
  non.baseline.level = levels(dative$RealizationOfRecipient)[2]
  thresholded = ifelse(test.predictions > .5, non.baseline.level, baseline.level)
  proportion.correct = mean(thresholded == test.set$RealizationOfRecipient)
  return(proportion.correct)
})
mean(cross.validate.m1)

## [1] 0.7483742

For this (inefficient but hopefully clarifying) method you’ll need to make a separate (similar but not identical) function for each model, filling in the ellipses. Then, compute then the predictions for each model and show your results in the form of a vector giving the average proportion correct for each.

Instead of making 6 copies, I will do an additional functional abstraction to lift the formula. Also, I will use predict(model, newdata=...) instead of f-hat, because the for-loop is really slow to compute.

cross.validate <- function(formula) {
  cross.validate.m = sapply(1:100, function(i) {
    N = length(dative$RealizationOfRecipient)
    test.set.size = floor(N/10)
    test.indices = sample(1:N, size=test.set.size)
    test.set = dative[test.indices,]
    training.indices = (1:N)[-test.indices] # all indices not in test set
    training.set = dative[training.indices,]
    
    m = glm(formula, data = training.set, family=binomial)
    test.predictions = predict(m, newdata=test.set, type="response")
    
    baseline.level = levels(dative$RealizationOfRecipient)[1]
    non.baseline.level = levels(dative$RealizationOfRecipient)[2]
    thresholded = ifelse(test.predictions > .5, non.baseline.level, baseline.level)
    proportion.correct = mean(thresholded == test.set$RealizationOfRecipient)
    return(proportion.correct)
    })
  return(mean(cross.validate.m))
}
c(m1 = cross.validate(RealizationOfRecipient ~ PronomOfRec), 
  m2 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme),
  m3 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme 
                                               + LogRThmDiff),
  m4 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme 
                                               + LogRThmDiff + DefinOfTheme),
  m5 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme
                                               + LogRThmDiff + DefinOfTheme
                                               + AnimacyOfRec),
  m6 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme
                                               + LogRThmDiff + DefinOfTheme
                                               + AnimacyOfRec + DefinOfRec))

##        m1        m2        m3        m4        m5        m6 
## 0.7464110 0.7662577 0.8482209 0.8601840 0.8622086 0.8664724

How does the models’ performance compare to how they did when the whole data set was used for both model fitting and evaluation in (4)? Does this evaluation method clarify whether, and to what extent, each predictor is useful in predicting the choice between the prepositional dative and the double object construction?

(Note: if adding predictors appears to make your model worse, rather than better, consider whether you’ve correctly identified the baseline and non-baseline variables in the ifelse(...) statement.)

The performance seems quite similar, but the first two models may be slightly worse when the training set and test set are split. This would not be a surprise, because using the training data as the test data generally overestimates the model fit. We can also see that the first few predictors lead to more increase in accuracy. I am not sure how much this says about which predictors are more useful, though. Again, since some of the predictors are highly correlated, predictors added later generally will not bring as much gain in accruracy as the earlier ones.

LING 275 HW7

Ciyang Qing

02/24/2015

Bresnan et al. 2007’s logit model