languageR package and type names(dative) to get a list of the variables available in the dative dataset - a subset of the data used in the Bresnan et al.’s analysis of the dative alternation. The choice of dative construction is coded as RealizationOfRecipient, which has 2 levels: 'NP' and 'PP'. (You can verify this by typing levels(dative$RealizationOfRecipient).) Build a logit model that tries to predict dative type, but has just an intercept, using glm(..., family=binomial, data=dative). What is the real-world interpretation of the single coefficient? What is its p-value? What does this tell you? Don’t forget to undo the logit transformation.set.seed(11231)
library(languageR)
names(dative)
## [1] "Speaker" "Modality"
## [3] "Verb" "SemanticClass"
## [5] "LengthOfRecipient" "AnimacyOfRec"
## [7] "DefinOfRec" "PronomOfRec"
## [9] "LengthOfTheme" "AnimacyOfTheme"
## [11] "DefinOfTheme" "PronomOfTheme"
## [13] "RealizationOfRecipient" "AccessOfRec"
## [15] "AccessOfTheme"
levels(dative$RealizationOfRecipient)
## [1] "NP" "PP"
intercept.m <- glm(RealizationOfRecipient ~ 1 , family=binomial, data=dative)
summary(intercept.m)
##
## Call:
## glm(formula = RealizationOfRecipient ~ 1, family = binomial,
## data = dative)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7763 -0.7763 -0.7763 1.6409 1.6409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0450 0.0399 -26.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3741.1 on 3262 degrees of freedom
## Residual deviance: 3741.1 on 3262 degrees of freedom
## AIC: 3743.1
##
## Number of Fisher Scoring iterations: 4
The single coefficient is (the estimate of) \(logit(p(Y=1))\), i.e., the logit of the probability that \(Y=1\) (or more precisely, the estimation of the probability, which is the relative frequency of \(Y=1\)). In this case, since NP is alphabetically lower than PP, PP is encoded by 1. Hence the coefficient is \(logit(relfreq(PP))\). We can verify this from the results below
1 / (1 + exp(-coef(intercept.m)))
## (Intercept)
## 0.26019
table(dative$RealizationOfRecipient) / length(dative$RealizationOfRecipient)
##
## NP PP
## 0.73981 0.26019
The p-value is \(<2e-16\), which means the intercept is statistically significantly not 0. A zero intercept corresponds to \(p(PP)=0.5\), hence the p-value tells us that the two choices (NP and PP) are not equally likely.
predict() function to generate predicted probabilities for all of the response data that your model was fit to - i.e., the whole RealizationOfRecipient vector. That is, if your model is called intercept.m, type predictions = predict(intercept.m, type='response') to get a vector of probabilities, each of which is the predicted probability that the corresponding element of the RealizationOfRecipient vector will be the level that R chose as non-baseline. Then, use ifelse() to threshold the probabilities at .5, mapping each element to the model’s “best guess” about its value.predictions <- predict(intercept.m, type='response')
baseline.level = levels(dative$RealizationOfRecipient)[1]
non.baseline.level = levels(dative$RealizationOfRecipient)[2]
thresholded = ifelse(predictions > .5, non.baseline.level, baseline.level)
Calculate the proportion of predictions which are equal to the observed values. What is the answer? Thinking about the model’s structure, what does this correspond to?
mean(thresholded == dative$RealizationOfRecipient)
## [1] 0.73981
We can see that the proportion of correct predictions is 0.73981. Note that the model only has an intercept, which means the predicted probability is a constant. This constant, as we have already seen, is the relative frequency of PP, i.e., 0.26019. Hence for each data point, the model predicts the same probability, which is below the threshold. This means that the model predicts that all the data to be NP. Therefore the proportion of correct prediction is exactly the relative frequency of NP in the observed values, which can be seen from the relative frequency table above.
From the graph on p.66, we can see that removing Pron Rec or Pron Theme has the largest negative effects on model fit (followed by LogRThmDiff, then Def of Theme, then Animacy of Rec, and then Previous and Num of Theme. Def of Rec is the last and does not have a significantly negative effect).
The graph shows for each of the predictor, if it is removed, how much the −2 log likelihood of the testing data, give the model trained without this predictor, will increase. (An increase in −2 log likelihood means that it is more unlikely to have observed the testing data given the model, indicating poorer model fit.)
The magnitude of the coefficients on slides 68-69 shows the order Animacy of Rec \(>\) Def of Theme \(>\) Pron Rec , Pron Theme, LogRThmDiff \(>\) Def of Rec \(>\) Num of Theme. This order is different from the previous one. For instance, the Pron Rec, Pron Theme, LogRThmDiff, which are the three most important predictors according to the previous order, only has the the third largest (in terms of absolute value) coefficient. In contrast, Animacy of Rec, which does not affect model fit as much if removed, nevertheless has the largest coefficient in the model. So the two metrics give different conclusions.
This discrepency is due to the fact these predictors in the model are correlated. As a result, if a predictor is highly correlated with some other predictors, we will not lose much model fit if we remove it. Nevertheless, it might just happen that such a predictor has a large coefficient in the model, because when predictors are highly correlated the estimates of coefficients tend to have greater variance.
glm(), starting from the top and adding one predictor at a time, using the 6 predictors on Bresnan’s slide 66 that are included in dative - everything except “Previous” and “Num of Theme”. (Note that you’ll have to create the third-ranked variable, LogRThmDiff by hand, by taking the difference between the log-scaled length of the recipient and the log-scaled length of the theme.) For each, use the procedure you used for the intercept-only model to generate predictions and evaluate them against the training set. What proportion of the training examples match each model’s thresholded predictions? Return your results in the form of a vector.dative$LogRThmDiff <- log(dative$LengthOfRecipient) - log(dative$LengthOfTheme)
m.1p <- glm(RealizationOfRecipient ~ PronomOfRec ,
family=binomial, data=dative)
m.2p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme,
family=binomial, data=dative)
m.3p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff,
family=binomial, data=dative)
m.4p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff
+ DefinOfTheme,
family=binomial, data=dative)
m.5p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff
+ DefinOfTheme + AnimacyOfRec,
family=binomial, data=dative)
m.6p <- glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme + LogRThmDiff
+ DefinOfTheme + AnimacyOfRec + DefinOfRec,
family=binomial, data=dative)
evl <- function(m){
predictions <- predict(m, type='response')
baseline.level = levels(dative$RealizationOfRecipient)[1]
non.baseline.level = levels(dative$RealizationOfRecipient)[2]
thresholded = ifelse(predictions > .5, non.baseline.level, baseline.level)
return(mean(thresholded == dative$RealizationOfRecipient))
}
sapply(list(m.1p, m.2p, m.3p, m.4p, m.5p, m.6p), evl)
## [1] 0.7486975 0.7704566 0.8467668 0.8623966 0.8611707 0.8660742
We can see that in general the proportion of correct predictions increases as more predictors are used.
cross.validate.m1 = sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
training.indices = (1:N)[-test.indices] # all indices not in test set
training.set = dative[training.indices,]
m = glm(RealizationOfRecipient ~ PronomOfRec, data = training.set, family=binomial)
coefs = coef(m)
f.hat = function(i) {
# factor encoding starts from 1, but in logistic regression it should start from 0
decoded.predictors <- ifelse(is.factor(test.set[i, "PronomOfRec"]),
unclass(test.set[i, "PronomOfRec"]) - 1,
test.set[i, "PronomOfRec"])
score <- sum(coefs * c(1, decoded.predictors)) # function on all predictors in model
return(1 / (1 + exp(-score))) # don't forget to undo logit transformation
}
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) test.predictions[i] = f.hat(i)
baseline.level = levels(dative$RealizationOfRecipient)[1]
non.baseline.level = levels(dative$RealizationOfRecipient)[2]
thresholded = ifelse(test.predictions > .5, non.baseline.level, baseline.level)
proportion.correct = mean(thresholded == test.set$RealizationOfRecipient)
return(proportion.correct)
})
mean(cross.validate.m1)
## [1] 0.7483742
For this (inefficient but hopefully clarifying) method you’ll need to make a separate (similar but not identical) function for each model, filling in the ellipses. Then, compute then the predictions for each model and show your results in the form of a vector giving the average proportion correct for each.
Instead of making 6 copies, I will do an additional functional abstraction to lift the formula. Also, I will use predict(model, newdata=...) instead of f-hat, because the for-loop is really slow to compute.
cross.validate <- function(formula) {
cross.validate.m = sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
training.indices = (1:N)[-test.indices] # all indices not in test set
training.set = dative[training.indices,]
m = glm(formula, data = training.set, family=binomial)
test.predictions = predict(m, newdata=test.set, type="response")
baseline.level = levels(dative$RealizationOfRecipient)[1]
non.baseline.level = levels(dative$RealizationOfRecipient)[2]
thresholded = ifelse(test.predictions > .5, non.baseline.level, baseline.level)
proportion.correct = mean(thresholded == test.set$RealizationOfRecipient)
return(proportion.correct)
})
return(mean(cross.validate.m))
}
c(m1 = cross.validate(RealizationOfRecipient ~ PronomOfRec),
m2 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme),
m3 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme
+ LogRThmDiff),
m4 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme
+ LogRThmDiff + DefinOfTheme),
m5 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme
+ LogRThmDiff + DefinOfTheme
+ AnimacyOfRec),
m6 = cross.validate(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme
+ LogRThmDiff + DefinOfTheme
+ AnimacyOfRec + DefinOfRec))
## m1 m2 m3 m4 m5 m6
## 0.7464110 0.7662577 0.8482209 0.8601840 0.8622086 0.8664724
How does the models’ performance compare to how they did when the whole data set was used for both model fitting and evaluation in (4)? Does this evaluation method clarify whether, and to what extent, each predictor is useful in predicting the choice between the prepositional dative and the double object construction?
(Note: if adding predictors appears to make your model worse, rather than better, consider whether you’ve correctly identified the baseline and non-baseline variables in the ifelse(...) statement.)
The performance seems quite similar, but the first two models may be slightly worse when the training set and test set are split. This would not be a surprise, because using the training data as the test data generally overestimates the model fit. We can also see that the first few predictors lead to more increase in accuracy. I am not sure how much this says about which predictors are more useful, though. Again, since some of the predictors are highly correlated, predictors added later generally will not bring as much gain in accruracy as the earlier ones.