Q1) Logit model to predict dative type with only intercept
library(languageR)
names(dative)
## [1] "Speaker" "Modality"
## [3] "Verb" "SemanticClass"
## [5] "LengthOfRecipient" "AnimacyOfRec"
## [7] "DefinOfRec" "PronomOfRec"
## [9] "LengthOfTheme" "AnimacyOfTheme"
## [11] "DefinOfTheme" "PronomOfTheme"
## [13] "RealizationOfRecipient" "AccessOfRec"
## [15] "AccessOfTheme"
levels(dative$RealizationOfRecipient)
## [1] "NP" "PP"
xtabs(~ PronomOfRec + RealizationOfRecipient,
dative)
## RealizationOfRecipient
## PronomOfRec NP PP
## nonpronominal 600 629
## pronominal 1814 220
#Set our response variable to PP recipient (I know R has
#a default assignment, but just wanted to)
DV = ifelse(dative$RealizationOfRecipient=="PP", 1,0)
#Intercept only model
m.intercept=glm(DV ~ 1, family=binomial, data=dative)
#This is our model's single parameter which corresponds with the proportion of PP Recipients in RealizationOfRecipient vector
intercept=m.intercept$coefficients[1]
summary(m.intercept)
##
## Call:
## glm(formula = DV ~ 1, family = binomial, data = dative)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7763 -0.7763 -0.7763 1.6409 1.6409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0450 0.0399 -26.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3741.1 on 3262 degrees of freedom
## Residual deviance: 3741.1 on 3262 degrees of freedom
## AIC: 3743.1
##
## Number of Fisher Scoring iterations: 4
#Proportion of PP recipients in RealizationOfRecipeint vector
proportion.pp=(sum(DV)/
length(dative$RealizationOfRecipient))
proportion.pp
## [1] 0.26019
#Logit transformation using intercept
e=exp(1) #e
invlogtr = function(m) { (e^m)/(1+e^m) }
f_hat = invlogtr(intercept)
f_hat
## (Intercept)
## 0.26019
#Comparison function to account for rounding errors (iff
#abs difference is less than .0000000001)
equality.compare = function(n1, n2) {
return(abs(n1 - n2) < 10^-10)
}
equality.compare(proportion.pp, f_hat)
## (Intercept)
## TRUE
This single coefficient indicates that we would expect a DV value of 0.26019 with an x value of zero. Our p-value in this case is extremely small, indicating that the probability that would observe this value in the case that the H0 was true (the intercept was 0) is extremely unlikely.
Q2)
predictions = predict(m.intercept, type='response')
#Looking at our predictions vector - all the elements are
#equal to the proportion of PP realizations
baseline.level=0
non.baseline.level=1
threshold= ifelse(predictions > .5, non.baseline.level, baseline.level)
proportion.correct = sum(threshold == DV)/length(DV)
proportion.correct
## [1] 0.73981
0.73981 Since our model simply predicts a flat probability of .26 for all data our proportion correct will simply be the proportion of PP’s
Q3) Incorporating Bresnan et al. (2007)
Given the graph on p.66 which plots the increase in 2 log likelihood (which amounts a decrease in the model’s goodness of fit) if a predictor variable is removed, we would conclude the Pronominality of Recipeint and Theme would have the two largest (negative) effects on the model if removed, followed by LogRThmDiff. More simply, if we were to remove these regressors from the model the model’s predictive power would decrease most substantially (than if we removed other regressors).
If we compare our above assessment to the magnitude of coefficents on 68-69 we see that while pronominality of recipient and theme do tend to have slightly larger coefficients overall they are 1) not the largest and 2) the differences are far more slight than the graph on p.66 would have us believe. In fact, if we only consider coefficient magnitudes as an assessment of model predictive power, animacy of recipient and definiteness of theme would appear to be the largest driving forces.
I’m not completely sure why we’d see this discrepency - my intuition is that there is a difference between consistency and something like “salience” for each of our regressors. For example, it might be the case that pronom of rec and pronom of theme tend to vary most closely with our DV. However, it may be the case that while animacy of recipient tends to have a weaker correlation with our DV overall, for the instances (which are perhaps few) that animcy of rec is “Animate” our DV is always PP (as an example) then we would expect the magnitude for that instance to be larger. So there is a dynamic between consistency of variability for each instance of our binary regressors and the salience of a given instance.
Q4) Building a series of models
#Review our predictors
names(dative)
## [1] "Speaker" "Modality"
## [3] "Verb" "SemanticClass"
## [5] "LengthOfRecipient" "AnimacyOfRec"
## [7] "DefinOfRec" "PronomOfRec"
## [9] "LengthOfTheme" "AnimacyOfTheme"
## [11] "DefinOfTheme" "PronomOfTheme"
## [13] "RealizationOfRecipient" "AccessOfRec"
## [15] "AccessOfTheme"
###:-------Sequence of logistic models-----------:###
m1 = glm(DV ~ PronomOfRec, family=binomial, data=dative)
predictions1 = predict(m1, type='response')
threshold1 = ifelse(predictions1 > .5, 1, 0)
prop1=sum(threshold1 == DV)/length(DV)
prop1
## [1] 0.7486975
m2 = glm(DV ~ PronomOfRec + PronomOfTheme, family=binomial, data=dative)
predictions2 = predict(m2, type='response')
threshold2 = ifelse(predictions2 > .5, 1, 0)
prop2=sum(threshold2 == DV)/length(DV)
prop2
## [1] 0.7704566
#Calculating values for LogRThmDiff vector using a function to reuse
logRthmDiff.fun = function(vec.indices) {
log.len.rec=log(dative[vec.indices, ]$LengthOfRecipient)
log.len.theme=log(dative[vec.indices, ]$LengthOfTheme)
return(log.len.rec - log.len.theme)
}
LogRThmDiff = logRthmDiff.fun(c(1:nrow(dative)))
m3 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff,
family=binomial, data=dative)
predictions3 = predict(m3, type='response')
threshold3 = ifelse(predictions3 > .5, 1, 0)
prop3=sum(threshold3 == DV)/length(DV)
prop3
## [1] 0.8467668
m4 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff + DefinOfTheme,
family=binomial, data=dative)
predictions4 = predict(m4, type='response')
threshold4 = ifelse(predictions4 > .5, 1, 0)
prop4=sum(threshold4 == DV)/length(DV)
prop4
## [1] 0.8623966
m5 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff + DefinOfTheme +
AnimacyOfRec, family=binomial, data=dative)
predictions5 = predict(m5, type='response')
threshold5 = ifelse(predictions5 > .5, 1, 0)
prop5=sum(threshold5 == DV)/length(DV)
prop5
## [1] 0.8611707
m6 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff + DefinOfTheme +
AnimacyOfRec + DefinOfRec, family=binomial, data=dative)
predictions6 = predict(m6, type='response')
threshold6 = ifelse(predictions6 > .5, 1, 0)
prop6=sum(threshold6 == DV)/length(DV)
prop6
## [1] 0.8660742
#Vector for each model accuracy
m.proportions=c(prop1=prop1, prop2=prop2, prop3=prop3,
prop4=prop4, prop5=prop5, prop6=prop6)
m.proportions
## prop1 prop2 prop3 prop4 prop5 prop6
## 0.7486975 0.7704566 0.8467668 0.8623966 0.8611707 0.8660742
#plot the proportions here
barplot(m.proportions, ylim=c(0,1),
names.arg=c("m1.prop", "m2.prop", "m3.prop",
"m4.prop", "m5.prop", "6.prop"),
main="Proportion matched\nModel predictions VS Observations",
ylab="proportion matched",
col=c(1,2,3,4,5,6))
Q5) Model performance assessed with cross-validation
#####:---------model with a single regressor IV-----------:###
cross.validate.m1=sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
#Test set
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
#Training set
training.indices = (1:N)[-test.indices] #all indices not in test set
training.set = dative[training.indices,]
#Recalculate log differences in length
LogRthmDiff = logRthmDiff.fun(training.indices)
#logit model on training data
m = glm(RealizationOfRecipient ~ PronomOfRec,
family=binomial, data=training.set)
#Coefficients
coefs = coef(m)
#Our inverse logit function
invlogit = function(x) { exp(x)/(1 + exp(x)) }
#Function logistic function on all predictors
f.hat = function(x) {
return(invlogit(
coefs[1] +
coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0)
))
}
true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) {
test.predictions[i] = f.hat(i)
}
test.thresholded = ifelse(test.predictions > .5, 1, 0)
proportion.correct = sum(test.thresholded == true.DV) / test.set.size
return(proportion.correct)
})
m1.mean = mean(cross.validate.m1)
#####:---------model with 2 regressors-----------:###
cross.validate.m2=sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
#Test set
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
#Training set
training.indices = (1:N)[-test.indices] #all indices not in test set
training.set = dative[training.indices,]
#Recalculate log differences in length
LogRthmDiff = logRthmDiff.fun(training.indices)
#logit model on training data
m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme,
family=binomial, data=training.set)
#Coefficients
coefs = coef(m)
#Our inverse logit function
invlogit = function(x) { exp(x)/(1 + exp(x)) }
#Function logistic function on all predictors
f.hat = function(x) {
return(invlogit(
coefs[1] +
coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0)
))
}
true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) {
test.predictions[i] = f.hat(i)
}
test.thresholded = ifelse(test.predictions > .5, 1, 0)
proportion.correct = sum(test.thresholded == true.DV) / test.set.size
return(proportion.correct)
})
m2.mean = mean(cross.validate.m2)
#####:---------model with 3 regressors-----------:###
cross.validate.m3=sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
#Test set
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
#Training set
training.indices = (1:N)[-test.indices] #all indices not in test set
training.set = dative[training.indices,]
#Recalculate log differences in length
LogRthmDiff = logRthmDiff.fun(training.indices)
#logit model on training data
m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
LogRthmDiff,
family=binomial, data=training.set)
#Coefficients
coefs = coef(m)
#Our inverse logit function
invlogit = function(x) { exp(x)/(1 + exp(x)) }
#Function logistic function on all predictors
f.hat = function(x) {
return(invlogit(
coefs[1] +
coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
coefs[4] * LogRthmDiff[x]
))
}
true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) {
test.predictions[i] = f.hat(i)
}
test.thresholded = ifelse(test.predictions > .5, 1, 0)
proportion.correct = sum(test.thresholded == true.DV) / test.set.size
return(proportion.correct)
})
m3.mean = mean(cross.validate.m3)
#####:---------model with 4 regressors-----------:###
cross.validate.m4=sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
#Test set
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
#Training set
training.indices = (1:N)[-test.indices] #all indices not in test set
training.set = dative[training.indices,]
#Recalculate log differences in length
LogRthmDiff = logRthmDiff.fun(training.indices)
#logit model on training data
m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
LogRthmDiff + DefinOfTheme,
family=binomial, data=training.set)
#Coefficients
coefs = coef(m)
#Our inverse logit function
invlogit = function(x) { exp(x)/(1 + exp(x)) }
#Function logistic function on all predictors
f.hat = function(x) {
return(invlogit(
coefs[1] +
coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
coefs[4] * LogRthmDiff[x] +
coefs[5] * ifelse(test.set$DefinOfTheme[x] == "indefinite", 1, 0)
))
}
true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) {
test.predictions[i] = f.hat(i)
}
test.thresholded = ifelse(test.predictions > .5, 1, 0)
proportion.correct = sum(test.thresholded == true.DV) / test.set.size
return(proportion.correct)
})
m4.mean = mean(cross.validate.m4)
#####:---------model with 5 regressors-----------:###
cross.validate.m5=sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
#Test set
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
#Training set
training.indices = (1:N)[-test.indices] #all indices not in test set
training.set = dative[training.indices,]
#Recalculate log differences in length
LogRthmDiff = logRthmDiff.fun(training.indices)
#logit model on training data
m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
LogRthmDiff + DefinOfTheme + AnimacyOfRec,
family=binomial, data=training.set)
#Coefficients
coefs = coef(m)
#Our inverse logit function
invlogit = function(x) { exp(x)/(1 + exp(x)) }
#Function logistic function on all predictors
f.hat = function(x) {
return(invlogit(
coefs[1] +
coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
coefs[4] * LogRthmDiff[x] +
coefs[5] * ifelse(test.set$DefinOfTheme[x] == "indefinite", 1, 0) +
coefs[6] * ifelse(test.set$AnimacyOfRec[x] == "inanimate", 1, 0)
))
}
true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) {
test.predictions[i] = f.hat(i)
}
test.thresholded = ifelse(test.predictions > .5, 1, 0)
proportion.correct = sum(test.thresholded == true.DV) / test.set.size
return(proportion.correct)
})
m5.mean = mean(cross.validate.m5)
#####:---------model with 6 regressors-----------:###
cross.validate.m6=sapply(1:100, function(i) {
N = length(dative$RealizationOfRecipient)
#Test set
test.set.size = floor(N/10)
test.indices = sample(1:N, size=test.set.size)
test.set = dative[test.indices,]
#Training set
training.indices = (1:N)[-test.indices] #all indices not in test set
training.set = dative[training.indices,]
#Recalculate log differences in length
LogRthmDiff = logRthmDiff.fun(training.indices)
#logit model on training data
m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
LogRthmDiff + DefinOfTheme + AnimacyOfRec + DefinOfRec,
family=binomial, data=training.set)
#Coefficients
coefs = coef(m)
#Our inverse logit function
invlogit = function(x) { exp(x)/(1 + exp(x)) }
#Function logistic function on all predictors
total = 0
f.hat = function(x) {
return(invlogit(
coefs[1] +
coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
coefs[4] * LogRthmDiff[x] +
coefs[5] * ifelse(test.set$DefinOfTheme[x] == "indefinite", 1, 0) +
coefs[6] * ifelse(test.set$AnimacyOfRec[x] == "inanimate", 1, 0) +
coefs[7] * ifelse(test.set$DefinOfRec[x] == "indefinite", 1, 0)
))
}
true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
test.predictions = rep(NA, test.set.size)
for (i in 1:test.set.size) {
test.predictions[i] = f.hat(i)
}
test.thresholded = ifelse(test.predictions > .5, 1, 0)
proportion.correct = sum(test.thresholded == true.DV) / test.set.size
return(proportion.correct)
})
m6.mean = mean(cross.validate.m6)
#Looking at our cross-validation results for the models
m.means=c(mean1=m1.mean, mean2=m2.mean, mean3=m3.mean,
mean4=m4.mean, mean5=m5.mean, mean6=m6.mean)
m.means
## mean1 mean2 mean3 mean4 mean5 mean6
## 0.7444785 0.7673620 0.7654601 0.7802761 0.7838957 0.7900920
#plot the proportions here
barplot(m.means, ylim=c(0,1),
names.arg=c("m1.mean", "m2.mean", "m3.mean",
"m4.mean", "m5.mean", "m6.mean"),
main="Proportion matched\nModel predictions VS Observations",
ylab="proportion matched",
col=c(1,2,3,4,5,6))
#How did our cross validation compare to usin the full data-set?
m.means - m.proportions
## mean1 mean2 mean3 mean4 mean5
## -0.004218990 -0.003094672 -0.081306656 -0.082120494 -0.077274996
## mean6
## -0.075982140
It appears that cross-validation underperformed slighly compared to using the full data set.
Just for fun, let’s look at the most accurate predictions in our cross-validation method - consistently above 80%. Given the nature of cross-validation in which we randomly generate a training set that we sample from we expect some ‘noise’ to detract from accuracy. Interesting that the discrepency between the two methods is roughly -0.0539997 This number is close to the 10% we cut out of the total data set for our training data… Is there a method for calculating our expected error of using cross-validation given sample size and size of the cuts?
max(cross.validate.m1)
## [1] 0.803681
max(cross.validate.m2)
## [1] 0.8128834
max(cross.validate.m3)
## [1] 0.8067485
max(cross.validate.m4)
## [1] 0.8374233
max(cross.validate.m5)
## [1] 0.8343558
max(cross.validate.m6)
## [1] 0.8466258