HW7(Due2.25.15)

Q1) Logit model to predict dative type with only intercept

library(languageR)
names(dative)

##  [1] "Speaker"                "Modality"              
##  [3] "Verb"                   "SemanticClass"         
##  [5] "LengthOfRecipient"      "AnimacyOfRec"          
##  [7] "DefinOfRec"             "PronomOfRec"           
##  [9] "LengthOfTheme"          "AnimacyOfTheme"        
## [11] "DefinOfTheme"           "PronomOfTheme"         
## [13] "RealizationOfRecipient" "AccessOfRec"           
## [15] "AccessOfTheme"

levels(dative$RealizationOfRecipient)

## [1] "NP" "PP"

xtabs(~ PronomOfRec + RealizationOfRecipient,
      dative)

##                RealizationOfRecipient
## PronomOfRec       NP   PP
##   nonpronominal  600  629
##   pronominal    1814  220

#Set our response variable to PP recipient (I know R has
#a default assignment, but just wanted to)
DV = ifelse(dative$RealizationOfRecipient=="PP", 1,0)

#Intercept only model
m.intercept=glm(DV ~ 1, family=binomial, data=dative)

#This is our model's single parameter which corresponds with the proportion of PP Recipients in RealizationOfRecipient vector
intercept=m.intercept$coefficients[1]
summary(m.intercept)

## 
## Call:
## glm(formula = DV ~ 1, family = binomial, data = dative)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7763  -0.7763  -0.7763   1.6409   1.6409  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0450     0.0399  -26.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3741.1  on 3262  degrees of freedom
## Residual deviance: 3741.1  on 3262  degrees of freedom
## AIC: 3743.1
## 
## Number of Fisher Scoring iterations: 4

#Proportion of PP recipients in RealizationOfRecipeint vector
proportion.pp=(sum(DV)/
                 length(dative$RealizationOfRecipient))
proportion.pp

## [1] 0.26019

#Logit transformation using intercept
e=exp(1) #e
invlogtr = function(m) { (e^m)/(1+e^m) }
f_hat = invlogtr(intercept)
f_hat

## (Intercept) 
##     0.26019

#Comparison function to account for rounding errors (iff 
#abs difference is less than .0000000001)
equality.compare = function(n1, n2) {
  return(abs(n1 - n2) < 10^-10)
}
equality.compare(proportion.pp, f_hat)

## (Intercept) 
##        TRUE

This single coefficient indicates that we would expect a DV value of 0.26019 with an x value of zero. Our p-value in this case is extremely small, indicating that the probability that would observe this value in the case that the H0 was true (the intercept was 0) is extremely unlikely.

Q2)

predictions = predict(m.intercept, type='response')
#Looking at our predictions vector  - all the elements are
#equal to the proportion of PP realizations
baseline.level=0
non.baseline.level=1
threshold= ifelse(predictions > .5, non.baseline.level, baseline.level)

proportion.correct = sum(threshold == DV)/length(DV)
proportion.correct

## [1] 0.73981

0.73981 Since our model simply predicts a flat probability of .26 for all data our proportion correct will simply be the proportion of PP’s

Q3) Incorporating Bresnan et al. (2007)

Given the graph on p.66 which plots the increase in 2 log likelihood (which amounts a decrease in the model’s goodness of fit) if a predictor variable is removed, we would conclude the Pronominality of Recipeint and Theme would have the two largest (negative) effects on the model if removed, followed by LogRThmDiff. More simply, if we were to remove these regressors from the model the model’s predictive power would decrease most substantially (than if we removed other regressors).

If we compare our above assessment to the magnitude of coefficents on 68-69 we see that while pronominality of recipient and theme do tend to have slightly larger coefficients overall they are 1) not the largest and 2) the differences are far more slight than the graph on p.66 would have us believe. In fact, if we only consider coefficient magnitudes as an assessment of model predictive power, animacy of recipient and definiteness of theme would appear to be the largest driving forces.

I’m not completely sure why we’d see this discrepency - my intuition is that there is a difference between consistency and something like “salience” for each of our regressors. For example, it might be the case that pronom of rec and pronom of theme tend to vary most closely with our DV. However, it may be the case that while animacy of recipient tends to have a weaker correlation with our DV overall, for the instances (which are perhaps few) that animcy of rec is “Animate” our DV is always PP (as an example) then we would expect the magnitude for that instance to be larger. So there is a dynamic between consistency of variability for each instance of our binary regressors and the salience of a given instance.

Q4) Building a series of models

#Review our predictors
names(dative)

##  [1] "Speaker"                "Modality"              
##  [3] "Verb"                   "SemanticClass"         
##  [5] "LengthOfRecipient"      "AnimacyOfRec"          
##  [7] "DefinOfRec"             "PronomOfRec"           
##  [9] "LengthOfTheme"          "AnimacyOfTheme"        
## [11] "DefinOfTheme"           "PronomOfTheme"         
## [13] "RealizationOfRecipient" "AccessOfRec"           
## [15] "AccessOfTheme"

###:-------Sequence of logistic models-----------:###
m1 = glm(DV ~ PronomOfRec, family=binomial, data=dative)
predictions1 = predict(m1, type='response')
threshold1 = ifelse(predictions1 > .5, 1, 0)
prop1=sum(threshold1 == DV)/length(DV)
prop1

## [1] 0.7486975

m2 = glm(DV ~ PronomOfRec + PronomOfTheme, family=binomial, data=dative)
predictions2 = predict(m2, type='response')
threshold2 = ifelse(predictions2 > .5, 1, 0)
prop2=sum(threshold2 == DV)/length(DV)
prop2

## [1] 0.7704566

#Calculating values for LogRThmDiff vector using a function to reuse
logRthmDiff.fun = function(vec.indices) {
  log.len.rec=log(dative[vec.indices, ]$LengthOfRecipient)
  log.len.theme=log(dative[vec.indices, ]$LengthOfTheme)
  return(log.len.rec - log.len.theme)
}
LogRThmDiff = logRthmDiff.fun(c(1:nrow(dative)))

m3 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff,
         family=binomial, data=dative)
predictions3 = predict(m3, type='response')
threshold3 = ifelse(predictions3 > .5, 1, 0)
prop3=sum(threshold3 == DV)/length(DV)
prop3

## [1] 0.8467668

m4 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff + DefinOfTheme,
         family=binomial, data=dative)
predictions4 = predict(m4, type='response')
threshold4 = ifelse(predictions4 > .5, 1, 0)
prop4=sum(threshold4 == DV)/length(DV)
prop4

## [1] 0.8623966

m5 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff + DefinOfTheme +
           AnimacyOfRec, family=binomial, data=dative)
predictions5 = predict(m5, type='response')
threshold5 = ifelse(predictions5 > .5, 1, 0)
prop5=sum(threshold5 == DV)/length(DV)
prop5

## [1] 0.8611707

m6 = glm(DV ~ PronomOfRec + PronomOfTheme + LogRThmDiff + DefinOfTheme +
           AnimacyOfRec + DefinOfRec, family=binomial, data=dative)
predictions6 = predict(m6, type='response')
threshold6 = ifelse(predictions6 > .5, 1, 0)
prop6=sum(threshold6 == DV)/length(DV)
prop6

## [1] 0.8660742

#Vector for each model accuracy
m.proportions=c(prop1=prop1, prop2=prop2, prop3=prop3,
              prop4=prop4, prop5=prop5, prop6=prop6)
m.proportions

##     prop1     prop2     prop3     prop4     prop5     prop6 
## 0.7486975 0.7704566 0.8467668 0.8623966 0.8611707 0.8660742

#plot the proportions here
barplot(m.proportions, ylim=c(0,1),
        names.arg=c("m1.prop", "m2.prop", "m3.prop",
                    "m4.prop", "m5.prop", "6.prop"),
        main="Proportion matched\nModel predictions VS Observations",
        ylab="proportion matched",
        col=c(1,2,3,4,5,6))

Q5) Model performance assessed with cross-validation

#####:---------model with a single regressor IV-----------:###
cross.validate.m1=sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  
  #Test set
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  
  #Training set
  training.indices = (1:N)[-test.indices] #all indices not in test set
  training.set = dative[training.indices,]
  
  #Recalculate log differences in length
  LogRthmDiff = logRthmDiff.fun(training.indices)
  
  #logit model on training data
  m = glm(RealizationOfRecipient ~ PronomOfRec, 
          family=binomial, data=training.set)
  
  #Coefficients
  coefs = coef(m)
  
  #Our inverse logit function
  invlogit = function(x) { exp(x)/(1 + exp(x)) }
  
  #Function logistic function on all predictors
  f.hat = function(x) {
    return(invlogit(
      coefs[1] +
      coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0)
    ))
  }
  true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) {
    test.predictions[i] = f.hat(i)
  }
  test.thresholded = ifelse(test.predictions > .5, 1, 0)
  proportion.correct = sum(test.thresholded == true.DV) / test.set.size
  return(proportion.correct)
})
m1.mean = mean(cross.validate.m1)

#####:---------model with 2 regressors-----------:###
cross.validate.m2=sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  
  #Test set
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  
  #Training set
  training.indices = (1:N)[-test.indices] #all indices not in test set
  training.set = dative[training.indices,]
  
  #Recalculate log differences in length
  LogRthmDiff = logRthmDiff.fun(training.indices)
  
  #logit model on training data
  m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme, 
          family=binomial, data=training.set)
  
  #Coefficients
  coefs = coef(m)
  
  #Our inverse logit function
  invlogit = function(x) { exp(x)/(1 + exp(x)) }
  
  #Function logistic function on all predictors
  f.hat = function(x) {
    return(invlogit(
      coefs[1] +
      coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
      coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0)
    ))
  }
  true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) {
    test.predictions[i] = f.hat(i)
  }
  test.thresholded = ifelse(test.predictions > .5, 1, 0)
  proportion.correct = sum(test.thresholded == true.DV) / test.set.size
  return(proportion.correct)
})
m2.mean = mean(cross.validate.m2)

#####:---------model with 3 regressors-----------:###
cross.validate.m3=sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  
  #Test set
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  
  #Training set
  training.indices = (1:N)[-test.indices] #all indices not in test set
  training.set = dative[training.indices,]
  
  #Recalculate log differences in length
  LogRthmDiff = logRthmDiff.fun(training.indices)
  
  #logit model on training data
  m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
          LogRthmDiff, 
          family=binomial, data=training.set)
  
  #Coefficients
  coefs = coef(m)
  
  #Our inverse logit function
  invlogit = function(x) { exp(x)/(1 + exp(x)) }
  
  #Function logistic function on all predictors
  f.hat = function(x) {
    return(invlogit(
      coefs[1] +
      coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
      coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
      coefs[4] * LogRthmDiff[x]
    ))
  }
  true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) {
    test.predictions[i] = f.hat(i)
  }
  test.thresholded = ifelse(test.predictions > .5, 1, 0)
  proportion.correct = sum(test.thresholded == true.DV) / test.set.size
  return(proportion.correct)
})
m3.mean = mean(cross.validate.m3)

#####:---------model with 4 regressors-----------:###
cross.validate.m4=sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  
  #Test set
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  
  #Training set
  training.indices = (1:N)[-test.indices] #all indices not in test set
  training.set = dative[training.indices,]
  
  #Recalculate log differences in length
  LogRthmDiff = logRthmDiff.fun(training.indices)
  
  #logit model on training data
  m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
          LogRthmDiff + DefinOfTheme, 
          family=binomial, data=training.set)
  
  #Coefficients
  coefs = coef(m)
  
  #Our inverse logit function
  invlogit = function(x) { exp(x)/(1 + exp(x)) }
  
  #Function logistic function on all predictors
  f.hat = function(x) {
    return(invlogit(
      coefs[1] +
      coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
      coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
      coefs[4] * LogRthmDiff[x] +
      coefs[5] * ifelse(test.set$DefinOfTheme[x] == "indefinite", 1, 0)
    ))
  }
  true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) {
    test.predictions[i] = f.hat(i)
  }
  test.thresholded = ifelse(test.predictions > .5, 1, 0)
  proportion.correct = sum(test.thresholded == true.DV) / test.set.size
  return(proportion.correct)
})
m4.mean = mean(cross.validate.m4)

#####:---------model with 5 regressors-----------:###
cross.validate.m5=sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  
  #Test set
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  
  #Training set
  training.indices = (1:N)[-test.indices] #all indices not in test set
  training.set = dative[training.indices,]
  
  #Recalculate log differences in length
  LogRthmDiff = logRthmDiff.fun(training.indices)
  
  #logit model on training data
  m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
          LogRthmDiff + DefinOfTheme + AnimacyOfRec, 
          family=binomial, data=training.set)
  
  #Coefficients
  coefs = coef(m)

  #Our inverse logit function
  invlogit = function(x) { exp(x)/(1 + exp(x)) }
  
  #Function logistic function on all predictors
  f.hat = function(x) {
    return(invlogit(
      coefs[1] +
      coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
      coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
      coefs[4] * LogRthmDiff[x] +
      coefs[5] * ifelse(test.set$DefinOfTheme[x] == "indefinite", 1, 0) +
      coefs[6] * ifelse(test.set$AnimacyOfRec[x] == "inanimate", 1, 0)
    ))
  }
  true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) {
    test.predictions[i] = f.hat(i)
  }
  test.thresholded = ifelse(test.predictions > .5, 1, 0)
  proportion.correct = sum(test.thresholded == true.DV) / test.set.size
  return(proportion.correct)
})
m5.mean = mean(cross.validate.m5)

#####:---------model with 6 regressors-----------:###
cross.validate.m6=sapply(1:100, function(i) {
  N = length(dative$RealizationOfRecipient)
  
  #Test set
  test.set.size = floor(N/10)
  test.indices = sample(1:N, size=test.set.size)
  test.set = dative[test.indices,]
  
  #Training set
  training.indices = (1:N)[-test.indices] #all indices not in test set
  training.set = dative[training.indices,]
  
  #Recalculate log differences in length
  LogRthmDiff = logRthmDiff.fun(training.indices)
  
  #logit model on training data
  m = glm(RealizationOfRecipient ~ PronomOfRec + PronomOfTheme +
          LogRthmDiff + DefinOfTheme + AnimacyOfRec + DefinOfRec, 
          family=binomial, data=training.set)
  
  #Coefficients
  coefs = coef(m)

  #Our inverse logit function
  invlogit = function(x) { exp(x)/(1 + exp(x)) }
  
  #Function logistic function on all predictors
  total = 0
  f.hat = function(x) {
    return(invlogit(
      coefs[1] +
      coefs[2] * ifelse(test.set$PronomOfRec[x] == "pronominal", 1, 0) +
      coefs[3] * ifelse(test.set$PronomOfTheme[x] == "pronominal", 1, 0) +
      coefs[4] * LogRthmDiff[x] +
      coefs[5] * ifelse(test.set$DefinOfTheme[x] == "indefinite", 1, 0) +
      coefs[6] * ifelse(test.set$AnimacyOfRec[x] == "inanimate", 1, 0) +
      coefs[7] * ifelse(test.set$DefinOfRec[x] == "indefinite", 1, 0)  
    ))
  }
  true.DV = ifelse(test.set$RealizationOfRec=="PP",1,0)
  test.predictions = rep(NA, test.set.size)
  for (i in 1:test.set.size) {
    test.predictions[i] = f.hat(i)
  }
  test.thresholded = ifelse(test.predictions > .5, 1, 0)
  proportion.correct = sum(test.thresholded == true.DV) / test.set.size
  return(proportion.correct)
})
m6.mean = mean(cross.validate.m6)

#Looking at our cross-validation results for the models
m.means=c(mean1=m1.mean, mean2=m2.mean, mean3=m3.mean,
          mean4=m4.mean, mean5=m5.mean, mean6=m6.mean)
m.means

##     mean1     mean2     mean3     mean4     mean5     mean6 
## 0.7444785 0.7673620 0.7654601 0.7802761 0.7838957 0.7900920

#plot the proportions here
barplot(m.means, ylim=c(0,1),
        names.arg=c("m1.mean", "m2.mean", "m3.mean",
                    "m4.mean", "m5.mean", "m6.mean"),
        main="Proportion matched\nModel predictions VS Observations",
        ylab="proportion matched",
        col=c(1,2,3,4,5,6))

#How did our cross validation compare to usin the full data-set?
m.means - m.proportions

##        mean1        mean2        mean3        mean4        mean5 
## -0.004218990 -0.003094672 -0.081306656 -0.082120494 -0.077274996 
##        mean6 
## -0.075982140

It appears that cross-validation underperformed slighly compared to using the full data set.

Just for fun, let’s look at the most accurate predictions in our cross-validation method - consistently above 80%. Given the nature of cross-validation in which we randomly generate a training set that we sample from we expect some ‘noise’ to detract from accuracy. Interesting that the discrepency between the two methods is roughly -0.0539997 This number is close to the 10% we cut out of the total data set for our training data… Is there a method for calculating our expected error of using cross-validation given sample size and size of the cuts?

max(cross.validate.m1)

## [1] 0.803681

max(cross.validate.m2)

## [1] 0.8128834

max(cross.validate.m3)

## [1] 0.8067485

max(cross.validate.m4)

## [1] 0.8374233

max(cross.validate.m5)

## [1] 0.8343558

max(cross.validate.m6)

## [1] 0.8466258

HW7(Due2.25.15)

BPeloquin

February 22, 2015