• See website for how to submit your answers and how feedback is organized
• Get experience with the mathematical structure of the logit model • Get experience with the interpretation of parameters of the logit model
Consider again the application in lecture 5.5, where we have analyzed response to a direct mailing using the following logit specification Pr[respi = 1] = exp(β0 + β1malei + β2activei + β3agei + β4(agei/10)2)1 + exp(β0 + β1malei + β2activei + β3agei + β4(agei/10)2) for i = 1, . . . , 925.
The maximum likelihood estimates of the parameters are given by
Variable Coefficient Std. Error t-value p-value Intercept -2.488 0.890 -2.796 0.005 Male 0.954 0.158 6.029 0.000 Active 0.914 0.185 4.945 0.000 Age 0.070 0.036 1.964 0.050 (Age/10)2 -0.069 0.034 -2.015 0.044
So loading the data and fiiting the model next:
library(readxl)
library(ggplot2)
library(dplyr)
df <- read_excel("Data Lecture 5-5.xls", col_names = TRUE)
df <- df%>% mutate(age2 = (age/10)^2)
head(df)
## # A tibble: 6 x 6
## `repond id` response male activity age age2
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0 0 58 33.6
## 2 2 1 1 0 50 25
## 3 3 1 1 0 40 16
## 4 4 1 1 0 36 13.0
## 5 5 1 1 0 28 7.84
## 6 6 1 1 0 70 49
So fitting the model:
logit <- glm(df$response ~ df$male + df$activity + df$age +df$age2,
family = binomial(link = "logit"))
summary(logit)
##
## Call:
## glm(formula = df$response ~ df$male + df$activity + df$age +
## df$age2, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6926 -1.2156 0.7389 1.0982 1.8473
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.48836 0.88999 -2.796 0.00517 **
## df$male 0.95369 0.15818 6.029 1.65e-09 ***
## df$activity 0.91375 0.18478 4.945 7.61e-07 ***
## df$age 0.06995 0.03561 1.964 0.04948 *
## df$age2 -0.06869 0.03410 -2.015 0.04394 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1282.1 on 924 degrees of freedom
## Residual deviance: 1203.7 on 920 degrees of freedom
## AIC: 1213.7
##
## Number of Fisher Scoring iterations: 4
library(pscl)
## Warning: package 'pscl' was built under R version 4.0.2
LogLikelihood <- pR2(logit)['llh']
## fitting null model for pseudo-r2
Likelihood <- exp(pR2(logit)['llh'])
## fitting null model for pseudo-r2
McFaddenR2 <- pR2(logit)['McFadden']
## fitting null model for pseudo-r2
NagelkerkeR2 <- pR2(logit)['r2CU']
## fitting null model for pseudo-r2
library(fmsb)
NagelkerkeR2 <- NagelkerkeR2(logit)$R2
NagelkerkeR2
## [1] 0.1083013
So showing that :
# logit odd ratios
exp(logit$coefficients)
## (Intercept) df$male df$activity df$age df$age2
## 0.08304624 2.59527990 2.49365102 1.07244947 0.93361419
#regression marginal effects:
ols <- lm(df$response ~ df$male + df$activity + df$age +df$age2)
ols$coefficients
## (Intercept) df$male df$activity df$age df$age2
## -0.06088779 0.22400196 0.20826836 0.01549425 -0.01520940
#logit average marginal effects:
log_scal <- mean(dlogis(predict(logit, type = "link")))
log_scal*coef(logit)
## (Intercept) df$male df$activity df$age df$age2
## -0.57116043 0.21890442 0.20973538 0.01605475 -0.01576709
# Predicting probabilities logit
polsreg <- predict(logit, type = "response")
summary(polsreg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1404 0.4265 0.5403 0.5081 0.5595 0.7613
#matrix accuracy
logit_matrix <- table(True = df$response, predi = round(fitted(logit)))
round(prop.table(logit_matrix),2)
## predi
## True 0 1
## 0 0.21 0.28
## 1 0.10 0.40
# R2 McFadden
logi0 <- update(logit, formula. = df$response ~ 1)
R2mc <-1- as.vector(logLik(logit)/logLik(logi0))
R2mc
## [1] 0.06111504
With that information we can say:
# RPTA.
# ∂Pr[respi=1]/∂agei+∂Pr[respi=0]/∂agei=0
#∂Pr[respi=1]∂agei+∂Pr[respi=0]∂agei = ∂Pr[respi=1]∂agei + ∂(1−Pr[respi=1])∂agei
# = ∂Pr[respi=1]∂agei + −∂(Pr[respi=1])∂agei
# = ∂Pr[respi=1]∂agei − ∂(Pr[respi=1])∂agei
# = 0
The odd ratios was:
# logit odd ratios
exp(logit$coefficients)
## (Intercept) df$male df$activity df$age df$age2
## 0.08304624 2.59527990 2.49365102 1.07244947 0.93361419
So the condiction:
#The condition respinew=−respi+1 means that the odds ratio of each new response is inversed, or:
# Pr[respinew=1]/Pr[respinew=0] = 1 /( Pr[respi=1]/Pr[respi=0] ) = Pr[respi=0]/Pr[respi=1]
# ⇒ Pr[respinew=0]/Pr[respinew=1] = Pr[respi=1]Pr[respi=0]
And the odds ratio formula is given as:
#Pr[respi=1]/Pr[respi=0] = exp(β0+∑j=2kβj∗xji)
# = exp(β0+β1∗malei+β2∗activei+β3∗agei+β4∗(agei/10))2
#⇒Pr[respi=0]Pr[respi=1] = 1/ exp(β0+∑j=2kβj∗xji)
# =1 / exp(β0+β1∗malei+β2∗activei+β3∗agei+β4∗(agei/10)2)
So finally:
# Pr[respinew=1]/Pr[respinew=0] = Pr[respi=0]/Pr[respi=1]
# = 1 / exp(β0+∑j=2kβj∗xji)
# =exp(−β0−∑j=2kβj∗xji))
# =exp(−β0−β1∗malei−β2∗activei−β3∗agei−β4∗(agei/10)2)
Where the transformation indeed implies that the sign of all parameters change.
During lecture 5.5 you have seen that this odds ratio obtains its maximum value for age equal to 50 years for males as well as females. Suppose now that you want to extend the logit model and allow that this age valueis possibly different for males than for females. Discuss how you can extend the logit specification.
Recording that the marginal effects are:
#regression marginal effects:
ols <- lm(df$response ~ df$male + df$activity + df$age +df$age2)
ols$coefficients
## (Intercept) df$male df$activity df$age df$age2
## -0.06088779 0.22400196 0.20826836 0.01549425 -0.01520940
#logit average marginal effects:
log_scal <- mean(dlogis(predict(logit, type = "link")))
log_scal*coef(logit)
## (Intercept) df$male df$activity df$age df$age2
## -0.57116043 0.21890442 0.20973538 0.01605475 -0.01576709
# RPTA
#One way is to split the sample in two (male−only and female−only) groups, and estimate the logit model seperately for each group.
#In this case, the remaining activity variable parameter β2 could also differ for each one of the groups and should be estimated seperately.
# Another way is to modify the logit model in order to use combinations of male and age variables (and related parameters) and leave the sample united as is.
# For example, we could add the variables malei∗agei and malei∗(agei/10)2 to the logit specification, thus extending it by 2 more variables with parameters β5 and β6.
# Then, the model odds ratio would be:
# Pr[respi=1]/Pr[respi=0] = exp(β0+β1∗malei+β2∗activei+β3∗agei+β4∗(agei/10)2+β5∗malei∗agei+β6∗malei∗(agei/10)2)
#Or, we could replace the age related variables and parameters
# β3∗agei+β4∗(agei/10)2
#with their male/female related combinations:
# β3∗malei∗agei+β4∗malei∗(agei/10)2+β5∗(1−malei)∗agei+β6∗(1−malei)∗(agei/10)2
Then, the model odds ratio would be:
# Pr[respi=1]/ Pr[respi=0]
# =exp(β0+β1∗malei+β2∗activei+β3∗malei∗agei+β4∗malei∗(agei/10)2
# +β5∗(1−malei)∗agei+β6∗(1−malei)∗(agei/10)2).
These second way variations would allow for similar flexibility to study on the age effect throught the different variables and parameters they incorporate.
# Predicting the errors:
#matrix accuracy
logit_matrix <- table(True = df$response, predi = round(fitted(logit)))
round(prop.table(logit_matrix),2)
## predi
## True 0 1
## 0 0.21 0.28
## 1 0.10 0.40
plot(residuals.glm(logit))