Levi Waldron
dplyr
and ggplot2
Systematic component:
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
Systematic plus random component:
\(y_i = E[y|x] + \epsilon_i\)
\(y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon_i\)
Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
The model: \(y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Random component of \(y_i\) is normally distributed: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
Systematic component (linear predictor): \(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}\)
Link function here is the identity link: \(g(E(y | x)) = E(y | x)\). We are modeling the mean directly, no transformation.
The model: \[ Logit(P(x)) = log \left( \frac{P(x)}{1-P(x)} \right) = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Random component: \(y_i\) follows a Binomial distribution (outcome is a binary variable)
Systematic component: linear predictor \[ \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Link function: logit (log of the odds that the event occurs)
\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]
\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
logit <- function(P) log(P/(1-P))
plot(logit, xlab="Probability", ylab="Log-odds",
cex.lab=1.5, cex.axis=1.5)
invLogit <- function(x) 1/(1+exp(-x))
From http://data.princeton.edu/wws509/datasets/#cuse
## age education wantsMore notUsing using
## <25 :4 high:8 no :8 Min. : 8.00 Min. : 4.00
## 25-29:4 low :8 yes:8 1st Qu.: 31.00 1st Qu.: 9.50
## 30-39:4 Median : 56.50 Median :29.00
## 40-49:4 Mean : 68.75 Mean :31.69
## 3rd Qu.: 85.75 3rd Qu.:49.00
## Max. :212.00 Max. :80.00
No interactions:
fit1 <- glm(cbind(using, notUsing) ~ age + education + wantsMore,
data=cuse, family=binomial("logit"))
summary(fit1)
##
## Call:
## glm(formula = cbind(using, notUsing) ~ age + education + wantsMore,
## family = binomial("logit"), data = cuse)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5148 -0.9376 0.2408 0.9822 1.7333
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8082 0.1590 -5.083 3.71e-07 ***
## age25-29 0.3894 0.1759 2.214 0.02681 *
## age30-39 0.9086 0.1646 5.519 3.40e-08 ***
## age40-49 1.1892 0.2144 5.546 2.92e-08 ***
## educationlow -0.3250 0.1240 -2.620 0.00879 **
## wantsMoreyes -0.8330 0.1175 -7.091 1.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 29.917 on 10 degrees of freedom
## AIC: 113.43
##
## Number of Fisher Scoring iterations: 4
Take the difference between observed and fitted values (on probability scale 0-1), and divide by the standard deviation of the observed value.
\[ r_i = \frac{y_i - \hat y_i}{ \sqrt{ Var(\hat y_i) }} \]
Summing the squared Pearson residuals produces the Pearson Chi-squared statistic:
\[ \chi ^2 = \sum_i r_i^2 \]
\[ d_i = s_i \sqrt{ -2 ( y_i \log \hat y_i + (1-y_i) \log (1 - \hat y_i) ) } \]
Where \(s_i = 1\) if \(y_i = 1\) and \(s_i = -1\) if \(y_i = 0\).
\(\Delta (\textrm{D}) = -2 * \Delta (\textrm{log likelihood})\)
fit0 <- glm(cbind(using, notUsing) ~ -1, data=cuse,
family=binomial("logit"))
anova(fit0, fit1, test="LRT")
## Analysis of Deviance Table
##
## Model 1: cbind(using, notUsing) ~ -1
## Model 2: cbind(using, notUsing) ~ age + education + wantsMore
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 16 389.85
## 2 10 29.92 6 359.94 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: Wald test confidence intervals on coefficients can provide poor coverage in some cases, even with relatively large samples
mutate
, group_by
, and summarize
fit1
, write on paper the model for expected probability of using birth control. Don’t forget the logit function.fit1
, what is the expected probability of an individual 25-29 years old, with high education, who wants more children, using birth control? Calculate it manually, and using predict(fit1)
fit1
: Relative to women under 25 who want to have children, what is the predicted increase in odds that a woman 40-49 years old who does not want to have children will be taking birth control?fit1
(no interactions)?