Using the data from the 2006 GSS (gss.dta) develop a logit model with abort as the outcome variable. The variables in the dataset are as follows:
## Variables in Dataset ##
# abort == binary, supports abortion if woman wants for any reason
# attend == categorical, religious attend: 0-8 -> never - more than once a week
# premarsex == binary, approves of premarital sex
# income == 25 category measure of family income
# bible == 3-category, word of god (1), inspired word (2), book of fables (3)
# libcon == categorical, 7 category exlib to excon
# socbar == socialize in bar daily / several times a week
# educ == years of education
Start from a simple single-predictor model and incrementally add complexity until you have a model with 4-6 independent variables. Make sure at least one independent variable is a dummy variable and at least one is continuous or ordinal. As you build up the model comment briefly on how the addition of each parameter affects model fit. You are more than welcome to try including a predictor and then decide not to keep it in; just justify why you made that decision.
## Load Data and Libraries ##
library(foreign)
library(arm)
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lattice
## Loading required package: lme4
##
## arm (Version 1.6-09, built: 2013-9-23)
##
## Working directory is /Users/squishy/Dropbox/PSCI 7108/Assignment 6
gss <- read.dta("/Users/squishy/Dropbox/PSCI 7108/Assignment 6/gss.dta")
str(gss)
## 'data.frame': 769 obs. of 9 variables:
## $ abort : num 0 1 1 1 0 0 1 1 0 1 ...
## $ attend : num 1 4 0 6 2 7 1 5 0 7 ...
## $ bible : num 1 2 3 3 1 2 2 2 1 1 ...
## $ libcon : int 4 5 2 2 6 3 4 3 2 2 ...
## $ premarsex: num 0 1 1 1 1 1 1 1 1 0 ...
## $ sexeduc : num 0 1 1 1 0 1 1 1 1 1 ...
## $ socbar : int 1 1 1 2 1 1 1 1 1 1 ...
## $ income : int 4 12 11 22 15 8 19 22 10 8 ...
## $ educ : int 9 12 11 16 12 12 13 18 16 16 ...
## - attr(*, "datalabel")= chr "Written by R. "
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 100 108 100 100 108 108 108
## - attr(*, "val.labels")= chr "" "" "" "" ...
## - attr(*, "var.labels")= chr "abort" "attend" "bible" "libcon" ...
## - attr(*, "version")= int 7
sum(!complete.cases(gss)) # Double check observations are complete
## [1] 0
## Model Building ## Create matrix for AIC and BIC comparison between models
compare <- matrix(data = NA, nrow = 2, ncol = 4)
colnames(compare) <- c("m1", "m2", "m3", "m4")
rownames(compare) <- c("AIC", "BIC")
# m1 - premarsex. Prediction: pemarital sex approval will also support
# woman's choice to abbort
m1 <- glm(abort ~ premarsex, family = binomial(link = "logit"), data = gss) # For the z-test of the b-paramater: the null hypothesis is that b = 0, meaning there is no significant relationship between support for abortion and approval of premarital sex
summary(m1) # In the single-predictor model, this predictor seems significant since the z-value is more than 2 standard deviations; as well, the p-value is statistically significant, so we reject the null and conclude the variation explained by the model is not due to chance
##
## Call:
## glm(formula = abort ~ premarsex, family = binomial(link = "logit"),
## data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.259 -0.787 -0.787 1.098 1.627
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.013 0.113 -8.94 < 2e-16 ***
## premarsex 1.203 0.154 7.81 5.9e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1036.22 on 768 degrees of freedom
## Residual deviance: 972.35 on 767 degrees of freedom
## AIC: 976.4
##
## Number of Fisher Scoring iterations: 4
compare[1, 1] <- AIC(m1) # Coo, matches AIC calculated by summary()
compare[2, 1] <- BIC(m1)
# m2 - premarsex + educ. Prediction: people with more education will support
# abortion choice
m2 <- glm(abort ~ premarsex + educ, family = binomial(link = "logit"), data = gss) # For the z-test of the b-paramater: the null hypothesis is that b = 0, meaning there is no significant relationship between the predictor and the outcome variable
summary(m2) # p-values are statistically significant for all predictors, so we reject the null and conclude that the variation explained by each predictor is not due to chance
##
## Call:
## glm(formula = abort ~ premarsex + educ, family = binomial(link = "logit"),
## data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.642 -0.898 -0.711 1.083 2.426
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.8877 0.3992 -7.23 4.7e-13 ***
## premarsex 1.1991 0.1570 7.64 2.2e-14 ***
## educ 0.1368 0.0273 5.00 5.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1036.2 on 768 degrees of freedom
## Residual deviance: 944.9 on 766 degrees of freedom
## AIC: 950.9
##
## Number of Fisher Scoring iterations: 4
compare[1, 2] <- AIC(m2)
compare[2, 2] <- BIC(m2)
compare # AIC and BIC for m2 are lower than m1, so this model (and additional predictor) improves model fit
## m1 m2 m3 m4
## AIC 976.4 950.9 NA NA
## BIC 985.6 964.8 NA NA
# m3 - premarsex + educ + attend. Prediction: people who attend religious
# events are less likely to support abortion choice
m3 <- glm(abort ~ premarsex + educ + attend, family = binomial(link = "logit"),
data = gss) # For the z-test of the b-paramater: the null hypothesis is that b = 0, meaning there is no significant relationship between the predictor and the outcome variable
summary(m3) # p-values are statistically significant for all predictors, so we reject the null and conclude that the variation explained by each predictor is not due to chance
##
## Call:
## glm(formula = abort ~ premarsex + educ + attend, family = binomial(link = "logit"),
## data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.948 -0.912 -0.585 1.055 2.465
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3995 0.4205 -5.71 1.2e-08 ***
## premarsex 0.8590 0.1674 5.13 2.9e-07 ***
## educ 0.1638 0.0293 5.58 2.3e-08 ***
## attend -0.1967 0.0316 -6.22 5.0e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1036.22 on 768 degrees of freedom
## Residual deviance: 904.39 on 765 degrees of freedom
## AIC: 912.4
##
## Number of Fisher Scoring iterations: 4
compare[1, 3] <- AIC(m3)
compare[2, 3] <- BIC(m3)
compare # AIC and BIC for m3 are lower than m2, so this model (and additional predictor) improves model fit
## m1 m2 m3 m4
## AIC 976.4 950.9 912.4 NA
## BIC 985.6 964.8 931.0 NA
# m4 - premarsex + educ + attend + income. Income can be iffy, but given
# that income is often tied to education, prediction: people who have higher
# incomes are more likely to support abortion choice
m4 <- glm(abort ~ premarsex + educ + attend + income, family = binomial(link = "logit"),
data = gss) # For the z-test of the b-paramater: the null hypothesis is that b = 0, meaning there is no significant relationship between the predictor and the outcome variable
summary(m4) # Hmm, the z- and p-values indicate that income is not statistically significant at the 0.05 level, so we accept the null and conclude that there is no significant relationship between income and abortion choice support
##
## Call:
## glm(formula = abort ~ premarsex + educ + attend + income, family = binomial(link = "logit"),
## data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.963 -0.918 -0.575 1.037 2.464
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6412 0.4429 -5.96 2.5e-09 ***
## premarsex 0.8080 0.1699 4.75 2.0e-06 ***
## educ 0.1465 0.0307 4.78 1.8e-06 ***
## attend -0.2028 0.0320 -6.34 2.3e-10 ***
## income 0.0293 0.0163 1.80 0.072 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1036.22 on 768 degrees of freedom
## Residual deviance: 901.12 on 764 degrees of freedom
## AIC: 911.1
##
## Number of Fisher Scoring iterations: 4
compare[1, 4] <- AIC(m4)
compare[2, 4] <- BIC(m4)
compare # AIC is lower, but BIC (more stringent) is higher; this contradiction between AIC and BIC makes sense since the p-value for income parameter is ALMOST statistically significant at 0.07; since m3 and m4 are nested, I can also try the LR test
## m1 m2 m3 m4
## AIC 976.4 950.9 912.4 911.1
## BIC 985.6 964.8 931.0 934.3
lr <- -2 * (logLik(m3) - logLik(m4)) # LR is -2 times the difference between the log-likelihood of the restricted and unrestricted model
lr
## 'log Lik.' 3.277 (df=4)
lr <- 3.2772
p.val <- 1 - pchisq(lr, df = 1) # The null hypothesis is the restricted model is better
p.val # The p-value is not statistically significant, so we accept the null hypothesis and conclude the restricted model, m3, is better
## [1] 0.07025
# Based on the above tests, all except AIC says that m3 provides a better
# model fit than m4, so it appears educ is not significant in predicting
# abort
# m5 - premarsex + educ + attend + libcon. Prediction: not sure if I
# understand this variable well enough to make prediction
m5 <- glm(abort ~ premarsex + educ + attend + libcon, family = binomial(link = "logit"),
data = gss) # For the z-test of the b-paramater: the null hypothesis is that b = 0, meaning there is no significant relationship between the predictor and the outcome variable
summary(m5) # p-values are statistically significant for all predictors, so we reject the null and conclude that the variation explained by each predictor is not due to chance
##
## Call:
## glm(formula = abort ~ premarsex + educ + attend + libcon, family = binomial(link = "logit"),
## data = gss)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.845 -0.912 -0.562 1.003 2.506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1872 0.4913 -2.42 0.016 *
## premarsex 0.7317 0.1712 4.27 1.9e-05 ***
## educ 0.1559 0.0295 5.29 1.2e-07 ***
## attend -0.1821 0.0322 -5.66 1.5e-08 ***
## libcon -0.2692 0.0599 -4.49 7.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1036.22 on 768 degrees of freedom
## Residual deviance: 883.56 on 764 degrees of freedom
## AIC: 893.6
##
## Number of Fisher Scoring iterations: 4
compare <- cbind(compare, NA) # Add one more column for m5
colnames(compare) <- c("m1", "m2", "m3", "m4", "m5")
compare[1, 5] <- AIC(m5)
compare[2, 5] <- BIC(m5)
compare # AIC and BIC for m5 are lower than m3, so this model (and additional predictor) improves model fit
## m1 m2 m3 m4 m5
## AIC 976.4 950.9 912.4 911.1 893.6
## BIC 985.6 964.8 931.0 934.3 916.8
1. Produce and interpret a binned residual plot for the fitted values and binned residual plots for at least two independent variables (i.e., three graphs in total). Are there any problems? If so, how can they be fixed? If it seems that doing a log transform or including a quadratic term would help, do so and re-do the plot to examine whether it helped or not.
## Binned Residuals Function ##
binned.resids <- function(x, y, nclass = sqrt(length(x))) {
breaks.index <- floor(length(x) * (1:(nclass - 1))/nclass)
breaks <- unique(c(-Inf, sort(x)[breaks.index], Inf))
output <- NULL
x.binned <- as.numeric(cut(x, breaks))
for (i in 1:nclass) {
items <- (1:length(x))[x.binned == i]
x.range <- range(x[items])
xbar <- mean(x[items])
ybar <- mean(y[items])
n <- length(items)
sdev <- sd(y[items])
output <- rbind(output, c(xbar, ybar, n, x.range, 1.96 * sdev/sqrt(n)))
}
colnames(output) <- c("xbar", "ybar", "n", "x.lo", "x.hi", "2se")
return(na.omit(output))
}
## Binned Residuals for Fitted Values ##
fv.m5 <- m5$fitted.values # Fitted values
y.m5 <- m5$y # The dependent variable
# Binned residuals
br.m5.fv <- binned.resids(fv.m5, y.m5 - fv.m5, nclass = 25) # Fitted values, residuals, # of bins
# Plot fitted values vs. average residual
plot(br.m5.fv[, 1], br.m5.fv[, 2], ylim = c(-0.2, 0.2), pch = 19, xlab = "Fitted Values",
ylab = "Average Residual", main = "Binned Residual Plot for Fitted Values") # Each dot represents one bin, lowest dot is 1st line of br.m5.fv, what we're looking for is randomness
lines(br.m5.fv[, 1], br.m5.fv[, 6], lwd = 2, col = "gray50") # 2SE upper-bound CI; expect about 5% of dots out of CI; if a bunch of dots are outside of the CI lines, then we have really large residuals
lines(br.m5.fv[, 1], -br.m5.fv[, 6], lwd = 2, col = "gray50") # Lower-bound CI
abline(h = 0, lwd = 2, lty = 3)
The binned residuals for the fitted values look good. Most of the binned residuals fall within the \( \pm1.96 \) standard-error bounds and are randomly distributed.
## Binned Residuals for Education ##
options(warn = 2)
br.m5.educ <- binned.resids(m5$model$educ, y.m5 - fv.m5, nclass = 20) # Education levels, residuals, # of bins; warnings removed 10 of the bins
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
plot(br.m5.educ[, 1], br.m5.educ[, 2], ylim = c(-0.2, 0.2), pch = 19, xlab = "Education Levels (years)",
ylab = "Average Residual", main = "Binned Residual Plot for Education Levels")
lines(br.m5.educ[, 1], br.m5.educ[, 6], lwd = 2, col = "gray50")
lines(br.m5.educ[, 1], -br.m5.educ[, 6], lwd = 2, col = "gray50")
abline(h = 0, lwd = 2, lty = 3)
The binned residuals for educ (education levels) look pretty good. All the binned residuals fall within the \( \pm1.96 \) standard-error bounds and appear largely randomly-distributed around the x-axis.
## Binned Residuals for Religious Attendance ##
br.m5.attend <- binned.resids(m5$model$attend, y.m5 - fv.m5, nclass = 20) # Religious attendance, residuals, # of bins; warnings removed some bins
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
## Warning: no non-missing arguments to min; returning Inf
## Warning: no non-missing arguments to max; returning -Inf
plot(br.m5.attend[, 1], br.m5.attend[, 2], ylim = c(-0.15, 0.15), pch = 19,
xlab = "Religious Attendance (attendance/week)", ylab = "Average Residual",
main = "Binned Residual Plot for Religious Attendance")
lines(br.m5.attend[, 1], br.m5.attend[, 6], lwd = 2, col = "gray50")
lines(br.m5.attend[, 1], -br.m5.attend[, 6], lwd = 2, col = "gray50")
abline(h = 0, lwd = 2, lty = 3)
The binned residuals for attend (religious attendance) look pretty good–quite a few of the binned residuals are on the x-axis. Most binsfall within the \( \pm1.96 \) standard-error bounds and appear largely randomly-distributed around the x-axis.
2. Compute the error rates for the null model and for the final model specified. Interpret the findings.
# Look at a few observations
cbind(y.m5, fv.m5)[c(1:20, 769), ] # More event successes
## y.m5 fv.m5
## 1 0 0.2606
## 2 1 0.3409
## 3 1 0.6728
## 4 1 0.6006
## 5 0 0.3626
## 6 0 0.3392
## 7 1 0.5774
## 8 1 0.6531
## 9 0 0.8177
## 10 1 0.3762
## 11 1 0.7989
## 12 0 0.1587
## 13 1 0.6452
## 14 1 0.5455
## 15 1 0.5838
## 16 1 0.4029
## 17 0 0.2441
## 18 0 0.1980
## 19 1 0.1636
## 20 0 0.8177
## 769 1 0.2148
# Compute the error rates
er.null <- 1 - mean(y.m5) # Error rate of null model: because output is binomial, the mean of outputs is the overall probability of event success (Pr=1), so 1 - Pr is the error rate of how often we would gyess incorrectly if we always guessed event success
er.null # Higher than 0.5, so use Pr instead
## [1] 0.5982
er.null <- mean(y.m5)
er.null
## [1] 0.4018
er.m5 <- mean((fv.m5 >= 0.5 & y.m5 == 0) | (fv.m5 < 0.5 & y.m5 == 1)) # Error rate of m5: find all values where prediction was wrong
er.m5 # 0.2951886 is lower than the null error rate 0.4018205
## [1] 0.2952
The null model (with no predictors) will be wrong 40% of the time by guessing a point prediction of not supporting abortion choice for each person. The final model calculated, m5 (with 4 predictors), has an error rate of about 30%; put another way, m5 correctly predicts the behavior of 70% of the respondents. Model m5 provides a better model fit than the null by about 10%.
1. Produce a table of results for your final model in \( \LaTeX \).
library(xtable)
##
## Attaching package: 'xtable'
##
## The following object is masked from 'package:arm':
##
## display
xtable(m5)
## % latex table generated in R 3.0.2 by xtable 1.7-1 package
## % Wed Oct 23 22:54:37 2013
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
## \hline
## & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
## \hline
## (Intercept) & -1.1872 & 0.4913 & -2.42 & 0.0157 \\
## premarsex & 0.7317 & 0.1712 & 4.27 & 0.0000 \\
## educ & 0.1559 & 0.0295 & 5.29 & 0.0000 \\
## attend & -0.1821 & 0.0322 & -5.66 & 0.0000 \\
## libcon & -0.2692 & 0.0599 & -4.49 & 0.0000 \\
## \hline
## \end{tabular}
## \end{table}
See \( \LaTeX \) document for results table.
2. Compute the change in predicted probability for the two categories of (one of) the independent variable(s) that is a dummy variable. Do this once by holding all the other variables constant at something and then again using an average predictive comparison.
## Average Predictive Comparisons ## Change in predicted probability for
## change in dummy variable, premarsex, holding everything constant Set educ,
## attend, and libcon constant
x.gss1 <- data.frame(intercept = 1, premarsex = 0, educ = mean(gss$educ), attend = mean(gss$attend),
libcon = mean(gss$libcon))
x.gss2 <- data.frame(intercept = 1, premarsex = 1, educ = mean(gss$educ), attend = mean(gss$attend),
libcon = mean(gss$libcon))
delta1 <- invlogit(as.matrix(x.gss1) %*% m5$coef) - invlogit(as.matrix(x.gss2) %*%
m5$coef)
delta1 # -0.1713031, difference of about -17% points
## [,1]
## [1,] -0.1713
# Same things, except other variables aren't held constant
x.gss3 <- data.frame(intercept = 1, premarsex = 0, educ = gss$educ, attend = gss$attend,
libcon = gss$libcon)
# Taken entire dataset and change everyone's premarsex value to 0
x.gss4 <- data.frame(intercept = 1, premarsex = 1, educ = gss$educ, attend = gss$attend,
libcon = gss$libcon)
# Taken entire dataset and change everyone's premarsex value to 1
delta2 <- invlogit(as.matrix(x.gss3) %*% m5$coef) - invlogit(as.matrix(x.gss4) %*%
m5$coef)
mean(delta2) # -0.1511357, this is not that different from holding other predictors constant, so this is saying premarital sex approval doesn't matter as much
## [1] -0.1511
3. With a continuous or ordinal independent variable, produce, using simulation, a predicted probability curve for it with 95% confidence bounds. Be sure to comment on what the values of other independent variables are held at for the curve. Give a brief interpretation of the plot.
## Simulated Coefficients ## Seed and number of simulations
set.seed(39582)
m <- 1000
# Simulate coefficients from a multivariate normal
betas <- m5$coef # Collecting betas
vcv <- vcov(m5) # Collecting variance-covariance matrix
sim.betas <- mvrnorm(m, betas, vcv) # Taking 1000 draws from multivariate normal, the mean is betas, vcv estimated from m5 (simulating from sampling distribution of the model)
# Check to see if the simulated coefficients look like the real results
round(m5$coef, digits = 2)
## (Intercept) premarsex educ attend libcon
## -1.19 0.73 0.16 -0.18 -0.27
round(head(sim.betas, 10), digits = 2)
## (Intercept) premarsex educ attend libcon
## [1,] -1.18 0.88 0.15 -0.18 -0.30
## [2,] -1.11 0.76 0.13 -0.15 -0.26
## [3,] -1.68 0.66 0.21 -0.17 -0.31
## [4,] -0.41 0.84 0.13 -0.17 -0.36
## [5,] -0.72 1.10 0.11 -0.15 -0.30
## [6,] -0.30 0.65 0.10 -0.21 -0.25
## [7,] -1.80 0.94 0.17 -0.19 -0.21
## [8,] -0.65 0.79 0.14 -0.13 -0.35
## [9,] -1.33 0.62 0.18 -0.25 -0.29
## [10,] -1.11 0.56 0.16 -0.20 -0.25
data.frame(sim.means = apply(sim.betas, 2, mean), betas = betas, sim.sd = apply(sim.betas,
2, sd), se = sqrt(diag(vcv))) # mean of simulated betas, actual betas, sd of simulation, actual standard error; Looks pretty close overall
## sim.means betas sim.sd se
## (Intercept) -1.1821 -1.1872 0.49725 0.49132
## premarsex 0.7372 0.7317 0.16990 0.17118
## educ 0.1556 0.1559 0.02948 0.02949
## attend -0.1822 -0.1821 0.03298 0.03218
## libcon -0.2703 -0.2692 0.06018 0.05994
## Predicted Probability Plot ## Create a sequence of numbers spanning the
## range of education experience
education <- seq(min(gss$educ), max(gss$educ), length = 50)
# Create hypothetical independent variable profile
x.gss <- data.frame(intercept = 0, premarsex = 1, educ = education, attend = mean(gss$attend),
libcon = mean(gss$libcon)) # Hold all other variables except educ constant (not always the best)
x.gss # We're making this the variable we set to something because we are interested in educ
## intercept premarsex educ attend libcon
## 1 0 1 0.0000 3.663 4.104
## 2 0 1 0.4082 3.663 4.104
## 3 0 1 0.8163 3.663 4.104
## 4 0 1 1.2245 3.663 4.104
## 5 0 1 1.6327 3.663 4.104
## 6 0 1 2.0408 3.663 4.104
## 7 0 1 2.4490 3.663 4.104
## 8 0 1 2.8571 3.663 4.104
## 9 0 1 3.2653 3.663 4.104
## 10 0 1 3.6735 3.663 4.104
## 11 0 1 4.0816 3.663 4.104
## 12 0 1 4.4898 3.663 4.104
## 13 0 1 4.8980 3.663 4.104
## 14 0 1 5.3061 3.663 4.104
## 15 0 1 5.7143 3.663 4.104
## 16 0 1 6.1224 3.663 4.104
## 17 0 1 6.5306 3.663 4.104
## 18 0 1 6.9388 3.663 4.104
## 19 0 1 7.3469 3.663 4.104
## 20 0 1 7.7551 3.663 4.104
## 21 0 1 8.1633 3.663 4.104
## 22 0 1 8.5714 3.663 4.104
## 23 0 1 8.9796 3.663 4.104
## 24 0 1 9.3878 3.663 4.104
## 25 0 1 9.7959 3.663 4.104
## 26 0 1 10.2041 3.663 4.104
## 27 0 1 10.6122 3.663 4.104
## 28 0 1 11.0204 3.663 4.104
## 29 0 1 11.4286 3.663 4.104
## 30 0 1 11.8367 3.663 4.104
## 31 0 1 12.2449 3.663 4.104
## 32 0 1 12.6531 3.663 4.104
## 33 0 1 13.0612 3.663 4.104
## 34 0 1 13.4694 3.663 4.104
## 35 0 1 13.8776 3.663 4.104
## 36 0 1 14.2857 3.663 4.104
## 37 0 1 14.6939 3.663 4.104
## 38 0 1 15.1020 3.663 4.104
## 39 0 1 15.5102 3.663 4.104
## 40 0 1 15.9184 3.663 4.104
## 41 0 1 16.3265 3.663 4.104
## 42 0 1 16.7347 3.663 4.104
## 43 0 1 17.1429 3.663 4.104
## 44 0 1 17.5510 3.663 4.104
## 45 0 1 17.9592 3.663 4.104
## 46 0 1 18.3673 3.663 4.104
## 47 0 1 18.7755 3.663 4.104
## 48 0 1 19.1837 3.663 4.104
## 49 0 1 19.5918 3.663 4.104
## 50 0 1 20.0000 3.663 4.104
# Compute the predicted probabilities using the MODEL coefficients
pp.betas <- invlogit(as.matrix(x.gss) %*% betas) # This is the mean function operating on the linear predictor to give the predicted probabilities
pp.betas
## [,1]
## [1,] 0.2611
## [2,] 0.2735
## [3,] 0.2864
## [4,] 0.2995
## [5,] 0.3131
## [6,] 0.3269
## [7,] 0.3411
## [8,] 0.3555
## [9,] 0.3702
## [10,] 0.3852
## [11,] 0.4003
## [12,] 0.4157
## [13,] 0.4313
## [14,] 0.4469
## [15,] 0.4627
## [16,] 0.4786
## [17,] 0.4945
## [18,] 0.5104
## [19,] 0.5262
## [20,] 0.5421
## [21,] 0.5578
## [22,] 0.5735
## [23,] 0.5890
## [24,] 0.6043
## [25,] 0.6194
## [26,] 0.6343
## [27,] 0.6489
## [28,] 0.6633
## [29,] 0.6773
## [30,] 0.6911
## [31,] 0.7045
## [32,] 0.7176
## [33,] 0.7303
## [34,] 0.7426
## [35,] 0.7546
## [36,] 0.7662
## [37,] 0.7774
## [38,] 0.7882
## [39,] 0.7986
## [40,] 0.8087
## [41,] 0.8183
## [42,] 0.8276
## [43,] 0.8365
## [44,] 0.8450
## [45,] 0.8532
## [46,] 0.8610
## [47,] 0.8684
## [48,] 0.8755
## [49,] 0.8823
## [50,] 0.8887
# Compute the predicted probabilities and confidence intervals using the
# SIMULATED coefficients
pp.sim <- matrix(NA, nrow = m, ncol = length(education)) # Compute the predicted probabilities 1000 times @ each of 50 education experience levels
for (i in 1:m) {
pp.sim[i, ] <- invlogit(as.matrix(x.gss) %*% sim.betas[i, ])
}
pe <- apply(pp.sim, 2, mean)
lo <- apply(pp.sim, 2, quantile, prob = 0.025) # 95% CI, 2.5 percentile
hi <- apply(pp.sim, 2, quantile, prob = 0.975) # 97.5 perentile
# pe, lo, and hi are vectors of length 50
# Make the plot
par(mar = c(4, 5.5, 0.1, 0.1))
plot(education, pe, type = "n", ylim = c(-0.025, 1), xlab = "", ylab = "", axes = F)
abline(v = seq(min(education), max(education), 0.2), col = "gray75", lty = 3) # Vertical grid lines
abline(h = seq(0, 1, 0.1), col = "gray75", lty = 3) # Horizonal grid lines
lines(education, pe, lwd = 3, lty = 1)
lines(education, lo, lwd = 2, lty = 2)
lines(education, hi, lwd = 2, lty = 2)
title(ylab = expression("Predicted Probability of Abortion Choice Support"),
line = 3, cex.lab = 1.25)
title(xlab = expression("Education Level"), line = 3, cex.lab = 1.25)
axis(1, at = c(min(gss$educ), mean(gss$educ), quantile(gss$educ, 0.95), max(gss$educ)),
labels = F) # Adds tick marks to x-axis
labels <- paste(c(expression("No"), expression("Mean"), expression("95%"), expression("Maximum")),
sep = " ")
text(c(min(gss$educ), mean(gss$educ), quantile(gss$educ, 0.95), max(gss$educ)),
-0.09, srt = 45, adj = 1, labels = labels, xpd = T) # Locations for labels of tick marks
axis(2, at = seq(0, 1, 0.1), las = 2, cex.axis = 1.1)
box()
rug(jitter(gss$educ), ticksize = 0.015) # The bottom shag-thing is a rug plot that tells the distribution of data across the sequence
legend("topleft", bty = "n", c(expression("Point Estimate"), expression("95% Conf. Interval")),
lty = c(1, 2), lwd = c(3, 2), cex = 1)
As the number of years of education increases, the probability of supporting a woman's choice to abort increases as well. Much of the data is concentrated near and above the mean of education level. On average, the confidence interval range is about \( \pm0.15 \), and increases a bit as education levels increases.