X Wang - Assignment 5

PSCI 7108: Advanced Data Analysis III

1: The Inverse Logistic Link

The inverse logistic distribution is defined as follows:

\( logit^{-1}(\eta)=\large{\frac{e^{\eta}}{1+e^{\eta}}} \)
Program an inverse logit function in R. Plot the following logistic regression lines on the same plot. Distinguish the lines using \( \verb!lty! \) and/or \( \verb!col! \). Also, use \( \verb!legend()! \) to put a legend on the plot.

\( Pr(y = 1) = logit^{-1}(x) \)

\( Pr(y = 1) = logit^{-1}(2 + x) \)

\( Pr(y = 1) = logit^{-1}(2x) \)

\( Pr(y = 1) = logit^{-1}(2 + 2x) \)

\( Pr(y = 1) = logit^{-1}(-2x) \)

## Inverse Logit Function ##
invlogit <- function(eta) {
    exp(eta)/(1 + exp(eta))
}
## Set Sequence for Value x ##
x <- seq(from = -10, to = 10, by = 0.001)
prob.1 <- invlogit(x)  # Evaluates the inverse logit distribution for first value for eta

## Plot Different Lines ##
par(mar = c(5, 5, 4, 2) + 0.1)  # Sets plot margins for bottom, left, top, right, respectively;
plot(x, prob.1, type = "l", col = "black", lwd = 3, xlab = "", ylab = "", main = "")  # Prob 1
title(main = expression(paste("Inverse Logistic Distributions for Different Values of ", 
    eta, sep = "")), font.main = 2)
title(xlab = "x", cex.lab = 1.25)
title(ylab = expression(paste(logit^-1, (eta), sep = "")), cex.lab = 1.25)
curve(invlogit(2 + x), add = T, col = "red", lwd = 3)  # Prob 2; curve() evaluates a function
curve(invlogit(2 * x), add = T, col = "green3", lwd = 3)  # Prob 3
curve(invlogit(2 + 2 * x), add = T, col = "yellow", lwd = 3)  # Prob 4
curve(invlogit(-2 * x), add = T, col = "blue", lwd = 3)  # Prob 5
legend("right", inset = c(0.02), title = expression(paste(eta, " Values", sep = "")), 
    c("x", "2 + x", "2x", "2 + 2x", "-2x"), col = c("black", "red", "green3", 
        "yellow", "blue"), lwd = 2)

plot of chunk unnamed-chunk-1

2: A Logit Model Example

The file \( \verb!RealNES.RData! \) contains the survey data of presidential preference and income for the 1992 election which we analyzed in class, along with other variables including sex, ethnicity, education, party identification, and political ideology.

2.1: Model Estimation

Fit two logistic regressions predicting support for Bush (vote), one with all of the other variables as independent variables and the other with some subset of them. Report the names of the variables in each model.

## Load, examine, and clean dataset for 1992 ##
nes <- subset(read.table("/Users/squishy/Dropbox/PSCI 7108/Assignment 5/RealNES.RData"), 
    year == 1992)
str(nes)  # There are observations w/ 'NA'--this is not good! Different models (even if not nested) need to be run off of the same dataset for comparability. When a regression model is run, observations with NAs are automatically dropped, so if different models are run, different observations will be dropped depending on which variables are included

## 'data.frame':    1724 obs. of  9 variables:
##  $ year        : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ vote        : int  1 1 NA 0 1 0 0 NA 1 1 ...
##  $ female      : int  1 1 1 1 0 1 1 1 0 1 ...
##  $ black       : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ age.discrete: int  2 4 2 4 1 4 3 2 3 3 ...
##  $ educ1       : int  2 3 2 2 4 2 3 3 3 4 ...
##  $ income      : int  4 2 3 1 2 3 4 3 2 4 ...
##  $ pid         : int  3 3 3 1 3 2 1 1 3 3 ...
##  $ ideo        : int  5 5 5 1 1 1 1 1 3 5 ...

ok <- complete.cases(nes)  # Check which observations are complete
sum(!ok)  # Number of observations with NAs

## [1] 591

nes2 <- nes[ok, ]  # Save new, clean dataframe
str(nes2)

## 'data.frame':    1133 obs. of  9 variables:
##  $ year        : int  1992 1992 1992 1992 1992 1992 1992 1992 1992 1992 ...
##  $ vote        : int  1 1 0 1 0 0 1 1 0 1 ...
##  $ female      : int  1 1 1 0 1 1 0 1 0 1 ...
##  $ black       : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ age.discrete: int  2 4 4 1 4 3 3 3 1 2 ...
##  $ educ1       : int  2 3 2 4 2 3 3 4 3 4 ...
##  $ income      : int  4 2 1 2 3 4 2 4 1 4 ...
##  $ pid         : int  3 3 1 3 2 1 3 3 2 3 ...
##  $ ideo        : int  5 5 1 1 1 1 3 5 1 5 ...


## Fit Models ##
m1 <- glm(vote ~ female + black + age.discrete + educ1 + income + pid + ideo, 
    family = binomial(link = "logit"), data = nes2)

Model 1 contains all independent variables: \( \verb!female, black, age.discrete, educ1, income, pid,! \) and \( \verb!ideo! \). These respectively represent gender, ethnicity, age, education, income, party identification, and political ideology.

summary(m1)

## 
## Call:
## glm(formula = vote ~ female + black + age.discrete + educ1 + 
##     income + pid + ideo, family = binomial(link = "logit"), data = nes2)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.445  -0.343  -0.217   0.405   3.163  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.3389     0.6330  -10.01  < 2e-16 ***
## female         0.4789     0.2137    2.24    0.025 *  
## black         -1.9770     0.4540   -4.35  1.3e-05 ***
## age.discrete   0.0216     0.1085    0.20    0.842    
## educ1          0.2144     0.1248    1.72    0.086 .  
## income        -0.0725     0.1047   -0.69    0.489    
## pid            1.9757     0.1153   17.14  < 2e-16 ***
## ideo           0.4477     0.0603    7.43  1.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1534.10  on 1132  degrees of freedom
## Residual deviance:  664.05  on 1125  degrees of freedom
## AIC: 680.1
## 
## Number of Fisher Scoring iterations: 6

m2 <- glm(vote ~ female + black + educ1 + income + pid + ideo, family = binomial(link = "logit"), 
    data = nes2)

Model 2 contains the following 6 independent variables: \( \verb!female, black, educ1, income, pid,! \) and \( \verb!ideo! \).

summary(m2)

## 
## Call:
## glm(formula = vote ~ female + black + educ1 + income + pid + 
##     ideo, family = binomial(link = "logit"), data = nes2)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.440  -0.340  -0.219   0.408   3.169  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -6.2688     0.5247  -11.95  < 2e-16 ***
## female        0.4802     0.2136    2.25    0.025 *  
## black        -1.9824     0.4533   -4.37  1.2e-05 ***
## educ1         0.2097     0.1225    1.71    0.087 .  
## income       -0.0739     0.1045   -0.71    0.479    
## pid           1.9741     0.1150   17.17  < 2e-16 ***
## ideo          0.4493     0.0597    7.52  5.3e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1534.10  on 1132  degrees of freedom
## Residual deviance:  664.09  on 1126  degrees of freedom
## AIC: 678.1
## 
## Number of Fisher Scoring iterations: 6

2.2: Model Fit

Use the \( \verb!logLik()! \) function to get the log-likelihoods of the two models.

Compute the deviance for each model; did the addition of more predictors improve model fit?

Perform and interpret a likelihood ratio test.

Compute the AIC and BIC for the two models (show your code for the computation of each). Considering all tests of model fit performed here, which model seems to fit the data best?

1. Log-Likelihood of Models

## Log-likelihood ##
ll.m1 <- logLik(m1)
ll.m2 <- logLik(m2)
ll.m1

## 'log Lik.' -332 (df=8)

ll.m1 <- -332.0258
ll.m2

## 'log Lik.' -332 (df=7)

ll.m2 <- -332.0456

2. Deviance of Models
The deviance is \( D = -2(\ln{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})) \).

## Deviance ##
d.m1 <- -2 * ll.m1
d.m2 <- -2 * ll.m2
d.m1  # 664.0516

## [1] 664.1

d.m2  # 664.0912

## [1] 664.1

Since deviance is a meausure of error between the model of interest and a saturated model, the lower the deviance, the better the model fit. In this case, Model 1 has one more predictor than Model 2 and has a lower deviance (664.0516). According to the deviance (which does not include penalties for additional predictors), the additional variable in Model 1 improves model fit.

3. Likelihood Ratio Test Comparing Models
The likelihood ratio test is \( LR = -2[\ln(\hat{L}_{R}) - \ln(\hat{L}_{U})] \), where \( R \) is the restricted model and \( U \) is the unrestricted.

## Likelihood Ratio Test ##
lr <- -2 * (ll.m2 - ll.m1)
lr

## [1] 0.0396

p.value <- 1 - pchisq(lr, df = 1)  # df = 1 between the restricted and unrestricted models; the null hypothesis is the restricted model is better
p.value  # The p-value is not statistically significant, so we accept the null hypothesis and interprect the restricted model, m2, is better

## [1] 0.8423

4. AIC and BIC of Models
The AIC is: \( AIC = -2(\ln{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})) + 2p = D(\boldsymbol{\hat{\theta}}) + 2p, \) where \( p \) is the number of estimated parameters.
The BIC is: \( BIC = - 2(\ln{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})) + plog(n) = D(\boldsymbol{\hat{\theta}}) + pln(n), \) where \( p \) is the number of estimated parameters and \( n \) is the number of observations.

## AIC and BIC ##
aic1 <- d.m1 + 2 * (ncol(nes2) - 2)  # All variables except for vote and year
aic2 <- d.m2 + 2 * (ncol(nes2) - 3)  # All except vote, year, and age
bic1 <- d.m1 + (ncol(nes2) - 2) * log(nrow(nes2))
bic2 <- d.m2 + (ncol(nes2) - 3) * log(nrow(nes2))
compare <- matrix(data = cbind(aic1, aic2, bic1, bic2), nrow = 2, ncol = 2, 
    byrow = T)
colnames(compare) <- c("m1 (U)", "m2 (R)")
rownames(compare) <- c("AIC", "BIC")
compare  # m2 has lower AIC and BIC values, reinforcing LR test above

##     m1 (U) m2 (R)
## AIC  678.1  676.1
## BIC  713.3  706.3

Considering all tests performed above, m2, the restricted model that doesn't include age, seems to fit the data the best. The deviance value for m1 was lower than m2 by 0.04, but deviance doesn't account for parsimony. The LR test, AIC, and BIC show lower values for m2.

2.3 Presenting Results

Report the results from both models with the \( \verb!apsrtable()! \) command in the package \( \verb!apsrtable! \).

Using the \( \verb!coefplot()! \) command in the \( \verb!arm! \) package, create a coefficient plot of the results (one plot for each model). Optional: Create the plots manually (without \( \verb!coefplot()! \)).

Compute and discuss the effect of unit changes in one independent variable in the better-fitting model on the probability of support for Bush. You can choose the way you compute this “marginal effect” (use inverse logit function, derivative, divide by 4, or odds ratio).

## Install and Load Libraries ##
install.packages("apsrtable")

## Error: trying to use CRAN without setting a mirror

install.package("arm")

## Error: could not find function "install.package"

library(apsrtable)
library(arm)

## Loading required package: MASS Loading required package: Matrix Loading
## required package: lattice Loading required package: lme4
## 
## arm (Version 1.6-09, built: 2013-9-23)
## 
## Working directory is /Users/squishy/Dropbox/PSCI 7108/Assignment 5
## 
## Attaching package: 'arm'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
## invlogit

library(xtable)

## Attaching package: 'xtable'
## 
## The following object is masked from 'package:arm':
## 
## display


## Report Results ##

# Generate LaTeX results tables
apsrtable(m1)

## \begin{table}[!ht]
## \caption{}
## \label{} 
## \begin{tabular}{ l D{.}{.}{2} } 
## \hline 
##   & \multicolumn{ 1 }{ c }{ Model 1 } \\ \hline
##  %            & Model 1 \\ 
## (Intercept)  & -6.34 ^*\\ 
##              & (0.63)  \\ 
## female       & 0.48 ^* \\ 
##              & (0.21)  \\ 
## black        & -1.98 ^*\\ 
##              & (0.45)  \\ 
## age.discrete & 0.02    \\ 
##              & (0.11)  \\ 
## educ1        & 0.21    \\ 
##              & (0.12)  \\ 
## income       & -0.07   \\ 
##              & (0.10)  \\ 
## pid          & 1.98 ^* \\ 
##              & (0.12)  \\ 
## ideo         & 0.45 ^* \\ 
##              & (0.06)   \\
##  $N$          & 1133    \\ 
## AIC          & 680.05  \\ 
## BIC          & 841.10  \\ 
## $\log L$    & -308.03  \\ \hline
##  \multicolumn{2}{l}{\footnotesize{Standard errors in parentheses}}\\
## \multicolumn{2}{l}{\footnotesize{$^*$ indicates significance at $p< 0.05 $}} 
## \end{tabular} 
##  \end{table}

apsrtable(m2)

## \begin{table}[!ht]
## \caption{}
## \label{} 
## \begin{tabular}{ l D{.}{.}{2} } 
## \hline 
##   & \multicolumn{ 1 }{ c }{ Model 1 } \\ \hline
##  %           & Model 1 \\ 
## (Intercept) & -6.27 ^*\\ 
##             & (0.52)  \\ 
## female      & 0.48 ^* \\ 
##             & (0.21)  \\ 
## black       & -1.98 ^*\\ 
##             & (0.45)  \\ 
## educ1       & 0.21    \\ 
##             & (0.12)  \\ 
## income      & -0.07   \\ 
##             & (0.10)  \\ 
## pid         & 1.97 ^* \\ 
##             & (0.11)  \\ 
## ideo        & 0.45 ^* \\ 
##             & (0.06)   \\
##  $N$         & 1133    \\ 
## AIC         & 678.09  \\ 
## BIC         & 819.00  \\ 
## $\log L$   & -311.05  \\ \hline
##  \multicolumn{2}{l}{\footnotesize{Standard errors in parentheses}}\\
## \multicolumn{2}{l}{\footnotesize{$^*$ indicates significance at $p< 0.05 $}} 
## \end{tabular} 
##  \end{table}

# Unfortunately, this LaTeX is not generating properly, so using xtable()
xtable(m1)  # Major difference between apstrtable() and xtable() is that the latter does not include the number of observations, AIC, BIC, and log-likelihood values

## % latex table generated in R 3.0.1 by xtable 1.7-1 package
## % Wed Oct 16 14:55:40 2013
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
##   \hline
##  & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ 
##   \hline
## (Intercept) & -6.3389 & 0.6330 & -10.01 & 0.0000 \\ 
##   female & 0.4789 & 0.2137 & 2.24 & 0.0250 \\ 
##   black & -1.9770 & 0.4540 & -4.35 & 0.0000 \\ 
##   age.discrete & 0.0216 & 0.1085 & 0.20 & 0.8422 \\ 
##   educ1 & 0.2144 & 0.1248 & 1.72 & 0.0859 \\ 
##   income & -0.0725 & 0.1047 & -0.69 & 0.4889 \\ 
##   pid & 1.9757 & 0.1153 & 17.14 & 0.0000 \\ 
##   ideo & 0.4477 & 0.0603 & 7.43 & 0.0000 \\ 
##    \hline
## \end{tabular}
## \end{table}

xtable(m2)

## % latex table generated in R 3.0.1 by xtable 1.7-1 package
## % Wed Oct 16 14:55:40 2013
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrr}
##   \hline
##  & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\ 
##   \hline
## (Intercept) & -6.2688 & 0.5247 & -11.95 & 0.0000 \\ 
##   female & 0.4802 & 0.2136 & 2.25 & 0.0245 \\ 
##   black & -1.9824 & 0.4533 & -4.37 & 0.0000 \\ 
##   educ1 & 0.2097 & 0.1225 & 1.71 & 0.0870 \\ 
##   income & -0.0739 & 0.1045 & -0.71 & 0.4791 \\ 
##   pid & 1.9741 & 0.1150 & 17.17 & 0.0000 \\ 
##   ideo & 0.4493 & 0.0597 & 7.52 & 0.0000 \\ 
##    \hline
## \end{tabular}
## \end{table}


# Generate coefficient plot of results
par(mar = c(3, 7, 3, 0.5))  # b,l,t,r
longnames.m1 <- c("(Intercept)", "Female", "Race (Black)", "Age", "Education", 
    "Income", "Party Identification", "Political Ideology")
coefplot(m1, longnames.m1, CI = 1, xlab = "", ylab = "", main = "Logit Estimates for m1")

plot of chunk unnamed-chunk-9

par(mar = c(3, 7, 3, 0.5))
longnames.m2 <- c("(Intercept)", "Female", "Race (Black)", "Education", "Income", 
    "Party Identification", "Political Ideology")
coefplot(m2, longnames.m2, CI = 1, xlab = "", ylab = "", main = "Logit Estimates for m2")

plot of chunk unnamed-chunk-9


## Investigate Effect of Unit Change in Political Identification Variable ##
intercept.m2 <- coef(m2)[1]  # Intercept of m2
intercept.m2

## (Intercept) 
##      -6.269

intercept.m2 <- -6.268842
pid.m2 <- coef(m2)[6]  # pid coefficient of m2
pid.m2

##   pid 
## 1.974

pid.m2 <- 1.974135
mean.pid <- mean(nes2$pid)  # Mean of pid in dataset
mean.pid

## [1] 1.838

pr.central <- invlogit(intercept.m2 + pid.m2 * mean.pid)  # Probability of voting for Bush at mean of pid
pr.central  # At the pid mean of 1.838, the probability of supporting Bush is = 0.067, quite low

## [1] 0.06654

unique(nes$pid)  # pid only has unique values of 1, 2, 3

## [1]  3  1  2 NA

pr.3_2 <- invlogit(intercept.m2 + pid.m2 * 3) - invlogit(intercept.m2 + pid.m2 * 
    2)
pr.2_1 <- invlogit(intercept.m2 + pid.m2 * 2) - invlogit(intercept.m2 + pid.m2 * 
    1)
pr.3_2  # A difference of 1 in the pid between 3 and 2 corresponds to a positive difference of 0.3248 in the probability of supporting Bush

## [1] 0.3248

pr.2_1  # A difference of 1 in the pid between 2 and 1 corresponds to a positive difference of 0.07598

## [1] 0.07598

The \( \verb!pid! \) (political identification) variable is on a 3-point scale. At the mean of this variable (1.84), the probability of support for Bush is 6.7%. Investigating the effect of a unit change in this variable, we find that a unit change from 1 to 2 corresponds to a positive difference of 7.6 percentage points in the probability of supporting Bush. A unit change from 2 to 3 corresponds to a positive difference of 32.5 percentages points in the probability of Bush support.