#Task 1 Using the States data, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.
load("States.RData")
#Examine data
head(States[, c("State", "ClintonVote", "College", "HouseholdIncome", "NonWhite")])
## State ClintonVote College HouseholdIncome NonWhite
## 1 Alabama 34.36 26.0 46.472 31.6
## 2 Alaska 36.55 26.5 76.114 34.7
## 3 Arizona 45.13 27.4 53.510 22.5
## 4 Arkansas 33.65 24.7 43.813 22.7
## 5 California 61.73 34.5 67.169 39.4
## 6 Colorado 48.16 39.6 65.458 15.8
# Fit the multiple linear regression model
model <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data=States)
# Summarize the model
# This will display the coefficients, standard errors, t-values, and p-values for each predictor
summary(model)
##
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite,
## data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.697 -3.605 -1.036 4.495 13.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99082 6.01413 0.165 0.870
## College 1.12498 0.22514 4.997 8.88e-06 ***
## HouseholdIncome -0.07554 0.15345 -0.492 0.625
## NonWhite 0.43123 0.08246 5.230 4.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.5904
## F-statistic: 24.55 on 3 and 46 DF, p-value: 1.237e-09
# Create regression tables using sjPlot
tab_model(model, show.ci = FALSE, show.se = TRUE, dv.labels = c("ClintonVote"))
| Â | ClintonVote | ||
|---|---|---|---|
| Predictors | Estimates | std. Error | p |
| (Intercept) | 0.99 | 6.01 | 0.870 |
| College | 1.12 | 0.23 | <0.001 |
| HouseholdIncome | -0.08 | 0.15 | 0.625 |
| NonWhite | 0.43 | 0.08 | <0.001 |
| Observations | 50 | ||
| R2 / R2 adjusted | 0.616 / 0.590 | ||
# Interpret the coefficients and assess model fit
coefficients <- summary(model)$coefficients
r_squared <- summary(model)$r.squared
adjusted_r_squared <- summary(model)$adj.r.squared
# Print the coefficients
cat("Coefficients:\n")
## Coefficients:
print(coefficients)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99082189 6.0141277 0.1647491 8.698633e-01
## College 1.12498459 0.2251381 4.9968638 8.883340e-06
## HouseholdIncome -0.07554429 0.1534528 -0.4922966 6.248512e-01
## NonWhite 0.43122573 0.0824595 5.2295460 4.053178e-06
# Print the R-squared and Adjusted R-squared values
cat("\nR-squared: ", r_squared)
##
## R-squared: 0.615523
cat("\nAdjusted R-squared: ", adjusted_r_squared)
##
## Adjusted R-squared: 0.5904484
#Interpretation of Coefficients
#Task 2 Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.
# Fit the original model
model_original <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)
# Fit the extended model with the 'Region' variable
model_extended <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)
# Compare models using AIC and BIC
aic_original <- AIC(model_original)
aic_extended <- AIC(model_extended)
bic_original <- BIC(model_original)
bic_extended <- BIC(model_extended)
cat("AIC for original model: ", aic_original, "\n")
## AIC for original model: 336.7667
cat("AIC for extended model: ", aic_extended, "\n")
## AIC for extended model: 326.4857
cat("BIC for original model: ", bic_original, "\n")
## BIC for original model: 346.3268
cat("BIC for extended model: ", bic_extended, "\n")
## BIC for extended model: 341.7819
# Create regression tables using modelsummary
models <- list(Original = model_original, Extended = model_extended)
modelsummary(models)
| Original | Extended | |
|---|---|---|
| (Intercept) | 0.991 | 14.806 |
| (6.014) | (7.159) | |
| College | 1.125 | 1.189 |
| (0.225) | (0.257) | |
| HouseholdIncome | -0.076 | -0.429 |
| (0.153) | (0.171) | |
| NonWhite | 0.431 | 0.525 |
| (0.082) | (0.082) | |
| RegionNE | 8.031 | |
| (2.721) | ||
| RegionS | -3.404 | |
| (2.742) | ||
| RegionW | 5.658 | |
| (2.772) | ||
| Num.Obs. | 50 | 50 |
| R2 | 0.616 | 0.722 |
| R2 Adj. | 0.590 | 0.684 |
| AIC | 336.8 | 326.5 |
| BIC | 346.3 | 341.8 |
| Log.Lik. | -163.383 | -155.243 |
| F | 24.548 | 18.648 |
| RMSE | 6.35 | 5.40 |
#Interpretation
Model Comparison:
The extended model has a lower AIC (326.4857) compared to the original model (336.7667). Lower AIC indicates that the extended model with the Region variable provides a better balance between goodness-of-fit and model complexity. It suggests that adding Region improves the model’s fit more effectively.
The extended model also has a lower BIC (341.7819) compared to the original model (346.3268).Lower BIC supports the same conclusion as AIC. It shows that the extended model is preferable, considering the additional penalty for complexity in the BIC criterion.
I therefore prefer the extended model, which includes the Region variable. This suggests that the inclusion of the Region variable improves the model’s fit without excessively increasing its complexity.
#Task 3 Using the States data, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?
# Fit the model including the interaction term between College and NonWhite
model_interaction <- lm(ClintonVote ~ College * NonWhite, data = States)
# Summarize the model
summary(model_interaction)
##
## Call:
## lm(formula = ClintonVote ~ College * NonWhite, data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.446 -3.543 -0.295 3.936 13.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22574 13.26349 -1.148 0.256927
## College 1.50058 0.39913 3.760 0.000479 ***
## NonWhite 0.99725 0.47769 2.088 0.042397 *
## College:NonWhite -0.01802 0.01457 -1.236 0.222664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared: 0.6259, Adjusted R-squared: 0.6015
## F-statistic: 25.66 on 3 and 46 DF, p-value: 6.636e-10
# Calculate effects for visualization
interaction_effects <- allEffects(model_interaction)
# Plot the interaction effects
plot(interaction_effects, main = "Interaction Effects of College and NonWhite on ClintonVote")
#Interpretation Each additional percentage point of college graduates is associated with a 1.50 percentage point increase in the likelihood of voting for Clinton. However, the interaction term is not statistically significant (p = 0.223), indicating that the effect of college education on voting for Clinton does not significantly vary with the percentage of the non-white population.
#Task 4 Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.
# Create the binary outcome variable
States$VotedForClinton <- ifelse(States$ClintonVote > 0.5, 1, 0)
# Fit the logistic regression model
model_logit <- glm(VotedForClinton ~ College + HouseholdIncome + Region, data = States, family = binomial)
# Summarize the model
summary(model_logit)
##
## Call:
## glm(formula = VotedForClinton ~ College + HouseholdIncome + Region,
## family = binomial, data = States)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.657e+01 4.336e+05 0 1
## College -6.055e-08 1.509e+04 0 1
## HouseholdIncome -8.145e-08 9.068e+03 0 1
## RegionNE -3.017e-06 1.665e+05 0 1
## RegionS -4.680e-06 1.525e+05 0 1
## RegionW -2.223e-06 1.696e+05 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.0000e+00 on 49 degrees of freedom
## Residual deviance: 2.9008e-10 on 44 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
# Create regression tables using modelsummary
modelsummary(model_logit)
| (1) | |
|---|---|
| (Intercept) | 26.566 |
| (433587.131) | |
| College | 0.000 |
| (15090.634) | |
| HouseholdIncome | 0.000 |
| (9067.724) | |
| RegionNE | 0.000 |
| (166455.565) | |
| RegionS | 0.000 |
| (152517.241) | |
| RegionW | 0.000 |
| (169636.511) | |
| Num.Obs. | 50 |
| AIC | 12.0 |
| BIC | 23.5 |
| Log.Lik. | 0.000 |
| F | 0.000 |
| RMSE | 0.00 |
# Create regression tables using sjPlot
tab_model(model_logit)
| Â | VotedForClinton | ||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 344746337574.15 | 0.00 – Inf | 1.000 |
| College | 1.00 | 0.00 – Inf | 1.000 |
| HouseholdIncome | 1.00 | 0.00 – Inf | 1.000 |
| Region [NE] | 1.00 | 0.00 – Inf | 1.000 |
| Region [S] | 1.00 | 0.00 – Inf | 1.000 |
| Region [W] | 1.00 | 0.00 – Inf | 1.000 |
| Observations | 50 | ||
# Convert coefficients to odds ratios
odds_ratios <- exp(coef(model_logit))
conf_int <- confint.default(model_logit)
conf_intervals <- exp(conf_int)
cat("Odds Ratios:\n")
## Odds Ratios:
print(knitr::kable(data.frame(OddsRatio = odds_ratios), col.names = "Odds Ratio", digits = 4))
##
##
## | | Odds Ratio|
## |:---------------|------------:|
## |(Intercept) | 344746337574|
## |College | 1|
## |HouseholdIncome | 1|
## |RegionNE | 1|
## |RegionS | 1|
## |RegionW | 1|
cat("\nConfidence Intervals for Odds Ratios:\n")
##
## Confidence Intervals for Odds Ratios:
print(knitr::kable(data.frame(ConfidenceIntervals = conf_intervals), digits = 4))
##
##
## | | ConfidenceIntervals.2.5..| ConfidenceIntervals.97.5..|
## |:---------------|-------------------------:|--------------------------:|
## |(Intercept) | 0| Inf|
## |College | 0| Inf|
## |HouseholdIncome | 0| Inf|
## |RegionNE | 0| Inf|
## |RegionS | 0| Inf|
## |RegionW | 0| Inf|
# Assess model fit using AIC and BIC
aic_logit <- AIC(model_logit)
bic_logit <- BIC(model_logit)
cat("\nAIC for logistic regression model: ", aic_logit, "\n")
##
## AIC for logistic regression model: 12
cat("BIC for logistic regression model: ", bic_logit, "\n")
## BIC for logistic regression model: 23.47214
#Interpretation
The odds ratio of 1for College means that there is no change in the odds of voting for Clinton with an increase in the percentage of college graduates. This result suggests that, in this model, College does not significantly affect the odds of voting for Clinton when controlling for other factors.
The odds ratio of 1 for HouseholdIncome indicates that changes in household income do not affect the odds of voting for Clinton in this model. This suggests that, after accounting for College and Region, HouseholdIncome does not have a significant effect on voting behavior. Region Variables (NE, S, W):
The odds ratios for the Region variables (NE, S, W) being 1means that these regional indicators do not affect the odds of voting for Clinton compared to the reference region. This suggests that there are no significant regional differences in the odds of voting for Clinton in this model.
#Model Fit:
AIC (12): A lower AIC value indicates a better model fit relative to other models. This value suggests that the model fits the data reasonably well. BIC (23.47214): Similar to AIC, a lower BIC value indicates a better model fit. BIC also penalizes model complexity, so a lower BIC suggests that this model provides a good balance between fit and complexity.
#Task 5 Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.
# Fit the logistic regression model
m.logit <- glm(VotedForClinton ~ College + HouseholdIncome + Region, data = States, family = binomial(link = "logit"))
summary(m.logit)
##
## Call:
## glm(formula = VotedForClinton ~ College + HouseholdIncome + Region,
## family = binomial(link = "logit"), data = States)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.657e+01 4.336e+05 0 1
## College -6.055e-08 1.509e+04 0 1
## HouseholdIncome -8.145e-08 9.068e+03 0 1
## RegionNE -3.017e-06 1.665e+05 0 1
## RegionS -4.680e-06 1.525e+05 0 1
## RegionW -2.223e-06 1.696e+05 0 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.0000e+00 on 49 degrees of freedom
## Residual deviance: 2.9008e-10 on 44 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
# Visualize odds ratios using sjPlot
plot_model(model_logit,
type = "est",
show.values = TRUE,
show.p = TRUE,
title = "Odds Ratios for Predictors of Voting for Clinton")
## Warning in scale_y_continuous(trans = "log10", limits = axis.scaling$axis.lim,
## : log-10 transformation introduced infinite values.
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_text()`).
# Plot predicted probabilities across the range of College values, separated by Region
eff <- allEffects(model_logit)
plot(eff,
type = "response",
main = "Predicted Probabilities of Voting for Clinton Across College Values by Region")
eff <- Effect("Region", m.logit)
plot(eff, type = "response", ylim=c(0,1))
#END