Task 1 Using the States dataset, conduct a multiple linear regression with ClintonVote #as the dependent variable and College, HouseholdIncome, and NonWhite as #independent variables. Create regression tables using the sjPlot package. #Interpret the coefficients and assess the model fit.
data(USStates)
States <- USStates
model <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)
tab_model(model)
| ClintonVote | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 0.99 | -11.11 – 13.10 | 0.870 |
| College | 1.12 | 0.67 – 1.58 | <0.001 |
| HouseholdIncome | -0.08 | -0.38 – 0.23 | 0.625 |
| NonWhite | 0.43 | 0.27 – 0.60 | <0.001 |
| Observations | 50 | ||
| R2 / R2 adjusted | 0.616 / 0.590 | ||
Based on the regression table results, the intercept has an estimate of 0.99 with a p-value of 0.870, indicating that the baseline level of ClintonVote is not significantly different from zero when all predictors are zero. The variable College has an estimate of 1.12 with a p-value of less than 0.001, indicating that this coefficient is statistically significant. For each one-unit increase in the percentage of college-educated individuals, ClintonVote increases by 1.12 percentage points.
The variable HouseholdIncome has an estimate of -0.08 with a p-value of 0.625, indicating that this coefficient is not statistically significant. This means that household income does not have a significant linear relationship with ClintonVote in this model.
The variable NonWhite has an estimate of 0.43 with a p-value of less than 0.001, indicating that this coefficient is statistically significant. For each one-unit increase in the percentage of non-white individuals, ClintonVote increases by 0.43 percentage points.
The model’s R-squared value is 0.616, indicating that approximately 61.6% of the variance in ClintonVote is explained by the model. This represents a moderate level of explanatory power. The adjusted R-squared value is 0.590, which adjusts for the number of predictors in the model and indicates that approximately 59.0% of the variance in ClintonVote is explained by the model when taking into account the number of predictors.
Task 2 Extend the model from Task 1 by adding the Region variable. Compare this new #model to the previous one using AIC and BIC. Create a regression table with all #models with modelsummary. Interpret the results and discuss which model you #prefer and why.
model2 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)
print(paste("AIC with model without Region is ", AIC(model)))
## [1] "AIC with model without Region is 336.766654146048"
print(paste("BIC with model without Region is ", BIC(model)))
## [1] "BIC with model without Region is 346.326769173189"
print(paste("AIC with model with Region is ", AIC(model2)))
## [1] "AIC with model with Region is 326.485680890501"
print(paste("BIC with model with Region is ", BIC(model2)))
## [1] "BIC with model with Region is 341.781864933926"
model_list <- list(model1 = model, model2 = model2)
modelsummary(model_list)
| model1 | model2 | |
|---|---|---|
| (Intercept) | 0.991 | 14.806 |
| (6.014) | (7.159) | |
| College | 1.125 | 1.189 |
| (0.225) | (0.257) | |
| HouseholdIncome | -0.076 | -0.429 |
| (0.153) | (0.171) | |
| NonWhite | 0.431 | 0.525 |
| (0.082) | (0.082) | |
| RegionNE | 8.031 | |
| (2.721) | ||
| RegionS | -3.404 | |
| (2.742) | ||
| RegionW | 5.658 | |
| (2.772) | ||
| Num.Obs. | 50 | 50 |
| R2 | 0.616 | 0.722 |
| R2 Adj. | 0.590 | 0.684 |
| AIC | 336.8 | 326.5 |
| BIC | 346.3 | 341.8 |
| Log.Lik. | -163.383 | -155.243 |
| RMSE | 6.35 | 5.40 |
The intercept, with an estimate of 14.806, is not statistically significant, indicating that the baseline level of ClintonVote, when all other predictors are zero, is not significantly different from zero.
The College variable has an estimate of 1.189 and is statistically significant (p < 0.001), suggesting that for each one-unit increase in the percentage of college-educated individuals, ClintonVote increases by approximately 1.189 percentage points, holding all other variables constant.
HouseholdIncome, with an estimate of -0.429, is not statistically significant, indicating that household income does not have a significant linear relationship with ClintonVote in this model. The NonWhite variable, however, has an estimate of 0.525 and is statistically significant (p < 0.001). This indicates that for each one-unit increase in the percentage of non-white individuals, ClintonVote increases by approximately 0.525 percentage points, holding all other variables constant.
The Region variable introduces additional insights into the model. Being in the Northeast region (RegionNE) has a significant positive effect on ClintonVote, with an estimate of 8.031 (p < 0.01). This suggests that ClintonVote is approximately 8.031 percentage points higher in the Northeast compared to the reference region. The South region (RegionS) has an estimate of -3.404, which is not statistically significant, indicating no significant difference in ClintonVote compared to the reference region. Similarly, the West region (RegionW) has an estimate of 5.658, which is also not statistically significant.
Model 2 has an R² value of 0.722 and an adjusted R² value of 0.684, indicating that the model explains approximately 72.2% of the variance in ClintonVote, with an adjusted explanation of 68.4% after accounting for the number of predictors. Additionally, Model 2 has lower AIC (326.5) and BIC (341.8) values compared to Model 1, suggesting a better balance between model fit and complexity.
Model 2 has higher R² (0.722) and adjusted R² (0.684) values compared to Model 1, indicating that Model 2 explains a higher proportion of the variance in ClintonVote.
Model 2 has lower AIC (326.5) and BIC (341.8) values compared to Model 1, suggesting a better balance between model fit and complexity.
Based on the comparison, Model with the Region variable is preferred over Model 1 because it has higher explanatory power and better model fit. The inclusion of the Region variable helps in better explaining the variation in ClintonVote.
Task 3 Using the States dataset, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?
model_interaction <- lm(ClintonVote ~ College + NonWhite + College:NonWhite, data = States)
plot(allEffects(model_interaction), multiline = TRUE)
The lines are not parallel, it indicates the presence of an interaction
effect.In other words, the effect of College on ClintonVote changes with
different percentages of NonWhite individuals.
The interaction effect between College and NonWhite is significant, as indicated by the non-parallel lines in the plot. This means that the relationship between education and voting patterns varies depending on the percentage of non-white individuals in the population. Specifically, higher levels of education have a stronger positive impact on ClintonVote in areas with lower percentages of non-white individuals, and this effect diminishes as the percentage of non-white individuals increases.
Task 4 Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.
States$Clinton <- abs(as.numeric(States$Elect2016) - 2)
m.logit <- glm(Clinton ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))
summary(m.logit)
##
## Call:
## glm(formula = Clinton ~ College + HouseholdIncome + Region, family = binomial(link = "logit"),
## data = States)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.88304 4.57843 -3.032 0.00243 **
## College 0.25390 0.12980 1.956 0.05046 .
## HouseholdIncome 0.05599 0.06642 0.843 0.39922
## RegionNE 3.02242 1.41280 2.139 0.03241 *
## RegionS 0.31259 1.58896 0.197 0.84404
## RegionW 3.02137 1.43108 2.111 0.03475 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 32.014 on 44 degrees of freedom
## AIC: 44.014
##
## Number of Fisher Scoring iterations: 6
tab_model(m.logit)
| Clinton | |||
|---|---|---|---|
| Predictors | Odds Ratios | CI | p |
| (Intercept) | 0.00 | 0.00 – 0.00 | 0.002 |
| College | 1.29 | 1.01 – 1.71 | 0.050 |
| HouseholdIncome | 1.06 | 0.93 – 1.22 | 0.399 |
| Region [NE] | 20.54 | 1.70 – 599.19 | 0.032 |
| Region [S] | 1.37 | 0.04 – 32.70 | 0.844 |
| Region [W] | 20.52 | 1.55 – 500.93 | 0.035 |
| Observations | 50 | ||
| R2 Tjur | 0.588 | ||
odds_ratios <- exp(coef(m.logit))
knitr::kable(odds_ratios, col.names = "Odds Ratio", digits = 4)
| Odds Ratio | |
|---|---|
| (Intercept) | 0.0000 |
| College | 1.2890 |
| HouseholdIncome | 1.0576 |
| RegionNE | 20.5410 |
| RegionS | 1.3670 |
| RegionW | 20.5195 |
The odds ratio of the intercept is effectively 0 (very small value close to zero). This represents the baseline odds of voting for Clinton when all predictors are at their reference level. However, since this scenario isn’t realistic (e.g., the percentage of college graduates can’t be zero), the intercept itself isn’t practically interpretable but serves as a baseline for the model.
The odds ratio for College is 1.2890. This means that for each 1 percentage point increase in the percentage of college graduates, the odds of voting for Clinton increase by 28.9%, holding all other variables constant.
The odds ratio for HouseholdIncome is 1.0576. This indicates that for each 1 unit increase in household income (presumably in thousands of dollars), the odds of voting for Clinton increase by 5.76%, holding all other variables constant.
RegionNE: The odds ratio is 20.5410, indicating that being in the Northeast region increases the odds of voting for Clinton by 1954%, compared to the baseline region.
RegionS: The odds ratio is 1.3670, suggesting a 36.7% increase in the odds, but this predictor is not significant (p = 0.84404).
RegionW: The odds ratio is 20.5195, indicating that being in the Western region increases the odds of voting for Clinton by 1952%, compared to the baseline region.
The R square is at a 0.588, indicating that 58.8% of variations of Clinton votes are explained by this model.
Task 5 Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.
eff <- Effect("Region", m.logit)
plot(eff, type = "response", ylim=c(0,1))
plot_model(m.logit)