First, let’s install and load the necessary packages. This step ensures that we have all the tools we need for our analysis.
Due: July 24, 2024 by lecture time
Submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer when prompted.
Using the States dataset, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.
data(USStates)
States <- USStates
model1 <- lm(ClintonVote ~College + HouseholdIncome + NonWhite, data=States)
summary(model1)
##
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite,
## data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.697 -3.605 -1.036 4.495 13.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99082 6.01413 0.165 0.870
## College 1.12498 0.22514 4.997 8.88e-06 ***
## HouseholdIncome -0.07554 0.15345 -0.492 0.625
## NonWhite 0.43123 0.08246 5.230 4.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.5904
## F-statistic: 24.55 on 3 and 46 DF, p-value: 1.237e-09
Interpretation: 1. The intercept equals 0.99, indicating the percentage of votes for Clinton if the independant values are zero. This is not realistic, but can serve as a baseline. 2. College: The slope (coefficient) equals 1.1249, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in college students if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 3. Household income: The slope (coefficient) equals -0.0755, meaning the percentage of votes for Clinton decrease by this amount per every 1% increase in household income if all other factors don’t change. This variable’s effect is statistically insignificant at a 5% confidence level because the p-value is 0.625 and 0.625>0.5. 4. Non white: The slope (coefficient) equals 0.4312, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in non white voters if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 5. The R square equals 0.616, meaning that 61.6% of the variation of votes for Clinton can be explained by this model. 6. The adjusted R squared is 0.590, explaining that 59% of the variation of votes for Clinton can be explained by this model when considering the independent variables.
tab_model(model1, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)
| Â | ClintonVote | |||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.99 | 6.01 | 0.16 | 0.870 |
| College | 1.12 | 0.23 | 5.00 | <0.001 |
| HouseholdIncome | -0.08 | 0.15 | -0.49 | 0.625 |
| NonWhite | 0.43 | 0.08 | 5.23 | <0.001 |
| Observations | 50 | |||
| R2 / R2 adjusted | 0.616 / 0.590 | |||
Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.
model2 <- lm(ClintonVote ~College + HouseholdIncome + NonWhite + Region, data=States)
summary(model2)
##
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite +
## Region, data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3217 -4.1695 0.0435 4.2880 11.2477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.80589 7.15877 2.068 0.0447 *
## College 1.18862 0.25668 4.631 3.36e-05 ***
## HouseholdIncome -0.42863 0.17107 -2.506 0.0161 *
## NonWhite 0.52461 0.08175 6.417 9.03e-08 ***
## RegionNE 8.03052 2.72062 2.952 0.0051 **
## RegionS -3.40381 2.74204 -1.241 0.2212
## RegionW 5.65798 2.77249 2.041 0.0474 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.82 on 43 degrees of freedom
## Multiple R-squared: 0.7224, Adjusted R-squared: 0.6836
## F-statistic: 18.65 on 6 and 43 DF, p-value: 1.545e-10
AIC(model1, model2)
## df AIC
## model1 5 336.7667
## model2 8 326.4857
BIC(model1, model2)
## df BIC
## model1 5 346.3268
## model2 8 341.7819
Interpretation (model 2): 1. The intercept is at a 14.805, meaning the % for Clinton votes will be at a 14.805% if all other variables equal 0. This is not realistic, but serves as a baseline. 2. College: The slope (coefficient) equals 1.1886, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in college students if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 3. Household income: The slope (coefficient) equals -0.4286, meaning the percentage of votes for Clinton decrease by this amount per every 1% increase in household income if all other factors don’t change. This variable’s effect is statistically insignificant at a 5% confidence level because the p-value is 0.625 and 0.625>0.5. 4. Non white: The slope (coefficient) equals 0.5246, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in non white voters if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 5. Region NE: The slope (coefficient) equals 8.0305, meaning that the average votes for Clinton in North Eastern states is 8.03% higher than in states in the Midwest if all other variables don’t change. 6. Region South: The slope (coefficient) equals -3.4038, meaning that the average votes for Clinton in Souther states is 3.40% less than states in the Midwest if all other variables don’t change. 7. Region West: : The slope (coefficient) equals 5.6579, meaning that the average votes for Clinton in North Eastern states is 5.65% higher than in states in the Midwest if all other variables don’t change. 5. The R square equals 0.7224, meaning that 72.24% of the variation of votes for Clinton can be explained by this model. 6. The adjusted R squaered is 0.6836, explaining that 68.36% of the variation of votes for Clinton can be explained by this model when considering the independent variables.
Because both the AIC and BIC are lower in model 2, this model is preferred over model 1.
Using the States dataset, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?
model_interact <- lm(ClintonVote ~ College + NonWhite + NonWhite:College, data=States)
summary(model_interact)
##
## Call:
## lm(formula = ClintonVote ~ College + NonWhite + NonWhite:College,
## data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.446 -3.543 -0.295 3.936 13.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22574 13.26349 -1.148 0.256927
## College 1.50058 0.39913 3.760 0.000479 ***
## NonWhite 0.99725 0.47769 2.088 0.042397 *
## College:NonWhite -0.01802 0.01457 -1.236 0.222664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared: 0.6259, Adjusted R-squared: 0.6015
## F-statistic: 25.66 on 3 and 46 DF, p-value: 6.636e-10
Interpretation: 1. The intercept is at a -15.2257, indicating that the % of votes for Clinton will be at a -15.2257% if all other variables equal 0. This is not realistic, but serves as a baseline. 2. College: The slope (coefficient) is equal to 1.5005, meaning there will be an increase in votes for Clinton by this much per every 1% increase in college students. 3. Non white: The slope (coefficient) is equal to 0.9972, meaning there will be an increase in votes for Clinton by this much per every 1% increase in non white voters. 4. College:non white: The slope (coefficient) is equal to -0.0180, meaning there will be a decrease in votes for Clinton by this much per every 1% increase in non white college students. 5. The R square equals 0.6259, indicating that 62.59% of the variation of Clinton votes are explained by this model. 6. The adjusted R square equals 0.6015, indicating that 60.15% of the variation of Clinton votes are explained by this model considering the independent variables.
library(effects)
plot(allEffects(model_interact), multiline = TRUE)
Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.
States$Clinton <- abs(as.numeric(States$Elect2016) - 2)
table(States$Clinton) # Clinton win = 1, Trump win = 0
##
## 0 1
## 30 20
model_logit <- glm(Clinton ~ College+HouseholdIncome+Region, data=States, family=binomial(link="logit"))
summary(model_logit)
##
## Call:
## glm(formula = Clinton ~ College + HouseholdIncome + Region, family = binomial(link = "logit"),
## data = States)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.88304 4.57843 -3.032 0.00243 **
## College 0.25390 0.12980 1.956 0.05046 .
## HouseholdIncome 0.05599 0.06642 0.843 0.39922
## RegionNE 3.02242 1.41280 2.139 0.03241 *
## RegionS 0.31259 1.58896 0.197 0.84404
## RegionW 3.02137 1.43108 2.111 0.03475 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67.301 on 49 degrees of freedom
## Residual deviance: 32.014 on 44 degrees of freedom
## AIC: 44.014
##
## Number of Fisher Scoring iterations: 6
tab_model(model_logit, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)
| Â | Clinton | |||
|---|---|---|---|---|
| Predictors | Odds Ratios | std. Error | Statistic | p |
| (Intercept) | 0.00 | 0.00 | -3.03 | 0.002 |
| College | 1.29 | 0.17 | 1.96 | 0.050 |
| HouseholdIncome | 1.06 | 0.07 | 0.84 | 0.399 |
| Region [NE] | 20.54 | 29.02 | 2.14 | 0.032 |
| Region [S] | 1.37 | 2.17 | 0.20 | 0.844 |
| Region [W] | 20.52 | 29.37 | 2.11 | 0.035 |
| Observations | 50 | |||
| R2 Tjur | 0.588 | |||
exp(coef(model_logit))
## (Intercept) College HouseholdIncome RegionNE RegionS
## 9.346977e-07 1.289038e+00 1.057591e+00 2.054104e+01 1.366961e+00
## RegionW
## 2.051945e+01
Interpretation: 1. Odds of ratio: This values at a 0.0000009346 which is the odds for voting for Vlinton if all other independent variables are at a 0. This is not realistic, but serves as a baseline. 2. College: The odds of voting for Clinton increase by 1.2890 (28.90%) per every 1% increase in college students if all other variables don’t change, and that is the slope (coefficient). 3. Household income: The odds of voting for Clinton increase by 1.0575 (5.76%) per every 1% increase in household income if all other variables don’t change, and that is the slope (coefficient). 4. Region NE: The odds of voting for Clinton increase by 2.0541 (5.41%) in states in the North East if all other variables don’t change, and that is the slope (coefficient). This indicates that the odds for voting for Clinton in NE are greater than Midwest states by the same amount. 5. Region S: The odds of voting for Clinton increase by 1.3669 (36.69%) in states in the South if if all other variables don’t change, and that is the slope (coefficient). This indicates that the odds for voting for Clinton in South are greater than Midwest states by the same amount. 6. Region W: The odds of voting for Clinton increase by 2.0519 (5.19%) in states in the West if if all other variables don’t change, and that is the slope (coefficient). This indicates that the odds for voting for Clinton in West are greater than Midwest states by the same amount. 7. The R square is at a 0.588, indicating that 58.8% of variations of Clinton votes are explained by this model.
Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.
plot(allEffects(model_logit), type="response")
eff <- Effect("Region", model_logit)
plot(eff, type = "response", ylim=c(0,1))