Problemset4

Load Data

load("/Users/Gurleen/Downloads/States.RData")

Task 1

m1 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)

tab_model(m1, show.ci = FALSE, show.se = TRUE, show.stat=TRUE)

	ClintonVote
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	0.99	6.01	0.16	0.870
College	1.12	0.23	5.00	<0.001
HouseholdIncome	-0.08	0.15	-0.49	0.625
NonWhite	0.43	0.08	5.23	<0.001
Observations	50
R² / R² adjusted	0.616 / 0.590

Interpretation

The intercept is 0.99, representing the expected value of ClintonVote when all the independent variables (College, Household Income, and NonWhite) are zero. In this context, this value is not particularly meaningful because zero values for these predictors are unrealistic. The high p-value (0.870) indicates that the intercept is not statistically significant. The coefficient for College is 1.12, meaning that for each one-unit increase in the percentage of college graduates, ClintonVote is expected to increase by 1.12 percentage points, holding other variables constant. The p-value is less than 0.001, indicating this predictor is statistically significant. The coefficient for Household Income is -0.08, meaning that for each one-unit increase in household income, ClintonVote is expected to decrease by 0.08 percentage points, holding other variables constant. The p-value is 0.625, indicating this predictor is not statistically significant. The coefficient for NonWhite is 0.43, meaning that for each one-unit increase in the percentage of the non-white population, ClintonVote is expected to increase by 0.43 percentage points, holding other variables constant. The p-value is less than 0.001, indicating this predictor is statistically significant. ### Model Fit The R-squared value of 0.616 indicates that approximately 61.6% of the variance in ClintonVote is explained by the independent variables (College, Household Income, and NonWhite). This suggests a reasonably good fit. The adjusted R-squared value of 0.590 adjusts for the number of predictors in the model. It is slightly lower than the R-squared value, indicating that the model is still a good fit but slightly less so when accounting for the number of predictors. While the summary provided doesn’t include the F-statistic, the significant p-values for College and NonWhite suggest that these variables are important predictors of ClintonVote. The model overall appears to be significant and provides valuable insights into the factors influencing ClintonVote.

Task 2

m2 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)

compare

AIC(m1, m2)

##    df      AIC
## m1  5 336.7667
## m2  8 326.4857

BIC(m1, m2)

##    df      BIC
## m1  5 346.3268
## m2  8 341.7819

aic_m1 <- AIC(m1)
aic_m2 <- AIC(m2)
bic_m1 <- BIC(m1)
bic_m2 <- BIC(m2)

model_list <- list("Model 1" = m1, "Model 2" = m2)
modelsummary(model_list, statistic = "std.error")

tinytable_ucfacnnj8vrpzk11afsf

	Model 1	Model 2
(Intercept)	0.991	14.806
	(6.014)	(7.159)
College	1.125	1.189
	(0.225)	(0.257)
HouseholdIncome	-0.076	-0.429
	(0.153)	(0.171)
NonWhite	0.431	0.525
	(0.082)	(0.082)
RegionNE		8.031
		(2.721)
RegionS		-3.404
		(2.742)
RegionW		5.658
		(2.772)
Num.Obs.	50	50
R2	0.616	0.722
R2 Adj.	0.590	0.684
AIC	336.8	326.5
BIC	346.3	341.8
Log.Lik.	-163.383	-155.243
RMSE	6.35	5.40

Interpretation and Model choice

AIC Interpretation: The AIC for Model 2 (326.5) is lower than the AIC for Model 1 (336.8). Since a lower AIC indicates a better fit with a balance between model complexity and goodness of fit, Model 2 is preferred over Model 1 based on AIC. BIC Interpretation: The BIC for Model 2 (341.8) is also lower than the BIC for Model 1 (346.3). BIC imposes a stricter penalty for the number of parameters, favouring simpler models more heavily than AIC. Since Model 2 has a lower BIC, it is also preferred over Model 1 based on BIC. Overall Model Preference: Model 2 is preferred over Model 1. Both AIC and BIC values are lower for Model 2, indicating that the inclusion of the Region variable improves the model fit. Although Model 2 is more complex, the criteria suggest that this added complexity provides a better explanation of the variance in ClintonVote.

Task 3

m3_interact <- lm(ClintonVote ~ College + NonWhite + College:NonWhite, data = States)
summary(m3_interact)

## 
## Call:
## lm(formula = ClintonVote ~ College + NonWhite + College:NonWhite, 
##     data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.446  -3.543  -0.295   3.936  13.831 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -15.22574   13.26349  -1.148 0.256927    
## College            1.50058    0.39913   3.760 0.000479 ***
## NonWhite           0.99725    0.47769   2.088 0.042397 *  
## College:NonWhite  -0.01802    0.01457  -1.236 0.222664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared:  0.6259, Adjusted R-squared:  0.6015 
## F-statistic: 25.66 on 3 and 46 DF,  p-value: 6.636e-10

library(effects)
interaction_plot <- effect("College*NonWhite", m3_interact, xlevels=list(NonWhite=seq(0, 90, 10)))
plot(interaction_plot, main="Interaction Effect of College and NonWhite on ClintonVote", 
     xlab="% College", ylab="% Clinton Vote")

plot(allEffects(m3_interact), multiline = TRUE)

Interpretation

The lines are not parallel, thus, there is an interaction effect. This indicates that the relationship between the predictor and the outcome differs depending on the level of the interacting variable. The interaction effect reveals that the relationship between education (College) and voting patterns (ClintonVote) varies depending on the percentage of non-white residents (NonWhite). Specifically, when the lines in the interaction plot are not parallel, it indicates that the impact of education on voting for Clinton changes with different levels of the non-white population. In practical terms, this means that the effect of having a higher percentage of college graduates on voting for Clinton is different depending on how diverse the population is in terms of non-white residents. This variation suggests that demographic context significantly influences the impact of education on voting behaviour.

Task 4

States$ClintonVote <- abs(as.numeric(States$Elect2016) - 2)
table(States$ClintonVote)

## 
##  0  1 
## 30 20

m.logit <- glm(ClintonVote ~ College + HouseholdIncome + Region, data=States, family=binomial(link="logit"))
summary(m.logit)

## 
## Call:
## glm(formula = ClintonVote ~ College + HouseholdIncome + Region, 
##     family = binomial(link = "logit"), data = States)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -13.88304    4.57843  -3.032  0.00243 **
## College           0.25390    0.12980   1.956  0.05046 . 
## HouseholdIncome   0.05599    0.06642   0.843  0.39922   
## RegionNE          3.02242    1.41280   2.139  0.03241 * 
## RegionS           0.31259    1.58896   0.197  0.84404   
## RegionW           3.02137    1.43108   2.111  0.03475 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 32.014  on 44  degrees of freedom
## AIC: 44.014
## 
## Number of Fisher Scoring iterations: 6

coef(m.logit)

##     (Intercept)         College HouseholdIncome        RegionNE         RegionS 
##    -13.88304271      0.25389586      0.05599371      3.02242479      0.31259020 
##         RegionW 
##      3.02137322

exp(coef(m.logit))

##     (Intercept)         College HouseholdIncome        RegionNE         RegionS 
##    9.346977e-07    1.289038e+00    1.057591e+00    2.054104e+01    1.366961e+00 
##         RegionW 
##    2.051945e+01

plot(allEffects(m.logit), type="response")

Regression tables

modelsummary(m.logit, exponentiate = TRUE, stars = TRUE)

tinytable_wr3h2zk666iz4q8a0rdn

	(1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	0.000**
	(0.000)
College	1.289+
	(0.167)
HouseholdIncome	1.058
	(0.070)
RegionNE	20.541*
	(29.020)
RegionS	1.367
	(2.172)
RegionW	20.519*
	(29.365)
Num.Obs.	50
AIC	44.0
BIC	55.5
Log.Lik.	-16.007
RMSE	0.32

tab_model(m.logit, show.ci = FALSE, show.se = TRUE, p.style = "numeric")

	ClintonVote
Predictors	Odds Ratios	std. Error	p
(Intercept)	0.00	0.00	0.002
College	1.29	0.17	0.050
HouseholdIncome	1.06	0.07	0.399
Region [NE]	20.54	29.02	0.032
Region [S]	1.37	2.17	0.844
Region [W]	20.52	29.37	0.035
Observations	50
R² Tjur	0.588

Interpretation

Intercept: The odds ratio of 9.35×10−79.35 ^{-7}9.35×10−7 represents the odds of voting for Clinton when the percentage of college graduates, household income, and region are all at their reference levels. This extremely small value indicates that the baseline odds of voting for Clinton are negligible when all predictors are at their lowest or reference levels, serving as a baseline for the model. College: For each 1 percentage point increase in college graduates, the odds of voting for Clinton increase by a factor of 1.289, or about 28.9%, holding household income and region constant. This suggests that higher percentages of college graduates in a state are strongly associated with increased odds of voting for Clinton. Household Income: For each 1 unit increase in household income (in thousands of dollars), the odds of voting for Clinton increase by a factor of 1.058, or about 5.76%, holding the percentage of college graduates and region constant. This indicates a positive but relatively smaller effect of household income on the likelihood of voting for Clinton compared to the effect of college education. Region (Northeast): The odds ratio of 20.54 means that states in the Northeast are over 20 times more likely to vote for Clinton compared to the baseline region (likely the Midwest or South, depending on the dataset’s coding). This shows a very strong regional effect favoring Clinton in the Northeast. Region (South): The odds ratio of 1.37 indicates that states in the South are about 37% more likely to vote for Clinton compared to the baseline region. This suggests a moderate positive effect of being in the South on voting for Clinton. Region (West): The odds ratio of 20.52 means that states in the West are over 20 times more likely to vote for Clinton compared to the baseline region. This indicates a very strong regional effect favouring Clinton in the West. ### Model Fit AIC:: AIC is used to compare the goodness of fit of different models, with lower values indicating a better fit. An AIC of 44.0 suggests that the model fits the data reasonably well, but without a comparison to other models, we can’t determine if it is the best fit. BIC: Similar to AIC, BIC also evaluates model fit, but it includes a stronger penalty for models with more parameters. A BIC of 55.5 indicates the model’s fit, with lower values preferred. Again, without comparison, it’s hard to make a definitive judgment.

Task 5

# Calculate effects
effects_model <- allEffects(m.logit)

effect_plot <- effect("College*Region", m.logit)

## NOTE: College:Region does not appear in the model

plot(effect_plot, 
     main = "Predicted Probabilities of Voting for Clinton by College and Region",
     xlab = "Percentage of College Graduates",
     ylab = "Predicted Probability of Voting for Clinton",
     ylim = c(0, 1),   # Ensure the y-axis is on a probability scale
     multiline = TRUE) # Display lines for each region

plot_model(m.logit, type = "est", show.values = TRUE, value.offset = .3,
           title = "Odds Ratios of Predictors in the Logistic Regression Model",
           axis.labels = c("West", "South", "Northeast", "Household Income", "College"),
           colors = "Set2")

Problemset4

Gurleen Bassy

2024-07-24

Load Data

Task 1

Interpretation

Task 2

Interpretation and Model choice

Task 3

Interpretation

Task 4

Regression tables

Interpretation

Task 5

END