Problem Set 4

#Task 1 Using the States data, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.

load("States.RData")

#Examine data
head(States[, c("State", "ClintonVote", "College", "HouseholdIncome", "NonWhite")])

##        State ClintonVote College HouseholdIncome NonWhite
## 1    Alabama       34.36    26.0          46.472     31.6
## 2     Alaska       36.55    26.5          76.114     34.7
## 3    Arizona       45.13    27.4          53.510     22.5
## 4   Arkansas       33.65    24.7          43.813     22.7
## 5 California       61.73    34.5          67.169     39.4
## 6   Colorado       48.16    39.6          65.458     15.8

# Fit the multiple linear regression model
model <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data=States)

# Summarize the model
# This will display the coefficients, standard errors, t-values, and p-values for each predictor
summary(model)

## 
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite, 
##     data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.697  -3.605  -1.036   4.495  13.686 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.99082    6.01413   0.165    0.870    
## College          1.12498    0.22514   4.997 8.88e-06 ***
## HouseholdIncome -0.07554    0.15345  -0.492    0.625    
## NonWhite         0.43123    0.08246   5.230 4.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared:  0.6155, Adjusted R-squared:  0.5904 
## F-statistic: 24.55 on 3 and 46 DF,  p-value: 1.237e-09

# Create regression tables using sjPlot
tab_model(model, show.ci = FALSE, show.se = TRUE, dv.labels = c("ClintonVote"))

	ClintonVote
Predictors	Estimates	std. Error	p
(Intercept)	0.99	6.01	0.870
College	1.12	0.23	<0.001
HouseholdIncome	-0.08	0.15	0.625
NonWhite	0.43	0.08	<0.001
Observations	50
R² / R² adjusted	0.616 / 0.590

# Interpret the coefficients and assess model fit
coefficients <- summary(model)$coefficients
r_squared <- summary(model)$r.squared
adjusted_r_squared <- summary(model)$adj.r.squared

# Print the coefficients
cat("Coefficients:\n")

## Coefficients:

print(coefficients)

##                    Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)      0.99082189  6.0141277  0.1647491 8.698633e-01
## College          1.12498459  0.2251381  4.9968638 8.883340e-06
## HouseholdIncome -0.07554429  0.1534528 -0.4922966 6.248512e-01
## NonWhite         0.43122573  0.0824595  5.2295460 4.053178e-06

# Print the R-squared and Adjusted R-squared values
cat("\nR-squared: ", r_squared)

## 
## R-squared:  0.615523

cat("\nAdjusted R-squared: ", adjusted_r_squared)

## 
## Adjusted R-squared:  0.5904484

#Interpretation of Coefficients

For each 1 percentage point increase in college graduates, the percentage of votes for Clinton increases by about 1.13 percentage points, showing a strong positive relationship.
Household income has a small and statistically insignificant effect on the percentage of votes for Clinton.
For each 1 percentage point increase in the non-white population, the percentage of votes for Clinton increases by about 0.43 percentage points, indicating a positive relationship. 4> The model explains approximately 61.55% of the variability in the percentage of votes for Clinton, indicating a reasonably good fit

#Task 2 Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.

# Fit the original model
model_original <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)

# Fit the extended model with the 'Region' variable
model_extended <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)

# Compare models using AIC and BIC
aic_original <- AIC(model_original)
aic_extended <- AIC(model_extended)

bic_original <- BIC(model_original)
bic_extended <- BIC(model_extended)

cat("AIC for original model: ", aic_original, "\n")

## AIC for original model:  336.7667

cat("AIC for extended model: ", aic_extended, "\n")

## AIC for extended model:  326.4857

cat("BIC for original model: ", bic_original, "\n")

## BIC for original model:  346.3268

cat("BIC for extended model: ", bic_extended, "\n")

## BIC for extended model:  341.7819

# Create regression tables using modelsummary
models <- list(Original = model_original, Extended = model_extended)
modelsummary(models)

tinytable_6xpob7q5qk1vhi41x399

	Original	Extended
(Intercept)	0.991	14.806
	(6.014)	(7.159)
College	1.125	1.189
	(0.225)	(0.257)
HouseholdIncome	-0.076	-0.429
	(0.153)	(0.171)
NonWhite	0.431	0.525
	(0.082)	(0.082)
RegionNE		8.031
		(2.721)
RegionS		-3.404
		(2.742)
RegionW		5.658
		(2.772)
Num.Obs.	50	50
R2	0.616	0.722
R2 Adj.	0.590	0.684
AIC	336.8	326.5
BIC	346.3	341.8
Log.Lik.	-163.383	-155.243
F	24.548	18.648
RMSE	6.35	5.40

#Interpretation Model Comparison: The extended model has a lower AIC (326.4857) compared to the original model (336.7667). Lower AIC indicates that the extended model with the Region variable provides a better balance between goodness-of-fit and model complexity. It suggests that adding Region improves the model’s fit more effectively. The extended model also has a lower BIC (341.7819) compared to the original model (346.3268).Lower BIC supports the same conclusion as AIC. It shows that the extended model is preferable, considering the additional penalty for complexity in the BIC criterion. I therefore prefer the extended model, which includes the Region variable. This suggests that the inclusion of the Region variable improves the model’s fit without excessively increasing its complexity. #Task 3 Using the States data, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?

# Fit the model including the interaction term between College and NonWhite
model_interaction <- lm(ClintonVote ~ College * NonWhite, data = States)

# Summarize the model
summary(model_interaction)

## 
## Call:
## lm(formula = ClintonVote ~ College * NonWhite, data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.446  -3.543  -0.295   3.936  13.831 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -15.22574   13.26349  -1.148 0.256927    
## College            1.50058    0.39913   3.760 0.000479 ***
## NonWhite           0.99725    0.47769   2.088 0.042397 *  
## College:NonWhite  -0.01802    0.01457  -1.236 0.222664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared:  0.6259, Adjusted R-squared:  0.6015 
## F-statistic: 25.66 on 3 and 46 DF,  p-value: 6.636e-10

# Calculate effects for visualization
interaction_effects <- allEffects(model_interaction)

# Plot the interaction effects
plot(interaction_effects, main = "Interaction Effects of College and NonWhite on ClintonVote")

#Interpretation Each additional percentage point of college graduates is associated with a 1.50 percentage point increase in the likelihood of voting for Clinton. However, the interaction term is not statistically significant (p = 0.223), indicating that the effect of college education on voting for Clinton does not significantly vary with the percentage of the non-white population. #Task 4 Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.

# Create the binary outcome variable
States$VotedForClinton <- ifelse(States$ClintonVote > 0.5, 1, 0)

# Fit the logistic regression model
model_logit <- glm(VotedForClinton ~ College + HouseholdIncome + Region, data = States, family = binomial)

# Summarize the model
summary(model_logit)

## 
## Call:
## glm(formula = VotedForClinton ~ College + HouseholdIncome + Region, 
##     family = binomial, data = States)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)      2.657e+01  4.336e+05       0        1
## College         -6.055e-08  1.509e+04       0        1
## HouseholdIncome -8.145e-08  9.068e+03       0        1
## RegionNE        -3.017e-06  1.665e+05       0        1
## RegionS         -4.680e-06  1.525e+05       0        1
## RegionW         -2.223e-06  1.696e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 49  degrees of freedom
## Residual deviance: 2.9008e-10  on 44  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

# Create regression tables using modelsummary
modelsummary(model_logit)

tinytable_fyonpafzssgr59w5hkk7

	(1)
(Intercept)	26.566
	(433587.131)
College	0.000
	(15090.634)
HouseholdIncome	0.000
	(9067.724)
RegionNE	0.000
	(166455.565)
RegionS	0.000
	(152517.241)
RegionW	0.000
	(169636.511)
Num.Obs.	50
AIC	12.0
BIC	23.5
Log.Lik.	0.000
F	0.000
RMSE	0.00

# Create regression tables using sjPlot
tab_model(model_logit)

	VotedForClinton
Predictors	Odds Ratios	CI	p
(Intercept)	344746337574.15	0.00 – Inf	1.000
College	1.00	0.00 – Inf	1.000
HouseholdIncome	1.00	0.00 – Inf	1.000
Region [NE]	1.00	0.00 – Inf	1.000
Region [S]	1.00	0.00 – Inf	1.000
Region [W]	1.00	0.00 – Inf	1.000
Observations	50

# Convert coefficients to odds ratios
odds_ratios <- exp(coef(model_logit))

conf_int <- confint.default(model_logit)  
conf_intervals <- exp(conf_int)

cat("Odds Ratios:\n")

## Odds Ratios:

print(knitr::kable(data.frame(OddsRatio = odds_ratios), col.names = "Odds Ratio", digits = 4))

## 
## 
## |                |   Odds Ratio|
## |:---------------|------------:|
## |(Intercept)     | 344746337574|
## |College         |            1|
## |HouseholdIncome |            1|
## |RegionNE        |            1|
## |RegionS         |            1|
## |RegionW         |            1|

cat("\nConfidence Intervals for Odds Ratios:\n")

## 
## Confidence Intervals for Odds Ratios:

print(knitr::kable(data.frame(ConfidenceIntervals = conf_intervals), digits = 4))

## 
## 
## |                | ConfidenceIntervals.2.5..| ConfidenceIntervals.97.5..|
## |:---------------|-------------------------:|--------------------------:|
## |(Intercept)     |                         0|                        Inf|
## |College         |                         0|                        Inf|
## |HouseholdIncome |                         0|                        Inf|
## |RegionNE        |                         0|                        Inf|
## |RegionS         |                         0|                        Inf|
## |RegionW         |                         0|                        Inf|

# Assess model fit using AIC and BIC
aic_logit <- AIC(model_logit)
bic_logit <- BIC(model_logit)

cat("\nAIC for logistic regression model: ", aic_logit, "\n")

## 
## AIC for logistic regression model:  12

cat("BIC for logistic regression model: ", bic_logit, "\n")

## BIC for logistic regression model:  23.47214

#Interpretation

3.4475e11 (344746337574) for the intercept is extremely high. This suggests that the odds of voting for Clinton when all predictors (College, HouseholdIncome, and Region) are at their baseline values are very high. However, this value is not practically interpretable and primarily serves as a baseline for comparison. College:

The odds ratio of 1for College means that there is no change in the odds of voting for Clinton with an increase in the percentage of college graduates. This result suggests that, in this model, College does not significantly affect the odds of voting for Clinton when controlling for other factors.

The odds ratio of 1 for HouseholdIncome indicates that changes in household income do not affect the odds of voting for Clinton in this model. This suggests that, after accounting for College and Region, HouseholdIncome does not have a significant effect on voting behavior. Region Variables (NE, S, W):

The odds ratios for the Region variables (NE, S, W) being 1means that these regional indicators do not affect the odds of voting for Clinton compared to the reference region. This suggests that there are no significant regional differences in the odds of voting for Clinton in this model.

#Model Fit:

AIC (12): A lower AIC value indicates a better model fit relative to other models. This value suggests that the model fits the data reasonably well. BIC (23.47214): Similar to AIC, a lower BIC value indicates a better model fit. BIC also penalizes model complexity, so a lower BIC suggests that this model provides a good balance between fit and complexity.

#Task 5 Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.

# Fit the logistic regression model
m.logit <- glm(VotedForClinton ~ College + HouseholdIncome + Region, data = States, family = binomial(link = "logit"))
summary(m.logit)

## 
## Call:
## glm(formula = VotedForClinton ~ College + HouseholdIncome + Region, 
##     family = binomial(link = "logit"), data = States)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)      2.657e+01  4.336e+05       0        1
## College         -6.055e-08  1.509e+04       0        1
## HouseholdIncome -8.145e-08  9.068e+03       0        1
## RegionNE        -3.017e-06  1.665e+05       0        1
## RegionS         -4.680e-06  1.525e+05       0        1
## RegionW         -2.223e-06  1.696e+05       0        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 49  degrees of freedom
## Residual deviance: 2.9008e-10  on 44  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

# Visualize odds ratios using sjPlot
plot_model(model_logit, 
           type = "est",
           show.values = TRUE, 
           show.p = TRUE, 
           title = "Odds Ratios for Predictors of Voting for Clinton")

## Warning in scale_y_continuous(trans = "log10", limits = axis.scaling$axis.lim,
## : log-10 transformation introduced infinite values.

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_text()`).

# Plot predicted probabilities across the range of College values, separated by Region
eff <- allEffects(model_logit)
plot(eff, 
     type = "response", 
     main = "Predicted Probabilities of Voting for Clinton Across College Values by Region")

eff <- Effect("Region", m.logit)
plot(eff, type = "response", ylim=c(0,1))

#END

Problem Set 4

ziyi su

2024-07-24