PS4 Modeling III: GLM & MNR

Problem Set 4

Due: July 24, 2024 by lecture time

Submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer when prompted. # Setting Up R for Analysis First, let’s install and load the necessary packages. This step ensures that we have all the tools we need for our analysis.

Task 1

Using the States dataset, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.

# Load the States data
data(USStates)
States <- USStates
model1 <- lm(ClintonVote ~College + HouseholdIncome  + NonWhite, data=States)
summary(model1)

## 
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite, 
##     data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.697  -3.605  -1.036   4.495  13.686 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.99082    6.01413   0.165    0.870    
## College          1.12498    0.22514   4.997 8.88e-06 ***
## HouseholdIncome -0.07554    0.15345  -0.492    0.625    
## NonWhite         0.43123    0.08246   5.230 4.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared:  0.6155, Adjusted R-squared:  0.5904 
## F-statistic: 24.55 on 3 and 46 DF,  p-value: 1.237e-09

tab_model(model1, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)

	Clinton Vote
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	0.99	6.01	0.16	0.870
College	1.12	0.23	5.00	<0.001
HouseholdIncome	-0.08	0.15	-0.49	0.625
NonWhite	0.43	0.08	5.23	<0.001
Observations	50
R² / R² adjusted	0.616 / 0.590

Intercept: The intercept is 0.99, it indicates that the percentage of votes for Clinton will be 0.99 percentage when all the independent variables are zero. College: The coefficient for College is 1.12498, it indicates that additional 1 percentage increasing in college students is associated with incrasing in the percentage of votes for Clinton by 1.12498 percentage holding other factors cosntant. The effect of College on ClintonVote is statistically significant at 5% level since p value is 0.000 which is less than 0.05. HouseholdIncome: The coefficient on HouseholdIncome is -0.07554 which indicates that additional one unit incrasing in HouseholdIncome is associated with decreasing in the percentage of votes for Clinton by 0.07554 percentage holding ohter factors constant.The effect of HouseholdIncome on ClintonVote is not statistically significant at 5% level since p value is 0.625 which is greater than 0.05. NonWhite: the coefficient on Nonwhite is 0.43123 which indicates that additional 1% increasing in Nonwhite is associated with increasing in the percentage of votes for Clinton by 0.43123 percentage holding other factors constant. The effect of NonWhite on ClintonVote is statistically significant at 5% level since p value is 0.000 which is less than 0.05. R squared is 0.6155 which indicates that 61.55% of the variation in ClintonVote can be explained by the model. Adjusted R-squared is 0.5904 which indicates that 59.04% of the variation in ClintonVote can be expalined by the model when considering the number of independent variables.

Task 2

Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.

model2 <- lm(ClintonVote ~College + HouseholdIncome  + NonWhite + Region, data=States)
summary(model2)

## 
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite + 
##     Region, data = States)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3217  -4.1695   0.0435   4.2880  11.2477 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     14.80589    7.15877   2.068   0.0447 *  
## College          1.18862    0.25668   4.631 3.36e-05 ***
## HouseholdIncome -0.42863    0.17107  -2.506   0.0161 *  
## NonWhite         0.52461    0.08175   6.417 9.03e-08 ***
## RegionNE         8.03052    2.72062   2.952   0.0051 ** 
## RegionS         -3.40381    2.74204  -1.241   0.2212    
## RegionW          5.65798    2.77249   2.041   0.0474 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.82 on 43 degrees of freedom
## Multiple R-squared:  0.7224, Adjusted R-squared:  0.6836 
## F-statistic: 18.65 on 6 and 43 DF,  p-value: 1.545e-10

# Compare models
AIC(model1, model2)

##        df      AIC
## model1  5 336.7667
## model2  8 326.4857

BIC(model1, model2)

##        df      BIC
## model1  5 346.3268
## model2  8 341.7819

The results of model 1 were explained above. Following shows the interpretation of the results in model 2. Intercept: The intercept is 14.806, it indicates that the percentage of votes for Clinton will be 14.806 percentage when all the independent variables are zero. College: The coefficient for College is 1.189, it indicates that additional 1 percentage increasing in college students is associated with incrasing in the percentage of votes for Clinton by 1.189 percentage holding other factors cosntant. The effect of College on ClintonVote is statistically significant at 5% level since p value is 0.000 which is less than 0.05. HouseholdIncome: The coefficient on HouseholdIncome is -0.429 which indicates that additional one unit incrasing in HouseholdIncome is associated with decreasing in the percentage of votes for Clinton by 0.429 percentage holding ohter factors constant.The effect of HouseholdIncome on ClintonVote is not statistically significant at 5% level since p value is 0.625 which is greater than 0.05. NonWhite: the coefficient on Nonwhite is 0.525 which indicates that additional 1% increasing in Nonwhite is associated with increasing in the percentage of votes for Clinton by 0.525 percentage holding other factors constant. The effect of NonWhite on ClintonVote is statistically significant at 5% level since p value is 0.000 which is less than 0.05. NE: the coefficient on Region NE is 8.031 which indicates that on average, the percentage of votes for Clinton in the Northeast is 8.031 percentage points higer compared to the Midwest (MW), holding all other variables constant. S (South): The coefficient for S is -3.404, which indicates that, on average, the percentage of votes for Clinton in the South is 3.404 percentage points lower compared to the Midwest (MW), holding all other variables constant. W (West): The coefficient for W is 5.658, which implies that, on average, the percentage of votes for Clinton in the West is 5.658 percentage points higher compared to the Midwest (MW), holding all other variables constant. R squared is 0.7224 which indicates that 72.24% of the variation in ClintonVote can be explained by the model. Adjusted R-squared is 0.6836 which indicates that 68.36% of the variation in ClintonVote can be expalined by the model when considering the number of independent variables. Comparing AIC and BIC, we can find that both AIC and BIC for model 2 are lower than that of model 1, model 2 is preferred.

Task 3

Using the States dataset, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?

model_interact <- lm(ClintonVote ~ College + NonWhite + NonWhite:College, data=States)
summary(model_interact)

## 
## Call:
## lm(formula = ClintonVote ~ College + NonWhite + NonWhite:College, 
##     data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.446  -3.543  -0.295   3.936  13.831 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -15.22574   13.26349  -1.148 0.256927    
## College            1.50058    0.39913   3.760 0.000479 ***
## NonWhite           0.99725    0.47769   2.088 0.042397 *  
## College:NonWhite  -0.01802    0.01457  -1.236 0.222664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared:  0.6259, Adjusted R-squared:  0.6015 
## F-statistic: 25.66 on 3 and 46 DF,  p-value: 6.636e-10

library(effects)
plot(allEffects(model_interact), multiline = TRUE)

Intercept: the intercept is -15.22574 which indicates that the percentage of votes for Clinton will be -15.22574 percentage when all the independent variables are zero. College: the coefficient on College is 1.50058, it indicates the marginal effect of College on ClintonVote is 1.50058-0.01802NonWhite which will depend on the value of NonWhite. NonWhite: the coefficient on NonWhite is 0.99725 which indicates that the marginal effect of NonWhite on ClintonVote is 0.99725-0.01802College which will depend on the value of College. R squared is 0.6259 which indicates that 62.59% of the variation in ClintonVote can be explained by the model. Adjusted R-squared is 0.6015 which indicates that 60.15% of the variation in ClintonVote can be expalined by the model when considering the number of independent variables.

Task 4

Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.

States$Clinton <- abs(as.numeric(States$Elect2016) - 2)
table(States$Clinton) # Clinton win is coded as 1, Trump win is coded as 0

## 
##  0  1 
## 30 20

model_logit <- glm(Clinton ~ College+HouseholdIncome+Region, data=States, family=binomial(link="logit"))
summary(model_logit)

## 
## Call:
## glm(formula = Clinton ~ College + HouseholdIncome + Region, family = binomial(link = "logit"), 
##     data = States)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0632  -0.3987  -0.1653   0.3878   2.2223  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -13.88304    4.57843  -3.032  0.00243 **
## College           0.25390    0.12980   1.956  0.05046 . 
## HouseholdIncome   0.05599    0.06642   0.843  0.39922   
## RegionNE          3.02242    1.41280   2.139  0.03241 * 
## RegionS           0.31259    1.58896   0.197  0.84404   
## RegionW           3.02137    1.43108   2.111  0.03475 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 32.014  on 44  degrees of freedom
## AIC: 44.014
## 
## Number of Fisher Scoring iterations: 6

tab_model(model_logit, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)

	Clinton
Predictors	Odds Ratios	std. Error	Statistic	p
(Intercept)	0.00	0.00	-3.03	0.002
College	1.29	0.17	1.96	0.050
HouseholdIncome	1.06	0.07	0.84	0.399
Region [NE]	20.54	29.02	2.14	0.032
Region [S]	1.37	2.17	0.20	0.844
Region [W]	20.52	29.37	2.11	0.035
Observations	50
R² Tjur	0.588

exp(coef(model_logit))

##     (Intercept)         College HouseholdIncome        RegionNE         RegionS 
##    9.346977e-07    1.289038e+00    1.057591e+00    2.054104e+01    1.366961e+00 
##         RegionW 
##    2.051945e+01

Intercept: The odds ratio of 0.0000009346977 represents the odds of voting for Clinton when all the independent variables are 0%. College: For each 1 percentage point increase in college graduates, the odds of voting for Clinton increase by a factor of 1.289, or about 28.9%. This suggests a strong positive relationship between the percentage of college graduates and the likelihood of voting for Clinton. HouseholdIncome: For each 1 unit increase in household income (in thousands of dollars), the odds of voting for Clinton increase by a factor of 1.0576, or about 5.76%, holding other factors constant. NE:the coefficient on NE is 3.0224 which indicates that the log odds of voting for Clinton in the Northeast is greater than the log odds of voting for Clinton in the the Midwest (MW) by 3.02242 holding other factors constant. S:the coefficient on S is 0.31259 which indicates that the log odds of voting for Clinton in the Northeast is greater than the log odds of voting for Clinton in the the Midwest (MW) by 0.31259 holding other factors constant. W:the coefficient on S is 3.02137 which indicates that the log odds of voting for Clinton in the Northeast is greater than the log odds of voting for Clinton in the the Midwest (MW) by 3.02137 holding other factors constant. R squared Tjur is 0.588 which indicates that 58.8% of the variation in the dependent variable can be explained by the model.

Task 5

Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.

# Predicted probabilities
plot(allEffects(model_logit), type="response")

eff <- Effect("Region", model_logit)
plot(eff, type = "response", ylim=c(0,1))

### End ###