Setting Up R for Analysis

First, let’s install and load the necessary packages. This step ensures that we have all the tools we need for our analysis.

Problem Set 4

Due: July 24, 2024 by lecture time

Submit both the .Rmd (R markdown file) and R pubs published version (with link provided). Copy/paste all the tasks as text and provide the relevant code and text answer when prompted.

Task 1

Using the States dataset, conduct a multiple linear regression with ClintonVote as the dependent variable and College, HouseholdIncome, and NonWhite as independent variables. Create regression tables using the sjPlot package. Interpret the coefficients and assess the model fit.

data(USStates)
States <- USStates
model1 <- lm(ClintonVote ~College + HouseholdIncome  + NonWhite, data=States)
summary(model1)

## 
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite, 
##     data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.697  -3.605  -1.036   4.495  13.686 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.99082    6.01413   0.165    0.870    
## College          1.12498    0.22514   4.997 8.88e-06 ***
## HouseholdIncome -0.07554    0.15345  -0.492    0.625    
## NonWhite         0.43123    0.08246   5.230 4.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared:  0.6155, Adjusted R-squared:  0.5904 
## F-statistic: 24.55 on 3 and 46 DF,  p-value: 1.237e-09

Interpretation: 1. The intercept equals 0.99, indicating the percentage of votes for Clinton if the independant values are zero. This is not realistic, but can serve as a baseline. 2. College: The slope (coefficient) equals 1.1249, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in college students if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 3. Household income: The slope (coefficient) equals -0.0755, meaning the percentage of votes for Clinton decrease by this amount per every 1% increase in household income if all other factors don’t change. This variable’s effect is statistically insignificant at a 5% confidence level because the p-value is 0.625 and 0.625>0.5. 4. Non white: The slope (coefficient) equals 0.4312, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in non white voters if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 5. The R square equals 0.616, meaning that 61.6% of the variation of votes for Clinton can be explained by this model. 6. The adjusted R squared is 0.590, explaining that 59% of the variation of votes for Clinton can be explained by this model when considering the independent variables.

tab_model(model1, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)

	ClintonVote
Predictors	Estimates	std. Error	Statistic	p
(Intercept)	0.99	6.01	0.16	0.870
College	1.12	0.23	5.00	<0.001
HouseholdIncome	-0.08	0.15	-0.49	0.625
NonWhite	0.43	0.08	5.23	<0.001
Observations	50
R² / R² adjusted	0.616 / 0.590

Task 2

Extend the model from Task 1 by adding the Region variable. Compare this new model to the previous one using AIC and BIC. Create a regression table with all models with modelsummary. Interpret the results and discuss which model you prefer and why.

model2 <- lm(ClintonVote ~College + HouseholdIncome  + NonWhite + Region, data=States)
summary(model2)

## 
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite + 
##     Region, data = States)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3217  -4.1695   0.0435   4.2880  11.2477 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     14.80589    7.15877   2.068   0.0447 *  
## College          1.18862    0.25668   4.631 3.36e-05 ***
## HouseholdIncome -0.42863    0.17107  -2.506   0.0161 *  
## NonWhite         0.52461    0.08175   6.417 9.03e-08 ***
## RegionNE         8.03052    2.72062   2.952   0.0051 ** 
## RegionS         -3.40381    2.74204  -1.241   0.2212    
## RegionW          5.65798    2.77249   2.041   0.0474 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.82 on 43 degrees of freedom
## Multiple R-squared:  0.7224, Adjusted R-squared:  0.6836 
## F-statistic: 18.65 on 6 and 43 DF,  p-value: 1.545e-10

AIC(model1, model2)

##        df      AIC
## model1  5 336.7667
## model2  8 326.4857

BIC(model1, model2)

##        df      BIC
## model1  5 346.3268
## model2  8 341.7819

Interpretation (model 2): 1. The intercept is at a 14.805, meaning the % for Clinton votes will be at a 14.805% if all other variables equal 0. This is not realistic, but serves as a baseline. 2. College: The slope (coefficient) equals 1.1886, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in college students if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 3. Household income: The slope (coefficient) equals -0.4286, meaning the percentage of votes for Clinton decrease by this amount per every 1% increase in household income if all other factors don’t change. This variable’s effect is statistically insignificant at a 5% confidence level because the p-value is 0.625 and 0.625>0.5. 4. Non white: The slope (coefficient) equals 0.5246, meaning the percentage of votes for Clinton increase by this amount per every 1% increase in non white voters if all other factors don’t change. This variable’s effect is statistically significant at a 5% confidence level because the p-value is less than 0.001 and <0.001<0.5. 5. Region NE: The slope (coefficient) equals 8.0305, meaning that the average votes for Clinton in North Eastern states is 8.03% higher than in states in the Midwest if all other variables don’t change. 6. Region South: The slope (coefficient) equals -3.4038, meaning that the average votes for Clinton in Souther states is 3.40% less than states in the Midwest if all other variables don’t change. 7. Region West: : The slope (coefficient) equals 5.6579, meaning that the average votes for Clinton in North Eastern states is 5.65% higher than in states in the Midwest if all other variables don’t change. 5. The R square equals 0.7224, meaning that 72.24% of the variation of votes for Clinton can be explained by this model. 6. The adjusted R squaered is 0.6836, explaining that 68.36% of the variation of votes for Clinton can be explained by this model when considering the independent variables.

Because both the AIC and BIC are lower in model 2, this model is preferred over model 1.

Task 3

Using the States dataset, fit a model predicting ClintonVote with College, NonWhite, and their interaction. Visualize the interaction effect using the effects package and interpret the results. How does the interaction change our understanding of the relationship between education and voting patterns?

model_interact <- lm(ClintonVote ~ College + NonWhite + NonWhite:College, data=States)
summary(model_interact)

## 
## Call:
## lm(formula = ClintonVote ~ College + NonWhite + NonWhite:College, 
##     data = States)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.446  -3.543  -0.295   3.936  13.831 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -15.22574   13.26349  -1.148 0.256927    
## College            1.50058    0.39913   3.760 0.000479 ***
## NonWhite           0.99725    0.47769   2.088 0.042397 *  
## College:NonWhite  -0.01802    0.01457  -1.236 0.222664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared:  0.6259, Adjusted R-squared:  0.6015 
## F-statistic: 25.66 on 3 and 46 DF,  p-value: 6.636e-10

Interpretation: 1. The intercept is at a -15.2257, indicating that the % of votes for Clinton will be at a -15.2257% if all other variables equal 0. This is not realistic, but serves as a baseline. 2. College: The slope (coefficient) is equal to 1.5005, meaning there will be an increase in votes for Clinton by this much per every 1% increase in college students. 3. Non white: The slope (coefficient) is equal to 0.9972, meaning there will be an increase in votes for Clinton by this much per every 1% increase in non white voters. 4. College:non white: The slope (coefficient) is equal to -0.0180, meaning there will be a decrease in votes for Clinton by this much per every 1% increase in non white college students. 5. The R square equals 0.6259, indicating that 62.59% of the variation of Clinton votes are explained by this model. 6. The adjusted R square equals 0.6015, indicating that 60.15% of the variation of Clinton votes are explained by this model considering the independent variables.

library(effects)
plot(allEffects(model_interact), multiline = TRUE)

Task 4

Using the States dataset, create a binary outcome variable indicating whether a state voted for Clinton (1) or not (0). Perform a logistic regression with this new variable as the dependent variable and College, HouseholdIncome, and Region as predictors. Create regression tables using both modelsummary and sjPlot. Interpret the odds ratios and assess model fit.

States$Clinton <- abs(as.numeric(States$Elect2016) - 2)
table(States$Clinton) # Clinton win = 1, Trump win = 0

## 
##  0  1 
## 30 20

model_logit <- glm(Clinton ~ College+HouseholdIncome+Region, data=States, family=binomial(link="logit"))
summary(model_logit)

## 
## Call:
## glm(formula = Clinton ~ College + HouseholdIncome + Region, family = binomial(link = "logit"), 
##     data = States)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -13.88304    4.57843  -3.032  0.00243 **
## College           0.25390    0.12980   1.956  0.05046 . 
## HouseholdIncome   0.05599    0.06642   0.843  0.39922   
## RegionNE          3.02242    1.41280   2.139  0.03241 * 
## RegionS           0.31259    1.58896   0.197  0.84404   
## RegionW           3.02137    1.43108   2.111  0.03475 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67.301  on 49  degrees of freedom
## Residual deviance: 32.014  on 44  degrees of freedom
## AIC: 44.014
## 
## Number of Fisher Scoring iterations: 6

tab_model(model_logit, show.ci=FALSE, show.se=TRUE, show.stat=TRUE)

	Clinton
Predictors	Odds Ratios	std. Error	Statistic	p
(Intercept)	0.00	0.00	-3.03	0.002
College	1.29	0.17	1.96	0.050
HouseholdIncome	1.06	0.07	0.84	0.399
Region [NE]	20.54	29.02	2.14	0.032
Region [S]	1.37	2.17	0.20	0.844
Region [W]	20.52	29.37	2.11	0.035
Observations	50
R² Tjur	0.588

exp(coef(model_logit))

##     (Intercept)         College HouseholdIncome        RegionNE         RegionS 
##    9.346977e-07    1.289038e+00    1.057591e+00    2.054104e+01    1.366961e+00 
##         RegionW 
##    2.051945e+01

Interpretation: 1. Odds of ratio: This values at a 0.0000009346 which is the odds for voting for Vlinton if all other independent variables are at a 0. This is not realistic, but serves as a baseline. 2. College: The odds of voting for Clinton increase by 1.2890 (28.90%) per every 1% increase in college students if all other variables don’t change, and that is the slope (coefficient). 3. Household income: The odds of voting for Clinton increase by 1.0575 (5.76%) per every 1% increase in household income if all other variables don’t change, and that is the slope (coefficient). 4. Region NE: The odds of voting for Clinton increase by 2.0541 (5.41%) in states in the North East if all other variables don’t change, and that is the slope (coefficient). This indicates that the odds for voting for Clinton in NE are greater than Midwest states by the same amount. 5. Region S: The odds of voting for Clinton increase by 1.3669 (36.69%) in states in the South if if all other variables don’t change, and that is the slope (coefficient). This indicates that the odds for voting for Clinton in South are greater than Midwest states by the same amount. 6. Region W: The odds of voting for Clinton increase by 2.0519 (5.19%) in states in the West if if all other variables don’t change, and that is the slope (coefficient). This indicates that the odds for voting for Clinton in West are greater than Midwest states by the same amount. 7. The R square is at a 0.588, indicating that 58.8% of variations of Clinton votes are explained by this model.

Task 5

Leveraging the same logistic regression model from Task 4, use the effects package to plot the predicted probabilities of voting for Clinton across the range of College values, separated by Region. Then, create a plot using sjPlot’s plot_model function to visualize the odds ratios of all predictors in the model.

plot(allEffects(model_logit), type="response")

eff <- Effect("Region", model_logit)
plot(eff, type = "response", ylim=c(0,1))

Problem Set 4 (SOC252)

Leili Ayoubian