TASK 1
# Inspect the dataset
str(States)
## 'data.frame': 50 obs. of 22 variables:
## $ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ HouseholdIncome : num 46.5 76.1 53.5 43.8 67.2 ...
## $ Region : Factor w/ 4 levels "MW","NE","S",..: 3 4 4 3 4 4 2 2 3 3 ...
## $ Population : num 4.88 0.74 7.02 3 39.54 ...
## $ EighthGradeMath : num 269 274 280 274 276 ...
## $ HighSchool : num 87.1 92.8 87.1 89.1 87.4 91.5 92.4 89.2 89.4 87.5 ...
## $ College : num 26 26.5 27.4 24.7 34.5 39.6 42.7 33.4 28.8 30.9 ...
## $ IQ : num 95.7 99 97.4 97.5 95.5 ...
## $ GSP : num 40.3 70.9 43.1 38.5 67.7 ...
## $ Vegetables : num 80.7 81 79.2 80.7 78.6 82.6 83.1 82.8 80.6 81.9 ...
## $ Fruit : num 55.1 63.1 62.8 55.3 67.5 67 68.5 64.6 65.6 61.4 ...
## $ Smokers : num 20.9 21 15.6 22.3 11.3 14.6 12.7 17 16.1 17.5 ...
## $ PhysicalActivity: num 42.8 58.3 52.7 45.4 57.5 58.7 51.9 46.3 49.5 46.1 ...
## $ Obese : num 36.2 29.5 29.5 37.1 25.8 23 27.4 33.5 30.7 32.5 ...
## $ NonWhite : num 31.6 34.7 22.5 22.7 39.4 15.8 23.3 30.9 24.3 40.6 ...
## $ HeavyDrinkers : num 5.45 7.33 5.57 5.32 5.95 7.3 6.45 6.4 7.61 5.82 ...
## $ Electoral : int 9 3 11 6 55 9 7 3 29 16 ...
## $ ClintonVote : num 34.4 36.5 45.1 33.6 61.7 ...
## $ Elect2016 : Factor w/ 2 levels "D","R": 2 2 2 2 1 1 1 1 2 2 ...
## $ TwoParents : num 60.9 71.5 62.7 63.3 66.8 71.9 66.5 62.6 61 62.1 ...
## $ StudentSpending : num 9.24 17.51 7.61 9.85 11.49 ...
## $ Insured : num 83.7 80.2 83.4 84.1 85.2 87.2 91.1 90.7 78.2 79.3 ...
summary(States)
## State HouseholdIncome Region Population EighthGradeMath
## Alabama : 1 Min. :42.01 MW:13 Min. : 0.579 Min. :268.7
## Alaska : 1 1st Qu.:50.91 NE:11 1st Qu.: 1.842 1st Qu.:276.6
## Arizona : 1 Median :56.34 S :13 Median : 4.569 Median :281.2
## Arkansas : 1 Mean :57.85 W :13 Mean : 6.501 Mean :281.2
## California: 1 3rd Qu.:64.75 3rd Qu.: 7.309 3rd Qu.:285.7
## Colorado : 1 Max. :78.92 Max. :39.537 Max. :294.5
## (Other) :44
## HighSchool College IQ GSP
## Min. :86.40 Min. :22.50 Min. : 94.20 Min. :34.03
## 1st Qu.:89.03 1st Qu.:27.98 1st Qu.: 98.47 1st Qu.:46.26
## Median :91.35 Median :32.65 Median :100.85 Median :51.23
## Mean :90.76 Mean :33.08 Mean :100.34 Mean :52.89
## 3rd Qu.:92.47 3rd Qu.:37.65 3rd Qu.:102.70 3rd Qu.:59.06
## Max. :95.00 Max. :50.90 Max. :104.30 Max. :73.53
##
## Vegetables Fruit Smokers PhysicalActivity
## Min. :76.10 Min. :53.70 Min. : 8.90 Min. :41.90
## 1st Qu.:80.62 1st Qu.:61.62 1st Qu.:15.03 1st Qu.:47.10
## Median :81.95 Median :63.30 Median :17.15 Median :50.40
## Mean :81.87 Mean :63.32 Mean :17.33 Mean :50.58
## 3rd Qu.:83.17 3rd Qu.:66.65 3rd Qu.:19.30 3rd Qu.:53.98
## Max. :87.60 Max. :70.30 Max. :26.00 Max. :59.70
##
## Obese NonWhite HeavyDrinkers Electoral
## Min. :23.00 Min. : 5.40 Min. :3.750 Min. : 3.00
## 1st Qu.:28.77 1st Qu.:14.35 1st Qu.:5.827 1st Qu.: 5.00
## Median :30.90 Median :21.75 Median :6.440 Median : 8.00
## Mean :31.43 Mean :23.05 Mean :6.483 Mean :10.70
## 3rd Qu.:34.38 3rd Qu.:31.45 3rd Qu.:7.322 3rd Qu.:11.75
## Max. :39.50 Max. :74.90 Max. :8.770 Max. :55.00
##
## ClintonVote Elect2016 TwoParents StudentSpending Insured
## Min. :21.63 D:20 Min. :54.10 Min. : 6.953 Min. :75.20
## 1st Qu.:35.83 R:30 1st Qu.:63.30 1st Qu.: 9.604 1st Qu.:83.42
## Median :45.91 Median :66.10 Median :11.319 Median :86.20
## Mean :43.78 Mean :66.53 Mean :11.950 Mean :86.24
## 3rd Qu.:49.98 3rd Qu.:70.42 3rd Qu.:14.072 3rd Qu.:89.67
## Max. :62.22 Max. :80.60 Max. :22.366 Max. :95.80
##
# Conduct multiple linear regression
model <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)
# Create regression tables using sjPlot
tab_model(model, show.std = TRUE, show.ci = FALSE, show.se = TRUE, show.p = TRUE)
| Â | ClintonVote | ||||
|---|---|---|---|---|---|
| Predictors | Estimates | std. Error | std. Beta | standardized std. Error | p |
| (Intercept) | 0.99 | 6.01 | 0.00 | 0.09 | 0.870 |
| College | 1.12 | 0.23 | 0.69 | 0.14 | <0.001 |
| HouseholdIncome | -0.08 | 0.15 | -0.07 | 0.14 | 0.625 |
| NonWhite | 0.43 | 0.08 | 0.53 | 0.10 | <0.001 |
| Observations | 50 | ||||
| R2 / R2 adjusted | 0.616 / 0.590 | ||||
# Summary of the model
summary(model)
##
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite,
## data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.697 -3.605 -1.036 4.495 13.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99082 6.01413 0.165 0.870
## College 1.12498 0.22514 4.997 8.88e-06 ***
## HouseholdIncome -0.07554 0.15345 -0.492 0.625
## NonWhite 0.43123 0.08246 5.230 4.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.5904
## F-statistic: 24.55 on 3 and 46 DF, p-value: 1.237e-09
The results show that ‘College’ and ‘NonWhite’ are significant predictors of ‘ClintonVote’, but ‘HouseholdIncome’ is not. Specifically, for each percentage point increase in the college-educated population, the percentage of votes for Clinton is expected to increase by 1.12 percentage points (p < 0.001). Similarly, for each percentage point increase in the non-white population, the percentage of votes for Clinton is expected to rise by 0.43 percentage points (p < 0.001). The standardized coefficients (0.69 for ‘College’ and 0.53 for ‘NonWhite’) show that these variables have a significant positive effect on Clinton’s vote share. On the other hand, ‘HouseholdIncome’ has a negligible negative effect on ‘ClintonVote’ (-0.08), which is not statistically significant (p = 0.625). The model explains approximately 61.6% of the variance in ‘ClintonVote’ (R-squared = 0.616), with an adjusted R-squared value of 0.590 indicating the model’s reliability. In conclusion, higher educational attainment and a larger non-white population are associated with higher percentages of votes for Clinton, whereas household income has no significant effect on voting results. The model fits the data well and explains a significant portion of the variation in voting behavior.
TASK 2
#install packages
install.packages("stargazer")
##
## The downloaded binary packages are in
## /var/folders/t_/pz759vz53jz772p82qtw5p8r0000gn/T//RtmpGN4ooR/downloaded_packages
#Load packages
library("stargazer")
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Initial model (without Region)
model1 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite, data = States)
# Extended model (with Region)
model2 <- lm(ClintonVote ~ College + HouseholdIncome + NonWhite + Region, data = States)
# Compare models using AIC and BIC
aic1 <- AIC(model1)
aic2 <- AIC(model2)
bic1 <- BIC(model1)
bic2 <- BIC(model2)
# Print AIC and BIC values
cat("AIC for model without Region: ", aic1, "\n")
## AIC for model without Region: 336.7667
cat("AIC for model with Region: ", aic2, "\n")
## AIC for model with Region: 326.4857
cat("BIC for model without Region: ", bic1, "\n")
## BIC for model without Region: 346.3268
cat("BIC for model with Region: ", bic2, "\n")
## BIC for model with Region: 341.7819
# Create regression tables using stargazer
stargazer(model1, model2, type = "text",
title = "Regression Results",
column.labels = c("Without Region", "With Region"),
covariate.labels = c("College", "Household Income", "Non-White", "Region"),
dep.var.labels = "Clinton Vote",
star.cutoffs = c(0.05, 0.01, 0.001))
##
## Regression Results
## =================================================================
## Dependent variable:
## ---------------------------------------------
## Clinton Vote
## Without Region With Region
## (1) (2)
## -----------------------------------------------------------------
## College 1.125*** 1.189***
## (0.225) (0.257)
##
## Household Income -0.076 -0.429*
## (0.153) (0.171)
##
## Non-White 0.431*** 0.525***
## (0.082) (0.082)
##
## Region 8.031**
## (2.721)
##
## RegionS -3.404
## (2.742)
##
## RegionW 5.658*
## (2.772)
##
## Constant 0.991 14.806*
## (6.014) (7.159)
##
## -----------------------------------------------------------------
## Observations 50 50
## R2 0.616 0.722
## Adjusted R2 0.590 0.684
## Residual Std. Error 6.622 (df = 46) 5.820 (df = 43)
## F Statistic 24.548*** (df = 3; 46) 18.648*** (df = 6; 43)
## =================================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
# Interpret the results
summary(model1)
##
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite,
## data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.697 -3.605 -1.036 4.495 13.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99082 6.01413 0.165 0.870
## College 1.12498 0.22514 4.997 8.88e-06 ***
## HouseholdIncome -0.07554 0.15345 -0.492 0.625
## NonWhite 0.43123 0.08246 5.230 4.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.622 on 46 degrees of freedom
## Multiple R-squared: 0.6155, Adjusted R-squared: 0.5904
## F-statistic: 24.55 on 3 and 46 DF, p-value: 1.237e-09
summary(model2)
##
## Call:
## lm(formula = ClintonVote ~ College + HouseholdIncome + NonWhite +
## Region, data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3217 -4.1695 0.0435 4.2880 11.2477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.80589 7.15877 2.068 0.0447 *
## College 1.18862 0.25668 4.631 3.36e-05 ***
## HouseholdIncome -0.42863 0.17107 -2.506 0.0161 *
## NonWhite 0.52461 0.08175 6.417 9.03e-08 ***
## RegionNE 8.03052 2.72062 2.952 0.0051 **
## RegionS -3.40381 2.74204 -1.241 0.2212
## RegionW 5.65798 2.77249 2.041 0.0474 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.82 on 43 degrees of freedom
## Multiple R-squared: 0.7224, Adjusted R-squared: 0.6836
## F-statistic: 18.65 on 6 and 43 DF, p-value: 1.545e-10
In this model that includes ‘Region’, ‘College’ and ‘NonWhite’ are still significant positive predictors of Clinton’s vote share. Each percentage point increase in the college-educated population corresponds with a 1.19 percentage point increase in votes for Clinton (p < 0.001). Similarly, each percentage point increase in the non-white population is associated with a 0.52 percentage point increase in votes for Clinton. Interestingly, household income becomes a significant predictor in this model. Each unit increase in household income is associated with a 0.43 percentage point decrease in votes for Clinton (p = 0.016).
The inclusion of the ‘Region’ variable reveals significant regional differences in voting patterns. Being in the Northeast region correlates with an 8.03 percentage point increase in Clinton votes with regard to the reference region (p = 0.005). The West region also has a positive effect, with Clinton’s votes increasing by 5.66 percentage points (p = 0.047). However, the South region has no significant impact on Clinton’s vote share (p = 0.221). The extended model explains a larger proportion of the variance in Clinton’s vote share (R-squared = 0.722), indicating a better fit.
I prefer the extened model with the ‘Region’ variable because it provides a more detailed and accurate understanding of the variation in Clinton’s vote share. It has a better fit and reveals more information.
TASK 3
# Load necessary libraries
library(effects)
## Loading required package: carData
##
## Attaching package: 'carData'
## The following object is masked _by_ '.GlobalEnv':
##
## States
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:sjlabelled':
##
## as_label
# Fit the linear model with interaction
interaction_model <- lm(ClintonVote ~ College * NonWhite, data = States)
# Summary of the model
summary(interaction_model)
##
## Call:
## lm(formula = ClintonVote ~ College * NonWhite, data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.446 -3.543 -0.295 3.936 13.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22574 13.26349 -1.148 0.256927
## College 1.50058 0.39913 3.760 0.000479 ***
## NonWhite 0.99725 0.47769 2.088 0.042397 *
## College:NonWhite -0.01802 0.01457 -1.236 0.222664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared: 0.6259, Adjusted R-squared: 0.6015
## F-statistic: 25.66 on 3 and 46 DF, p-value: 6.636e-10
# Plot the interaction effect using the effects package
interaction_effect <- effect("College*NonWhite", interaction_model)
plot(interaction_effect, main="Interaction Effect of College and NonWhite on ClintonVote", xlab="College (%)", ylab="Predicted ClintonVote (%)")
# Detailed model summary
summary(interaction_model)
##
## Call:
## lm(formula = ClintonVote ~ College * NonWhite, data = States)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.446 -3.543 -0.295 3.936 13.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.22574 13.26349 -1.148 0.256927
## College 1.50058 0.39913 3.760 0.000479 ***
## NonWhite 0.99725 0.47769 2.088 0.042397 *
## College:NonWhite -0.01802 0.01457 -1.236 0.222664
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.532 on 46 degrees of freedom
## Multiple R-squared: 0.6259, Adjusted R-squared: 0.6015
## F-statistic: 25.66 on 3 and 46 DF, p-value: 6.636e-10
Including the interaction term between ‘College’ and ‘NonWhite’ in the model provides a more nuanced view of the relationship between education and voting patterns. The negative but non-significant interaction term implies that the positive effect of college education on Clinton’s vote percentage could decrease slightly as the percentage of the non-white population grows. However, because the interaction is not statistically significant, the moderating effect is not sufficient enough to draw a firm conclusion.
TASK 4
# Create the binary outcome variable
States$ClintonWin <- ifelse(States$ClintonVote > 50, 1, 0)
# Inspect the new variable
table(States$ClintonWin)
##
## 0 1
## 37 13
# Fit the logistic regression model
logistic_model <- glm(ClintonWin ~ College + HouseholdIncome + Region, data = States, family = binomial)
# Summary of the logistic regression model
summary(logistic_model)
##
## Call:
## glm(formula = ClintonWin ~ College + HouseholdIncome + Region,
## family = binomial, data = States)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.16240 6.31603 -2.717 0.00658 **
## College 0.28494 0.14633 1.947 0.05150 .
## HouseholdIncome 0.07628 0.07376 1.034 0.30110
## RegionNE 2.17592 1.40578 1.548 0.12166
## RegionS -16.31016 2420.64651 -0.007 0.99462
## RegionW 2.58895 1.62430 1.594 0.11096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 57.306 on 49 degrees of freedom
## Residual deviance: 25.780 on 44 degrees of freedom
## AIC: 37.78
##
## Number of Fisher Scoring iterations: 18
library(modelsummary)
## `modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
## backend. Learn more at: https://vincentarelbundock.github.io/tinytable/
##
## Revert to `kableExtra` for one session:
##
## options(modelsummary_factory_default = 'kableExtra')
## options(modelsummary_factory_latex = 'kableExtra')
## options(modelsummary_factory_html = 'kableExtra')
##
## Silence this message forever:
##
## config_modelsummary(startup_message = FALSE)
# Create regression tables using modelsummary
model_list <- list("Logistic Regression" = logistic_model)
modelsummary(model_list, statistic = c('std.error', 'p.value'), stars = TRUE)
| Logistic Regression | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | -17.162** |
| (6.316) | |
| (0.007) | |
| College | 0.285+ |
| (0.146) | |
| (0.052) | |
| HouseholdIncome | 0.076 |
| (0.074) | |
| (0.301) | |
| RegionNE | 2.176 |
| (1.406) | |
| (0.122) | |
| RegionS | -16.310 |
| (2420.647) | |
| (0.995) | |
| RegionW | 2.589 |
| (1.624) | |
| (0.111) | |
| Num.Obs. | 50 |
| AIC | 37.8 |
| BIC | 49.3 |
| Log.Lik. | -12.890 |
| RMSE | 0.29 |
# Create regression tables using sjPlot
tab_model(logistic_model, show.std = TRUE, show.ci = TRUE, show.se = TRUE, show.p = TRUE)
| Â | ClintonWin | ||||||
|---|---|---|---|---|---|---|---|
| Predictors | Odds Ratios | std. Error | std. Beta | standardized std. Error | CI | standardized CI | p |
| (Intercept) | 0.00 | 0.00 | 0.04 | 0.05 | 0.00 – Inf | 0.00 – Inf | 0.007 |
| College | 1.33 | 0.19 | 6.05 | 5.59 | 0.00 – Inf | 0.00 – Inf | 0.052 |
| HouseholdIncome | 1.08 | 0.08 | 2.07 | 1.46 | 0.00 – Inf | 0.00 – Inf | 0.301 |
| Region [NE] | 8.81 | 12.39 | 8.81 | 12.39 | 0.00 – Inf | 0.00 – Inf | 0.122 |
| Region [S] | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 – Inf | 0.00 – Inf | 0.995 |
| Region [W] | 13.32 | 21.63 | 13.32 | 21.63 | 0.00 – Inf | 0.00 – Inf | 0.111 |
| Observations | 50 | ||||||
| R2 Tjur | 0.564 | ||||||
The odds ratio of 1.33 indicates that for every percentage point increase in the college-educated population, the chances of voting for Clinton rise by 33%. This effect is marginally significant (p-value = 0.052), indicating that education may have a positive influence on voting behavior, though the data does not firmly support this conclusion. Household income, on the other hand, does not seem to have a significant impact on the likelihood of voting for Clinton. The odds ratio of 1.08 indicates that people with higher household incomes are slightly more likely to vote for Clinton, but this effect is not statistically significant (p = 0.301). This suggests that household income does not have a significant impact on voting outcomes in this model. The regional effects provide interesting insights, though they are not statistically significant. States in the Northeast have an odds ratio of 8.81, indicating that living in the Northeast increases the likelihood of voting for Clinton by about 8.81 times when compared to the reference region. However, this effect is statistically insignificant (p = 0.122). Similarly, the odds ratio for states in the West is 13.32, which indicates a strong positive influence on voting for Clinton, but again, this result is not statistically significant (p = 0.111). In contrast, the South has an odds ratio of 0.00, indicating a negative effect on voting for Clinton, though once again, this is not statistically significant (p = 0.995).
The model fit, shown by Tjur’s R-squared value of 0.564, suggests that the predictors in the model explain about 56.4% of the variation in the binary outcome (whether or not a state voted for Clinton). This indicates a reasonably good fit, emphasizing the importance of including educational and regional factors in voting behavior analysis.
TASK 5
library(effects)
# Compute the effects for College separated by Region
college_effect <- Effect(c("College", "Region"), logistic_model)
# Print the computed effects to verify
print(college_effect)
##
## College*Region effect
## Region
## College MW NE S W
## 22 0.001529445 0.01331582 1.264117e-10 0.01998916
## 30 0.014747399 0.11650937 1.235255e-09 0.16618842
## 37 0.099098735 0.49216147 9.077778e-09 0.59427511
## 44 0.447017978 0.87687838 6.671179e-08 0.91499584
## 51 0.855922149 0.98125211 4.902589e-07 0.98751633
# Extract the predicted probabilities from the effects object
predicted_data <- as.data.frame(college_effect)
# Inspect the predicted data
print(head(predicted_data))
## College Region fit se lower upper
## 1 22 MW 0.001529445 0.003839379 1.109538e-05 0.1745571
## 2 30 MW 0.014747399 0.022696798 7.002030e-04 0.2422794
## 3 37 MW 0.099098735 0.101499274 1.170994e-02 0.5052449
## 4 44 MW 0.447017978 0.370236482 4.115875e-02 0.9383607
## 5 51 MW 0.855922149 0.283699493 6.139181e-02 0.9981501
## 6 22 NE 0.013315818 0.032057321 1.130447e-04 0.6169967
# Create a ggplot2 plot
ggplot(predicted_data, aes(x = College, y = fit, color = Region)) +
geom_line(linewidth = 1) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = Region), alpha = 0.2) +
labs(title = "Predicted Probability of Voting for Clinton by College and Region",
x = "Percentage of College Educated", y = "Predicted Probability of Voting for Clinton") +
theme_minimal()