library(readr)
library(tibble)
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(broom)
library(Metrics)
Verbal Explanation: There are 3 years which are 2008
2010 2006, and 42 different states are included in the dataset.
# Create a variable call "turnout"
turnout <- as_tibble(read_csv("https://bit.ly/3RFo38k")); glimpse(turnout)
## Rows: 1237 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (5): year, district, black_turnout, black_share, black_candidate
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 1,237
## Columns: 6
## $ year <dbl> 2008, 2010, 2010, 2008, 2006, 2010, 2008, 2010, 2006, …
## $ state <chr> "AK", "AK", "AK", "AK", "AK", "AL", "AL", "AL", "AL", …
## $ district <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, …
## $ black_turnout <dbl> 0.7097267, 0.4484968, 0.4484968, 0.7097267, 0.4394295,…
## $ black_share <dbl> 0.03502312, 0.03234046, 0.03234046, 0.03502312, 0.0317…
## $ black_candidate <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# To indicate which years are included in the dataset
unique(turnout$year)
## [1] 2008 2010 2006
# To indicate how many different states are included in the dataset
length(unique(turnout$state))
## [1] 42
Verbal Explanation: According to a boxplot below, it
shows the differences in Black voter turnout between elections with and
without co-ethnic candidates, and also represent a difference in the
median turnout levels or the spread of the data. Refer to the outcome,
the boxplot reveals higher median Black voter turnout in elections with
co-ethnic candidates compared to those without. Also, the interquartile
range (IQR) is wider in elections with Black candidates than in those
without, which indicating greater variability in turnout levels. These
findings suggest that the presence of co-ethnic candidates may
positively influence Black voter participation.
# Plot a boxplot
turnout <- turnout %>%
mutate(is_black_candidate = ifelse(black_candidate == 1, "Yes", "No"))
turnout_box <- ggplot(turnout, aes(x = is_black_candidate,
y = black_turnout,
fill = "indianred1")) +
geom_boxplot() +
labs(y = "Voter turnout (Black voters)",
x = "One or more Black candidates") +
theme_minimal() +
scale_fill_identity()
# To see the boxplot
turnout_box
Interpret both coefficients:
Conclusions:
Thus, regarding the prediction that Black voters turn out at higher rates when a co-ethnic candidate is running, the coefficient for black_candidate being positive (0.06164) supports this prediction. It indicates that there is a positive association between the presence of co-ethnic candidates and Black voter turnout. However, the small magnitude of the coefficient suggests that the effect of co-ethnic candidates on Black voter turnout may be relatively small.
Verbal Explanation for R2:
The R-squared value of 0.01351812 indicates that approximately 1.35%
of the variability in Black voter turnout is explained by the candidate
co-ethnicity model. This suggests that while the presence of co-ethnic
candidates has a statistically significant effect on Black voter
turnout, there are likely other factors not included in the model that
also influence turnout.
# Run a linear regression
lm_1 <- lm(data = turnout, black_turnout ~ black_candidate); lm_1
##
## Call:
## lm(formula = black_turnout ~ black_candidate, data = turnout)
##
## Coefficients:
## (Intercept) black_candidate
## 0.39386 0.06164
# Report the coefficients
lm_1 |>
broom::tidy() |>
select(term, estimate) |>
knitr::kable(digits = 2)
| term | estimate |
|---|---|
| (Intercept) | 0.39 |
| black_candidate | 0.06 |
# Compute R2 of the candidate co-ethnicity
cat("The R2 of the candidate co-ethnicity model is", glance(lm_1)$r.squared)
## The R2 of the candidate co-ethnicity model is 0.01351812
What does this graph imply about the relationship between Black voting-age population and Black turnout?
What does it inform us about the relationship between Black voting-age population and the presence of a Black candidate?
# Store a scatter plot name "turnout_scatter"
turnout <- turnout %>%
mutate(is_black_candidate = ifelse(black_candidate == 1, "Coethnic candidate",
"No coethnic candidate"))
turnout_scatter <- ggplot(turnout, aes(x = black_share,
y = black_turnout,
color = is_black_candidate)) +
geom_point(alpha = 0.7) +
labs( x = "Share of Black voting-age population",
y = " Black voter turnout",
color = "Candidate Type") +
theme_minimal()
turnout_scatter
Interpret for coefficients:
Based on these results, we can interpret that there is a positive relationship between the share of the Black voting-age population and Black voter turnout. As the share of the Black voting-age population increases, Black voter turnout is estimated to increase as well.
Interpret for R2:
The R2 value of 0.028437 indicates that approximately 2.84% of the variability in Black voter turnout can be explained by the share of the Black voting-age population. This suggests that while the share of the Black voting-age population provides some explanatory power, other factors not included in the model also influence Black voter turnout.
Conclusions for coefficients and R2:
The coefficient indicates a positive relationship between the share of the Black voting-age population and Black voter turnout. However, the relatively low R2 value suggests that while the share of the Black voting-age population explains some variation in Black voter turnout, other factors not included in the model also influence turnout.
Interpreting RMSEs:
Since in this case, there are two scenarios between (1) RMSE for “district demographics model” (0.3418771) and (2) RMSE for “candidate co-ethnicity model” (0.4485989). The RMSE value of “district demographics model” is lower than “candidate co-ethnicity model”, which means low in RMSE values indicate that the model fits the data well and has more precise predictions. However, the higher values suggest more error and less precise predictions.
# Run a linear regression
lm_2 <- lm(black_turnout ~ black_share, data = turnout); lm_2
##
## Call:
## lm(formula = black_turnout ~ black_share, data = turnout)
##
## Coefficients:
## (Intercept) black_share
## 0.3759 0.1957
# Report the coefficients of lm_2
lm_2 |>
broom::tidy() |>
select(term, estimate) |>
knitr::kable(digits = 2)
| term | estimate |
|---|---|
| (Intercept) | 0.38 |
| black_share | 0.20 |
# Compute R2
cat("The R2 is", glance(lm_2)$r.squared)
## The R2 is 0.028437
# Calculate Root Mean-Squared Error (RMSE) for "district demographics model"
cat("RSME for 'district demographics model'is", rmse(turnout$black_turnout, turnout$black_share))
## RSME for 'district demographics model'is 0.3418771
# Calculate Root Mean-Squared Error (RMSE) for "candidate co-ethnicity model"
cat("RSME for 'candidate co-ethnicity model'is", rmse(turnout$black_turnout, turnout$black_candidate))
## RSME for 'candidate co-ethnicity model'is 0.4485989
Interpret the coefficients on the two predictors:
So, coefficients provide information on how each predictor variable (black_candidate and black_share) is associated with Black voter turnout. While the presence of Black candidates appears to have a minimal negative association with Black voter turnout, the share of the Black voting-age population has a stronger positive association with turnout.
Evaluate both the R2 and the Adjusted R2:
In this case, The R2 and the Adjusted R2 measure of how well the independent variables in the model explain the variation in the dependent variable (Black voter turnout). Adjusted R2 is always less than or equal to R2, same with outcomes below, R2 is 0.02852765 which is higher than adjusted R2 is 0.02695314.
Does the relationship between black_share and black_turnout change from the previous regression?
Since to compare the relationship between black_share and black_turnout has changed from the previous regression, we can compare the coefficients and measures of model fit between the two models:
Therefore, additional predictors in the multiple regression model slightly alters the effect size of black_share on Black voter turnout, both models have similar explanatory power as indicated by their comparable R2 values. So, the relationship between black_share and black_turnout remains relatively consistent across the two models.
# Run a linear regression
lm_3 <- lm(black_turnout ~ black_candidate + black_share, data = turnout); lm_3
##
## Call:
## lm(formula = black_turnout ~ black_candidate + black_share, data = turnout)
##
## Coefficients:
## (Intercept) black_candidate black_share
## 0.375275 -0.007364 0.207392
# Report the coefficients of lm_2
lm_3 |>
broom::tidy() |>
select(term, estimate) |>
knitr::kable(digits = 2)
| term | estimate |
|---|---|
| (Intercept) | 0.38 |
| black_candidate | -0.01 |
| black_share | 0.21 |
# Compute R2
cat("The R2 is", glance(lm_3)$r.squared)
## The R2 is 0.02852765
# Compute the adjusted R2
cat("The adjusted R2 is", glance(lm_3)$adj.r.squared)
## The adjusted R2 is 0.02695314
Interpret the intercept from the regression model with two predictors:
The intercept between two predictors is 0.375275, which is the expected value of the response variable when all predictors equal zero. It represents the estimated Black voter turnout when both predictors, candidate co-ethnicity (black_candidate), and the share of the Black voting-age population (black_share), are zero or absent.
Is this intercept a substantively important or interesting quantity? Why or why not?
Since in the multiple linear regression, it’s possible that some of the independent variables are actually correlated with one another, it’s important to check how two independent variables are correlated, the high correlation actually is r2 > ~0.6. However, in this case, r2 is about 0.03, which is lower than 0.6 a lot. Thus, this intercept isn’t quite significant and important in quantity.
What do you conclude about the relationship between co-ethnic candidates and Black voter turnout?
Thus, a positive coefficient in The candidate co-ethnicity model indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. But, the multiple regression model is a negative coefficient which the independent variable increases, the dependent variable tends to decrease.
Do they come to similar or different conclusions about this relationship?
Yes, as I mentioned the candidate co-ethnicity model has a positive coefficient, but the multiple regression model has a negative coefficient. So, their relationship on regression model are totally different.
Which model do you prefer and why? (Please ignore issues of statistical significance for this question and focus on discussing model fit.)
In my opinion, since the linear regression model provides the relationship between the predictor and the target variable, and the multiple regression model considers a broader range of factors. Even though the multiple regression model has a negative relationship, but it probably offers more accurate representation of the real-world scenario by accounting for other relevant variables.
What does R do with these categorical variables?
Since it’s a regression model, and “state” which I omitted, it’s a categorical variable. Using table() to find state omitted and interpreting the coefficients by considered the change in the dependent variable (black_turnout). When R encounters with categorical variables like “state,” it’ll automatically convert them into dummy variables, representing each category with its own binary variable except for one reference category, against which the other categories are compared. Then, after I omitted the states, there are 42 states left from the original 1237 states. Additionally, R normally picks the first state alphabetically as the reference. In this case would be “AK”. Since R chooses one of the states as a “reference” or “base” to compare all other states against.
How do we interpret the coefficients?
# Create the regression model name "lm_states"
lm_states <- lm(black_turnout ~ state, data = turnout); lm_states
##
## Call:
## lm(formula = black_turnout ~ state, data = turnout)
##
## Coefficients:
## (Intercept) stateAL stateAR stateAZ stateCA stateCO
## 0.55118 -0.11883 -0.18848 -0.19542 -0.15493 -0.04704
## stateCT stateDE stateFL stateGA stateIA stateIL
## -0.12102 -0.01851 -0.12429 -0.13935 -0.03011 -0.18291
## stateIN stateKS stateKY stateLA stateMA stateMD
## -0.20289 -0.16516 -0.11449 -0.09044 -0.19336 -0.06229
## stateME stateMI stateMN stateMO stateMS stateNC
## 0.35365 -0.03181 -0.08637 -0.16215 -0.14942 -0.10195
## stateNE stateNH stateNJ stateNM stateNV stateNY
## -0.16040 0.04948 -0.15214 -0.12861 -0.16100 -0.19762
## stateOH stateOK stateOR statePA stateRI stateSC
## -0.10503 -0.03266 0.13679 -0.21613 -0.12049 -0.11335
## stateTN stateTX stateUT stateWA stateWI stateWV
## -0.14519 -0.26037 -0.16401 -0.20475 -0.17116 -0.17120
# Estimate the coefficient
lm_states |>
broom::tidy() |>
select(term, estimate) |>
knitr::kable(digits = 2)
| term | estimate |
|---|---|
| (Intercept) | 0.55 |
| stateAL | -0.12 |
| stateAR | -0.19 |
| stateAZ | -0.20 |
| stateCA | -0.15 |
| stateCO | -0.05 |
| stateCT | -0.12 |
| stateDE | -0.02 |
| stateFL | -0.12 |
| stateGA | -0.14 |
| stateIA | -0.03 |
| stateIL | -0.18 |
| stateIN | -0.20 |
| stateKS | -0.17 |
| stateKY | -0.11 |
| stateLA | -0.09 |
| stateMA | -0.19 |
| stateMD | -0.06 |
| stateME | 0.35 |
| stateMI | -0.03 |
| stateMN | -0.09 |
| stateMO | -0.16 |
| stateMS | -0.15 |
| stateNC | -0.10 |
| stateNE | -0.16 |
| stateNH | 0.05 |
| stateNJ | -0.15 |
| stateNM | -0.13 |
| stateNV | -0.16 |
| stateNY | -0.20 |
| stateOH | -0.11 |
| stateOK | -0.03 |
| stateOR | 0.14 |
| statePA | -0.22 |
| stateRI | -0.12 |
| stateSC | -0.11 |
| stateTN | -0.15 |
| stateTX | -0.26 |
| stateUT | -0.16 |
| stateWA | -0.20 |
| stateWI | -0.17 |
| stateWV | -0.17 |
# To find which state is omitted
table(turnout$state) # or using unique(turnout$state)
##
## AK AL AR AZ CA CO CT DE FL GA IA IL IN KS KY LA MA MD ME MI
## 5 23 13 26 161 17 11 6 69 41 11 59 29 14 20 13 32 26 1 46
## MN MO MS NC NE NH NJ NM NV NY OH OK OR PA RI SC TN TX UT WA
## 19 29 13 41 8 2 41 11 11 88 56 16 7 59 8 20 29 93 5 29
## WI WV
## 17 12
How do we interpret these coefficients?
# Again, run the same regression model without intercept
lm_states <- lm(black_turnout ~ 0 + state, data = turnout); lm_states
##
## Call:
## lm(formula = black_turnout ~ 0 + state, data = turnout)
##
## Coefficients:
## stateAK stateAL stateAR stateAZ stateCA stateCO stateCT stateDE
## 0.5512 0.4323 0.3627 0.3558 0.3962 0.5041 0.4302 0.5327
## stateFL stateGA stateIA stateIL stateIN stateKS stateKY stateLA
## 0.4269 0.4118 0.5211 0.3683 0.3483 0.3860 0.4367 0.4607
## stateMA stateMD stateME stateMI stateMN stateMO stateMS stateNC
## 0.3578 0.4889 0.9048 0.5194 0.4648 0.3890 0.4018 0.4492
## stateNE stateNH stateNJ stateNM stateNV stateNY stateOH stateOK
## 0.3908 0.6007 0.3990 0.4226 0.3902 0.3536 0.4461 0.5185
## stateOR statePA stateRI stateSC stateTN stateTX stateUT stateWA
## 0.6880 0.3350 0.4307 0.4378 0.4060 0.2908 0.3872 0.3464
## stateWI stateWV
## 0.3800 0.3800
Note: Since in the last assignment, TA mentioned my answers were too short and I got minus on those. Then, in this time I provide the answers with knowledges as much as and as long as I can. Hope I will not be minus the grade again.