Download necessary packages

library(readr)
library(tibble)
library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(broom)
library(Metrics)

Question 1

Focus on the data for Black voters by using “read_csv()” to read the data and call “turnout” including explanation

Verbal Explanation: There are 3 years which are 2008 2010 2006, and 42 different states are included in the dataset.

# Create a variable call "turnout"
turnout <- as_tibble(read_csv("https://bit.ly/3RFo38k")); glimpse(turnout)

## Rows: 1237 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (5): year, district, black_turnout, black_share, black_candidate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## Rows: 1,237
## Columns: 6
## $ year            <dbl> 2008, 2010, 2010, 2008, 2006, 2010, 2008, 2010, 2006, …
## $ state           <chr> "AK", "AK", "AK", "AK", "AK", "AL", "AL", "AL", "AL", …
## $ district        <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, …
## $ black_turnout   <dbl> 0.7097267, 0.4484968, 0.4484968, 0.7097267, 0.4394295,…
## $ black_share     <dbl> 0.03502312, 0.03234046, 0.03234046, 0.03502312, 0.0317…
## $ black_candidate <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

# To indicate which years are included in the dataset 
unique(turnout$year)

## [1] 2008 2010 2006

# To indicate how many different states are included in the dataset
length(unique(turnout$state))

## [1] 42

Question 2

Construct a boxplot and run the code following the description

Verbal Explanation: According to a boxplot below, it shows the differences in Black voter turnout between elections with and without co-ethnic candidates, and also represent a difference in the median turnout levels or the spread of the data. Refer to the outcome, the boxplot reveals higher median Black voter turnout in elections with co-ethnic candidates compared to those without. Also, the interquartile range (IQR) is wider in elections with Black candidates than in those without, which indicating greater variability in turnout levels. These findings suggest that the presence of co-ethnic candidates may positively influence Black voter participation.

# Plot a boxplot
turnout <- turnout %>%
  mutate(is_black_candidate = ifelse(black_candidate == 1, "Yes", "No")) 

turnout_box <- ggplot(turnout, aes(x = is_black_candidate,
                                   y = black_turnout,
                                   fill = "indianred1")) +
  geom_boxplot() +
  labs(y = "Voter turnout (Black voters)",
       x = "One or more Black candidates") +
  theme_minimal() +
  scale_fill_identity()

# To see the boxplot
turnout_box

Question 3

Run a linear regression and report the coefficient of the candidate co-ethnicity model

Interpret both coeﬀicients:

The intercept of 0.39386 suggests that the estimated Black voter turnout when no co-ethnic candidate is running is 0.39386.
The coefficient of 0.06164 suggests that for every additional co-ethnic candidate running, Black voter turnout increases by approximately 0.06164 units.

Conclusions:

Thus, regarding the prediction that Black voters turn out at higher rates when a co-ethnic candidate is running, the coefficient for black_candidate being positive (0.06164) supports this prediction. It indicates that there is a positive association between the presence of co-ethnic candidates and Black voter turnout. However, the small magnitude of the coefficient suggests that the effect of co-ethnic candidates on Black voter turnout may be relatively small.

Verbal Explanation for R2:

The R-squared value of 0.01351812 indicates that approximately 1.35% of the variability in Black voter turnout is explained by the candidate co-ethnicity model. This suggests that while the presence of co-ethnic candidates has a statistically significant effect on Black voter turnout, there are likely other factors not included in the model that also influence turnout.

# Run a linear regression
lm_1 <- lm(data = turnout, black_turnout ~ black_candidate); lm_1

## 
## Call:
## lm(formula = black_turnout ~ black_candidate, data = turnout)
## 
## Coefficients:
##     (Intercept)  black_candidate  
##         0.39386          0.06164

# Report the coeﬀicients 
lm_1 |>
  broom::tidy() |>
  select(term, estimate) |>
  knitr::kable(digits = 2)

term	estimate
(Intercept)	0.39
black_candidate	0.06

# Compute R2 of the candidate co-ethnicity
cat("The R2 of the candidate co-ethnicity model is", glance(lm_1)$r.squared)

## The R2 of the candidate co-ethnicity model is 0.01351812

Question 4

Construct a scatter plot following the instructions

What does this graph imply about the relationship between Black voting-age population and Black turnout?

The Relationship between Black voting-age population and Black turnout:
- The plot shows the distribution of Black voter turnout across different levels of the Black voting-age population, and provide insights into whether areas with a higher share of the Black voting-age population tend to have higher Black voter turnout.
- According to a scatter plot below, there is no clear relationship. Then, we would observe a more scattered distribution of points without a discernible trend, indicating that the share of the Black voting-age population does not strongly predict Black voter turnout.

What does it inform us about the relationship between Black voting-age population and the presence of a Black candidate?

The Relationship between Black voting-age population and the presence of a Black candidate:
- Coloring the points based on the presence or absence of a Black candidate (co-ethnic candidate), the scatter plot allows for the examination of how the presence of Black candidates influences the relationship between the Black voting-age population and Black voter turnout.
- Below, it seems the presence of a Black candidate has no significant effect on Black voter turnout. Suggest that other factors may play a more influential role in driving Black voter participation.

# Store a scatter plot name "turnout_scatter"
turnout <- turnout %>%
  mutate(is_black_candidate = ifelse(black_candidate == 1, "Coethnic candidate",
                                     "No coethnic candidate"))

turnout_scatter <- ggplot(turnout, aes(x = black_share,
                                       y = black_turnout,
                                       color = is_black_candidate)) +
  geom_point(alpha = 0.7) +
  labs( x = "Share of Black voting-age population",
        y = " Black voter turnout",
        color = "Candidate Type")  +
  theme_minimal()

turnout_scatter

Question 5

Run a linear regression of the district demographics model

Interpret for coeﬀicients:

The intercept of 0.3759 represents the estimated Black voter turnout when the share of the Black voting-age population is zero.
For every one-percentage-point increase in the share of the Black voting-age population, Black voter turnout is estimated to increase by approximately 0.1957 units

Based on these results, we can interpret that there is a positive relationship between the share of the Black voting-age population and Black voter turnout. As the share of the Black voting-age population increases, Black voter turnout is estimated to increase as well.

Interpret for R2:

The R2 value of 0.028437 indicates that approximately 2.84% of the variability in Black voter turnout can be explained by the share of the Black voting-age population. This suggests that while the share of the Black voting-age population provides some explanatory power, other factors not included in the model also influence Black voter turnout.

Conclusions for coefficients and R2:

The coefficient indicates a positive relationship between the share of the Black voting-age population and Black voter turnout. However, the relatively low R2 value suggests that while the share of the Black voting-age population explains some variation in Black voter turnout, other factors not included in the model also influence turnout.

Interpreting RMSEs:

Since in this case, there are two scenarios between (1) RMSE for “district demographics model” (0.3418771) and (2) RMSE for “candidate co-ethnicity model” (0.4485989). The RMSE value of “district demographics model” is lower than “candidate co-ethnicity model”, which means low in RMSE values indicate that the model fits the data well and has more precise predictions. However, the higher values suggest more error and less precise predictions.

# Run a linear regression
lm_2 <- lm(black_turnout ~ black_share, data = turnout); lm_2

## 
## Call:
## lm(formula = black_turnout ~ black_share, data = turnout)
## 
## Coefficients:
## (Intercept)  black_share  
##      0.3759       0.1957

# Report the coeﬀicients of lm_2
lm_2 |>
  broom::tidy() |>
  select(term, estimate) |>
  knitr::kable(digits = 2)

term	estimate
(Intercept)	0.38
black_share	0.20

# Compute R2
cat("The R2 is", glance(lm_2)$r.squared)

## The R2 is 0.028437

# Calculate Root Mean-Squared Error (RMSE) for "district demographics model"
cat("RSME for 'district demographics model'is", rmse(turnout$black_turnout, turnout$black_share))

## RSME for 'district demographics model'is 0.3418771

# Calculate Root Mean-Squared Error (RMSE) for "candidate co-ethnicity model"
cat("RSME for 'candidate co-ethnicity model'is", rmse(turnout$black_turnout, turnout$black_candidate))

## RSME for 'candidate co-ethnicity model'is 0.4485989

Question 6

Run a multiple regression and report the coeﬀicient

Interpret the coeﬀicients on the two predictors:

Coefficient for black_candidate (-0.007364):
- The coefficient for black_candidate represents the change in the estimated Black voter turnout for each unit increase in the presence of a co-ethnic candidate (Black candidate). Each additional Black candidate is associated with a decrease in Black voter turnout by approximately 0.007364 units.
Coefficient for black_share (0.207392):
- The coefficient for black_share represents the change in the estimated Black voter turnout for each one-unit increase in the share of the Black voting-age population. For every one-percentage-point increase in the share of the Black voting-age population, Black voter turnout is estimated to increase by approximately 0.207392 units.

So, coefficients provide information on how each predictor variable (black_candidate and black_share) is associated with Black voter turnout. While the presence of Black candidates appears to have a minimal negative association with Black voter turnout, the share of the Black voting-age population has a stronger positive association with turnout.

Evaluate both the R2 and the Adjusted R2:

In this case, The R2 and the Adjusted R2 measure of how well the independent variables in the model explain the variation in the dependent variable (Black voter turnout). Adjusted R2 is always less than or equal to R2, same with outcomes below, R2 is 0.02852765 which is higher than adjusted R2 is 0.02695314.

R2: Approximately 2.85% of the variability in Black voter turnout. Since higher R2 value indicates a better fit of the model to the data, but in this case, the value is relatively low, the predictors included in the model do not explain a large portion of the variability in Black voter turnout.
Adjusted R2: 0.02695314 is a value of Adjusted R2. Since the adjusted R2 normally ranges from 0 to 1, with higher values indicating a better fit of the model to the data. However, this case: Adjusted R2 < R2, a value that is less than or equal to 0 indicates a model that has no predictive value.

Does the relationship between black_share and black_turnout change from the previous regression?

Since to compare the relationship between black_share and black_turnout has changed from the previous regression, we can compare the coefficients and measures of model fit between the two models:

Model Fit (R-squared):
- Values for both models are very similar, the previous R2 is 0.028437 and the current one is 0.02852765, indicating both explain a comparable proportion of the variability in Black voter turnout.
Effect of black_share:
- In the previous coefficient for black_share was 0.1957 and the current one is 0.207392, which slightly higher than in the previous regression.

Therefore, additional predictors in the multiple regression model slightly alters the effect size of black_share on Black voter turnout, both models have similar explanatory power as indicated by their comparable R2 values. So, the relationship between black_share and black_turnout remains relatively consistent across the two models.

# Run a linear regression
lm_3 <- lm(black_turnout ~ black_candidate + black_share, data = turnout); lm_3

## 
## Call:
## lm(formula = black_turnout ~ black_candidate + black_share, data = turnout)
## 
## Coefficients:
##     (Intercept)  black_candidate      black_share  
##        0.375275        -0.007364         0.207392

# Report the coeﬀicients of lm_2
lm_3 |>
  broom::tidy() |>
  select(term, estimate) |>
  knitr::kable(digits = 2)

term	estimate
(Intercept)	0.38
black_candidate	-0.01
black_share	0.21

# Compute R2
cat("The R2 is", glance(lm_3)$r.squared)

## The R2 is 0.02852765

# Compute the adjusted R2
cat("The adjusted R2 is", glance(lm_3)$adj.r.squared)

## The adjusted R2 is 0.02695314

Question 7

Interpret the intercept from the regression model with two predictors:

The intercept between two predictors is 0.375275, which is the expected value of the response variable when all predictors equal zero. It represents the estimated Black voter turnout when both predictors, candidate co-ethnicity (black_candidate), and the share of the Black voting-age population (black_share), are zero or absent.

Is this intercept a substantively important or interesting quantity? Why or why not?

Since in the multiple linear regression, it’s possible that some of the independent variables are actually correlated with one another, it’s important to check how two independent variables are correlated, the high correlation actually is r2 > ~0.6. However, in this case, r2 is about 0.03, which is lower than 0.6 a lot. Thus, this intercept isn’t quite significant and important in quantity.

Question 8

Comparing the candidate co-ethnicity model and the multiple regression model

What do you conclude about the relationship between co-ethnic candidates and Black voter turnout?

The candidate co-ethnicity model in Questions 3, is a linear regression with Black voter turnout (black_turnout) as an outcome variable and candidate co-ethnicity (black_candidate) as a predictor. It indicates that there is a positive association between the presence of co-ethnic candidates and Black voter turnout since the coefficients between them are 0.06164.
The multiple regression model, Black turnout (black_turnout) as the outcome variable and with candidate co-ethnicity (black_candidate) and co-ethnic voting-age population (black_share) are the predictors. Their coefficients between co-ethnic candidates and Black voter turnout become negative, -0.007364.

Thus, a positive coefficient in The candidate co-ethnicity model indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. But, the multiple regression model is a negative coefficient which the independent variable increases, the dependent variable tends to decrease.

Do they come to similar or different conclusions about this relationship?

Yes, as I mentioned the candidate co-ethnicity model has a positive coefficient, but the multiple regression model has a negative coefficient. So, their relationship on regression model are totally different.

Which model do you prefer and why? (Please ignore issues of statistical significance for this question and focus on discussing model fit.)

In my opinion, since the linear regression model provides the relationship between the predictor and the target variable, and the multiple regression model considers a broader range of factors. Even though the multiple regression model has a negative relationship, but it probably offers more accurate representation of the real-world scenario by accounting for other relevant variables.

Question 9

Run another regression with changing the predictor and print the estimated coeﬀicients

What does R do with these categorical variables?

Since it’s a regression model, and “state” which I omitted, it’s a categorical variable. Using table() to find state omitted and interpreting the coefficients by considered the change in the dependent variable (black_turnout). When R encounters with categorical variables like “state,” it’ll automatically convert them into dummy variables, representing each category with its own binary variable except for one reference category, against which the other categories are compared. Then, after I omitted the states, there are 42 states left from the original 1237 states. Additionally, R normally picks the first state alphabetically as the reference. In this case would be “AK”. Since R chooses one of the states as a “reference” or “base” to compare all other states against.

How do we interpret the coeﬀicients?

As the coefficient table below, it represents the difference in Black turnout between the corresponding state and the reference state (Intercept).
Give some examples, if the coefficient for state “AL” is -0.11883, it suggests that Black turnout in AL state is approximately 0.11883 units lower compared to the reference state. Similarly, if the coefficient for state “NY” is -0.19762, it indicates that Black turnout in NY state is approximately 0.19762 units lower compared to the reference state.
So, the positive coefficients indicate higher Black turnout compared to the reference state, while negative coefficients suggest lower Black turnout.

# Create the regression model name "lm_states"
lm_states <- lm(black_turnout ~ state, data = turnout); lm_states

## 
## Call:
## lm(formula = black_turnout ~ state, data = turnout)
## 
## Coefficients:
## (Intercept)      stateAL      stateAR      stateAZ      stateCA      stateCO  
##     0.55118     -0.11883     -0.18848     -0.19542     -0.15493     -0.04704  
##     stateCT      stateDE      stateFL      stateGA      stateIA      stateIL  
##    -0.12102     -0.01851     -0.12429     -0.13935     -0.03011     -0.18291  
##     stateIN      stateKS      stateKY      stateLA      stateMA      stateMD  
##    -0.20289     -0.16516     -0.11449     -0.09044     -0.19336     -0.06229  
##     stateME      stateMI      stateMN      stateMO      stateMS      stateNC  
##     0.35365     -0.03181     -0.08637     -0.16215     -0.14942     -0.10195  
##     stateNE      stateNH      stateNJ      stateNM      stateNV      stateNY  
##    -0.16040      0.04948     -0.15214     -0.12861     -0.16100     -0.19762  
##     stateOH      stateOK      stateOR      statePA      stateRI      stateSC  
##    -0.10503     -0.03266      0.13679     -0.21613     -0.12049     -0.11335  
##     stateTN      stateTX      stateUT      stateWA      stateWI      stateWV  
##    -0.14519     -0.26037     -0.16401     -0.20475     -0.17116     -0.17120

# Estimate the coefficient
lm_states |>
  broom::tidy() |>
  select(term, estimate) |>
  knitr::kable(digits = 2)

term	estimate
(Intercept)	0.55
stateAL	-0.12
stateAR	-0.19
stateAZ	-0.20
stateCA	-0.15
stateCO	-0.05
stateCT	-0.12
stateDE	-0.02
stateFL	-0.12
stateGA	-0.14
stateIA	-0.03
stateIL	-0.18
stateIN	-0.20
stateKS	-0.17
stateKY	-0.11
stateLA	-0.09
stateMA	-0.19
stateMD	-0.06
stateME	0.35
stateMI	-0.03
stateMN	-0.09
stateMO	-0.16
stateMS	-0.15
stateNC	-0.10
stateNE	-0.16
stateNH	0.05
stateNJ	-0.15
stateNM	-0.13
stateNV	-0.16
stateNY	-0.20
stateOH	-0.11
stateOK	-0.03
stateOR	0.14
statePA	-0.22
stateRI	-0.12
stateSC	-0.11
stateTN	-0.15
stateTX	-0.26
stateUT	-0.16
stateWA	-0.20
stateWI	-0.17
stateWV	-0.17

# To find which state is omitted
table(turnout$state) # or using unique(turnout$state)

## 
##  AK  AL  AR  AZ  CA  CO  CT  DE  FL  GA  IA  IL  IN  KS  KY  LA  MA  MD  ME  MI 
##   5  23  13  26 161  17  11   6  69  41  11  59  29  14  20  13  32  26   1  46 
##  MN  MO  MS  NC  NE  NH  NJ  NM  NV  NY  OH  OK  OR  PA  RI  SC  TN  TX  UT  WA 
##  19  29  13  41   8   2  41  11  11  88  56  16   7  59   8  20  29  93   5  29 
##  WI  WV 
##  17  12

Run the same regression without the intercept

How do we interpret these coeﬀicients?

Once running the regression model without the intercept, it represents the estimated Black turnout for each state, without any influence or adjustment to a baseline level of Black turnout.
As the results, in each coefficient turns to be a positive value which different from the original regression.
A positive coefficient means the Black turnout for that state is higher compared to the baseline. But a negative coefficient suggests that the Black turnout for that state is lower compared to the baseline.

# Again, run the same regression model without intercept
lm_states <- lm(black_turnout ~ 0 + state, data = turnout); lm_states

## 
## Call:
## lm(formula = black_turnout ~ 0 + state, data = turnout)
## 
## Coefficients:
## stateAK  stateAL  stateAR  stateAZ  stateCA  stateCO  stateCT  stateDE  
##  0.5512   0.4323   0.3627   0.3558   0.3962   0.5041   0.4302   0.5327  
## stateFL  stateGA  stateIA  stateIL  stateIN  stateKS  stateKY  stateLA  
##  0.4269   0.4118   0.5211   0.3683   0.3483   0.3860   0.4367   0.4607  
## stateMA  stateMD  stateME  stateMI  stateMN  stateMO  stateMS  stateNC  
##  0.3578   0.4889   0.9048   0.5194   0.4648   0.3890   0.4018   0.4492  
## stateNE  stateNH  stateNJ  stateNM  stateNV  stateNY  stateOH  stateOK  
##  0.3908   0.6007   0.3990   0.4226   0.3902   0.3536   0.4461   0.5185  
## stateOR  statePA  stateRI  stateSC  stateTN  stateTX  stateUT  stateWA  
##  0.6880   0.3350   0.4307   0.4378   0.4060   0.2908   0.3872   0.3464  
## stateWI  stateWV  
##  0.3800   0.3800

Note: Since in the last assignment, TA mentioned my answers were too short and I got minus on those. Then, in this time I provide the answers with knowledges as much as and as long as I can. Hope I will not be minus the grade again.

PBA 2024 Assignment 3

112077440

Thanawan Tachasomboonsuk (Yong)