In this Project 3 I will be using Multiple linear regression model to show how voting eligability population and the total ballots counted can influence the voter turnout% throughout the U.S. This dataset includes percent total ballots counted, voting eligible population and total ballots counted.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Documents/School/Data 101/Data project/Project 3")
dataset <- read.csv("voter_count.csv")
colSums(is.na(dataset))
## year region
## 0 0
## voting_eligible_population total_ballots_counted
## 0 223
## highest_office percent_total_ballots_counted
## 1 223
## percent_highest_office
## 1
str(dataset)
## 'data.frame': 936 obs. of 7 variables:
## $ year : int 2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
## $ region : chr "United States" "Alabama" "Alaska" "Arizona" ...
## $ voting_eligible_population : int 227157964 3588783 520562 4510186 2117881 24440416 3800664 2577311 681526 495899 ...
## $ total_ballots_counted : int 83262122 1191274 285431 1537671 852642 7513972 2080071 1096556 238110 177176 ...
## $ highest_office : int 81687059 1180413 282382 1506416 848592 7317581 2041020 1089880 234038 175071 ...
## $ percent_total_ballots_counted: num 0.367 0.332 0.548 0.341 0.403 ...
## $ percent_highest_office : num 0.36 0.329 0.542 0.334 0.401 ...
summary(dataset)
## year region voting_eligible_population
## Min. :1980 Length:936 Min. : 270122
## 1st Qu.:1988 Class :character 1st Qu.: 999644
## Median :1997 Mode :character Median : 2662524
## Mean :1997 Mean : 7277622
## 3rd Qu.:2006 3rd Qu.: 4569632
## Max. :2014 Max. :227157964
##
## total_ballots_counted highest_office percent_total_ballots_counted
## Min. : 122356 Min. : 117623 Min. :0.2507
## 1st Qu.: 422851 1st Qu.: 488820 1st Qu.:0.4338
## Median : 1170867 Median : 1236230 Median :0.5234
## Mean : 3074280 Mean : 3509231 Mean :0.5183
## 3rd Qu.: 2395791 3rd Qu.: 2336586 3rd Qu.:0.6046
## Max. :132609063 Max. :131304731 Max. :0.7877
## NA's :223 NA's :1 NA's :223
## percent_highest_office
## Min. :0.2020
## 1st Qu.:0.4141
## Median :0.5010
## Mean :0.4993
## 3rd Qu.:0.5839
## Max. :0.7837
## NA's :1
multiple <- lm(percent_total_ballots_counted ~ voting_eligible_population + total_ballots_counted, data = dataset)
summary(multiple)
##
## Call:
## lm(formula = percent_total_ballots_counted ~ voting_eligible_population +
## total_ballots_counted, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.256095 -0.080239 0.004202 0.083034 0.257083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.188e-01 4.038e-03 128.498 < 2e-16 ***
## voting_eligible_population -5.836e-09 8.073e-10 -7.229 1.26e-12 ***
## total_ballots_counted 1.156e-08 1.630e-09 7.092 3.21e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1041 on 710 degrees of freedom
## (223 observations deleted due to missingness)
## Multiple R-squared: 0.06857, Adjusted R-squared: 0.06595
## F-statistic: 26.14 on 2 and 710 DF, p-value: 1.116e-11
This shows how the multiple data shows how voting eligible population and total ballots counted affects the voter turnout percentage.Voting eligible population is significant p < 0.001. As voting eligibility population increases, the turnout decreases.
To make sure, below will need to follow steps to check the relationships and if their patterns, size, or if they’re similar to each other. Using graphs can help determine this.
plot(resid(multiple), type="b", main = "Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)
This plot shows the patterns instead of being random. There’s a
repetitive pattern, can sugest that these might not be independent.
par(mfrow=c(2,2)); plot(multiple); par(mfrow=c(1,1))
The models shows the regression, even the small ones. There are slight curves within the graph. The models follow the pattern but also has is a bit off. The spread becomes bigger as the predicted increase, instead of always staying the same.
residuals_simple <- resid(multiple)
rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 0.1039158
This shows the actual turnout percentage of 10%
dataset/multiple linear regression Project and formats for the rest: “Linear and Multiple Regression Usi.RMD” and “logisitc regression-semi clean version.Rmd”