R Markdown for Project 3

In this Project 3 I will be using Multiple linear regression model to show how voting eligability population and the total ballots counted can influence the voter turnout% throughout the U.S. This dataset includes percent total ballots counted, voting eligible population and total ballots counted.

Dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Documents/School/Data 101/Data project/Project 3")

dataset <- read.csv("voter_count.csv")

Checking for missing values

colSums(is.na(dataset)) 
##                          year                        region 
##                             0                             0 
##    voting_eligible_population         total_ballots_counted 
##                             0                           223 
##                highest_office percent_total_ballots_counted 
##                             1                           223 
##        percent_highest_office 
##                             1

Checking dataset

str(dataset)
## 'data.frame':    936 obs. of  7 variables:
##  $ year                         : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ region                       : chr  "United States" "Alabama" "Alaska" "Arizona" ...
##  $ voting_eligible_population   : int  227157964 3588783 520562 4510186 2117881 24440416 3800664 2577311 681526 495899 ...
##  $ total_ballots_counted        : int  83262122 1191274 285431 1537671 852642 7513972 2080071 1096556 238110 177176 ...
##  $ highest_office               : int  81687059 1180413 282382 1506416 848592 7317581 2041020 1089880 234038 175071 ...
##  $ percent_total_ballots_counted: num  0.367 0.332 0.548 0.341 0.403 ...
##  $ percent_highest_office       : num  0.36 0.329 0.542 0.334 0.401 ...
summary(dataset)
##       year         region          voting_eligible_population
##  Min.   :1980   Length:936         Min.   :   270122         
##  1st Qu.:1988   Class :character   1st Qu.:   999644         
##  Median :1997   Mode  :character   Median :  2662524         
##  Mean   :1997                      Mean   :  7277622         
##  3rd Qu.:2006                      3rd Qu.:  4569632         
##  Max.   :2014                      Max.   :227157964         
##                                                              
##  total_ballots_counted highest_office      percent_total_ballots_counted
##  Min.   :   122356     Min.   :   117623   Min.   :0.2507               
##  1st Qu.:   422851     1st Qu.:   488820   1st Qu.:0.4338               
##  Median :  1170867     Median :  1236230   Median :0.5234               
##  Mean   :  3074280     Mean   :  3509231   Mean   :0.5183               
##  3rd Qu.:  2395791     3rd Qu.:  2336586   3rd Qu.:0.6046               
##  Max.   :132609063     Max.   :131304731   Max.   :0.7877               
##  NA's   :223           NA's   :1           NA's   :223                  
##  percent_highest_office
##  Min.   :0.2020        
##  1st Qu.:0.4141        
##  Median :0.5010        
##  Mean   :0.4993        
##  3rd Qu.:0.5839        
##  Max.   :0.7837        
##  NA's   :1

Fitting a Multiple Linear Regression Model

multiple <- lm(percent_total_ballots_counted ~ voting_eligible_population + total_ballots_counted, data = dataset)

summary(multiple)
## 
## Call:
## lm(formula = percent_total_ballots_counted ~ voting_eligible_population + 
##     total_ballots_counted, data = dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.256095 -0.080239  0.004202  0.083034  0.257083 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 5.188e-01  4.038e-03 128.498  < 2e-16 ***
## voting_eligible_population -5.836e-09  8.073e-10  -7.229 1.26e-12 ***
## total_ballots_counted       1.156e-08  1.630e-09   7.092 3.21e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1041 on 710 degrees of freedom
##   (223 observations deleted due to missingness)
## Multiple R-squared:  0.06857,    Adjusted R-squared:  0.06595 
## F-statistic: 26.14 on 2 and 710 DF,  p-value: 1.116e-11

This shows how the multiple data shows how voting eligible population and total ballots counted affects the voter turnout percentage.Voting eligible population is significant p < 0.001. As voting eligibility population increases, the turnout decreases.

To make sure, below will need to follow steps to check the relationships and if their patterns, size, or if they’re similar to each other. Using graphs can help determine this.

Independence check

plot(resid(multiple), type="b", main = "Residuals vs Order", ylab="Residuals")
abline(h=0, lty=2)

This plot shows the patterns instead of being random. There’s a repetitive pattern, can sugest that these might not be independent.

Core Disgnostic

par(mfrow=c(2,2)); plot(multiple); par(mfrow=c(1,1))

The models shows the regression, even the small ones. There are slight curves within the graph. The models follow the pattern but also has is a bit off. The spread becomes bigger as the predicted increase, instead of always staying the same.

Diagnose Model Fit

residuals_simple <- resid(multiple)

rmse_simple <- sqrt(mean(residuals_simple^2))
rmse_simple
## [1] 0.1039158

This shows the actual turnout percentage of 10%

References:

dataset/multiple linear regression Project and formats for the rest: “Linear and Multiple Regression Usi.RMD” and “logisitc regression-semi clean version.Rmd”