Vote-By-Mail Rejection Likelihood: An Analysis of Time Predictors

This project evaluates Vote-By-Mail rejection rates and rejection likelihood at the county and state level in Florida. The primary objective is to study time predictors that lead to rejection, but the larger goal is to visualize and understand the Florida voter whose Vote-By-Mail ballot is rejected.

Introduction

Vote-By-Mail (VBM), previously referred to as absentee ballot voting, has been growing in popularity nationwide. More individuals are beginning to see the appeal in ordering a ballot to their home and skipping the Election Day lines. However, with this convenience comes added hassle and risk. Between the unpredicatable postal system and the chance of mismatched signatures, the risk that your ballot might be rejected is higher at home.

In this project, I study how time affects this rejection likelihood. If a voter requests their ballot and returns it in a timely manner, is their ballot less likely to be rejected than the person who waits until a few days before it’s due? Does the additional time allow voters to pay more attention to details like signatures and other important fields?

Using Florida voter data from the 2018 November election, we study these likelihoods and where these rejections are occuring most.

In the Fall of 2018, I worked on a smaller scale project that studied these time predictors in the context of voting by mail in Florida’s Duval County. Now, with expanded political and statistical knowledge, I revist this project to produce state level analysis and visualizations.

Getting Started

Below are the packages I used to complete this analysis, followed by the relevant code needed to replicate this study.

library(dplyr)
library(ggplot2)
library(readr)
library(sf)
library(sp)
require(scales)
require(gridExtra)

Data Used

These two data sets were obtained from the Secretary of State’s Division of Elections by Public Records Requests. They were then merged by VoterIDs to combine the Vote-By-Mail information and the voter’s demographic info.

A special thanks is due to Anna Baringer, one of my peers in research at the University of Florida, who provided the shell of code that enabled me to read in the 67 county files more efficiently.

Voter History: Vote By Mail

The VBM data set is part of the voter history data kept by each county. Here they are loaded in county-by-county and then rbinded together. They detail the election, the VBM status, all dates associated with the VBM ballot, and mailing information. For the purpose of this study, we are primarily interested in the dates and VBM status.

setwd("F:/POS.EDS project/20190813_VoteByMail")

#create a list of the files from your target directory
file_list <- list.files(path="F:/POS.EDS project/20190813_VoteByMail")

#initiate a blank data frame, each iteration of the loop will append the data from the given file to here 
FL.VBM.OG <- data.frame()

# all the column headers 
detail.cols <- c("RecordType", "CountyID", "VoterID", "FVRS.Election", "Election.Date", 
                 "Election.Name", "Record.Change.Date", "Abs.Request.Status", "Abs.Request.Date", 
                 "Abs.Delivery.Date", "Abs.Return.Date", "Abs.Request.Canceled.Date", "Abs.Military", 
                 "Abs.Overseas.Flag", "Abs.Military.Dependent", "Precinct", "Precinct.Split", 
                 "Congressional.District", "Senate.District", "House.District", 
                 "County.Commission.District", "School.Board.District", "Other.Districts", "Abs.Party",
                 "Name", "Abs.Mail.Add.1", "Abs.Mail.Add.2", "Abs.Mail.Add.3", "Abs.Mail.Add.City", 
                 "Abs.Mail.Add.State", "Abs.Mail.Add.Zip", "Abs.Mail.Add.Country", 
                 "Abs.Mail.Add.E-mail", "Fax")

# for loop loads in each county's data and binds them into one data set 
for (i in 1:length(file_list)){
  temp_data <- read.table(file_list[i], sep = "\t",
                          col.names = detail.cols, fill = TRUE)
   
#for each iteration, bind the new data to the building dataset
FL.VBM.OG <- rbind(FL.VBM.OG, temp_data) 
}

Voter Details

The Voter Details data set is created similarly to the VBM data set, by loading in each county separately using a for loop and then rbinding them all together. Since this data set contains the universe of Florida voters, the set is much larger and thus requires us to select only the necessary columns during the reading in process (colClasses()).

This set includes important demographic information like Gender, Age, Race, and Party Affiliation. These will be important control variables in our modeling.

setwd("F:/POS.EDS project/20190813_VoterDetail")

#create a list of the files from your target directory
file_list <- list.files(path="F:/POS.EDS project/20190813_VoterDetail")

#initiate a blank data frame, each iteration of the loop will append the data from the given file to here 
FL.VBM.counties <- data.frame()

detail.cols <- c("CountyCode", "VoterID", "NameLast", "NameSuffix", "NameFirst", 
                 "NameMiddle","RequestedPublicRecordsExemption", "ResidenceAddressLine1", 
                 "ResidenceAddressLine2", "ResidenceCity", "ResidenceState", "ResidenceZipcode",
                 "MailingAddressLine1", "MailingAddressLine2", "MailingAddressLine3", "MailingCity",
                 "MailingState", "MailingZipcode", "MailingCountry","Gender", "Race", 
                 "BirthDate","RegistrationDate", "PartyAffiliation","Precinct", "PrecinctGroup", 
                 "PrecinctSplit", "PrecinctSuffix","VoterStatus","CongressionalDistrict", "HouseDistrict",
                 "SenateDistrict", "CountyCommissionDistrict", "SchoolBoardDistrict","DaytimeAreaCode",
                 "DaytimePhoneNumber", "DaytimePhoneExtension", "EmailAddress")

# for loop loads in each county's data (only the necessary columns) and binds them into one data set 
for (i in 1:length(file_list)){
  temp_data <- read.table(file_list[i], sep = "\t",
                          col.names = detail.cols, fill = TRUE, 
                          colClasses = c("character", "character", NA, NA, NA, NA,NA,NA, NA, NA, NA, NA, NA, NA, 
                                         NA, NA, NA, NA, NA,"character", "character", "character","character", 
                                         "character",NA, NA, NA, NA,"character",NA, NA, NA, NA, NA,NA, NA, NA, 
                                         NA)) 
#for each iteration, bind the new data to the building dataset
FL.VBM.counties <- rbind(FL.VBM.counties, temp_data) 
}

Brief Subsetting

Here I subset the data a little bit before continuing with variable creation. For example, we only want the ballots that were actually returned, not all the ballots that were requested. Thus we only choose to include the statuses marked V, N, and E: V = Voted

N = No Signature

E = Voter Error

(both N and E result in ballot rejection)

Further, we use birthdate to calculate the voter’s age at the time of the election. Then we subset out all impossible ages - all pre-registered voters (< 18) and all voters over the age of 105.

Finally, the merge between the Voter Details and the VBM data occurs, linking the two by the unique VoterID key.

#subset the data to just include V, N, and E 
FL.VBM.OG$accepted <- factor(FL.VBM.OG$Abs.Request.Status)
FL.VBM <- subset(FL.VBM.OG, subset = accepted == "V"| accepted == "N" | accepted =="E")


#subset to remove weird ages 
FL.VBM.counties$AgeInDays = -(as.Date(FL.VBM.counties$BirthDate, format = "%m/%d/%Y") - as.Date(as.character("11/06/2018"), format="%m/%d/%Y"))
FL.VBM.counties$Age = floor((FL.VBM.counties$AgeInDays)/365)
FL.VBM.counties <- subset(FL.VBM.counties, subset = Age < 105 | Age > 18)
attach(FL.VBM.counties)

#merge two data files
merged.data <- merge(FL.VBM, FL.VBM.counties, by= 'VoterID')

Create Variables for Modeling

Demographic Variables

Since we are primarily using categorical variables in the model, a series of dummy variables were created for each predictor of interest and the leave-one-out modeling technique was applied. Most of these variables were created using the ifelse() function and then examined using the table() function. The examination is used mostly to check that the ifelse() statement was successful and the variables are correctly modeled.

Accepted The Accepted variable is the dependent variable. 1 is equal to ballot acceptance. 0 is used when the ballot was rejected.

Age The age category takes the age calculation we did earlier during the subsetting process and produces four age bins to be used in the model. The youngest bin is left out of the model for comparison.

Gender Male, Female, and Unidentified are the three categories available in the Florida voter file. Each was coded into their own dummy variable and the males were left out of the model for comparison.

Party Identification For the purpose of simplicity, I only coded Non-Party Affiliates, Democrats, and Republicans individually and then sorted the remaining smaller parties into a general “Other” bin. Republicans were left out of the model for comparison.

Race Similarly to Party ID, four race bins were created - White, Black, Hispanic, and Other. Whites were left out of the model.

New Voter The new voter variable was created to attempt to control for new voter error. These were people who registered in either 2017 and 2018 and are likely less experienced with the FL voting experience and specifically, the VBM process.

County Type Lastly, a variable for county type was created based on population density. The 67 counties were organized in order by population density and then divided evenly into 3 categories - Urban, Rural, and In Between. Urban was left out of the model for comparison.

#dummy variable for ballot accepted or rejected 
merged.data$accepted1 <- ifelse(merged.data$accepted == 'V', 1, 0)
table(merged.data$accepted1)

## 
##       0       1 
##   12530 1129894

#dummy variables for age bins 
merged.data$Age18.to.29 <- ifelse(((merged.data$Age > 17) & (merged.data$Age < 30)), 1, 0)
merged.data$Age30.to.44 <- ifelse(((merged.data$Age > 29) & (merged.data$Age < 45)), 1, 0)
merged.data$Age45.to.64 <- ifelse(((merged.data$Age > 44) & (merged.data$Age < 65)), 1, 0)
merged.data$Age65.to.104 <- ifelse(((merged.data$Age > 64) & (merged.data$Age < 105)), 1, 0)


#dummy variables for gender
merged.data$genderM <- ifelse(merged.data$"Gender" == 'M', 1, 0)
merged.data$genderF <- ifelse(merged.data$"Gender" == 'F', 1, 0)
merged.data$genderU <- ifelse(merged.data$"Gender" == 'U', 1, 0)
table(merged.data$Gender)

## 
##             1      3      4      5      6      7      9      F      M 
##    700      1     74      3     15      4      1      2 637970 483338 
##      U 
##  20316

#dummy variables for party identification
merged.data$partyidNPA <- ifelse(merged.data$Abs.Party == "NPA", 1, 0)
merged.data$partyidDEM <- ifelse(merged.data$Abs.Party == 'DEM', 1, 0)
merged.data$partyidREP <- ifelse(merged.data$Abs.Party == 'REP', 1, 0)
merged.data$partyidOther <- with(merged.data, ifelse(merged.data$Abs.Party == 'CPF', 1,
                                              ifelse(merged.data$Abs.Party == 'ECO', 1,
                                              ifelse(merged.data$Abs.Party == 'GRE', 1,
                                              ifelse(merged.data$Abs.Party == 'IND', 1,
                                              ifelse(merged.data$Abs.Party == 'LPF', 1, 
                                              ifelse(merged.data$Abs.Party == 'NRS', 1,
                                              ifelse(merged.data$Abs.Party == 'REF', 1,0))))))))
table(merged.data$Abs.Party)

## 
##    AIP    CPF    DEM    ECO    GRE    IND    LPF    NPA    PSL    REF 
##      2    101 446311     32    449   4213   1937 219612     35     86 
##    REP    UNK 
## 469645      1

# dummy variable for RAce 
merged.data$Black <- ifelse(merged.data$"Race" == '3', 1, 0)
merged.data$Hispanic <- ifelse(merged.data$"Race" == '4', 1, 0)
merged.data$White <- ifelse(merged.data$"Race" == '5', 1, 0)
merged.data$RaceOther <- with(merged.data, ifelse(merged.data$"Race" == '1', 1,
                                           ifelse(merged.data$"Race" == '2', 1,
                                           ifelse(merged.data$"Race" == '6', 1,
                                           ifelse(merged.data$"Race" == '7', 1,
                                           ifelse(merged.data$"Race" == '9', 1, 0))))))


#dummy variable for whether they are a NEW voter 
merged.data$Reg.Year <- format(as.Date(merged.data$RegistrationDate, format="%d/%m/%Y"),"%Y")
merged.data$NewVoter <- with(merged.data, ifelse(merged.data$Reg.Year == '2018', 1,
                                          ifelse(merged.data$Reg.Year == '2017', 1, 0)))
table(merged.data$NewVoter)

## 
##      0      1 
## 461959  24599

# County Type (Urban and Rural and in between)
# population density data found at http://worldpopulationreview.com/states/florida-population/#densityByCounty  

popDensity <- read.csv("F:/POS.EDS project/popDensity.csv")
merged.data <- left_join(merged.data, popDensity, by="CountyID")
#create Urban, Rural, and Middle variables 
merged.data$Urban <- ifelse(merged.data$CountyType == "Urban", 1, 0)
merged.data$Rural <- ifelse(merged.data$CountyType == 'Rural', 1, 0)
merged.data$Between <- ifelse(merged.data$CountyType == 'Middle', 1, 0)

Creating the Time Variables

These time variables are calculated in number of days.

Request is the number of days before the election each voter requested their ballots. This number can be quite large for some voters because they have selected to receive a VBM ballot for every election automatically and only that day’s request remains in the data. I decided to not subset these numbers out because I think they help illustrate the proactiveness among these voters that allow lower likelihood of their ballot rejection.

Return is the number of days before the election each voter returned their ballots to their Supervisor of Elections (SOE) office. There were a fair amount of negative numbers here because many voters mailed their ballots too late leading to an automatic rejection. These negative dates don’t make sense in the context of this project because we know that if they are returned after the deadline, the ballots are guaranteed to be rejected. Thus, we subset them out to test whether the days strictly before the election have an additional effect on rejection likelihood

Our last time variable, atHome, measures the length of time in days between the ballot delivery to the voter and the ballot delivery back to the SOE. This measure is used to test the hypothesis that the longer the ballot spends at home, the less likely it is to be rejected.

There is one main issue with this variable - it does not account for the length of time the ballot spends in the mail back to the SOE. Thus, it is not a completely accurate measure of time spent “at home”.

## request = how many days before Election Day did voter request their Vote By Mail Ballot 
merged.data$request = -(as.Date(merged.data$Abs.Request.Date, format = "%m/%d/%Y") - as.Date(as.character("11/06/2018"), format="%m/%d/%Y"))
ggplot(merged.data,  aes(x=request)) +
  geom_bar(stat = "count") +
  theme_bw() +
  ggtitle("Increase in Requests as Election Draws Nearer") +
  xlab("Days until Election") +
  ylab("Number of Ballots Requested")+
  scale_y_continuous(labels = comma)

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

# eliminate negative values likely due to error in the data 
merged.data <- merged.data[merged.data$request >= 0, ]

## return = how many days before Election Day did voter return their Vote By Mail Ballot 
merged.data$return = -(as.Date(merged.data$Abs.Return.Date, format = "%m/%d/%Y") - as.Date(as.character("11/06/2018"), format="%m/%d/%Y"))
ggplot(merged.data,  aes(x=return)) + 
  geom_bar(stat = "count") +
  theme_bw() +
  ggtitle("Increase in Returns as Election Draws Nearer") +
  xlab("Days until Election") +
  ylab("Number of Ballots Returned") +
  scale_y_continuous(labels = comma)

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

merged.data <- merged.data[merged.data$return >= 0, ]

merged.data$atHome = -(as.Date(merged.data$Abs.Delivery.Date, format = "%m/%d/%Y") - as.Date(merged.data$Abs.Return.Date, format="%m/%d/%Y"))
merged.data <- merged.data[merged.data$atHome >= 0, ]
ggplot(merged.data,  aes(x=atHome)) + 
  geom_bar(stat = "count") +
  theme_bw() +
  ggtitle("How long do Voters spend with their ballots?") +
  xlab("Days at Home") +
  ylab("Number of People") +
  scale_y_continuous(labels = comma)

## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Logistic Regression Models

Two models were produced to evaluate the significance of the time predictors and the demographic variables. The summary output is provided to view significance levels for each variable.

The two models were separated to eliminate the potential for multicollinearity which could be an issue since the atHome variable is a function of the return variable.

Testing Request and Return

mylogit.RR <- glm(accepted1 ~ request + return + 
                  Age30.to.44  +
                  Age45.to.64  +
                  Age65.to.104 +
                  genderF +
                  genderU +
                  partyidNPA +
                  partyidDEM +
                  partyidOther +
                  Black  +
                  Hispanic  +
                  RaceOther   +
                  NewVoter +
                  Rural +
                  Between, data=merged.data, family = "binomial")
summary(mylogit.RR)

## 
## Call:
## glm(formula = accepted1 ~ request + return + Age30.to.44 + Age45.to.64 + 
##     Age65.to.104 + genderF + genderU + partyidNPA + partyidDEM + 
##     partyidOther + Black + Hispanic + RaceOther + NewVoter + 
##     Rural + Between, family = "binomial", data = merged.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7357   0.0653   0.0810   0.1063   0.2496  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.687e+00  7.883e-02  59.456  < 2e-16 ***
## request       1.483e-04  6.338e-05   2.340   0.0193 *  
## return        3.984e-02  3.006e-03  13.255  < 2e-16 ***
## Age30.to.44   3.998e-01  7.434e-02   5.377 7.56e-08 ***
## Age45.to.64   5.355e-01  6.625e-02   8.083 6.31e-16 ***
## Age65.to.104  7.662e-01  6.919e-02  11.073  < 2e-16 ***
## genderF       1.792e-01  4.352e-02   4.117 3.85e-05 ***
## genderU       5.553e-02  1.353e-01   0.410   0.6815    
## partyidNPA   -2.798e-01  5.888e-02  -4.752 2.01e-06 ***
## partyidDEM   -9.292e-02  5.478e-02  -1.696   0.0898 .  
## partyidOther -1.178e-01  2.719e-01  -0.433   0.6649    
## Black        -9.283e-01  6.522e-02 -14.234  < 2e-16 ***
## Hispanic     -9.533e-01  5.328e-02 -17.894  < 2e-16 ***
## RaceOther    -6.073e-01  8.374e-02  -7.251 4.13e-13 ***
## NewVoter      6.366e-02  9.281e-02   0.686   0.4927    
## Rural         1.822e-01  1.885e-01   0.967   0.3336    
## Between       2.464e-03  5.894e-02   0.042   0.9667    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28415  on 481981  degrees of freedom
## Residual deviance: 27265  on 481965  degrees of freedom
##   (651640 observations deleted due to missingness)
## AIC: 27299
## 
## Number of Fisher Scoring iterations: 8

Final Request/Return Model

mylogit.RR.final <- glm(accepted1 ~ request + return + 
                  Age30.to.44  +
                  Age45.to.64  +
                  Age65.to.104 +
                  genderF +
                  partyidNPA +
                  Black  +
                  Hispanic  +
                  RaceOther, data=merged.data, family = "binomial")
summary(mylogit.RR.final)

## 
## Call:
## glm(formula = accepted1 ~ request + return + Age30.to.44 + Age45.to.64 + 
##     Age65.to.104 + genderF + partyidNPA + Black + Hispanic + 
##     RaceOther, family = "binomial", data = merged.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7367   0.0662   0.0817   0.1073   0.2499  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   4.665e+00  4.544e-02 102.669  < 2e-16 ***
## request       1.545e-04  4.058e-05   3.806 0.000141 ***
## return        3.850e-02  1.935e-03  19.900  < 2e-16 ***
## Age30.to.44   3.642e-01  4.718e-02   7.721 1.16e-14 ***
## Age45.to.64   5.337e-01  4.204e-02  12.694  < 2e-16 ***
## Age65.to.104  7.328e-01  4.375e-02  16.750  < 2e-16 ***
## genderF       1.810e-01  2.761e-02   6.556 5.51e-11 ***
## partyidNPA   -2.094e-01  3.227e-02  -6.488 8.72e-11 ***
## Black        -1.005e+00  3.862e-02 -26.025  < 2e-16 ***
## Hispanic     -9.727e-01  3.375e-02 -28.818  < 2e-16 ***
## RaceOther    -5.529e-01  5.373e-02 -10.291  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67974  on 1133615  degrees of freedom
## Residual deviance: 65251  on 1133605  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 65273
## 
## Number of Fisher Scoring iterations: 8

Note that by removing some extraneous variables, return was made more significant.

Testing Time At Home

mylogit.AtHome <- glm(accepted1 ~ atHome + 
                  Age30.to.44  +
                  Age45.to.64  +
                  Age65.to.104 +
                  genderF +
                  genderU +
                  partyidNPA +
                  partyidDEM +
                  partyidOther +
                  Black  +
                  Hispanic  +
                  RaceOther   +
                  NewVoter +
                  Rural +
                  Between, data=merged.data, family = "binomial")
summary(mylogit.AtHome)

## 
## Call:
## glm(formula = accepted1 ~ atHome + Age30.to.44 + Age45.to.64 + 
##     Age65.to.104 + genderF + genderU + partyidNPA + partyidDEM + 
##     partyidOther + Black + Hispanic + RaceOther + NewVoter + 
##     Rural + Between, family = "binomial", data = merged.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5728   0.0681   0.0792   0.1083   0.2250  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   5.154639   0.087051  59.214  < 2e-16 ***
## atHome       -0.003849   0.002491  -1.545 0.122353    
## Age30.to.44   0.443681   0.074276   5.973 2.32e-09 ***
## Age45.to.64   0.650990   0.065803   9.893  < 2e-16 ***
## Age65.to.104  0.977494   0.067901  14.396  < 2e-16 ***
## genderF       0.155144   0.043495   3.567 0.000361 ***
## genderU       0.019346   0.135199   0.143 0.886220    
## partyidNPA   -0.302298   0.058838  -5.138 2.78e-07 ***
## partyidDEM   -0.078538   0.054768  -1.434 0.151572    
## partyidOther -0.135923   0.271813  -0.500 0.617033    
## Black        -1.001747   0.065057 -15.398  < 2e-16 ***
## Hispanic     -1.019586   0.052994 -19.240  < 2e-16 ***
## RaceOther    -0.650910   0.083637  -7.783 7.11e-15 ***
## NewVoter     -0.027744   0.092317  -0.301 0.763769    
## Rural         0.108716   0.188252   0.578 0.563599    
## Between       0.024643   0.058860   0.419 0.675455    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28415  on 481981  degrees of freedom
## Residual deviance: 27465  on 481966  degrees of freedom
##   (651640 observations deleted due to missingness)
## AIC: 27497
## 
## Number of Fisher Scoring iterations: 8

Final atHome Model - Still Not Significant

mylogit.AtHome.final <- glm(accepted1 ~ atHome + 
                  Age30.to.44  +
                  Age45.to.64  +
                  Age65.to.104 +
                  genderF +
                  partyidNPA +
                  Black  +
                  Hispanic  +
                  RaceOther, data=merged.data, family = "binomial")
summary(mylogit.AtHome.final)

## 
## Call:
## glm(formula = accepted1 ~ atHome + Age30.to.44 + Age45.to.64 + 
##     Age65.to.104 + genderF + partyidNPA + Black + Hispanic + 
##     RaceOther, family = "binomial", data = merged.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5167   0.0700   0.0811   0.1102   0.2198  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   5.0654839  0.0513504  98.645  < 2e-16 ***
## atHome       -0.0009235  0.0016086  -0.574    0.566    
## Age30.to.44   0.4155835  0.0470934   8.825  < 2e-16 ***
## Age45.to.64   0.6585039  0.0416681  15.804  < 2e-16 ***
## Age65.to.104  0.9558839  0.0427487  22.361  < 2e-16 ***
## genderF       0.1600554  0.0275931   5.801 6.61e-09 ***
## partyidNPA   -0.2405456  0.0322082  -7.468 8.11e-14 ***
## Black        -1.0712058  0.0385319 -27.800  < 2e-16 ***
## Hispanic     -1.0320396  0.0336400 -30.679  < 2e-16 ***
## RaceOther    -0.6000331  0.0536688 -11.180  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67974  on 1133615  degrees of freedom
## Residual deviance: 65712  on 1133606  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 65732
## 
## Number of Fisher Scoring iterations: 8

Discussion of Findings

The request and return variables were significant predictors in their model as well as some of the demographic control variables; however, the atHome variable did not show any significance.

In both models, we eliminated the variables that did not show significance in hopes that this would improve the predictability of the model. It improved the significance of the return time predictor but did not help the atHome model. Surprisingly, the time the ballot spent at home was not a significant predictor for ballot rejection.

For the demographic variables, all age groups older than 24 showed a higher likelihood of ballot acceptance when compared to the youngest age group (18-24). Black, Hispanics, and those in the “other” race category were less likely to have an accepted ballot. These two findings confirm those of Dr. Daniel Smith at the University of Florida that were published recently by the ACLU (report can be found here).

In addition, females were more likely to have an accepted ballot than their male and unidentified counterpart and NPA voters were less likely to see their ballots accepted.

Among the not significant variables, there were a few surprises. Based on the visualizations below I expected there to be a difference in rejection likelihood between Democrats and Republicans but one was not present.

In addition, county type (determined by population density) was not significant in any capacity which I found to be interesting. I predicted that Rural counties would have higher likelihoods of ballot rejection because of the widespread nature of the county but that seems to have not been the case. Perhaps if I would have not subsetted out the voters whose ballots were received by the SOEs after the election deadline, an effect would have been present. This may be explored in a later project.

Visualizing Voter By Mail

Who Votes By Mail vs. Who Gets Rejected?

First, we have to create a subset of those whose ballots were rejected:

justrejected <- merged.data %>% 
  filter(accepted1 == 0)

By Race

race.all <- ggplot(data = subset(merged.data, !is.na(RaceName)),  aes(x=RaceName)) + 
  geom_bar(stat = "count") +
  theme_bw() +
  ggtitle("Race - All VBM") +
  xlab("Race") +
  scale_y_continuous(labels = comma)
race.rej <- ggplot(justrejected, aes(x=RaceName)) + 
  geom_bar(stat = "count") + 
  theme_bw() +
  ggtitle("Race - Rejected ") +
  xlab("Race")
grid.arrange(race.all, race.rej, ncol=2)

By Age

age.all <- ggplot(merged.data, aes(x=Age)) +
  geom_histogram(stat = "bin", binwidth = 4) +
  theme_bw() +
  ggtitle("Age - All VBM") +
  xlab("Age") +
  scale_y_continuous(labels = comma) +
  xlim(18, 100)
age.rej <- ggplot(justrejected, aes(x=Age)) +
  geom_histogram(stat = "bin", binwidth = 4) +
  theme_bw() +
  ggtitle("Age - Rejected") +
  xlab("Age") +
  scale_y_continuous(labels = comma) +
  xlim(18, 100)
grid.arrange(age.all, age.rej, ncol=2)

By Party

party.all <- ggplot(data = subset(merged.data, !is.na(simpleParty)), aes(x=simpleParty)) +
  geom_histogram(stat="count") +
  theme_bw() +
  ggtitle("Party - All VBM") +
  xlab("Party") +
  scale_y_continuous(labels = comma) 
party.rej <- ggplot(justrejected, aes(x=simpleParty))+
  geom_histogram(stat="count")  +
  theme_bw() +
  ggtitle("Party - Rejected") +
  xlab("Party") +
  scale_y_continuous(labels = comma) 
grid.arrange(party.all, party.rej, ncol=2)

Mapping Rejections

FL.VBM.OG <- left_join(FL.VBM.OG, popDensity, by="CountyID")

Rej.Rates <- FL.VBM.OG %>% 
  group_by(CountyID, CountyType) %>% 
  count(CountyID) 

for (i in 1:nrow(Rej.Rates)) 
  Rej.Rates$all_Rej[i] <- length(which(FL.VBM.OG$CountyID == Rej.Rates$CountyID[i] & (FL.VBM.OG$Abs.Request.Status == "N" |FL.VBM.OG$Abs.Request.Status == "E")))

FL.VBM.OG$Abs.Request.Status <- as.factor(FL.VBM.OG$Abs.Request.Status)


#calculate rejection rates 
Rej.Rates <- Rej.Rates %>% 
  mutate(Rej.Rate = (all_Rej/n)*100)

Load in the Shape File:

The shapefile I used was taken from this website.

fl.boundary <- st_read('F:/Florida_Counties')

## Reading layer `Florida_Counties' from data source `F:\Florida_Counties' using driver `ESRI Shapefile'
## Simple feature collection with 67 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -87.62601 ymin: 24.54522 xmax: -80.03095 ymax: 30.99702
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

for.plotting <- merge(fl.boundary, Rej.Rates, by = "CountyID")

Urban, Rural, and in Between Counties in Florida

ggplot() + 
  geom_sf(data = for.plotting, aes(fill = CountyType)) +
  ggtitle("Urban and Rural Counties") +
  theme_bw() +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.ticks = element_blank(), axis.text = element_blank())

How many Vote By Mail Ballots are Cast per County?

ggplot() + 
  geom_sf(data = for.plotting, aes(fill = n)) + 
  scale_y_continuous(labels = comma) +
  scale_fill_gradient(labels = comma) +
  ggtitle("Total VBM Ballots Cast") +
  theme_bw() +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.ticks = element_blank(), axis.text = element_blank())

How do Rejection Rates differ throughout the state?

It’s worth noting here that these rejection rates were produced using the original VBM numbers and not the subsetted data that eliminated those ballots received after the election.

ggplot() + 
  geom_sf(data = for.plotting, aes(fill = Rej.Rate)) + 
  scale_y_continuous(labels = comma) +
  ggtitle("VBM Rejection Rates") +
  theme_bw() +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), axis.ticks = element_blank(), axis.text = element_blank())

Conclusions

Modeling and visualizing VBM rejections helps to demonstrate that the voting by mail process benefits some voters and disadvantages others.

For example, young people and people of color are most likely to have their ballots rejected when compared to all other voting eligible ages and white voters respectively.

Ballot rejection rates vary county to county from 0% to about 2% with no clear difference between Urban and Rural counties. Though these numbers seem low, they still represent votes who are excluded from our democracy and people who cared enough about voting to request and return a ballot. In many cases, these individuals aren’t even aware that their ballots weren’t accepted in the election.

Voter education, SOE outreach, and voter proactiveness amongst VBM users is of utmost importance to prevent ballot rejections and improve our democracy’s inclusiveness.

Future Research

In the future, other avenues of research that could be explored are VBM rejections in other states, as well as modeling county rejection rates as a function of a county’s geographical properties.

A project similar to this one could be completed with Colorado’s data and would likely be more accurate. There, almost all voters vote by mail and they have ballot tracking data that would allow for more precise dates for ballot delivery and transit.

Another state’s data could be used, like Washington’s, since their elections are run exclusively through the mail. With an exclusively VBM state’s data you would have a more comprehensive view of who is getting rejected, without the added bias of those who choose to seek out VBM voting.

Additionally, a linear regression could be done that predicts the rejection rates of each county in Florida, or another state. The variables in this model would be population density, the number of postal workers in the county, the overall area of the county, and maybe some more. Instead of modeling rejection likelihood (0 or 1) we would use these to model rejection rates (percentages). This separate model might be more accurate than including the Urban/Rural dummy variables in the logistic regression, because we would be modeling each county as the unit of analysis instead of each voter.