Stats II Midterm - Mariana Peinado

For this project, we are working with data from Su, Min and Christian Buerger, (2025) “Playing Politics with Traffic Fines: Sheriff Elections and Political Cycles in Traffic Fines Revenue”

Abstract: The political budget cycle theory has extensively documented how politicians manipulate policies during election years to gain an electoral advantage. This paper focuses on county sheriffs, crucial but often neglected local officials, and investigates their opportunistic political behavior during elections. Using a panel data set covering 57 California county governments over four election cycles, we find compelling evidence of traffic enforcement policy manipulation by county sheriffs during election years. Specifically, a county’s per capita traffic fines revenue is 9% lower in the election than in nonelection years. The magnitude of the political cycle intensifies when an election is competitive. Our findings contribute to the political budget cycle theory and provide timely insights into the ongoing debate surrounding law enforcement reform and local governments’ increasing reliance on fines and fees revenue.

Hypotheses

The primary hypothesis is that a county’s per capita traffic fines revenue is lower in a sheriff election year than in nonelection years.

The secondary hypothesis is that the relationship between election years and per capita traffic fines revenue is conditional on the percentage of young drivers in a county, as counties with a high proportion of young drivers will be less likely to see lower traffic fines revenue in an election year.

Data and Methods

The dependent variable is per capita inflation-adjusted net revenues from fines for moving and parking violations collected in a county.

The key independent variable is elec_dummy, which is a binary indicator coded as 1 if a sheriff election is held in the year and 0 otherwise. The model also includes young_drivers, which is the % share of people between 15 and 24 in a county’s total population.

Controls are included for the share of Democratic voters (dem_share), the share of independent/third party voters (otherparty_share), the share of Asians in the county (ASIAN_share), the share of African Americans in the county (BLACK_share), the share of Hispanics in the county (HISPANIC_share), the share of all other non-white races in the county (OTHER_share), the number of people per square mile in the county (density), median household income in the county (MedInc), county unemployment rates (Unemp), the share of a county’s owwn-source revenue in total county revenue (OwnSourceShare), the annual average of monthly employment levels in goods producing industries (emp_goods), the annual average of monthly employment levels in service producing industries (emp_service), the average annual pay in goods producing industries (pay_goods_i), the average annual pay in service producing industries (pay_service_i), the percentage of state arterial road mileage in the county (arte_share), the percentage of state collector road mileage in the county (collect_share), the number of sworn law enforcement personnel per 1000 residents (CNTY_LE_SWORN_1000p), the number of felony crimes per 1000 residents (felony_tot_1000p), and the number of misdemeanors per 1000 residents (misdemeanor_tot_1000p).

The model also includes county-specify linear time trend variables, i_trend_1 through i_trend 56, with i_trend_57 as the excluded reference category. These control for the possibility that fine revenue follows a different linear trajectory over time in each county.

More information about the variables in the dataset can be found in the Codebook_26may2024.pdf file.

You download the dataverse files which includes the following code to replicate the paper results:

data <- read.csv("sheriff_elections.csv")

model1 <- lm(VehicleCodeFines_i_p ~ elec_dummy + young_drivers + elec_dummy * young_drivers +
    dem_share + otherparty_share + ASIAN_share + BLACK_share + HISPANIC_share + OTHER_share +
    density + lnMedInc + Unemp + OwnSourceShare + emp_goods + emp_service + pay_goods_i +
    pay_service_i + arte_share + collect_share + CNTY_LE_SWORN_1000p + felony_tot_1000p +
    misdemeanor_tot_1000p + i_trend_1 + i_trend_2 + i_trend_3 + i_trend_4 + i_trend_5 +
    i_trend_6 + i_trend_7 + i_trend_8 + i_trend_9 + i_trend_10 + i_trend_11 + i_trend_12 +
    i_trend_13 + i_trend_14 + i_trend_15 + i_trend_16 + i_trend_17 + i_trend_18 +
    i_trend_19 + i_trend_20 + i_trend_21 + i_trend_22 + i_trend_23 + i_trend_24 +
    i_trend_25 + i_trend_26 + i_trend_27 + i_trend_28 + i_trend_29 + i_trend_30 +
    i_trend_31 + i_trend_32 + i_trend_33 + i_trend_34 + i_trend_35 + i_trend_36 +
    i_trend_37 + i_trend_38 + i_trend_39 + i_trend_40 + i_trend_41 + i_trend_42 +
    i_trend_43 + i_trend_44 + i_trend_45 + i_trend_46 + i_trend_47 + i_trend_48 +
    i_trend_49 + i_trend_50 + i_trend_51 + i_trend_52 + i_trend_53 + i_trend_54 +
    i_trend_55 + i_trend_56, data = data)
summary(model1)

Your task is to make the model better. There are a few outright errors, some generally bad practices, and many subjective areas of improvement. You need to identify and fix/improve 5.

Some tips and guidelines:

Don’t forget the basics. Not everything is hard.
Some of the modifications I made to the original model made the main finding disappear. Bringing it back does not inherently make the model better, nor are you required to do so.
There is no requirement to run a certain set of diagnostic tests. You get points for a) identifying issues, b) fixing them, and c) producing a final model that is less bad.
The model is set up so that there are at least 5 issues that relate to things we have explicitly covered in class. You can identify issues/implement fixes that go beyond class material if you want, but doing so isn’t necessary.
Dropping two variables does not count as two fixes unless you are dropping them to address different issues.
Your written explanations should be clear and concise, generally no more than 1-2 sentences.
A lot of this is subjective and is less about making the right decision and more about whether you can reasonably justify the choices you make.

# R Diagnostics
summary(data)

summary(data$elec_dummy)

summary(data$VehicleCodeFines_i_p)

# R Diagnostics 1
range(data$elec_dummy, na.rm = TRUE)

# R Diagnostics 2
range(data$VehicleCodeFines_i_p, na.rm = TRUE)

# R Diagnostics 3
hist(data$VehicleCodeFines_i_p)

# R Diagnostics 4
library(ggplot2)
ggplot(data, aes(x = elec_dummy, y = VehicleCodeFines_i_p)) + geom_point() + labs(x = "sheriff election",
    y = "per capita inflation-adjusted net revenues", title = "sheriff election vs per capita inflation-adjusted net revenues")

fits <- fitted(model1)
y <- model1$model[[1]]

plot(y, fits, xlab = "Observed y", ylab = "Predicted y (y-hat)", main = "Predicted vs Observed",
    pch = 19)

abline(0, 1, col = "red", lty = 2)

res <- residuals(model1)
plot(fits, res, xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs Fitted")

abline(h = 0, lty = 2, col = "red")

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.3.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(sandwich)

## Warning: package 'sandwich' was built under R version 4.3.3

coeftest(model1, vcov = vcovHC(model1, type = "HC1"))

hist(data$logVehicleCodeFines_i_p)

Issues Identified

Be specific about what the issue is and why it is a problem.

1.The independent variable should be a dummy variable (elec_dummy) with o and 1, but the range the values are -1 and 1.

2.The dependent variable (VehicleCodeFines_i_p) is skewed according to the histogram

3.The relationship between the independent and the dependent variable is non-linear

4.The model predictions do not fit the observed data (from predicted vs. observed). I do not observe data points evenly dispersed around a 45◦ angle.

5.The model errors does not seem like random noise (from fitted vs residuals). Residuals are not evenly dispersed around 0 for all values of Y hat.

Fixes Implemented

What are you doing to address each of the issues you have identified and why?

I will I re-code the independent variable (elec_dummy) so that it takes values of 0 and 1 instead of -1 and 1. Dummy variables should be coded as 0 and 1 to allow for the correct interpretation of the coefficient as the difference between years with election and years no election.

# Fix 1
fixed_data_1 <- data
fixed_data_1$elec_dummy[fixed_data_1$elec_dummy == -1] <- 0

I will use the logarithm of VehicleCodeFines_i_p as the dependent variable (which is already in the dataset) to make the linear model more appropriate for analyzing the relationship among the variables. The original variable is highly skewed, and taking the logarithm reduces skewness and improves the suitability of the linear regression model.

# Fix 2 - This is implemented in my model 2 below

# Fix3

# Fix4

# Fix5

Revised model

Remember, all models are wrong. Some models are useful.

# Revised model
model2 <- lm(logVehicleCodeFines_i_p ~ elec_dummy + young_drivers + elec_dummy *
    young_drivers + dem_share + otherparty_share + ASIAN_share + BLACK_share + HISPANIC_share +
    OTHER_share + density + lnMedInc + Unemp + OwnSourceShare + emp_goods + emp_service +
    pay_goods_i + pay_service_i + arte_share + collect_share + CNTY_LE_SWORN_1000p +
    felony_tot_1000p + misdemeanor_tot_1000p + i_trend_1 + i_trend_2 + i_trend_3 +
    i_trend_4 + i_trend_5 + i_trend_6 + i_trend_7 + i_trend_8 + i_trend_9 + i_trend_10 +
    i_trend_11 + i_trend_12 + i_trend_13 + i_trend_14 + i_trend_15 + i_trend_16 +
    i_trend_17 + i_trend_18 + i_trend_19 + i_trend_20 + i_trend_21 + i_trend_22 +
    i_trend_23 + i_trend_24 + i_trend_25 + i_trend_26 + i_trend_27 + i_trend_28 +
    i_trend_29 + i_trend_30 + i_trend_31 + i_trend_32 + i_trend_33 + i_trend_34 +
    i_trend_35 + i_trend_36 + i_trend_37 + i_trend_38 + i_trend_39 + i_trend_40 +
    i_trend_41 + i_trend_42 + i_trend_43 + i_trend_44 + i_trend_45 + i_trend_46 +
    i_trend_47 + i_trend_48 + i_trend_49 + i_trend_50 + i_trend_51 + i_trend_52 +
    i_trend_53 + i_trend_54 + i_trend_55 + i_trend_56, data = fixed_data_1)
summary(model2)

fits <- fitted(model2)
y <- model2$model[[1]]

plot(y, fits, xlab = "Observed y", ylab = "Predicted y (y-hat)", main = "Predicted vs Observed",
    pch = 19)

abline(0, 1, col = "red", lty = 2)

res <- residuals(model2)

plot(fits, res, xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs Fitted")

abline(h = 0, lty = 2, col = "red")

coeftest(model2, vcov = vcovHC(model2, type = "HC1"))

In my model 2, by using the log of the dependent variable (logVehicleCodeFines_i_p) as well as re-coding correctly the independent variable (elec_dummy) I observe that the model prediction fit better the observed data (comparing fitted with observed). Therefore, I improve the specification by using the logarithm. Also, the plot of the fitted vs.the residuals for model 2 shows shows that the errors are randomly dispersed around zero, suggesting a better model fit. In terms of robust standard errors, I observe a decrease between model 1 and 2 (from 1.75 to 0.37), indicating more precision in model 2. However, the coefficient for the independent variable (elec_dummy) remains statistically insignificant, suggesting that sheriff elections do not have a detectable effect on fine revenues in my model 2.