This data set, sourced from the United States Census Bureau, details the payroll information for both full-time and part-time employees of each sector of local government for each U.S. state (as well as the collective information for the U.S. entirely). For full-time, part-time, full-time equivalent, and both full and part time, the dataset lists the total number of employees within each category, the total value of the payroll (in whole US dollars), and the coefficient of variation. I plan to generally explore how the payroll distribution varies among different departments and different states, likely focusing on total payroll for the sake of simplicity.
Rows: 1976 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, government_function
dbl (7): fte_variation_coefficient, ftp_variation_coefficient, pte_variation...
num (7): full_time_employment, full_time_payroll, part_time_employment, part...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Getting rid of repetitive rows and making a new column that lists the total payroll in millions of dollarspayroll2 <- payrolls |>filter(government_function !="Total - All Government Employment Functions"& government_function !="Police Protection - Persons with Power of Arrest"& government_function !="Police Protection - Other"& government_function !="Fire Protection - Firefighters"& government_function !="Fire Protection - Other"& government_function !="Education - Elementary and Secondary Total"& government_function !="Education - Elementary and Secondary Other"& government_function !="Education - Elementary and Secondary Instructional"& government_function !="Education - Higher Education Total"& government_function !="Education - Higher Education Instructional"& government_function !="Education - Higher Education Other"& government_function !="Education - Other") |>mutate(payroll_millions = total_payroll/100000)
# Creating an additional variable to determine the value assigned per employee payroll3 <- payroll2 |>mutate(peremployee = total_payroll/total_full_time_and_part_time)
# Creating two filtered versions of the dataset, one with only countrywide information and one with only statewide informationcountryroll <- payroll2 |>filter(state =="US")stateroll <- payroll2 |>filter(state !="US")
# Creating a linear regression comparing employment levels to payroll to evaluate the relationship between the size of each department and the funding being allocated to its employeesmodel1 <-lm(full_time_payroll ~ full_time_employment, data = stateroll)summary(model1)
Call:
lm(formula = full_time_payroll ~ full_time_employment, data = stateroll)
Residuals:
Min 1Q Median 3Q Max
-835005515 -3134760 -2292355 -1123410 1124212765
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.292e+06 1.521e+06 1.507 0.132
full_time_employment 5.647e+03 3.923e+01 143.941 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 54050000 on 1324 degrees of freedom
Multiple R-squared: 0.9399, Adjusted R-squared: 0.9399
F-statistic: 2.072e+04 on 1 and 1324 DF, p-value: < 2.2e-16
res <-resid(model1)plot(fitted(model1), res)
This model yields the following equation: full time payroll = 5.647e+03(full time employment) + 2.292e+06. The p value is very low (close to 0) and the adjusted R-squared value (0.939) is high, suggesting a fairly significant level of correlation between the two variables. This implies a pretty consistent relationship between the amount of employees in a department and the total value allocated for payroll, which makes sense. However, given that it’s not completely linear, the size of the department is not entirely determinant of its payroll; this makes me curious as to which departments are better favored when size is excluded as a factor. The residuals plot raises some concern, however; there seems to be a heavy concentration of residuals to the left of the plot, implying that this model may not be an appropriate one to represent the relationship between the variables. Any interpretations from this model should likely be taken with a grain of salt.
# Determining the highest-funded departments overall to make data easier to analyzecountryroll2 <- countryroll |>filter(government_function !="All other and unallocable") |>group_by(government_function) |>arrange(desc(total_payroll)) |>head(11)countryroll2
# A tibble: 11 × 17
# Groups: government_function [11]
state government_function full_time_employment fte_variation_coefficient
<chr> <chr> <dbl> <dbl>
1 US Education Total 6477793 0.1
2 US Police Protection Total 810674 0.08
3 US Hospitals 536074 0.08
4 US Fire Protection Total 369935 0.04
5 US Health 249248 0.15
6 US Financial Administration 243004 0.03
7 US Highways 267854 0.06
8 US Corrections 244243 0.13
9 US Judicial and Legal 228306 0.05
10 US Transit 208162 0.1
11 US Public Welfare 250280 0.08
# ℹ 13 more variables: full_time_payroll <dbl>,
# ftp_variation_coefficient <dbl>, part_time_employment <dbl>,
# pte_variation_coefficient <dbl>, part_time_payroll <dbl>,
# ptp_variation_coefficient <dbl>, fulltime_equivalent_employment <dbl>,
# ft_ee_coefficient <dbl>, total_full_time_and_part_time <dbl>,
# total_full_and_part_employment_coefficient <dbl>, total_payroll <dbl>,
# total_payroll_coefficient <dbl>, payroll_millions <dbl>
# Filtering so each state only includes the top 10 most highly-paid departments overall and we're only focusing on the DMVstateroll2 <- stateroll |>filter(government_function =="Education Total"| government_function =="Police Protection Total"| government_function =="Hospitals"| government_function =="Fire Protection Total"| government_function =="Health"| government_function =="Financial Administration"| government_function =="Highways"| government_function =="Corrections"| government_function =="Judicial and Legal"| government_function =="Transit") |>filter(state =="MD"| state =="DC"| state =="VA")
plot1 <- stateroll2 |>ggplot(aes(x = government_function, y = payroll_millions, fill = state))+geom_bar(stat ="identity", width =0.75, position ="stack")+ylim(-3000, 11000)+coord_polar()+theme_minimal()+labs(y ="Total Payroll (Millions)", title ="Local Government Payroll Distributions in the DMV", fill ="State", caption ="Source: US Census Bureau")+theme(axis.text.x =element_text(size =6, angle =45, hjust =0.1, vjust =0.1), axis.title =element_blank(), axis.text.y =element_text(size =7), legend.position =c(1.2, 0.7))+scale_fill_brewer(palette ="Set3")plot1
Most of my data cleaning involved filtering out only the most pertinent elements of the data set; I was aware that it wouldn’t be feasible to attempt to display all possible states and departments, so I had to be selective about which departments to focus on. Additionally, there were several redundant departments included that I decided to get rid of. I also made the choice to convert the total payroll to millions instead of graphing it as its actual value in order to scale it down slightly and make the disparities between different departments more comparable. Regarding disparities, I was quite surprised at how significant the amount of funding given to the education system is compared to every other department. In the future, I’d be interested in looking deeper into the divisions between K-12 and higher education, as the inclusion of state university payrolls may be skewing the general perception of how much money is actually going towards funding education. I was a little displeased with the scaling of my plot and the placement of the labels; despite it being a relatively small detail, I struggled more than I’d anticipated to get the plot reasonably sized. The magnitude of education compared to the other departments was great enough that I wondered if my plot might be benefited by removing it, but I considered it an interesting enough detail that I wanted to refrain from removing it. Overall, although I do like elements about this plot, I think that the placement of the axis text and the height scaling of the bars make it less readable than it would ideally be.