data110project1

Author

Emma Poch

This data set, sourced from the United States Census Bureau, details the payroll information for both full-time and part-time employees of each sector of local government for each U.S. state (as well as the collective information for the U.S. entirely). For full-time, part-time, full-time equivalent, and both full and part time, the dataset lists the total number of employees within each category, the total value of the payroll (in whole US dollars), and the coefficient of variation. I plan to generally explore how the payroll distribution varies among different departments and different states, likely focusing on total payroll for the sake of simplicity.

Preliminary setup code

setwd("C:/Users/emmap/Downloads/DATA110")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(RColorBrewer)
payrolls <- read_csv("2023_local.csv")

Rows: 1976 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, government_function
dbl (7): fte_variation_coefficient, ftp_variation_coefficient, pte_variation...
num (7): full_time_employment, full_time_payroll, part_time_employment, part...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Getting rid of repetitive rows and making a new column that lists the total payroll in millions of dollars
payroll2 <- payrolls |>
filter(government_function != "Total - All Government Employment Functions" & government_function != "Police Protection - Persons with Power of Arrest" & government_function != "Police Protection - Other" & government_function != "Fire Protection - Firefighters" & government_function != "Fire Protection - Other" & government_function != "Education - Elementary and Secondary Total" & government_function != "Education - Elementary and Secondary Other" & government_function != "Education - Elementary and Secondary Instructional" & government_function != "Education - Higher Education Total" & government_function != "Education - Higher Education Instructional" & government_function != "Education - Higher Education Other" & government_function != "Education - Other") |>
  mutate(payroll_millions = total_payroll/100000)

# Creating an additional variable to determine the value assigned per employee 
payroll3 <- payroll2 |>
  mutate(peremployee = total_payroll/total_full_time_and_part_time)

# Creating two filtered versions of the dataset, one with only countrywide information and one with only statewide information
countryroll <- payroll2 |>
  filter(state == "US")
stateroll <- payroll2 |>
  filter(state != "US")

# Creating a linear regression comparing employment levels to payroll to evaluate the relationship between the size of each department and the funding being allocated to its employees
model1 <- lm(full_time_payroll ~ full_time_employment, data = stateroll)
summary(model1)


Call:
lm(formula = full_time_payroll ~ full_time_employment, data = stateroll)

Residuals:
       Min         1Q     Median         3Q        Max 
-835005515   -3134760   -2292355   -1123410 1124212765 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.292e+06  1.521e+06   1.507    0.132    
full_time_employment 5.647e+03  3.923e+01 143.941   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 54050000 on 1324 degrees of freedom
Multiple R-squared:  0.9399,    Adjusted R-squared:  0.9399 
F-statistic: 2.072e+04 on 1 and 1324 DF,  p-value: < 2.2e-16

res <- resid(model1)
plot(fitted(model1), res)

This model yields the following equation: full time payroll = 5.647e+03(full time employment) + 2.292e+06. The p value is very low (close to 0) and the adjusted R-squared value (0.939) is high, suggesting a fairly significant level of correlation between the two variables. This implies a pretty consistent relationship between the amount of employees in a department and the total value allocated for payroll, which makes sense. However, given that it’s not completely linear, the size of the department is not entirely determinant of its payroll; this makes me curious as to which departments are better favored when size is excluded as a factor. The residuals plot raises some concern, however; there seems to be a heavy concentration of residuals to the left of the plot, implying that this model may not be an appropriate one to represent the relationship between the variables. Any interpretations from this model should likely be taken with a grain of salt.

# Determining the highest-funded departments overall to make data easier to analyze
countryroll2 <- countryroll |>
  filter(government_function != "All other and unallocable") |>
  group_by(government_function) |>
  arrange(desc(total_payroll)) |>
  head(11)
countryroll2

# A tibble: 11 × 17
# Groups:   government_function [11]
   state government_function      full_time_employment fte_variation_coefficient
   <chr> <chr>                                   <dbl>                     <dbl>
 1 US    Education Total                       6477793                      0.1 
 2 US    Police Protection Total                810674                      0.08
 3 US    Hospitals                              536074                      0.08
 4 US    Fire Protection Total                  369935                      0.04
 5 US    Health                                 249248                      0.15
 6 US    Financial Administration               243004                      0.03
 7 US    Highways                               267854                      0.06
 8 US    Corrections                            244243                      0.13
 9 US    Judicial and Legal                     228306                      0.05
10 US    Transit                                208162                      0.1 
11 US    Public Welfare                         250280                      0.08
# ℹ 13 more variables: full_time_payroll <dbl>,
#   ftp_variation_coefficient <dbl>, part_time_employment <dbl>,
#   pte_variation_coefficient <dbl>, part_time_payroll <dbl>,
#   ptp_variation_coefficient <dbl>, fulltime_equivalent_employment <dbl>,
#   ft_ee_coefficient <dbl>, total_full_time_and_part_time <dbl>,
#   total_full_and_part_employment_coefficient <dbl>, total_payroll <dbl>,
#   total_payroll_coefficient <dbl>, payroll_millions <dbl>

# Filtering so each state only includes the top 10 most highly-paid departments overall and we're only focusing on the DMV
stateroll2 <- stateroll |>
  filter(government_function == "Education Total" | government_function == "Police Protection Total" | government_function == "Hospitals" | government_function == "Fire Protection Total" | government_function == "Health" | government_function == "Financial Administration" | government_function == "Highways" | government_function == "Corrections" | government_function == "Judicial and Legal" | government_function == "Transit") |>
   filter(state == "MD" | state == "DC" | state == "VA")

plot1 <- stateroll2 |>
  ggplot(aes(x = government_function, y = payroll_millions, fill = state))+
  geom_bar(stat = "identity", width = 0.75, position = "stack")+
  ylim(-3000, 11000)+
  coord_polar()+
  theme_minimal()+
  labs(y = "Total Payroll (Millions)", title = "Local Government Payroll Distributions in the DMV", fill = "State", caption = "Source: US Census Bureau")+
  theme(axis.text.x = element_text(size = 6, angle = 45, hjust = 0.1, vjust = 0.1), axis.title = element_blank(), axis.text.y = element_text(size = 7), legend.position = c(1.2, 0.7))+
  scale_fill_brewer(palette = "Set3")
plot1

Warning: Removed 2 rows containing missing values (`geom_bar()`).

Reflection

Most of my data cleaning involved filtering out only the most pertinent elements of the data set; I was aware that it wouldn’t be feasible to attempt to display all possible states and departments, so I had to be selective about which departments to focus on. Additionally, there were several redundant departments included that I decided to get rid of. I also made the choice to convert the total payroll to millions instead of graphing it as its actual value in order to scale it down slightly and make the disparities between different departments more comparable. Regarding disparities, I was quite surprised at how significant the amount of funding given to the education system is compared to every other department. In the future, I’d be interested in looking deeper into the divisions between K-12 and higher education, as the inclusion of state university payrolls may be skewing the general perception of how much money is actually going towards funding education. I was a little displeased with the scaling of my plot and the placement of the labels; despite it being a relatively small detail, I struggled more than I’d anticipated to get the plot reasonably sized. The magnitude of education compared to the other departments was great enough that I wondered if my plot might be benefited by removing it, but I considered it an interesting enough detail that I wanted to refrain from removing it. Overall, although I do like elements about this plot, I think that the placement of the axis text and the height scaling of the bars make it less readable than it would ideally be.