Start Date: 11 Sept. 2024

Report Date: 25 September 2024

Source: City of Boston

Perform EDA with OB data team.

Payroll Categories

payroll_categories %>% 
  DT::datatable()
downloadthis::download_this(payroll_categories,output_name = "Payroll_Categories",output_extension = ".xlsx")

Clean and Shape

dataframe1=clean_names(dataframe,allow_dupes = F)

skimr::skim(dataframe1)
Data summary
Name dataframe1
Number of rows 25812
Number of columns 12
_______________________
Column type frequency:
character 4
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 6 42 0 25740 0
department_name 0 1 4 30 0 234 0
title 0 1 3 30 0 1550 0
postal 63 1 5 10 0 657 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
regular 2245 0.91 74330.47 45473.35 -1699.38 35415.09 77440.67 114228.86 359942.30 ▇▇▁▁▁
retro 11210 0.57 3469.63 2825.20 -1146.38 1485.31 3267.67 4997.70 22450.56 ▇▆▁▁▁
other 8414 0.67 4729.68 9543.21 -2268.08 800.00 2096.08 5000.00 230572.23 ▇▁▁▁▁
overtime 18502 0.28 20960.39 24669.36 -1414.04 2449.42 12543.50 30941.36 221579.41 ▇▁▁▁▁
injured 24652 0.04 35704.23 38548.61 0.78 6796.16 19765.44 51690.58 182954.01 ▇▂▂▁▁
detail 23653 0.08 16594.87 20494.22 108.00 2280.00 8512.00 23348.88 200160.00 ▇▁▁▁▁
quinn_education 24493 0.05 20914.16 8773.46 0.01 17993.92 19887.42 26161.00 52416.68 ▃▇▆▂▁
total_gross 0 1.00 83013.64 61208.30 0.16 27834.72 79499.05 129116.38 426425.10 ▇▆▁▁▁

Four character and eight numeric columns for a total of 12.

Regular has 2245 missing entries - are some folks not getting base pay?

Retro refers to wages employers owe their workers and have not yet paid. So, 25812 - 11210 = 14602. Are 14,602 people experiencing errors and not getting paid properly?

total_gross has 0 missing entries - everyone is getting paid something in place of base salary.

Descriptive Stats

Who’s making $.20 per year for total_gross?

summary(dataframe1$total_gross)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.2  27834.7  79499.0  83013.6 129116.4 426425.1
sd(dataframe1$total_gross)
## [1] 61208.3
dataframe1 %>% 
  filter(total_gross <= 1) %>% 
  select(name,total_gross,everything())

Bettencourt,Katarina was retro paid $.16 as a substitute teacher.

Department size by count

dataframe1 %>% 
  group_by(department_name) %>% 
  summarise(headcount=n())%>%
  mutate(percent=headcount/sum(headcount)) %>% 
  arrange(desc(percent)) %>% 
  mutate(percent = scales::percent(percent,2)) %>% 
  ungroup() %>% 
  add_row(headcount=sum(.$headcount))

There are folks with more than one job title in each department. The Police Dept. shows 3,007 unique names (out of the 3,011 with duplicates), meaning four people “wear more than one hat.”

Police dept. duplicate by name

dataframe1 %>% 
  filter(department_name =="Boston Police Department") %>% 
  group_by(name) %>% 
  arrange(name) %>% 
  filter(n()> 1)

Total gross and injured pay by dept. and title

dataframe1 %>% 
  group_by(department_name) %>% 
  summarise(injured=sum(injured,na.rm = T),
            total_gross=sum(total_gross,na.rm = T)) %>% 
  mutate(injured_pct = round((injured/total_gross)*100,1)) %>% 
  arrange(desc(injured_pct)) 
dataframe1 %>% 
  group_by(department_name,title) %>% 
  summarise(injured=sum(injured,na.rm = T),
            total_gross=sum(total_gross,na.rm = T)) %>% 
  mutate(injured_pct = round((injured/total_gross)*100,1)) %>% 
  arrange(desc(injured_pct))

Descriptive stats by Total Gross by dept.

dataframe1 %>% 
  group_by(department_name) %>% 
  summarise(mean_pay = mean(total_gross,na.rm = T),
            median_pay = median(total_gross, na.rm=T),
            min_pay = min(total_gross,na.rm = T),
            max_pay = max(total_gross,na.rm = T)) %>% 
  arrange(desc(median_pay))

Isolate only fire and police depts.

test_DF = dataframe1 %>% 
  filter(department_name %in% c("Boston Fire Department","Boston Police Department"))

test_DF %>% 
  group_by(department_name) %>% 
  summarise(injured=sum(injured,na.rm = T),
            total_gross=sum(total_gross,na.rm = T)) %>% 
  mutate(injured_pct = round((injured/total_gross)*100,1)) %>% 
  arrange(desc(injured_pct))

Who gets injured pay more? A test.

fire_injured = test_DF %>% 
  filter(department_name=="Boston Fire Department",
         !is.na(injured))

police_injured= test_DF %>% 
  filter(department_name=="Boston Police Department",
         !is.na(injured))

hist(fire_injured$injured)

shapiro.test(police_injured$injured)
## 
##  Shapiro-Wilk normality test
## 
## data:  police_injured$injured
## W = 0.89395, p-value = 3.641e-16

Is there a difference in median injured pay at all?

wilcox.test(police_injured$injured,fire_injured$injured,alternative = "two.sided")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  police_injured$injured and fire_injured$injured
## W = 113736, p-value = 0.00000006202
## alternative hypothesis: true location shift is not equal to 0

Yes. Does the police get paid more in injured pay than the fire dept.?

Given the p-value (0.00000006202), we can reject the null hypothesis that there is no difference in the injury data between the police department and the fire department.

wilcox.test(police_injured$injured,fire_injured$injured,alternative = "greater")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  police_injured$injured and fire_injured$injured
## W = 113736, p-value = 0.00000003101
## alternative hypothesis: true location shift is greater than 0

It’s statistically significant that the police dept gets paid more than the fire dept.

wilcox.test(police_injured$injured,fire_injured$injured,alternative = "less")
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  police_injured$injured and fire_injured$injured
## W = 113736, p-value = 1
## alternative hypothesis: true location shift is less than 0

It’s not statistically significant that the fire dept gets paid more than the police dept.