OB Boston Employee Earnings Report 2023

Start Date: 11 Sept. 2024

Report Date: 25 September 2024

Perform EDA with OB data team.

Payroll Categories

payroll_categories %>% 
  DT::datatable()

downloadthis::download_this(payroll_categories,output_name = "Payroll_Categories",output_extension = ".xlsx")

Clean and Shape

dataframe1=clean_names(dataframe,allow_dupes = F)

skimr::skim(dataframe1)

Data summary
Name	dataframe1
Number of rows	25812
Number of columns	12
_______________________
Column type frequency:
character	4
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1	6	42	25740
department_name	0	1	4	30	234
title	0	1	3	30	1550
postal	63	1	5	10	657

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
regular	2245	0.91	74330.47	45473.35	-1699.38	35415.09	77440.67	114228.86	359942.30	▇▇▁▁▁
retro	11210	0.57	3469.63	2825.20	-1146.38	1485.31	3267.67	4997.70	22450.56	▇▆▁▁▁
other	8414	0.67	4729.68	9543.21	-2268.08	800.00	2096.08	5000.00	230572.23	▇▁▁▁▁
overtime	18502	0.28	20960.39	24669.36	-1414.04	2449.42	12543.50	30941.36	221579.41	▇▁▁▁▁
injured	24652	0.04	35704.23	38548.61	0.78	6796.16	19765.44	51690.58	182954.01	▇▂▂▁▁
detail	23653	0.08	16594.87	20494.22	108.00	2280.00	8512.00	23348.88	200160.00	▇▁▁▁▁
quinn_education	24493	0.05	20914.16	8773.46	0.01	17993.92	19887.42	26161.00	52416.68	▃▇▆▂▁
total_gross	0	1.00	83013.64	61208.30	0.16	27834.72	79499.05	129116.38	426425.10	▇▆▁▁▁

Four character and eight numeric columns for a total of 12.

Regular has 2245 missing entries - are some folks not getting base pay?

Retro refers to wages employers owe their workers and have not yet paid. So, 25812 - 11210 = 14602. Are 14,602 people experiencing errors and not getting paid properly?

total_gross has 0 missing entries - everyone is getting paid something in place of base salary.

Descriptive Stats

Who’s making $.20 per year for total_gross?

summary(dataframe1$total_gross)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.2  27834.7  79499.0  83013.6 129116.4 426425.1

sd(dataframe1$total_gross)

## [1] 61208.3

dataframe1 %>% 
  filter(total_gross <= 1) %>% 
  select(name,total_gross,everything())

Bettencourt,Katarina was retro paid $.16 as a substitute teacher.

Department size by count

dataframe1 %>% 
  group_by(department_name) %>% 
  summarise(headcount=n())%>%
  mutate(percent=headcount/sum(headcount)) %>% 
  arrange(desc(percent)) %>% 
  mutate(percent = scales::percent(percent,2)) %>% 
  ungroup() %>% 
  add_row(headcount=sum(.$headcount))

There are folks with more than one job title in each department. The Police Dept. shows 3,007 unique names (out of the 3,011 with duplicates), meaning four people “wear more than one hat.”

Police dept. duplicate by `name`

dataframe1 %>% 
  filter(department_name =="Boston Police Department") %>% 
  group_by(name) %>% 
  arrange(name) %>% 
  filter(n()> 1)

Total gross and injured pay by dept. and title

dataframe1 %>% 
  group_by(department_name) %>% 
  summarise(injured=sum(injured,na.rm = T),
            total_gross=sum(total_gross,na.rm = T)) %>% 
  mutate(injured_pct = round((injured/total_gross)*100,1)) %>% 
  arrange(desc(injured_pct))

dataframe1 %>% 
  group_by(department_name,title) %>% 
  summarise(injured=sum(injured,na.rm = T),
            total_gross=sum(total_gross,na.rm = T)) %>% 
  mutate(injured_pct = round((injured/total_gross)*100,1)) %>% 
  arrange(desc(injured_pct))

Descriptive stats by `Total Gross` by dept.

dataframe1 %>% 
  group_by(department_name) %>% 
  summarise(mean_pay = mean(total_gross,na.rm = T),
            median_pay = median(total_gross, na.rm=T),
            min_pay = min(total_gross,na.rm = T),
            max_pay = max(total_gross,na.rm = T)) %>% 
  arrange(desc(median_pay))

Isolate only fire and police depts.

test_DF = dataframe1 %>% 
  filter(department_name %in% c("Boston Fire Department","Boston Police Department"))

test_DF %>% 
  group_by(department_name) %>% 
  summarise(injured=sum(injured,na.rm = T),
            total_gross=sum(total_gross,na.rm = T)) %>% 
  mutate(injured_pct = round((injured/total_gross)*100,1)) %>% 
  arrange(desc(injured_pct))

Who gets injured pay more? A test.

fire_injured = test_DF %>% 
  filter(department_name=="Boston Fire Department",
         !is.na(injured))

police_injured= test_DF %>% 
  filter(department_name=="Boston Police Department",
         !is.na(injured))

hist(fire_injured$injured)

shapiro.test(police_injured$injured)

## 
##  Shapiro-Wilk normality test
## 
## data:  police_injured$injured
## W = 0.89395, p-value = 3.641e-16

Is there a difference in median injured pay at all?

wilcox.test(police_injured$injured,fire_injured$injured,alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  police_injured$injured and fire_injured$injured
## W = 113736, p-value = 0.00000006202
## alternative hypothesis: true location shift is not equal to 0

Yes. Does the police get paid more in injured pay than the fire dept.?

Given the p-value (0.00000006202), we can reject the null hypothesis that there is no difference in the injury data between the police department and the fire department.

wilcox.test(police_injured$injured,fire_injured$injured,alternative = "greater")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  police_injured$injured and fire_injured$injured
## W = 113736, p-value = 0.00000003101
## alternative hypothesis: true location shift is greater than 0

It’s statistically significant that the police dept gets paid more than the fire dept.

wilcox.test(police_injured$injured,fire_injured$injured,alternative = "less")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  police_injured$injured and fire_injured$injured
## W = 113736, p-value = 1
## alternative hypothesis: true location shift is less than 0

It’s not statistically significant that the fire dept gets paid more than the police dept.