Start Date: 11 Sept. 2024
Report Date: 25 September 2024
Perform EDA with OB data team.
payroll_categories %>%
DT::datatable()
downloadthis::download_this(payroll_categories,output_name = "Payroll_Categories",output_extension = ".xlsx")
dataframe1=clean_names(dataframe,allow_dupes = F)
skimr::skim(dataframe1)
Name | dataframe1 |
Number of rows | 25812 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1 | 6 | 42 | 0 | 25740 | 0 |
department_name | 0 | 1 | 4 | 30 | 0 | 234 | 0 |
title | 0 | 1 | 3 | 30 | 0 | 1550 | 0 |
postal | 63 | 1 | 5 | 10 | 0 | 657 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
regular | 2245 | 0.91 | 74330.47 | 45473.35 | -1699.38 | 35415.09 | 77440.67 | 114228.86 | 359942.30 | ▇▇▁▁▁ |
retro | 11210 | 0.57 | 3469.63 | 2825.20 | -1146.38 | 1485.31 | 3267.67 | 4997.70 | 22450.56 | ▇▆▁▁▁ |
other | 8414 | 0.67 | 4729.68 | 9543.21 | -2268.08 | 800.00 | 2096.08 | 5000.00 | 230572.23 | ▇▁▁▁▁ |
overtime | 18502 | 0.28 | 20960.39 | 24669.36 | -1414.04 | 2449.42 | 12543.50 | 30941.36 | 221579.41 | ▇▁▁▁▁ |
injured | 24652 | 0.04 | 35704.23 | 38548.61 | 0.78 | 6796.16 | 19765.44 | 51690.58 | 182954.01 | ▇▂▂▁▁ |
detail | 23653 | 0.08 | 16594.87 | 20494.22 | 108.00 | 2280.00 | 8512.00 | 23348.88 | 200160.00 | ▇▁▁▁▁ |
quinn_education | 24493 | 0.05 | 20914.16 | 8773.46 | 0.01 | 17993.92 | 19887.42 | 26161.00 | 52416.68 | ▃▇▆▂▁ |
total_gross | 0 | 1.00 | 83013.64 | 61208.30 | 0.16 | 27834.72 | 79499.05 | 129116.38 | 426425.10 | ▇▆▁▁▁ |
Four character and eight numeric columns for a total of 12.
Regular
has 2245 missing entries - are some folks not
getting base pay?
Retro
refers to wages employers owe their workers and
have not yet paid. So, 25812 - 11210 = 14602. Are
14,602 people experiencing errors and not getting paid properly?
total_gross
has 0 missing entries - everyone is getting
paid something in place of base salary.
Who’s making $.20 per year for total_gross
?
summary(dataframe1$total_gross)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2 27834.7 79499.0 83013.6 129116.4 426425.1
sd(dataframe1$total_gross)
## [1] 61208.3
dataframe1 %>%
filter(total_gross <= 1) %>%
select(name,total_gross,everything())
Bettencourt,Katarina was retro
paid $.16 as a substitute
teacher.
dataframe1 %>%
group_by(department_name) %>%
summarise(headcount=n())%>%
mutate(percent=headcount/sum(headcount)) %>%
arrange(desc(percent)) %>%
mutate(percent = scales::percent(percent,2)) %>%
ungroup() %>%
add_row(headcount=sum(.$headcount))
There are folks with more than one job title in each department. The Police Dept. shows 3,007 unique names (out of the 3,011 with duplicates), meaning four people “wear more than one hat.”
name
dataframe1 %>%
filter(department_name =="Boston Police Department") %>%
group_by(name) %>%
arrange(name) %>%
filter(n()> 1)
dataframe1 %>%
group_by(department_name) %>%
summarise(injured=sum(injured,na.rm = T),
total_gross=sum(total_gross,na.rm = T)) %>%
mutate(injured_pct = round((injured/total_gross)*100,1)) %>%
arrange(desc(injured_pct))
dataframe1 %>%
group_by(department_name,title) %>%
summarise(injured=sum(injured,na.rm = T),
total_gross=sum(total_gross,na.rm = T)) %>%
mutate(injured_pct = round((injured/total_gross)*100,1)) %>%
arrange(desc(injured_pct))
Total Gross
by dept.dataframe1 %>%
group_by(department_name) %>%
summarise(mean_pay = mean(total_gross,na.rm = T),
median_pay = median(total_gross, na.rm=T),
min_pay = min(total_gross,na.rm = T),
max_pay = max(total_gross,na.rm = T)) %>%
arrange(desc(median_pay))
Isolate only fire and police depts.
test_DF = dataframe1 %>%
filter(department_name %in% c("Boston Fire Department","Boston Police Department"))
test_DF %>%
group_by(department_name) %>%
summarise(injured=sum(injured,na.rm = T),
total_gross=sum(total_gross,na.rm = T)) %>%
mutate(injured_pct = round((injured/total_gross)*100,1)) %>%
arrange(desc(injured_pct))
fire_injured = test_DF %>%
filter(department_name=="Boston Fire Department",
!is.na(injured))
police_injured= test_DF %>%
filter(department_name=="Boston Police Department",
!is.na(injured))
hist(fire_injured$injured)
shapiro.test(police_injured$injured)
##
## Shapiro-Wilk normality test
##
## data: police_injured$injured
## W = 0.89395, p-value = 3.641e-16
Is there a difference in median injured pay at all?
wilcox.test(police_injured$injured,fire_injured$injured,alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: police_injured$injured and fire_injured$injured
## W = 113736, p-value = 0.00000006202
## alternative hypothesis: true location shift is not equal to 0
Yes. Does the police get paid more in injured pay than the fire dept.?
Given the p-value (0.00000006202), we can reject the null hypothesis that there is no difference in the injury data between the police department and the fire department.
wilcox.test(police_injured$injured,fire_injured$injured,alternative = "greater")
##
## Wilcoxon rank sum test with continuity correction
##
## data: police_injured$injured and fire_injured$injured
## W = 113736, p-value = 0.00000003101
## alternative hypothesis: true location shift is greater than 0
It’s statistically significant that the police dept gets paid more than the fire dept.
wilcox.test(police_injured$injured,fire_injured$injured,alternative = "less")
##
## Wilcoxon rank sum test with continuity correction
##
## data: police_injured$injured and fire_injured$injured
## W = 113736, p-value = 1
## alternative hypothesis: true location shift is less than 0
It’s not statistically significant that the fire dept gets paid more than the police dept.