In this project I will be exploring salary differences in the Montgomery County Fire & Rescue department. I will examine pay differences between gender, different stations, and different employee grades. To do this, I will use a dataset called Employee Salaries, which was gathered by the Office of Human Resources (data owned by Kathy Luh). The data has 9,958 observations on 8 variables. I will be using Department, Division, Gender, base salary, and grade.
Variable
Description
Gender
The gender of the employee
Department
Department code for the department to which the employee is assigned.
Division
Name of the division within the County department to which the employee is assigned.
Base Salary
Annual base salary for the employee at the end of the calendar year. NOTE: This is the projected salary based on employee status and hours the employee is scheduled to work.
Grade
Assigned grade for the employee at the end of the calendar year. (For Fire & Rescue, it is the employees “rank” at the station. It ranges from F1 to B2
With these variables I will explore the following questions:
Is there a difference in base salary between men and women in Montgomery County Fire & Rescue?
Which variables contribute the most to a higher salary for the Montgomery County Fire & Rescue Department?
NOTE: No information was provided on how the data was collected, there is no ReadMe or similar file with that information available.
I chose this topic because it is important to understand what factors influence the salary of our first responders. If there is any discrimination occurring in the workplace then people should be aware of it. Additionally, my brother is currently serving as an EMT (Emergency Medical Technician) at Montgomery County Fire & Rescue Stations. As he has talked about his experiences, I have been interested in exploring data about the stations further.
Loading the Libraries and Data
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 9958 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Department, Department Name, Division, Gender, Base Salary, 2020 Ov...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(salaries)
# A tibble: 6 × 8
Department `Department Name` Division Gender `Base Salary` `2020 Overtime Pay`
<chr> <chr> <chr> <chr> <chr> <chr>
1 ABS Alcohol Beverage… Wholesa… F $78,902 $199.17
2 ABS Alcohol Beverage… Adminis… F $35,926 $0
3 ABS Alcohol Beverage… Adminis… M $167,345 $0
4 ABS Alcohol Beverage… Wholesa… F $90,848 $0
5 ABS Alcohol Beverage… Adminis… F $78,902 $205.16
6 ABS Alcohol Beverage… Marketi… F $109,761 $0
# ℹ 2 more variables: `2020 Longevity Pay` <chr>, Grade <chr>
Data Cleaning & Exploration
I need to conduct some basic cleaning before I can do my statistical analysis.
First, I will perform some simple cleaning on the variable names, changing them to all lowercase and replacing spaces with underscores.
# A tibble: 6 × 8
department department_name division gender base_salary overtime_pay
<chr> <chr> <chr> <chr> <chr> <chr>
1 ABS Alcohol Beverage Services Wholesal… F $78,902 $199.17
2 ABS Alcohol Beverage Services Administ… F $35,926 $0
3 ABS Alcohol Beverage Services Administ… M $167,345 $0
4 ABS Alcohol Beverage Services Wholesal… F $90,848 $0
5 ABS Alcohol Beverage Services Administ… F $78,902 $205.16
6 ABS Alcohol Beverage Services Marketing F $109,761 $0
# ℹ 2 more variables: longevity_pay <chr>, grade <chr>
Next, I need to filter the Department to only include data from Fire & Rescue. We also need to filter division to only include stations and not other positions within the department
# A tibble: 6 × 8
department department_name division gender base_salary overtime_pay
<chr> <chr> <chr> <chr> <chr> <chr>
1 FRS Fire and Rescue Services Station 35 F 117435 38350.14
2 FRS Fire and Rescue Services Station 11 M 94668 6787.14
3 FRS Fire and Rescue Services Station 17 M 86065 5565.72
4 FRS Fire and Rescue Services Station 6 M 86065 38210.76
5 FRS Fire and Rescue Services Station 8 M 86065 0
6 FRS Fire and Rescue Services Station 4 M 117435 1851.9
# ℹ 2 more variables: longevity_pay <chr>, grade <chr>
Finally, all the salaries are currently categorcial variables. I need to coerce them into numeric variables to create my visualization and perform a multiple linear regression.
Now that the data has been cleaned, I can begin exploration.
Exploring the Relationship Between Gender and Salary
First, I will examine how many males and females are in the data, as a disproportionate ratio of males to females may skew the data falsely.
fire_salaries |>group_by(gender) |>count(gender)
# A tibble: 2 × 2
# Groups: gender [2]
gender n
<chr> <int>
1 F 48
2 M 943
Okay, so there are only 48 females in this dataset, compared to 943 males. That is a huge differences in proportions of males and females, which may influence any statistical findings.
# First I will examine the differences in salary distribution between males and femalesgender_plot <- fire_salaries |>ggplot(aes(x = gender, y = base_salary, fill = gender)) +geom_boxplot() +theme_grey(base_family ="serif") +scale_fill_manual(values =c("lightpink", "lightblue")) +labs(title ="Salary Distribution by Gender", x ="Gender", y ="Base Salary (USD)", caption ="Data provided by Kathy Luh")gender_plot
This boxplot shows that males in Fire and Rescue have a higher median salary than females. However, this could be due to an unrepresentative sample, because there are so few female observations, but that is not necessarily the case. In almost 1000 observations, only 48 of them are female, likely because mostly males enter Fire & Rescue Services. The Anova test will determine if this difference in Median base salary is significant
Now I will perform an ANOVA test to determine if there is a significant association between salary and gender.
model <-aov(base_salary ~ gender, data = fire_salaries)summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 4.671e+08 467076221 1.193 0.275
Residuals 989 3.874e+11 391662720
The p-value is 0.275, which fortunately shows that the difference in median salaries between males and females is not significant. The difference in proportions observed earlier is likely due to a siple population difference (more males enter fire and rescue department than females). Next I will explore which factors do have an influence on salary using a multiple linear regression.
Statistical Analysis
I want to determine which variables have the greatest impact on an employees salary. The variables I can study include division (which fire station the employee works at), gender, overtime pay, longevity pay, and grade (employee’s grade. To do this, I will perform a multiple linear regression with backwards elimination. Before I do that, however, I need to alter some of the variables.
First I need to examine the meaning behind employee grade values.
# Viewing all the levels of the grade variableunique(fire_salaries$grade)
[1] "B2" "F4" "F3" "B1" "F2" "F1"
I needed to conduct some background research as to what these employee grades meant for Fire & Rescue specifically. The Montgomery County Fire and Rescue Service explains that F1 stands for Fire Fighter Rescuer 1, an entry level Fire and Rescue position. It is the lowest rank in the department and has the lowest average salary. From F1, there is then F2 (Fire Fighter II), F3 (Fire Fighter III), and F4 (Fire Fighter Rescuer IV). Above each of those ranks is B1 (Fire and Rescue Lieutenant) and B2 (Fire and Rescue Captain). So, certain grades like B2 are likely to have a much higher salary (Salary & Benefits)
Now that I understand each of those values in grade, I alter the variable.
In order to perform a multiple linear regression, I need these categorical variables (gender and grade) to become numerical variables. To do this, I will make gender binary and assign a ranking system to grade. I will label them 1-6, F1 (the lowest rank) will be assigned the number 1 and B2 (the highest rank) will be assigned number 6.
fire1 <- fire_salaries |>mutate(gender_numeric =ifelse(gender =="F", "0", "1")) |>mutate(rank = dplyr::recode(grade, # Had to specify the package because of an error "F1"="1","F2"="2","F3"="3","F4"="4","B1"="5","B2"="6"))head(fire1)
# A tibble: 6 × 10
department department_name division gender base_salary overtime_pay
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 FRS Fire and Rescue Services Station 35 F 117435 38350.
2 FRS Fire and Rescue Services Station 11 M 94668 6787.
3 FRS Fire and Rescue Services Station 17 M 86065 5566.
4 FRS Fire and Rescue Services Station 6 M 86065 38211.
5 FRS Fire and Rescue Services Station 8 M 86065 0
6 FRS Fire and Rescue Services Station 4 M 117435 1852.
# ℹ 4 more variables: longevity_pay <dbl>, grade <chr>, gender_numeric <chr>,
# rank <chr>
Now I need to convert these two new character variables into numeric ones for the multiple linear regression
Now that the variables are ready, I can perform and interpret a multiple linear regression.
Multiple Linear Regression
I will use backwards elimination, beginning with a full model and eliminating variables based on adjusted r-squared values and p-values. I will continue eliminating variables until the adjusted r-squared and p-values are maximized.
# Fit multiple linear regression: multiple_model <-lm(base_salary ~ gender_numeric + rank + overtime_pay +longevity_pay + division, data = fire1)# View the model summarysummary(multiple_model)
Our first model (the full model) has a adjusted r-squared of 0.8486, which means that about 84.86% of the variation in the data can be explained by this model. To improve it, I will remove overtime-pay, which has an extremely high p-value of 0.985, which indicates that it is not significant to the model.
After removing overtime_pay, the adjusted r-squared improved by 0.0001, which is not very much, meaning overtime_pay was slightly hindering our model. Removing it improved the model and simplified it. I will now remove gender from the model, as its p-value is not significant at all and doing so could improve the model further.
multiple_model2 <-lm(base_salary ~ rank +longevity_pay + division, data = fire1)summary(multiple_model2)
The model improved by another 0.0002%, which is good because, even if the r-squared has not improved much, our model has become simpler and I am now aware of which variables are extremely significant to determining salary. These variables include rank, station, and longevity pay, and they account for 86.13% of the variation in the data. I attempted removing the stations that were not significant to the model, but that actually decreased the adjusted r-squared. Thus, this model is the best given current data.
Model Equation: base_salary = 26500.7051 + 14222.8673 (rank) + 0.7003 (longevity pay) + 5741.7213 (station 10) + (7733.3997) Station 11 + Station 12 (2922.5481) + Station 13 (8250.7226) + Station 14(11246.1853) + Station 15 (1265.3915) + Station 16 (3932.1141) + Station 17 (4311.8904) +Station 18 (3635.2417) + Station 19 (4709.4807) + Station 2 (1199.7598) + Station 20 (5562.3446) + Station 21 (3227.9557) + Station 22 (2331.0992)+ Station 23 (802.5981) + Station 24 (3431.7568) + Station 25 (-38.3556) + Station 26 (3157.7971) + Station 28 (8241.6440) + Station 29 (2283.429) + Station 3 (3125.4239) + Station 30 (5670.4357) + Station 31 (2426.0320) + Station 32 (-1276.2324) + Station 33 (5466.1450) + Station 34 (6836.5189) + Station 35 (7429.1947) + Station 4 (3502.7468) + Station 40 (2586.9364) + Station 5 (-71.5390) + Station 6 (6102.1558) + Station 7 (4734.7976) +Station 8 (3395.2269) + Station 9 (5266.7597)
Adjusted R-Squared: 0.8615
This model suggest that 86% of the variation in base salary among Montgomery County Fire & Rescue department employees can be explained by rank (also known as the grade of the employee), longevity pay, and station.
Tableau Visualizations
#exporting the altered csv for tableau visualizationswrite.csv(fire1, "fire_salaries_data.csv", row.names =FALSE)
NOTE: Please enter full screen when examining the Tableau Visualization.
This visualization allows you to view different Fire Stations throughout Montgomery County and the salary differences among each one. It shows the station average salary, average overtime pay, average longevity pay, and gender differences in average salary. It also shows the ratio of males to females at the station (because there are so few women at the station, that may affect average salary differences), and it shows the distribution of employees at each of the ranks.
The treemap in the top right sorts the stations based on average salary. Station 9 had the highest average salary, around 92,000, while station 32 had the lowest average salary of 67,000. These differences could result from the distribution of employees as well as the gender proportions at the station.
Conclusion
Thus, to answer my research questions: There is no significant association between base salary and gender in Montgomery County Fire & Rescue. Base salary is impacted by grade, longevity pay, and the station the employee works. I was very surprised at the how large the difference in proportions there was for males and females in the department, as shown in the visualization. I was also surprised at the large range of salaries among each of the stations, seeing as pay in fire and rescue is not commissions based (how busy the station is). I would’ve loved to explore what other factors play a role in base salary, but thankfully I am aware of the majority of them, thanks to my multiple linear regression.