Data 110 - Final Project

Author

Kalina Peterson

Salary Differences in Montgomery County Fire & Rescue

^{Image retrieved from https://www.fireapparatusmagazine.com/the-fire-station/the-station-articles/montgomery-county-md-replaces-two-bay-firehouse-with-four-bay-one-story-fire-station/}

Introduction

In this project I will be exploring salary differences in the Montgomery County Fire & Rescue department. I will examine pay differences between gender, different stations, and different employee grades. To do this, I will use a dataset called Employee Salaries, which was gathered by the Office of Human Resources (data owned by Kathy Luh). The data has 9,958 observations on 8 variables. I will be using Department, Division, Gender, base salary, and grade.

Variable	Description
Gender	The gender of the employee
Department	Department code for the department to which the employee is assigned.
Division	Name of the division within the County department to which the employee is assigned.
Base Salary	Annual base salary for the employee at the end of the calendar year. NOTE: This is the projected salary based on employee status and hours the employee is scheduled to work.
Grade	Assigned grade for the employee at the end of the calendar year. (For Fire & Rescue, it is the employees “rank” at the station. It ranges from F1 to B2

With these variables I will explore the following questions:

Is there a difference in base salary between men and women in Montgomery County Fire & Rescue?
Which variables contribute the most to a higher salary for the Montgomery County Fire & Rescue Department?

NOTE: No information was provided on how the data was collected, there is no ReadMe or similar file with that information available.

I chose this topic because it is important to understand what factors influence the salary of our first responders. If there is any discrimination occurring in the workplace then people should be aware of it. Additionally, my brother is currently serving as an EMT (Emergency Medical Technician) at Montgomery County Fire & Rescue Stations. As he has talked about his experiences, I have been interested in exploring data about the stations further.

Loading the Libraries and Data

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(readr)

setwd("C:/Users/kpeter81/OneDrive - montgomerycollege.edu/Datasets")
salaries <- read_csv("Employee_Salaries_-_2020_20260505.csv")

Rows: 9958 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Department, Department Name, Division, Gender, Base Salary, 2020 Ov...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(salaries)

# A tibble: 6 × 8
  Department `Department Name` Division Gender `Base Salary` `2020 Overtime Pay`
  <chr>      <chr>             <chr>    <chr>  <chr>         <chr>              
1 ABS        Alcohol Beverage… Wholesa… F      $78,902       $199.17            
2 ABS        Alcohol Beverage… Adminis… F      $35,926       $0                 
3 ABS        Alcohol Beverage… Adminis… M      $167,345      $0                 
4 ABS        Alcohol Beverage… Wholesa… F      $90,848       $0                 
5 ABS        Alcohol Beverage… Adminis… F      $78,902       $205.16            
6 ABS        Alcohol Beverage… Marketi… F      $109,761      $0                 
# ℹ 2 more variables: `2020 Longevity Pay` <chr>, Grade <chr>

Data Cleaning & Exploration

I need to conduct some basic cleaning before I can do my statistical analysis.

First, I will perform some simple cleaning on the variable names, changing them to all lowercase and replacing spaces with underscores.

colnames(salaries) <- tolower(colnames(salaries))
colnames(salaries) <- gsub("2020 ", "", colnames(salaries), fixed = TRUE) # removing 2020 from overtime pay and longevity pay
colnames(salaries) <- gsub(" ", "_", colnames(salaries)) 
head(salaries)

# A tibble: 6 × 8
  department department_name           division  gender base_salary overtime_pay
  <chr>      <chr>                     <chr>     <chr>  <chr>       <chr>       
1 ABS        Alcohol Beverage Services Wholesal… F      $78,902     $199.17     
2 ABS        Alcohol Beverage Services Administ… F      $35,926     $0          
3 ABS        Alcohol Beverage Services Administ… M      $167,345    $0          
4 ABS        Alcohol Beverage Services Wholesal… F      $90,848     $0          
5 ABS        Alcohol Beverage Services Administ… F      $78,902     $205.16     
6 ABS        Alcohol Beverage Services Marketing F      $109,761    $0          
# ℹ 2 more variables: longevity_pay <chr>, grade <chr>

Next, I need to filter the Department to only include data from Fire & Rescue. We also need to filter division to only include stations and not other positions within the department

fire_salaries <- salaries |>
  filter(department_name == "Fire and Rescue Services") |>
  filter(str_detect(division, regex("Station"))) |>
  mutate(across(everything(), ~ str_replace_all(., fixed("$"), ""))) |>
  mutate(across(where(is.character), ~ gsub(",", "", .))) 
head(fire_salaries)

# A tibble: 6 × 8
  department department_name          division   gender base_salary overtime_pay
  <chr>      <chr>                    <chr>      <chr>  <chr>       <chr>       
1 FRS        Fire and Rescue Services Station 35 F      117435      38350.14    
2 FRS        Fire and Rescue Services Station 11 M      94668       6787.14     
3 FRS        Fire and Rescue Services Station 17 M      86065       5565.72     
4 FRS        Fire and Rescue Services Station 6  M      86065       38210.76    
5 FRS        Fire and Rescue Services Station 8  M      86065       0           
6 FRS        Fire and Rescue Services Station 4  M      117435      1851.9      
# ℹ 2 more variables: longevity_pay <chr>, grade <chr>

Finally, all the salaries are currently categorcial variables. I need to coerce them into numeric variables to create my visualization and perform a multiple linear regression.

fire_salaries$base_salary <- as.numeric(fire_salaries$base_salary)
fire_salaries$overtime_pay <- as.numeric(fire_salaries$overtime_pay)
fire_salaries$longevity_pay <- as.numeric(fire_salaries$longevity_pay)
str(fire_salaries)

tibble [991 × 8] (S3: tbl_df/tbl/data.frame)
 $ department     : chr [1:991] "FRS" "FRS" "FRS" "FRS" ...
 $ department_name: chr [1:991] "Fire and Rescue Services" "Fire and Rescue Services" "Fire and Rescue Services" "Fire and Rescue Services" ...
 $ division       : chr [1:991] "Station 35" "Station 11" "Station 17" "Station 6" ...
 $ gender         : chr [1:991] "F" "M" "M" "M" ...
 $ base_salary    : num [1:991] 117435 94668 86065 86065 86065 ...
 $ overtime_pay   : num [1:991] 38350 6787 5566 38211 0 ...
 $ longevity_pay  : num [1:991] 0 0 2665 0 1415 ...
 $ grade          : chr [1:991] "B2" "F4" "F3" "F3" ...

Now that the data has been cleaned, I can begin exploration.

Exploring the Relationship Between Gender and Salary

First, I will examine how many males and females are in the data, as a disproportionate ratio of males to females may skew the data falsely.

fire_salaries |>
  group_by(gender) |>
  count(gender)

# A tibble: 2 × 2
# Groups:   gender [2]
  gender     n
  <chr>  <int>
1 F         48
2 M        943

Okay, so there are only 48 females in this dataset, compared to 943 males. That is a huge differences in proportions of males and females, which may influence any statistical findings.

# First I will examine the differences in salary distribution between males and females
gender_plot <- fire_salaries |>
  ggplot(aes(x = gender, y = base_salary, fill = gender)) +
  geom_boxplot() +
  theme_grey(base_family = "serif") +
  scale_fill_manual(values = c("lightpink", "lightblue")) +
  labs(title = "Salary Distribution by Gender", x = "Gender", y = "Base Salary (USD)", caption = "Data provided by Kathy Luh")
gender_plot

This boxplot shows that males in Fire and Rescue have a higher median salary than females. However, this could be due to an unrepresentative sample, because there are so few female observations, but that is not necessarily the case. In almost 1000 observations, only 48 of them are female, likely because mostly males enter Fire & Rescue Services. The Anova test will determine if this difference in Median base salary is significant

Now I will perform an ANOVA test to determine if there is a significant association between salary and gender.

model <- aov(base_salary ~ gender, data = fire_salaries)
summary(model)

             Df    Sum Sq   Mean Sq F value Pr(>F)
gender        1 4.671e+08 467076221   1.193  0.275
Residuals   989 3.874e+11 391662720

The p-value is 0.275, which fortunately shows that the difference in median salaries between males and females is not significant. The difference in proportions observed earlier is likely due to a siple population difference (more males enter fire and rescue department than females). Next I will explore which factors do have an influence on salary using a multiple linear regression.

Statistical Analysis

I want to determine which variables have the greatest impact on an employees salary. The variables I can study include division (which fire station the employee works at), gender, overtime pay, longevity pay, and grade (employee’s grade. To do this, I will perform a multiple linear regression with backwards elimination. Before I do that, however, I need to alter some of the variables.

First I need to examine the meaning behind employee grade values.

# Viewing all the levels of the grade variable
unique(fire_salaries$grade)

[1] "B2" "F4" "F3" "B1" "F2" "F1"

I needed to conduct some background research as to what these employee grades meant for Fire & Rescue specifically. The Montgomery County Fire and Rescue Service explains that F1 stands for Fire Fighter Rescuer 1, an entry level Fire and Rescue position. It is the lowest rank in the department and has the lowest average salary. From F1, there is then F2 (Fire Fighter II), F3 (Fire Fighter III), and F4 (Fire Fighter Rescuer IV). Above each of those ranks is B1 (Fire and Rescue Lieutenant) and B2 (Fire and Rescue Captain). So, certain grades like B2 are likely to have a much higher salary (Salary & Benefits)

Now that I understand each of those values in grade, I alter the variable.

In order to perform a multiple linear regression, I need these categorical variables (gender and grade) to become numerical variables. To do this, I will make gender binary and assign a ranking system to grade. I will label them 1-6, F1 (the lowest rank) will be assigned the number 1 and B2 (the highest rank) will be assigned number 6.

fire1 <- fire_salaries |>
  mutate(gender_numeric = ifelse(gender == "F", "0", "1")) |>
  mutate(rank = dplyr::recode(grade, # Had to specify the package because of an error 
                              "F1" = "1",
                              "F2" = "2",
                              "F3" = "3",
                              "F4" = "4",
                              "B1" = "5",
                              "B2" = "6"))
head(fire1)

# A tibble: 6 × 10
  department department_name          division   gender base_salary overtime_pay
  <chr>      <chr>                    <chr>      <chr>        <dbl>        <dbl>
1 FRS        Fire and Rescue Services Station 35 F           117435       38350.
2 FRS        Fire and Rescue Services Station 11 M            94668        6787.
3 FRS        Fire and Rescue Services Station 17 M            86065        5566.
4 FRS        Fire and Rescue Services Station 6  M            86065       38211.
5 FRS        Fire and Rescue Services Station 8  M            86065           0 
6 FRS        Fire and Rescue Services Station 4  M           117435        1852.
# ℹ 4 more variables: longevity_pay <dbl>, grade <chr>, gender_numeric <chr>,
#   rank <chr>

Now I need to convert these two new character variables into numeric ones for the multiple linear regression

fire1$gender_numeric <- as.numeric(fire1$gender_numeric)
fire1$rank <- as.numeric(fire1$rank)
str(fire1)

tibble [991 × 10] (S3: tbl_df/tbl/data.frame)
 $ department     : chr [1:991] "FRS" "FRS" "FRS" "FRS" ...
 $ department_name: chr [1:991] "Fire and Rescue Services" "Fire and Rescue Services" "Fire and Rescue Services" "Fire and Rescue Services" ...
 $ division       : chr [1:991] "Station 35" "Station 11" "Station 17" "Station 6" ...
 $ gender         : chr [1:991] "F" "M" "M" "M" ...
 $ base_salary    : num [1:991] 117435 94668 86065 86065 86065 ...
 $ overtime_pay   : num [1:991] 38350 6787 5566 38211 0 ...
 $ longevity_pay  : num [1:991] 0 0 2665 0 1415 ...
 $ grade          : chr [1:991] "B2" "F4" "F3" "F3" ...
 $ gender_numeric : num [1:991] 0 1 1 1 1 1 1 1 1 1 ...
 $ rank           : num [1:991] 6 4 3 3 3 6 6 4 6 4 ...

Now that the variables are ready, I can perform and interpret a multiple linear regression.

Multiple Linear Regression

I will use backwards elimination, beginning with a full model and eliminating variables based on adjusted r-squared values and p-values. I will continue eliminating variables until the adjusted r-squared and p-values are maximized.

# Fit multiple linear regression: 
multiple_model <- lm(base_salary ~ gender_numeric + rank + overtime_pay +longevity_pay + division, data = fire1)

# View the model summary
summary(multiple_model)


Call:
lm(formula = base_salary ~ gender_numeric + rank + overtime_pay + 
    longevity_pay + division, data = fire1)

Residuals:
     Min       1Q   Median       3Q      Max 
-19632.5  -5458.8   -423.3   5253.5  27031.8 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.661e+04  1.856e+03  14.336  < 2e-16 ***
gender_numeric     -1.070e+02  1.117e+03  -0.096 0.923661    
rank                1.422e+04  2.543e+02  55.895  < 2e-16 ***
overtime_pay        1.244e-03  1.444e-02   0.086 0.931377    
longevity_pay       7.006e-01  1.217e-01   5.758 1.15e-08 ***
divisionStation 10  5.734e+03  1.942e+03   2.952 0.003235 ** 
divisionStation 11  7.741e+03  2.117e+03   3.657 0.000269 ***
divisionStation 12  2.918e+03  1.892e+03   1.542 0.123339    
divisionStation 13  8.245e+03  2.039e+03   4.043 5.70e-05 ***
divisionStation 14  1.122e+04  2.062e+03   5.441 6.73e-08 ***
divisionStation 15  1.266e+03  1.797e+03   0.705 0.481230    
divisionStation 16  3.943e+03  1.836e+03   2.147 0.032025 *  
divisionStation 17  4.304e+03  1.960e+03   2.196 0.028359 *  
divisionStation 18  3.634e+03  1.970e+03   1.845 0.065390 .  
divisionStation 19  4.706e+03  1.998e+03   2.355 0.018736 *  
divisionStation 2   1.204e+03  2.061e+03   0.584 0.559189    
divisionStation 20  5.554e+03  2.356e+03   2.358 0.018593 *  
divisionStation 21  3.222e+03  2.060e+03   1.564 0.118213    
divisionStation 22  2.325e+03  2.037e+03   1.142 0.253881    
divisionStation 23  8.012e+02  1.751e+03   0.457 0.647420    
divisionStation 24  3.441e+03  1.855e+03   1.855 0.063934 .  
divisionStation 25 -4.795e+01  1.759e+03  -0.027 0.978251    
divisionStation 26  3.164e+03  2.350e+03   1.346 0.178533    
divisionStation 28  8.234e+03  2.087e+03   3.945 8.57e-05 ***
divisionStation 29  2.280e+03  1.859e+03   1.226 0.220448    
divisionStation 3   3.117e+03  1.873e+03   1.665 0.096329 .  
divisionStation 30  5.674e+03  2.038e+03   2.784 0.005480 ** 
divisionStation 31  2.423e+03  1.858e+03   1.304 0.192544    
divisionStation 32 -1.283e+03  2.016e+03  -0.636 0.524698    
divisionStation 33  5.471e+03  2.037e+03   2.686 0.007352 ** 
divisionStation 34  6.831e+03  1.839e+03   3.715 0.000215 ***
divisionStation 35  7.428e+03  1.824e+03   4.073 5.03e-05 ***
divisionStation 4   3.503e+03  2.037e+03   1.719 0.085886 .  
divisionStation 40  2.582e+03  2.038e+03   1.267 0.205560    
divisionStation 5  -6.878e+01  2.949e+03  -0.023 0.981397    
divisionStation 6   6.100e+03  1.926e+03   3.167 0.001590 ** 
divisionStation 7   4.717e+03  2.315e+03   2.037 0.041877 *  
divisionStation 8   3.391e+03  1.689e+03   2.008 0.044970 *  
divisionStation 9   5.268e+03  2.423e+03   2.174 0.029939 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7374 on 952 degrees of freedom
Multiple R-squared:  0.8665,    Adjusted R-squared:  0.8612 
F-statistic: 162.6 on 38 and 952 DF,  p-value: < 2.2e-16

Our first model (the full model) has a adjusted r-squared of 0.8486, which means that about 84.86% of the variation in the data can be explained by this model. To improve it, I will remove overtime-pay, which has an extremely high p-value of 0.985, which indicates that it is not significant to the model.

multiple_model1 <- lm(base_salary ~ gender_numeric + rank +longevity_pay +division, data = fire1)
summary(multiple_model1)


Call:
lm(formula = base_salary ~ gender_numeric + rank + longevity_pay + 
    division, data = fire1)

Residuals:
     Min       1Q   Median       3Q      Max 
-19638.9  -5458.4   -422.2   5280.0  27023.8 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        26598.2360  1850.0582  14.377  < 2e-16 ***
gender_numeric      -103.2317  1115.2263  -0.093 0.926268    
rank               14223.6033   235.9310  60.287  < 2e-16 ***
longevity_pay          0.7003     0.1216   5.761 1.13e-08 ***
divisionStation 10  5741.2477  1939.4162   2.960 0.003150 ** 
divisionStation 11  7731.7835  2112.9349   3.659 0.000267 ***
divisionStation 12  2919.5402  1890.7275   1.544 0.122887    
divisionStation 13  8245.3535  2038.3868   4.045 5.65e-05 ***
divisionStation 14 11227.7815  2060.1886   5.450 6.42e-08 ***
divisionStation 15  1268.4053  1796.3597   0.706 0.480300    
divisionStation 16  3935.2057  1833.0896   2.147 0.032064 *  
divisionStation 17  4307.6178  1958.8582   2.199 0.028114 *  
divisionStation 18  3638.1296  1968.4431   1.848 0.064879 .  
divisionStation 19  4708.4513  1997.1187   2.358 0.018594 *  
divisionStation 2   1203.1896  2059.5907   0.584 0.559232    
divisionStation 20  5565.2200  2351.2798   2.367 0.018137 *  
divisionStation 21  3222.3572  2059.2530   1.565 0.117957    
divisionStation 22  2325.8425  2035.4566   1.143 0.253465    
divisionStation 23   799.1213  1750.1571   0.457 0.648063    
divisionStation 24  3434.9918  1852.9347   1.854 0.064075 .  
divisionStation 25   -41.8743  1756.1986  -0.024 0.980982    
divisionStation 26  3160.8249  2348.5535   1.346 0.178668    
divisionStation 28  8235.6127  2086.1574   3.948 8.47e-05 ***
divisionStation 29  2286.4388  1856.7625   1.231 0.218473    
divisionStation 3   3116.3736  1871.5783   1.665 0.096221 .  
divisionStation 30  5669.2918  2036.4847   2.784 0.005478 ** 
divisionStation 31  2429.0481  1856.2969   1.309 0.191004    
divisionStation 32 -1285.0978  2014.2500  -0.638 0.523625    
divisionStation 33  5469.6080  2035.4926   2.687 0.007333 ** 
divisionStation 34  6834.0308  1837.5737   3.719 0.000212 ***
divisionStation 35  7424.1336  1822.3468   4.074 5.01e-05 ***
divisionStation 4   3501.6314  2036.2316   1.720 0.085818 .  
divisionStation 40  2581.4965  2037.2556   1.267 0.205413    
divisionStation 5    -68.4641  2947.2899  -0.023 0.981472    
divisionStation 6   6098.3061  1925.1388   3.168 0.001585 ** 
divisionStation 7   4737.7372  2301.4702   2.059 0.039807 *  
divisionStation 8   3394.7985  1687.9265   2.011 0.044583 *  
divisionStation 9   5261.9911  2420.6872   2.174 0.029969 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7370 on 953 degrees of freedom
Multiple R-squared:  0.8665,    Adjusted R-squared:  0.8613 
F-statistic: 167.2 on 37 and 953 DF,  p-value: < 2.2e-16

After removing overtime_pay, the adjusted r-squared improved by 0.0001, which is not very much, meaning overtime_pay was slightly hindering our model. Removing it improved the model and simplified it. I will now remove gender from the model, as its p-value is not significant at all and doing so could improve the model further.

multiple_model2 <- lm(base_salary ~  rank +longevity_pay + division, data = fire1)
summary(multiple_model2)


Call:
lm(formula = base_salary ~ rank + longevity_pay + division, data = fire1)

Residuals:
     Min       1Q   Median       3Q      Max 
-19647.5  -5462.2   -422.7   5280.2  27019.6 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        26500.7051  1519.9193  17.436  < 2e-16 ***
rank               14222.8673   235.6744  60.350  < 2e-16 ***
longevity_pay          0.7003     0.1215   5.764 1.11e-08 ***
divisionStation 10  5741.7213  1938.4014   2.962 0.003131 ** 
divisionStation 11  7733.3997  2111.7646   3.662 0.000264 ***
divisionStation 12  2922.5481  1889.4656   1.547 0.122253    
divisionStation 13  8250.7226  2036.5023   4.051 5.50e-05 ***
divisionStation 14 11246.1853  2049.5068   5.487 5.23e-08 ***
divisionStation 15  1265.3915  1795.1311   0.705 0.481043    
divisionStation 16  3932.1141  1831.8328   2.147 0.032081 *  
divisionStation 17  4311.8904  1957.2965   2.203 0.027834 *  
divisionStation 18  3635.2417  1967.1729   1.848 0.064919 .  
divisionStation 19  4709.4807  1996.0497   2.359 0.018505 *  
divisionStation 2   1199.7598  2058.1870   0.583 0.560084    
divisionStation 20  5562.3446  2349.8526   2.367 0.018127 *  
divisionStation 21  3227.9557  2057.2948   1.569 0.116973    
divisionStation 22  2331.0992  2033.6068   1.146 0.251963    
divisionStation 23   802.5981  1748.8446   0.459 0.646389    
divisionStation 24  3431.7568  1851.6422   1.853 0.064140 .  
divisionStation 25   -38.3556  1754.8745  -0.022 0.982567    
divisionStation 26  3157.7971  2347.1052   1.345 0.178816    
divisionStation 28  8241.6440  2084.0559   3.955 8.23e-05 ***
divisionStation 29  2283.4294  1855.5130   1.231 0.218769    
divisionStation 3   3125.4239  1868.0513   1.673 0.094637 .  
divisionStation 30  5670.4357  2035.3887   2.786 0.005443 ** 
divisionStation 31  2426.0320  1855.0463   1.308 0.191256    
divisionStation 32 -1276.2324  2010.9260  -0.635 0.525809    
divisionStation 33  5466.1450  2034.0910   2.687 0.007329 ** 
divisionStation 34  6836.5189  1836.4222   3.723 0.000209 ***
divisionStation 35  7429.1947  1820.5797   4.081 4.87e-05 ***
divisionStation 4   3502.7468  2035.1377   1.721 0.085551 .  
divisionStation 40  2586.9364  2035.3494   1.271 0.204037    
divisionStation 5    -71.5390  2945.5709  -0.024 0.980629    
divisionStation 6   6102.1558  1923.6892   3.172 0.001562 ** 
divisionStation 7   4734.7976  2300.0550   2.059 0.039808 *  
divisionStation 8   3395.2269  1687.0429   2.013 0.044445 *  
divisionStation 9   5266.7597  2418.8811   2.177 0.029699 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7366 on 954 degrees of freedom
Multiple R-squared:  0.8665,    Adjusted R-squared:  0.8615 
F-statistic:   172 on 36 and 954 DF,  p-value: < 2.2e-16

The model improved by another 0.0002%, which is good because, even if the r-squared has not improved much, our model has become simpler and I am now aware of which variables are extremely significant to determining salary. These variables include rank, station, and longevity pay, and they account for 86.13% of the variation in the data. I attempted removing the stations that were not significant to the model, but that actually decreased the adjusted r-squared. Thus, this model is the best given current data.

Model Equation: base_salary = 26500.7051 + 14222.8673 (rank) + 0.7003 (longevity pay) + 5741.7213 (station 10) + (7733.3997) Station 11 + Station 12 (2922.5481) + Station 13 (8250.7226) + Station 14(11246.1853) + Station 15 (1265.3915) + Station 16 (3932.1141) + Station 17 (4311.8904) +Station 18 (3635.2417) + Station 19 (4709.4807) + Station 2 (1199.7598) + Station 20 (5562.3446) + Station 21 (3227.9557) + Station 22 (2331.0992)+ Station 23 (802.5981) + Station 24 (3431.7568) + Station 25 (-38.3556) + Station 26 (3157.7971) + Station 28 (8241.6440) + Station 29 (2283.429) + Station 3 (3125.4239) + Station 30 (5670.4357) + Station 31 (2426.0320) + Station 32 (-1276.2324) + Station 33 (5466.1450) + Station 34 (6836.5189) + Station 35 (7429.1947) + Station 4 (3502.7468) + Station 40 (2586.9364) + Station 5 (-71.5390) + Station 6 (6102.1558) + Station 7 (4734.7976) +Station 8 (3395.2269) + Station 9 (5266.7597)

Adjusted R-Squared: 0.8615

This model suggest that 86% of the variation in base salary among Montgomery County Fire & Rescue department employees can be explained by rank (also known as the grade of the employee), longevity pay, and station.

Tableau Visualizations

#exporting the altered csv for tableau visualizations
write.csv(fire1, "fire_salaries_data.csv", row.names = FALSE)

Link to interactive visualizations: https://public.tableau.com/shared/6M32D3N36?:display_count=n&:origin=viz_share_link

NOTE: Please enter full screen when examining the Tableau Visualization.

This visualization allows you to view different Fire Stations throughout Montgomery County and the salary differences among each one. It shows the station average salary, average overtime pay, average longevity pay, and gender differences in average salary. It also shows the ratio of males to females at the station (because there are so few women at the station, that may affect average salary differences), and it shows the distribution of employees at each of the ranks.

The treemap in the top right sorts the stations based on average salary. Station 9 had the highest average salary, around 92,000, while station 32 had the lowest average salary of 67,000. These differences could result from the distribution of employees as well as the gender proportions at the station.

Conclusion

Thus, to answer my research questions: There is no significant association between base salary and gender in Montgomery County Fire & Rescue. Base salary is impacted by grade, longevity pay, and the station the employee works. I was very surprised at the how large the difference in proportions there was for males and females in the department, as shown in the visualization. I was also surprised at the large range of salaries among each of the stations, seeing as pay in fire and rescue is not commissions based (how busy the station is). I would’ve loved to explore what other factors play a role in base salary, but thankfully I am aware of the majority of them, thanks to my multiple linear regression.

Citations

Salary & Benefits. (n.d.). https://www.montgomerycountymd.gov/montgomery-county-fire-rescue-service/join-us/career-recruiting/salary-benefits
Data gathered by Office of Human Resources, owned by Kathy Luh, retrieved from Montgomery County Open Data.