::include_graphics("C:/Users/Lenovo/Downloads/PortpholioProjects/imagesMontgomery.PNG") knitr
Montgomery County Salary Analysis Project
Montgomery County Salary Analysis for 2021
This project looks at salary data for Montgomery County employees in 2021, obtained from the dataMontgomery portal. The dataset includes information about employee compensation, such as base salary, overtime pay, and longevity pay. It contains 8 variables that provide details about different aspects of employee compensation and job roles. These variables are:
1.Department: Department code for the department to which the employee is assigned. 2.Department Name:Name of the County Department to which the employee is assigned. 3.Division: A smaller, specific part within the department. 4.Gender: Whether the employee is male or female. 5.Base Salary: The employee’s regular yearly pay. 6.Overtime Pay: Extra pay for working more hours than usual. 7.Longevity Pay: Bonus pay for employees who have worked with the county for a long time. 8.Grade: A number showing the employee’s job rank or level.
For this analysis, I want to explore three main questions:
1. Is there a difference in salary between men and women doing the same job? If so, how big is the difference in percentage terms? 2. Which department and division offer the highest longevity pay? This will show where long-term employees are rewarded the most. 3 Does the employee’s grade correlate with the amount of overtime they work? I want to see if higher-ranking employees work more overtime.
I will use charts and basic statistics to answer these questions and better understand how Montgomery County salaries are structured.
Loading the data set
# part of the tidyverse package that helps to read files like csv format
library(readr)
# Load the dataset
<- read_csv("C:/Users/Lenovo/Downloads/PortpholioProjects/Employee_Salaries_-_2021_20241016.csv") dataset
Rows: 9907 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Department, Department Name, Division, Gender, Grade
dbl (3): Base Salary, 2021 Overtime Pay, 2021 Longevity Pay
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# dplyr simplifies the process of transforming and analyzing data in a clear, readable way
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# to build complex plots layer by layer
library(ggplot2)
# View the structure of the dataset
str(dataset)
spc_tbl_ [9,907 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Department : chr [1:9907] "ABS" "ABS" "ABS" "ABS" ...
$ Department Name : chr [1:9907] "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" "Alcohol Beverage Services" ...
$ Division : chr [1:9907] "Beer Loading" "Liquor and Wine Delivery Operations" "Beer Delivery Operations" "Beer Delivery Operations" ...
$ Gender : chr [1:9907] "M" "M" "M" "M" ...
$ Base Salary : num [1:9907] 87969 80086 80086 70814 76419 ...
$ 2021 Overtime Pay : num [1:9907] 32953 32656 31369 29838 21379 ...
$ 2021 Longevity Pay: num [1:9907] NA 1105 1591 NA NA ...
$ Grade : chr [1:9907] "20" "18" "18" "18" ...
- attr(*, "spec")=
.. cols(
.. Department = col_character(),
.. `Department Name` = col_character(),
.. Division = col_character(),
.. Gender = col_character(),
.. `Base Salary` = col_double(),
.. `2021 Overtime Pay` = col_double(),
.. `2021 Longevity Pay` = col_double(),
.. Grade = col_character()
.. )
- attr(*, "problems")=<externalptr>
# Check for missing values in each column
colSums(is.na(dataset))
Department Department Name Division Gender
0 0 0 0
Base Salary 2021 Overtime Pay 2021 Longevity Pay Grade
0 4312 7494 0
# Check the column names
colnames(dataset)
[1] "Department" "Department Name" "Division"
[4] "Gender" "Base Salary" "2021 Overtime Pay"
[7] "2021 Longevity Pay" "Grade"
# Replace spaces with underscores in column names
colnames(dataset) <- gsub(" ", "_", colnames(dataset))
# Check the updated column names
colnames(dataset)
[1] "Department" "Department_Name" "Division"
[4] "Gender" "Base_Salary" "2021_Overtime_Pay"
[7] "2021_Longevity_Pay" "Grade"
# Impute missing overtime pay and longevity pay with 0
<- dataset %>%
dataset_imputed mutate(`2021_Overtime_Pay` = ifelse(is.na(`2021_Overtime_Pay`), 0, `2021_Overtime_Pay`),
`2021_Longevity_Pay` = ifelse(is.na(`2021_Longevity_Pay`), 0, `2021_Longevity_Pay`))
# Check if the missing values are handled
colSums(is.na(dataset_imputed))
Department Department_Name Division Gender
0 0 0 0
Base_Salary 2021_Overtime_Pay 2021_Longevity_Pay Grade
0 0 0 0
# Summary statistics of Base Salary by Gender
<- dataset_imputed %>%
salary_by_gender group_by(Gender) %>%
summarise(Mean_Salary = mean(Base_Salary, na.rm = TRUE),
Median_Salary = median(Base_Salary, na.rm = TRUE),
Count = n())
# Print the summary
print(salary_by_gender)
# A tibble: 2 × 4
Gender Mean_Salary Median_Salary Count
<chr> <dbl> <dbl> <int>
1 F 79710. 76810. 4101
2 M 82491. 80086 5806
# Box plot for Base Salary by Gender
ggplot(dataset_imputed, aes(x = Gender, y = Base_Salary, fill = Gender)) +
geom_boxplot() +
labs(title = "Base Salary Distribution by Gender",
x = "Gender",
y = "Base Salary") +
scale_fill_manual(values = c("F" = "lightpink", "M" = "lightblue")) + # Custom colors for Gender
theme_minimal()
# T-test for salary difference by gender
<- t.test(Base_Salary ~ Gender, data = dataset_imputed)
t_test_result print(t_test_result)
Welch Two Sample t-test
data: Base_Salary by Gender
t = -4.4046, df = 8462.4, p-value = 1.073e-05
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
-4018.949 -1543.453
sample estimates:
mean in group F mean in group M
79709.61 82490.81
Interpretation:
-The p-value is very small, which means the difference in base salaries between males and females is statistically significant. -The negative confidence interval suggests that males, on average, earn between $1,543 and $4,018 more than females, with 95% confidence. This confirms that there is a significant salary gap between male and female employees. # Question 2: Which Department and Division Pays More for Longevity? Here, we’ll focus on analyzing 2021 Longevity Pay across different departments and divisions to see where employees are receiving the most longevity pay. Steps: 1.Group the Data by Department and Division: We will group the data by department and division to calculate the total and average longevity pay. 2.Calculate Summary Statistics: Calculate the mean (average) longevity pay for each department and division. 3.Visualize the Results: Create a bar chart to show which departments and divisions offer the highest longevity pay. ###Step 1: Group Data by Department and Division
# Calculate the average longevity pay by department and division
<- dataset_imputed %>%
longevity_by_department group_by(Department, Division) %>%
summarise(Mean_Longevity_Pay = mean(`2021_Longevity_Pay`, na.rm = TRUE),
Total_Longevity_Pay = sum(`2021_Longevity_Pay`, na.rm = TRUE),
Count = n()) %>%
ungroup()
`summarise()` has grouped output by 'Department'. You can override using the
`.groups` argument.
Step 2: Find the Top Departments and Divisions
# Sort the data to find departments/divisions with the highest longevity pay
<- longevity_by_department %>%
top_longevity_departments arrange(desc(Mean_Longevity_Pay))
Step 3: Visualize Longevity Pay by Department and Division
# Calculate the average longevity pay by division
<- dataset_imputed %>%
longevity_by_division group_by(Division, Department) %>%
summarise(Average_Longevity_Pay = mean(`2021_Longevity_Pay`, na.rm = TRUE)) %>%
arrange(desc(Average_Longevity_Pay))
`summarise()` has grouped output by 'Division'. You can override using the
`.groups` argument.
# Select the top 10 divisions
<- head(longevity_by_division, 10)
top10_divisions
# View the top 10 divisions (optional)
print(top10_divisions)
# A tibble: 10 × 3
# Groups: Division [10]
Division Department Average_Longevity_Pay
<chr> <chr> <dbl>
1 Duty Operations Chief's Office FRS 10698.
2 FRS 45 First Battalion - Administration FRS 9233.
3 FRS 45 Station 9 FRS 8430.
4 MSB Personnel Division Polygraph Section POL 7766.
5 FRS 45 Station 7 FRS 7714.
6 FRS 45 Fire and Explosive Investigations FRS 7457.
7 POL 47 FSB Special Operations Division Cani… POL 7390.
8 FRS 45 Station 23 FRS 7286.
9 POL 47 MSB Management and Budget Division V… POL 6713.
10 Special Operations FRS 6683.
# Top 10 divisions by longevity pay with meaningful labels and colors
ggplot(top10_divisions, aes(x = reorder(Division, Average_Longevity_Pay),
y = Average_Longevity_Pay / 1000, # Convert to thousands
fill = Department)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip for better readability
labs(title = "Top 10 Divisions by Average Longevity Pay",
x = "Division",
y = "Average Longevity Pay (Thousands of Dollars Per Year)",
fill = "Department",
caption = "Data Source: dataMontgomery") +
scale_fill_manual(values = c("red", "blue", "green", "purple", "orange",
"pink", "yellow", "cyan", "brown", "gray")) + # Add more colors
theme_minimal() +
scale_y_continuous(labels = scales::dollar_format(scale = 1, suffix = "K")) # Format y-axis labels
Interpretation of the Chart: The chart shows the top 10 divisions by average longevity pay, categorized by department. The following abbreviations are used in the chart:
FRS: Fire and Rescue Services POL: Police Department MSB: Montgomery Safety Board FSB: Fire and Safety Board OHR: Office of Human Resources
Interpretation: The Duty Operations Chief’s Office and several divisions within the FRS (Fire and Rescue Services) department stand out for offering the highest average longevity pay. POL (Police Department) divisions, like the Special Operations Division Canine Section, also offer relatively high longevity pay compared to other divisions. # Question 3: Does Employee Grade Correlate with Overtime Worked? Calculate the Correlation Between Employee Grade and Overtime Pay
# Check the data type of Grade and 2021 Overtime Pay
str(dataset_imputed$Grade)
chr [1:9907] "20" "18" "18" "18" "17" "20" "17" "20" "12" "17" "12" "15" ...
str(dataset_imputed$`2021_Overtime_Pay`)
num [1:9907] 32953 32656 31369 29838 21379 ...
# Remove rows where Grade is NA
<- dataset_imputed %>%
dataset_imputed filter(!is.na(Grade))
# Convert Grade to numeric again (just to be sure)
$Grade <- as.numeric(dataset_imputed$Grade) dataset_imputed
Warning: NAs introduced by coercion
# Convert Grade to numeric
$Grade <- as.numeric(dataset_imputed$Grade) dataset_imputed
# Find non-numeric values in Grade
<- dataset_imputed$Grade[!grepl("^[0-9]+$", dataset_imputed$Grade)]
non_numeric_grades unique(non_numeric_grades)
[1] NA
ggplot(dataset_imputed, aes(x = Grade, y = `2021_Overtime_Pay`, color = Gender)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Relationship Between Employee Grade and Overtime Pay",
x = "Employee Grade",
y = "2021 Overtime Pay",
caption = "Data source: dataMontgomery") +
scale_color_manual(values = c("red", "green", "blue")) +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 3391 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3391 rows containing missing values or values outside the scale range
(`geom_point()`).
Interpretation: From the graph, we can see that workers with the highest grades are generally not working a lot of overtime. There seems to be some correlation between medium-grade workers and overtime, but overall, the connection between grade and overtime is weak.
Dataset Cleaning and Visualization Insights
To prepare the data for analysis, the first step involved checking for missing values in key columns like overtime pay and longevity pay. Missing values in these columns were imputed with zeros since no overtime or longevity pay likely indicated that an employee did not receive these forms of compensation. Additionally, categorical variables such as grade were converted into numeric format for correlation analysis, ensuring all necessary columns were ready for visualization.
The visualizations aimed to explore salary distribution by gender, longevity pay across divisions, and the correlation between grade and overtime pay. A key insight revealed a statistically significant difference in base salaries between males and females, with males generally earning more. Another finding was that certain divisions, particularly in the Fire & Rescue Services and Police departments, offered the highest longevity pay, which was displayed using bar charts. There was also a slight negative correlation between employee grade and overtime pay, meaning higher-graded employees tend to work fewer overtime hours.
Some aspects of the analysis could not be fully explored, such as identifying specific jobs with the largest gender pay gaps, which would require a more detailed breakdown by job title. Overall, the data cleaning and visualization process provided valuable insights, and the visualizations were enhanced with appropriate labels, colors, and themes for clarity and readability.
Data Source: Montgomery County Employee Salaries - 2021 Picture Source