# loading the nesaseery librarys library(tidyverse)
Warning: package 'readr' was built under R version 4.4.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Warning: package 'plotly' was built under R version 4.4.1
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(reshape2)
Warning: package 'reshape2' was built under R version 4.4.1
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
library(corrplot)
Warning: package 'corrplot' was built under R version 4.4.1
corrplot 0.92 loaded
library(viridis)
Loading required package: viridisLite
library(hrbrthemes)
Warning: package 'hrbrthemes' was built under R version 4.4.1
library(treemap)library(ggstream)
Warning: package 'ggstream' was built under R version 4.4.1
# making all variables in data lower case and removing sapce names(Data) <-tolower(names(Data))names(Data) <-gsub(" ","_",names(Data))
# set NA to zero to be able to do mathematical equations Data[is.na(Data)] <-0
About Data set
This Data is provided by DOJ company which is the United States Department of Justice. The dataset aims to provide insights into salary trends for various data professions within the Department of Justice. It includes information on different job titles, the salaries associated with these roles, and various factors such as job title, age, gender, date of hire, past experience, education, salary, and performance rating that could influence these salaries.
This data was collected through a combination of internal records and employee surveys within the Department of Justice. The method used Extracted directly from existing digital records maintained by the HR department.
This data set could be used for a few different analyses, including Identifying salary trends and disparities across different job titles and locations, Analyzing the impact of education and experience on salaries, Investigating gender pay gaps within data professions, and Forecasting salary growth based on age and years of experience.
Furthermore, understanding which factors are primarily considered during candidate hiring, such as whether age truly affects employee salary, is crucial. for exampls, should age be a determining factor, and if so, to what extent?
Age and salary
One factor that employers consider when hiring is the candidate’s age. This factor can affect a person’s salary differently across various units. Many articles claim that “there is no clear linear relationship between either age or years of experience and salary” (Gao). To gain a better understanding of the relationship between age and salary, the bar chart below has been created. This chart represents the age groups for each unit and their corresponding income, demonstrating the relationship between the age of DOJ employees and their salaries. From the chart, we can observe that certain age groups within specific units tend to have higher salaries, indicating that age, in conjunction with unit assignment, plays a significant role in determining salary levels.
# creating a bar chart that shows relationship between age and salary for each unitggplot(filtered_data, aes(x = age_group, y = avrgsalary, fill = unit)) +geom_bar(stat ="identity", position =position_dodge()) +labs(title ="Salary Distribution by Age Group",x ="Age Group",y ="Salary" , caption ="Source: https://www.justice.gov/") +scale_fill_brewer(palette ="Spectral") +# Add Set1 color palette from RColorBrewertheme_minimal()
To simplify the analysis, we can examine the stream graph below. This visualization uses different colors to represent each unit and illustrates how values increase with age groups.
# creating sremgraph using geom_stem p4 <- filtered_data |>ggplot( aes(x = age_group, y = avrgsalary, fill = unit)) +geom_stream(width =NULL, height =NULL,offset ="silhouette", interpolate ="cardinal", interactive =TRUE,scale ="date", top =20, right =40, bottom =50, left =200) +labs(title =" Rate of Salary by age group",x ="Age Group", y ="salary", fill ="Unit") +annotate("text",x =Inf, y =Inf,label ="Source: https://www.justice.gov/",hjust =1, vjust =1,color ="gray50" ) +theme_minimal() +scale_fill_brewer(palette ="Spectral") # adding Palette
# fitng a linear regression model with lm data2 <-lm(salary ~ age + past.exp + designation + sex + unit + ratings, data = Data)summary(data2) # summary of the fitted model to see the relationship
# creating a liner regression for var salary and past experience lm_model <-lm(salary ~ past.exp, data = Data)summary(lm_model)
Call:
lm(formula = salary ~ past.exp, data = Data)
Residuals:
Min 1Q Median 3Q Max
-73368 -9754 1962 7440 178776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40049.9 430.7 92.98 <2e-16 ***
past.exp 11543.2 136.9 84.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19190 on 2637 degrees of freedom
Multiple R-squared: 0.7294, Adjusted R-squared: 0.7293
F-statistic: 7108 on 1 and 2637 DF, p-value: < 2.2e-16
predictions <-predict(lm_model, newdata2 = Data)
# setting the whole Data as numerical variable numerical_columns <-sapply(Data, is.numeric)
Past Experience
To better understand this relationship, let’s look at the line chart below that shows the relationship between one independent variable, past experience, and the dependent variable, salary.
Equation Y = 40049.9 +11543.2X
X= Past Experience , Y= salary ,
Intercept (40049.9): the estimated base salary when past experience (past.exp) is zero. slope of the line: 11543.2 that shows for each additional year of past experience, the salary is expected to increase by approximately 11543.2 units
ggplot(Data, aes(x = past.exp, y = salary)) +geom_point( shape =1, size =3, color ="#69b3a2") +# Actual data pointsgeom_smooth(aes(y = predictedsalary), method ="lm", se =FALSE, color ="#66a2de") +# Regression linelabs(title ="Relationship between Years of expreince and salary ",x ="years of exprince ", y ="Salary value ", caption ="Source: https://www.justice.gov/" ) +scale_shape_manual(values =c(16, 18), labels =c("Actual", "Predicted")) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
From the chart above, it is evident that the relationship between past experience and salary is generally positive, meaning that more experienced employees tend to earn higher salaries due to their accumulated skills and expertise. Therefore, it is understood that age can influence past experience, as people typically gain more experience as they age. Therefore, the hypothesis that there is no clear relationship between age and salary can be rejected in this dataset.
The chart below provides a comprehensive summary of the positive relationship between each role and the income they earn with the number of years of experience they have. It Shows that employees with more years of experience tend to have higher salaries across different roles, highlighting the significant impact of experience on income.
# picking color palette color_palette <-viridis_pal()(8)# creating a plot plot with plot_lyp1 <-plot_ly(Avg_data, x =~past.exp, # x variable past experience y =~avg_salary, # Y the average salary type ='scatter', mode ='markers',size =~avg_salary, # size increases with salary color =~designation, # giving each job one color colors = color_palette, # viridis color palettetext =~paste('<br>Average salary : $', avg_salary, '<br>Years of Experince:', past.exp, '<br> Job:', designation ) , hoverinfo ='text') # providing text p1
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
# updating layout settingsp2 <- p1 %>%layout(title ='Salary by Years of Experience ',xaxis =list(title ='Years of Experience'),yaxis =list(title ='Salary'))p2
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
source = ‘https://www.justice.gov/’
The Importance of Data
This data and visualization, created through various analyses, are highly beneficial for companies and employers looking to determine appropriate salaries for their employees. They are also invaluable for employees seeking to understand their market value and ascertain the salary they deserve.
Having information about how factors such as age, past experience, and your skills affect your salary could empower you to negotiate effectively if an initial salary offer does not reflect your true worth. According to CNBC, “Most employers will offer a lower salary to start, leaving room for negotiations. So, by not negotiating, you could be leaving money on the table!” This is especially crucial for employees negotiating salary for the first time.
With a better understanding of how to predict salaries, let’s examine how DOJ employees are compensated by comparing their accepted salaries with the predicted salaries calculated from their relevant factors such as: Sex , Age, Rating on their peformace , Past exprince and unit.
# using average Data to create a chart barggplot(Data, aes(x = designation)) +geom_bar(aes(y = salary, fill ="Accepted"), stat ="identity", position ="dodge" , width=0.5) +geom_bar(aes(y = predictedsalary, fill ="Predicted"), stat ="identity", position ="dodge" , width=0.5) +labs(title ="Accepted vs Predicted Salaries by Job Title",x ="Job Title",y ="Salary",caption ="Source: https://www.justice.gov/" ) +scale_fill_brewer(palette ="Set2")+# using palette set 2theme_minimal()
why DOJ pays more ?
As it is Shown above for each rule The Department of Justice (DOJ) typically pays higher salaries than the predicated salary, which could be due to several factors. first one is that government positions, including those at the DOJ, often offer stability, comprehensive benefits packages, and opportunities for career advancement. Higher pay can make up for these benefits and the dedication to public service. also DOJ follows government pay scales which means that it pays competitive salaries to attract and retain highly skilled professionals in law enforcement, legal, and administrative roles. which typically demand for high salaries.
In the Government pay scale, positions are categorized into job classifications, each designated with a grade level (e.g., GS-5, GS-7, GS-9, etc.). These grade levels are structured to reflect the varying degrees of difficulty and responsibility associated with each role. Generally, higher grade levels correlate with increased salaries, reflecting the heightened expertise and responsibilities required.
This structured scale is integral for maintaining transparency and equity in salary determination within the government sector. This scale is assigned to provide transparency and equity in salary determination, ensuring that pay is based on standardized criteria rather than negotiation.
Location
A key factor that the DOJ data set is missing is the location of each employee. Salaries may vary based on the cost of living and demand for specific skills in different regions. Therefore, when calculating the predicted salary, it is definitely important to consider the license. Based on the research Overall, pay in metropolitan areas was higher than pay in non metropolitan areas. For DOJ, location is an important factor. In many government agencies, there are locality pay adjustments to account for differences in the cost of living across various geographic areas. These adjustments aim to ensure that federal employees receive salaries that are competitive with private sector jobs in the same region.
In datasets such as DOJ salary predictions, understanding various factors influencing individual salaries benefits both employees and employers. Age plays a crucial role in salary determination due to its often positive correlation with accumulated experience, which significantly impacts earnings. Reviewing datasets or using salary calculators can be valuable when evaluating job offers. Although in government agencies, salary negotiations for higher pay are typically minimized by structured government scales. These scales offer opportunities for salary progression based on merit and promotion rather than negotiation.
Sources:
Entry-Level (Honors Program) and Experienced Attorneys - Attorney Salaries, Promotions, and Benefits. 21 Feb. 2024, www.justice.gov/legal-careers/attorney-salaries-promotions-and-benefits.
Gao, Han. “Research on the Effect of Age and Experience on Salary Based on Linear Regression and Tree-Based Train.” Highlights in Business, Economics and Management, vol. 24, Jan. 2024, pp. 1401–07. https://doi.org/10.54097/2xjwsv42.
Heinzerling, Kelly. “How to Negotiate the Salary for Your First Job Offer.” CNBC, 30 Oct. 2021, www.cnbc.com/2021/10/30/how-to-negotiate-the-salary-for-your-first-job-offer.html.