These days, the most significant component of employee remuneration and a key component of organizational strategy is salary. Companies nowadays would implement the process of benchmarking to gather data on the wages paid for tech labor in other organizations with similar roles as part of designing pay scales. Market wage information is used to compare pay scales to those of competitors or to prevent instances of underpay. Similarly, knowledge about pay ranges for potential careers can be helpful for job seekers. Labor can maximize the development of their talents by focusing on high-paying industries; besides job satisfaction that has been highly emphasized today for organizational commitment (Matbouli & Alghamdi, 2022; Khan et al., 2010). However, there is a lack of research on the implementation of machine learning models in predicting salary and job satisfaction of tech jobs for job seekers; taking into account the uncertainties of main features that contribute to data science job salaries. Thus, this project is to develop a model that successfully predicts salary and job satisfaction in tech jobs, and identify the main features that go towards earning a better wage in a data scientist role.
The Salary prediction dataset used for this research has 742 entries and is derived from the Kaggle. It is a compilation of tech job positions scraped from glassdoor.com in 2017. This dataset can be used to predict salary based on several features for instance, age, company size etc. In this dataset, there is one column labeled as “job title” which refers to tech job positions. The remaining 31 features will be representing the important information associated with the tech job. However, 12 features will only be selected as our study focus of this project.
df <- read.csv("edaData.csv")
str(df)
## 'data.frame': 742 obs. of 33 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Job.Title : chr "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
## $ Salary.Estimate : chr "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
## $ Job.Description : chr "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Company.Name : chr "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
## $ Location : chr "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
## $ Headquarters : chr "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
## $ Size : chr "501 to 1000 employees" "10000+ employees" "501 to 1000 employees" "1001 to 5000 employees" ...
## $ Founded : int 1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
## $ Type.of.ownership: chr "Company - Private" "Other Organization" "Company - Private" "Government" ...
## $ Industry : chr "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
## $ Sector : chr "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
## $ Revenue : chr "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
## $ Competitors : chr "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
## $ hourly : int 0 0 0 0 0 0 0 0 0 0 ...
## $ employer_provided: int 0 0 0 0 0 0 0 0 0 0 ...
## $ min_salary : int 53 63 80 56 86 71 54 86 38 120 ...
## $ max_salary : int 91 112 90 97 143 119 93 142 84 160 ...
## $ avg_salary : num 72 87.5 85 76.5 114.5 ...
## $ company_txt : chr "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
## $ job_state : chr "NM" "MD" "FL" "WA" ...
## $ same_state : int 0 0 1 1 1 1 1 0 1 1 ...
## $ age : int 47 36 10 55 22 20 12 15 6 11 ...
## $ python_yn : int 1 1 1 1 1 1 0 1 0 1 ...
## $ R_yn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ spark : int 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : int 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : int 1 0 1 0 1 1 1 1 0 0 ...
## $ job_simp : chr "data scientist" "data scientist" "data scientist" "data scientist" ...
## $ seniority : chr "na" "na" "na" "na" ...
## $ desc_len : int 2536 4783 3461 3883 2728 3747 1786 3804 1538 4574 ...
## $ num_comp : int 0 0 0 3 3 3 0 0 0 2 ...
df <- df %>%
dplyr::select(5,9,12,14,20,24:30)
summary(df)
## Rating Size Industry Revenue
## Min. :-1.000 Length:742 Length:742 Length:742
## 1st Qu.: 3.300 Class :character Class :character Class :character
## Median : 3.700 Mode :character Mode :character Mode :character
## Mean : 3.619
## 3rd Qu.: 4.000
## Max. : 5.000
## avg_salary age python_yn R_yn
## Min. : 13.5 Min. : -1.00 Min. :0.0000 Min. :0.000000
## 1st Qu.: 73.5 1st Qu.: 11.00 1st Qu.:0.0000 1st Qu.:0.000000
## Median : 97.5 Median : 24.00 Median :1.0000 Median :0.000000
## Mean :100.6 Mean : 46.59 Mean :0.5283 Mean :0.002695
## 3rd Qu.:122.5 3rd Qu.: 59.00 3rd Qu.:1.0000 3rd Qu.:0.000000
## Max. :254.0 Max. :276.00 Max. :1.0000 Max. :1.000000
## spark aws excel job_simp
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Length:742
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :0.0000 Median :0.0000 Median :1.0000 Mode :character
## Mean :0.2251 Mean :0.2372 Mean :0.5229
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
str(df)
## 'data.frame': 742 obs. of 12 variables:
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Size : chr "501 to 1000 employees" "10000+ employees" "501 to 1000 employees" "1001 to 5000 employees" ...
## $ Industry : chr "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
## $ Revenue : chr "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
## $ avg_salary: num 72 87.5 85 76.5 114.5 ...
## $ age : int 47 36 10 55 22 20 12 15 6 11 ...
## $ python_yn : int 1 1 1 1 1 1 0 1 0 1 ...
## $ R_yn : int 0 0 0 0 0 0 0 0 0 0 ...
## $ spark : int 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : int 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : int 1 0 1 0 1 1 1 1 0 0 ...
## $ job_simp : chr "data scientist" "data scientist" "data scientist" "data scientist" ...
boxplot(df$Rating, ylab = "Rating Value", horizontal = TRUE)
# remove 11 outlier, instances 742 reduce to 731
df <- df %>%
filter (Rating > 0)
df$Rating <- round(df$Rating)
df<-df %>%
mutate(Rating = cut(Rating, breaks = c(0.0, 1.0, 2.0, 3.0, 4.0, 5.0), labels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied")))
summary(df$Rating)
## Very Dissatisfied Dissatisfied Neutral Satisfied
## 0 21 211 444
## Very Satisfied
## 55
boxplot(df$age, horizontal= T)
min_age <- sum(df$age < 18)
min_age
## [1] 279
max_age <- sum(df$age > 70)
max_age
## [1] 156
## There are 279 instances of min_age<18 & 156 instances of max_age>70 which are outliers. ## To replace <18 outliers with IQR Q1 (30) & replace >70 outliers with IQR Q3(52)
getWorkingAge <- df$age[df$age <66 & df$age > 23 & !is.na(df$age)]
quantile(getWorkingAge)
## 0% 25% 50% 75% 100%
## 24 30 39 52 62
df$age[df$age<18] <- 30
df$age[df$age>70] <- 52
hist(df$age)
# average salary*1000
df$avg_salary <- df$avg_salary*1000
# Replace NA with other tech jobs
df$job_simp <- str_replace_all(df$job_simp, "\\bna\\b", "Other tech jobs")
df$Size <- ifelse(df$Size %in% c("1 to 50 employees", "51 to 200 employees"), "small",
ifelse(df$Size %in% c("201 to 500 employees", "501 to 1000 employees"), "medium",
ifelse(df$Size %in% c("1001 to 5000 employees", "5001 to 10000 employees", "10000+ employees"), "large", NA)))
df$Revenue<- ifelse(df$Revenue == "Less than $1 million (USD)", "micro-bus",
ifelse(df$Revenue %in% c("$1 to $5 million (USD)", "$5 to $10 million (USD)"), "small-bus",
ifelse(df$Revenue %in% c("$10 to $25 million (USD)","$25 to $50 million (USD)", "$50 to $100 million (USD)"), "medium-bus",
ifelse(df$Revenue %in% c("$100 to $500 million (USD)", "$500 million to $1 billion (USD)", "$1 to $2 billion (USD)", "$2 to $5 billion (USD)", "$5 to $10 billion (USD)", "$10+ billion (USD)"), "large-bus", NA))))
library(dplyr)
df <- df %>%
mutate(
Size = as.factor(Size),
Revenue = as.factor(Revenue)
)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
init = mice(df, maxit = 0)
## Warning: Number of logged events: 2
meth = init$method
predM = init$predictorMatrix
meth[c("Size")] = "polyreg"
meth[c("Revenue")] = "polyreg"
imp_rev <- mice(df, m=5, method= meth, predictorMatrix = predM, maxit = 10, seed = 20)
##
## iter imp variable
## 1 1 Size Revenue
## 1 2 Size Revenue
## 1 3 Size Revenue
## 1 4 Size Revenue
## 1 5 Size Revenue
## 2 1 Size Revenue
## 2 2 Size Revenue
## 2 3 Size Revenue
## 2 4 Size Revenue
## 2 5 Size Revenue
## 3 1 Size Revenue
## 3 2 Size Revenue
## 3 3 Size Revenue
## 3 4 Size Revenue
## 3 5 Size Revenue
## 4 1 Size Revenue
## 4 2 Size Revenue
## 4 3 Size Revenue
## 4 4 Size Revenue
## 4 5 Size Revenue
## 5 1 Size Revenue
## 5 2 Size Revenue
## 5 3 Size Revenue
## 5 4 Size Revenue
## 5 5 Size Revenue
## 6 1 Size Revenue
## 6 2 Size Revenue
## 6 3 Size Revenue
## 6 4 Size Revenue
## 6 5 Size Revenue
## 7 1 Size Revenue
## 7 2 Size Revenue
## 7 3 Size Revenue
## 7 4 Size Revenue
## 7 5 Size Revenue
## 8 1 Size Revenue
## 8 2 Size Revenue
## 8 3 Size Revenue
## 8 4 Size Revenue
## 8 5 Size Revenue
## 9 1 Size Revenue
## 9 2 Size Revenue
## 9 3 Size Revenue
## 9 4 Size Revenue
## 9 5 Size Revenue
## 10 1 Size Revenue
## 10 2 Size Revenue
## 10 3 Size Revenue
## 10 4 Size Revenue
## 10 5 Size Revenue
## Warning: Number of logged events: 100
df <- complete(imp_rev)
df <- subset(df, Industry != "-1")
cleaned_df<-read.csv("edaData_cleaned2.csv")
View(cleaned_df)
ggplot(data = cleaned_df, mapping = aes(x = age, y = avg_salary ))+labs(title="Age vs AverageSalary") + geom_point()
#As shown below, the points on the scatter plot seem to be scattered randomly, which indicates that there is no relationship between age and average salary.
ggplot(data = cleaned_df, aes(x = Size, y = avg_salary))+labs(title="CompanySize vs AverageSalary") + geom_boxplot()
#Overall, the 3 batches of data look as if they were generally distributed in a similar way.The interquartile ranges are reasonably similar (as shown by the lengths of the boxes).The median average salary of small-sized company is greater than large-sized company and medium-sized company.There are outliers observed across large and medium-sized companies which indicate that these data values are far away from other data values which needs further evaluation.
ggplot(data = cleaned_df, aes(x = Rating, y = avg_salary))+labs(title="Rating vs AverageSalary") + geom_boxplot()
#Overall, individuals who rated satisfied scored higher median average salary compared to individuals who rated neutral, very satisfied and dissatisfied. Individuals who rated neutral and very satisfied yielded the same median average salary. As shown in dissatisfied column, the box plot is comparatively short which indicates that individuals that are dissatisfied have around the same average salary with each other. In very dissatisfied category, the boxplot is relatively longer, which indicates that individuals hold quite different amount in terms of average salary.There are outliers observed across individuals with neutral rating and satisfied rating which indicate that these data values are far away from other data values which needs further evaluation.
ggplot(data = cleaned_df, aes(x = job_simp, y = avg_salary)) +labs(title="Tech Job vs AverageSalary") + geom_boxplot()
#Overall, director position has higher median average salary compared to mle, data scientist, data engineer, other tech jobs, manager and analyst. As shown in analyst, data engineer, manager and mle column, the box plots are comparatively short which indicates that individuals that are in these mentioned positions have around the same average salary with each other. In addition, the 4 sections of the box plot for manager and mle are relatively uneven in size. This indicates that individuals in that particular position respectively have similar average salary at certain parts of the scale, but in other parts of the scale, individuals in that position are more variable in their average salary. The short upper whisker in mle boxplot indicates that their average salary are varied among the least positive quartile group, and very similar for the most positive quartile group. There are outliers observed across all the positions which indicate that these data values are far away from other data values which needs further evaluation.
dat <- data.frame(table(cleaned_df$Size,cleaned_df$Rating))
names(dat) <- c("Size","Rating","Count")
ggplot(data=dat, aes(x=Size, y=Count, fill=Rating)) + geom_bar(stat="identity")
#Overall, most individuals working in large, medium and small companies generally satisfied with their job. In medium-sized company, there are more individuals who are dissatisfied with their job. It is also noticeable that there are more individuals who are very satisfied with their job work in the small company compared to medium-sized and large-sized company.
#compare rating(job satisfaction) and Tech Job
dat2 <- data.frame(table(cleaned_df$job_simp,cleaned_df$Rating))
names(dat2) <- c("job_simp","Rating","Count")
ggplot(data=dat2, aes(x=job_simp, y=Count, fill=Rating)) + geom_bar(stat="identity")
#Overall, most individuals generally satisfied with their job. It can be observed there are no dissatisfaction for data engineers, director, manager and mle position in individuals.
skill <- c("python_yn", "R_yn","spark","aws","excel")
NoYes <- c("1","0")
df2 <- cleaned_df %>%
dplyr::select('python_yn':'job_simp')
df2_new <- cbind(df2, skill)
df2_new2 <- cbind(df2_new, NoYes)
ggplot(df2_new2) + geom_bar(aes(x=skill, fill=NoYes),position = "dodge") + facet_wrap(~job_simp) +theme(axis.text.x=element_text(angle=30,hjust=0.5,vjust=0.5))
#For Analyst role, the skill that is most needed would be spark
#For data engineer role, the skill that is most needed would be excel
#For data scientist role, the skill that is most needed would be python_yn
#For director role, the skill that is most needed would be python_yn
#For manager role, the skill that is most needed would be R_yn
#For mle role, the skill that is most needed would be aws and excel
#For other tech role, the skill that is most needed would be spark
#Thus, the 3 skills that most needed across the tech jobs would be spark, excel and python_yn.
sal_df<-
dplyr::select(cleaned_df,-c("X","Industry"))
norm_minmax <- function(x){
(x- min(x)) /(max(x)-min(x))
}
sal_df[sapply(sal_df, is.numeric)] <- lapply(sal_df[sapply(sal_df, is.numeric)],norm_minmax)
sal_df[sapply(sal_df, is.character)] <- lapply(sal_df[sapply(sal_df, is.character)],as.factor)
str(sal_df)
## 'data.frame': 730 obs. of 11 variables:
## $ Rating : Factor w/ 4 levels "Dissatisfied",..: 3 2 4 3 2 2 3 3 2 4 ...
## $ Size : Factor w/ 3 levels "large","medium",..: 2 1 2 1 3 2 2 2 1 3 ...
## $ Revenue : Factor w/ 4 levels "large-bus","medium-bus",..: 2 1 1 1 4 1 1 2 1 1 ...
## $ avg_salary: num 0.243 0.308 0.297 0.262 0.42 ...
## $ age : num 0.5686 0.3529 0.2353 0.7255 0.0784 ...
## $ python_yn : num 1 1 1 1 1 1 0 1 0 1 ...
## $ R_yn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ spark : num 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : num 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : num 1 0 1 0 1 1 1 1 0 0 ...
## $ job_simp : Factor w/ 7 levels "analyst","data engineer",..: 3 3 3 3 3 3 3 3 7 3 ...
set.seed(20)
indice<-createDataPartition(y=sal_df$avg_salary,p=0.8,list=FALSE)
sal_train<-sal_df[indice,]
sal_test<-sal_df[-indice,]
sal_train_x = sal_train[, -4]
sal_train_y = sal_train[,4]
sal_test_x = sal_test[,-4]
sal_test_y = sal_test[,4]
# Create Multiple Regression Model
set.seed(20)
slmModel=lm(avg_salary~.,data=sal_train)
summary(slmModel)
##
## Call:
## lm(formula = avg_salary ~ ., data = sal_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38813 -0.08248 -0.01855 0.06287 0.50527
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.238884 0.037299 6.405 3.18e-10 ***
## RatingNeutral -0.018697 0.034130 -0.548 0.5840
## RatingSatisfied -0.006620 0.033487 -0.198 0.8434
## RatingVery Satisfied -0.022515 0.038427 -0.586 0.5582
## Sizemedium -0.034927 0.015581 -2.242 0.0254 *
## Sizesmall -0.008869 0.025439 -0.349 0.7275
## Revenuemedium-bus 0.009260 0.017177 0.539 0.5900
## Revenuemicro-bus 0.016576 0.054486 0.304 0.7611
## Revenuesmall-bus 0.040113 0.032479 1.235 0.2173
## age -0.043540 0.025482 -1.709 0.0881 .
## python_yn 0.027971 0.012961 2.158 0.0313 *
## R_yn -0.005018 0.093279 -0.054 0.9571
## spark 0.029546 0.014926 1.979 0.0482 *
## aws 0.005995 0.013887 0.432 0.6661
## excel 0.011519 0.011040 1.043 0.2972
## job_simpdata engineer 0.133222 0.021668 6.148 1.48e-09 ***
## job_simpdata scientist 0.183395 0.018803 9.753 < 2e-16 ***
## job_simpdirector 0.419384 0.046065 9.104 < 2e-16 ***
## job_simpmanager 0.046901 0.033508 1.400 0.1622
## job_simpmle 0.240755 0.038304 6.285 6.55e-10 ***
## job_simpOther tech jobs 0.074923 0.018951 3.954 8.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1291 on 564 degrees of freedom
## Multiple R-squared: 0.3344, Adjusted R-squared: 0.3108
## F-statistic: 14.17 on 20 and 564 DF, p-value: < 2.2e-16
# Predict Salary
pred_sal_slm<-predict(slmModel,sal_test_x)
# Evaluate Model
mse <- mean((sal_test_y - pred_sal_slm)^2)
mae <- MAE(sal_test_y, pred_sal_slm)
rmse <-RMSE(sal_test_y,pred_sal_slm)
r2 <- R2(sal_test_y, pred_sal_slm)
model_metrics_lm <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_lm)<-"Linear Regression"
model_metrics_lm
## mse mae rmse r2
## Linear Regression 0.02083315 0.1029561 0.1443369 0.3502413
#Create Random Forest Model
set.seed(20)
RFModel = randomForest(x = sal_train_x,
y = sal_train_y,
ntree = 500)
#Predict Salary
pred_sal_RF<-predict(RFModel,sal_test_x)
#Evaluate Model
mse = mean((sal_test_y - pred_sal_RF)^2)
mae = MAE(sal_test_y , pred_sal_RF)
rmse =RMSE(sal_test_y ,pred_sal_RF)
r2 = R2(sal_test_y, pred_sal_RF)
model_metrics_RF <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_RF)<-"random forest"
model_metrics_RF
## mse mae rmse r2
## random forest 0.01735091 0.09309639 0.1317229 0.4557773
overall<-rbind(model_metrics_lm,model_metrics_RF)
#Create gbm model
set.seed(20)
gbmModel<-gbm(avg_salary~.,data=sal_train,distribution="gaussian")
#Predict Salary
pred_sal_gbm<-predict(gbmModel,sal_test_x)
## Using 100 trees...
#Evalaute Model
mse = mean((sal_test_y - pred_sal_gbm)^2)
mae = MAE(sal_test_y , pred_sal_gbm)
rmse =RMSE(sal_test_y ,pred_sal_gbm)
r2 = R2(sal_test_y, pred_sal_gbm)
model_metrics_gbm <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_gbm)<-"gbm"
model_metrics_gbm
## mse mae rmse r2
## gbm 0.0204676 0.1027295 0.143065 0.3675479
overall<-rbind(overall,model_metrics_gbm)
#XGboost
set.seed(20)
xgb_train = xgb.DMatrix(data = data.matrix(sal_train_x), label = sal_train_y)
xgb_test = xgb.DMatrix(data = data.matrix(sal_test_x), label = sal_test_y)
xgboostModel = xgboost(data = xgb_train, max.depth = 3, nrounds = 100, verbose = 0)
#Predict Salary
pred_sal_xgb = predict(xgboostModel, xgb_test)
#Evaluate Model
mse = mean((sal_test_y - pred_sal_xgb)^2)
mae = MAE(sal_test_y , pred_sal_xgb)
rmse =RMSE(sal_test_y ,pred_sal_xgb)
r2 = R2(sal_test_y, pred_sal_xgb)
model_metrics_xgb <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_xgb)<-"XGBoost"
model_metrics_xgb
## mse mae rmse r2
## XGBoost 0.01931272 0.09765352 0.1389702 0.4133631
overall<-rbind(overall,model_metrics_xgb)
overall
## mse mae rmse r2
## Linear Regression 0.02083315 0.10295613 0.1443369 0.3502413
## random forest 0.01735091 0.09309639 0.1317229 0.4557773
## gbm 0.02046760 0.10272951 0.1430650 0.3675479
## XGBoost 0.01931272 0.09765352 0.1389702 0.4133631
Based on the accuracy result, the R2 value for random forest is the highest and the value of mae, mse and rmae for random forest is the lowest, therefore, the random forest model is the best model for salary prediction However, the 4 models does not perform well which the max R2 score is only 44% from Random Forest This may due to several limitation of dataset:
The small size of dataset, a total of 730 rows of dataset is insufficient to carried out the accurate salary prediction. For instance, Kablaoui and Salman (2022) successful predict the salary with a accuracy 83.2% of with 20, 000 rows of dataset.
Insufficient features there are many others factors has unable to cater in this dataset. Features such as the working experience, level of education should be considered.For example, 98% accuracy of salary prediction modeling is developed using the features of degree, major, year of experience, industry and salary. (Lothe et. al, 2021)
df3 <- read.csv("edaData_cleaned2.csv")
df3 <- df3 %>%
dplyr::select(2:13)
df3 <- df3 %>%
mutate (Rating = as.factor(Rating),
Size = as.factor(Size),
Revenue = as.factor(Revenue),
job_simp = as.factor(job_simp))
#Function
norm_minmax <- function(x){
(x- min(x)) /(max(x)-min(x))
}
df3[sapply(df3, is.numeric)] <- lapply(df3[sapply(df3, is.numeric)],norm_minmax)
set.seed(20)
indice<-createDataPartition(y=df3$Rating,p=0.8,list=FALSE)
rat_train<-df3[indice,]
rat_test<-df3[-indice,]
rat_train_x = rat_train[, -1]
rat_train_y = rat_train[,1]
rat_test_x = rat_test[,-1]
rat_test_y = rat_test[,1]
#Training Model and Predicting with Random Forest
set.seed(20)
rat_model_rf <-randomForest(as.factor(Rating)~., data=rat_train)
rat_model_rf_pred <- predict(rat_model_rf, rat_test_x)
confusionMatrix(rat_model_rf_pred, as.factor(rat_test_y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Dissatisfied Neutral Satisfied Very Satisfied
## Dissatisfied 3 0 0 0
## Neutral 0 25 5 1
## Satisfied 1 16 82 8
## Very Satisfied 0 1 1 2
##
## Overall Statistics
##
## Accuracy : 0.7724
## 95% CI : (0.6955, 0.8379)
## No Information Rate : 0.6069
## P-Value [Acc > NIR] : 1.776e-05
##
## Kappa : 0.5332
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity 0.75000 0.5952 0.9318
## Specificity 1.00000 0.9417 0.5614
## Pos Pred Value 1.00000 0.8065 0.7664
## Neg Pred Value 0.99296 0.8509 0.8421
## Prevalence 0.02759 0.2897 0.6069
## Detection Rate 0.02069 0.1724 0.5655
## Detection Prevalence 0.02069 0.2138 0.7379
## Balanced Accuracy 0.87500 0.7685 0.7466
## Class: Very Satisfied
## Sensitivity 0.18182
## Specificity 0.98507
## Pos Pred Value 0.50000
## Neg Pred Value 0.93617
## Prevalence 0.07586
## Detection Rate 0.01379
## Detection Prevalence 0.02759
## Balanced Accuracy 0.58345
#Train Model with Extreme Boost
set.seed(20)
rat_xgboost_train <- xgb.DMatrix(data=data.matrix(rat_train_x), label = rat_train_y)
rat_xgboost_test <- xgb.DMatrix(data=data.matrix(rat_test_x), label=rat_test_y)
rat_xgboostModel <- xgboost(rat_xgboost_train, max.depth = 3, nrounds=100, verbose=0)
#Make predictions with Extreme Boost
rat_model_xgb_pred <- predict(rat_xgboostModel, rat_xgboost_test)
rat_model_xgb_pred[(rat_model_xgb_pred > 4)] = 4
rat_model_xgb_pred_asfactor <- as.factor((levels(rat_test_y))[round(rat_model_xgb_pred)])
confusionMatrix(rat_model_xgb_pred_asfactor, rat_test_y)
## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length
## Warning in confusionMatrix.default(rat_model_xgb_pred_asfactor, rat_test_y):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Dissatisfied Neutral Satisfied Very Satisfied
## Dissatisfied 0 0 0 0
## Neutral 3 27 15 1
## Satisfied 0 15 72 9
## Very Satisfied 1 0 1 1
##
## Overall Statistics
##
## Accuracy : 0.6897
## 95% CI : (0.6076, 0.7638)
## No Information Rate : 0.6069
## P-Value [Acc > NIR] : 0.02405
##
## Kappa : 0.3851
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity 0.00000 0.6429 0.8182
## Specificity 1.00000 0.8155 0.5789
## Pos Pred Value NaN 0.5870 0.7500
## Neg Pred Value 0.97241 0.8485 0.6735
## Prevalence 0.02759 0.2897 0.6069
## Detection Rate 0.00000 0.1862 0.4966
## Detection Prevalence 0.00000 0.3172 0.6621
## Balanced Accuracy 0.50000 0.7292 0.6986
## Class: Very Satisfied
## Sensitivity 0.090909
## Specificity 0.985075
## Pos Pred Value 0.333333
## Neg Pred Value 0.929577
## Prevalence 0.075862
## Detection Rate 0.006897
## Detection Prevalence 0.020690
## Balanced Accuracy 0.537992
#Train Model and Predicting with Naive Bayes
set.seed(20)
rat_model_nb <- naiveBayes(as.factor(Rating)~., data =rat_train)
rat_model_nb_pred <- predict(rat_model_nb, rat_test_x)
confusionMatrix(rat_model_nb_pred, as.factor(rat_test_y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Dissatisfied Neutral Satisfied Very Satisfied
## Dissatisfied 3 4 8 0
## Neutral 0 24 9 0
## Satisfied 0 7 42 1
## Very Satisfied 1 7 29 10
##
## Overall Statistics
##
## Accuracy : 0.5448
## 95% CI : (0.4601, 0.6277)
## No Information Rate : 0.6069
## P-Value [Acc > NIR] : 0.946
##
## Kappa : 0.3473
##
## Mcnemar's Test P-Value : 2.483e-08
##
## Statistics by Class:
##
## Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity 0.75000 0.5714 0.4773
## Specificity 0.91489 0.9126 0.8596
## Pos Pred Value 0.20000 0.7273 0.8400
## Neg Pred Value 0.99231 0.8393 0.5158
## Prevalence 0.02759 0.2897 0.6069
## Detection Rate 0.02069 0.1655 0.2897
## Detection Prevalence 0.10345 0.2276 0.3448
## Balanced Accuracy 0.83245 0.7420 0.6685
## Class: Very Satisfied
## Sensitivity 0.90909
## Specificity 0.72388
## Pos Pred Value 0.21277
## Neg Pred Value 0.98980
## Prevalence 0.07586
## Detection Rate 0.06897
## Detection Prevalence 0.32414
## Balanced Accuracy 0.81649
#Train Model with Gaussian Naive Bayes
set.seed(20)
rat_model_gnb <- gaussian_naive_bayes(x = data.matrix(rat_train_x), y = rat_train_y)
rat_model_gnb_pred <- predict(rat_model_gnb, data.matrix(rat_test_x))
confusionMatrix(rat_model_gnb_pred, as.factor(rat_test_y))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Dissatisfied Neutral Satisfied Very Satisfied
## Dissatisfied 3 11 16 0
## Neutral 0 10 4 0
## Satisfied 0 14 37 3
## Very Satisfied 1 7 31 8
##
## Overall Statistics
##
## Accuracy : 0.4
## 95% CI : (0.3196, 0.4846)
## No Information Rate : 0.6069
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1617
##
## Mcnemar's Test P-Value : 8.272e-12
##
## Statistics by Class:
##
## Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity 0.75000 0.23810 0.4205
## Specificity 0.80851 0.96117 0.7018
## Pos Pred Value 0.10000 0.71429 0.6852
## Neg Pred Value 0.99130 0.75573 0.4396
## Prevalence 0.02759 0.28966 0.6069
## Detection Rate 0.02069 0.06897 0.2552
## Detection Prevalence 0.20690 0.09655 0.3724
## Balanced Accuracy 0.77926 0.59963 0.5611
## Class: Very Satisfied
## Sensitivity 0.72727
## Specificity 0.70896
## Pos Pred Value 0.17021
## Neg Pred Value 0.96939
## Prevalence 0.07586
## Detection Rate 0.05517
## Detection Prevalence 0.32414
## Balanced Accuracy 0.71811
Based on the performance of all 4 classification models, Random Forest is the best model for job rating classification.
For prediction of Salary (regression), among the 4 models implemented, Random Forest performed the best. However, overall, the prediction of all 4 models did not perform particularly well. This may be attributed to small dataset size, as well as insufficient features.
For the job satisfaction prediction, Random forest also performed the best among the 4 models chosen. The model was able to achieve a 0.7724 accuracy
Overall, the project can be considered a success. Future work may include a more extensive dataset that includes larger number of features.
Kablaoui, R., & Salman, A. (2022). Machine Learning Models for Salary Prediction Dataset using Python. 2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA). https://doi.org/10.1109/icecta57148.2022.9990316
Khan, M.R., Ziauddin, Z., Jam, F.A. & Ramay, M.I.. (2010). The impacts of organizational commitment on employee job performance. European Journal of Social Sciences, 15. 292-298. http://dx.doi.org/10.2139/ssrn.1570544
Lothe, P. M., Tiwari, P., Patil, N., Patil, S., & Patil, V. (2021). Salary Prediction Using Machine Learning. International Journal of Advance Scientific Research and Engineering Trends, 6(5), 199–202. https://doi.org/10.51319/2456-0774.2021.5.0047
Matbouli, Y.T., & Alghamdi, S.M. (2022). Statistical machine learning regression models for salary prediction featuring economy wide activities and occupations. Information, 13, 495. https://doi.org/10.3390/info13100495