Salary and Job Satisfaction Prediction of the Data Science Field

Introduction

These days, the most significant component of employee remuneration and a key component of organizational strategy is salary. Companies nowadays would implement the process of benchmarking to gather data on the wages paid for tech labor in other organizations with similar roles as part of designing pay scales. Market wage information is used to compare pay scales to those of competitors or to prevent instances of underpay. Similarly, knowledge about pay ranges for potential careers can be helpful for job seekers. Labor can maximize the development of their talents by focusing on high-paying industries; besides job satisfaction that has been highly emphasized today for organizational commitment (Matbouli & Alghamdi, 2022; Khan et al., 2010). However, there is a lack of research on the implementation of machine learning models in predicting salary and job satisfaction of tech jobs for job seekers; taking into account the uncertainties of main features that contribute to data science job salaries. Thus, this project is to develop a model that successfully predicts salary and job satisfaction in tech jobs, and identify the main features that go towards earning a better wage in a data scientist role.

Objectives

Project Objectives:

To develop and evaluate several machine learning models for predicting job salary in the data science field.
To develop and evaluate several machine learning models for a job satisfaction prediction in the data science field.

Methodology

Business Understanding: Asking questions in relation to problem statement
Data Cleaning: Apply appropriate method for data cleansing 3. Exploratory Data Analysis: Create visualization of the cleansed data to provide insights and to answer the objective
Modeling & Evaluation: Using machine learning for regression problem, classification problem, Assess the performance of the model
Deployment: Final report will be documented in Rpubs by the end of the project for future reference

Dataset

The Salary prediction dataset used for this research has 742 entries and is derived from the Kaggle. It is a compilation of tech job positions scraped from glassdoor.com in 2017. This dataset can be used to predict salary based on several features for instance, age, company size etc. In this dataset, there is one column labeled as “job title” which refers to tech job positions. The remaining 31 features will be representing the important information associated with the tech job. However, 12 features will only be selected as our study focus of this project.

Loading data

df <- read.csv("edaData.csv")
str(df)

## 'data.frame':    742 obs. of  33 variables:
##  $ X                : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Job.Title        : chr  "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
##  $ Salary.Estimate  : chr  "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
##  $ Job.Description  : chr  "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
##  $ Rating           : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Company.Name     : chr  "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
##  $ Location         : chr  "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Headquarters     : chr  "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Size             : chr  "501 to 1000 employees" "10000+ employees" "501 to 1000 employees" "1001 to 5000 employees" ...
##  $ Founded          : int  1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
##  $ Type.of.ownership: chr  "Company - Private" "Other Organization" "Company - Private" "Government" ...
##  $ Industry         : chr  "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
##  $ Sector           : chr  "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
##  $ Revenue          : chr  "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
##  $ Competitors      : chr  "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
##  $ hourly           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ employer_provided: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ min_salary       : int  53 63 80 56 86 71 54 86 38 120 ...
##  $ max_salary       : int  91 112 90 97 143 119 93 142 84 160 ...
##  $ avg_salary       : num  72 87.5 85 76.5 114.5 ...
##  $ company_txt      : chr  "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
##  $ job_state        : chr  "NM" "MD" "FL" "WA" ...
##  $ same_state       : int  0 0 1 1 1 1 1 0 1 1 ...
##  $ age              : int  47 36 10 55 22 20 12 15 6 11 ...
##  $ python_yn        : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ R_yn             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ spark            : int  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws              : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel            : int  1 0 1 0 1 1 1 1 0 0 ...
##  $ job_simp         : chr  "data scientist" "data scientist" "data scientist" "data scientist" ...
##  $ seniority        : chr  "na" "na" "na" "na" ...
##  $ desc_len         : int  2536 4783 3461 3883 2728 3747 1786 3804 1538 4574 ...
##  $ num_comp         : int  0 0 0 3 3 3 0 0 0 2 ...

Data Cleaning

Select Attributes 12 variables from 33 variables.

df <- df %>% 
  dplyr::select(5,9,12,14,20,24:30)

Summary

summary(df)

##      Rating           Size             Industry           Revenue         
##  Min.   :-1.000   Length:742         Length:742         Length:742        
##  1st Qu.: 3.300   Class :character   Class :character   Class :character  
##  Median : 3.700   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.619                                                           
##  3rd Qu.: 4.000                                                           
##  Max.   : 5.000                                                           
##    avg_salary         age           python_yn           R_yn         
##  Min.   : 13.5   Min.   : -1.00   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.: 73.5   1st Qu.: 11.00   1st Qu.:0.0000   1st Qu.:0.000000  
##  Median : 97.5   Median : 24.00   Median :1.0000   Median :0.000000  
##  Mean   :100.6   Mean   : 46.59   Mean   :0.5283   Mean   :0.002695  
##  3rd Qu.:122.5   3rd Qu.: 59.00   3rd Qu.:1.0000   3rd Qu.:0.000000  
##  Max.   :254.0   Max.   :276.00   Max.   :1.0000   Max.   :1.000000  
##      spark             aws             excel          job_simp        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Length:742        
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :0.0000   Median :0.0000   Median :1.0000   Mode  :character  
##  Mean   :0.2251   Mean   :0.2372   Mean   :0.5229                     
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000                     
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

str(df)

## 'data.frame':    742 obs. of  12 variables:
##  $ Rating    : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Size      : chr  "501 to 1000 employees" "10000+ employees" "501 to 1000 employees" "1001 to 5000 employees" ...
##  $ Industry  : chr  "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
##  $ Revenue   : chr  "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
##  $ avg_salary: num  72 87.5 85 76.5 114.5 ...
##  $ age       : int  47 36 10 55 22 20 12 15 6 11 ...
##  $ python_yn : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ R_yn      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ spark     : int  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws       : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel     : int  1 0 1 0 1 1 1 1 0 0 ...
##  $ job_simp  : chr  "data scientist" "data scientist" "data scientist" "data scientist" ...

Data Cleaning - Rating

boxplot(df$Rating, ylab = "Rating Value", horizontal = TRUE)

From boxplot Ratings has outliers, since it is the target, decision is to drop it.

# remove 11 outlier, instances 742 reduce to 731
df <- df %>%  
  filter (Rating > 0)

Bin Ratings to Categorical

0.0 - 0.9 Very Dissatisfied
1.0 - 1.9 Dissatisfied
2.0 - 2.9 Neutral
3.0 - 3.9 Satisfied
4.0 - 5.0 Very Satisfied

df$Rating <- round(df$Rating)
df<-df %>%
  mutate(Rating = cut(Rating, breaks = c(0.0, 1.0, 2.0, 3.0, 4.0, 5.0), labels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied")))
summary(df$Rating)

## Very Dissatisfied      Dissatisfied           Neutral         Satisfied 
##                 0                21               211               444 
##    Very Satisfied 
##                55

Data Cleaning - Age

Assuming the age 18-70 are the working age group

boxplot(df$age, horizontal= T)

min_age <- sum(df$age < 18)
min_age

## [1] 279

max_age <- sum(df$age > 70)
max_age

## [1] 156

## There are 279 instances of min_age<18 & 156 instances of max_age>70 which are outliers. ## To replace <18 outliers with IQR Q1 (30) & replace >70 outliers with IQR Q3(52)
getWorkingAge <- df$age[df$age <66 & df$age > 23 & !is.na(df$age)]
quantile(getWorkingAge)

##   0%  25%  50%  75% 100% 
##   24   30   39   52   62

df$age[df$age<18] <- 30
df$age[df$age>70] <- 52
hist(df$age)

Data Cleaning - Average Salary

# average salary*1000
df$avg_salary <- df$avg_salary*1000

Data Cleaning - Job Title Simplified

# Replace NA with other tech jobs 
df$job_simp <- str_replace_all(df$job_simp, "\\bna\\b", "Other tech jobs")

Data Cleaning - Number of Employees categorized in company size

1 to 200 employees - “Small”
201 to 1000 employees - ‘Medium’
1001 and above - ‘Large’

df$Size <- ifelse(df$Size %in% c("1 to 50 employees", "51 to 200 employees"), "small", 
                       ifelse(df$Size %in% c("201 to 500 employees", "501 to 1000 employees"), "medium", 
                              ifelse(df$Size %in% c("1001 to 5000 employees", "5001 to 10000 employees", "10000+ employees"), "large", NA)))

Data Cleaning - Revenue categorized by business size

Less than $1 million (USD) - ‘micro-bus’
$1 million to 10 million (USD) - ‘small-bus’
$10 million to 50 million (USD) - ‘medium-bus’
$50 million and above (USD) - ‘large-bus’

df$Revenue<- ifelse(df$Revenue == "Less than $1 million (USD)", "micro-bus", 
                         ifelse(df$Revenue %in% c("$1 to $5 million (USD)", "$5 to $10 million (USD)"), "small-bus", 
                                ifelse(df$Revenue %in% c("$10 to $25 million (USD)","$25 to $50 million (USD)", "$50 to $100 million (USD)"), "medium-bus", 
                                       ifelse(df$Revenue %in% c("$100 to $500 million (USD)", "$500 million to $1 billion (USD)", "$1 to $2 billion (USD)", "$2 to $5 billion (USD)", "$5 to $10 billion (USD)", "$10+ billion (USD)"), "large-bus", NA))))

Data Cleaning - Impute Missing Value Revenue(195) and Size(2) by MICE. polyreg method was chosen because there are more than 2 factor variables.

library(dplyr)
df <- df %>% 
  mutate(
    Size = as.factor(Size),
    Revenue = as.factor(Revenue)
  )

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

init = mice(df, maxit = 0)

## Warning: Number of logged events: 2

meth = init$method
predM = init$predictorMatrix
meth[c("Size")] = "polyreg"
meth[c("Revenue")] = "polyreg"

imp_rev <- mice(df, m=5, method= meth, predictorMatrix = predM, maxit = 10, seed = 20)

## 
##  iter imp variable
##   1   1  Size  Revenue
##   1   2  Size  Revenue
##   1   3  Size  Revenue
##   1   4  Size  Revenue
##   1   5  Size  Revenue
##   2   1  Size  Revenue
##   2   2  Size  Revenue
##   2   3  Size  Revenue
##   2   4  Size  Revenue
##   2   5  Size  Revenue
##   3   1  Size  Revenue
##   3   2  Size  Revenue
##   3   3  Size  Revenue
##   3   4  Size  Revenue
##   3   5  Size  Revenue
##   4   1  Size  Revenue
##   4   2  Size  Revenue
##   4   3  Size  Revenue
##   4   4  Size  Revenue
##   4   5  Size  Revenue
##   5   1  Size  Revenue
##   5   2  Size  Revenue
##   5   3  Size  Revenue
##   5   4  Size  Revenue
##   5   5  Size  Revenue
##   6   1  Size  Revenue
##   6   2  Size  Revenue
##   6   3  Size  Revenue
##   6   4  Size  Revenue
##   6   5  Size  Revenue
##   7   1  Size  Revenue
##   7   2  Size  Revenue
##   7   3  Size  Revenue
##   7   4  Size  Revenue
##   7   5  Size  Revenue
##   8   1  Size  Revenue
##   8   2  Size  Revenue
##   8   3  Size  Revenue
##   8   4  Size  Revenue
##   8   5  Size  Revenue
##   9   1  Size  Revenue
##   9   2  Size  Revenue
##   9   3  Size  Revenue
##   9   4  Size  Revenue
##   9   5  Size  Revenue
##   10   1  Size  Revenue
##   10   2  Size  Revenue
##   10   3  Size  Revenue
##   10   4  Size  Revenue
##   10   5  Size  Revenue

## Warning: Number of logged events: 100

df <- complete(imp_rev)

Data Cleaning - Industry

df <- subset(df, Industry != "-1")

cleaned_df<-read.csv("edaData_cleaned2.csv")
View(cleaned_df)

Exploratory Data Analysis

Relationship between age and average salary

ggplot(data = cleaned_df, mapping = aes(x = age, y = avg_salary ))+labs(title="Age vs AverageSalary") + geom_point()

#As shown below, the points on the scatter plot seem to be scattered randomly, which indicates that there is no relationship between age and average salary.

compare average salary distribution for each company size

ggplot(data = cleaned_df, aes(x = Size, y = avg_salary))+labs(title="CompanySize vs AverageSalary") + geom_boxplot()

#Overall, the 3 batches of data look as if they were generally distributed in a similar way.The interquartile ranges are reasonably similar (as shown by the lengths of the boxes).The median average salary of small-sized company is greater than large-sized company and medium-sized company.There are outliers observed across large and medium-sized companies which indicate that these data values are far away from other data values which needs further evaluation.

compare average salary distribution for rating (job satisfaction)

ggplot(data = cleaned_df, aes(x = Rating, y = avg_salary))+labs(title="Rating vs AverageSalary") + geom_boxplot()

#Overall, individuals who rated satisfied scored higher median average salary compared to individuals who rated neutral, very satisfied and dissatisfied. Individuals who rated neutral and very satisfied yielded the same median average salary. As shown in dissatisfied column, the box plot is comparatively short which indicates that individuals that are dissatisfied have around the same average salary with each other. In very dissatisfied category, the boxplot is relatively longer, which indicates that individuals hold quite different amount in terms of average salary.There are outliers observed across individuals with neutral rating and satisfied rating which indicate that these data values are far away from other data values which needs further evaluation.

compare salary distribution for each tech job

ggplot(data = cleaned_df, aes(x = job_simp, y = avg_salary)) +labs(title="Tech Job vs AverageSalary") + geom_boxplot()

#Overall, director position has higher median average salary compared to mle, data scientist, data engineer, other tech jobs, manager and analyst. As shown in analyst, data engineer, manager and mle column, the box plots are comparatively short which indicates that individuals that are in these mentioned positions have around the same average salary with each other. In addition, the 4 sections of the box plot for manager and mle are relatively uneven in size. This indicates that individuals in that particular position respectively have similar average salary at certain parts of the scale, but in other parts of the scale, individuals in that position are more variable in their average salary. The short upper whisker in mle boxplot indicates that their average salary are varied among the least positive quartile group, and very similar for the most positive quartile group. There are outliers observed across all the positions which indicate that these data values are far away from other data values which needs further evaluation.

compare rating(job satisfaction) and Size of the companies

dat <- data.frame(table(cleaned_df$Size,cleaned_df$Rating))
names(dat) <- c("Size","Rating","Count")

ggplot(data=dat, aes(x=Size, y=Count, fill=Rating)) + geom_bar(stat="identity")

#Overall, most individuals working in large, medium and small companies generally satisfied with their job. In medium-sized company, there are more individuals who are dissatisfied with their job. It is also noticeable that there are more individuals who are very satisfied with their job work in the small company compared to medium-sized and large-sized company.

compare rating(job satisfaction) and Tech Job

#compare rating(job satisfaction) and Tech Job
dat2 <- data.frame(table(cleaned_df$job_simp,cleaned_df$Rating))
names(dat2) <- c("job_simp","Rating","Count")

ggplot(data=dat2, aes(x=job_simp, y=Count, fill=Rating)) + geom_bar(stat="identity")

#Overall, most individuals generally satisfied with their job. It can be observed there are no dissatisfaction for data engineers, director, manager and mle position in individuals.

What are the job skills needed for Tech Job positions?

skill <- c("python_yn", "R_yn","spark","aws","excel")
NoYes <- c("1","0")
df2 <- cleaned_df %>% 
  dplyr::select('python_yn':'job_simp')
df2_new <- cbind(df2, skill)
df2_new2 <- cbind(df2_new, NoYes)

ggplot(df2_new2) + geom_bar(aes(x=skill, fill=NoYes),position = "dodge") + facet_wrap(~job_simp) +theme(axis.text.x=element_text(angle=30,hjust=0.5,vjust=0.5))

#For Analyst role, the skill that is most needed would be spark
#For data engineer role, the skill that is most needed would be excel
#For data scientist role, the skill that is most needed would be python_yn
#For director role, the skill that is most needed would be python_yn
#For manager role, the skill that is most needed would be R_yn
#For mle role, the skill that is most needed would be aws and excel
#For other tech role, the skill that is most needed would be spark

#Thus, the 3 skills that most needed across the tech jobs would be spark, excel and python_yn.

Salary Prediction

Feature Selection and Normalization

sal_df<-
  dplyr::select(cleaned_df,-c("X","Industry"))
norm_minmax <- function(x){
  (x- min(x)) /(max(x)-min(x))
}
sal_df[sapply(sal_df, is.numeric)] <- lapply(sal_df[sapply(sal_df, is.numeric)],norm_minmax)
sal_df[sapply(sal_df, is.character)] <- lapply(sal_df[sapply(sal_df, is.character)],as.factor)
str(sal_df)

## 'data.frame':    730 obs. of  11 variables:
##  $ Rating    : Factor w/ 4 levels "Dissatisfied",..: 3 2 4 3 2 2 3 3 2 4 ...
##  $ Size      : Factor w/ 3 levels "large","medium",..: 2 1 2 1 3 2 2 2 1 3 ...
##  $ Revenue   : Factor w/ 4 levels "large-bus","medium-bus",..: 2 1 1 1 4 1 1 2 1 1 ...
##  $ avg_salary: num  0.243 0.308 0.297 0.262 0.42 ...
##  $ age       : num  0.5686 0.3529 0.2353 0.7255 0.0784 ...
##  $ python_yn : num  1 1 1 1 1 1 0 1 0 1 ...
##  $ R_yn      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ spark     : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws       : num  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel     : num  1 0 1 0 1 1 1 1 0 0 ...
##  $ job_simp  : Factor w/ 7 levels "analyst","data engineer",..: 3 3 3 3 3 3 3 3 7 3 ...

Spliting Data to test = 80%, training = 20%.

set.seed(20)
indice<-createDataPartition(y=sal_df$avg_salary,p=0.8,list=FALSE)
sal_train<-sal_df[indice,]
sal_test<-sal_df[-indice,]

sal_train_x = sal_train[, -4]
sal_train_y = sal_train[,4]
sal_test_x = sal_test[,-4]
sal_test_y = sal_test[,4]

Multiple Linear Regression Model

# Create Multiple Regression Model
set.seed(20)
slmModel=lm(avg_salary~.,data=sal_train)
summary(slmModel)

## 
## Call:
## lm(formula = avg_salary ~ ., data = sal_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38813 -0.08248 -0.01855  0.06287  0.50527 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.238884   0.037299   6.405 3.18e-10 ***
## RatingNeutral           -0.018697   0.034130  -0.548   0.5840    
## RatingSatisfied         -0.006620   0.033487  -0.198   0.8434    
## RatingVery Satisfied    -0.022515   0.038427  -0.586   0.5582    
## Sizemedium              -0.034927   0.015581  -2.242   0.0254 *  
## Sizesmall               -0.008869   0.025439  -0.349   0.7275    
## Revenuemedium-bus        0.009260   0.017177   0.539   0.5900    
## Revenuemicro-bus         0.016576   0.054486   0.304   0.7611    
## Revenuesmall-bus         0.040113   0.032479   1.235   0.2173    
## age                     -0.043540   0.025482  -1.709   0.0881 .  
## python_yn                0.027971   0.012961   2.158   0.0313 *  
## R_yn                    -0.005018   0.093279  -0.054   0.9571    
## spark                    0.029546   0.014926   1.979   0.0482 *  
## aws                      0.005995   0.013887   0.432   0.6661    
## excel                    0.011519   0.011040   1.043   0.2972    
## job_simpdata engineer    0.133222   0.021668   6.148 1.48e-09 ***
## job_simpdata scientist   0.183395   0.018803   9.753  < 2e-16 ***
## job_simpdirector         0.419384   0.046065   9.104  < 2e-16 ***
## job_simpmanager          0.046901   0.033508   1.400   0.1622    
## job_simpmle              0.240755   0.038304   6.285 6.55e-10 ***
## job_simpOther tech jobs  0.074923   0.018951   3.954 8.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1291 on 564 degrees of freedom
## Multiple R-squared:  0.3344, Adjusted R-squared:  0.3108 
## F-statistic: 14.17 on 20 and 564 DF,  p-value: < 2.2e-16

# Predict Salary
pred_sal_slm<-predict(slmModel,sal_test_x)

# Evaluate Model
mse <- mean((sal_test_y - pred_sal_slm)^2)
mae <- MAE(sal_test_y, pred_sal_slm)
rmse <-RMSE(sal_test_y,pred_sal_slm)
r2 <- R2(sal_test_y, pred_sal_slm)
model_metrics_lm <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_lm)<-"Linear Regression"
model_metrics_lm

##                          mse       mae      rmse        r2
## Linear Regression 0.02083315 0.1029561 0.1443369 0.3502413

Random Forest

#Create Random Forest Model
set.seed(20)
RFModel = randomForest(x = sal_train_x,
                            y = sal_train_y,
                            ntree = 500)

#Predict Salary
pred_sal_RF<-predict(RFModel,sal_test_x)

#Evaluate Model
mse = mean((sal_test_y - pred_sal_RF)^2)
mae = MAE(sal_test_y , pred_sal_RF)
rmse =RMSE(sal_test_y ,pred_sal_RF)
r2 = R2(sal_test_y, pred_sal_RF)
model_metrics_RF <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_RF)<-"random forest"
model_metrics_RF

##                      mse        mae      rmse        r2
## random forest 0.01735091 0.09309639 0.1317229 0.4557773

overall<-rbind(model_metrics_lm,model_metrics_RF)

Gradient Boost

#Create gbm model
set.seed(20)
gbmModel<-gbm(avg_salary~.,data=sal_train,distribution="gaussian")

#Predict Salary
pred_sal_gbm<-predict(gbmModel,sal_test_x)

## Using 100 trees...

#Evalaute Model
mse = mean((sal_test_y - pred_sal_gbm)^2)
mae = MAE(sal_test_y , pred_sal_gbm)
rmse =RMSE(sal_test_y ,pred_sal_gbm)
r2 = R2(sal_test_y, pred_sal_gbm)
model_metrics_gbm <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_gbm)<-"gbm"
model_metrics_gbm

##           mse       mae     rmse        r2
## gbm 0.0204676 0.1027295 0.143065 0.3675479

overall<-rbind(overall,model_metrics_gbm)

XGBoost

#XGboost
set.seed(20)
xgb_train = xgb.DMatrix(data = data.matrix(sal_train_x), label = sal_train_y)
xgb_test = xgb.DMatrix(data = data.matrix(sal_test_x), label = sal_test_y)
xgboostModel = xgboost(data = xgb_train, max.depth = 3, nrounds = 100, verbose = 0)

#Predict Salary
pred_sal_xgb = predict(xgboostModel, xgb_test)
#Evaluate Model
mse = mean((sal_test_y - pred_sal_xgb)^2)
mae = MAE(sal_test_y , pred_sal_xgb)
rmse =RMSE(sal_test_y ,pred_sal_xgb)
r2 = R2(sal_test_y, pred_sal_xgb)
model_metrics_xgb <- cbind(mse,mae,rmse,r2)
row.names(model_metrics_xgb)<-"XGBoost"
model_metrics_xgb

##                mse        mae      rmse        r2
## XGBoost 0.01931272 0.09765352 0.1389702 0.4133631

overall<-rbind(overall,model_metrics_xgb)

Evaluation for Regression Models

overall

##                          mse        mae      rmse        r2
## Linear Regression 0.02083315 0.10295613 0.1443369 0.3502413
## random forest     0.01735091 0.09309639 0.1317229 0.4557773
## gbm               0.02046760 0.10272951 0.1430650 0.3675479
## XGBoost           0.01931272 0.09765352 0.1389702 0.4133631

Summary for Regression Model

Based on the accuracy result, the R2 value for random forest is the highest and the value of mae, mse and rmae for random forest is the lowest, therefore, the random forest model is the best model for salary prediction However, the 4 models does not perform well which the max R2 score is only 44% from Random Forest This may due to several limitation of dataset:

The small size of dataset, a total of 730 rows of dataset is insufficient to carried out the accurate salary prediction. For instance, Kablaoui and Salman (2022) successful predict the salary with a accuracy 83.2% of with 20, 000 rows of dataset.
Insufficient features there are many others factors has unable to cater in this dataset. Features such as the working experience, level of education should be considered.For example, 98% accuracy of salary prediction modeling is developed using the features of degree, major, year of experience, industry and salary. (Lothe et. al, 2021)

Job Satisfaction Prediction - Classification

df3 <- read.csv("edaData_cleaned2.csv")
df3 <- df3 %>% 
  dplyr::select(2:13)

df3 <- df3 %>% 
  mutate (Rating = as.factor(Rating),
          Size = as.factor(Size),
          Revenue = as.factor(Revenue),
          job_simp = as.factor(job_simp))
#Function 
norm_minmax <- function(x){
  (x- min(x)) /(max(x)-min(x))
}

df3[sapply(df3, is.numeric)] <- lapply(df3[sapply(df3, is.numeric)],norm_minmax)

Spliting Data to test = 80%, training = 20%.

set.seed(20)
indice<-createDataPartition(y=df3$Rating,p=0.8,list=FALSE)
rat_train<-df3[indice,]
rat_test<-df3[-indice,]

rat_train_x = rat_train[, -1]
rat_train_y = rat_train[,1]
rat_test_x = rat_test[,-1]
rat_test_y = rat_test[,1]

Random Forest

#Training Model and Predicting with Random Forest 
set.seed(20)
rat_model_rf <-randomForest(as.factor(Rating)~., data=rat_train)
rat_model_rf_pred <- predict(rat_model_rf, rat_test_x)
confusionMatrix(rat_model_rf_pred, as.factor(rat_test_y))

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Dissatisfied Neutral Satisfied Very Satisfied
##   Dissatisfied              3       0         0              0
##   Neutral                   0      25         5              1
##   Satisfied                 1      16        82              8
##   Very Satisfied            0       1         1              2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7724          
##                  95% CI : (0.6955, 0.8379)
##     No Information Rate : 0.6069          
##     P-Value [Acc > NIR] : 1.776e-05       
##                                           
##                   Kappa : 0.5332          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity                      0.75000         0.5952           0.9318
## Specificity                      1.00000         0.9417           0.5614
## Pos Pred Value                   1.00000         0.8065           0.7664
## Neg Pred Value                   0.99296         0.8509           0.8421
## Prevalence                       0.02759         0.2897           0.6069
## Detection Rate                   0.02069         0.1724           0.5655
## Detection Prevalence             0.02069         0.2138           0.7379
## Balanced Accuracy                0.87500         0.7685           0.7466
##                      Class: Very Satisfied
## Sensitivity                        0.18182
## Specificity                        0.98507
## Pos Pred Value                     0.50000
## Neg Pred Value                     0.93617
## Prevalence                         0.07586
## Detection Rate                     0.01379
## Detection Prevalence               0.02759
## Balanced Accuracy                  0.58345

Extreme Gradient Boosting

 #Train Model with Extreme Boost 
set.seed(20)
rat_xgboost_train <- xgb.DMatrix(data=data.matrix(rat_train_x), label = rat_train_y)
rat_xgboost_test <- xgb.DMatrix(data=data.matrix(rat_test_x), label=rat_test_y)
rat_xgboostModel <- xgboost(rat_xgboost_train, max.depth = 3, nrounds=100, verbose=0)
#Make predictions with Extreme Boost 
rat_model_xgb_pred <- predict(rat_xgboostModel, rat_xgboost_test)
rat_model_xgb_pred[(rat_model_xgb_pred > 4)] = 4
rat_model_xgb_pred_asfactor <- as.factor((levels(rat_test_y))[round(rat_model_xgb_pred)])
confusionMatrix(rat_model_xgb_pred_asfactor, rat_test_y)

## Warning in levels(reference) != levels(data): longer object length is not a
## multiple of shorter object length

## Warning in confusionMatrix.default(rat_model_xgb_pred_asfactor, rat_test_y):
## Levels are not in the same order for reference and data. Refactoring data to
## match.

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Dissatisfied Neutral Satisfied Very Satisfied
##   Dissatisfied              0       0         0              0
##   Neutral                   3      27        15              1
##   Satisfied                 0      15        72              9
##   Very Satisfied            1       0         1              1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6897          
##                  95% CI : (0.6076, 0.7638)
##     No Information Rate : 0.6069          
##     P-Value [Acc > NIR] : 0.02405         
##                                           
##                   Kappa : 0.3851          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity                      0.00000         0.6429           0.8182
## Specificity                      1.00000         0.8155           0.5789
## Pos Pred Value                       NaN         0.5870           0.7500
## Neg Pred Value                   0.97241         0.8485           0.6735
## Prevalence                       0.02759         0.2897           0.6069
## Detection Rate                   0.00000         0.1862           0.4966
## Detection Prevalence             0.00000         0.3172           0.6621
## Balanced Accuracy                0.50000         0.7292           0.6986
##                      Class: Very Satisfied
## Sensitivity                       0.090909
## Specificity                       0.985075
## Pos Pred Value                    0.333333
## Neg Pred Value                    0.929577
## Prevalence                        0.075862
## Detection Rate                    0.006897
## Detection Prevalence              0.020690
## Balanced Accuracy                 0.537992

Naive Bayes

#Train Model and Predicting with Naive Bayes 
set.seed(20)
rat_model_nb <- naiveBayes(as.factor(Rating)~., data =rat_train)
rat_model_nb_pred <- predict(rat_model_nb, rat_test_x)
confusionMatrix(rat_model_nb_pred, as.factor(rat_test_y))

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Dissatisfied Neutral Satisfied Very Satisfied
##   Dissatisfied              3       4         8              0
##   Neutral                   0      24         9              0
##   Satisfied                 0       7        42              1
##   Very Satisfied            1       7        29             10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5448          
##                  95% CI : (0.4601, 0.6277)
##     No Information Rate : 0.6069          
##     P-Value [Acc > NIR] : 0.946           
##                                           
##                   Kappa : 0.3473          
##                                           
##  Mcnemar's Test P-Value : 2.483e-08       
## 
## Statistics by Class:
## 
##                      Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity                      0.75000         0.5714           0.4773
## Specificity                      0.91489         0.9126           0.8596
## Pos Pred Value                   0.20000         0.7273           0.8400
## Neg Pred Value                   0.99231         0.8393           0.5158
## Prevalence                       0.02759         0.2897           0.6069
## Detection Rate                   0.02069         0.1655           0.2897
## Detection Prevalence             0.10345         0.2276           0.3448
## Balanced Accuracy                0.83245         0.7420           0.6685
##                      Class: Very Satisfied
## Sensitivity                        0.90909
## Specificity                        0.72388
## Pos Pred Value                     0.21277
## Neg Pred Value                     0.98980
## Prevalence                         0.07586
## Detection Rate                     0.06897
## Detection Prevalence               0.32414
## Balanced Accuracy                  0.81649

Gaussian Naive Bayes

#Train Model with Gaussian Naive Bayes 
set.seed(20)
rat_model_gnb <- gaussian_naive_bayes(x = data.matrix(rat_train_x), y = rat_train_y)
rat_model_gnb_pred <- predict(rat_model_gnb, data.matrix(rat_test_x))
confusionMatrix(rat_model_gnb_pred, as.factor(rat_test_y))

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Dissatisfied Neutral Satisfied Very Satisfied
##   Dissatisfied              3      11        16              0
##   Neutral                   0      10         4              0
##   Satisfied                 0      14        37              3
##   Very Satisfied            1       7        31              8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4             
##                  95% CI : (0.3196, 0.4846)
##     No Information Rate : 0.6069          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1617          
##                                           
##  Mcnemar's Test P-Value : 8.272e-12       
## 
## Statistics by Class:
## 
##                      Class: Dissatisfied Class: Neutral Class: Satisfied
## Sensitivity                      0.75000        0.23810           0.4205
## Specificity                      0.80851        0.96117           0.7018
## Pos Pred Value                   0.10000        0.71429           0.6852
## Neg Pred Value                   0.99130        0.75573           0.4396
## Prevalence                       0.02759        0.28966           0.6069
## Detection Rate                   0.02069        0.06897           0.2552
## Detection Prevalence             0.20690        0.09655           0.3724
## Balanced Accuracy                0.77926        0.59963           0.5611
##                      Class: Very Satisfied
## Sensitivity                        0.72727
## Specificity                        0.70896
## Pos Pred Value                     0.17021
## Neg Pred Value                     0.96939
## Prevalence                         0.07586
## Detection Rate                     0.05517
## Detection Prevalence               0.32414
## Balanced Accuracy                  0.71811

Summary for Classification Model

Based on the performance of all 4 classification models, Random Forest is the best model for job rating classification.

Conclusion

For prediction of Salary (regression), among the 4 models implemented, Random Forest performed the best. However, overall, the prediction of all 4 models did not perform particularly well. This may be attributed to small dataset size, as well as insufficient features.

For the job satisfaction prediction, Random forest also performed the best among the 4 models chosen. The model was able to achieve a 0.7724 accuracy

Overall, the project can be considered a success. Future work may include a more extensive dataset that includes larger number of features.

Reference

Kablaoui, R., & Salman, A. (2022). Machine Learning Models for Salary Prediction Dataset using Python. 2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA). https://doi.org/10.1109/icecta57148.2022.9990316

Khan, M.R., Ziauddin, Z., Jam, F.A. & Ramay, M.I.. (2010). The impacts of organizational commitment on employee job performance. European Journal of Social Sciences, 15. 292-298. http://dx.doi.org/10.2139/ssrn.1570544

Lothe, P. M., Tiwari, P., Patil, N., Patil, S., & Patil, V. (2021). Salary Prediction Using Machine Learning. International Journal of Advance Scientific Research and Engineering Trends, 6(5), 199–202. https://doi.org/10.51319/2456-0774.2021.5.0047

Matbouli, Y.T., & Alghamdi, S.M. (2022). Statistical machine learning regression models for salary prediction featuring economy wide activities and occupations. Information, 13, 495. https://doi.org/10.3390/info13100495

Salary and Job Satisfaction Prediction of the Data Science Field

2023-01-14

Introduction

Objectives

Project Objectives:

Methodology

Dataset

Loading data

Data Cleaning

Select Attributes 12 variables from 33 variables.

Summary

Data Cleaning - Rating

From boxplot Ratings has outliers, since it is the target, decision is to drop it.

Bin Ratings to Categorical

Data Cleaning - Age

Assuming the age 18-70 are the working age group

Data Cleaning - Average Salary

Data Cleaning - Job Title Simplified

Data Cleaning - Number of Employees categorized in company size

Data Cleaning - Revenue categorized by business size

Data Cleaning - Impute Missing Value Revenue(195) and Size(2) by MICE. polyreg method was chosen because there are more than 2 factor variables.

Data Cleaning - Industry

Exploratory Data Analysis

Relationship between age and average salary

compare average salary distribution for each company size

compare average salary distribution for rating (job satisfaction)

compare salary distribution for each tech job

compare rating(job satisfaction) and Size of the companies

compare rating(job satisfaction) and Tech Job

What are the job skills needed for Tech Job positions?

Salary Prediction

Feature Selection and Normalization

Spliting Data to test = 80%, training = 20%.

Multiple Linear Regression Model

Random Forest

Gradient Boost

XGBoost

Evaluation for Regression Models

Summary for Regression Model

Job Satisfaction Prediction - Classification

Spliting Data to test = 80%, training = 20%.

Random Forest

Extreme Gradient Boosting

Naive Bayes

Gaussian Naive Bayes

Summary for Classification Model

Conclusion

Reference