1.0 Summary

Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. Each data scientist is expected to be equipped with most valuable skills in their data science toolkit -

What are the most valued data science skills?

Key objective of this project is answer this question.A collaborative approach is used from beginning to end of the project find ‘most valued data science skill’. A Data science job dataset is used to analyze the skills required for around 10K job posts. Team also analyzed trends and ran data thru various models to prepare a conclusion. These models are -

  • Classification Tree
  • Random Forest
  • Naive Bayes

2.0 Acceptance Criteria

  • Determine what tool(s) are used as a group to effectively collaborate, share code and any project documentation
  • Determine what data to collect, where the data can be found, and how to load it
  • Determine what collected data should reside in a relational database, in a set of normalized tables
  • Perform any needed tidying, transformations, and exploratory data analysis in R
  • Deliverable should include all code, results, and documentation of motivation, approach, and findings
  • Try out statistics and data models

3.0 The DataMiners

The DataMiners team is to extract and discover patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems to answer key question - What are the most valued data science skills. This team include listed team members with their defined responsibilities for this project

  • Ramnivas Singh - Data & Analytics Lead
  • Deepak Sharma - Data Scientist
  • Richard Zheng - Data Scientist
  • Tage Singh - Data Architect
  • Matthew Lucich - Data Modeler & Statistician

4.0 Our Approach

The DataMiners team collaborated in understanding the question at hand.

  • Brainstorming our approach and where we might pull data from
  • Deciding on a dataset and then acquiring
  • Tidying & transforming the dataset
  • Visualizing & analyzing this dataset
  • Run statistics and data models
  • Ultimately come to the conclusions

Two pronged - Team used ‘two pronged’ approach. With clearly defined tasks and regular stand -ups (Agile Scrum), each team member was onboarded with given ask, approach and final result. By using ‘two pronged’ approach -
1. Sub-team focused on data sourcing, cleansing and make it available in relation database
2. Second sub-team focused on data analysis, modeling, trends and statistics applicability

4.1 Collaborative Tools

This team collaborated virtually on Slack, Google Docs, Zoom, AWS, emails and Github

 - Slack
 - Google Docs
 - Zoom
 - AWS
 - Outlook
 - GitHub

4.2 Libraries & Capabilities used

Libraries:
 - aws.s3
 - caret
 - data.table
 - dplyr
 - DT
 - e1071
 - fastDummies
 - ggplot2
 - ggplot2
 - gtrendsR
 - httr
 - lares
 - mgsub
 - plotly
 - randomForest
 - readr
 - rlist
 - RMySQL
 - rpart
 - rpart.plot
 - stats
 - stringr
 - tidyverse
 - wordcloud

Capabilities:
 - AWS RDS MySQL
 - AWS S3
 - Gitbub
 - Google Trends
 - RPubs.com
 - R Studio

5.0 Implementation

5.1 Load Libraries

library(mgsub)
library(ggplot2)
library(plotly)
library(tidyverse)
library(ggplot2)
library(dplyr)
library (readr)
library(RMySQL)
library(DT)
library(lares)
library(httr)
library(stringr)
library(data.table)
library(aws.s3)
library(rlist)
library(wordcloud)
library(gtrendsR)
library(stats)
library(rpart)
library(fastDummies)
library(rpart.plot)
library(randomForest)
library(e1071)
library(caret)

5.2 Initiaize AWS S3 Object Access

library(lares)
bucket<-get_creds()$`aws.s3`$bucket
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = get_creds()$`aws.s3`$accessKeyId,
  "AWS_SECRET_ACCESS_KEY" = get_creds()$`aws.s3`$accessKey,
  "AWS_DEFAULT_REGION" = get_creds()$`aws.s3`$region)

5.3 Data Source & Acquisition

To answer key question - Team used Indeed Dataset - Data Scientist/Analyst/Engineer

5.4 Loading data

data<-s3read_using(FUN = read.csv, object = "indeed_job_dataset.csv", bucket = bucket)
dim(data)
## [1] 5715   43

5.5 Data Transformation

#function that transforms "python list" string into R vector
clean_string = function(string){
list = str_replace(string,"\\[","") %>%
  str_replace("\\]","") %>%
  str_replace_all("\\'","") %>%
  str_split(",")
return (trimws(list[[1]]))
}

5.6 Cleaning data

Job Title

# subset job titles
job_title = unique(data[c('Job_Title','Job_Type')])
head(job_title)
##                                     Job_Title       Job_Type
## 1                              Data Scientist data_scientist
## 4   Graduate Studies Program - Data Scientist data_scientist
## 5                            Data Scientist I data_scientist
## 6 Data Scientist - Entry to Experienced Level data_scientist
## 7                   Geospatial Data Scientist data_scientist
## 9               Bioinformatics Data Scientist data_scientist

Company

# subset companies
companies = unique(data[c('Company','No_of_Reviews','No_of_Stars','Company_Revenue','Company_Employees','Company_Industry')])
# drop nulls
companies = drop_na(companies)
head(companies)
##                          Company No_of_Reviews No_of_Stars      Company_Revenue
## 1                Express Scripts          3301         3.3 More than $10B (USD)
## 2                       comScore            62         3.5                     
## 3    Central Intelligence Agency           158         4.3                     
## 4 Federal Reserve Bank of Dallas           495         4.1                     
## 5       National Security Agency           173         4.3                     
## 6                    NYC Careers            30         3.8                     
##   Company_Employees             Company_Industry
## 1           10,000+                  Health Care
## 2                                               
## 3                                     Government
## 4  Less than 10,000 Banks and Financial Services
## 5                                               
## 6                                     Government

Jobs

# subset jobs
jobs = data[c('X','Job_Title','Company','Queried_Salary','Date_Since_Posted','Location','Link')]
head(jobs)
##   X                                   Job_Title                        Company
## 1 0                              Data Scientist                Express Scripts
## 2 1                              Data Scientist  Money Mart Financial Services
## 3 2                              Data Scientist                       comScore
## 4 3   Graduate Studies Program - Data Scientist    Central Intelligence Agency
## 5 4                            Data Scientist I Federal Reserve Bank of Dallas
## 6 5 Data Scientist - Entry to Experienced Level       National Security Agency
##   Queried_Salary Date_Since_Posted Location
## 1         <80000                 1       MO
## 2         <80000                15       TX
## 3         <80000                 1       OR
## 4         <80000                30       DC
## 5         <80000                30       TX
## 6         <80000                30       MD
##                                                                             Link
## 1 https://www.indeed.com/rc/clk?jk=6a105f495c36afe3&fccid=281a27deb89bbe92&vjs=3
## 2 https://www.indeed.com/rc/clk?jk=86afd561ea8c60bc&fccid=b584cf601069b5d0&vjs=3
## 3 https://www.indeed.com/rc/clk?jk=e0aad317e6d45d9b&fccid=b39fb6f124ce8586&vjs=3
## 4 https://www.indeed.com/rc/clk?jk=1cfdd9e391a6328a&fccid=e9870e3159e9c6ac&vjs=3
## 5 https://www.indeed.com/rc/clk?jk=fec647775a21ecc6&fccid=2c6850e24c8a2811&vjs=3
## 6 https://www.indeed.com/rc/clk?jk=835a2db493898f25&fccid=e8f18ca6180ec8da&vjs=3

Skills

data = data[data$No_of_Skills>0,]
x = data$X
strings = data$Skill
count = c()
X = c()
skill_list = c()
# create vector count
for (string in strings){
  count = c(count,length(clean_string(string)))
}
#create skills dataframe
for (i in seq(length(count))){
  X = c(X,rep(x[i],count[i]))
  skill_list = c(skill_list,clean_string(strings[i]))
}
skills = data.frame(X,skill_list)
head(skills)
##   X       skill_list
## 1 0              SAP
## 2 0              SQL
## 3 1 Machine Learning
## 4 1                R
## 5 1              SAS
## 6 1              SQL

5.7 Download tables as csv

write.csv(job_title,"job_title.csv",row.names=FALSE)
write.csv(companies,"Company.csv",row.names=FALSE)
write.csv(jobs,"Job_master.csv",row.names=FALSE)
write.csv(skills,"skills.csv",row.names=FALSE)

This data is loaded in the tables described in next section

5.8 Data Tables

Listed below are database tabled used

Company           : Company that posted the job posting
Company_Revenue   : Annual revenue of hiring company
Company_Employees : Employee count of hiring company
Company_Industry  : Industry of hiring company
State             : State the job opening is located in
Job_master        : Job details
Skills            : Skills master list
Salary            : Salary offered
Job_skills        : Skills needed for posted job

5.9 Create data tables

CREATE TABLE `company` (
  `company_id` int NOT NULL,
  `company` text,
  PRIMARY KEY (`company_id`)
)

CREATE TABLE `company_employee` (
  `comp_emp_id` int NOT NULL,
  `no_of_emp` text,
  PRIMARY KEY (`comp_emp_id`)
) 
CREATE TABLE `company_industry` (
  `id` int NOT NULL,
  `industry` text,
  PRIMARY KEY (`id`)
) 

CREATE TABLE `company_revenue` (
  `revenue_id` int NOT NULL,
  `revenue_range` text,
  PRIMARY KEY (`revenue_id`)
) 

CREATE TABLE `job_master` (
  `job_id` int NOT NULL,
  `job_title` text,
  `salary_id` int DEFAULT NULL,
  `comp_id` int DEFAULT NULL,
  `state_id` int DEFAULT NULL,
  `comp_rev_id` int DEFAULT NULL,
  `comp_emp_id` int DEFAULT NULL,
  `comp_ind_id` int DEFAULT NULL,
  PRIMARY KEY (`job_id`),
  KEY `FK_JOB_STATE_idx` (`state_id`),
  KEY `FK_COMPANY_REV_idx` (`comp_rev_id`),
  KEY `FK_COMPANY_EMPLOYEE_idx` (`comp_emp_id`),
  KEY `FK_COMPANY_INDUSTRY_idx` (`comp_ind_id`),
  KEY `FK_SALARY_idx` (`salary_id`),
  KEY `FK_COMPANY_idx` (`comp_id`),
  CONSTRAINT `FK_COMPANY` FOREIGN KEY (`comp_id`) REFERENCES `company` (`company_id`),
  CONSTRAINT `FK_COMPANY_EMPLOYEE` FOREIGN KEY (`comp_emp_id`) REFERENCES `company_employee` (`comp_emp_id`),
  CONSTRAINT `FK_COMPANY_INDUSTRY` FOREIGN KEY (`comp_ind_id`) REFERENCES `company_industry` (`id`),
  CONSTRAINT `FK_COMPANY_REV` FOREIGN KEY (`comp_rev_id`) REFERENCES `company_revenue` (`revenue_id`),
  CONSTRAINT `FK_JOB_STATE` FOREIGN KEY (`state_id`) REFERENCES `state` (`state_id`),
  CONSTRAINT `FK_SALARY` FOREIGN KEY (`salary_id`) REFERENCES `salary` (`salary_id`)
) 

CREATE TABLE `job_skills` (
  `job_id` int DEFAULT NULL,
  `job_title` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `skill_id` int DEFAULT NULL,
  KEY `FK_JOB_idx` (`skill_id`),
  KEY `FK_JOB_ID_idx` (`job_id`),
  CONSTRAINT `FK_JOB_ID` FOREIGN KEY (`job_id`) REFERENCES `job_master` (`job_id`),
  CONSTRAINT `FK_JOB_SKILLS` FOREIGN KEY (`skill_id`) REFERENCES `skills` (`skill_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `salary` (
  `salary_id` int NOT NULL,
  `salary_range` text,
  PRIMARY KEY (`salary_id`)
) 

CREATE TABLE `skills` (
  `skill_id` int NOT NULL,
  `skill_desc` text,
  PRIMARY KEY (`skill_id`)
) 

CREATE TABLE `state` (
  `state_id` int NOT NULL,
  `state_district` text,
  `postal_code` text,
  PRIMARY KEY (`state_id`)
) 

Load RDS credentials from secret management

csvurl<-get_creds()$`aws.rds`$schema
user <-get_creds()$`aws.rds`$user
password <-get_creds()$`aws.rds`$creds
host<-get_creds()$`aws.rds`$host
dbname<-get_creds()$`aws.rds`$schema

Initialize RDS Database connection

connection = dbConnect(MySQL(), user = user, password = password, dbname = dbname, host = host)

Get data function

This common function is created to load data from a given table

get_data<- function(table_name) {
dataset <- tbl(connection,dbplyr::in_schema(dbname, table_name))
  return(dataset)
}

Returns all relevant records

d607_p3_all_recs<-as.data.frame(get_data('view_d607_p3_all_recs'))
tblcompany<-as.data.frame(get_data('company'))
tblcemployee<-as.data.frame(get_data('company_employee'))
tblcindustry<-as.data.frame(get_data('company_industry'))
tblcrevenue<-as.data.frame(get_data('company_revenue'))
tbljobmaster<-as.data.frame(get_data('job_master'))
tbljobskills<-as.data.frame(get_data('job_skills'))
tblsalary<-as.data.frame(get_data('salary'))
tblskills<-as.data.frame(get_data('skills'))
tblstate<-as.data.frame(get_data('state'))

6.0 Exploratory Data Analysis

6.2 Wordcloud based on size of the word

skill_count = skills %>%
  group_by(skill_list)%>%
  summarise(frequency = n())%>%
  arrange(desc(frequency))
wordcloud(skill_count$skill_list,skill_count$frequency,min.freq = 5,
          max.words=150, random.order=FALSE, rot.per=0.1,
          colors=brewer.pal(8, "Dark2"))


Findings:

  • Wordcloud shows Python, Machine Learning, R and SQL are most valuable skills
  • Hadoop, Spark, Java, AWS are other skills required for Data Science jobs
  • C/C++ are still mentioned on job postings and appears to be a required skill

6.3 Just looking at jobs in NY

ny = jobs[jobs["Location"] == "NY",]
ny_skills = merge(ny,skills,on = 'X')
num_jobs_ny = dim(ny)[1]

ny_skills %>% 
  group_by(skill_list)%>%
  summarise(percent = n()/num_jobs_ny)%>%
  arrange(desc(percent))%>%
  head(10)%>%
  ggplot(aes(reorder(skill_list,percent),percent))+
  geom_bar(stat = 'identity', fill = '#C4A9C4')+
  coord_flip()+labs(x = 'Skills',title = "Most Desired Data Science Skills in NY")


Findings:

  • For jobs in NY, SQL and Python are desired for most roles (over 50%)
  • R and Machine Learning being necessary for nearly 40%.
  • These results are very similar to our results for the entire United States

6.4 Salary and job skill analysis

derive_salary <- function(s) {
        contains_range <- str_detect(s, '-')
        if (contains_range == TRUE) {
            text_range <- unlist(str_extract_all(s, '[0-9,]+'))
            numeric_range <- as.numeric(str_remove_all(text_range, ','))
            return(mean(c(numeric_range[1], numeric_range[2])))
        } else {
            as_text <- str_extract_all(s, '[0-9,]+')
            as_numeric <- as.numeric(str_remove_all(as_text, ','))
            return(as_numeric)
        }
}
tblsalary$salary<-as.numeric(lapply(tblsalary$salary_range, derive_salary))
jobmaster_salary = merge(x=tbljobmaster,y=tblsalary,by="salary_id")
ggplot(jobmaster_salary,aes(x=as.factor(salary)))+
geom_bar()

quantile(jobmaster_salary$salary, c(0, .1, .25, .5, .75, .9, .95, .99))
##       0%      10%      25%      50%      75%      90%      95%      99% 
##  80000.0  80000.0  89999.5 109999.5 129999.5 149999.5 160000.0 160000.0
head(jobmaster_salary)
##   salary_id job_id                                   job_title comp_id state_id
## 1         1      1                              Data Scientist     761       26
## 2         1      2                              Data Scientist    1336       44
## 3         1      3                              Data Scientist     523       38
## 4         1      4   Graduate Studies Program - Data Scientist     418        9
## 5         1      5                            Data Scientist I     782       44
## 6         1      6 Data Scientist - Entry to Experienced Level    1374       21
##   comp_rev_id comp_emp_id comp_ind_id salary_range salary
## 1           4           1          14       <80000  80000
## 2           5           3          34       <80000  80000
## 3           5           3          34       <80000  80000
## 4           5           3          13       <80000  80000
## 5           5           2           4       <80000  80000
## 6           5           3          34       <80000  80000


Findings:

  • With above we can say that 50% of jobs are in 110k range
  • 40% of jobs are in 90k range

6.5 Count job post of different state

# count job post of different state
jobmaster_state <- merge(x=tbljobmaster,y=tblstate,by="state_id")
df_jobs <- jobmaster_state %>% group_by(state_district,postal_code) %>% dplyr::summarize (n = n())
df_jobs$hover <- with(df_jobs, paste(state_district, '<br>', "jobs:", n))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)
# plot the map
plot_geo(df_jobs, locationmode = 'USA-states') %>%
  add_trace(
    z = ~df_jobs$n,
    locations = ~df_jobs$postal_code
  ) %>%colorbar(title = "Data scientist Job Postings") %>%
  layout(title = 'Data scientist Jobs by State',geo = g)


Findings:

  • California and New York are top state with most data science job posting
  • Arkansas is lowest in the list with few Data science job posting

8.0 Data Models: Classification Tree, Random Forest & Naive Bayes

One use case of analyzing the most sought after data science skills is to predict a salary range based on your current skill set or the skill set you look to cultivate. This challenge takes the form of a classification, where we can feed in an individual’s skills and location and provide an estimated salary range. The task required extensive data preparation – dummy variables for over a hundred attributes – in order to be able to fit the models of interest. Those models were: classification tree, Random Forest, and Naive Bayes.

8.1 Data preperation for modeling

# Include skills with at least 100 observations
skills_hundred_obs <- top_skills %>% filter(n >= 100)
df_jobs_skills_hundred_obs <- d607_p3_all_recs %>% filter(skill_desc %in% skills_hundred_obs$skill_desc)

# Create dummy variables for skills with at least 100 observations
df_jobs_skills_hundred_obs_dummies <- dummy_cols(df_jobs_skills_hundred_obs, 
                                      select_columns = c("skill_desc"), remove_first_dummy = TRUE)

# Remove unneeded columns
df_jobs_skills_hundred_obs_dummies <- df_jobs_skills_hundred_obs_dummies %>% 
                                      select(-(c("job_title", "company", "no_of_emp", 
                                         "revenue_range", "industry", 
                                         "postal_code", "skill_id", "skill_desc")))

# Group by job_id and for each column use max value (0 or 1) as aggregate value
df_jobs_skills_hundred_obs_dummies <- df_jobs_skills_hundred_obs_dummies %>% 
                                                      group_by(job_id) %>% 
                                                      summarise_each(list(max))

# Find top locations by observation count
top_locations <- d607_p3_all_recs %>% select(skill_id, postal_code) %>% 
                                      group_by(postal_code) %>% 
                                      summarise(n = n()) %>%
                                      arrange(desc(n))

# Include locations with at least 100 observations
locations_hundred_obs <- top_locations %>% filter(n >= 100)
df_jobs_locs_hundred_obs <- d607_p3_all_recs %>% 
                                 filter(postal_code %in% locations_hundred_obs$postal_code)

# Create dummy variables for locations with at least 100 observations
df_jobs_locs_hundred_obs_dummies <- dummy_cols(df_jobs_locs_hundred_obs, 
                                      select_columns = c("postal_code"), remove_first_dummy = TRUE)

# Remove unneeded columns
df_jobs_locs_hundred_obs_dummies <- df_jobs_locs_hundred_obs_dummies %>% 
                                    select(-(c("job_title", "company", "no_of_emp", 
                                         "salary_range", "revenue_range", "industry", 
                                         "postal_code", "skill_id", "skill_desc")))

# Group by job_id and for each column use max value of one of the rows as aggregate value
df_jobs_locs_hundred_obs_dummies <- df_jobs_locs_hundred_obs_dummies %>% 
                                                      group_by(job_id) %>% 
                                                      summarise_each(list(max))

# Join skills and locations
df_model <- df_jobs_skills_hundred_obs_dummies %>% inner_join(df_jobs_locs_hundred_obs_dummies)

# Rename problematic column names
df_model <- df_model %>% rename(skill_desc_C = "skill_desc_C/C++",
                                        skill_desc_C_Sharp = "skill_desc_C#",
                                        skill_desc_CICD = "skill_desc_CI/CD",
                                        skill_desc_TSSCIClearance = "skill_desc_TS/SCIClearance")

8.2 Create train test split

# Create train test split
train <- df_model %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(df_model, train, by = 'job_id')

8.3 Reorder target variable levels

# Reorder target variable levels
positions <- c("<80000", "80000-99999", "100000-119999", "120000-139999", "140000-159999", ">160000")
df_model$salary_range <- factor(df_model$salary_range, levels=positions)
# To get an idea of minimum accuracy
# Review the distribution / class sizes of the target variable by percentage
ggplot(df_model, aes(x = salary_range)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill="turquoise3") + 
  ggtitle("Salary Range by Percentage") + xlab("Percentage") + ylab("Salary Range")

8.4 Fit model training data

# Fit model training data
model_ct <- rpart(salary_range ~ . - job_id, method="class", data=train)

# Plot classification tree
rpart.plot(model_ct, main="Classification Tree for Salary Range")

# Make predictions on test data
preds_ct <- predict(model_ct, as.data.frame(test), type = "class")

# Review Confusion Matrix
confusionMatrix(preds_ct, as.factor(test$salary_range))
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      <80000 >160000 100000-119999 120000-139999 140000-159999
##   <80000           148      22            84            40            13
##   >160000            0       0             0             0             0
##   100000-119999     27      25           124           117            50
##   120000-139999      3      22            95           130            60
##   140000-159999      3      39            35            69            73
##   80000-99999        0       0             0             0             0
##                Reference
## Prediction      80000-99999
##   <80000                109
##   >160000                 0
##   100000-119999          75
##   120000-139999          10
##   140000-159999          23
##   80000-99999             0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3403          
##                  95% CI : (0.3154, 0.3658)
##     No Information Rate : 0.255           
##     P-Value [Acc > NIR] : 8.572e-13       
##                                           
##                   Kappa : 0.1815          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity                 0.8177        0.00000              0.36686
## Specificity                 0.7794        1.00000              0.72212
## Pos Pred Value              0.3558            NaN              0.29665
## Neg Pred Value              0.9663        0.92264              0.78119
## Prevalence                  0.1297        0.07736              0.24212
## Detection Rate              0.1060        0.00000              0.08883
## Detection Prevalence        0.2980        0.00000              0.29943
## Balanced Accuracy           0.7986        0.50000              0.54449
##                      Class: 120000-139999 Class: 140000-159999
## Sensitivity                       0.36517              0.37245
## Specificity                       0.81731              0.85917
## Pos Pred Value                    0.40625              0.30165
## Neg Pred Value                    0.78996              0.89341
## Prevalence                        0.25501              0.14040
## Detection Rate                    0.09312              0.05229
## Detection Prevalence              0.22923              0.17335
## Balanced Accuracy                 0.59124              0.61581
##                      Class: 80000-99999
## Sensitivity                      0.0000
## Specificity                      1.0000
## Pos Pred Value                      NaN
## Neg Pred Value                   0.8446
## Prevalence                       0.1554
## Detection Rate                   0.0000
## Detection Prevalence             0.0000
## Balanced Accuracy                0.5000

8.5 Fit Top Skills and Locations Random Forest Model

# Fit Top Skills and Locations Random Forest Model
model_rf <- randomForest(as.factor(salary_range) ~ . - job_id, method="class",  data=train)

# Make predictions on test data
preds_rf <- predict(model_rf, test)

# Review Confusion Matrix
confusionMatrix(preds_rf, as.factor(test$salary_range))
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      <80000 >160000 100000-119999 120000-139999 140000-159999
##   <80000           139      12            38            15             7
##   >160000            0      22             3             4            14
##   100000-119999     16      11           169            99            24
##   120000-139999      1      27            87           179            54
##   140000-159999      1      32            15            44            92
##   80000-99999       24       4            26            15             5
##                Reference
## Prediction      80000-99999
##   <80000                 63
##   >160000                 1
##   100000-119999          65
##   120000-139999          16
##   140000-159999           5
##   80000-99999            67
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4785         
##                  95% CI : (0.452, 0.5051)
##     No Information Rate : 0.255          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3516         
##                                          
##  Mcnemar's Test P-Value : 8.272e-16      
## 
## Statistics by Class:
## 
##                      Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity                0.76796        0.20370               0.5000
## Specificity                0.88889        0.98292               0.7968
## Pos Pred Value             0.50730        0.50000               0.4401
## Neg Pred Value             0.96257        0.93639               0.8330
## Prevalence                 0.12966        0.07736               0.2421
## Detection Rate             0.09957        0.01576               0.1211
## Detection Prevalence       0.19628        0.03152               0.2751
## Balanced Accuracy          0.82842        0.59331               0.6484
##                      Class: 120000-139999 Class: 140000-159999
## Sensitivity                        0.5028               0.4694
## Specificity                        0.8221               0.9192
## Pos Pred Value                     0.4918               0.4868
## Neg Pred Value                     0.8285               0.9138
## Prevalence                         0.2550               0.1404
## Detection Rate                     0.1282               0.0659
## Detection Prevalence               0.2607               0.1354
## Balanced Accuracy                  0.6625               0.6943
##                      Class: 80000-99999
## Sensitivity                     0.30876
## Specificity                     0.93723
## Pos Pred Value                  0.47518
## Neg Pred Value                  0.88048
## Prevalence                      0.15544
## Detection Rate                  0.04799
## Detection Prevalence            0.10100
## Balanced Accuracy               0.62300

8.6 Fit Naive Bayes model

# Fit Naive Bayes model
model_nb <- naiveBayes(as.factor(salary_range)  ~ . - job_id, data = train)

# Make predictions on test data
preds_nb <- predict(model_nb, as.data.frame(test))

# Review Confusion Matrix
confusionMatrix(preds_nb, as.factor(test$salary_range))
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      <80000 >160000 100000-119999 120000-139999 140000-159999
##   <80000           141      17            64            22            11
##   >160000           33      81           183           219           149
##   100000-119999      0       0             9             4             0
##   120000-139999      0       0             2             7             3
##   140000-159999      4       9            74            96            32
##   80000-99999        3       1             6             8             1
##                Reference
## Prediction      80000-99999
##   <80000                108
##   >160000                82
##   100000-119999           3
##   120000-139999           2
##   140000-159999          17
##   80000-99999             5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.197           
##                  95% CI : (0.1764, 0.2188)
##     No Information Rate : 0.255           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1013          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity                 0.7790        0.75000             0.026627
## Specificity                 0.8173        0.48292             0.993384
## Pos Pred Value              0.3884        0.10843             0.562500
## Neg Pred Value              0.9613        0.95840             0.761594
## Prevalence                  0.1297        0.07736             0.242120
## Detection Rate              0.1010        0.05802             0.006447
## Detection Prevalence        0.2600        0.53510             0.011461
## Balanced Accuracy           0.7981        0.61646             0.510005
##                      Class: 120000-139999 Class: 140000-159999
## Sensitivity                      0.019663              0.16327
## Specificity                      0.993269              0.83333
## Pos Pred Value                   0.500000              0.13793
## Neg Pred Value                   0.747467              0.85911
## Prevalence                       0.255014              0.14040
## Detection Rate                   0.005014              0.02292
## Detection Prevalence             0.010029              0.16619
## Balanced Accuracy                0.506466              0.49830
##                      Class: 80000-99999
## Sensitivity                    0.023041
## Specificity                    0.983885
## Pos Pred Value                 0.208333
## Neg Pred Value                 0.845481
## Prevalence                     0.155444
## Detection Rate                 0.003582
## Detection Prevalence           0.017192
## Balanced Accuracy              0.503463


Findings:

Random Forest provides the best results across several metrics such as: Sensitivity, Specificity, and Balanced Accuracy by class as well as overall Kappa and accuracy. Both Sensitivity and Specificity are useful since there is not a strong preference for either false positives or false negatives. Kappa and Balanced Accuracy are helpful over general accuracy since the classes are imbalanced. The Classification Tree model’s strongest point is naturally the easy to comprehend tree visual for understanding how select factors can lead to a particular salary range. Naive Bayes is the only model that does not have an accuracy better than the largest class which saw approximately 24% of observations. Though it is modestly useful in predicting low and high salaries (e.g. <80000 and >160000), which may be the most important aspects of utilizing classifiers for predicting salary range.


9.0 Conclusion

  • Based on the dataset - Python, Machine Learning, R and SQL are most valuable skills
  • Average salary is between 100 and 110 and most of the jobs are in California and New York
  • These findings and conclusions matches closely with the model we ran
  • Trends - key takeaways are the increasing interest in Python and machine learning over the last 11 years as well as the stable interest in R and NoSQL
  • Models Our modeling section showed Random Forest to be the most reliable model for this classification task, based on several evaluation metrics.
  • Kappa and balanced accuracy were our leading evaluators, though sensitivity and specificity were also useful as neither false positives or false negatives are preferred over the other.
  • Our analysis was focused on software and technical skills. Soft skills were not directly available from the job post.



References

Philippe Massicotte, Dirk Eddelbuettel (2021). gtrendsR: Perform and Display Google Trends Queries. R package version 1.4.8.
Terry Therneau, Beth Atkinson, Brian Ripley (2019). rpart: Recursive partitioning for classification,regression and survival trees. R package version 4.1.15
Leo Breiman and Adele Cutler (2018). randomForest: Breiman and Cutler’s Random Forests for Classification and Regression. R package version 4.6.14.
Michal Majka (2020). naivebayes: High Performance Implementation of the Naive Bayes Algorithm. R package version 0.9.7.
Max Kuhn, Jed Wing, Steve Weston, Andre Williams (2020). caret: Misc functions for training and plotting classification and regression models. R package version 6.0.86
Kenneth Tay (2020) “What is balanced accuracy?” (edited)
Job dataset (2018) Indeed Dataset - Data Scientist/Analyst/Engineer
Agile & scrum What is Scrum
——————————————————————————–