Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. Each data scientist is expected to be equipped with most valuable skills in their data science toolkit -
What are the most valued data science skills?
Key objective of this project is answer this question.A collaborative approach is used from beginning to end of the project find ‘most valued data science skill’. A Data science job dataset is used to analyze the skills required for around 10K job posts. Team also analyzed trends and ran data thru various models to prepare a conclusion. These models are -
The DataMiners team is to extract and discover patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems to answer key question - What are the most valued data science skills. This team include listed team members with their defined responsibilities for this project
The DataMiners team collaborated in understanding the question at hand.
Two pronged - Team used ‘two pronged’ approach. With clearly defined tasks and regular stand -ups (Agile Scrum), each team member was onboarded with given ask, approach and final result. By using ‘two pronged’ approach -
1. Sub-team focused on data sourcing, cleansing and make it available in relation database
2. Second sub-team focused on data analysis, modeling, trends and statistics applicability
This team collaborated virtually on Slack, Google Docs, Zoom, AWS, emails and Github
- Slack
- Google Docs
- Zoom
- AWS
- Outlook
- GitHub
Libraries:
- aws.s3
- caret
- data.table
- dplyr
- DT
- e1071
- fastDummies
- ggplot2
- ggplot2
- gtrendsR
- httr
- lares
- mgsub
- plotly
- randomForest
- readr
- rlist
- RMySQL
- rpart
- rpart.plot
- stats
- stringr
- tidyverse
- wordcloud
Capabilities:
- AWS RDS MySQL
- AWS S3
- Gitbub
- Google Trends
- RPubs.com
- R Studio
library(mgsub)
library(ggplot2)
library(plotly)
library(tidyverse)
library(ggplot2)
library(dplyr)
library (readr)
library(RMySQL)
library(DT)
library(lares)
library(httr)
library(stringr)
library(data.table)
library(aws.s3)
library(rlist)
library(wordcloud)
library(gtrendsR)
library(stats)
library(rpart)
library(fastDummies)
library(rpart.plot)
library(randomForest)
library(e1071)
library(caret)library(lares)
bucket<-get_creds()$`aws.s3`$bucket
Sys.setenv(
"AWS_ACCESS_KEY_ID" = get_creds()$`aws.s3`$accessKeyId,
"AWS_SECRET_ACCESS_KEY" = get_creds()$`aws.s3`$accessKey,
"AWS_DEFAULT_REGION" = get_creds()$`aws.s3`$region)To answer key question - Team used Indeed Dataset - Data Scientist/Analyst/Engineer
data<-s3read_using(FUN = read.csv, object = "indeed_job_dataset.csv", bucket = bucket)
dim(data)## [1] 5715 43
#function that transforms "python list" string into R vector
clean_string = function(string){
list = str_replace(string,"\\[","") %>%
str_replace("\\]","") %>%
str_replace_all("\\'","") %>%
str_split(",")
return (trimws(list[[1]]))
}# subset job titles
job_title = unique(data[c('Job_Title','Job_Type')])
head(job_title)## Job_Title Job_Type
## 1 Data Scientist data_scientist
## 4 Graduate Studies Program - Data Scientist data_scientist
## 5 Data Scientist I data_scientist
## 6 Data Scientist - Entry to Experienced Level data_scientist
## 7 Geospatial Data Scientist data_scientist
## 9 Bioinformatics Data Scientist data_scientist
# subset companies
companies = unique(data[c('Company','No_of_Reviews','No_of_Stars','Company_Revenue','Company_Employees','Company_Industry')])
# drop nulls
companies = drop_na(companies)
head(companies)## Company No_of_Reviews No_of_Stars Company_Revenue
## 1 Express Scripts 3301 3.3 More than $10B (USD)
## 2 comScore 62 3.5
## 3 Central Intelligence Agency 158 4.3
## 4 Federal Reserve Bank of Dallas 495 4.1
## 5 National Security Agency 173 4.3
## 6 NYC Careers 30 3.8
## Company_Employees Company_Industry
## 1 10,000+ Health Care
## 2
## 3 Government
## 4 Less than 10,000 Banks and Financial Services
## 5
## 6 Government
# subset jobs
jobs = data[c('X','Job_Title','Company','Queried_Salary','Date_Since_Posted','Location','Link')]
head(jobs)## X Job_Title Company
## 1 0 Data Scientist Express Scripts
## 2 1 Data Scientist Money Mart Financial Services
## 3 2 Data Scientist comScore
## 4 3 Graduate Studies Program - Data Scientist Central Intelligence Agency
## 5 4 Data Scientist I Federal Reserve Bank of Dallas
## 6 5 Data Scientist - Entry to Experienced Level National Security Agency
## Queried_Salary Date_Since_Posted Location
## 1 <80000 1 MO
## 2 <80000 15 TX
## 3 <80000 1 OR
## 4 <80000 30 DC
## 5 <80000 30 TX
## 6 <80000 30 MD
## Link
## 1 https://www.indeed.com/rc/clk?jk=6a105f495c36afe3&fccid=281a27deb89bbe92&vjs=3
## 2 https://www.indeed.com/rc/clk?jk=86afd561ea8c60bc&fccid=b584cf601069b5d0&vjs=3
## 3 https://www.indeed.com/rc/clk?jk=e0aad317e6d45d9b&fccid=b39fb6f124ce8586&vjs=3
## 4 https://www.indeed.com/rc/clk?jk=1cfdd9e391a6328a&fccid=e9870e3159e9c6ac&vjs=3
## 5 https://www.indeed.com/rc/clk?jk=fec647775a21ecc6&fccid=2c6850e24c8a2811&vjs=3
## 6 https://www.indeed.com/rc/clk?jk=835a2db493898f25&fccid=e8f18ca6180ec8da&vjs=3
data = data[data$No_of_Skills>0,]
x = data$X
strings = data$Skill
count = c()
X = c()
skill_list = c()
# create vector count
for (string in strings){
count = c(count,length(clean_string(string)))
}
#create skills dataframe
for (i in seq(length(count))){
X = c(X,rep(x[i],count[i]))
skill_list = c(skill_list,clean_string(strings[i]))
}
skills = data.frame(X,skill_list)
head(skills)## X skill_list
## 1 0 SAP
## 2 0 SQL
## 3 1 Machine Learning
## 4 1 R
## 5 1 SAS
## 6 1 SQL
write.csv(job_title,"job_title.csv",row.names=FALSE)
write.csv(companies,"Company.csv",row.names=FALSE)
write.csv(jobs,"Job_master.csv",row.names=FALSE)
write.csv(skills,"skills.csv",row.names=FALSE)This data is loaded in the tables described in next section
Listed below are database tabled used
Company : Company that posted the job posting
Company_Revenue : Annual revenue of hiring company
Company_Employees : Employee count of hiring company
Company_Industry : Industry of hiring company
State : State the job opening is located in
Job_master : Job details
Skills : Skills master list
Salary : Salary offered
Job_skills : Skills needed for posted job
CREATE TABLE `company` (
`company_id` int NOT NULL,
`company` text,
PRIMARY KEY (`company_id`)
)
CREATE TABLE `company_employee` (
`comp_emp_id` int NOT NULL,
`no_of_emp` text,
PRIMARY KEY (`comp_emp_id`)
)
CREATE TABLE `company_industry` (
`id` int NOT NULL,
`industry` text,
PRIMARY KEY (`id`)
)
CREATE TABLE `company_revenue` (
`revenue_id` int NOT NULL,
`revenue_range` text,
PRIMARY KEY (`revenue_id`)
)
CREATE TABLE `job_master` (
`job_id` int NOT NULL,
`job_title` text,
`salary_id` int DEFAULT NULL,
`comp_id` int DEFAULT NULL,
`state_id` int DEFAULT NULL,
`comp_rev_id` int DEFAULT NULL,
`comp_emp_id` int DEFAULT NULL,
`comp_ind_id` int DEFAULT NULL,
PRIMARY KEY (`job_id`),
KEY `FK_JOB_STATE_idx` (`state_id`),
KEY `FK_COMPANY_REV_idx` (`comp_rev_id`),
KEY `FK_COMPANY_EMPLOYEE_idx` (`comp_emp_id`),
KEY `FK_COMPANY_INDUSTRY_idx` (`comp_ind_id`),
KEY `FK_SALARY_idx` (`salary_id`),
KEY `FK_COMPANY_idx` (`comp_id`),
CONSTRAINT `FK_COMPANY` FOREIGN KEY (`comp_id`) REFERENCES `company` (`company_id`),
CONSTRAINT `FK_COMPANY_EMPLOYEE` FOREIGN KEY (`comp_emp_id`) REFERENCES `company_employee` (`comp_emp_id`),
CONSTRAINT `FK_COMPANY_INDUSTRY` FOREIGN KEY (`comp_ind_id`) REFERENCES `company_industry` (`id`),
CONSTRAINT `FK_COMPANY_REV` FOREIGN KEY (`comp_rev_id`) REFERENCES `company_revenue` (`revenue_id`),
CONSTRAINT `FK_JOB_STATE` FOREIGN KEY (`state_id`) REFERENCES `state` (`state_id`),
CONSTRAINT `FK_SALARY` FOREIGN KEY (`salary_id`) REFERENCES `salary` (`salary_id`)
)
CREATE TABLE `job_skills` (
`job_id` int DEFAULT NULL,
`job_title` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
`skill_id` int DEFAULT NULL,
KEY `FK_JOB_idx` (`skill_id`),
KEY `FK_JOB_ID_idx` (`job_id`),
CONSTRAINT `FK_JOB_ID` FOREIGN KEY (`job_id`) REFERENCES `job_master` (`job_id`),
CONSTRAINT `FK_JOB_SKILLS` FOREIGN KEY (`skill_id`) REFERENCES `skills` (`skill_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `salary` (
`salary_id` int NOT NULL,
`salary_range` text,
PRIMARY KEY (`salary_id`)
)
CREATE TABLE `skills` (
`skill_id` int NOT NULL,
`skill_desc` text,
PRIMARY KEY (`skill_id`)
)
CREATE TABLE `state` (
`state_id` int NOT NULL,
`state_district` text,
`postal_code` text,
PRIMARY KEY (`state_id`)
)
csvurl<-get_creds()$`aws.rds`$schema
user <-get_creds()$`aws.rds`$user
password <-get_creds()$`aws.rds`$creds
host<-get_creds()$`aws.rds`$host
dbname<-get_creds()$`aws.rds`$schemaconnection = dbConnect(MySQL(), user = user, password = password, dbname = dbname, host = host)This common function is created to load data from a given table
get_data<- function(table_name) {
dataset <- tbl(connection,dbplyr::in_schema(dbname, table_name))
return(dataset)
}d607_p3_all_recs<-as.data.frame(get_data('view_d607_p3_all_recs'))
tblcompany<-as.data.frame(get_data('company'))
tblcemployee<-as.data.frame(get_data('company_employee'))
tblcindustry<-as.data.frame(get_data('company_industry'))
tblcrevenue<-as.data.frame(get_data('company_revenue'))
tbljobmaster<-as.data.frame(get_data('job_master'))
tbljobskills<-as.data.frame(get_data('job_skills'))
tblsalary<-as.data.frame(get_data('salary'))
tblskills<-as.data.frame(get_data('skills'))
tblstate<-as.data.frame(get_data('state'))# most popular skills
num_jobs = dim(jobs)[1]
skills %>%
group_by_at('skill_list')%>%
summarise(percent = n()/num_jobs)%>%
arrange(desc(percent))%>%
head(10)%>%
ggplot(aes(reorder(skill_list,percent),percent))+
geom_bar(stat = 'identity',fill = '#A1B8A1')+coord_flip()+
labs(x = 'Skills',title = "Most Desired Data Science Skills in the U.S.")
Findings:
skill_count = skills %>%
group_by(skill_list)%>%
summarise(frequency = n())%>%
arrange(desc(frequency))
wordcloud(skill_count$skill_list,skill_count$frequency,min.freq = 5,
max.words=150, random.order=FALSE, rot.per=0.1,
colors=brewer.pal(8, "Dark2"))
Findings:
ny = jobs[jobs["Location"] == "NY",]
ny_skills = merge(ny,skills,on = 'X')
num_jobs_ny = dim(ny)[1]
ny_skills %>%
group_by(skill_list)%>%
summarise(percent = n()/num_jobs_ny)%>%
arrange(desc(percent))%>%
head(10)%>%
ggplot(aes(reorder(skill_list,percent),percent))+
geom_bar(stat = 'identity', fill = '#C4A9C4')+
coord_flip()+labs(x = 'Skills',title = "Most Desired Data Science Skills in NY")
Findings:
derive_salary <- function(s) {
contains_range <- str_detect(s, '-')
if (contains_range == TRUE) {
text_range <- unlist(str_extract_all(s, '[0-9,]+'))
numeric_range <- as.numeric(str_remove_all(text_range, ','))
return(mean(c(numeric_range[1], numeric_range[2])))
} else {
as_text <- str_extract_all(s, '[0-9,]+')
as_numeric <- as.numeric(str_remove_all(as_text, ','))
return(as_numeric)
}
}
tblsalary$salary<-as.numeric(lapply(tblsalary$salary_range, derive_salary))
jobmaster_salary = merge(x=tbljobmaster,y=tblsalary,by="salary_id")
ggplot(jobmaster_salary,aes(x=as.factor(salary)))+
geom_bar()quantile(jobmaster_salary$salary, c(0, .1, .25, .5, .75, .9, .95, .99))## 0% 10% 25% 50% 75% 90% 95% 99%
## 80000.0 80000.0 89999.5 109999.5 129999.5 149999.5 160000.0 160000.0
head(jobmaster_salary)## salary_id job_id job_title comp_id state_id
## 1 1 1 Data Scientist 761 26
## 2 1 2 Data Scientist 1336 44
## 3 1 3 Data Scientist 523 38
## 4 1 4 Graduate Studies Program - Data Scientist 418 9
## 5 1 5 Data Scientist I 782 44
## 6 1 6 Data Scientist - Entry to Experienced Level 1374 21
## comp_rev_id comp_emp_id comp_ind_id salary_range salary
## 1 4 1 14 <80000 80000
## 2 5 3 34 <80000 80000
## 3 5 3 34 <80000 80000
## 4 5 3 13 <80000 80000
## 5 5 2 4 <80000 80000
## 6 5 3 34 <80000 80000
Findings:
# count job post of different state
jobmaster_state <- merge(x=tbljobmaster,y=tblstate,by="state_id")
df_jobs <- jobmaster_state %>% group_by(state_district,postal_code) %>% dplyr::summarize (n = n())
df_jobs$hover <- with(df_jobs, paste(state_district, '<br>', "jobs:", n))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
# plot the map
plot_geo(df_jobs, locationmode = 'USA-states') %>%
add_trace(
z = ~df_jobs$n,
locations = ~df_jobs$postal_code
) %>%colorbar(title = "Data scientist Job Postings") %>%
layout(title = 'Data scientist Jobs by State',geo = g)
Findings:
In addition to knowing the current most sought after data science skills, knowing which are rising, declining, and remaining stable can provide much needed context to plan for the future. Google Trends is a tool that provides data on how popular a term is relative to its peak search volume. To clarify, no insights on search volume relative to other terms can be made from this data. For example, SQL has shown a consistent decrease in popularity and is currently approximately 35% compared to its peak popularity, while Spark is approximately 65% compared to its peak popularity. It is likely the current search volume for SQL is significantly higher than for Spark.
Due to Google Trends not having a public API, a 3rd party R package, gtrendsR, is used to source the data. Careful consideration was put into what skills to track over time and how to group them. We ultimately decided to group skills in terms of rising interest, declining interest, stable interest, and a category we call “peaked” which represents stark increases and decreases in popularity since 2010. Top skills were determined by the frequency they appeared in job posts.
d607_p3_all_recs<-as.data.frame(get_data('view_d607_p3_all_recs'))
top_skills <- d607_p3_all_recs %>%
select(skill_id, skill_desc) %>%
group_by(skill_desc) %>%
summarise(n = n()) %>%
arrange(desc(n))hl <-"en-US"
time <- "2010-01-01 2021-03-20"
geo <- "US"
# Google Trends for terms
python_trend <- gtrends(keyword = "Python", geo = geo, time = time, hl = hl)
sql_trend <- gtrends(keyword = "SQL", geo = geo, time = time, hl = hl)
ml_trend <- gtrends(keyword = "Machine Learning", geo = geo, time = time, hl = hl)
r_trend <- gtrends(keyword = "R", geo = geo, time = time, hl = hl)
hadoop_trend <- gtrends(keyword = "Hadoop", geo = geo, time = time, hl = hl)
spark_trend <- gtrends(keyword = "Spark", geo = geo, time = time, hl = hl)
java_trend <- gtrends(keyword = "Java", geo = geo, time = time, hl = hl)
tableau_trend <- gtrends(keyword = "Tableau", geo = geo, time = time, hl = hl)
datamining_trend <- gtrends(keyword = "Data Mining", geo = geo, time = time, hl = hl)
hive_trend <- gtrends(keyword = "Hive", geo = geo, time = time, hl = hl)
Sys.sleep(10)
sas_trend <- gtrends(keyword = "SAS", geo = geo, time = time, hl = hl)
bigtrend_trend <- gtrends(keyword = "Big Data", geo = geo, time = time, hl = hl)
aws_trend <- gtrends(keyword = "AWS", geo = geo, time = time, hl = hl)
scala_trend <- gtrends(keyword = "Scala", geo = geo, time = time, hl = hl)
nosql_trend <- gtrends(keyword = "NoSQL", geo = geo, time = time, hl = hl)
c_trend <- gtrends(keyword = "C/C++", geo = geo, time = time, hl = hl)
nlp_trend <- gtrends(keyword = "Natural Language Processing", geo = geo, time = time, hl = hl)
oracle_trend <- gtrends(keyword = "Oracle", geo = geo, time = time, hl = hl)
datawarehouse_trend <- gtrends(keyword = "Data Warehouse", geo = geo, time = time, hl = hl)
linux_trend <- gtrends(keyword = "Linux", geo = geo, time = time, hl = hl)
Sys.sleep(10)
ai_trend <- gtrends(keyword = "AI", geo = geo, time = time, hl = hl)
micrsql_trend <- gtrends(keyword = "Microsoft SQL Server", geo = geo, time = time, hl = hl)
tensorflow_trend <- gtrends(keyword = "TensorFlow", geo = geo, time = time, hl = hl)
kafka_trend <- gtrends(keyword = "Kafka", geo = geo, time = time, hl = hl)
azure_trend <- gtrends(keyword = "Azure", geo = geo, time = time, hl = hl)# Join trend data for terms rising in interest
python_trend$interest_over_time$ml_hits = ml_trend$interest_over_time$hits
python_trend$interest_over_time$spark_hits = spark_trend$interest_over_time$hits
python_trend$interest_over_time$tableau_hits = tableau_trend$interest_over_time$hits
python_trend$interest_over_time$kafka_hits = kafka_trend$interest_over_time$hits
python_trend$interest_over_time$azure_hits = azure_trend$interest_over_time$hits
# Join trend data for terms declining in interest
sql_trend$interest_over_time$java_hits = java_trend$interest_over_time$hits
sql_trend$interest_over_time$hadoop_hits = hadoop_trend$interest_over_time$hits
sql_trend$interest_over_time$datamining_hits = datamining_trend$interest_over_time$hits
sql_trend$interest_over_time$sas_hits = sas_trend$interest_over_time$hits
sql_trend$interest_over_time$oracle_hits = oracle_trend$interest_over_time$hits
# Join trend data for terms stable in interest
r_trend$interest_over_time$scala_hits = scala_trend$interest_over_time$hits
r_trend$interest_over_time$nosql_hits = nosql_trend$interest_over_time$hits
r_trend$interest_over_time$c_hits = c_trend$interest_over_time$hits
r_trend$interest_over_time$datawarehouse_hits = datawarehouse_trend$interest_over_time$hits
# Join trend data for terms that peaked in interest
bigtrend_trend$interest_over_time$tensorflow_hits = tensorflow_trend$interest_over_time$hits
bigtrend_trend$interest_over_time$hadoop_hits = hadoop_trend$interest_over_time$hits# Plot rising interest over time
ggplot() +
geom_line(data=python_trend$interest_over_time, aes(y= hits, x= date, colour="a"), size=1 ) +
geom_line(data=python_trend$interest_over_time, aes(y= ml_hits, x= date, colour="b"), size=1) +
geom_line(data=python_trend$interest_over_time, aes(y= spark_hits, x= date, colour="c"), size=1) +
geom_line(data=python_trend$interest_over_time, aes(y= tableau_hits, x= date, colour="d"), size=1) +
geom_line(data=python_trend$interest_over_time, aes(y= azure_hits, x= date, colour="e"), size=1) +
xlab("Date") + ylab("Interest over time") +
scale_color_discrete(name = "Increasing Interest Over Time",
labels = c("Python", "Machine Learning", "Spark", "Tableau", "Azure"))# Plot declining interest over time
ggplot() +
geom_line(data=sql_trend$interest_over_time, aes(y=hits, x=date, color="a"), size=1) +
geom_line(data=sql_trend$interest_over_time, aes(y=java_hits, x=date, color="b"), size=1) +
geom_line(data=sql_trend$interest_over_time, aes(y=sas_hits, x=date, color="d"), size=1) +
geom_line(data=sql_trend$interest_over_time, aes(y=oracle_hits, x=date, color="e"), size=1) +
xlab("Date") + ylab("Interest over time") +
scale_color_discrete(name = "Decreasing Interest Over Time",
labels = c("SQL", "Java", "SAS", "Oracle"))# Plot stable interest over time
ggplot() +
geom_line(data=r_trend$interest_over_time, aes(y=hits, x=date, color="a"), size=1) +
geom_line(data=r_trend$interest_over_time, aes(y=scala_hits, x=date, color="b"), size=1) +
geom_line(data=r_trend$interest_over_time, aes(y=nosql_hits, x=date, color="c"), size=1) +
geom_line(data=r_trend$interest_over_time, aes(y=c_hits, x=date, color="d"), size=1) +
geom_line(data=r_trend$interest_over_time, aes(y=datawarehouse_hits, x=date, color="e"), size=1) +
xlab("Date") + ylab("Interest over time") +
scale_color_discrete(name = "Stable Interest Over Time",
labels = c("R", "Scala", "NoSQL", "C", "Data Warehouse"))# Plot peaked interest over time
ggplot() +
geom_line(data=bigtrend_trend$interest_over_time, aes(y=hits, x=date, color="a"), size=1) +
geom_line(data=bigtrend_trend$interest_over_time, aes(y=as.numeric(tensorflow_hits),
x=date, color="b"), size=1) +
geom_line(data=bigtrend_trend$interest_over_time, aes(y=hadoop_hits, x=date, color="c"), size=1) +
xlab("Date") + ylab("Interest over time") +
scale_color_discrete(name = "Peak Popularity", labels = c("Big Data", "Tensorflow", "Hadoop"))
Findings:
Terms with increasing popularity provide the least surprising findings, though the sharp increase in Azure was slightly unforeseen. For terms decreasing in popularity, the steady drop in popularity of SQL was unexpected, though after discussing within our team, we think SQL searches may have been replaced by terms such as “BigQuery” and “PostgreSQL”, which are both rising in search volume. Determining what skills are stable in interest is as important as knowing which ones are rising in interest, since you can make stronger assumptions those skills will remain relevant. It is encouraging to know R’s popularity has remained fairly stable over the last 11 years as the CUNY S.P.S. Data Science program is tightly coupled with the language.
One use case of analyzing the most sought after data science skills is to predict a salary range based on your current skill set or the skill set you look to cultivate. This challenge takes the form of a classification, where we can feed in an individual’s skills and location and provide an estimated salary range. The task required extensive data preparation – dummy variables for over a hundred attributes – in order to be able to fit the models of interest. Those models were: classification tree, Random Forest, and Naive Bayes.
# Include skills with at least 100 observations
skills_hundred_obs <- top_skills %>% filter(n >= 100)
df_jobs_skills_hundred_obs <- d607_p3_all_recs %>% filter(skill_desc %in% skills_hundred_obs$skill_desc)
# Create dummy variables for skills with at least 100 observations
df_jobs_skills_hundred_obs_dummies <- dummy_cols(df_jobs_skills_hundred_obs,
select_columns = c("skill_desc"), remove_first_dummy = TRUE)
# Remove unneeded columns
df_jobs_skills_hundred_obs_dummies <- df_jobs_skills_hundred_obs_dummies %>%
select(-(c("job_title", "company", "no_of_emp",
"revenue_range", "industry",
"postal_code", "skill_id", "skill_desc")))
# Group by job_id and for each column use max value (0 or 1) as aggregate value
df_jobs_skills_hundred_obs_dummies <- df_jobs_skills_hundred_obs_dummies %>%
group_by(job_id) %>%
summarise_each(list(max))
# Find top locations by observation count
top_locations <- d607_p3_all_recs %>% select(skill_id, postal_code) %>%
group_by(postal_code) %>%
summarise(n = n()) %>%
arrange(desc(n))
# Include locations with at least 100 observations
locations_hundred_obs <- top_locations %>% filter(n >= 100)
df_jobs_locs_hundred_obs <- d607_p3_all_recs %>%
filter(postal_code %in% locations_hundred_obs$postal_code)
# Create dummy variables for locations with at least 100 observations
df_jobs_locs_hundred_obs_dummies <- dummy_cols(df_jobs_locs_hundred_obs,
select_columns = c("postal_code"), remove_first_dummy = TRUE)
# Remove unneeded columns
df_jobs_locs_hundred_obs_dummies <- df_jobs_locs_hundred_obs_dummies %>%
select(-(c("job_title", "company", "no_of_emp",
"salary_range", "revenue_range", "industry",
"postal_code", "skill_id", "skill_desc")))
# Group by job_id and for each column use max value of one of the rows as aggregate value
df_jobs_locs_hundred_obs_dummies <- df_jobs_locs_hundred_obs_dummies %>%
group_by(job_id) %>%
summarise_each(list(max))
# Join skills and locations
df_model <- df_jobs_skills_hundred_obs_dummies %>% inner_join(df_jobs_locs_hundred_obs_dummies)
# Rename problematic column names
df_model <- df_model %>% rename(skill_desc_C = "skill_desc_C/C++",
skill_desc_C_Sharp = "skill_desc_C#",
skill_desc_CICD = "skill_desc_CI/CD",
skill_desc_TSSCIClearance = "skill_desc_TS/SCIClearance")# Create train test split
train <- df_model %>% dplyr::sample_frac(.75)
test <- dplyr::anti_join(df_model, train, by = 'job_id')# Reorder target variable levels
positions <- c("<80000", "80000-99999", "100000-119999", "120000-139999", "140000-159999", ">160000")
df_model$salary_range <- factor(df_model$salary_range, levels=positions)
# To get an idea of minimum accuracy
# Review the distribution / class sizes of the target variable by percentage
ggplot(df_model, aes(x = salary_range)) +
geom_bar(aes(y = (..count..)/sum(..count..)), fill="turquoise3") +
ggtitle("Salary Range by Percentage") + xlab("Percentage") + ylab("Salary Range")# Fit model training data
model_ct <- rpart(salary_range ~ . - job_id, method="class", data=train)
# Plot classification tree
rpart.plot(model_ct, main="Classification Tree for Salary Range")# Make predictions on test data
preds_ct <- predict(model_ct, as.data.frame(test), type = "class")
# Review Confusion Matrix
confusionMatrix(preds_ct, as.factor(test$salary_range))## Confusion Matrix and Statistics
##
## Reference
## Prediction <80000 >160000 100000-119999 120000-139999 140000-159999
## <80000 148 22 84 40 13
## >160000 0 0 0 0 0
## 100000-119999 27 25 124 117 50
## 120000-139999 3 22 95 130 60
## 140000-159999 3 39 35 69 73
## 80000-99999 0 0 0 0 0
## Reference
## Prediction 80000-99999
## <80000 109
## >160000 0
## 100000-119999 75
## 120000-139999 10
## 140000-159999 23
## 80000-99999 0
##
## Overall Statistics
##
## Accuracy : 0.3403
## 95% CI : (0.3154, 0.3658)
## No Information Rate : 0.255
## P-Value [Acc > NIR] : 8.572e-13
##
## Kappa : 0.1815
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity 0.8177 0.00000 0.36686
## Specificity 0.7794 1.00000 0.72212
## Pos Pred Value 0.3558 NaN 0.29665
## Neg Pred Value 0.9663 0.92264 0.78119
## Prevalence 0.1297 0.07736 0.24212
## Detection Rate 0.1060 0.00000 0.08883
## Detection Prevalence 0.2980 0.00000 0.29943
## Balanced Accuracy 0.7986 0.50000 0.54449
## Class: 120000-139999 Class: 140000-159999
## Sensitivity 0.36517 0.37245
## Specificity 0.81731 0.85917
## Pos Pred Value 0.40625 0.30165
## Neg Pred Value 0.78996 0.89341
## Prevalence 0.25501 0.14040
## Detection Rate 0.09312 0.05229
## Detection Prevalence 0.22923 0.17335
## Balanced Accuracy 0.59124 0.61581
## Class: 80000-99999
## Sensitivity 0.0000
## Specificity 1.0000
## Pos Pred Value NaN
## Neg Pred Value 0.8446
## Prevalence 0.1554
## Detection Rate 0.0000
## Detection Prevalence 0.0000
## Balanced Accuracy 0.5000
# Fit Top Skills and Locations Random Forest Model
model_rf <- randomForest(as.factor(salary_range) ~ . - job_id, method="class", data=train)
# Make predictions on test data
preds_rf <- predict(model_rf, test)
# Review Confusion Matrix
confusionMatrix(preds_rf, as.factor(test$salary_range))## Confusion Matrix and Statistics
##
## Reference
## Prediction <80000 >160000 100000-119999 120000-139999 140000-159999
## <80000 139 12 38 15 7
## >160000 0 22 3 4 14
## 100000-119999 16 11 169 99 24
## 120000-139999 1 27 87 179 54
## 140000-159999 1 32 15 44 92
## 80000-99999 24 4 26 15 5
## Reference
## Prediction 80000-99999
## <80000 63
## >160000 1
## 100000-119999 65
## 120000-139999 16
## 140000-159999 5
## 80000-99999 67
##
## Overall Statistics
##
## Accuracy : 0.4785
## 95% CI : (0.452, 0.5051)
## No Information Rate : 0.255
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3516
##
## Mcnemar's Test P-Value : 8.272e-16
##
## Statistics by Class:
##
## Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity 0.76796 0.20370 0.5000
## Specificity 0.88889 0.98292 0.7968
## Pos Pred Value 0.50730 0.50000 0.4401
## Neg Pred Value 0.96257 0.93639 0.8330
## Prevalence 0.12966 0.07736 0.2421
## Detection Rate 0.09957 0.01576 0.1211
## Detection Prevalence 0.19628 0.03152 0.2751
## Balanced Accuracy 0.82842 0.59331 0.6484
## Class: 120000-139999 Class: 140000-159999
## Sensitivity 0.5028 0.4694
## Specificity 0.8221 0.9192
## Pos Pred Value 0.4918 0.4868
## Neg Pred Value 0.8285 0.9138
## Prevalence 0.2550 0.1404
## Detection Rate 0.1282 0.0659
## Detection Prevalence 0.2607 0.1354
## Balanced Accuracy 0.6625 0.6943
## Class: 80000-99999
## Sensitivity 0.30876
## Specificity 0.93723
## Pos Pred Value 0.47518
## Neg Pred Value 0.88048
## Prevalence 0.15544
## Detection Rate 0.04799
## Detection Prevalence 0.10100
## Balanced Accuracy 0.62300
# Fit Naive Bayes model
model_nb <- naiveBayes(as.factor(salary_range) ~ . - job_id, data = train)
# Make predictions on test data
preds_nb <- predict(model_nb, as.data.frame(test))
# Review Confusion Matrix
confusionMatrix(preds_nb, as.factor(test$salary_range))## Confusion Matrix and Statistics
##
## Reference
## Prediction <80000 >160000 100000-119999 120000-139999 140000-159999
## <80000 141 17 64 22 11
## >160000 33 81 183 219 149
## 100000-119999 0 0 9 4 0
## 120000-139999 0 0 2 7 3
## 140000-159999 4 9 74 96 32
## 80000-99999 3 1 6 8 1
## Reference
## Prediction 80000-99999
## <80000 108
## >160000 82
## 100000-119999 3
## 120000-139999 2
## 140000-159999 17
## 80000-99999 5
##
## Overall Statistics
##
## Accuracy : 0.197
## 95% CI : (0.1764, 0.2188)
## No Information Rate : 0.255
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1013
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity 0.7790 0.75000 0.026627
## Specificity 0.8173 0.48292 0.993384
## Pos Pred Value 0.3884 0.10843 0.562500
## Neg Pred Value 0.9613 0.95840 0.761594
## Prevalence 0.1297 0.07736 0.242120
## Detection Rate 0.1010 0.05802 0.006447
## Detection Prevalence 0.2600 0.53510 0.011461
## Balanced Accuracy 0.7981 0.61646 0.510005
## Class: 120000-139999 Class: 140000-159999
## Sensitivity 0.019663 0.16327
## Specificity 0.993269 0.83333
## Pos Pred Value 0.500000 0.13793
## Neg Pred Value 0.747467 0.85911
## Prevalence 0.255014 0.14040
## Detection Rate 0.005014 0.02292
## Detection Prevalence 0.010029 0.16619
## Balanced Accuracy 0.506466 0.49830
## Class: 80000-99999
## Sensitivity 0.023041
## Specificity 0.983885
## Pos Pred Value 0.208333
## Neg Pred Value 0.845481
## Prevalence 0.155444
## Detection Rate 0.003582
## Detection Prevalence 0.017192
## Balanced Accuracy 0.503463
Findings:
Random Forest provides the best results across several metrics such as: Sensitivity, Specificity, and Balanced Accuracy by class as well as overall Kappa and accuracy. Both Sensitivity and Specificity are useful since there is not a strong preference for either false positives or false negatives. Kappa and Balanced Accuracy are helpful over general accuracy since the classes are imbalanced. The Classification Tree model’s strongest point is naturally the easy to comprehend tree visual for understanding how select factors can lead to a particular salary range. Naive Bayes is the only model that does not have an accuracy better than the largest class which saw approximately 24% of observations. Though it is modestly useful in predicting low and high salaries (e.g. <80000 and >160000), which may be the most important aspects of utilizing classifiers for predicting salary range.
Philippe Massicotte, Dirk Eddelbuettel (2021). gtrendsR: Perform and Display Google Trends Queries. R package version 1.4.8.
Terry Therneau, Beth Atkinson, Brian Ripley (2019). rpart: Recursive partitioning for classification,regression and survival trees. R package version 4.1.15
Leo Breiman and Adele Cutler (2018). randomForest: Breiman and Cutler’s Random Forests for Classification and Regression. R package version 4.6.14.
Michal Majka (2020). naivebayes: High Performance Implementation of the Naive Bayes Algorithm. R package version 0.9.7.
Max Kuhn, Jed Wing, Steve Weston, Andre Williams (2020). caret: Misc functions for training and plotting classification and regression models. R package version 6.0.86
Kenneth Tay (2020) “What is balanced accuracy?” (edited)
Job dataset (2018) Indeed Dataset - Data Scientist/Analyst/Engineer
Agile & scrum What is Scrum
——————————————————————————–