1.0 Summary

Data scientists are highly educated – 88% have at least a Master’s degree and 46% have PhDs – and while there are notable exceptions, a very strong educational background is usually required to develop the depth of knowledge necessary to be a data scientist. Each data scientist is expected to be equipped with most valuable skills in their data science toolkit -

What are the most valued data science skills?

Key objective of this project is answer this question.A collaborative approach is used from beginning to end of the project find ‘most valued data science skill’. A Data science job dataset is used to analyze the skills required for around 10K job posts. Team also analyzed trends and ran data thru various models to prepare a conclusion. These models are -

Classification Tree
Random Forest
Naive Bayes

2.0 Acceptance Criteria

Determine what tool(s) are used as a group to effectively collaborate, share code and any project documentation
Determine what data to collect, where the data can be found, and how to load it
Determine what collected data should reside in a relational database, in a set of normalized tables
Perform any needed tidying, transformations, and exploratory data analysis in R
Deliverable should include all code, results, and documentation of motivation, approach, and findings
Try out statistics and data models

3.0 The DataMiners

The DataMiners team is to extract and discover patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems to answer key question - What are the most valued data science skills. This team include listed team members with their defined responsibilities for this project

Ramnivas Singh - Data & Analytics Lead
Deepak Sharma - Data Scientist
Richard Zheng - Data Scientist
Tage Singh - Data Architect
Matthew Lucich - Data Modeler & Statistician

4.0 Our Approach

The DataMiners team collaborated in understanding the question at hand.

Brainstorming our approach and where we might pull data from
Deciding on a dataset and then acquiring
Tidying & transforming the dataset
Visualizing & analyzing this dataset
Run statistics and data models
Ultimately come to the conclusions

Two pronged - Team used ‘two pronged’ approach. With clearly defined tasks and regular stand -ups (Agile Scrum), each team member was onboarded with given ask, approach and final result. By using ‘two pronged’ approach -
1. Sub-team focused on data sourcing, cleansing and make it available in relation database
2. Second sub-team focused on data analysis, modeling, trends and statistics applicability

4.1 Collaborative Tools

This team collaborated virtually on Slack, Google Docs, Zoom, AWS, emails and Github

 - Slack
 - Google Docs
 - Zoom
 - AWS
 - Outlook
 - GitHub

4.2 Libraries & Capabilities used

Libraries:
 - aws.s3
 - caret
 - data.table
 - dplyr
 - DT
 - e1071
 - fastDummies
 - ggplot2
 - ggplot2
 - gtrendsR
 - httr
 - lares
 - mgsub
 - plotly
 - randomForest
 - readr
 - rlist
 - RMySQL
 - rpart
 - rpart.plot
 - stats
 - stringr
 - tidyverse
 - wordcloud

Capabilities:
 - AWS RDS MySQL
 - AWS S3
 - Gitbub
 - Google Trends
 - RPubs.com
 - R Studio

5.0 Implementation

5.1 Load Libraries

library(mgsub)
library(ggplot2)
library(plotly)
library(tidyverse)
library(ggplot2)
library(dplyr)
library (readr)
library(RMySQL)
library(DT)
library(lares)
library(httr)
library(stringr)
library(data.table)
library(aws.s3)
library(rlist)
library(wordcloud)
library(gtrendsR)
library(stats)
library(rpart)
library(fastDummies)
library(rpart.plot)
library(randomForest)
library(e1071)
library(caret)

5.2 Initiaize AWS S3 Object Access

library(lares)
bucket<-get_creds()$`aws.s3`$bucket
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = get_creds()$`aws.s3`$accessKeyId,
  "AWS_SECRET_ACCESS_KEY" = get_creds()$`aws.s3`$accessKey,
  "AWS_DEFAULT_REGION" = get_creds()$`aws.s3`$region)

5.3 Data Source & Acquisition

To answer key question - Team used Indeed Dataset - Data Scientist/Analyst/Engineer

5.4 Loading data

data<-s3read_using(FUN = read.csv, object = "indeed_job_dataset.csv", bucket = bucket)
dim(data)

## [1] 5715   43

5.5 Data Transformation

#function that transforms "python list" string into R vector
clean_string = function(string){
list = str_replace(string,"\\[","") %>%
  str_replace("\\]","") %>%
  str_replace_all("\\'","") %>%
  str_split(",")
return (trimws(list[[1]]))
}

5.6 Cleaning data

Job Title

# subset job titles
job_title = unique(data[c('Job_Title','Job_Type')])
head(job_title)

##                                     Job_Title       Job_Type
## 1                              Data Scientist data_scientist
## 4   Graduate Studies Program - Data Scientist data_scientist
## 5                            Data Scientist I data_scientist
## 6 Data Scientist - Entry to Experienced Level data_scientist
## 7                   Geospatial Data Scientist data_scientist
## 9               Bioinformatics Data Scientist data_scientist

Company

# subset companies
companies = unique(data[c('Company','No_of_Reviews','No_of_Stars','Company_Revenue','Company_Employees','Company_Industry')])
# drop nulls
companies = drop_na(companies)
head(companies)

##                          Company No_of_Reviews No_of_Stars      Company_Revenue
## 1                Express Scripts          3301         3.3 More than $10B (USD)
## 2                       comScore            62         3.5                     
## 3    Central Intelligence Agency           158         4.3                     
## 4 Federal Reserve Bank of Dallas           495         4.1                     
## 5       National Security Agency           173         4.3                     
## 6                    NYC Careers            30         3.8                     
##   Company_Employees             Company_Industry
## 1           10,000+                  Health Care
## 2                                               
## 3                                     Government
## 4  Less than 10,000 Banks and Financial Services
## 5                                               
## 6                                     Government

Jobs

# subset jobs
jobs = data[c('X','Job_Title','Company','Queried_Salary','Date_Since_Posted','Location','Link')]
head(jobs)

##   X                                   Job_Title                        Company
## 1 0                              Data Scientist                Express Scripts
## 2 1                              Data Scientist  Money Mart Financial Services
## 3 2                              Data Scientist                       comScore
## 4 3   Graduate Studies Program - Data Scientist    Central Intelligence Agency
## 5 4                            Data Scientist I Federal Reserve Bank of Dallas
## 6 5 Data Scientist - Entry to Experienced Level       National Security Agency
##   Queried_Salary Date_Since_Posted Location
## 1         <80000                 1       MO
## 2         <80000                15       TX
## 3         <80000                 1       OR
## 4         <80000                30       DC
## 5         <80000                30       TX
## 6         <80000                30       MD
##                                                                             Link
## 1 https://www.indeed.com/rc/clk?jk=6a105f495c36afe3&fccid=281a27deb89bbe92&vjs=3
## 2 https://www.indeed.com/rc/clk?jk=86afd561ea8c60bc&fccid=b584cf601069b5d0&vjs=3
## 3 https://www.indeed.com/rc/clk?jk=e0aad317e6d45d9b&fccid=b39fb6f124ce8586&vjs=3
## 4 https://www.indeed.com/rc/clk?jk=1cfdd9e391a6328a&fccid=e9870e3159e9c6ac&vjs=3
## 5 https://www.indeed.com/rc/clk?jk=fec647775a21ecc6&fccid=2c6850e24c8a2811&vjs=3
## 6 https://www.indeed.com/rc/clk?jk=835a2db493898f25&fccid=e8f18ca6180ec8da&vjs=3

Skills

data = data[data$No_of_Skills>0,]
x = data$X
strings = data$Skill
count = c()
X = c()
skill_list = c()
# create vector count
for (string in strings){
  count = c(count,length(clean_string(string)))
}
#create skills dataframe
for (i in seq(length(count))){
  X = c(X,rep(x[i],count[i]))
  skill_list = c(skill_list,clean_string(strings[i]))
}
skills = data.frame(X,skill_list)
head(skills)

##   X       skill_list
## 1 0              SAP
## 2 0              SQL
## 3 1 Machine Learning
## 4 1                R
## 5 1              SAS
## 6 1              SQL

5.7 Download tables as csv

write.csv(job_title,"job_title.csv",row.names=FALSE)
write.csv(companies,"Company.csv",row.names=FALSE)
write.csv(jobs,"Job_master.csv",row.names=FALSE)
write.csv(skills,"skills.csv",row.names=FALSE)

This data is loaded in the tables described in next section

5.8 Data Tables

Listed below are database tabled used

Company           : Company that posted the job posting
Company_Revenue   : Annual revenue of hiring company
Company_Employees : Employee count of hiring company
Company_Industry  : Industry of hiring company
State             : State the job opening is located in
Job_master        : Job details
Skills            : Skills master list
Salary            : Salary offered
Job_skills        : Skills needed for posted job

5.9 Create data tables

CREATE TABLE `company` (
  `company_id` int NOT NULL,
  `company` text,
  PRIMARY KEY (`company_id`)
)

CREATE TABLE `company_employee` (
  `comp_emp_id` int NOT NULL,
  `no_of_emp` text,
  PRIMARY KEY (`comp_emp_id`)
) 
CREATE TABLE `company_industry` (
  `id` int NOT NULL,
  `industry` text,
  PRIMARY KEY (`id`)
) 

CREATE TABLE `company_revenue` (
  `revenue_id` int NOT NULL,
  `revenue_range` text,
  PRIMARY KEY (`revenue_id`)
) 

CREATE TABLE `job_master` (
  `job_id` int NOT NULL,
  `job_title` text,
  `salary_id` int DEFAULT NULL,
  `comp_id` int DEFAULT NULL,
  `state_id` int DEFAULT NULL,
  `comp_rev_id` int DEFAULT NULL,
  `comp_emp_id` int DEFAULT NULL,
  `comp_ind_id` int DEFAULT NULL,
  PRIMARY KEY (`job_id`),
  KEY `FK_JOB_STATE_idx` (`state_id`),
  KEY `FK_COMPANY_REV_idx` (`comp_rev_id`),
  KEY `FK_COMPANY_EMPLOYEE_idx` (`comp_emp_id`),
  KEY `FK_COMPANY_INDUSTRY_idx` (`comp_ind_id`),
  KEY `FK_SALARY_idx` (`salary_id`),
  KEY `FK_COMPANY_idx` (`comp_id`),
  CONSTRAINT `FK_COMPANY` FOREIGN KEY (`comp_id`) REFERENCES `company` (`company_id`),
  CONSTRAINT `FK_COMPANY_EMPLOYEE` FOREIGN KEY (`comp_emp_id`) REFERENCES `company_employee` (`comp_emp_id`),
  CONSTRAINT `FK_COMPANY_INDUSTRY` FOREIGN KEY (`comp_ind_id`) REFERENCES `company_industry` (`id`),
  CONSTRAINT `FK_COMPANY_REV` FOREIGN KEY (`comp_rev_id`) REFERENCES `company_revenue` (`revenue_id`),
  CONSTRAINT `FK_JOB_STATE` FOREIGN KEY (`state_id`) REFERENCES `state` (`state_id`),
  CONSTRAINT `FK_SALARY` FOREIGN KEY (`salary_id`) REFERENCES `salary` (`salary_id`)
) 

CREATE TABLE `job_skills` (
  `job_id` int DEFAULT NULL,
  `job_title` text CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci,
  `skill_id` int DEFAULT NULL,
  KEY `FK_JOB_idx` (`skill_id`),
  KEY `FK_JOB_ID_idx` (`job_id`),
  CONSTRAINT `FK_JOB_ID` FOREIGN KEY (`job_id`) REFERENCES `job_master` (`job_id`),
  CONSTRAINT `FK_JOB_SKILLS` FOREIGN KEY (`skill_id`) REFERENCES `skills` (`skill_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `salary` (
  `salary_id` int NOT NULL,
  `salary_range` text,
  PRIMARY KEY (`salary_id`)
) 

CREATE TABLE `skills` (
  `skill_id` int NOT NULL,
  `skill_desc` text,
  PRIMARY KEY (`skill_id`)
) 

CREATE TABLE `state` (
  `state_id` int NOT NULL,
  `state_district` text,
  `postal_code` text,
  PRIMARY KEY (`state_id`)
)

Load RDS credentials from secret management

csvurl<-get_creds()$`aws.rds`$schema
user <-get_creds()$`aws.rds`$user
password <-get_creds()$`aws.rds`$creds
host<-get_creds()$`aws.rds`$host
dbname<-get_creds()$`aws.rds`$schema

Initialize RDS Database connection

connection = dbConnect(MySQL(), user = user, password = password, dbname = dbname, host = host)

Get data function

This common function is created to load data from a given table

get_data<- function(table_name) {
dataset <- tbl(connection,dbplyr::in_schema(dbname, table_name))
  return(dataset)
}

Returns all relevant records

d607_p3_all_recs<-as.data.frame(get_data('view_d607_p3_all_recs'))
tblcompany<-as.data.frame(get_data('company'))
tblcemployee<-as.data.frame(get_data('company_employee'))
tblcindustry<-as.data.frame(get_data('company_industry'))
tblcrevenue<-as.data.frame(get_data('company_revenue'))
tbljobmaster<-as.data.frame(get_data('job_master'))
tbljobskills<-as.data.frame(get_data('job_skills'))
tblsalary<-as.data.frame(get_data('salary'))
tblskills<-as.data.frame(get_data('skills'))
tblstate<-as.data.frame(get_data('state'))

6.0 Exploratory Data Analysis

6.1 Most popular skills

# most popular skills
num_jobs = dim(jobs)[1]
skills %>%
  group_by_at('skill_list')%>%
  summarise(percent = n()/num_jobs)%>%
  arrange(desc(percent))%>%
  head(10)%>%
  ggplot(aes(reorder(skill_list,percent),percent))+
  geom_bar(stat = 'identity',fill = '#A1B8A1')+coord_flip()+
  labs(x = 'Skills',title = "Most Desired Data Science Skills in the U.S.")

Findings:

These are the top 10 jobs that companies are looking for.
Most notably, over 60% of jobs are looking for candidates with python knowledge and over 50% of jobs require SQL.
R and Machine learning are also incredibly sought after with nearly 40% of the jobs looking for candidates with those skills

6.2 Wordcloud based on size of the word

skill_count = skills %>%
  group_by(skill_list)%>%
  summarise(frequency = n())%>%
  arrange(desc(frequency))
wordcloud(skill_count$skill_list,skill_count$frequency,min.freq = 5,
          max.words=150, random.order=FALSE, rot.per=0.1,
          colors=brewer.pal(8, "Dark2"))

Findings:

Wordcloud shows Python, Machine Learning, R and SQL are most valuable skills
Hadoop, Spark, Java, AWS are other skills required for Data Science jobs
C/C++ are still mentioned on job postings and appears to be a required skill

6.3 Just looking at jobs in NY

ny = jobs[jobs["Location"] == "NY",]
ny_skills = merge(ny,skills,on = 'X')
num_jobs_ny = dim(ny)[1]

ny_skills %>% 
  group_by(skill_list)%>%
  summarise(percent = n()/num_jobs_ny)%>%
  arrange(desc(percent))%>%
  head(10)%>%
  ggplot(aes(reorder(skill_list,percent),percent))+
  geom_bar(stat = 'identity', fill = '#C4A9C4')+
  coord_flip()+labs(x = 'Skills',title = "Most Desired Data Science Skills in NY")

Findings:

For jobs in NY, SQL and Python are desired for most roles (over 50%)
R and Machine Learning being necessary for nearly 40%.
These results are very similar to our results for the entire United States

6.4 Salary and job skill analysis

derive_salary <- function(s) {
        contains_range <- str_detect(s, '-')
        if (contains_range == TRUE) {
            text_range <- unlist(str_extract_all(s, '[0-9,]+'))
            numeric_range <- as.numeric(str_remove_all(text_range, ','))
            return(mean(c(numeric_range[1], numeric_range[2])))
        } else {
            as_text <- str_extract_all(s, '[0-9,]+')
            as_numeric <- as.numeric(str_remove_all(as_text, ','))
            return(as_numeric)
        }
}
tblsalary$salary<-as.numeric(lapply(tblsalary$salary_range, derive_salary))
jobmaster_salary = merge(x=tbljobmaster,y=tblsalary,by="salary_id")
ggplot(jobmaster_salary,aes(x=as.factor(salary)))+
geom_bar()

quantile(jobmaster_salary$salary, c(0, .1, .25, .5, .75, .9, .95, .99))

##       0%      10%      25%      50%      75%      90%      95%      99% 
##  80000.0  80000.0  89999.5 109999.5 129999.5 149999.5 160000.0 160000.0

head(jobmaster_salary)

##   salary_id job_id                                   job_title comp_id state_id
## 1         1      1                              Data Scientist     761       26
## 2         1      2                              Data Scientist    1336       44
## 3         1      3                              Data Scientist     523       38
## 4         1      4   Graduate Studies Program - Data Scientist     418        9
## 5         1      5                            Data Scientist I     782       44
## 6         1      6 Data Scientist - Entry to Experienced Level    1374       21
##   comp_rev_id comp_emp_id comp_ind_id salary_range salary
## 1           4           1          14       <80000  80000
## 2           5           3          34       <80000  80000
## 3           5           3          34       <80000  80000
## 4           5           3          13       <80000  80000
## 5           5           2           4       <80000  80000
## 6           5           3          34       <80000  80000

Findings:

With above we can say that 50% of jobs are in 110k range
40% of jobs are in 90k range

6.5 Count job post of different state

# count job post of different state
jobmaster_state <- merge(x=tbljobmaster,y=tblstate,by="state_id")
df_jobs <- jobmaster_state %>% group_by(state_district,postal_code) %>% dplyr::summarize (n = n())
df_jobs$hover <- with(df_jobs, paste(state_district, '<br>', "jobs:", n))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map options
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)
# plot the map
plot_geo(df_jobs, locationmode = 'USA-states') %>%
  add_trace(
    z = ~df_jobs$n,
    locations = ~df_jobs$postal_code
  ) %>%colorbar(title = "Data scientist Job Postings") %>%
  layout(title = 'Data scientist Jobs by State',geo = g)

Findings:

California and New York are top state with most data science job posting
Arkansas is lowest in the list with few Data science job posting

7.0 Trends

In addition to knowing the current most sought after data science skills, knowing which are rising, declining, and remaining stable can provide much needed context to plan for the future. Google Trends is a tool that provides data on how popular a term is relative to its peak search volume. To clarify, no insights on search volume relative to other terms can be made from this data. For example, SQL has shown a consistent decrease in popularity and is currently approximately 35% compared to its peak popularity, while Spark is approximately 65% compared to its peak popularity. It is likely the current search volume for SQL is significantly higher than for Spark.

Due to Google Trends not having a public API, a 3rd party R package, gtrendsR, is used to source the data. Careful consideration was put into what skills to track over time and how to group them. We ultimately decided to group skills in terms of rising interest, declining interest, stable interest, and a category we call “peaked” which represents stark increases and decreases in popularity since 2010. Top skills were determined by the frequency they appeared in job posts.

7.1 Select Dataset

d607_p3_all_recs<-as.data.frame(get_data('view_d607_p3_all_recs'))
top_skills <- d607_p3_all_recs %>% 
                      select(skill_id, skill_desc) %>% 
                      group_by(skill_desc) %>% 
                      summarise(n = n()) %>%
                      arrange(desc(n))

7.2 Google Trends for terms

hl <-"en-US"
time <- "2010-01-01 2021-03-20"
geo <- "US"

# Google Trends for terms
python_trend <- gtrends(keyword = "Python", geo = geo, time = time, hl = hl)
sql_trend <- gtrends(keyword = "SQL", geo = geo, time = time, hl = hl)
ml_trend <- gtrends(keyword = "Machine Learning", geo = geo, time = time, hl = hl)
r_trend <- gtrends(keyword = "R", geo = geo, time = time, hl = hl)
hadoop_trend <- gtrends(keyword = "Hadoop", geo = geo, time = time, hl = hl)
spark_trend <- gtrends(keyword = "Spark", geo = geo, time = time, hl = hl)
java_trend <- gtrends(keyword = "Java", geo = geo, time = time, hl = hl)
tableau_trend <- gtrends(keyword = "Tableau", geo = geo, time = time, hl = hl)
datamining_trend <- gtrends(keyword = "Data Mining", geo = geo, time = time, hl = hl)
hive_trend <- gtrends(keyword = "Hive", geo = geo, time = time, hl = hl)
Sys.sleep(10)
sas_trend <- gtrends(keyword = "SAS", geo = geo, time = time, hl = hl)
bigtrend_trend <- gtrends(keyword = "Big Data", geo = geo, time = time, hl = hl)
aws_trend <- gtrends(keyword = "AWS", geo = geo, time = time, hl = hl)
scala_trend <- gtrends(keyword = "Scala", geo = geo, time = time, hl = hl)
nosql_trend <- gtrends(keyword = "NoSQL", geo = geo, time = time, hl = hl)
c_trend <- gtrends(keyword = "C/C++", geo = geo, time = time, hl = hl)
nlp_trend <- gtrends(keyword = "Natural Language Processing", geo = geo, time = time, hl = hl)
oracle_trend <- gtrends(keyword = "Oracle", geo = geo, time = time, hl = hl)
datawarehouse_trend <- gtrends(keyword = "Data Warehouse", geo = geo, time = time, hl = hl)
linux_trend <- gtrends(keyword = "Linux", geo = geo, time = time, hl = hl)
Sys.sleep(10)
ai_trend <- gtrends(keyword = "AI", geo = geo, time = time, hl = hl)
micrsql_trend <- gtrends(keyword = "Microsoft SQL Server", geo = geo, time = time, hl = hl)
tensorflow_trend <- gtrends(keyword = "TensorFlow", geo = geo, time = time, hl = hl)
kafka_trend <- gtrends(keyword = "Kafka", geo = geo, time = time, hl = hl)
azure_trend <- gtrends(keyword = "Azure", geo = geo, time = time, hl = hl)

7.3 Join trend data for interest

# Join trend data for terms rising in interest
python_trend$interest_over_time$ml_hits = ml_trend$interest_over_time$hits
python_trend$interest_over_time$spark_hits = spark_trend$interest_over_time$hits
python_trend$interest_over_time$tableau_hits = tableau_trend$interest_over_time$hits
python_trend$interest_over_time$kafka_hits = kafka_trend$interest_over_time$hits
python_trend$interest_over_time$azure_hits = azure_trend$interest_over_time$hits

# Join trend data for terms declining in interest
sql_trend$interest_over_time$java_hits = java_trend$interest_over_time$hits
sql_trend$interest_over_time$hadoop_hits = hadoop_trend$interest_over_time$hits
sql_trend$interest_over_time$datamining_hits = datamining_trend$interest_over_time$hits
sql_trend$interest_over_time$sas_hits = sas_trend$interest_over_time$hits
sql_trend$interest_over_time$oracle_hits = oracle_trend$interest_over_time$hits

# Join trend data for terms stable in interest
r_trend$interest_over_time$scala_hits = scala_trend$interest_over_time$hits
r_trend$interest_over_time$nosql_hits = nosql_trend$interest_over_time$hits
r_trend$interest_over_time$c_hits = c_trend$interest_over_time$hits
r_trend$interest_over_time$datawarehouse_hits = datawarehouse_trend$interest_over_time$hits

# Join trend data for terms that peaked in interest
bigtrend_trend$interest_over_time$tensorflow_hits = tensorflow_trend$interest_over_time$hits
bigtrend_trend$interest_over_time$hadoop_hits = hadoop_trend$interest_over_time$hits

Rising interest over time

# Plot rising interest over time
ggplot() +
  geom_line(data=python_trend$interest_over_time, aes(y= hits, x= date, colour="a"), size=1 ) +
  geom_line(data=python_trend$interest_over_time, aes(y= ml_hits, x= date, colour="b"), size=1) +
  geom_line(data=python_trend$interest_over_time, aes(y= spark_hits, x= date, colour="c"), size=1) +
  geom_line(data=python_trend$interest_over_time, aes(y= tableau_hits, x= date, colour="d"), size=1) +
  geom_line(data=python_trend$interest_over_time, aes(y= azure_hits, x= date, colour="e"), size=1) +
  xlab("Date") + ylab("Interest over time") +
  scale_color_discrete(name = "Increasing Interest Over Time", 
                       labels = c("Python", "Machine Learning", "Spark", "Tableau", "Azure"))

Declining interest over time

# Plot declining interest over time
ggplot() +
  geom_line(data=sql_trend$interest_over_time, aes(y=hits, x=date, color="a"), size=1) +
  geom_line(data=sql_trend$interest_over_time, aes(y=java_hits, x=date, color="b"), size=1) +
  geom_line(data=sql_trend$interest_over_time, aes(y=sas_hits, x=date, color="d"), size=1) +
  geom_line(data=sql_trend$interest_over_time, aes(y=oracle_hits, x=date, color="e"), size=1) +
  xlab("Date") + ylab("Interest over time") +
  scale_color_discrete(name = "Decreasing Interest Over Time", 
                       labels = c("SQL", "Java", "SAS", "Oracle"))

Stable interest over time

# Plot stable interest over time
ggplot() +
  geom_line(data=r_trend$interest_over_time, aes(y=hits, x=date, color="a"), size=1) +
  geom_line(data=r_trend$interest_over_time, aes(y=scala_hits, x=date, color="b"), size=1) +
  geom_line(data=r_trend$interest_over_time, aes(y=nosql_hits, x=date, color="c"), size=1) +
  geom_line(data=r_trend$interest_over_time, aes(y=c_hits, x=date, color="d"), size=1) +
  geom_line(data=r_trend$interest_over_time, aes(y=datawarehouse_hits, x=date, color="e"), size=1) +
  xlab("Date") + ylab("Interest over time") +
  scale_color_discrete(name = "Stable Interest Over Time", 
                       labels = c("R", "Scala", "NoSQL", "C", "Data Warehouse"))

Peaked interest over time

# Plot peaked interest over time
ggplot() +
  geom_line(data=bigtrend_trend$interest_over_time, aes(y=hits, x=date, color="a"), size=1) +
  geom_line(data=bigtrend_trend$interest_over_time, aes(y=as.numeric(tensorflow_hits), 
                                                        x=date, color="b"), size=1) +
  geom_line(data=bigtrend_trend$interest_over_time, aes(y=hadoop_hits, x=date, color="c"), size=1) +
  xlab("Date") + ylab("Interest over time") +
  scale_color_discrete(name = "Peak Popularity", labels = c("Big Data", "Tensorflow", "Hadoop"))

Findings:

Terms with increasing popularity provide the least surprising findings, though the sharp increase in Azure was slightly unforeseen. For terms decreasing in popularity, the steady drop in popularity of SQL was unexpected, though after discussing within our team, we think SQL searches may have been replaced by terms such as “BigQuery” and “PostgreSQL”, which are both rising in search volume. Determining what skills are stable in interest is as important as knowing which ones are rising in interest, since you can make stronger assumptions those skills will remain relevant. It is encouraging to know R’s popularity has remained fairly stable over the last 11 years as the CUNY S.P.S. Data Science program is tightly coupled with the language.

8.0 Data Models: Classification Tree, Random Forest & Naive Bayes

One use case of analyzing the most sought after data science skills is to predict a salary range based on your current skill set or the skill set you look to cultivate. This challenge takes the form of a classification, where we can feed in an individual’s skills and location and provide an estimated salary range. The task required extensive data preparation – dummy variables for over a hundred attributes – in order to be able to fit the models of interest. Those models were: classification tree, Random Forest, and Naive Bayes.

8.1 Data preperation for modeling

# Include skills with at least 100 observations
skills_hundred_obs <- top_skills %>% filter(n >= 100)
df_jobs_skills_hundred_obs <- d607_p3_all_recs %>% filter(skill_desc %in% skills_hundred_obs$skill_desc)

# Create dummy variables for skills with at least 100 observations
df_jobs_skills_hundred_obs_dummies <- dummy_cols(df_jobs_skills_hundred_obs, 
                                      select_columns = c("skill_desc"), remove_first_dummy = TRUE)

# Remove unneeded columns
df_jobs_skills_hundred_obs_dummies <- df_jobs_skills_hundred_obs_dummies %>% 
                                      select(-(c("job_title", "company", "no_of_emp", 
                                         "revenue_range", "industry", 
                                         "postal_code", "skill_id", "skill_desc")))

# Group by job_id and for each column use max value (0 or 1) as aggregate value
df_jobs_skills_hundred_obs_dummies <- df_jobs_skills_hundred_obs_dummies %>% 
                                                      group_by(job_id) %>% 
                                                      summarise_each(list(max))

# Find top locations by observation count
top_locations <- d607_p3_all_recs %>% select(skill_id, postal_code) %>% 
                                      group_by(postal_code) %>% 
                                      summarise(n = n()) %>%
                                      arrange(desc(n))

# Include locations with at least 100 observations
locations_hundred_obs <- top_locations %>% filter(n >= 100)
df_jobs_locs_hundred_obs <- d607_p3_all_recs %>% 
                                 filter(postal_code %in% locations_hundred_obs$postal_code)

# Create dummy variables for locations with at least 100 observations
df_jobs_locs_hundred_obs_dummies <- dummy_cols(df_jobs_locs_hundred_obs, 
                                      select_columns = c("postal_code"), remove_first_dummy = TRUE)

# Remove unneeded columns
df_jobs_locs_hundred_obs_dummies <- df_jobs_locs_hundred_obs_dummies %>% 
                                    select(-(c("job_title", "company", "no_of_emp", 
                                         "salary_range", "revenue_range", "industry", 
                                         "postal_code", "skill_id", "skill_desc")))

# Group by job_id and for each column use max value of one of the rows as aggregate value
df_jobs_locs_hundred_obs_dummies <- df_jobs_locs_hundred_obs_dummies %>% 
                                                      group_by(job_id) %>% 
                                                      summarise_each(list(max))

# Join skills and locations
df_model <- df_jobs_skills_hundred_obs_dummies %>% inner_join(df_jobs_locs_hundred_obs_dummies)

# Rename problematic column names
df_model <- df_model %>% rename(skill_desc_C = "skill_desc_C/C++",
                                        skill_desc_C_Sharp = "skill_desc_C#",
                                        skill_desc_CICD = "skill_desc_CI/CD",
                                        skill_desc_TSSCIClearance = "skill_desc_TS/SCIClearance")

8.2 Create train test split

# Create train test split
train <- df_model %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(df_model, train, by = 'job_id')

8.3 Reorder target variable levels

# Reorder target variable levels
positions <- c("<80000", "80000-99999", "100000-119999", "120000-139999", "140000-159999", ">160000")
df_model$salary_range <- factor(df_model$salary_range, levels=positions)
# To get an idea of minimum accuracy
# Review the distribution / class sizes of the target variable by percentage
ggplot(df_model, aes(x = salary_range)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill="turquoise3") + 
  ggtitle("Salary Range by Percentage") + xlab("Percentage") + ylab("Salary Range")

8.4 Fit model training data

# Fit model training data
model_ct <- rpart(salary_range ~ . - job_id, method="class", data=train)

# Plot classification tree
rpart.plot(model_ct, main="Classification Tree for Salary Range")

# Make predictions on test data
preds_ct <- predict(model_ct, as.data.frame(test), type = "class")

# Review Confusion Matrix
confusionMatrix(preds_ct, as.factor(test$salary_range))

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      <80000 >160000 100000-119999 120000-139999 140000-159999
##   <80000           148      22            84            40            13
##   >160000            0       0             0             0             0
##   100000-119999     27      25           124           117            50
##   120000-139999      3      22            95           130            60
##   140000-159999      3      39            35            69            73
##   80000-99999        0       0             0             0             0
##                Reference
## Prediction      80000-99999
##   <80000                109
##   >160000                 0
##   100000-119999          75
##   120000-139999          10
##   140000-159999          23
##   80000-99999             0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3403          
##                  95% CI : (0.3154, 0.3658)
##     No Information Rate : 0.255           
##     P-Value [Acc > NIR] : 8.572e-13       
##                                           
##                   Kappa : 0.1815          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity                 0.8177        0.00000              0.36686
## Specificity                 0.7794        1.00000              0.72212
## Pos Pred Value              0.3558            NaN              0.29665
## Neg Pred Value              0.9663        0.92264              0.78119
## Prevalence                  0.1297        0.07736              0.24212
## Detection Rate              0.1060        0.00000              0.08883
## Detection Prevalence        0.2980        0.00000              0.29943
## Balanced Accuracy           0.7986        0.50000              0.54449
##                      Class: 120000-139999 Class: 140000-159999
## Sensitivity                       0.36517              0.37245
## Specificity                       0.81731              0.85917
## Pos Pred Value                    0.40625              0.30165
## Neg Pred Value                    0.78996              0.89341
## Prevalence                        0.25501              0.14040
## Detection Rate                    0.09312              0.05229
## Detection Prevalence              0.22923              0.17335
## Balanced Accuracy                 0.59124              0.61581
##                      Class: 80000-99999
## Sensitivity                      0.0000
## Specificity                      1.0000
## Pos Pred Value                      NaN
## Neg Pred Value                   0.8446
## Prevalence                       0.1554
## Detection Rate                   0.0000
## Detection Prevalence             0.0000
## Balanced Accuracy                0.5000

8.5 Fit Top Skills and Locations Random Forest Model

# Fit Top Skills and Locations Random Forest Model
model_rf <- randomForest(as.factor(salary_range) ~ . - job_id, method="class",  data=train)

# Make predictions on test data
preds_rf <- predict(model_rf, test)

# Review Confusion Matrix
confusionMatrix(preds_rf, as.factor(test$salary_range))

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      <80000 >160000 100000-119999 120000-139999 140000-159999
##   <80000           139      12            38            15             7
##   >160000            0      22             3             4            14
##   100000-119999     16      11           169            99            24
##   120000-139999      1      27            87           179            54
##   140000-159999      1      32            15            44            92
##   80000-99999       24       4            26            15             5
##                Reference
## Prediction      80000-99999
##   <80000                 63
##   >160000                 1
##   100000-119999          65
##   120000-139999          16
##   140000-159999           5
##   80000-99999            67
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4785         
##                  95% CI : (0.452, 0.5051)
##     No Information Rate : 0.255          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3516         
##                                          
##  Mcnemar's Test P-Value : 8.272e-16      
## 
## Statistics by Class:
## 
##                      Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity                0.76796        0.20370               0.5000
## Specificity                0.88889        0.98292               0.7968
## Pos Pred Value             0.50730        0.50000               0.4401
## Neg Pred Value             0.96257        0.93639               0.8330
## Prevalence                 0.12966        0.07736               0.2421
## Detection Rate             0.09957        0.01576               0.1211
## Detection Prevalence       0.19628        0.03152               0.2751
## Balanced Accuracy          0.82842        0.59331               0.6484
##                      Class: 120000-139999 Class: 140000-159999
## Sensitivity                        0.5028               0.4694
## Specificity                        0.8221               0.9192
## Pos Pred Value                     0.4918               0.4868
## Neg Pred Value                     0.8285               0.9138
## Prevalence                         0.2550               0.1404
## Detection Rate                     0.1282               0.0659
## Detection Prevalence               0.2607               0.1354
## Balanced Accuracy                  0.6625               0.6943
##                      Class: 80000-99999
## Sensitivity                     0.30876
## Specificity                     0.93723
## Pos Pred Value                  0.47518
## Neg Pred Value                  0.88048
## Prevalence                      0.15544
## Detection Rate                  0.04799
## Detection Prevalence            0.10100
## Balanced Accuracy               0.62300

8.6 Fit Naive Bayes model

# Fit Naive Bayes model
model_nb <- naiveBayes(as.factor(salary_range)  ~ . - job_id, data = train)

# Make predictions on test data
preds_nb <- predict(model_nb, as.data.frame(test))

# Review Confusion Matrix
confusionMatrix(preds_nb, as.factor(test$salary_range))

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      <80000 >160000 100000-119999 120000-139999 140000-159999
##   <80000           141      17            64            22            11
##   >160000           33      81           183           219           149
##   100000-119999      0       0             9             4             0
##   120000-139999      0       0             2             7             3
##   140000-159999      4       9            74            96            32
##   80000-99999        3       1             6             8             1
##                Reference
## Prediction      80000-99999
##   <80000                108
##   >160000                82
##   100000-119999           3
##   120000-139999           2
##   140000-159999          17
##   80000-99999             5
## 
## Overall Statistics
##                                           
##                Accuracy : 0.197           
##                  95% CI : (0.1764, 0.2188)
##     No Information Rate : 0.255           
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1013          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: <80000 Class: >160000 Class: 100000-119999
## Sensitivity                 0.7790        0.75000             0.026627
## Specificity                 0.8173        0.48292             0.993384
## Pos Pred Value              0.3884        0.10843             0.562500
## Neg Pred Value              0.9613        0.95840             0.761594
## Prevalence                  0.1297        0.07736             0.242120
## Detection Rate              0.1010        0.05802             0.006447
## Detection Prevalence        0.2600        0.53510             0.011461
## Balanced Accuracy           0.7981        0.61646             0.510005
##                      Class: 120000-139999 Class: 140000-159999
## Sensitivity                      0.019663              0.16327
## Specificity                      0.993269              0.83333
## Pos Pred Value                   0.500000              0.13793
## Neg Pred Value                   0.747467              0.85911
## Prevalence                       0.255014              0.14040
## Detection Rate                   0.005014              0.02292
## Detection Prevalence             0.010029              0.16619
## Balanced Accuracy                0.506466              0.49830
##                      Class: 80000-99999
## Sensitivity                    0.023041
## Specificity                    0.983885
## Pos Pred Value                 0.208333
## Neg Pred Value                 0.845481
## Prevalence                     0.155444
## Detection Rate                 0.003582
## Detection Prevalence           0.017192
## Balanced Accuracy              0.503463

Findings:

Random Forest provides the best results across several metrics such as: Sensitivity, Specificity, and Balanced Accuracy by class as well as overall Kappa and accuracy. Both Sensitivity and Specificity are useful since there is not a strong preference for either false positives or false negatives. Kappa and Balanced Accuracy are helpful over general accuracy since the classes are imbalanced. The Classification Tree model’s strongest point is naturally the easy to comprehend tree visual for understanding how select factors can lead to a particular salary range. Naive Bayes is the only model that does not have an accuracy better than the largest class which saw approximately 24% of observations. Though it is modestly useful in predicting low and high salaries (e.g. <80000 and >160000), which may be the most important aspects of utilizing classifiers for predicting salary range.

9.0 Conclusion

Based on the dataset - Python, Machine Learning, R and SQL are most valuable skills
Average salary is between 100 and 110 and most of the jobs are in California and New York
These findings and conclusions matches closely with the model we ran
Trends - key takeaways are the increasing interest in Python and machine learning over the last 11 years as well as the stable interest in R and NoSQL
Models Our modeling section showed Random Forest to be the most reliable model for this classification task, based on several evaluation metrics.
Kappa and balanced accuracy were our leading evaluators, though sensitivity and specificity were also useful as neither false positives or false negatives are preferred over the other.
Our analysis was focused on software and technical skills. Soft skills were not directly available from the job post.

References

Philippe Massicotte, Dirk Eddelbuettel (2021). gtrendsR: Perform and Display Google Trends Queries. R package version 1.4.8.
Terry Therneau, Beth Atkinson, Brian Ripley (2019). rpart: Recursive partitioning for classification,regression and survival trees. R package version 4.1.15
Leo Breiman and Adele Cutler (2018). randomForest: Breiman and Cutler’s Random Forests for Classification and Regression. R package version 4.6.14.
Michal Majka (2020). naivebayes: High Performance Implementation of the Naive Bayes Algorithm. R package version 0.9.7.
Max Kuhn, Jed Wing, Steve Weston, Andre Williams (2020). caret: Misc functions for training and plotting classification and regression models. R package version 6.0.86
Kenneth Tay (2020) “What is balanced accuracy?” (edited)
Job dataset (2018) Indeed Dataset - Data Scientist/Analyst/Engineer
Agile & scrum What is Scrum
——————————————————————————–

Project 3 : Most Valued Data Science Skills

The DataMiners - Ramnivas Singh | Deepak Sharma | Tage Singh | Matthew Lucich | Richard Zheng

2021-04-01

1.0 Summary

2.0 Acceptance Criteria

3.0 The DataMiners

4.0 Our Approach

4.1 Collaborative Tools

4.2 Libraries & Capabilities used

5.0 Implementation

5.1 Load Libraries

5.2 Initiaize AWS S3 Object Access

5.3 Data Source & Acquisition

5.4 Loading data

5.5 Data Transformation

5.6 Cleaning data

Job Title

Company

Jobs

Skills

5.7 Download tables as csv

5.8 Data Tables

5.9 Create data tables

Load RDS credentials from secret management

Initialize RDS Database connection

Get data function

Returns all relevant records

6.0 Exploratory Data Analysis

6.1 Most popular skills

6.2 Wordcloud based on size of the word

6.3 Just looking at jobs in NY

6.4 Salary and job skill analysis

6.5 Count job post of different state

7.0 Trends

7.1 Select Dataset

7.2 Google Trends for terms

7.3 Join trend data for interest

Rising interest over time

Declining interest over time

Stable interest over time

Peaked interest over time

8.0 Data Models: Classification Tree, Random Forest & Naive Bayes

8.1 Data preperation for modeling

8.2 Create train test split

8.3 Reorder target variable levels

8.4 Fit model training data

8.5 Fit Top Skills and Locations Random Forest Model

8.6 Fit Naive Bayes model

9.0 Conclusion

References