Many MS Business Analytics students strive to secure Data Scientist jobs upon completion of the MS BANA program. They often wonder what tools are needed, what opportunities and job titles are available, and the salaries associated with the roles. This analysis is an effort to help out fellow students in search for a Data Scientist job.
We want to understand what skills would be best for students to pursue and would be relevant in their “Data Scientist” job search. We also incorporated the cost-of-living consideration along with the average salary provided in the dataset.
Below is the list of packages used in the project.
# Load libraries
library(ellipsis)
library(rlang)
library(tidyverse)
library(readxl)
library(GGally)
library(reshape2)
library(knitr)
library(pROC)
library(rpart)
library(rpart.plot)
library(dplyr)
library(caret)
library(ROCR)
library(forecast)
library(zoo)
library(fpp3)
library(ggvis)
library(MLmetrics)
library(precrec)
library(class)
library(fpc)
library(factoextra)
library(plyr)
library(psych)
For this case study, we are using a Kaggle dataset that has salary information from 2021. The dataset (secured from Kaggle) is scraped from the Glassdoor website using Selenium scrapper. After scraping, the raw dataset was cleaned and made usable for performing data analysis and modelling. Supplementary data sources were used to incorporate the cost-of-living consideration to arrive at an Indexed salary for specific job locations.
The following diagram presents the “Three Finger Summary” (Noonan, 2018):
“Three Finger Summary” (Noonan, 2018)
DS <- read_excel("MS_BANA_Case_Competition_Data_Spring_2022.xlsx", sheet = "data_cleaned_2021")
# structure of data
str(DS)
## tibble [742 x 42] (S3: tbl_df/tbl/data.frame)
## $ index : num [1:742] 0 1 2 3 4 5 6 7 8 9 ...
## $ Job Title : chr [1:742] "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
## $ Salary Estimate : chr [1:742] "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
## $ Job Description : chr [1:742] "Data Scientist\r\nLocation: Albuquerque, NM\r\nEducation Required: Bachelorâ\200\231s degree required, preferab"| __truncated__ "What You Will Do:\r\n\r\nI. General Summary\r\n\r\nThe Healthcare Data Scientist position will join our Advance"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\r\nJob ID: 310709\r\n\r\nDirectorate: Earth & Biological Sciences\r\n\r\nDivision: B"| __truncated__ ...
## $ Rating : num [1:742] 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Company Name : chr [1:742] "Tecolote Research\r\n3.8" "University of Maryland Medical System\r\n3.4" "KnowBe4\r\n4.8" "PNNL\r\n3.8" ...
## $ Location : chr [1:742] "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
## $ Headquarters : chr [1:742] "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
## $ Size : chr [1:742] "501 - 1000" "10000+" "501 - 1000" "1001 - 5000" ...
## $ Founded : num [1:742] 1973 1984 2010 1965 1998 ...
## $ Type of ownership : chr [1:742] "Company - Private" "Other Organization" "Company - Private" "Government" ...
## $ Industry : chr [1:742] "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
## $ Sector : chr [1:742] "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
## $ Revenue : chr [1:742] "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
## $ Competitors : chr [1:742] "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
## $ Hourly : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ Employer provided : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ Lower Salary : num [1:742] 53 63 80 56 86 71 54 86 38 120 ...
## $ Upper Salary : num [1:742] 91 112 90 97 143 119 93 142 84 160 ...
## $ Avg Salary(K) : num [1:742] 72 87.5 85 76.5 114.5 ...
## $ company_txt : chr [1:742] "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
## $ Job Location : chr [1:742] "NM" "MD" "FL" "WA" ...
## $ Age : num [1:742] 48 37 11 56 23 21 13 16 7 12 ...
## $ Python : num [1:742] 1 1 1 1 1 1 0 1 0 1 ...
## $ spark : num [1:742] 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : num [1:742] 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : num [1:742] 1 0 1 0 1 1 1 1 0 0 ...
## $ sql : num [1:742] 0 0 1 0 1 1 0 1 0 0 ...
## $ sas : num [1:742] 1 0 1 0 1 0 0 0 0 0 ...
## $ keras : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ pytorch : num [1:742] 0 0 0 0 0 0 0 1 0 0 ...
## $ scikit : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ tensor : num [1:742] 0 0 0 0 0 0 0 1 0 0 ...
## $ hadoop : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ tableau : num [1:742] 1 0 0 0 0 0 0 0 0 0 ...
## $ bi : num [1:742] 1 0 0 0 0 1 0 0 0 0 ...
## $ flink : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ mongo : num [1:742] 0 0 0 0 0 1 0 0 0 0 ...
## $ google_an : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title_sim : chr [1:742] "data scientist" "data scientist" "data scientist" "data scientist" ...
## $ seniority_by_title: chr [1:742] "na" "na" "na" "na" ...
## $ Degree : chr [1:742] "M" "M" "M" "na" ...
cat('The dimention is: ', dim(DS),'\n')
## The dimention is: 742 42
cat('Is there any NA data: ', any(is.na(DS)), '\n')
## Is there any NA data: FALSE
#Check the length and see which variables can be converted to factors
apply(DS,2, function(x) length(unique(x)))
## index Job Title Salary Estimate Job Description
## 742 264 416 463
## Rating Company Name Location Headquarters
## 31 343 200 198
## Size Founded Type of ownership Industry
## 8 102 9 60
## Sector Revenue Competitors Hourly
## 25 13 128 2
## Employer provided Lower Salary Upper Salary Avg Salary(K)
## 2 113 162 219
## company_txt Job Location Age Python
## 343 37 102 2
## spark aws excel sql
## 2 2 2 2
## sas keras pytorch scikit
## 2 2 2 2
## tensor hadoop tableau bi
## 2 2 2 2
## flink mongo google_an job_title_sim
## 2 2 2 10
## seniority_by_title Degree
## 3 3
DS <- subset (DS, select = -index)
#Checking for duplicated rows
cat('No. of duplicated rows to be removed:', DS %>%
select(everything()) %>%
duplicated() %>%
sum())
## No. of duplicated rows to be removed: 275
#Removing duplicated rows
index <-
DS %>%
select(everything()) %>%
duplicated() %>%
which()
DS <- DS[-index,]
#DS[DS$company_txt == 'KnowBe4', ]
Steps we followed for Data Preparation:
Value labeling for target Variables (job_title_sim) for predicting the Probability of a “Data Scientist” job posting (Where 1= “Data Scientist” job posting and 0= Other job posting).
New Variable creation for ‘Seniority_by_title’ & ‘Degree’ which will help us analyze these predictor variables as a Binary Variable.
Replacing the -1 values in ‘Rating’ column with the mean of the distribution.
DS$job_title_sim <- ifelse(DS$job_title_sim == "data scientist",1,0)
table(DS$job_title_sim)
##
## 0 1
## 256 211
DS <- select(DS, -c(1:3))
DS <- select(DS, -c(2:15))
DS <- select(DS, -c(3:5))
DS1 <- dummyVars(" ~ .", data = DS)
DS_transformed <- data.frame(predict(DS1, newdata = DS))
str(DS_transformed)
## 'data.frame': 467 obs. of 25 variables:
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ X.Avg.Salary.K.. : num 72 87.5 85 76.5 114.5 ...
## $ Python : num 1 1 1 1 1 1 0 1 0 1 ...
## $ spark : num 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : num 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : num 1 0 1 0 1 1 1 1 0 0 ...
## $ sql : num 0 0 1 0 1 1 0 1 0 0 ...
## $ sas : num 1 0 1 0 1 0 0 0 0 0 ...
## $ keras : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pytorch : num 0 0 0 0 0 0 0 1 0 0 ...
## $ scikit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tensor : num 0 0 0 0 0 0 0 1 0 0 ...
## $ hadoop : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tableau : num 1 0 0 0 0 0 0 0 0 0 ...
## $ bi : num 1 0 0 0 0 1 0 0 0 0 ...
## $ flink : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mongo : num 0 0 0 0 0 1 0 0 0 0 ...
## $ google_an : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title_sim : num 1 1 1 1 1 1 1 1 0 1 ...
## $ seniority_by_titlejr: num 0 0 0 0 0 0 0 0 0 0 ...
## $ seniority_by_titlena: num 1 1 1 1 1 1 1 1 1 1 ...
## $ seniority_by_titlesr: num 0 0 0 0 0 0 0 0 0 0 ...
## $ DegreeM : num 1 1 1 0 0 0 0 1 0 0 ...
## $ Degreena : num 0 0 0 1 1 1 1 0 0 1 ...
## $ DegreeP : num 0 0 0 0 0 0 0 0 1 0 ...
# rename
DS_transformed <- dplyr::rename(DS_transformed, Avg.Salary= X.Avg.Salary.K..)
DS_transformed <- dplyr::rename(DS_transformed, job_title= job_title_sim)
DS_transformed <- dplyr::rename(DS_transformed, seniority.Jr= seniority_by_titlejr)
DS_transformed <- dplyr::rename(DS_transformed, seniority.Md= seniority_by_titlena)
DS_transformed <- dplyr::rename(DS_transformed, seniority.Sr= seniority_by_titlesr)
DS_transformed <- dplyr::rename(DS_transformed, Degree.Masters = DegreeM)
DS_transformed <- dplyr::rename(DS_transformed, Degree.NA = Degreena)
DS_transformed <- dplyr::rename(DS_transformed, Degree.PhD = DegreeP)
# convert categorical data to factor
DS_transformed$Python <- as.factor(DS_transformed$Python)
DS_transformed$spark <- as.factor(DS_transformed$spark)
DS_transformed$aws <- as.factor(DS_transformed$aws)
DS_transformed$excel <- as.factor(DS_transformed$excel)
DS_transformed$sql <- as.factor(DS_transformed$sql)
DS_transformed$sas <- as.factor(DS_transformed$sas)
DS_transformed$keras <- as.factor(DS_transformed$keras)
DS_transformed$pytorch <- as.factor(DS_transformed$pytorch)
DS_transformed$scikit <- as.factor(DS_transformed$scikit)
DS_transformed$tensor <- as.factor(DS_transformed$tensor)
DS_transformed$hadoop <- as.factor(DS_transformed$hadoop)
DS_transformed$tableau <- as.factor(DS_transformed$tableau)
DS_transformed$bi <- as.factor(DS_transformed$bi)
DS_transformed$flink <- as.factor(DS_transformed$flink)
DS_transformed$mongo <- as.factor(DS_transformed$mongo)
DS_transformed$google_an <- as.factor(DS_transformed$google_an)
DS_transformed$job_title <- as.factor(DS_transformed$job_title)
DS_transformed$seniority.Jr <- as.factor(DS_transformed$seniority.Jr)
DS_transformed$seniority.Md <- as.factor(DS_transformed$seniority.Md)
DS_transformed$seniority.Sr <- as.factor(DS_transformed$seniority.Sr)
DS_transformed$Degree.Masters <- as.factor(DS_transformed$Degree.Masters)
DS_transformed$Degree.NA <- as.factor(DS_transformed$Degree.NA)
DS_transformed$Degree.PhD <- as.factor(DS_transformed$Degree.PhD)
str(DS_transformed)
## 'data.frame': 467 obs. of 25 variables:
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Avg.Salary : num 72 87.5 85 76.5 114.5 ...
## $ Python : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 1 2 ...
## $ spark : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 2 1 2 ...
## $ aws : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
## $ excel : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 1 1 ...
## $ sql : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 2 1 1 ...
## $ sas : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 1 1 1 ...
## $ keras : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ pytorch : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ scikit : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ tensor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ hadoop : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ tableau : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
## $ bi : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 1 ...
## $ flink : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mongo : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ google_an : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ job_title : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
## $ seniority.Jr : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seniority.Md : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ seniority.Sr : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Degree.Masters: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 2 1 1 ...
## $ Degree.NA : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 1 2 ...
## $ Degree.PhD : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
Visualization work is still in progress.
Next we start with creation of a Training and testing dataset (0.8 as training & 0.2 as testing ) for Model validation and testing.
#Training and testing dataset
index <- sample(nrow(DS_transformed),nrow(DS_transformed)*0.80)
DS_transformed_train = DS_transformed[index,]
DS_transformed_test = DS_transformed[-index,]
Define functions for Prediction performance metrics.
#prediction performance metrics
predict_perform = function(DS_transformed_test,pD){
cat("Accuracy is ", Accuracy(y_pred = pD, y_true = DS_transformed_test$job_title),'\n')
cat("Recall is ", Recall(y_pred = pD, y_true = DS_transformed_test$job_title,positive = '1'),'\n' )
cat("Precision is", Precision(y_pred = pD, y_true = DS_transformed_test$job_title,positive = '1'),'\n')
}
#prediction performance plots
predict_perform_p = function(DS_transformed_test,pD){
precrec_obj <- evalmod(labels = DS_transformed_test$job_title, scores = as.numeric(pD))
autoplot(precrec_obj)
}
#plot variable importance
var_imp =function(model){
v=varImp(model)
print(v)
v$var = rownames(v)
v %>% ggvis(~var, ~Overall) %>% layer_bars()
}
# Decision Tree: a full model
DS_transformed_rpart0 <- rpart(formula = job_title ~ ., data = DS_transformed_train, method = "class")
# As the cost are symmetric, we can ignore the parms argument.
DS_transformed_rpart1 <- rpart(formula = job_title ~ . , data = DS_transformed_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
pred0 <- predict(DS_transformed_rpart0, type="class")
cat('Training Confusion Matrix:\n')
## Training Confusion Matrix:
table(DS_transformed_train$job_title, pred0, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 150 52
## 1 26 145
cat('\n')
pred1 <- predict(DS_transformed_rpart1, type="class")
cat('Training Confusion Matrix (with Cost Matrix):\n')
## Training Confusion Matrix (with Cost Matrix):
table(DS_transformed_train$job_title, pred1, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 99 103
## 1 6 165
The first tree model shows prominent impact of the following skill set: Python, Scikit. SAS,Keras .
#Printing the Decision Tree
DS_transformed_rpart0
## n= 373
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 373 171 0 (0.54155496 0.45844504)
## 2) Python=0 165 40 0 (0.75757576 0.24242424)
## 4) Avg.Salary< 107 117 17 0 (0.85470085 0.14529915) *
## 5) Avg.Salary>=107 48 23 0 (0.52083333 0.47916667)
## 10) Degree.PhD=1 12 1 0 (0.91666667 0.08333333) *
## 11) Degree.PhD=0 36 14 1 (0.38888889 0.61111111) *
## 3) Python=1 208 77 1 (0.37019231 0.62980769)
## 6) Avg.Salary< 95.25 72 30 0 (0.58333333 0.41666667)
## 12) sas=0 63 22 0 (0.65079365 0.34920635)
## 24) Rating< 3.45 17 2 0 (0.88235294 0.11764706) *
## 25) Rating>=3.45 46 20 0 (0.56521739 0.43478261)
## 50) Avg.Salary< 67.25 10 1 0 (0.90000000 0.10000000) *
## 51) Avg.Salary>=67.25 36 17 1 (0.47222222 0.52777778)
## 102) tableau=0 28 12 0 (0.57142857 0.42857143)
## 204) Avg.Salary>=87.75 7 0 0 (1.00000000 0.00000000) *
## 205) Avg.Salary< 87.75 21 9 1 (0.42857143 0.57142857)
## 410) Degree.NA=1 13 5 0 (0.61538462 0.38461538) *
## 411) Degree.NA=0 8 1 1 (0.12500000 0.87500000) *
## 103) tableau=1 8 1 1 (0.12500000 0.87500000) *
## 13) sas=1 9 1 1 (0.11111111 0.88888889) *
## 7) Avg.Salary>=95.25 136 35 1 (0.25735294 0.74264706) *
#Plotting the Decision Tree
rpart.plot(DS_transformed_rpart0)
prp(DS_transformed_rpart0, digits = 4, extra = 1)
#view the variable importance based on Decision Tree
var_imp(DS_transformed_rpart0)
## Overall
## Avg.Salary 58.509617
## aws 7.892619
## Degree.Masters 11.863044
## Degree.NA 16.111935
## Degree.PhD 5.995491
## excel 3.648244
## hadoop 2.673350
## keras 11.093486
## Python 27.615351
## Rating 9.578145
## sas 10.884887
## scikit 12.793258
## seniority.Md 5.042705
## seniority.Sr 4.612943
## tableau 3.196698
## tensor 24.282284
## spark 0.000000
## sql 0.000000
## pytorch 0.000000
## bi 0.000000
## flink 0.000000
## mongo 0.000000
## google_an 0.000000
## seniority.Jr 0.000000
#Print accuracy,recall,precision for Training dataset
predict_perform(DS_transformed_train,pred0)
## Accuracy is 0.7908847
## Recall is 0.8479532
## Precision is 0.7360406
#Prediction Performance
predict_perform_p(DS_transformed_train,pred0)
#DS_transformed_rpart1
#rpart.plot(DS_transformed_rpart1)
#prp(DS_transformed_rpart1, extra = 1)
For this binary classification problem, there are 2 types of predictions. One is the predicted class of response (0 or 1), and the second type is the probability of response being 1. We also look at ROC Curve which gives us the trade-off between hit rate (1 - false positive) and false negative, and area under the curve (AUC) as a measure of how good the binary classification model performs.
#In-sample prediction
DS_transformed_train.pred.tree1<- predict(DS_transformed_rpart0, DS_transformed_train, type="class")
table(DS_transformed_train$job_title, DS_transformed_train.pred.tree1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 150 52
## 1 26 145
#Out-of-sample prediction
DS_transformed_test.pred.tree1<- predict(DS_transformed_rpart0, DS_transformed_test, type="class")
table(DS_transformed_test$job_title, DS_transformed_test.pred.tree1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 37 17
## 1 10 30
#Print accuracy,recall,precision for Testing data-set:
predict_perform(DS_transformed_test,DS_transformed_test.pred.tree1)
## Accuracy is 0.712766
## Recall is 0.75
## Precision is 0.6382979
#Prediction performance for Testing data-set:
predict_perform_p(DS_transformed_test,DS_transformed_test.pred.tree1)
#Probability of getting 1
DS_transformed_test_prob_rpart = predict(DS_transformed_rpart0, DS_transformed_test, type="prob")
pred3 = prediction(DS_transformed_test_prob_rpart[,2], DS_transformed_test$job_title)
perf3 = performance(pred3, "tpr", "fpr")
plot(perf3, colorize=TRUE)
#Area under the curve is given by:
slot(performance(pred3, "auc"), "y.values")[[1]]
## [1] 0.7497685
#For a given cut-off probability, the 0/1 prediction result can be calculated similar to what we do in logistic regression:
DS_transformed_test_pred_rpart = as.numeric(DS_transformed_test_prob_rpart[,2] > 1/(5+1))
table(DS_transformed_test$job_title, DS_transformed_test_pred_rpart, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 35 19
## 1 8 32
DS_transformed_train_prob_rpart = predict(DS_transformed_rpart0, DS_transformed_train, type="prob")
pred4 = prediction(DS_transformed_test_pred_rpart, DS_transformed_test$job_title)
perf4 = performance(pred4, "tpr", "fpr")
plot(perf4, colorize=TRUE)
Next, We will compare classification tree model’s out-of-sample performance with the logistic regression model with all variables in it.
#comparing this model’s out-of-sample performance with the logistic regression model with all variables in it.
#Fit logistic regression model
DS_transformed_glm0 <- glm(job_title~.,data = DS_transformed_train,family=binomial)
DS_transformed_glm1 <- glm(job_title~Python,data = DS_transformed_train,family=binomial)
DS_transformed_glm2 <- glm(job_title~Python + Avg.Salary,data = DS_transformed_train,family=binomial)
print(DS_transformed_glm0)
##
## Call: glm(formula = job_title ~ ., family = binomial, data = DS_transformed_train)
##
## Coefficients:
## (Intercept) Rating Avg.Salary Python1
## -5.98550 0.30346 0.02604 1.49306
## spark1 aws1 excel1 sql1
## -0.68070 -0.80033 -0.23530 -0.62052
## sas1 keras1 pytorch1 scikit1
## 1.63346 15.36501 0.73914 0.61141
## tensor1 hadoop1 tableau1 bi1
## 1.71327 0.51257 0.77570 -0.53581
## flink1 mongo1 google_an1 seniority.Jr1
## -0.72009 -0.02517 -0.83780 -14.42356
## seniority.Md1 seniority.Sr1 Degree.Masters1 Degree.NA1
## 0.71139 NA 1.47797 0.67457
## Degree.PhD1
## NA
##
## Degrees of Freedom: 372 Total (i.e. Null); 350 Residual
## Null Deviance: 514.5
## Residual Deviance: 339.1 AIC: 385.1
print(DS_transformed_glm1)
##
## Call: glm(formula = job_title ~ Python, family = binomial, data = DS_transformed_train)
##
## Coefficients:
## (Intercept) Python1
## -1.139 1.671
##
## Degrees of Freedom: 372 Total (i.e. Null); 371 Residual
## Null Deviance: 514.5
## Residual Deviance: 456.9 AIC: 460.9
print(DS_transformed_glm2)
##
## Call: glm(formula = job_title ~ Python + Avg.Salary, family = binomial,
## data = DS_transformed_train)
##
## Coefficients:
## (Intercept) Python1 Avg.Salary
## -3.08101 1.44199 0.02004
##
## Degrees of Freedom: 372 Total (i.e. Null); 370 Residual
## Null Deviance: 514.5
## Residual Deviance: 423.2 AIC: 429.2
DS_transformed_train_pred_glm0 <- predict(DS_transformed_glm0,DS_transformed_train, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
DS_transformed_train_predict0 = ifelse(predict(DS_transformed_glm0, DS_transformed_train,type = "response")>=0.5,1,0)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
DS_transformed_train_pred_glm1 <- predict(DS_transformed_glm1,DS_transformed_train, type="response")
DS_transformed_train_predict1 = ifelse(predict(DS_transformed_glm1, DS_transformed_train,type = "response")>=0.5,1,0)
DS_transformed_train_pred_glm2 <- predict(DS_transformed_glm2,DS_transformed_train, type="response")
DS_transformed_train_predict2 = ifelse(predict(DS_transformed_glm2, DS_transformed_train,type = "response")>=0.5,1,0)
#Get binary prediction
DS_transformed_test_pred_glm0 <- predict(DS_transformed_glm0, DS_transformed_test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
DS_transformed_test_predict0 = ifelse(predict(DS_transformed_glm0, DS_transformed_test,type = "response")>=0.5,1,0)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
predict_perform(DS_transformed_test,DS_transformed_test_predict0)
## Accuracy is 0.6914894
## Recall is 0.65
## Precision is 0.6341463
#Calculate cost using test set
#cost(DS_transformed_test$default, credit_test_pred_glm)
#view the variable importance
var_imp(DS_transformed_glm0)
## Overall
## Rating 1.580754821
## Avg.Salary 5.468337828
## Python1 4.459373526
## spark1 1.577719413
## aws1 2.319269291
## excel1 0.856116605
## sql1 1.853099092
## sas1 2.831559090
## keras1 0.017730767
## pytorch1 0.611201466
## scikit1 0.882307310
## tensor1 1.877450940
## hadoop1 1.162010556
## tableau1 1.919218741
## bi1 0.919593979
## flink1 0.640973137
## mongo1 0.033498605
## google_an1 0.658063147
## seniority.Jr1 0.005270204
## seniority.Md1 2.105163805
## Degree.Masters1 3.043678594
## Degree.NA1 1.431529890
#Confusion matrix
table(DS_transformed_test$job_title, as.numeric(DS_transformed_test_pred_glm0>1/6), dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 23 31
## 1 4 36
#Compare log regression models:
predict_perform(DS_transformed_train,DS_transformed_train_predict0)
## Accuracy is 0.7882038
## Recall is 0.7309942
## Precision is 0.7911392
predict_perform(DS_transformed_train,DS_transformed_train_predict1)
## Accuracy is 0.6863271
## Recall is 0.7660819
## Precision is 0.6298077
predict_perform(DS_transformed_train,DS_transformed_train_predict2)
## Accuracy is 0.7211796
## Recall is 0.7192982
## Precision is 0.6871508
# Data Mining Approaches
DS_Cluster <- read_excel("MS_BANA_Case_Competition_Data_Spring_2022.xlsx", sheet = "data")
DS_Cluster$job_title_sim <- ifelse(DS_Cluster$job_title_sim == "data scientist",1,0)
table(DS_Cluster$job_title_sim)
##
## 0 1
## 256 211
DS_Cluster <- select(DS_Cluster, -c(1:3))
DS_Cluster <- select(DS_Cluster, -c(2:15))
DS_Cluster <- select(DS_Cluster, -c(3:5))
table (DS_Cluster$Degree)
##
## M na P
## 159 245 63
table (DS_Cluster$seniority_by_title)
##
## jr na sr
## 3 340 124
DS0 <- dummyVars(" ~ .", data = DS_Cluster)
DS_Cluster_transformed <- data.frame(predict(DS0, newdata = DS_Cluster))
str(DS_Cluster_transformed)
## 'data.frame': 467 obs. of 26 variables:
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ X.Avg.Salary.K.. : num 72 87.5 85 76.5 114.5 ...
## $ Python : num 1 1 1 1 1 1 0 1 0 1 ...
## $ spark : num 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : num 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : num 1 0 1 0 1 1 1 1 0 0 ...
## $ sql : num 0 0 1 0 1 1 0 1 0 0 ...
## $ sas : num 1 0 1 0 1 0 0 0 0 0 ...
## $ keras : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pytorch : num 0 0 0 0 0 0 0 1 0 0 ...
## $ scikit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tensor : num 0 0 0 0 0 0 0 1 0 0 ...
## $ hadoop : num 0 0 0 0 0 0 0 0 0 0 ...
## $ tableau : num 1 0 0 0 0 0 0 0 0 0 ...
## $ bi : num 1 0 0 0 0 1 0 0 0 0 ...
## $ flink : num 0 0 0 0 0 0 0 0 0 0 ...
## $ mongo : num 0 0 0 0 0 1 0 0 0 0 ...
## $ google_an : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title_sim : num 1 1 1 1 1 1 1 1 0 1 ...
## $ seniority_by_titlejr: num 0 0 0 0 0 0 0 0 0 0 ...
## $ seniority_by_titlena: num 1 1 1 1 1 1 1 1 1 1 ...
## $ seniority_by_titlesr: num 0 0 0 0 0 0 0 0 0 0 ...
## $ DegreeM : num 1 1 1 0 0 0 0 1 0 0 ...
## $ Degreena : num 0 0 0 1 1 1 1 0 0 1 ...
## $ DegreeP : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Cluster : num 1 1 1 1 1 1 1 2 1 1 ...
#Supervised Learning: Classification Analysis
Data_Scientist <- rbind(DS_Cluster_transformed[DS_Cluster_transformed$job_title_sim =="1",])
Others <- rbind(DS_Cluster_transformed[DS_Cluster_transformed$job_title_sim =="0",])
ind <- sample(nrow(DS_Cluster_transformed),nrow(DS_Cluster_transformed)*0.80)
DS_Cluster_transformed_train <- rbind(Data_Scientist[ind,], Others[ind,])
DS_Cluster_transformed_test <- rbind(Data_Scientist[-ind,], Others[-ind,])
DS_Cluster_cl = DS_Cluster_transformed_train[,19]
DS_Cluster_knn <- knn(train = na.omit(select(DS_Cluster_transformed_train,-19)), test = na.omit(select(DS_Cluster_transformed_test,-19)), cl=na.omit(DS_Cluster_transformed_train[,19]), k=2)
table(DS_Cluster_transformed_test[,19], DS_Cluster_knn, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 38 14
## 1 22 27
plotcluster(na.omit(select(DS_Cluster_transformed_test,-19)), DS_Cluster_knn)
# Unsupervised Learning: Clustering Analysis
index0 <- sample(nrow(DS_Cluster_transformed),nrow(DS_Cluster_transformed)*0.80)
DS_Cluster_train = DS_Cluster_transformed[index0,]
DS_Cluster_test = DS_Cluster_transformed[-index0,]
fit <- kmeans(na.omit(DS_Cluster_transformed), 2)
plotcluster(na.omit(DS_Cluster_transformed), fit$cluster)
# Display number of observations in each cluster
table(fit$cluster)
##
## 1 2
## 155 312
sapply(DS_Cluster_transformed, is.numeric)
## Rating X.Avg.Salary.K.. Python
## TRUE TRUE TRUE
## spark aws excel
## TRUE TRUE TRUE
## sql sas keras
## TRUE TRUE TRUE
## pytorch scikit tensor
## TRUE TRUE TRUE
## hadoop tableau bi
## TRUE TRUE TRUE
## flink mongo google_an
## TRUE TRUE TRUE
## job_title_sim seniority_by_titlejr seniority_by_titlena
## TRUE TRUE TRUE
## seniority_by_titlesr DegreeM Degreena
## TRUE TRUE TRUE
## DegreeP Cluster
## TRUE TRUE
sort(colSums(is.na(DS_Cluster_transformed)))
## Rating X.Avg.Salary.K.. Python
## 0 0 0
## spark aws excel
## 0 0 0
## sql sas keras
## 0 0 0
## pytorch scikit tensor
## 0 0 0
## hadoop tableau bi
## 0 0 0
## flink mongo google_an
## 0 0 0
## job_title_sim seniority_by_titlejr seniority_by_titlena
## 0 0 0
## seniority_by_titlesr DegreeM Degreena
## 0 0 0
## DegreeP Cluster
## 0 0
DS_Cluster_transformed <- scale(DS_Cluster_transformed,center = TRUE, scale = TRUE)
fit <- kmeans(na.omit(DS_Cluster_transformed), 2)
plotcluster(na.omit(DS_Cluster_transformed), fit$cluster)
fviz_cluster(fit, data = DS_Cluster_transformed)