What should I learn in 2022 to become a Data Scientist ?

Problem Definition Summary

Many MS Business Analytics students strive to secure Data Scientist jobs upon completion of the MS BANA program. They often wonder what tools are needed, what opportunities and job titles are available, and the salaries associated with the roles. This analysis is an effort to help out fellow students in search for a Data Scientist job.

We want to understand what skills would be best for students to pursue and would be relevant in their “Data Scientist” job search. We also incorporated the cost-of-living consideration along with the average salary provided in the dataset.

Packages Required

Below is the list of packages used in the project.

# Load libraries
library(ellipsis)
library(rlang)
library(tidyverse)
library(readxl)
library(GGally)
library(reshape2)
library(knitr)
library(pROC)
library(rpart)
library(rpart.plot)
library(dplyr)
library(caret)
library(ROCR)
library(forecast)
library(zoo)
library(fpp3)
library(ggvis)
library(MLmetrics)
library(precrec)
library(class)
library(fpc)
library(factoextra)
library(plyr)
library(psych)

Dataset Overview

For this case study, we are using a Kaggle dataset that has salary information from 2021. The dataset (secured from Kaggle) is scraped from the Glassdoor website using Selenium scrapper. After scraping, the raw dataset was cleaned and made usable for performing data analysis and modelling. Supplementary data sources were used to incorporate the cost-of-living consideration to arrive at an Indexed salary for specific job locations.

The following diagram presents the “Three Finger Summary” (Noonan, 2018):

“Three Finger Summary” (Noonan, 2018)

DS <- read_excel("MS_BANA_Case_Competition_Data_Spring_2022.xlsx", sheet = "data_cleaned_2021")

# structure of data
str(DS)
## tibble [742 x 42] (S3: tbl_df/tbl/data.frame)
##  $ index             : num [1:742] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Job Title         : chr [1:742] "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
##  $ Salary Estimate   : chr [1:742] "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
##  $ Job Description   : chr [1:742] "Data Scientist\r\nLocation: Albuquerque, NM\r\nEducation Required: Bachelorâ\200\231s degree required, preferab"| __truncated__ "What You Will Do:\r\n\r\nI. General Summary\r\n\r\nThe Healthcare Data Scientist position will join our Advance"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\r\nJob ID: 310709\r\n\r\nDirectorate: Earth & Biological Sciences\r\n\r\nDivision: B"| __truncated__ ...
##  $ Rating            : num [1:742] 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Company Name      : chr [1:742] "Tecolote Research\r\n3.8" "University of Maryland Medical System\r\n3.4" "KnowBe4\r\n4.8" "PNNL\r\n3.8" ...
##  $ Location          : chr [1:742] "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Headquarters      : chr [1:742] "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Size              : chr [1:742] "501 - 1000" "10000+" "501 - 1000" "1001 - 5000" ...
##  $ Founded           : num [1:742] 1973 1984 2010 1965 1998 ...
##  $ Type of ownership : chr [1:742] "Company - Private" "Other Organization" "Company - Private" "Government" ...
##  $ Industry          : chr [1:742] "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
##  $ Sector            : chr [1:742] "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
##  $ Revenue           : chr [1:742] "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
##  $ Competitors       : chr [1:742] "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
##  $ Hourly            : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Employer provided : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Lower Salary      : num [1:742] 53 63 80 56 86 71 54 86 38 120 ...
##  $ Upper Salary      : num [1:742] 91 112 90 97 143 119 93 142 84 160 ...
##  $ Avg Salary(K)     : num [1:742] 72 87.5 85 76.5 114.5 ...
##  $ company_txt       : chr [1:742] "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
##  $ Job Location      : chr [1:742] "NM" "MD" "FL" "WA" ...
##  $ Age               : num [1:742] 48 37 11 56 23 21 13 16 7 12 ...
##  $ Python            : num [1:742] 1 1 1 1 1 1 0 1 0 1 ...
##  $ spark             : num [1:742] 0 0 1 0 0 0 0 1 0 1 ...
##  $ aws               : num [1:742] 0 0 0 0 0 1 0 1 0 0 ...
##  $ excel             : num [1:742] 1 0 1 0 1 1 1 1 0 0 ...
##  $ sql               : num [1:742] 0 0 1 0 1 1 0 1 0 0 ...
##  $ sas               : num [1:742] 1 0 1 0 1 0 0 0 0 0 ...
##  $ keras             : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ pytorch           : num [1:742] 0 0 0 0 0 0 0 1 0 0 ...
##  $ scikit            : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ tensor            : num [1:742] 0 0 0 0 0 0 0 1 0 0 ...
##  $ hadoop            : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ tableau           : num [1:742] 1 0 0 0 0 0 0 0 0 0 ...
##  $ bi                : num [1:742] 1 0 0 0 0 1 0 0 0 0 ...
##  $ flink             : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ mongo             : num [1:742] 0 0 0 0 0 1 0 0 0 0 ...
##  $ google_an         : num [1:742] 0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title_sim     : chr [1:742] "data scientist" "data scientist" "data scientist" "data scientist" ...
##  $ seniority_by_title: chr [1:742] "na" "na" "na" "na" ...
##  $ Degree            : chr [1:742] "M" "M" "M" "na" ...
cat('The dimention is: ', dim(DS),'\n')
## The dimention is:  742 42
cat('Is there any NA data: ', any(is.na(DS)), '\n')
## Is there any NA data:  FALSE
#Check the length and see which variables can be converted to factors
apply(DS,2, function(x) length(unique(x)))
##              index          Job Title    Salary Estimate    Job Description 
##                742                264                416                463 
##             Rating       Company Name           Location       Headquarters 
##                 31                343                200                198 
##               Size            Founded  Type of ownership           Industry 
##                  8                102                  9                 60 
##             Sector            Revenue        Competitors             Hourly 
##                 25                 13                128                  2 
##  Employer provided       Lower Salary       Upper Salary      Avg Salary(K) 
##                  2                113                162                219 
##        company_txt       Job Location                Age             Python 
##                343                 37                102                  2 
##              spark                aws              excel                sql 
##                  2                  2                  2                  2 
##                sas              keras            pytorch             scikit 
##                  2                  2                  2                  2 
##             tensor             hadoop            tableau                 bi 
##                  2                  2                  2                  2 
##              flink              mongo          google_an      job_title_sim 
##                  2                  2                  2                 10 
## seniority_by_title             Degree 
##                  3                  3
DS <- subset (DS, select = -index)

#Checking for duplicated rows
cat('No. of duplicated rows to be removed:', DS %>%
  select(everything()) %>%
  duplicated() %>% 
  sum())
## No. of duplicated rows to be removed: 275
#Removing duplicated rows
index <- 
  DS %>% 
  select(everything()) %>% 
  duplicated() %>% 
  which()

DS <- DS[-index,]

#DS[DS$company_txt == 'KnowBe4', ]

Data Preparation

Steps we followed for Data Preparation:

  • Value labeling for target Variables (job_title_sim) for predicting the Probability of a “Data Scientist” job posting (Where 1= “Data Scientist” job posting and 0= Other job posting).

  • New Variable creation for ‘Seniority_by_title’ & ‘Degree’ which will help us analyze these predictor variables as a Binary Variable.

  • Replacing the -1 values in ‘Rating’ column with the mean of the distribution.

DS$job_title_sim <- ifelse(DS$job_title_sim == "data scientist",1,0)

table(DS$job_title_sim)
## 
##   0   1 
## 256 211
DS <- select(DS, -c(1:3))
DS <- select(DS, -c(2:15))
DS <- select(DS, -c(3:5))

DS1 <- dummyVars(" ~ .", data = DS)
DS_transformed <- data.frame(predict(DS1, newdata = DS))
str(DS_transformed)
## 'data.frame':    467 obs. of  25 variables:
##  $ Rating              : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ X.Avg.Salary.K..    : num  72 87.5 85 76.5 114.5 ...
##  $ Python              : num  1 1 1 1 1 1 0 1 0 1 ...
##  $ spark               : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws                 : num  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel               : num  1 0 1 0 1 1 1 1 0 0 ...
##  $ sql                 : num  0 0 1 0 1 1 0 1 0 0 ...
##  $ sas                 : num  1 0 1 0 1 0 0 0 0 0 ...
##  $ keras               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pytorch             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ scikit              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tensor              : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ hadoop              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tableau             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ bi                  : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ flink               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mongo               : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ google_an           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title_sim       : num  1 1 1 1 1 1 1 1 0 1 ...
##  $ seniority_by_titlejr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ seniority_by_titlena: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ seniority_by_titlesr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DegreeM             : num  1 1 1 0 0 0 0 1 0 0 ...
##  $ Degreena            : num  0 0 0 1 1 1 1 0 0 1 ...
##  $ DegreeP             : num  0 0 0 0 0 0 0 0 1 0 ...
# rename
DS_transformed <- dplyr::rename(DS_transformed, Avg.Salary= X.Avg.Salary.K..)
DS_transformed <- dplyr::rename(DS_transformed, job_title= job_title_sim)
DS_transformed <- dplyr::rename(DS_transformed, seniority.Jr= seniority_by_titlejr)
DS_transformed <- dplyr::rename(DS_transformed, seniority.Md= seniority_by_titlena)
DS_transformed <- dplyr::rename(DS_transformed, seniority.Sr= seniority_by_titlesr)
DS_transformed <- dplyr::rename(DS_transformed, Degree.Masters = DegreeM)
DS_transformed <- dplyr::rename(DS_transformed, Degree.NA = Degreena)
DS_transformed <- dplyr::rename(DS_transformed, Degree.PhD = DegreeP)

# convert categorical data to factor
DS_transformed$Python <- as.factor(DS_transformed$Python)
DS_transformed$spark <- as.factor(DS_transformed$spark)
DS_transformed$aws <- as.factor(DS_transformed$aws)
DS_transformed$excel <- as.factor(DS_transformed$excel)
DS_transformed$sql <- as.factor(DS_transformed$sql)
DS_transformed$sas <- as.factor(DS_transformed$sas)
DS_transformed$keras <- as.factor(DS_transformed$keras)
DS_transformed$pytorch <- as.factor(DS_transformed$pytorch)
DS_transformed$scikit <- as.factor(DS_transformed$scikit)
DS_transformed$tensor <- as.factor(DS_transformed$tensor)
DS_transformed$hadoop <- as.factor(DS_transformed$hadoop)
DS_transformed$tableau <- as.factor(DS_transformed$tableau)
DS_transformed$bi <- as.factor(DS_transformed$bi)
DS_transformed$flink <- as.factor(DS_transformed$flink)
DS_transformed$mongo <- as.factor(DS_transformed$mongo)
DS_transformed$google_an <- as.factor(DS_transformed$google_an)
DS_transformed$job_title <- as.factor(DS_transformed$job_title)
DS_transformed$seniority.Jr <- as.factor(DS_transformed$seniority.Jr)
DS_transformed$seniority.Md <- as.factor(DS_transformed$seniority.Md)
DS_transformed$seniority.Sr <- as.factor(DS_transformed$seniority.Sr)
DS_transformed$Degree.Masters <- as.factor(DS_transformed$Degree.Masters)
DS_transformed$Degree.NA <- as.factor(DS_transformed$Degree.NA)
DS_transformed$Degree.PhD <- as.factor(DS_transformed$Degree.PhD)

str(DS_transformed)
## 'data.frame':    467 obs. of  25 variables:
##  $ Rating        : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Avg.Salary    : num  72 87.5 85 76.5 114.5 ...
##  $ Python        : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 2 1 2 ...
##  $ spark         : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 2 1 2 ...
##  $ aws           : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
##  $ excel         : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 2 2 1 1 ...
##  $ sql           : Factor w/ 2 levels "0","1": 1 1 2 1 2 2 1 2 1 1 ...
##  $ sas           : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 1 1 1 ...
##  $ keras         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pytorch       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ scikit        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tensor        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ hadoop        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tableau       : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
##  $ bi            : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 1 1 ...
##  $ flink         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mongo         : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ google_an     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ job_title     : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 2 ...
##  $ seniority.Jr  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seniority.Md  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ seniority.Sr  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Degree.Masters: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 2 1 1 ...
##  $ Degree.NA     : Factor w/ 2 levels "0","1": 1 1 1 2 2 2 2 1 1 2 ...
##  $ Degree.PhD    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...

Including Plots

Visualization work is still in progress.

Validation and Training

Next we start with creation of a Training and testing dataset (0.8 as training & 0.2 as testing ) for Model validation and testing.

#Training and testing dataset

index <- sample(nrow(DS_transformed),nrow(DS_transformed)*0.80)
DS_transformed_train = DS_transformed[index,]
DS_transformed_test = DS_transformed[-index,]

Define functions for Prediction performance metrics.

#prediction performance metrics  
predict_perform = function(DS_transformed_test,pD){
  cat("Accuracy is ", Accuracy(y_pred = pD, y_true = DS_transformed_test$job_title),'\n') 
  cat("Recall is ", Recall(y_pred = pD, y_true = DS_transformed_test$job_title,positive = '1'),'\n' )
  cat("Precision is", Precision(y_pred = pD, y_true = DS_transformed_test$job_title,positive = '1'),'\n')
}

#prediction performance plots
predict_perform_p = function(DS_transformed_test,pD){
  precrec_obj <- evalmod(labels = DS_transformed_test$job_title, scores = as.numeric(pD))
  autoplot(precrec_obj)
}

#plot variable importance 
var_imp =function(model){
  v=varImp(model)
  print(v)
  v$var = rownames(v)
  v %>% ggvis(~var, ~Overall) %>% layer_bars()
}

Data Modeling

# Decision Tree: a full model
DS_transformed_rpart0 <- rpart(formula = job_title ~ ., data = DS_transformed_train, method = "class")

# As the cost are symmetric, we can ignore the parms argument.
DS_transformed_rpart1 <- rpart(formula = job_title ~ . , data = DS_transformed_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))

pred0 <- predict(DS_transformed_rpart0, type="class")
cat('Training Confusion Matrix:\n')
## Training Confusion Matrix:
table(DS_transformed_train$job_title, pred0, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0 150  52
##    1  26 145
cat('\n')
pred1 <- predict(DS_transformed_rpart1, type="class")
cat('Training Confusion Matrix (with Cost Matrix):\n')
## Training Confusion Matrix (with Cost Matrix):
table(DS_transformed_train$job_title, pred1, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0  99 103
##    1   6 165

Decision Tree

Printing and ploting the tree

The first tree model shows prominent impact of the following skill set: Python, Scikit. SAS,Keras .

#Printing the Decision Tree
DS_transformed_rpart0
## n= 373 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 373 171 0 (0.54155496 0.45844504)  
##     2) Python=0 165  40 0 (0.75757576 0.24242424)  
##       4) Avg.Salary< 107 117  17 0 (0.85470085 0.14529915) *
##       5) Avg.Salary>=107 48  23 0 (0.52083333 0.47916667)  
##        10) Degree.PhD=1 12   1 0 (0.91666667 0.08333333) *
##        11) Degree.PhD=0 36  14 1 (0.38888889 0.61111111) *
##     3) Python=1 208  77 1 (0.37019231 0.62980769)  
##       6) Avg.Salary< 95.25 72  30 0 (0.58333333 0.41666667)  
##        12) sas=0 63  22 0 (0.65079365 0.34920635)  
##          24) Rating< 3.45 17   2 0 (0.88235294 0.11764706) *
##          25) Rating>=3.45 46  20 0 (0.56521739 0.43478261)  
##            50) Avg.Salary< 67.25 10   1 0 (0.90000000 0.10000000) *
##            51) Avg.Salary>=67.25 36  17 1 (0.47222222 0.52777778)  
##             102) tableau=0 28  12 0 (0.57142857 0.42857143)  
##               204) Avg.Salary>=87.75 7   0 0 (1.00000000 0.00000000) *
##               205) Avg.Salary< 87.75 21   9 1 (0.42857143 0.57142857)  
##                 410) Degree.NA=1 13   5 0 (0.61538462 0.38461538) *
##                 411) Degree.NA=0 8   1 1 (0.12500000 0.87500000) *
##             103) tableau=1 8   1 1 (0.12500000 0.87500000) *
##        13) sas=1 9   1 1 (0.11111111 0.88888889) *
##       7) Avg.Salary>=95.25 136  35 1 (0.25735294 0.74264706) *
#Plotting the Decision Tree
rpart.plot(DS_transformed_rpart0)

prp(DS_transformed_rpart0, digits = 4, extra = 1)

#view the variable importance based on Decision Tree
var_imp(DS_transformed_rpart0)
##                  Overall
## Avg.Salary     58.509617
## aws             7.892619
## Degree.Masters 11.863044
## Degree.NA      16.111935
## Degree.PhD      5.995491
## excel           3.648244
## hadoop          2.673350
## keras          11.093486
## Python         27.615351
## Rating          9.578145
## sas            10.884887
## scikit         12.793258
## seniority.Md    5.042705
## seniority.Sr    4.612943
## tableau         3.196698
## tensor         24.282284
## spark           0.000000
## sql             0.000000
## pytorch         0.000000
## bi              0.000000
## flink           0.000000
## mongo           0.000000
## google_an       0.000000
## seniority.Jr    0.000000
#Print accuracy,recall,precision for Training dataset
predict_perform(DS_transformed_train,pred0)
## Accuracy is  0.7908847 
## Recall is  0.8479532 
## Precision is 0.7360406
#Prediction Performance
predict_perform_p(DS_transformed_train,pred0)

#DS_transformed_rpart1
#rpart.plot(DS_transformed_rpart1)


#prp(DS_transformed_rpart1, extra = 1)

Prediction using classification trees

For this binary classification problem, there are 2 types of predictions. One is the predicted class of response (0 or 1), and the second type is the probability of response being 1. We also look at ROC Curve which gives us the trade-off between hit rate (1 - false positive) and false negative, and area under the curve (AUC) as a measure of how good the binary classification model performs.

#In-sample prediction

DS_transformed_train.pred.tree1<- predict(DS_transformed_rpart0, DS_transformed_train, type="class")
table(DS_transformed_train$job_title, DS_transformed_train.pred.tree1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 150  52
##     1  26 145
#Out-of-sample prediction

DS_transformed_test.pred.tree1<- predict(DS_transformed_rpart0, DS_transformed_test, type="class")
table(DS_transformed_test$job_title, DS_transformed_test.pred.tree1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth  0  1
##     0 37 17
##     1 10 30
#Print accuracy,recall,precision for Testing data-set:

predict_perform(DS_transformed_test,DS_transformed_test.pred.tree1)
## Accuracy is  0.712766 
## Recall is  0.75 
## Precision is 0.6382979
#Prediction performance for Testing data-set:

predict_perform_p(DS_transformed_test,DS_transformed_test.pred.tree1)

#Probability of getting 1

DS_transformed_test_prob_rpart = predict(DS_transformed_rpart0, DS_transformed_test, type="prob")

pred3 = prediction(DS_transformed_test_prob_rpart[,2], DS_transformed_test$job_title)
perf3 = performance(pred3, "tpr", "fpr")
plot(perf3, colorize=TRUE)

#Area under the curve is given by:

slot(performance(pred3, "auc"), "y.values")[[1]]
## [1] 0.7497685
#For a given cut-off probability, the 0/1 prediction result can be calculated similar to what we do in logistic regression:

DS_transformed_test_pred_rpart = as.numeric(DS_transformed_test_prob_rpart[,2] > 1/(5+1))
table(DS_transformed_test$job_title, DS_transformed_test_pred_rpart, dnn=c("Truth","Predicted"))
##      Predicted
## Truth  0  1
##     0 35 19
##     1  8 32
DS_transformed_train_prob_rpart = predict(DS_transformed_rpart0, DS_transformed_train, type="prob")

pred4 = prediction(DS_transformed_test_pred_rpart, DS_transformed_test$job_title)
perf4 = performance(pred4, "tpr", "fpr")
plot(perf4, colorize=TRUE)

Comparing classification tree with logistic regression

Next, We will compare classification tree model’s out-of-sample performance with the logistic regression model with all variables in it.

#comparing this model’s out-of-sample performance with the logistic regression model with all variables in it.

#Fit logistic regression model
DS_transformed_glm0 <- glm(job_title~.,data = DS_transformed_train,family=binomial)
DS_transformed_glm1 <- glm(job_title~Python,data = DS_transformed_train,family=binomial)
DS_transformed_glm2 <- glm(job_title~Python + Avg.Salary,data = DS_transformed_train,family=binomial)

print(DS_transformed_glm0)
## 
## Call:  glm(formula = job_title ~ ., family = binomial, data = DS_transformed_train)
## 
## Coefficients:
##     (Intercept)           Rating       Avg.Salary          Python1  
##        -5.98550          0.30346          0.02604          1.49306  
##          spark1             aws1           excel1             sql1  
##        -0.68070         -0.80033         -0.23530         -0.62052  
##            sas1           keras1         pytorch1          scikit1  
##         1.63346         15.36501          0.73914          0.61141  
##         tensor1          hadoop1         tableau1              bi1  
##         1.71327          0.51257          0.77570         -0.53581  
##          flink1           mongo1       google_an1    seniority.Jr1  
##        -0.72009         -0.02517         -0.83780        -14.42356  
##   seniority.Md1    seniority.Sr1  Degree.Masters1       Degree.NA1  
##         0.71139               NA          1.47797          0.67457  
##     Degree.PhD1  
##              NA  
## 
## Degrees of Freedom: 372 Total (i.e. Null);  350 Residual
## Null Deviance:       514.5 
## Residual Deviance: 339.1     AIC: 385.1
print(DS_transformed_glm1)
## 
## Call:  glm(formula = job_title ~ Python, family = binomial, data = DS_transformed_train)
## 
## Coefficients:
## (Intercept)      Python1  
##      -1.139        1.671  
## 
## Degrees of Freedom: 372 Total (i.e. Null);  371 Residual
## Null Deviance:       514.5 
## Residual Deviance: 456.9     AIC: 460.9
print(DS_transformed_glm2)
## 
## Call:  glm(formula = job_title ~ Python + Avg.Salary, family = binomial, 
##     data = DS_transformed_train)
## 
## Coefficients:
## (Intercept)      Python1   Avg.Salary  
##    -3.08101      1.44199      0.02004  
## 
## Degrees of Freedom: 372 Total (i.e. Null);  370 Residual
## Null Deviance:       514.5 
## Residual Deviance: 423.2     AIC: 429.2
DS_transformed_train_pred_glm0 <- predict(DS_transformed_glm0,DS_transformed_train, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
DS_transformed_train_predict0 = ifelse(predict(DS_transformed_glm0, DS_transformed_train,type = "response")>=0.5,1,0)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
DS_transformed_train_pred_glm1 <- predict(DS_transformed_glm1,DS_transformed_train, type="response")
DS_transformed_train_predict1 = ifelse(predict(DS_transformed_glm1, DS_transformed_train,type = "response")>=0.5,1,0)

DS_transformed_train_pred_glm2 <- predict(DS_transformed_glm2,DS_transformed_train, type="response")
DS_transformed_train_predict2 = ifelse(predict(DS_transformed_glm2, DS_transformed_train,type = "response")>=0.5,1,0)

#Get binary prediction
DS_transformed_test_pred_glm0 <- predict(DS_transformed_glm0, DS_transformed_test, type="response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
DS_transformed_test_predict0 = ifelse(predict(DS_transformed_glm0, DS_transformed_test,type = "response")>=0.5,1,0)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
predict_perform(DS_transformed_test,DS_transformed_test_predict0)
## Accuracy is  0.6914894 
## Recall is  0.65 
## Precision is 0.6341463
#Calculate cost using test set
#cost(DS_transformed_test$default, credit_test_pred_glm)

#view the variable importance
var_imp(DS_transformed_glm0)
##                     Overall
## Rating          1.580754821
## Avg.Salary      5.468337828
## Python1         4.459373526
## spark1          1.577719413
## aws1            2.319269291
## excel1          0.856116605
## sql1            1.853099092
## sas1            2.831559090
## keras1          0.017730767
## pytorch1        0.611201466
## scikit1         0.882307310
## tensor1         1.877450940
## hadoop1         1.162010556
## tableau1        1.919218741
## bi1             0.919593979
## flink1          0.640973137
## mongo1          0.033498605
## google_an1      0.658063147
## seniority.Jr1   0.005270204
## seniority.Md1   2.105163805
## Degree.Masters1 3.043678594
## Degree.NA1      1.431529890
#Confusion matrix
table(DS_transformed_test$job_title, as.numeric(DS_transformed_test_pred_glm0>1/6), dnn=c("Truth","Predicted"))
##      Predicted
## Truth  0  1
##     0 23 31
##     1  4 36
#Compare log regression models: 
predict_perform(DS_transformed_train,DS_transformed_train_predict0)
## Accuracy is  0.7882038 
## Recall is  0.7309942 
## Precision is 0.7911392
predict_perform(DS_transformed_train,DS_transformed_train_predict1)
## Accuracy is  0.6863271 
## Recall is  0.7660819 
## Precision is 0.6298077
predict_perform(DS_transformed_train,DS_transformed_train_predict2)
## Accuracy is  0.7211796 
## Recall is  0.7192982 
## Precision is 0.6871508
# Data Mining Approaches

DS_Cluster <- read_excel("MS_BANA_Case_Competition_Data_Spring_2022.xlsx", sheet = "data")
DS_Cluster$job_title_sim <- ifelse(DS_Cluster$job_title_sim == "data scientist",1,0)

table(DS_Cluster$job_title_sim)
## 
##   0   1 
## 256 211
DS_Cluster <- select(DS_Cluster, -c(1:3))
DS_Cluster <- select(DS_Cluster, -c(2:15))
DS_Cluster <- select(DS_Cluster, -c(3:5))

table (DS_Cluster$Degree)
## 
##   M  na   P 
## 159 245  63
table (DS_Cluster$seniority_by_title)
## 
##  jr  na  sr 
##   3 340 124
DS0 <- dummyVars(" ~ .", data = DS_Cluster)
DS_Cluster_transformed <- data.frame(predict(DS0, newdata = DS_Cluster))
str(DS_Cluster_transformed)
## 'data.frame':    467 obs. of  26 variables:
##  $ Rating              : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ X.Avg.Salary.K..    : num  72 87.5 85 76.5 114.5 ...
##  $ Python              : num  1 1 1 1 1 1 0 1 0 1 ...
##  $ spark               : num  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws                 : num  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel               : num  1 0 1 0 1 1 1 1 0 0 ...
##  $ sql                 : num  0 0 1 0 1 1 0 1 0 0 ...
##  $ sas                 : num  1 0 1 0 1 0 0 0 0 0 ...
##  $ keras               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pytorch             : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ scikit              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tensor              : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ hadoop              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ tableau             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ bi                  : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ flink               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ mongo               : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ google_an           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title_sim       : num  1 1 1 1 1 1 1 1 0 1 ...
##  $ seniority_by_titlejr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ seniority_by_titlena: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ seniority_by_titlesr: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DegreeM             : num  1 1 1 0 0 0 0 1 0 0 ...
##  $ Degreena            : num  0 0 0 1 1 1 1 0 0 1 ...
##  $ DegreeP             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Cluster             : num  1 1 1 1 1 1 1 2 1 1 ...
#Supervised Learning: Classification Analysis

Data_Scientist <- rbind(DS_Cluster_transformed[DS_Cluster_transformed$job_title_sim =="1",])
Others <- rbind(DS_Cluster_transformed[DS_Cluster_transformed$job_title_sim =="0",])

ind <- sample(nrow(DS_Cluster_transformed),nrow(DS_Cluster_transformed)*0.80)
DS_Cluster_transformed_train <- rbind(Data_Scientist[ind,], Others[ind,])
DS_Cluster_transformed_test <- rbind(Data_Scientist[-ind,], Others[-ind,])
DS_Cluster_cl = DS_Cluster_transformed_train[,19]

DS_Cluster_knn <- knn(train = na.omit(select(DS_Cluster_transformed_train,-19)), test = na.omit(select(DS_Cluster_transformed_test,-19)), cl=na.omit(DS_Cluster_transformed_train[,19]), k=2)

table(DS_Cluster_transformed_test[,19], DS_Cluster_knn, dnn = c("True", "Predicted"))
##     Predicted
## True  0  1
##    0 38 14
##    1 22 27
plotcluster(na.omit(select(DS_Cluster_transformed_test,-19)), DS_Cluster_knn)

# Unsupervised Learning: Clustering Analysis

index0 <- sample(nrow(DS_Cluster_transformed),nrow(DS_Cluster_transformed)*0.80)
DS_Cluster_train = DS_Cluster_transformed[index0,]
DS_Cluster_test = DS_Cluster_transformed[-index0,]

fit <- kmeans(na.omit(DS_Cluster_transformed), 2)

plotcluster(na.omit(DS_Cluster_transformed), fit$cluster)

# Display number of observations in each cluster
table(fit$cluster)
## 
##   1   2 
## 155 312
sapply(DS_Cluster_transformed, is.numeric)
##               Rating     X.Avg.Salary.K..               Python 
##                 TRUE                 TRUE                 TRUE 
##                spark                  aws                excel 
##                 TRUE                 TRUE                 TRUE 
##                  sql                  sas                keras 
##                 TRUE                 TRUE                 TRUE 
##              pytorch               scikit               tensor 
##                 TRUE                 TRUE                 TRUE 
##               hadoop              tableau                   bi 
##                 TRUE                 TRUE                 TRUE 
##                flink                mongo            google_an 
##                 TRUE                 TRUE                 TRUE 
##        job_title_sim seniority_by_titlejr seniority_by_titlena 
##                 TRUE                 TRUE                 TRUE 
## seniority_by_titlesr              DegreeM             Degreena 
##                 TRUE                 TRUE                 TRUE 
##              DegreeP              Cluster 
##                 TRUE                 TRUE
sort(colSums(is.na(DS_Cluster_transformed)))
##               Rating     X.Avg.Salary.K..               Python 
##                    0                    0                    0 
##                spark                  aws                excel 
##                    0                    0                    0 
##                  sql                  sas                keras 
##                    0                    0                    0 
##              pytorch               scikit               tensor 
##                    0                    0                    0 
##               hadoop              tableau                   bi 
##                    0                    0                    0 
##                flink                mongo            google_an 
##                    0                    0                    0 
##        job_title_sim seniority_by_titlejr seniority_by_titlena 
##                    0                    0                    0 
## seniority_by_titlesr              DegreeM             Degreena 
##                    0                    0                    0 
##              DegreeP              Cluster 
##                    0                    0
DS_Cluster_transformed <- scale(DS_Cluster_transformed,center = TRUE, scale = TRUE)
fit <- kmeans(na.omit(DS_Cluster_transformed), 2)
plotcluster(na.omit(DS_Cluster_transformed), fit$cluster)

fviz_cluster(fit, data = DS_Cluster_transformed)