1 Introduction

This document aims to answer the reason behind quit a current job role while taking into consideration possible underlying factors that lead to this decision. In the next sections there is detailed information regarding the methodological approaches followed, inferential assumptions and code implementation.

2 Dataset profiling

Data was provided from Kaggle repository and consists of 1.499 employees who quit or not their current role. Each row of the dataset corresponds to a distinct employee record. The following feature variables are available:

Left: Indicator variable suggesting whether an employee quit his role or not.
Satisfaction level
Last evaluation
Number of projects
Average monthly hours
Time spent in company
Work accident
Promotion during the last 5 years
Department: (i.e. IT, RnD)
Salary level: (i.e. low, medium, high)

3 Data quality control

For the purposes of this analysis data were analyzed for missing values, outliers or duplicated records:

# Source script that loads data - all variables were pre-defined as character to avoid losing information
# ------------------------------------------------------------------------------------------------#
knitr::opts_chunk$set(echo = TRUE)
setwd("/Users/mammask/Documents/Data Challenge/")          # Set working directory

currLibraryPath <- c("/Library/Frameworks/R.framework/Versions/3.3/Resources/library")  # Path of library
dataPath        <- "/Users/mammask/Documents/Data Challenge/"   # Path of data  
currDir         <- "/Users/mammask/Documents/Data Challenge/"   # Path of current directory
source("./loadScript.R")                                   # Load data

# Define type of feature variables
numericVals  <- c("satisfaction_level", "last_evaluation") # Define numeric values
integerVals  <- c("number_project","average_montly_hours","time_spend_company",
                  "Work_accident","promotion_last_5years") # Define response variable
respVar      <- "left"                                     # Define name of classification variable

# Assign formats in dataset
# ------------------------------------------------------------------------------------------------#
allDat[, c(numericVals):= lapply(.SD, as.numeric), .SDcols = numericVals]       # Convert numeric values
allDat[, c(integerVals):= lapply(.SD, as.integer), .SDcols = integerVals]       # Convert integer values

3.1 Analysis of missing values

The following script indicates that no missing values were found in the dataset:

# Missing values (NAs) in features and classification variable
# ------------------------------------------------------------------------------------------------#
nMiss <- colSums(allDat[, lapply(.SD, is.na), .SDcols = names(allDat)])
print(nMiss)

##    satisfaction_level       last_evaluation        number_project 
##                     0                     0                     0 
##  average_montly_hours    time_spend_company         Work_accident 
##                     0                     0                     0 
##                  left promotion_last_5years                 sales 
##                     0                     0                     0 
##                salary 
##                     0

3.2 Detection of outliers

A box plot for each one of the continuous variables is computed in order to identify outliers or extreme values that might potentially affect prediction:

# Produce box plots of continuous features grouped by the class variable
# ------------------------------------------------------------------------------------------------#
allDat[,ID:=1]
AllDatLongFormat <- data.table::melt(allDat[,c("ID","satisfaction_level",
                                               "last_evaluation",
                                               "number_project",
                                               "time_spend_company"),with = F],
                                     id.vars = "ID", value.name = "Metric")

## Warning in melt.data.table(allDat[, c("ID", "satisfaction_level",
## "last_evaluation", : 'measure.vars' [satisfaction_level, last_evaluation,
## number_project, time_spend_company] are not all of the same type. By order
## of hierarchy, the molten data value column will be of type 'double'. All
## measure variables not of type 'double' will be coerced to. Check DETAILS
## in ?melt.data.table for more on coercion.

AllDatLongFormat[, variable := str_to_title(gsub("_"," ",variable))]

htmltools::tagList(hc_size(
  hcboxplot(x = AllDatLongFormat[["Metric"]], var = AllDatLongFormat[["variable"]]),
  width = NULL, height = 300) %>% 
    hc_title(text = "Distribution of feature variables"))

rm(AllDatLongFormat)

Box-plots indicate the presence of outliers in variable time spent in company (31 employees have more than 5 years which is the maximum value). Following analysis during the model fitting process will be made in order to quantify the effect of these values to the overall prediction. A box-plot of the average monthly hours was created but no outliers were detected.

3.3 Duplicated records

The available dataset was analyzed for possible duplicated records. The following script suggests the presence of 3008 duplicated cases that were discarded from the analysis.

nrow(allDat[duplicated(allDat)])                        # Number of duplicated rows

## [1] 3008

allDat <- copy(allDat[!duplicated(allDat)])             # Keep unique records

4 Descriptive analysis

4.1 Descriptive tables

The dataset consists of 11991 records. The 16.6% of employees quit their role while the 83.4% remained loyal. This is a strong indication of an an imbalanced classification variable as there is unequal distribution between its classes; this indication will be discussed thoroughly in the next sections. The following table presents descriptive information for each one of the feature variables grouped by the classification variable (stayed/ left):

# Generate statistics of (mean, median or sd) 
# ------------------------------------------------------------------------------------------------#
resultsMetrics <- StatMetrics(metricData = allDat,                                                # input dataset
                              mMetric    = c("mean", "median"),                                 # metric
                              grouper    = "left",                                              # response variable
                              featuresVec= c(numericVals, integerVals))                         # vector of numeric variables

setnames(resultsMetrics, c("Feature", "Metric","0","1"), c("Feature", "Metric",
                                                           "Stayed","Left"))                    # renaming columns

datatable(resultsMetrics,rownames = FALSE, class = 'cell-border stripe', options = list(
  dom = 'T<"clear">lfrtip',
  tableTools = list(
    className = 'dt-center', targets = 0:(ncol(resultsMetrics)-1)),
  columnDefs = list(list(className = 'dt-center', targets = 0:(ncol(resultsMetrics)-1))),
  pageLength = 15
))                                                                                             # generate interactive table

Descriptive results suggest that employees who quit their role seem to work more, perform slightly better in their evaluation tests, spent more time in the company and have a significantly lower satisfaction level compared to those who stayed.

For more information about the function StatMetrics please see Appendix.

4.2 Descriptive charts

Further analysis on employees’ salary band and department indicates that almost the 27% of employees who are on the lowest salary band chose to quit their role while the top 3 departments with the highest quit rates are namely the HR (18.8%), Accounting (17.55%) and Technical (17.38%).

salarStats <- allDat[, .N, by = c("left","salary")]
salarStats <- data.table::dcast(data = salarStats, formula = "salary ~ left", variable.var = "variable", value.var = "N")
setnames(salarStats, c("Salary", "Stayed", "Left"))
salarStats[, `Stayed %`:= round(100*Stayed/(Stayed+Left),2), by  = 1:nrow(salarStats)]
salarStats[, `Left %`:= round(100*Left/(Stayed+Left),2), by  = 1:nrow(salarStats)]

salesStats <- allDat[, .N, by = c("left","sales")]
salesStats <- data.table::dcast(data = salesStats, formula = "sales ~ left", variable.var = "variable",value.var = "N")
setnames(salesStats, c("Sales", "Stayed", "Left"))
salesStats[, `Stayed %`:= round(100*Stayed/(Stayed+Left),2), by  = 1:nrow(salesStats)]
salesStats[, `Left %`:= round(100*Left/(Stayed+Left),2), by  = 1:nrow(salesStats)]

4.3 Distribution of features grouped by the classification variable

The following histograms present the distribution of each one of the feature variables grouped by the two levels of the classification variable (stayed/ left):

allDat[left == 0, left:= "Stayed"]
allDat[left == 1, left:= "Left"]
allDat[, left:= factor(left, levels = c("Stayed","Left"))]


p1 <- OverlayedHist(mData        = allDat,
                    featureVar   = "satisfaction_level",
                    grouper      = "left",
                    mbinwidth    =  0.02,
                    mTitle       = "Distribution of satisfaction level",
                    mxlab        = "Satisfaction level",
                    mylab        = "Number of employees",
                    mlegendTitle = "Status"
)

p2 <- OverlayedHist(mData        = allDat,
                    featureVar   = "last_evaluation",
                    grouper      = "left",
                    mbinwidth    =  0.02,
                    mTitle       = "Distribution of last evaluation",
                    mxlab        = "Last evaluation",
                    mylab        = "Number of employees",
                    mlegendTitle = "Status"
)

p3 <- OverlayedHist(mData        = allDat,
                    featureVar   = "number_project",
                    grouper      = "left",
                    mbinwidth    =  0.8,
                    mTitle       = "Distribution of number of projects",
                    mxlab        = "Number of projects",
                    mylab        = "Number of employees",
                    mlegendTitle = "Status"
)

p4 <- OverlayedHist(mData        = allDat,
                    featureVar   = "average_montly_hours",
                    grouper      = "left",
                    mbinwidth    =  1.5,
                    mTitle       = "Distribution of number of monthly hours",
                    mxlab        = "Number of monthly hours",
                    mylab        = "Number of employees",
                    mlegendTitle = "Status"
)

It seems that employees who quit their role are less satisfied than those who remain loyal to their company. Employees’ last evaluation seems to follow a bi modal pattern. There are employees who left, performed really well and had evaluation score over 0.75 while there is another group who was under performing with an evaluation score less than 0.55. The same pattern is observed also in the distribution of monthly hours (bottom right plot) where there are two group of employees who quit. Those who put extra effort and those who worked significantly less than the average.

4.4 Correlation heatmap between continuous features

Each one of the following correlation matrices provides input about correlation patterns between the available features for three different scenarios:

4.4.1 Correlation - all employees

# Generate correlation heatmap between continuous features
CorrFunction(metricDat = allDat, featureVar= c(integerVals, numericVals))

4.4.2 Correlation - stayed

The following correlation heatmap does not indicate high correlation values between the feature variables.

# Generate correlation heatmap between continuous features
CorrFunction(metricDat = allDat[left == "Stayed"], featureVar= c(integerVals, numericVals))

4.4.3 Correlation - quit

The following correlation heatmap is generated for employees who left the company. Results indicate that employees who tend to work on many projects also work many hours, perform better in their last evaluation tests and spend more time in the company, however, their satisfaction level is quite poor.

# Generate correlation heatmap between continuous features
CorrFunction(metricDat = allDat[left == "Left"], featureVar= c(integerVals, numericVals))

5 Modelling

Descriptive analysis of the previous section provides insightful information about the possible factors that lead to quit the job role. In the current section we aim to develop a model fitting process that will help us quantify the impact of the available features on the classification variable.

5.1 Further considerations about data

In section 4.1 descriptive analysis indicated that the distribution of the classification levels is imbalanced and this is a very important factor to consider prior the model fitting process. The main motivation behind the need to reprocess imbalanced data before we feed them into a classifier is that typically classifiers are more sensitive to detecting the majority class and less sensitive to the minority class. Thus, if we do not consider the imbalanced nature of the data then prediction will be biased as the respective model will be trained to over-predict the majority class. For this reason we will employ the following pre-process approaches:

Build classifiers on the imbalanced dataset
Build classifiers after performing oversampling in the minority class
Build classifiers after performing under sampling in the majority class

5.2 Cross validation

In order to measure the predictive performance of the candidate models and address the trade-off between variance and bias we implement the cross validation approach as part of the model fitting process. The following steps are followed:

Randomly split the dataset into training (80%) and testing set (20%).
On the training set, perform oversampling, under-sampling and no action (3 scenarios). As referred in 4.1 the classification variable is imbalanced as there is unequal distribution between its classes. This way, we will prevent the classifiers be trained in over-predicting the majority class.
Perform 10-fold validation in the new training set
Test the model accuracy in the unseen 20% test set

# Data preparation for modelling
#------------------------------------------------------------------------------------------#
predictorVector <- c(integerVals,numericVals, "salary", "sales")         # Define predictor variables
responseVar     <- "left"                                                # Define response variable
modelFormula <- as.formula(paste0(paste0(responseVar," ~ 1 +"),paste0(predictorVector, collapse  = "+")))
allDat[, left:= as.character(left)]                                      # Modify response as factor
allDat[left == "Stayed", left:= "0"]
allDat[left == "Left", left:= "1"]
allDat[, left:= factor(left, levels = c(0,1))]

# Create partition of dataset
trainSet <- createDataPartition(allDat[["left"]], p=0.9, list=FALSE)    # Randomly partition data
# Test dataset
testDataSet <- copy(allDat[trainSet,])                                  # Define test data

# Dealing with Imbalanced data. Oversampling and undersampling
#------------------------------------------------------------------------------------------#
underSampleTrainSet <- data.table(downSample(x = allDat[trainSet,],
                                             y = allDat[trainSet,left]))# Perform undersampling in training set

overSampleTrainSet  <- data.table(upSample(x = allDat[trainSet,],
                                           y = allDat[trainSet,left]))  # Perform oversampling in training set

5.3 Model fitting process

For the purposes of this study we fitted the following models:

5.3.1 Logistic regression

In this section, a logistic regression classifier was fitted in order to predict employees who quit their current role or remained loyal to the company. One of the main advantages of the logistic model is the interpret-ability of its coefficient estimates that can help us quantify the effect of the feature variables.

5.3.1.1 10-fold cross validation - Logistic regression

The following 3 cross validation approaches were implemented using the caret package:

# Train model on the train dataset using cross validation and perform stepwise 
# procedure using the AIC criterion
  modelFitNoSample <- train(
    modelFormula,
    data=allDat[trainSet,],
    method="glmStepAIC",
    family="binomial",
    trControl = trainControl(method = "cv", number = 10),
    preProcess = c("center", "scale")
  )
  
  modelFitUnderSample <- train(
    modelFormula,
    data=underSampleTrainSet,
    method="glmStepAIC",
    family="binomial",
    trControl = trainControl(method = "cv", number = 10),
    preProcess = c("center", "scale")
  )
  
  modelFitOverSample <- train(
    modelFormula,
    data=overSampleTrainSet,
    method="glmStepAIC",
    family="binomial",
    trControl = trainControl(method = "cv", number = 10),
    preProcess = c("center", "scale")
  )

5.3.1.2 Modelling results - Logistic regression model

Model results provide insightful information about the features having significant impact on the probability of quit the current job role. In the model fitted in the oversampling training set it seems that among all the features, the presence of low and medium salary levels,the higher evaluation scores and the time spent in company increase significantly the probability of quit.

On the other hand it is reasonable that higher satisfaction scores or promotion during the last 5 years are not likely to improve the probability of quit.

One very interesting finding in the model results is the number of estimated parameters of each model. The model fitted in the under-sampling training data has the lowest number of feature estimates due to the very low dimensionality of the training set.

5.3.1.2.1 Model - training data

5.3.1.2.2 Model - undersampling training data

5.3.1.2.3 Model - oversampling training data

5.3.1.3 Model accuracy - Logistic regression

The accuracy of each one of the models was evaluated in the test dataset (20%):

It seems that the model with the highest accuracy rate is the model fitted in the unbalanced dataset (79.2%). A closer look at the results suggests that the fitted model on the pure training set clearly over-predicts the majority class (did not quit - sensitivity=92.7%) and performs quite purely in the minority class (quit - specificity = 35.8%). It is clear that the oversampling and under-sampling techniques improved significantly the prediction of the “weak”class by an improvement of 45% in the two remaining models.

The following plot presents the ROC curves for each one of the fitted models:

Comparative assessment between the ROC curves for the 3 different models suggests that for small cut-off values the model fitted on the training data performs better (indication of higher predictive performance on low thresholds), while for higher cut-off values the models fitted using the oversampling/ under-sampling techniques present better results. Results are in agreement with the business question under consideration and this is the correct prediction of patients who are likely to quit their current role.

5.3.2 Lasso logistic regression

A Lasso logistic regression classifier was fitted in order to take advantage of its powerful capabilities rooted in the bias-variance trade-off. As lamda parameter increases the flexibility of the lasso decreases leading to a decreased variance but an increased bias. To this direction, the Lasso logistic model will likely perform a forward and backward stepwise selection returning only a subset of the features included in the model.

5.3.2.1 10-fold cross validation - Lasso logistic regression

The following 3 cross validation approaches were implemented using the glmnet package:

# Train model on the train dataset using cross validation for various values of lamda 
# procedure using the mean square error
xNoSample <- allDat[trainSet,c(integerVals,numericVals,"salary","sales"), with = F]
xNoSample <- as.matrix(cbind(xNoSample[,c(integerVals,numericVals),with = F], model.matrix( ~ 0 + salary + sales, xNoSample)))
modelFitNoSampleLasso <- cv.glmnet(x = xNoSample, y=allDat[trainSet,left], family = "binomial",
                                   type.measure='mse',nfolds=10,alpha=1)


# Train model on the undersampling dataset using cross validation for various values of lamda 
# procedure using the mean square error
xUnderSampling <- underSampleTrainSet[,c(integerVals,numericVals,"salary","sales"), with = F]
xUnderSampling <- as.matrix(cbind(xUnderSampling[,c(integerVals,numericVals),with = F], model.matrix( ~ 0 + salary + sales, xUnderSampling)))
modelFitUnderSampleLasso <- cv.glmnet(x = xUnderSampling, y=underSampleTrainSet[,left], family = "binomial",
                                   type.measure='mse',nfolds=10,alpha=1)


# Train model on the oversampling dataset using cross validation for various values of lamda 
# procedure using the mean square error
xOverSampling <- overSampleTrainSet[,c(integerVals,numericVals,"salary","sales"), with = F]
xOverSampling <- as.matrix(cbind(xOverSampling[,c(integerVals,numericVals),with = F], model.matrix( ~ 0 + salary + sales, xOverSampling)))
modelFitOverSampleLasso <- cv.glmnet(x = xOverSampling, y=overSampleTrainSet[,left], family = "binomial",
                                      type.measure='mse',nfolds=10,alpha=1)

The following charts present the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the lambda sequence (error bars). Two selected lambdas are indicated by the vertical dotted lines (see below) for each model:

Top left (model on training sample), top right (model on udersampling training set), bottom left (model on oversampling training set)

Each plot provides two suggested lambdas. The first one is the minimum lambda which refers to the lowest cross validation error. The second lambda represents the value of lambda in the search that was simpler than the best model (lambda.min), but which has error within 1 standard error of the best model. The best lambda will be selected based on the simplicity of each model and the parameter estimates.

5.3.2.2 Modelling results - Lasso logistic regression model

The following tab panels contain coefficient estimates for each one of the three different modelling approaches. In the first tab, model results from the model fitted on the pure training set provides two candidate lambda parameters. The first lambda is the one that returns the lowest mean square error while the second one returns a simpler than the best model but which has error within 1 standard error of the best model. This is a very important property of the selected modelling approach when it comes to solve problems of interpretation like the one under consideration.

Results of the model fitted on the training set using the oversampling technique (lambda =0.00066) suggests that 2 features were dropped from the model (medium salary, sales producr manager) while the model has an overall accuracy of 76.2%. The coefficient estimates provide insightful information about the features having a significant impact on the probability of quit. The last evaluation score, time spent in company the low salary band and a position in HR seem to make the event of leaving the curent role event more probable.

Plots of ROC curves in both lambdas reflect the predictive performance of the models.

5.3.2.2.1 Model - training data

5.3.2.2.2 Model - undersampling training data

5.3.2.2.3 Model - oversampling training data

5.3.2.3 Model overall accuracy - Lasso logistic regression - minimum lamda

The model fitted on the pure training set seems to provide the highest overall accuracy.

Although the model built on the training set presents the highest accuracy, the two models fitted using the sampling techniques seem to have a better trade off between sensitivity and specificity.

LassoROC("lambda.min")

5.3.2.4 Model overall accuracy - Lasso logistic regression - lamda within 1standard deviation

LassoROC("lambda.1se")

6 Discussion

The purpose of this study was to identify the underlying reasons that play an important role on predicting an event of quit a current job. For this reason we employed two modelling approaches which are quite famous for their predictive performance and their ability to quantify the effects of the respective features. Algorithms with better accuracy could have been employed such as Random Forests and GBMs, however, we employed the fitted aforementioned for the shake of interpretation.

Both approaches were fitted using oversampling and under sampling techniques on the training set in order to avoid the over prediction of not quit the current job role.The fitted models performed significantly better after performing the sampling techniques and this can be clearly viewed through the ROC charts where the false positive prediction is reduced.

This way we obtained coefficient estimates from models who have a quite high specificity score and also a high balance accuracy score. Results indicate that the best logistic regression model predicts correct the 80.5% of those who quit and this result is aligned to the findings of the lasso logistic model with the minimum MSE.

Overall, it seems that the descriptive analysis is fully aligned with the modelling results. There are two different patterns of employees. Those who quit and those who remained on their current role. Employess who quit seem to perform well in their last evaluation are in a low salary band and work above the average.

Also, it is important to mention that peole who work in HR increase the probability of leaving the current role and this is also a finding that is reflected from the demographic results

This is a clear indication that the company has to invest in improved employee programs and this is reflected by the fitted models.

7 Supporting Functions

#------------------------------------------------------------------------------------------#
#         scriptName: sFuncV1a                                                             #
#            purpose: Supporting functions                                                 #
#               date: 16/02/2017                                                           #
#             author: Konstantinos Mammas                                                  #
#               mail: mammaskon@gmail.com                                                  #
#------------------------------------------------------------------------------------------#

StatMetrics <- function(metricData,
                        mMetric,
                        grouper,
                        featuresVec){
  
  # function name: StatMetrics
  #       purpose: To compute statistics (mean, median, sd) of (numeric, integer) feature variables against the response
  #       version: 1a
  #          date: 10/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  packList <- c("data.table", "reshape2", "stringr")
  packCheck <- lapply(packList, require, character.only = TRUE)                 # Load required packages
  metricData[, c(featuresVec):= lapply(.SD, as.numeric), .SDcols = featuresVec]
  if (sum(unlist(packCheck)) < 3) {
    stop("Please ensure that you have installed all required packages")         # Enure all packages are loaded
  }
  
  metricList <- list()                                                          # List to obtain metric results
  k <- 0                                                                        # Create counter
  if ("mean" %in% mMetric){ 
    k <- k+1
    metricList[[k]] <- metricData[, c(Metric = "Mean",
                                      lapply(.SD, mean)), .SDcols = featuresVec, by = grouper] # Calculate mean
  }
  if ("median" %in% mMetric){ 
    k <- k+1
    metricList[[k]] <- metricData[, c(Metric = "Median",
                                      lapply(.SD, median)), .SDcols = featuresVec, by = grouper] # Calculate median
  }
  
  if ("sd" %in% mMetric){ 
    k <- k+1
    metricList[[k]] <- metricData[, c(Metric = "SD",
                                      lapply(.SD, sd)), .SDcols = featuresVec, by = grouper]     # Calculate median
  }
  
  metricData <- rbindlist(metricList)                                                            # Wrap results
  
  
  
  metricData <- data.table::melt(data = metricData, id.vars = c(grouper,"Metric"))
  metricData <- data.table::dcast(data = metricData, formula = paste0("variable + Metric ~ ", grouper))
  setnames(metricData, c("variable", "Metric"), c("Feature", "Metric"))
  
  grouperLevels <- colnames(metricData)[!colnames(metricData) %in% c("Feature", "Metric")]
  
  metricData[, c(grouperLevels) := lapply(.SD, as.numeric), .SDcols = grouperLevels]
  metricData[, c(grouperLevels) := lapply(.SD, round, 2), .SDcols = grouperLevels]
  metricData[, Feature := str_to_title(gsub("_", " ", Feature)) ]
  
  return(metricData)
}


OverlayedHist <- function(mData       ,
                          featureVar  ,
                          grouper     ,
                          mbinwidth   ,
                          mTitle      ,
                          mxlab       ,
                          mylab       ,
                          mlegendTitle
){
  
  # function name: OverlayedHist
  #       purpose: To produce overlayed histograms against a grouping variable
  #       version: 1a
  #          date: 10/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  
  p <- ggplot(allDat, aes(eval(parse(text = featureVar)), fill = eval(parse(text = grouper)))) +
    geom_histogram(alpha = 0.7, position = 'identity', binwidth = mbinwidth) + 
    scale_fill_manual(values=c("#377EB8","#E41A1C")) + 
    ggtitle(mTitle) +
    xlab(mxlab) + ylab(mylab) + 
    guides(fill=guide_legend(title=mlegendTitle)) + theme(plot.title = element_text(size=10))

  
  return(ggplotly(p))
  
}


CorrFunction <- function(metricDat,
                         featureVar){
  
  # function name: CorrFunction
  #       purpose: To compute an interactive correlation heatmap between a set of features
  #       version: 1a
  #        inputs:
  #             - metricDat : dataset with feature variables in numeric or integer format
  #             - featureVar: a vector of the feature variables
  #       outputs:
  #             - p         : an interactive correlation matrix of the selected features
  #          date: 10/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  library(data.table)
  library(ggplot2)
  library(plotly)
  
  corrData <- data.table(cor(metricDat[,featureVar, with = F]), keep.rownames = T)
  setnames(corrData, "rn", "Metric A")
  corrData <- data.table::melt(data        = corrData,
                               id.vars     = "Metric A",
                               value.name  = "Correlation", variable.name = "Metric B")
  
  corrData[, categ:= cut(Correlation, breaks=seq(from = -1, to = 1, by = 0.25))]
  corrData[, `Metric A`:= str_to_title(gsub("_", " ", `Metric A`))]
  corrData[, `Metric B`:= str_to_title(gsub("_", " ", `Metric B`))]
  
  properOrder <- corrData[, list(Min = min(Correlation)), by = categ][order(Min)][,categ]
  corrData[, categ := factor(categ, levels = properOrder)]
  corrData[, Correlation:= round(Correlation,2)]
  
  p <- ggplot(corrData, aes(x = `Metric A`, y = `Metric B`)) + geom_tile(aes(fill=categ),colour="black")  + scale_fill_brewer(palette = "RdYlGn",name="Correlation")
  p <- p + xlab("") + ylab("") + theme(axis.text.x = element_text(angle = 0, hjust = 1))
  p <- p + geom_text(aes(label=paste0(format(Correlation, big.mark=",",scientific=FALSE))),size = 3, check_overlap = F)
  p <- p + ggtitle("Correlation of feature variables")
  p <- p + theme(panel.background=element_blank(),
                 legend.text=element_text(size=10),
                 legend.title=element_text(size=9),
                 axis.text.y  = element_text(size=7),
                 axis.text.x  = element_text(size=5),
                 axis.line.x = element_line(color="black"),
                 axis.line.y = element_line(color="black"))
  return(ggplotly(p))
  
}


LassoCoeffs <- function(modelObject){
  
  # function name: LassoCoeffs
  #       purpose: Obtain coefficients from Lasso cross validation object
  #       version: 1a
  #        inputs:
  #             - modelObject : object of lasso cross validation
  #       outputs:
  #             - p           : a taable of coefficient estimates for both lamdas (min, 1se)
  #          date: 15/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  
  # Obtain coefficient estimate for lamda
  coefMatr <- coef(modelObject, s = "lambda.min")
  minLamMatr <- data.table(Feature  = names(coefMatr[,1]),
                           Estimate = round(as.numeric(coefMatr[,1]),3),
                           `Exp Estimate` = ifelse(exp(as.numeric(coefMatr[,1]))!=1, round(exp(as.numeric(coefMatr[,1])),3), NA)
  )
  
  setnames(minLamMatr, c("Feature", paste0("Estimate-lamda=", round(modelObject$lambda.min,5)),
                         paste0("Exp Estimate-lamda=", round(modelObject$lambda.min,5))))
  
  # Obtain coefficient estimates for lamda with 1se
  coefMatr <- coef(modelObject, s = "lambda.1se")
  se1LamMatr <- data.table(Feature  = names(coefMatr[,1]),
                           Estimate = round(as.numeric(coefMatr[,1]),3),
                           `Exp Estimate` = ifelse(exp(as.numeric(coefMatr[,1]))!=1, round(exp(as.numeric(coefMatr[,1])),3), NA)
  )
  setnames(se1LamMatr, c("Feature", paste0("Estimate-lamda=", round(modelObject$lambda.1se,5)),
                         paste0("Exp Estimate-lamda=", round(modelObject$lambda.1se,5))))
  
  totalCoef <- merge(minLamMatr, se1LamMatr, by ="Feature", sort = F)
  totalCoef[, Feature:=str_to_title(gsub("_"," ",Feature))]
  
  return(
    datatable(totalCoef,rownames = FALSE, class = 'cell-border stripe', options = list(
      dom = 'T<"clear">lfrtip',
      tableTools = list(
        className = 'dt-center', targets = 0:(ncol(totalCoef)-1)),
      columnDefs = list(list(className = 'dt-center', targets = 0:(ncol(totalCoef)-1))),
      pageLength = 15
    ))
  )
}


LassoAccuFun <- function(lamdaVal){
  
  # function name: LassoAccuFun
  #       purpose: Obtain accuracy tables for lasso
  #       version: 1a
  #        inputs:
  #             - modelObject : lamdaVal
  #       outputs:
  #             - p           : a table of accuary metrics
  #          date: 15/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  
  predNoSamplingLasso    <- predict(modelFitNoSampleLasso, newx = lassoTestSet,s = lamdaVal, type = "class")
  predUnderSamplingLasso <- predict(modelFitUnderSampleLasso, newx = lassoTestSet,s = lamdaVal, type = "class")
  predOverSamplingLasso  <- predict(modelFitOverSampleLasso, newx = lassoTestSet,s = lamdaVal, type = "class")
  
  conMatrNoSamplLasso    <- confusionMatrix(data = as.numeric(predNoSamplingLasso), reference = testDataSet[,left])
  conMatrUnderSamplLasso <- confusionMatrix(data = as.numeric(predUnderSamplingLasso), reference = testDataSet[,left])
  conMatrOverSamplLasso  <- confusionMatrix(data = as.numeric(predOverSamplingLasso), reference = testDataSet[,left])
  
  lassoOverAccur <- data.table(Model = c("Under sampling on training set", "Over sampling on training set", "No sampling on training set"),
                               do.call(rbind,list(conMatrUnderSamplLasso$overall[1],
                                                  conMatrOverSamplLasso$overall[1],
                                                  conMatrNoSamplLasso$overall[1])))
  
  lassoOverAccur[, Accuracy:= round(Accuracy,3)]
  
  
  uSMLasso <- data.table(Metric = names(conMatrUnderSamplLasso$byClass),
                         `Under Sampling Model` = round(conMatrUnderSamplLasso$byClass,3), keep.rownames = T)
  
  oSMLasso <- data.table(Metric = names(conMatrOverSamplLasso$byClass),
                         `Over Sampling Model` = round(conMatrOverSamplLasso$byClass,3), keep.rownames = T)
  
  nSMLasso <- data.table(Metric = names(conMatrNoSamplLasso$byClass),
                         `No Sampling Model` = round(conMatrNoSamplLasso$byClass,3), keep.rownames = T)
  
  overallLasso <- merge(merge(uSMLasso,oSMLasso, by = "Metric"),nSMLasso, by = "Metric")
  
  return(list(lassoOverAccur, overallLasso))  
}

LassoROC <- function(lamdaVal){
  
  # function name: LassoROC
  #       purpose: Obtain ROC curves for different lamda
  #       version: 1a
  #        inputs:
  #             - parameter : lamdaVal
  #       outputs:
  #             - p           : a plot of roc curves
  #          date: 15/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  
  predNoSamplingLasso    <- predict(modelFitNoSampleLasso, newx = lassoTestSet,s = lamdaVal, type = "response")
  predUnderSamplingLasso <- predict(modelFitUnderSampleLasso, newx = lassoTestSet,s = lamdaVal, type = "response")
  predOverSamplingLasso  <- predict(modelFitOverSampleLasso, newx = lassoTestSet,s = lamdaVal, type = "response")
  
  rocData <- data.table(actual = ifelse(testDataSet[,left]=="1",1,0),
                        `Training Data` = predNoSamplingLasso[,1],
                        `Under Sampling Training Data` = predUnderSamplingLasso[,1],
                        `Over Sampling Training Data` = predOverSamplingLasso[,1])
  
  rocData <- melt(rocData,id.vars = "actual",variable.name = "Model", value.name = "Predicted Prob")

  return(
  ggplot(rocData , aes(d = actual, m = `Predicted Prob`, color = Model)) + 
    geom_roc() + 
    style_roc()
  )
  
}

DemoFuncBokeh <- function(FeatureVec,
                          mMetric){
  

  # function name: DemoFuncBokeh
  #       purpose: create interactive bar chart
  #       version: 1a
  #        inputs:
  #             - variable : FeatureVec
  #       outputs:
  #             - p           : a bar plot
  #          date: 15/02/2017
  #       contact: mammaskon@gmail.com
  #          name: Konstantinos Mammas
  
  
  figure(     width   = 300,
              height  = 300,
              xlab    = "Status",
              ylab    = "Number of employees",
              title   = paste0(mMetric," ", FeatureVec),
              legend_location = NULL
  )                     %>%
    
    ly_bar(         data    = metricCharts[Feature == FeatureVec & Metric == mMetric],
                    x       = Status,
                    y       = Value,
                    color   = Status,
                    hover   = T,
                    position= "dodge")                          %>%
    
    theme_axis(     axis_label_text_font_style = 'normal')      %>%
    
    theme_legend(   border_line_width = 2)                      %>%
    theme_axis(     axis_label_text_font_style = 'normal')      %>%
    set_palette(discrete_color = pal_color(c("red", "blue")))
  
}

Kaggle Competition - HR Analytics

Konstantinos Mammas - Statistical Programmer - mammaskon@gmail.com

February 16, 2017 - Dublin, Ireland