A small company, ABC Company, employs 32 people. Owners have brought forward employee satisfaction as a key issue, and want us to run analysis so that they can understand it more thoroughly. We are tasked with predicting job satisfaction using the data provided. As such, satisfaction is our response variable, and the rest are our predictor variables.


Visualizations and Exploratory Analysis

Box Plot

Fig. 1.1

Fig. 1.1

Within these series of box plots, we can see the levels of satisfaction within each department. Some departments have higher levels of satisfaction than others, though the data is limited in scope, so we can’t say this would remain the same with more data. Maintenance and Quality Control have the highest satisfaction levels, but they also have the lowest range due to the relative size disparity of their department compared to the total employees. The main department to examine is Production, as they make up most of the company’s employees. That department is just about average in all regards, so all we can say is that we have found nothing odd to report about its data.


Histogram

Fig. 1.2

Fig. 1.2

The histogram is somewhat symmetrical but is skewed left, meaning that there are slightly more satisfied employees than ones that are unsatisfied.

Fig. 1.3

Fig. 1.3

This histogram shows us that there is an incredible amount of employees that are new hires. The graph is bimodal and skewed right. Since it is not symmetrical, we use the median to evaluate the data’s centre, leading us to acknowledge the new hires as the centre of the data.


Heat Map

Fig. 1.4

Fig. 1.4

The heat map provides a lot of information that can be unpacked.

For instance, we can see that years spent working actually have a negative correlation with most variables, except for just two. Those two variables are tools and training, and this is understandable, as the longer you remain employed, the more trained and better equipped you should be.

Another point we can look to is the pairwise correlation between ideas and recognition, this stemming from the fact that being an employee that creates value for the company will lead to recognition by the company. This pair is also most strongly correlated with satisfaction of any other variables. This leads me to believe that satisfaction for this set of employees comes most strongly from them feeling that they provide value whenever they sit down to work.


Scatter Plot

Fig. 1.5

Fig. 1.5

Taking inspiration from the heat map analysis, we look to see the impact of recognition on satisfaction in the form of a scatter plot. We can see a trend in the plot, where an increase in recognition improves the satisfaction of the employee.

Fig. 1.6

Fig. 1.6

The next point of interest is the pairing of ideas and recognition. We can see a strong trend in the plot, where an increase in perceived brilliance by management concerning the employee’s ideas improves the recognition the employee receives.

Fig. 1.7

Fig. 1.7

The final scatter plot displays satisfaction by number of years worked at the company, and it is colour-coded by department. There is not much data to work with, so this analysis is limited by that factor, but generally, we can see that satisfaction is highest within the first four years of working for the company.



K-Nearest-Neighbours

## k-Nearest Neighbors 
## 
## 32 samples
##  9 predictor
##  2 classes: 'Unsatisfied', 'Satisfied' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 29, 29, 28, 29, 29, 28, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8972222  0.7566667
##   7  0.7833333  0.4900000
##   9  0.6833333  0.2900000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

Classification Results

department years ideas communication recognition training conditions tools balance satisfaction prediction
Administrative 16 2 3 2 2 4 5 2 Unsatisfied Unsatisfied
Administrative 2 4 4 3 4 4 5 3 Satisfied Satisfied
Administrative 14 4 3 2 2 5 5 5 Unsatisfied Satisfied
Maintenance 17 5 4 3 5 5 5 3 Satisfied Satisfied
Maintenance 15 5 5 5 5 5 5 5 Satisfied Satisfied
Management 1 5 4 4 3 5 3 5 Satisfied Satisfied
Management 3 3 4 3 3 4 5 5 Satisfied Satisfied
Management 3 2 2 2 2 3 5 3 Unsatisfied Unsatisfied
Production 16 2 3 2 4 4 4 2 Unsatisfied Unsatisfied
Production 15 2 3 1 4 4 4 2 Unsatisfied Unsatisfied
Production 13 3 3 3 4 4 4 3 Satisfied Satisfied
Production 3 5 5 5 5 5 5 5 Satisfied Satisfied
Production 6 2 2 1 3 3 4 2 Unsatisfied Unsatisfied
Production 1 5 4 4 3 4 5 5 Satisfied Satisfied
Production 3 3 4 3 4 5 5 4 Satisfied Satisfied
Production 2 4 4 4 4 5 5 5 Satisfied Satisfied
Production 3 3 4 3 3 2 4 4 Unsatisfied Unsatisfied
Production 2 4 3 4 3 3 4 4 Unsatisfied Satisfied
Production 2 4 5 4 4 4 4 4 Satisfied Satisfied
Production 15 5 4 3 4 3 5 3 Satisfied Satisfied
Production 5 4 5 3 2 3 5 4 Satisfied Satisfied
Production 8 5 5 3 5 3 5 3 Satisfied Satisfied
Production 17 4 3 4 3 3 5 2 Unsatisfied Unsatisfied
Production 15 5 3 4 5 5 5 5 Satisfied Satisfied
Production 5 2 4 2 2 2 5 3 Unsatisfied Unsatisfied
QC 1 5 5 5 5 5 5 5 Satisfied Satisfied
QC 11 3 4 4 4 5 5 2 Satisfied Satisfied
SR 21 3 2 2 3 2 4 3 Unsatisfied Unsatisfied
SR 8 3 2 2 2 2 4 2 Unsatisfied Unsatisfied
SR 32 2 3 2 4 2 5 3 Unsatisfied Unsatisfied
SR 2 5 5 5 5 5 5 5 Satisfied Satisfied
SR 18 4 4 4 5 5 5 5 Satisfied Satisfied

Confusion Matrix

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Unsatisfied Satisfied
##   Unsatisfied          11         0
##   Satisfied             2        19
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.7919, 0.9923)
##     No Information Rate : 0.5938          
##     P-Value [Acc > NIR] : 1.452e-05       
##                                           
##                   Kappa : 0.8672          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8462          
##          Pos Pred Value : 0.9048          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5938          
##          Detection Rate : 0.5938          
##    Detection Prevalence : 0.6562          
##       Balanced Accuracy : 0.9231          
##                                           
##        'Positive' Class : Satisfied       
## 

KNN Model Output

The highest accuracy was found to be gained by using K = 5
The confusion matrix displays a few key factors that point to the success of the model’s predictive power. We can first look to the accuracy, which is 93.75% for this output; that is a strong value to have, as it is close to error-free. This is much higher than the NIR of 59.38%, NIR being the accuracy gained by only predicting the most frequent class.
We can also look to the p-value; we see that it is significantly lower than 5%, meaning that the model, compared to random guessing, is significantly better at predictions for this data.
The specificity and positive predictive values are 100%, and the sensitivity and negative predictive values are both above 84%. This shows that the model does very well both when seeking to predict a class correctly and when seeking to avoid false classification.
The 95% CI of (0.7919, 0.9923) is strong. By showing us the range that the model’s true accuracy may rest in after extended analysis, we become aware that the model has a relatively safe floor, and the ceiling is close to perfection as well.
The Kappa value of 0.867 is close to 1, showing that the model’s success is not due solely to chance, meaning that this success is repeatable due to the strong level of agreement between classes.

Naive Bayes

Classification Results

department years ideas communication recognition training conditions tools balance satisfaction Unsatisfied Satisfied pred.class
Administrative Hired within last 4 years 2 3 2 2 4 5 2 Unsatisfied 1.0000000 0.0000000 Unsatisfied
Administrative Hired more than 4 years ago 4 4 3 4 4 5 3 Satisfied 0.0142491 0.9857509 Satisfied
Administrative Hired within last 4 years 4 3 2 2 5 5 5 Unsatisfied 0.9985680 0.0014320 Unsatisfied
Maintenance Hired within last 4 years 5 4 3 5 5 5 3 Satisfied 0.0000000 1.0000000 Satisfied
Maintenance Hired within last 4 years 5 5 5 5 5 5 5 Satisfied 0.0000000 1.0000000 Satisfied
Management Hired more than 4 years ago 5 4 4 3 5 3 5 Satisfied 0.0000000 1.0000000 Satisfied
Management Hired more than 4 years ago 3 4 3 3 4 5 5 Satisfied 0.0021911 0.9978089 Satisfied
Management Hired more than 4 years ago 2 2 2 2 3 5 3 Unsatisfied 1.0000000 0.0000000 Unsatisfied
Production Hired within last 4 years 2 3 2 4 4 4 2 Unsatisfied 1.0000000 0.0000000 Unsatisfied
Production Hired within last 4 years 2 3 1 4 4 4 2 Unsatisfied 0.9999999 0.0000001 Unsatisfied
Production Hired within last 4 years 3 3 3 4 4 4 3 Satisfied 0.7927756 0.2072244 Unsatisfied
Production Hired more than 4 years ago 5 5 5 5 5 5 5 Satisfied 0.0000000 1.0000000 Satisfied
Production Hired within last 4 years 2 2 1 3 3 4 2 Unsatisfied 1.0000000 0.0000000 Unsatisfied
Production Hired more than 4 years ago 5 4 4 3 4 5 5 Satisfied 0.0000122 0.9999878 Satisfied
Production Hired more than 4 years ago 3 4 3 4 5 5 4 Satisfied 0.0007979 0.9992021 Satisfied
Production Hired more than 4 years ago 4 4 4 4 5 5 5 Satisfied 0.0002190 0.9997810 Satisfied
Production Hired more than 4 years ago 3 4 3 3 2 4 4 Unsatisfied 0.9882210 0.0117790 Unsatisfied
Production Hired more than 4 years ago 4 3 4 3 3 4 4 Unsatisfied 0.9315183 0.0684817 Unsatisfied
Production Hired more than 4 years ago 4 5 4 4 4 4 4 Satisfied 0.0008345 0.9991655 Satisfied
Production Hired within last 4 years 5 4 3 4 3 5 3 Satisfied 0.0000902 0.9999098 Satisfied
Production Hired within last 4 years 4 5 3 2 3 5 4 Satisfied 0.0037451 0.9962549 Satisfied
Production Hired within last 4 years 5 5 3 5 3 5 3 Satisfied 0.0000000 1.0000000 Satisfied
Production Hired within last 4 years 4 3 4 3 3 5 2 Unsatisfied 0.9798306 0.0201694 Unsatisfied
Production Hired within last 4 years 5 3 4 5 5 5 5 Satisfied 0.0000001 0.9999999 Satisfied
Production Hired within last 4 years 2 4 2 2 2 5 3 Unsatisfied 1.0000000 0.0000000 Unsatisfied
QC Hired more than 4 years ago 5 5 5 5 5 5 5 Satisfied 0.0000000 1.0000000 Satisfied
QC Hired within last 4 years 3 4 4 4 5 5 2 Satisfied 0.0005646 0.9994354 Satisfied
SR Hired within last 4 years 3 2 2 3 2 4 3 Unsatisfied 1.0000000 0.0000000 Unsatisfied
SR Hired within last 4 years 3 2 2 2 2 4 2 Unsatisfied 1.0000000 0.0000000 Unsatisfied
SR Hired within last 4 years 2 3 2 4 2 5 3 Unsatisfied 1.0000000 0.0000000 Unsatisfied
SR Hired more than 4 years ago 5 5 5 5 5 5 5 Satisfied 0.0000000 1.0000000 Satisfied
SR Hired within last 4 years 4 4 4 5 5 5 5 Satisfied 0.0000066 0.9999934 Satisfied

Confusion Matrix

## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Unsatisfied Satisfied
##   Unsatisfied          13         1
##   Satisfied             0        18
##                                           
##                Accuracy : 0.9688          
##                  95% CI : (0.8378, 0.9992)
##     No Information Rate : 0.5938          
##     P-Value [Acc > NIR] : 1.303e-06       
##                                           
##                   Kappa : 0.936           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9474          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9286          
##              Prevalence : 0.5938          
##          Detection Rate : 0.5625          
##    Detection Prevalence : 0.5625          
##       Balanced Accuracy : 0.9737          
##                                           
##        'Positive' Class : Satisfied       
## 

Naive Bayes Model Output

The confusion matrix displays a few key factors that point to the success of the model’s predictive power. We can first look to the accuracy, which is 96.88% for this output; that is a very strong value to have, as it is almost error-free. This is also much higher than the NIR of 59.38%.
We can also look to the p-value; we see that it is significantly lower than 5%, meaning that the model, compared to random guessing, is significantly better at predictions for this data.
The sensitivity and negative predictive values are 100%, and the specificity and positive predictive values are both above 92%. This shows that the model does very well both when seeking to predict a class correctly and when seeking to avoid false classification.
Interestingly enough, the strengths of the Naive Bayes model are the inverse of those of the KNN model. Whereas the KNN model had perfect scores for its specificity and positive predictive values and somewhat worse ones for the rest, the Naive Bayes model had perfect scores for its sensitivity and negative predictive values and somewhat worse ones for its other performance matrix. Could this information be used somehow in use of the models to complement each other’s performance?
The 95% CI of (0.8378, 0.9992) is also strong. By showing us the range that the model’s true accuracy may rest in after extended analysis, we become aware that even if the model performs at its worst, it will still be above 83% accuracy; that is a very safe floor, and the ceiling is close to perfection as well.
The Kappa value of 0.936 is close to 1, showing that the model’s success is not due solely to chance, meaning that this success is repeatable due to the strong level of agreement between classes. This Kappa value is significantly higher than the already strong value returned in the KNN analysis.
This model looks to be slightly more reliable in analysis of this dataset, if we focus solely on predictive reliability, , than the KNN model (basing our success mostly on the general and balanced accuracy values, Kappa value, and the overall performance of the performance metrics; both have a good p-value, so that is a non-factor in comparison).