Claudia Schmitt Assignment

About the Data

The dataset includes information on employees from different departments and factors that might affect job satisfaction. It has details like years of experience, communication, recognition, training, working conditions, tools, and work-life balance, along with job satisfaction scores. The data covers departments like Administrative, Maintenance, Management, Production, QC, and SR, making it possible to compare satisfaction across roles. Analyzing this can help find patterns and understand what influences job satisfaction.

Data Preview

Department	Years	Ideas	Communication	Recognition	Training	Conditions	Tools	Balance	Satisfaction
Administrative	16	2	3	2	2	4	5	2	3
Administrative	2	4	4	3	4	4	5	3	9
Administrative	14	4	3	2	2	5	5	5	6
Maintenance	17	5	4	3	5	5	5	3	8
Maintenance	15	5	5	5	5	5	5	5	9

This is a preview of the first five rows of the dataset. The dataset consists of a mix of categorical and numerical data types. The Department column contains text values such as “Administrative,” “Maintenance,” and “Production,” which can be treated as a categorical variable. The other columns, such as Years, Ideas, Communication, Recognition, Training, Conditions, Tools, Balance, and Satisfaction, are numerical and contain ratings or numerical scores. These scores represent categories like satisfaction or skill level and could be considered ordinal data, even though they are stored as numeric. While these columns are stored as numbers, they represent rankings or ratings rather than continuous measurements.

Data Summary

Department	Years	Ideas	Communication	Recognition	Training	Conditions	Tools	Balance	Satisfaction
Length:32	Min. : 1.000	Min. :2.000	Min. :2.000	Min. :1.000	Min. :2.000	Min. :2.000	Min. :3.000	Min. :2.000	Min. : 3.000
Class :character	1st Qu.: 2.750	1st Qu.:3.000	1st Qu.:3.000	1st Qu.:2.000	1st Qu.:3.000	1st Qu.:3.000	1st Qu.:4.000	1st Qu.:3.000	1st Qu.: 5.000
Mode :character	Median : 7.000	Median :4.000	Median :4.000	Median :3.000	Median :4.000	Median :4.000	Median :5.000	Median :3.500	Median : 7.000
NA	Mean : 9.219	Mean :3.656	Mean :3.688	Mean :3.156	Mean :3.625	Mean :3.844	Mean :4.656	Mean :3.625	Mean : 6.844
NA	3rd Qu.:15.000	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:4.250	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.:5.000	3rd Qu.: 8.250
NA	Max. :32.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :10.000

The summary of the dataset reveals employee ratings across various work-related factors, such as communication, recognition, and training. The Department column consists of categorical data with 32 entries from different departments. Numerical columns like Years range from 1 to 32, with an average of about 9.22 years of experience. Ratings for work factors, on a scale of 1 to 5, show that most employees report moderately positive experiences. Satisfaction, with a mean of 6.84 and a maximum of 10, indicates generally positive feedback. Other factors, like Tools and Conditions, have higher ratings, with employees feeling well-equipped and satisfied with their working conditions.

Quality of Data

The data is not missing any values. It includes a good amount of predictors though there are a few columns that should have been included that would affect satisfaction that are not mentioned. These include health and safety, company values, relationship with managers, pay scale, career development, and challenges.

A) Data Visualisation

1. Histogram of Job Satisfaction

This histogram displays the distribution of job satisfaction scores, ranging from 1 (least satisfied) to 10 (most satisfied). The distribution is left-skewed, indicating that most employees report relatively high satisfaction levels. The majority of scores cluster between 7 and 8, suggesting that a significant portion of employees are generally satisfied with their jobs. The left skewness further highlights that fewer employees report very low satisfaction scores, making dissatisfaction less common in this dataset.

2. Boxplot of Job Satisfaction by Department

The box plot of job satisfaction by department provides insight into how satisfaction levels vary across different roles. It highlights that QC and Maintenance have the highest overall job satisfaction scores, indicating that employees in these departments tend to be more satisfied with their work environment. Additionally, Management has the highest median satisfaction level, suggesting that, on average, management employees report greater satisfaction compared to other departments. This visualization helps identify trends in job satisfaction across various roles, which can be useful for targeted improvements.

3. Recognition vs. Satisfaction

As an employee receives more recognition, their satusfaction levels go up, which is expected.

4. Correlation Heatmap

This correlation heatmap illustrates the strength and direction of relationships among different variables. A key takeaway is that there is a positive relationship between satisfaction and recognition, meaning that as recognition increases, so does job satisfaction. On the other hand, the relationship between satisfaction and tools shows little correlation, suggesting that the availability of tools does not have a strong impact on job satisfaction. This visualization helps identify which factors are most closely related to satisfaction..

5. Scatter Plot: Years of Experience vs. Job Satisfaction

This scatter plot shows a negative linear relationship between years of experience and job satisfaction. This finding was surprising, as I initially assumed that gaining experience would lead to higher comfort and satisfaction in the workplace. However, the plot suggests the opposite: as employees gain more experience, their satisfaction tends to decrease. This could indicate that with more experience, employees develop higher expectations and may become dissatisfied with what previously satisfied them. This trend challenges the assumption that experience always leads to increased job satisfaction.

6. Average Satisfaction by Communication Level

The higher the communication level the more satisfied an employee is, which is expected.

7. Job Satisfaction by Ideas Contribution Level

The higher the contribution level, the more satisfied the employee is which is expected.

8. Tools vs. Satisfaction

As the tools an employee has increases, so does their satisfaction.

9. Satisfaction by Training Level

The more training an employee has, the more satisfied they are.

B) KNN to predict and classify Satisfaction

This analysis evaluates a KNN model’s performance in predicting job satisfaction levels (Low, Medium, High) based on various workplace factors. The model’s accuracy, precision, and balanced accuracy are assessed to determine its effectiveness, with a focus on identifying potential weaknesses in classifying Low satisfaction cases.

1. Dataset

Job dataset for KNN
Department	Years	Ideas	Communication	Recognition	Training	Conditions	Tools	Balance	Satisfaction
Administrative	16	2	3	2	2	4	5	2	Low
Administrative	2	4	4	3	4	4	5	3	High
Administrative	14	4	3	2	2	5	5	5	Medium
Maintenance	17	5	4	3	5	5	5	3	High
Maintenance	15	5	5	5	5	5	5	5	High
Management	1	5	4	4	3	5	3	5	High
Management	3	3	4	3	3	4	5	5	High
Management	3	2	2	2	2	3	5	3	Low
Production	16	2	3	2	4	4	4	2	Medium
Production	15	2	3	1	4	4	4	2	Medium
Production	13	3	3	3	4	4	4	3	High
Production	3	5	5	5	5	5	5	5	High
Production	6	2	2	1	3	3	4	2	Medium
Production	1	5	4	4	3	4	5	5	High
Production	3	3	4	3	4	5	5	4	Medium
Production	2	4	4	4	4	5	5	5	High
Production	3	3	4	3	3	2	4	4	Medium
Production	2	4	3	4	3	3	4	4	Medium
Production	2	4	5	4	4	4	4	4	High
Production	15	5	4	3	4	3	5	3	Medium
Production	5	4	5	3	2	3	5	4	Medium
Production	8	5	5	3	5	3	5	3	High
Production	17	4	3	4	3	3	5	2	Medium
Production	15	5	3	4	5	5	5	5	High
Production	5	2	4	2	2	2	5	3	Low
QC	1	5	5	5	5	5	5	5	High
QC	11	3	4	4	4	5	5	2	Medium
SR	21	3	2	2	3	2	4	3	Medium
SR	8	3	2	2	2	2	4	2	Medium
SR	32	2	3	2	4	2	5	3	Medium
SR	2	5	5	5	5	5	5	5	High
SR	18	4	4	4	5	5	5	5	High

The Job dataset for KNN consists of data collected from employees across various departments, with the goal of predicting job satisfaction. The dataset includes columns for different features that may impact satisfaction, such as Years of experience, Ideas, Communication, Recognition, Training, Conditions, Tools, and Work-Life Balance. The Satisfaction column is the target variable, with three levels: Low, Medium, and High.

The dataset covers several departments, including Administrative, Maintenance, Management, Production, QC, and SR, providing a diverse set of roles. It contains 32 rows of employee data, with satisfaction levels distributed across the dataset, reflecting various combinations of job-related factors.

This dataset is used to train a KNN model, with the goal of understanding how these factors influence job satisfaction across different departments.

2. Choosing K value

## k-Nearest Neighbors 
## 
## 32 samples
##  9 predictor
##  3 classes: 'Low', 'Medium', 'High' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 29, 29, 28, 29, 28, 30, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7100000  0.4744781
##   7  0.6544444  0.3883502
##   9  0.5516667  0.1687205
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

The final value used for the model was k = 5.

3. Plot

The plot shows the elbow occurs at 7.

4. Prediction

KNN classifier result
Department	Years	Ideas	Communication	Recognition	Training	Conditions	Tools	Balance	Satisfaction	prediction
Administrative	16	2	3	2	2	4	5	2	Low	Medium
Administrative	2	4	4	3	4	4	5	3	High	High
Administrative	14	4	3	2	2	5	5	5	Medium	Medium
Maintenance	17	5	4	3	5	5	5	3	High	High
Maintenance	15	5	5	5	5	5	5	5	High	High
Management	1	5	4	4	3	5	3	5	High	High
Management	3	3	4	3	3	4	5	5	High	High
Management	3	2	2	2	2	3	5	3	Low	Medium
Production	16	2	3	2	4	4	4	2	Medium	Medium
Production	15	2	3	1	4	4	4	2	Medium	Medium
Production	13	3	3	3	4	4	4	3	High	Medium
Production	3	5	5	5	5	5	5	5	High	High
Production	6	2	2	1	3	3	4	2	Medium	Medium
Production	1	5	4	4	3	4	5	5	High	High
Production	3	3	4	3	4	5	5	4	Medium	High
Production	2	4	4	4	4	5	5	5	High	High
Production	3	3	4	3	3	2	4	4	Medium	Medium
Production	2	4	3	4	3	3	4	4	Medium	High
Production	2	4	5	4	4	4	4	4	High	High
Production	15	5	4	3	4	3	5	3	Medium	High
Production	5	4	5	3	2	3	5	4	Medium	Medium
Production	8	5	5	3	5	3	5	3	High	Medium
Production	17	4	3	4	3	3	5	2	Medium	Medium
Production	15	5	3	4	5	5	5	5	High	High
Production	5	2	4	2	2	2	5	3	Low	Medium
QC	1	5	5	5	5	5	5	5	High	High
QC	11	3	4	4	4	5	5	2	Medium	Medium
SR	21	3	2	2	3	2	4	3	Medium	Medium
SR	8	3	2	2	2	2	4	2	Medium	Medium
SR	32	2	3	2	4	2	5	3	Medium	Medium
SR	2	5	5	5	5	5	5	5	High	High
SR	18	4	4	4	5	5	5	5	High	High

The prediction seems to be unable to predict low values while mire accurately predicting medium and high values.

5. Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low      0      0    0
##     Medium   3     11    2
##     High     0      3   13
## 
## Overall Statistics
##                                          
##                Accuracy : 0.75           
##                  95% CI : (0.566, 0.8854)
##     No Information Rate : 0.4688         
##     P-Value [Acc > NIR] : 0.001154       
##                                          
##                   Kappa : 0.5429         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity             0.00000        0.7857      0.8667
## Specificity             1.00000        0.7222      0.8235
## Pos Pred Value              NaN        0.6875      0.8125
## Neg Pred Value          0.90625        0.8125      0.8750
## Prevalence              0.09375        0.4375      0.4688
## Detection Rate          0.00000        0.3438      0.4062
## Detection Prevalence    0.00000        0.5000      0.5000
## Balanced Accuracy       0.50000        0.7540      0.8451

6. Output Discussion

This confusion matrix reveals that this model is 75% accurate and was able to predict most satisfaction levels. The Kappa level is 0.5429 shows a decent level of agreement between predicted and actual values. The sensitivity for class low is 0 which means the model was not able to correctly identify any of the satisfaction levels in thus category. 78.57% of the Medium satisfaction instances were correctly identified revealed by a sensitivity level of 0.7857. 86.67% of the High satisfaction instances were correctly identified revealed by a sensitivity level of 0.8667. As for specificity, The model correctly identified all non-Low instances as not Low which is because it did not classify any as low. The model correctly identified 72.22% of non-Medium instances as not Medium. 82.35% of non-High instances were correctly identified as not High.Precision: The model correctly predicts Medium satisfaction 68.75% of the time and High satisfaction 81.25% of the time. It never predicts Low, so its precision for Low is NaN. Negative Predictive Value (NPV): The model correctly identifies non-Low cases 90.63% of the time, non-Medium cases 81.25% of the time, and non-High cases 87.5% of the time. Balanced Accuracy: The model performs best for High (84.51%), followed by Medium (75.4%). Low has a 50% balanced accuracy, meaning the model struggles to detect Low satisfaction. The model performs well for predicting High and Medium satisfaction but completely fails to predict Low satisfaction, indicating a need for better class balance or feature adjustments.

C) Naive Bayes to predict Satisfaction

Naive Bayes is a probabilistic classifier that predicts job satisfaction by calculating the probability of each satisfaction level (Low, Medium, High) based on features like communication, recognition, and training. It assumes that the features are independent given the satisfaction level, making it a simple yet effective method for classification.

1. Dataset

This output shows the data with all categorical variables.

Job Dataset for Naive Bayes
Department	Years	Ideas	Communication	Recognition	Training	Conditions	Tools	Balance	Satisfaction
Administrative	16-20	Low	Medium	Low	Low	High	High	Low	Low
Administrative	1-5	Medium	High	Medium	High	High	High	Medium	High
Administrative	11-15	Medium	Medium	Low	Low	High	High	High	Medium
Maintenance	16-20	High	High	Medium	High	High	High	Medium	High
Maintenance	11-15	High	High	High	High	High	High	High	High
Management	1-5	High	High	High	Medium	High	Medium	High	High
Management	1-5	Medium	High	Medium	Medium	High	High	High	High
Management	1-5	Low	Low	Low	Low	Medium	High	Medium	Low
Production	16-20	Low	Medium	Low	High	High	High	Low	Medium
Production	11-15	Low	Medium	Low	High	High	High	Low	Medium
Production	11-15	Medium	Medium	Medium	High	High	High	Medium	High
Production	1-5	High	High	High	High	High	High	High	High
Production	6-10	Low	Low	Low	Medium	Medium	High	Low	Medium
Production	1-5	High	High	High	Medium	High	High	High	High
Production	1-5	Medium	High	Medium	High	High	High	High	Medium
Production	1-5	Medium	High	High	High	High	High	High	High
Production	1-5	Medium	High	Medium	Medium	Low	High	High	Medium
Production	1-5	Medium	Medium	High	Medium	Medium	High	High	Medium
Production	1-5	Medium	High	High	High	High	High	High	High
Production	11-15	High	High	Medium	High	Medium	High	Medium	Medium
Production	1-5	Medium	High	Medium	Low	Medium	High	High	Medium
Production	6-10	High	High	Medium	High	Medium	High	Medium	High
Production	16-20	Medium	Medium	High	Medium	Medium	High	Low	Medium
Production	11-15	High	Medium	High	High	High	High	High	High
Production	1-5	Low	High	Low	Low	Low	High	Medium	Low
QC	1-5	High	High	High	High	High	High	High	High
QC	11-15	Medium	High	High	High	High	High	Low	Medium
SR	21+	Medium	Low	Low	Medium	Low	High	Medium	Medium
SR	6-10	Medium	Low	Low	Low	Low	High	Low	Medium
SR	21+	Low	Medium	Low	High	Low	High	Medium	Medium
SR	1-5	High	High	High	High	High	High	High	High
SR	16-20	Medium	High	High	High	High	High	High	High

2. Two types of probabilities

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     Low  Medium    High 
## 0.09375 0.43750 0.46875 
## 
## Conditional probabilities:
##         Department
## Y        Administrative Maintenance Management Production         QC         SR
##   Low        0.33333333  0.00000000 0.33333333 0.33333333 0.00000000 0.00000000
##   Medium     0.07142857  0.00000000 0.00000000 0.64285714 0.07142857 0.21428571
##   High       0.06666667  0.13333333 0.13333333 0.46666667 0.06666667 0.13333333
## 
##         Years
## Y               1-5       6-10      11-15      16-20        21+
##   Low    0.66666667 0.00000000 0.00000000 0.33333333 0.00000000
##   Medium 0.28571429 0.14285714 0.28571429 0.14285714 0.14285714
##   High   0.60000000 0.06666667 0.20000000 0.13333333 0.00000000
## 
##         Ideas
## Y               Low     Medium       High
##   Low    1.00000000 0.00000000 0.00000000
##   Medium 0.28571429 0.64285714 0.07142857
##   High   0.00000000 0.40000000 0.60000000
## 
##         Communication
## Y              Low    Medium      High
##   Low    0.3333333 0.3333333 0.3333333
##   Medium 0.2142857 0.4285714 0.3571429
##   High   0.0000000 0.1333333 0.8666667
## 
##         Recognition
## Y              Low    Medium      High
##   Low    1.0000000 0.0000000 0.0000000
##   Medium 0.5000000 0.2857143 0.2142857
##   High   0.0000000 0.3333333 0.6666667
## 
##         Training
## Y              Low    Medium      High
##   Low    1.0000000 0.0000000 0.0000000
##   Medium 0.2142857 0.3571429 0.4285714
##   High   0.0000000 0.2000000 0.8000000
## 
##         Conditions
## Y               Low     Medium       High
##   Low    0.33333333 0.33333333 0.33333333
##   Medium 0.28571429 0.35714286 0.35714286
##   High   0.00000000 0.06666667 0.93333333
## 
##         Tools
## Y               Low     Medium       High
##   Low    0.00000000 0.00000000 1.00000000
##   Medium 0.00000000 0.00000000 1.00000000
##   High   0.00000000 0.06666667 0.93333333
## 
##         Balance
## Y              Low    Medium      High
##   Low    0.3333333 0.6666667 0.0000000
##   Medium 0.4285714 0.2142857 0.3571429
##   High   0.0000000 0.2666667 0.7333333

Now we want to use the NB Classifier to classify companies based on their predictors. We will use the whole dataset.

3. NB Classifier

##                Low       Medium         High
##  [1,] 9.772912e-01 2.270882e-02 1.225914e-12
##  [2,] 3.658135e-09 4.553691e-02 9.544631e-01
##  [3,] 1.032852e-07 9.999936e-01 6.334088e-06
##  [4,] 8.623011e-12 5.565793e-05 9.999443e-01
##  [5,] 4.703643e-18 1.686669e-05 9.999831e-01
##  [6,] 1.951173e-14 2.623751e-07 9.999997e-01
##  [7,] 4.178268e-12 6.742242e-04 9.993258e-01
##  [8,] 9.999593e-01 4.066221e-05 4.031836e-13
##  [9,] 2.385173e-03 9.976148e-01 1.675498e-08
## [10,] 3.586300e-06 9.999964e-01 1.259623e-08
## [11,] 6.576018e-12 5.893853e-01 4.106147e-01
## [12,] 9.944699e-14 1.031607e-03 9.989684e-01
## [13,] 1.721401e-05 9.999828e-01 2.699156e-12
## [14,] 3.968327e-13 3.430432e-03 9.965696e-01
## [15,] 2.879444e-13 3.584366e-02 9.641563e-01
## [16,] 1.472714e-13 1.374939e-02 9.862506e-01
## [17,] 1.192117e-11 9.893079e-01 1.069205e-02
## [18,] 8.947844e-12 8.353788e-01 1.646212e-01
## [19,] 1.472714e-13 1.374939e-02 9.862506e-01
## [20,] 1.931076e-11 1.602551e-01 8.397449e-01
## [21,] 1.598988e-08 9.952196e-01 4.780419e-03
## [22,] 5.363465e-11 2.225502e-01 7.774498e-01
## [23,] 2.975015e-09 9.999005e-01 9.951625e-05
## [24,] 2.843126e-15 2.359438e-02 9.764056e-01
## [25,] 9.663192e-01 3.368080e-02 1.772771e-11
## [26,] 2.088866e-15 8.025448e-04 9.991975e-01
## [27,] 1.164327e-13 9.662455e-01 3.375451e-02
## [28,] 3.442861e-10 1.000000e+00 6.169606e-11
## [29,] 1.434525e-07 9.999999e-01 6.426672e-14
## [30,] 3.227681e-07 9.999997e-01 7.712005e-11
## [31,] 1.044014e-15 1.203334e-03 9.987967e-01
## [32,] 3.403249e-15 3.530338e-02 9.646966e-01

##  [1] Low    High   Medium High   High   High   High   Low    Medium Medium
## [11] Medium High   Medium High   High   High   Medium Medium High   High  
## [21] Medium High   Medium High   Low    High   Medium Medium Medium Medium
## [31] High   High  
## Levels: Low Medium High

4. Display the result of classification

Classification Result
Department	Years	Ideas	Communication	Recognition	Training	Conditions	Tools	Balance	Satisfaction	Low	Medium	High	pred.class
Administrative	16-20	Low	Medium	Low	Low	High	High	Low	Low	0.9772912	0.0227088	0.0000000	Low
Administrative	1-5	Medium	High	Medium	High	High	High	Medium	High	0.0000000	0.0455369	0.9544631	High
Administrative	11-15	Medium	Medium	Low	Low	High	High	High	Medium	0.0000001	0.9999936	0.0000063	Medium
Maintenance	16-20	High	High	Medium	High	High	High	Medium	High	0.0000000	0.0000557	0.9999443	High
Maintenance	11-15	High	High	High	High	High	High	High	High	0.0000000	0.0000169	0.9999831	High
Management	1-5	High	High	High	Medium	High	Medium	High	High	0.0000000	0.0000003	0.9999997	High
Management	1-5	Medium	High	Medium	Medium	High	High	High	High	0.0000000	0.0006742	0.9993258	High
Management	1-5	Low	Low	Low	Low	Medium	High	Medium	Low	0.9999593	0.0000407	0.0000000	Low
Production	16-20	Low	Medium	Low	High	High	High	Low	Medium	0.0023852	0.9976148	0.0000000	Medium
Production	11-15	Low	Medium	Low	High	High	High	Low	Medium	0.0000036	0.9999964	0.0000000	Medium
Production	11-15	Medium	Medium	Medium	High	High	High	Medium	High	0.0000000	0.5893853	0.4106147	Medium
Production	1-5	High	High	High	High	High	High	High	High	0.0000000	0.0010316	0.9989684	High
Production	6-10	Low	Low	Low	Medium	Medium	High	Low	Medium	0.0000172	0.9999828	0.0000000	Medium
Production	1-5	High	High	High	Medium	High	High	High	High	0.0000000	0.0034304	0.9965696	High
Production	1-5	Medium	High	Medium	High	High	High	High	Medium	0.0000000	0.0358437	0.9641563	High
Production	1-5	Medium	High	High	High	High	High	High	High	0.0000000	0.0137494	0.9862506	High
Production	1-5	Medium	High	Medium	Medium	Low	High	High	Medium	0.0000000	0.9893079	0.0106921	Medium
Production	1-5	Medium	Medium	High	Medium	Medium	High	High	Medium	0.0000000	0.8353788	0.1646212	Medium
Production	1-5	Medium	High	High	High	High	High	High	High	0.0000000	0.0137494	0.9862506	High
Production	11-15	High	High	Medium	High	Medium	High	Medium	Medium	0.0000000	0.1602551	0.8397449	High
Production	1-5	Medium	High	Medium	Low	Medium	High	High	Medium	0.0000000	0.9952196	0.0047804	Medium
Production	6-10	High	High	Medium	High	Medium	High	Medium	High	0.0000000	0.2225502	0.7774498	High
Production	16-20	Medium	Medium	High	Medium	Medium	High	Low	Medium	0.0000000	0.9999005	0.0000995	Medium
Production	11-15	High	Medium	High	High	High	High	High	High	0.0000000	0.0235944	0.9764056	High
Production	1-5	Low	High	Low	Low	Low	High	Medium	Low	0.9663192	0.0336808	0.0000000	Low
QC	1-5	High	High	High	High	High	High	High	High	0.0000000	0.0008025	0.9991975	High
QC	11-15	Medium	High	High	High	High	High	Low	Medium	0.0000000	0.9662455	0.0337545	Medium
SR	21+	Medium	Low	Low	Medium	Low	High	Medium	Medium	0.0000000	1.0000000	0.0000000	Medium
SR	6-10	Medium	Low	Low	Low	Low	High	Low	Medium	0.0000001	0.9999999	0.0000000	Medium
SR	21+	Low	Medium	Low	High	Low	High	Medium	Medium	0.0000003	0.9999997	0.0000000	Medium
SR	1-5	High	High	High	High	High	High	High	High	0.0000000	0.0012033	0.9987967	High
SR	16-20	Medium	High	High	High	High	High	High	High	0.0000000	0.0353034	0.9646966	High

5. Confusion Matrix and Statistics

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low Medium High
##     Low      3      0    0
##     Medium   0     12    1
##     High     0      2   14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9062          
##                  95% CI : (0.7498, 0.9802)
##     No Information Rate : 0.4688          
##     P-Value [Acc > NIR] : 2.331e-07       
##                                           
##                   Kappa : 0.8381          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Low Class: Medium Class: High
## Sensitivity             1.00000        0.8571      0.9333
## Specificity             1.00000        0.9444      0.8824
## Pos Pred Value          1.00000        0.9231      0.8750
## Neg Pred Value          1.00000        0.8947      0.9375
## Prevalence              0.09375        0.4375      0.4688
## Detection Rate          0.09375        0.3750      0.4375
## Detection Prevalence    0.09375        0.4062      0.5000
## Balanced Accuracy       1.00000        0.9008      0.9078

6. Output Discussion

The Naive Bayes model achieved 90.62% accuracy, with a strong Kappa score of 0.8381, indicating high agreement between predictions and actual values. Low satisfaction was perfectly classified (100% sensitivity and precision), while Medium (85.71% sensitivity, 92.31% precision) and High (93.33% sensitivity, 87.50% precision) had minor misclassifications. The balanced accuracy remains high across all classes, making this model effective for predicting job satisfaction trends.

Claudia Schmitt Assignment

CS

2025-03-19

Assignment 1: MSCI 4230 – BUSINESS ANALYTICS IN PRACTICE

About the Data

Data Preview

Data Summary

Quality of Data

A) Data Visualisation

1. Histogram of Job Satisfaction

2. Boxplot of Job Satisfaction by Department

3. Recognition vs. Satisfaction

4. Correlation Heatmap

5. Scatter Plot: Years of Experience vs. Job Satisfaction

6. Average Satisfaction by Communication Level

7. Job Satisfaction by Ideas Contribution Level

8. Tools vs. Satisfaction

9. Satisfaction by Training Level

B) KNN to predict and classify Satisfaction

1. Dataset

2. Choosing K value

3. Plot

4. Prediction

5. Confusion Matrix

6. Output Discussion

C) Naive Bayes to predict Satisfaction

1. Dataset

2. Two types of probabilities

3. NB Classifier

4. Display the result of classification

5. Confusion Matrix and Statistics

6. Output Discussion