Group 5 (OCC 2)
Tan Rik Ee (17206574)
Lim
Wei En (23086179)
Chow Koo Li (23104111)
Lim Sze Gee
(23102772)
Law Yu Xuan (23106248)
11 January 2025
The main objectives are to predict monthly salary and employee attrition:
To identify the factors associated by looking into various aspects of employee features.
To develop machine learning models for prediction.
To evaluate the effectiveness of the machine learning models for prediction.
Employee attrition refers to the natural process by which employees leave an organization, often through resignation or other factors. It is a critical challenge faced by organizations, as it directly impacts workforce stability, financial costs, and overall productivity. According to the 2023 Salary Increase and Turnover Study, Malaysia’s attrition rates increased to 16.2% in 2023 from 14.9% in 2022. Both voluntary and involuntary attrition can hinder organizational growth and competitiveness. Addressing the root causes of attrition is crucial to enhancing employee retention, which is vital for sustaining organizational stability, productivity, and competitive advantage. In addition to predicting attrition, analyzing and predicting monthly salaries plays an equally important role. It allows organizations identify patterns in compensation, ensuring fairness and attractiveness in pay structures, which can mitigate potential dissatisfaction and reduce attrition. Leveraging machine learning, a powerful tool in data science, enables organizations to analyze historical data, predict attrition trends, and implement proactive strategies to mitigate its adverse effects, such as skill shortages and diminished performance.
The dataset is obtained from Kaggle titled ‘Fictional dataset on HR Employee attrition and performance’ at https://www.kaggle.com/datasets/patelprashant/employee-attrition. It is structured as a CSV file, containing both categorical and numerical variables. The dataset contains information about employees, focusing on various factors that may contribute to attrition such as demographics, job related attributes, performance metrics and attrition status.
Education: ‘1-Below College’ ‘2-College’ ‘3-Bachelor’ ‘4-Master’ ‘5-Doctor’
Environment Satisfaction: ‘1-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’
Job Involvement: ‘1-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’
Job Satisfaction: ‘2-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’
Performance Rating: ‘1-Low’ ‘2-Good’ ‘3-Excellent’ ‘4-Outstanding’
Relationship Satisfaction: ‘1-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’
Work Life Balance: ‘1-Bad’ ‘2-Good’ ‘3-Better’ ‘4-Best’
# Import necessary libraries
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(readr)
library(repr)
library(reshape2)
library(ROSE)
## Loaded ROSE 0.0-4
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(stats)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
csv_url <- "https://drive.google.com/uc?id=1TLe_eD3c-KURISY6KU-ltJNxukC8X2a_&export=download"
df <- read_csv(csv_url)
## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 35
## Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 41 Yes Travel_Rarely 1102 Sales 1 2
## 2 49 No Travel_Freque… 279 Research … 8 1
## 3 37 Yes Travel_Rarely 1373 Research … 2 2
## 4 33 No Travel_Freque… 1392 Research … 3 4
## 5 27 No Travel_Rarely 591 Research … 2 1
## 6 32 No Travel_Freque… 1005 Research … 2 2
## # ℹ 28 more variables: EducationField <chr>, EmployeeCount <dbl>,
## # EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>, Gender <chr>,
## # HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>, JobRole <chr>,
## # JobSatisfaction <dbl>, MaritalStatus <chr>, MonthlyIncome <dbl>,
## # MonthlyRate <dbl>, NumCompaniesWorked <dbl>, Over18 <chr>, OverTime <chr>,
## # PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## # RelationshipSatisfaction <dbl>, StandardHours <dbl>, …
# Check if there are any missing values
anyNA(df)
## [1] TRUE
# Perform imputation with mean
# Find columns with missing values
missing_cols <- colnames(df)[colSums(is.na(df)) > 0]
missing_cols
## [1] "MonthlyRate"
# Impute missing values with the mean for each column
for (col in missing_cols) {
if (is.numeric(df[[col]])) {
df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE)
}
}
# Remove columns with only one unique values (which are irrelevant) and Employee number
univalue_cols <- names(df)[sapply(df, function(x) length(unique(x)) == 1)]
cols_to_remove <- c("EmployeeNumber", univalue_cols)
# Print out columns to be removed
print(cols_to_remove)
## [1] "EmployeeNumber" "EmployeeCount" "Over18" "StandardHours"
df <- df[, !(names(df) %in% cols_to_remove)]
head(df)
## # A tibble: 6 × 31
## Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 41 Yes Travel_Rarely 1102 Sales 1 2
## 2 49 No Travel_Freque… 279 Research … 8 1
## 3 37 Yes Travel_Rarely 1373 Research … 2 2
## 4 33 No Travel_Freque… 1392 Research … 3 4
## 5 27 No Travel_Rarely 591 Research … 2 1
## 6 32 No Travel_Freque… 1005 Research … 2 2
## # ℹ 24 more variables: EducationField <chr>, EnvironmentSatisfaction <dbl>,
## # Gender <chr>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## # MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## # RelationshipSatisfaction <dbl>, StockOptionLevel <dbl>,
## # TotalWorkingYears <dbl>, TrainingTimesLastYear <dbl>, …
str(df)
## tibble [1,470 × 31] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr [1:1470] "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : num [1:1470] 1102 279 1373 1392 591 ...
## $ Department : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EnvironmentSatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr [1:1470] "Female" "Male" "Male" "Female" ...
## $ HourlyRate : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr [1:1470] "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : num [1:1470] 5993 5130 2090 2909 3468 ...
## $ MonthlyRate : num [1:1470] 19479 24907 2396 23159 14281 ...
## $ NumCompaniesWorked : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : num [1:1470] 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : num [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : num [1:1470] 5 7 0 0 2 6 0 0 8 7 ...
colnames(df)
## [1] "Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EnvironmentSatisfaction" "Gender"
## [11] "HourlyRate" "JobInvolvement"
## [13] "JobLevel" "JobRole"
## [15] "JobSatisfaction" "MaritalStatus"
## [17] "MonthlyIncome" "MonthlyRate"
## [19] "NumCompaniesWorked" "OverTime"
## [21] "PercentSalaryHike" "PerformanceRating"
## [23] "RelationshipSatisfaction" "StockOptionLevel"
## [25] "TotalWorkingYears" "TrainingTimesLastYear"
## [27] "WorkLifeBalance" "YearsAtCompany"
## [29] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [31] "YearsWithCurrManager"
# List out the numerical columns and categorical columns for preprocessing later
numerical_cols = c("Age", "DailyRate", "DistanceFromHome", "HourlyRate", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "PercentSalaryHike", "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany",
"YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
categorical_cols = c("Attrition", "BusinessTravel", "Department", "Education", "EducationField", "EnvironmentSatisfaction", "Gender", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus",
"OverTime", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel", "WorkLifeBalance")
# Convert categorical columns to factors
for (col in categorical_cols) {
df[[col]] <- factor(df[[col]])
}
# Verify structure
str(df)
## tibble [1,470 × 31] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : num [1:1470] 1102 279 1373 1392 591 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : num [1:1470] 5993 5130 2090 2909 3468 ...
## $ MonthlyRate : num [1:1470] 19479 24907 2396 23159 14281 ...
## $ NumCompaniesWorked : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...
## $ TotalWorkingYears : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : num [1:1470] 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : num [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : num [1:1470] 5 7 0 0 2 6 0 0 8 7 ...
We have generated a summary on the numerical columns to have an understanding on the data. After that, we also utilised the ggplot2 library to visualise the distribution of the numerical values. After visualisation, we are able to identify if the data follows normal distribution or it is skewed to left or right. For example, variables like “Monthly Income” or “Years at Company” may show skewness, indicating a small group with significantly higher values than the rest.
summary(df[numerical_cols])
## Age DailyRate DistanceFromHome HourlyRate
## Min. :18.00 Min. : 102.0 Min. : 1.000 Min. : 30.00
## 1st Qu.:30.00 1st Qu.: 465.0 1st Qu.: 2.000 1st Qu.: 48.00
## Median :36.00 Median : 802.0 Median : 7.000 Median : 66.00
## Mean :36.92 Mean : 802.5 Mean : 9.193 Mean : 65.89
## 3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.: 83.75
## Max. :60.00 Max. :1499.0 Max. :29.000 Max. :100.00
## MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike
## Min. : 1009 Min. : 2094 Min. :0.000 Min. :11.00
## 1st Qu.: 2911 1st Qu.: 8807 1st Qu.:1.000 1st Qu.:12.00
## Median : 4919 Median :14281 Median :2.000 Median :14.00
## Mean : 6503 Mean :14281 Mean :2.693 Mean :15.21
## 3rd Qu.: 8379 3rd Qu.:19801 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :19999 Max. :26999 Max. :9.000 Max. :25.00
## TotalWorkingYears TrainingTimesLastYear YearsAtCompany YearsInCurrentRole
## Min. : 0.00 Min. :0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :10.00 Median :3.000 Median : 5.000 Median : 3.000
## Mean :11.28 Mean :2.799 Mean : 7.008 Mean : 4.229
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :40.00 Max. :6.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
# Numeric data
df_numeric <- df[sapply(df, is.numeric)]
# Plot histograms for numeric columns
numeric_long <- pivot_longer(as.data.frame(df_numeric), cols = everything(), names_to = "variable", values_to = "value")
ggplot(numeric_long, aes(x = value, fill = variable)) +
geom_histogram(bins = 20, color = "white", show.legend = FALSE) +
facet_wrap(~variable, scales = "free") +
labs(title = "Histograms for Numeric Features", x = "Value", y = "Frequency") +
theme_minimal()
Bar plots provide a clear view of the frequency or proportion of categories within categorical variables, helping identify patterns and imbalances.
summary(df[categorical_cols])
## Attrition BusinessTravel Department Education
## No :1233 Non-Travel : 150 Human Resources : 63 1:170
## Yes: 237 Travel_Frequently: 277 Research & Development:961 2:282
## Travel_Rarely :1043 Sales :446 3:572
## 4:398
## 5: 48
##
##
## EducationField EnvironmentSatisfaction Gender JobInvolvement
## Human Resources : 27 1:284 Female:588 1: 83
## Life Sciences :606 2:287 Male :882 2:375
## Marketing :159 3:453 3:868
## Medical :464 4:446 4:144
## Other : 82
## Technical Degree:132
##
## JobLevel JobRole JobSatisfaction MaritalStatus
## 1:543 Sales Executive :326 1:289 Divorced:327
## 2:534 Research Scientist :292 2:280 Married :673
## 3:218 Laboratory Technician :259 3:442 Single :470
## 4:106 Manufacturing Director :145 4:459
## 5: 69 Healthcare Representative:131
## Manager :102
## (Other) :215
## OverTime PerformanceRating RelationshipSatisfaction StockOptionLevel
## No :1054 3:1244 1:276 0:631
## Yes: 416 4: 226 2:303 1:596
## 3:459 2:158
## 4:432 3: 85
##
##
##
## WorkLifeBalance
## 1: 80
## 2:344
## 3:893
## 4:153
##
##
##
# Factor (categorical) data
df_factor <- df[sapply(df, is.factor)]
# Plot bar plots for categorical columns
factor_long <- pivot_longer(as.data.frame(df_factor), cols = everything(), names_to = "variable", values_to = "value")
ggplot(factor_long, aes(x = value, fill = variable)) +
geom_bar(color = "white", show.legend = FALSE) +
facet_wrap(~variable, scales = "free") +
labs(title = "Bar Plots for Categorical Features", x = "Categories", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We also generated correlation map to show relationship between variables. As our target variable for regression problem is monthly income and classification problem is attrition, hence we have focused on these two variables.
After analysing the correlation between different features with monthly income, it was noted that job level, total working hours and years at company show a relatively high correlation with monthly income with correlation of more than +0.5. Hence, these variables can be focused as they have a strong relationship with the target variable.
Meanwhile, the correlation graph between different featrues with attrition shows that overtime column has a postive correlation of more than 0.2 with attrition, showing overtime might be an important factor for employee to leave.
# Convert factor columns to numeric
df_numeric <- df
df_numeric[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.numeric)
# Compute correlation matrix
cor_matrix <- cor(df_numeric)
# Reorder rows and columns (e.g., "Attrition" and "A" as last two rows/columns)
order <- c(setdiff(rownames(cor_matrix), c("MonthlyIncome", "Attrition")), "MonthlyIncome", "Attrition")
cor_matrix <- cor_matrix[order, order]
# Melt the correlation matrix for ggplot
melted_cor <- melt(cor_matrix, varnames = c("Row", "Column"))
# Reverse the order of the rows for the y-axis
melted_cor$Row <- factor(melted_cor$Row, levels = rev(unique(melted_cor$Row)))
# Plot the heatmap with coefficients
ggplot(data = melted_cor, aes(x = Column, y = Row, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
geom_text(aes(label = round(value, 2)), color = "black", size = 3) + # Add coefficients
labs(title = "Correlation Heatmap", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # to make x axis values visible
# Plot correlation matrix of different features with monthly income
cor_monthly_income <- cor_matrix["MonthlyIncome", ]
cor_df <- data.frame(Feature = names(cor_monthly_income), Correlation = cor_monthly_income)
cor_df <- cor_df[cor_df$Feature != "MonthlyIncome", ]
ggplot(cor_df, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_col() +
coord_flip() +
labs(title = "Correlation of Features with Monthly Income",
x = "Feature",
y = "Correlation")
# Plot correlation matrix of different features with attrition
cor_attrition <- cor_matrix["Attrition", ]
cor_df <- data.frame(Feature = names(cor_attrition), Correlation = cor_attrition)
cor_df <- cor_df[cor_df$Feature != "Attrition", ]
ggplot(cor_df, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_col() +
coord_flip() +
labs(title = "Correlation of Features with Attrition",
x = "Feature",
y = "Correlation")
More than 50% of employees earn between 2000 and 6000 in monthly income. This indicates that the company’s workforce is skewed towards lower to mid-level income groups.
# Histogram for numeric variables
ggplot(df, aes(x = MonthlyIncome)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
scale_x_continuous(n.breaks = 10) +
labs(title = "Distribution of Monthly Income", x = "Monthly Income", y = "Frequency")
This graph shows that lower-income employees seem to leave.
# Boxplot for numeric variables
ggplot(df, aes(y = MonthlyIncome, x = Attrition)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Monthly Income by Attrition", x = "Attrition", y = "Monthly Income")
This graph shows that younger employees tend to leave, possibly seeking for other better opportunities.
# Density plot for numeric variables
ggplot(df, aes(x = Age, fill = Attrition)) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of Age by Attrition", x = "Age", y = "Density")
Sales Department may require attention regarding employee retention strategies as this department having a higher proportion of attrition relative to the total department size. R&D shows attrition but not at alarming rates compared to the total workforce.
# Bar plot for categorical variables
ggplot(df, aes(x = Department, fill = Attrition)) +
geom_bar(position = "dodge")+
labs(title = "Employee Attrition by Department", x = "Department", y = "Count")
Employees with lower salaries and shorter tenures are more likely to leave. This suggests the need for early career engagement strategies.
# Scatter plot for two numeric variables
ggplot(df, aes(x = MonthlyIncome, y = YearsAtCompany, color = Attrition)) +
geom_point(alpha = 0.7) +
labs(title = "Monthly Income vs Years at Company", x = "Monthly Income", y = "Years at Company")
Majority of employees that travel rarely do not leave as travelling infrequetly might positively influence their job satisfaction and work-life balance.
# Bar plot to visualize Attrition by Business Travel
ggplot(df, aes(x = BusinessTravel, fill = Attrition)) +
geom_bar(position = "dodge") +
labs(title = "Attrition by Business Travel",
x = "Business Travel",
y = "Count")
The attrition columns show it is imbalanced as the count of ‘no’ is larger than ‘yes’.
# Plot barchart for count of attrition
ggplot(df, aes(x = factor(Attrition), fill = factor(Attrition))) +
geom_bar() +
labs(title = "Count of Attrition", x = "Attrition", y = "Count")
table(df$Attrition)
##
## No Yes
## 1233 237
This table presents the results of a chi-square test to evaluate the association between various categorical variables and a dependent variable. Features like ‘BusinessTravel’, ‘Education’ and ‘JobSatisfaction’ are marked as Dependent, indicating significant associations based on their low p-values (less than 0.05) and chi-square statistics exceeding the critical value. Meanwhile, features like ‘Gender’ and ‘PerformanceRating’ are marked as Independent, suggesting no significant relationship with the dependent variable. These results highlight the most influential factors that could be important for further analysis or modeling.
cols_to_drop <- c(numerical_cols, "Attrition")
a <- df %>% select(-all_of(cols_to_drop)) # Factor columns
b <- df$Attrition # Target column
# Perform Chi-Square test for each factor column
results <- lapply(a, function(column) {
contingency_table <- table(column, b) # Create a contingency table
chi_result <- chisq.test(contingency_table) # Perform Chi-Square test
# Extract results
list(
statistic = chi_result$statistic,
p_value = chi_result$p.value,
df = chi_result$parameter,
critical_value = qchisq(0.95, df = chi_result$parameter),
hypothesis = ifelse(chi_result$p.value < 0.05, "Dependent", "Independent")
)
})
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
# Convert results to a DataFrame
results_df <- data.frame(
Feature = names(a),
Chi_Square_Statistic = sapply(results, `[[`, "statistic"),
P_Value = sapply(results, `[[`, "p_value"),
Degrees_of_Freedom = sapply(results, `[[`, "df"),
Critical_Value_5_Percent = sapply(results, `[[`, "critical_value"),
Hypothesis = sapply(results, `[[`, "hypothesis")
)
# Print the results
results_df
## Feature
## BusinessTravel.X-squared BusinessTravel
## Department.X-squared Department
## Education.X-squared Education
## EducationField.X-squared EducationField
## EnvironmentSatisfaction.X-squared EnvironmentSatisfaction
## Gender.X-squared Gender
## JobInvolvement.X-squared JobInvolvement
## JobLevel.X-squared JobLevel
## JobRole.X-squared JobRole
## JobSatisfaction.X-squared JobSatisfaction
## MaritalStatus.X-squared MaritalStatus
## OverTime.X-squared OverTime
## PerformanceRating.X-squared PerformanceRating
## RelationshipSatisfaction.X-squared RelationshipSatisfaction
## StockOptionLevel.X-squared StockOptionLevel
## WorkLifeBalance.X-squared WorkLifeBalance
## Chi_Square_Statistic P_Value
## BusinessTravel.X-squared 2.418241e+01 5.608614e-06
## Department.X-squared 1.079601e+01 4.525607e-03
## Education.X-squared 3.073961e+00 5.455253e-01
## EducationField.X-squared 1.602467e+01 6.773980e-03
## EnvironmentSatisfaction.X-squared 2.250388e+01 5.123469e-05
## Gender.X-squared 1.116967e+00 2.905724e-01
## JobInvolvement.X-squared 2.849202e+01 2.863181e-06
## JobLevel.X-squared 7.252901e+01 6.634685e-15
## JobRole.X-squared 8.619025e+01 2.752482e-15
## JobSatisfaction.X-squared 1.750508e+01 5.563005e-04
## MaritalStatus.X-squared 4.616368e+01 9.455511e-11
## OverTime.X-squared 8.756429e+01 8.158424e-21
## PerformanceRating.X-squared 1.547544e-04 9.900745e-01
## RelationshipSatisfaction.X-squared 5.241068e+00 1.549724e-01
## StockOptionLevel.X-squared 6.059830e+01 4.379390e-13
## WorkLifeBalance.X-squared 1.632510e+01 9.725699e-04
## Degrees_of_Freedom Critical_Value_5_Percent
## BusinessTravel.X-squared 2 5.991465
## Department.X-squared 2 5.991465
## Education.X-squared 4 9.487729
## EducationField.X-squared 5 11.070498
## EnvironmentSatisfaction.X-squared 3 7.814728
## Gender.X-squared 1 3.841459
## JobInvolvement.X-squared 3 7.814728
## JobLevel.X-squared 4 9.487729
## JobRole.X-squared 8 15.507313
## JobSatisfaction.X-squared 3 7.814728
## MaritalStatus.X-squared 2 5.991465
## OverTime.X-squared 1 3.841459
## PerformanceRating.X-squared 1 3.841459
## RelationshipSatisfaction.X-squared 3 7.814728
## StockOptionLevel.X-squared 3 7.814728
## WorkLifeBalance.X-squared 3 7.814728
## Hypothesis
## BusinessTravel.X-squared Dependent
## Department.X-squared Dependent
## Education.X-squared Independent
## EducationField.X-squared Dependent
## EnvironmentSatisfaction.X-squared Dependent
## Gender.X-squared Independent
## JobInvolvement.X-squared Dependent
## JobLevel.X-squared Dependent
## JobRole.X-squared Dependent
## JobSatisfaction.X-squared Dependent
## MaritalStatus.X-squared Dependent
## OverTime.X-squared Dependent
## PerformanceRating.X-squared Independent
## RelationshipSatisfaction.X-squared Independent
## StockOptionLevel.X-squared Dependent
## WorkLifeBalance.X-squared Dependent
First we train a regression model with k-fold cross-validation before its performance is evaluated and the best model is chosen using R-squared metric. Then, we train a classification model with cross-validation and the option to handle class imbalance using oversampling technique. After the best model is chosen, a classification report is generated to show the evaluation of the models’ performance.
options(repr.plot.width = 6, repr.plot.height = 6)
model_training_regression <- function(X, y, model, k = 5) {
# Combine predictors (X) and target (y) into a single data frame
data <- cbind(X, target = y)
# K-Fold Cross Validation
set.seed(42)
folds <- createFolds(y, k = k, list = TRUE, returnTrain = TRUE)
# Initialize variables to store results
fold_metrics <- data.frame(Fold = integer(), MSE = numeric(), RMSE = numeric(), R_Squared = numeric())
models <- list()
# Loop through folds
for (fold in seq_along(folds)) {
train_idx <- folds[[fold]]
test_idx <- setdiff(seq_len(nrow(data)), train_idx)
X_train <- X[train_idx, ]
X_test <- X[test_idx, ]
y_train <- y[train_idx]
y_test <- y[test_idx]
# Train the model
model_fit <- train(X_train, y_train, method = model)
models[[fold]] <- model_fit
# Predict and evaluate
y_pred <- predict(model_fit, X_test)
# Compute Metrics
mae <- mean(abs(y_test - y_pred))
rmse <- sqrt(mean((y_test - y_pred)^2))
ss_total <- sum((y_test - mean(y_test))^2)
ss_residual <- sum((y_test - y_pred)^2)
r_squared <- 1 - (ss_residual / ss_total)
# Save metrics
fold_metrics <- rbind(fold_metrics, data.frame(Fold = fold, MAE = mae, RMSE = rmse, R_Squared = r_squared))
cat(sprintf("Fold %d: MSE = %.2f, RMSE = %.2f, R-Squared = %.2f\n", fold, mae, rmse, r_squared))
}
# Calculate and print average metrics
avg_metrics <- colMeans(fold_metrics[, -1]) # Exclude 'Fold' column
cat(sprintf("\nAverage Metrics Across Folds: MAE = %.2f, RMSE = %.2f, R-Squared = %.2f\n",
avg_metrics["MAE"], avg_metrics["RMSE"], avg_metrics["R_Squared"]))
# Select the best model based on R-Squared
best_fold <- which.max(fold_metrics$R_Squared)
best_model <- models[[best_fold]]
cat(sprintf("\nBest Model is from Fold %d with R-Squared = %.2f\n", best_fold, fold_metrics$R_Squared[best_fold]))
# Plot Predictions vs Actual for the Best Model
best_test_idx <- setdiff(seq_len(nrow(data)), folds[[best_fold]])
X_best_test <- X[best_test_idx, ]
y_best_test <- y[best_test_idx]
y_best_pred <- predict(best_model, X_best_test)
print(
ggplot(data = data.frame(Actual = y_best_test, Predicted = y_best_pred), aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Predicted vs Actual (Best Model)", x = "Actual", y = "Predicted") +
theme_minimal()
)
# Return best model and fold metrics
list(BestModel = best_model, FoldMetrics = fold_metrics)
}
model_training <- function(X, y, model, sampler = NULL) {
# Combine predictors (X) and target (y) into a single data frame
data <- cbind(X, target = y)
# Stratified K-Fold Cross Validation
set.seed(42)
folds <- createFolds(y, k = 5, list = TRUE, returnTrain = TRUE)
# Initialize variables to store results
fold_accuracies <- numeric()
models <- list()
confusion_matrices <- list()
predictions <- list()
true_labels <- list()
# Loop through folds
for (fold in seq_along(folds)) {
train_idx <- folds[[fold]]
test_idx <- setdiff(seq_len(nrow(data)), train_idx)
X_train <- X[train_idx, ]
X_test <- X[test_idx, ]
y_train <- y[train_idx]
y_test <- y[test_idx]
# Apply resampling if specified
if (!is.null(sampler)) {
if (sampler == "SMOTE") {
cat("Applying SMOTE...\n")
smote_data <- ROSE(target ~ ., data = data.frame(X_train, target = y_train),
seed = 42)$data
X_train <- smote_data[, -ncol(smote_data)] # Drop target column
y_train <- smote_data$target
} else if (sampler == "RandomOversampling") {
cat("Applying Random Oversampling...\n")
rose_data <- ROSE(target ~ ., data = data.frame(X_train, target = y_train), seed = 42)$data
X_train <- rose_data[, -ncol(rose_data)] # Drop target column
y_train <- rose_data$target
}
}
# Train the model
model_fit <- train(X_train, y_train, method = model)
models[[fold]] <- model_fit
# Predict and evaluate
y_pred <- predict(model_fit, X_test)
accuracy <- Accuracy(y_pred, y_test)
fold_accuracies <- c(fold_accuracies, accuracy)
# Save predictions and true labels
predictions[[fold]] <- y_pred
true_labels[[fold]] <- y_test
# Save confusion matrix
confusion_matrices[[fold]] <- confusionMatrix(y_test, y_pred)
cat(sprintf("Fold %d: Accuracy = %.2f\n", fold, accuracy))
}
# Average accuracy across folds
avg_accuracy <- mean(fold_accuracies)
cat(sprintf("Average Accuracy: %.2f\n", avg_accuracy))
# Identify the best model
best_fold <- which.max(fold_accuracies)
best_model <- models[[best_fold]]
cat(sprintf("\nBest Model is from Fold %d with Accuracy = %.2f\n", best_fold, fold_accuracies[best_fold]))
# Display the confusion matrix and classification report for the best model
print(confusion_matrices[[best_fold]])
list(
BestModel = best_model,
BestPredictions = predictions[[best_fold]],
BestTrueLabels = true_labels[[best_fold]]
)
}
generate_classification_report <- function(y_true, y_pred) {
# Ensure inputs are factors
y_true <- as.factor(y_true)
y_pred <- as.factor(y_pred)
# Get the levels of the true labels
classes <- levels(y_true)
# Initialize the report data frame
report <- data.frame(
Class = character(),
Precision = numeric(),
Recall = numeric(),
F1_Score = numeric(),
Support = integer()
)
# Calculate metrics for each class
for (class in classes) {
precision <- Precision(y_pred, y_true, positive = class)
recall <- Recall(y_pred, y_true, positive = class)
f1_score <- F1_Score(y_pred, y_true, positive = class)
support <- sum(y_true == class)
report <- rbind(report, data.frame(
Class = class,
Precision = precision,
Recall = recall,
F1_Score = f1_score,
Support = support
))
}
# Add averages (macro-average)
macro_precision <- mean(report$Precision, na.rm = TRUE)
macro_recall <- mean(report$Recall, na.rm = TRUE)
macro_f1 <- mean(report$F1_Score, na.rm = TRUE)
report <- rbind(report, data.frame(
Class = "Macro Average",
Precision = macro_precision,
Recall = macro_recall,
F1_Score = macro_f1,
Support = NA
))
# Add weighted averages (weighted by support)
total_support <- sum(report$Support, na.rm = TRUE)
weighted_precision <- sum(report$Precision * report$Support, na.rm = TRUE) / total_support
weighted_recall <- sum(report$Recall * report$Support, na.rm = TRUE) / total_support
weighted_f1 <- sum(report$F1_Score * report$Support, na.rm = TRUE) / total_support
report <- rbind(report, data.frame(
Class = "Weighted Average",
Precision = weighted_precision,
Recall = weighted_recall,
F1_Score = weighted_f1,
Support = total_support
))
return(report)
}
In this part, we used 4 different types of models to predict monthly income based on the other variables. These models are linear regression, logistic regression, Support Vector Machine (SVM) and Random Forest Classifier. The models are evaluated based on the R-squared metric to select the best one. Linear regression and logistic regression provide identical performances with a R-squared value of 0.95 followed by Random Forest with 0.94 and SVM with 0.93. The linear regression model shows that higher job levels are directly related to higher monthly income.
df_final <- as.data.frame(df)
X <- df_final %>% select(-MonthlyIncome)
num_cols <- sapply(X, is.numeric)
# Min-Max scale numerical columns
X[, num_cols] <- lapply(X[, num_cols], rescale, to = c(0, 1))
y <- df_final$MonthlyIncome
Linear Regression
# Train the model
results <- model_training_regression(X, y, model = "lm")
## Fold 1: MSE = 828.91, RMSE = 1082.25, R-Squared = 0.95
## Fold 2: MSE = 870.33, RMSE = 1116.75, R-Squared = 0.95
## Fold 3: MSE = 845.95, RMSE = 1106.10, R-Squared = 0.94
## Fold 4: MSE = 768.92, RMSE = 1045.57, R-Squared = 0.95
## Fold 5: MSE = 800.64, RMSE = 1056.02, R-Squared = 0.95
##
## Average Metrics Across Folds: MAE = 822.95, RMSE = 1081.34, R-Squared = 0.95
##
## Best Model is from Fold 4 with R-Squared = 0.95
# View the fold metrics
print(results$FoldMetrics)
## Fold MAE RMSE R_Squared
## 1 1 828.9101 1082.247 0.9485850
## 2 2 870.3345 1116.750 0.9491387
## 3 3 845.9539 1106.101 0.9386794
## 4 4 768.9196 1045.574 0.9500628
## 5 5 800.6386 1056.016 0.9482275
# Access the best model
best_model <- results$BestModel
model_summary <- summary(best_model)
# Extract coefficients table and convert it to a data frame
coefficients_df <- as.data.frame(model_summary$coefficients)
# Add column names for clarity
colnames(coefficients_df) <- c("Estimate", "Std.Error", "t.value", "P.Value")
# Sort by p-value
sorted_coefficients <- coefficients_df[order(coefficients_df$P.Value), ]
sorted_coefficients
## Estimate Std.Error t.value
## JobLevel5 11128.553100 305.96735 36.37170178
## JobLevel4 8449.651934 262.28936 32.21500040
## JobLevel3 4877.776167 173.62223 28.09419127
## JobRoleResearch Director 3396.313228 202.71578 16.75406457
## JobRoleManager 3405.927580 247.97128 13.73516961
## JobLevel2 1599.175704 126.17838 12.67392829
## JobRoleLaboratory Technician -1266.673428 160.98849 -7.86809950
## JobRoleResearch Scientist -1216.878099 161.38288 -7.54031729
## (Intercept) 3692.168968 585.47547 6.30627440
## JobRoleSales Representative -1221.448955 333.73051 -3.65998594
## TotalWorkingYears 1300.374016 367.11759 3.54211853
## JobInvolvement3 -363.622583 138.73212 -2.62104099
## JobInvolvement4 -364.940933 168.28534 -2.16858424
## NumCompaniesWorked 267.058359 129.66398 2.05961869
## JobInvolvement2 -295.845662 145.88960 -2.02787360
## Education5 -331.801800 211.86178 -1.56612390
## StockOptionLevel1 169.394274 111.44450 1.51998778
## Education3 -151.800846 107.93985 -1.40634663
## JobSatisfaction2 -141.038264 100.68643 -1.40076739
## Education2 -150.754922 120.54784 -1.25058169
## DepartmentResearch & Development 551.353643 453.50926 1.21574948
## AttritionYes -120.712163 101.21726 -1.19260455
## YearsInCurrentRole 310.864297 264.72191 1.17430512
## JobSatisfaction3 -100.382876 91.38259 -1.09849015
## GenderMale 63.139388 65.55567 0.96314149
## OverTimeYes 68.411639 72.44940 0.94426791
## PerformanceRating4 -128.060348 135.99746 -0.94163780
## JobRoleSales Executive 274.291916 293.62630 0.93415310
## Age -189.363032 207.14438 -0.91415965
## YearsSinceLastPromotion 172.839726 192.77058 0.89660842
## PercentSalaryHike 168.532776 191.21627 0.88137255
## BusinessTravelTravel_Rarely 91.409246 105.30608 0.86803387
## EducationFieldMedical -262.353335 308.23376 -0.85115055
## RelationshipSatisfaction2 85.137088 100.54232 0.84677866
## WorkLifeBalance4 139.557308 167.60848 0.83263871
## YearsWithCurrManager -211.210099 266.40319 -0.79282121
## BusinessTravelTravel_Frequently 92.981542 123.17768 0.75485709
## EducationFieldLife Sciences -223.608795 308.54154 -0.72472833
## EducationFieldMarketing -232.839766 327.56029 -0.71083026
## TrainingTimesLastYear -101.781937 148.29514 -0.68634708
## EducationFieldOther -224.944912 330.68697 -0.68023519
## EducationFieldTechnical Degree -202.511764 319.61746 -0.63360671
## JobRoleHuman Resources -278.771392 474.99356 -0.58689510
## WorkLifeBalance3 83.592331 143.80397 0.58129362
## JobSatisfaction4 -52.738468 91.50354 -0.57635444
## MaritalStatusSingle 76.193547 137.79456 0.55295033
## EnvironmentSatisfaction2 -52.280042 101.98499 -0.51262488
## DepartmentSales 232.652083 461.82531 0.50376642
## WorkLifeBalance2 74.649952 152.98642 0.48795150
## JobRoleManufacturing Director -70.873720 149.73308 -0.47333376
## MaritalStatusMarried 37.796013 84.42156 0.44770568
## DistanceFromHome -46.711766 108.73972 -0.42957411
## EnvironmentSatisfaction4 -38.608441 92.80066 -0.41603625
## DailyRate 45.235325 110.07200 0.41096122
## HourlyRate 44.388598 109.39371 0.40576921
## RelationshipSatisfaction3 15.371131 92.67310 0.16586400
## MonthlyRate -16.768151 116.25081 -0.14424115
## EnvironmentSatisfaction3 -9.737652 92.23749 -0.10557152
## RelationshipSatisfaction4 7.705286 94.47260 0.08156107
## YearsAtCompany 35.760154 475.69156 0.07517509
## StockOptionLevel3 -11.093335 168.41468 -0.06586917
## StockOptionLevel2 8.134095 141.69522 0.05740557
## Education4 -5.505794 115.27068 -0.04776405
## P.Value
## JobLevel5 1.619338e-191
## JobLevel4 2.081430e-161
## JobLevel3 1.064009e-131
## JobRoleResearch Director 2.342074e-56
## JobRoleManager 8.981680e-40
## JobLevel2 1.753125e-34
## JobRoleLaboratory Technician 8.484063e-15
## JobRoleResearch Scientist 9.699518e-14
## (Intercept) 4.114887e-10
## JobRoleSales Representative 2.640573e-04
## TotalWorkingYears 4.133597e-04
## JobInvolvement3 8.885654e-03
## JobInvolvement4 3.032553e-02
## NumCompaniesWorked 3.966716e-02
## JobInvolvement2 4.281099e-02
## Education5 1.176039e-01
## StockOptionLevel1 1.287981e-01
## Education3 1.599003e-01
## JobSatisfaction2 1.615624e-01
## Education2 2.113500e-01
## DepartmentResearch & Development 2.243382e-01
## AttritionYes 2.332785e-01
## YearsInCurrentRole 2.405240e-01
## JobSatisfaction3 2.722281e-01
## GenderMale 3.356857e-01
## OverTimeYes 3.452377e-01
## PerformanceRating4 3.465824e-01
## JobRoleSales Executive 3.504275e-01
## Age 3.608310e-01
## YearsSinceLastPromotion 3.701218e-01
## PercentSalaryHike 3.783065e-01
## BusinessTravelTravel_Rarely 3.855629e-01
## EducationFieldMedical 3.948688e-01
## RelationshipSatisfaction2 3.973005e-01
## WorkLifeBalance4 4.052271e-01
## YearsWithCurrManager 4.280510e-01
## BusinessTravelTravel_Frequently 4.504944e-01
## EducationFieldLife Sciences 4.687710e-01
## EducationFieldMarketing 4.773384e-01
## TrainingTimesLastYear 4.926372e-01
## EducationFieldOther 4.964970e-01
## EducationFieldTechnical Degree 5.264677e-01
## JobRoleHuman Resources 5.573932e-01
## WorkLifeBalance3 5.611603e-01
## JobSatisfaction4 5.644922e-01
## MaritalStatusSingle 5.804085e-01
## EnvironmentSatisfaction2 6.083155e-01
## DepartmentSales 6.145252e-01
## WorkLifeBalance2 6.256803e-01
## JobRoleManufacturing Director 6.360679e-01
## MaritalStatusMarried 6.544527e-01
## DistanceFromHome 6.675886e-01
## EnvironmentSatisfaction4 6.774637e-01
## DailyRate 6.811801e-01
## HourlyRate 6.849902e-01
## RelationshipSatisfaction3 8.682941e-01
## MonthlyRate 8.853362e-01
## EnvironmentSatisfaction3 9.159413e-01
## RelationshipSatisfaction4 9.350104e-01
## YearsAtCompany 9.400889e-01
## StockOptionLevel3 9.474938e-01
## StockOptionLevel2 9.542324e-01
## Education4 9.619129e-01
# Perform linear regression modelling
model <- lm(y ~ ., data = X)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = y ~ ., data = X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3222.6 -660.7 -54.9 620.1 4483.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3585.679 493.757 7.262 6.29e-13 ***
## Age -128.846 184.795 -0.697 0.48577
## AttritionYes -113.667 89.442 -1.271 0.20400
## BusinessTravelTravel_Frequently 132.955 109.808 1.211 0.22618
## BusinessTravelTravel_Rarely 126.625 94.394 1.341 0.17999
## DailyRate 95.528 97.747 0.977 0.32859
## DepartmentResearch & Development 331.253 387.016 0.856 0.39219
## DepartmentSales 101.324 401.361 0.252 0.80073
## DistanceFromHome -63.680 97.677 -0.652 0.51454
## Education2 -191.251 105.982 -1.805 0.07136 .
## Education3 -134.821 95.213 -1.416 0.15700
## Education4 -31.364 101.967 -0.308 0.75844
## Education5 -262.173 177.985 -1.473 0.14097
## EducationFieldLife Sciences -147.111 277.291 -0.531 0.59583
## EducationFieldMarketing -79.744 295.410 -0.270 0.78724
## EducationFieldMedical -151.789 278.342 -0.545 0.58561
## EducationFieldOther -166.127 298.232 -0.557 0.57759
## EducationFieldTechnical Degree -86.024 289.431 -0.297 0.76635
## EnvironmentSatisfaction2 -35.576 90.513 -0.393 0.69435
## EnvironmentSatisfaction3 -30.856 82.169 -0.376 0.70733
## EnvironmentSatisfaction4 -45.904 82.610 -0.556 0.57852
## GenderMale 82.946 57.425 1.444 0.14884
## HourlyRate 68.097 96.644 0.705 0.48117
## JobInvolvement2 -227.718 131.024 -1.738 0.08243 .
## JobInvolvement3 -348.415 124.408 -2.801 0.00517 **
## JobInvolvement4 -316.115 149.621 -2.113 0.03480 *
## JobLevel2 1628.154 112.570 14.464 < 2e-16 ***
## JobLevel3 4876.694 153.935 31.680 < 2e-16 ***
## JobLevel4 8462.717 232.879 36.340 < 2e-16 ***
## JobLevel5 11031.862 272.062 40.549 < 2e-16 ***
## JobRoleHuman Resources -520.019 405.941 -1.281 0.20040
## JobRoleLaboratory Technician -1236.021 143.365 -8.622 < 2e-16 ***
## JobRoleManager 3474.382 212.172 16.375 < 2e-16 ***
## JobRoleManufacturing Director -68.293 129.234 -0.528 0.59728
## JobRoleResearch Director 3406.925 182.622 18.656 < 2e-16 ***
## JobRoleResearch Scientist -1168.127 143.144 -8.161 7.33e-16 ***
## JobRoleSales Executive 224.696 255.954 0.878 0.38016
## JobRoleSales Representative -1247.632 292.098 -4.271 2.07e-05 ***
## JobSatisfaction2 -78.912 90.060 -0.876 0.38106
## JobSatisfaction3 -94.659 81.527 -1.161 0.24581
## JobSatisfaction4 -49.215 81.189 -0.606 0.54450
## MaritalStatusMarried 64.713 76.061 0.851 0.39502
## MaritalStatusSingle 121.552 123.560 0.984 0.32541
## MonthlyRate -28.088 103.415 -0.272 0.78596
## NumCompaniesWorked 263.853 114.216 2.310 0.02103 *
## OverTimeYes 88.453 65.297 1.355 0.17575
## PercentSalaryHike 204.099 170.029 1.200 0.23020
## PerformanceRating4 -162.435 122.408 -1.327 0.18473
## RelationshipSatisfaction2 68.620 90.085 0.762 0.44635
## RelationshipSatisfaction3 52.051 82.464 0.631 0.52801
## RelationshipSatisfaction4 20.505 83.162 0.247 0.80528
## StockOptionLevel1 163.770 99.127 1.652 0.09873 .
## StockOptionLevel2 1.533 126.409 0.012 0.99033
## StockOptionLevel3 -69.960 149.662 -0.467 0.64025
## TotalWorkingYears 1422.871 322.413 4.413 1.10e-05 ***
## TrainingTimesLastYear -43.775 131.373 -0.333 0.73903
## WorkLifeBalance2 79.179 134.219 0.590 0.55534
## WorkLifeBalance3 54.587 126.428 0.432 0.66598
## WorkLifeBalance4 46.607 148.693 0.313 0.75399
## YearsAtCompany -115.719 397.733 -0.291 0.77114
## YearsInCurrentRole 308.803 232.981 1.325 0.18524
## YearsSinceLastPromotion 143.594 170.292 0.843 0.39925
## YearsWithCurrManager -150.110 224.556 -0.668 0.50394
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1056 on 1407 degrees of freedom
## Multiple R-squared: 0.9518, Adjusted R-squared: 0.9497
## F-statistic: 448.1 on 62 and 1407 DF, p-value: < 2.2e-16
# Test the accuracy of the model
# Predict monthly income using the model
predictions <- predict(model, newdata = df)
# Calculate the Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((df$MonthlyIncome - predictions)^2))
# Print the RMSE
cat("RMSE:", rmse, "\n")
## RMSE: 362994.7
# Calculate the R-squared value
r_squared <- summary(model)$r.squared
# Print the R-squared value
cat("R-squared:", r_squared, "\n")
## R-squared: 0.9518001
Logistic Regression Model
# Train the model
results <- model_training_regression(X, y, model = "glm")
## Fold 1: MSE = 828.91, RMSE = 1082.25, R-Squared = 0.95
## Fold 2: MSE = 870.33, RMSE = 1116.75, R-Squared = 0.95
## Fold 3: MSE = 845.95, RMSE = 1106.10, R-Squared = 0.94
## Fold 4: MSE = 768.92, RMSE = 1045.57, R-Squared = 0.95
## Fold 5: MSE = 800.64, RMSE = 1056.02, R-Squared = 0.95
##
## Average Metrics Across Folds: MAE = 822.95, RMSE = 1081.34, R-Squared = 0.95
##
## Best Model is from Fold 4 with R-Squared = 0.95
# View the fold metrics
print(results$FoldMetrics)
## Fold MAE RMSE R_Squared
## 1 1 828.9101 1082.247 0.9485850
## 2 2 870.3345 1116.750 0.9491387
## 3 3 845.9539 1106.101 0.9386794
## 4 4 768.9196 1045.574 0.9500628
## 5 5 800.6386 1056.016 0.9482275
# Access the best model
best_model <- results$BestModel
model_summary <- summary(best_model)
# Extract coefficients table and convert it to a data frame
coefficients_df <- as.data.frame(model_summary$coefficients)
# Add column names for clarity
colnames(coefficients_df) <- c("Estimate", "Std.Error", "t.value", "P.Value")
# Sort by p-value
sorted_coefficients <- coefficients_df[order(coefficients_df$P.Value), ]
sorted_coefficients
## Estimate Std.Error t.value
## JobLevel5 11128.553100 305.96735 36.37170178
## JobLevel4 8449.651934 262.28936 32.21500040
## JobLevel3 4877.776167 173.62223 28.09419127
## JobRoleResearch Director 3396.313228 202.71578 16.75406457
## JobRoleManager 3405.927580 247.97128 13.73516961
## JobLevel2 1599.175704 126.17838 12.67392829
## JobRoleLaboratory Technician -1266.673428 160.98849 -7.86809950
## JobRoleResearch Scientist -1216.878099 161.38288 -7.54031729
## (Intercept) 3692.168968 585.47547 6.30627440
## JobRoleSales Representative -1221.448955 333.73051 -3.65998594
## TotalWorkingYears 1300.374016 367.11759 3.54211853
## JobInvolvement3 -363.622583 138.73212 -2.62104099
## JobInvolvement4 -364.940933 168.28534 -2.16858424
## NumCompaniesWorked 267.058359 129.66398 2.05961869
## JobInvolvement2 -295.845662 145.88960 -2.02787360
## Education5 -331.801800 211.86178 -1.56612390
## StockOptionLevel1 169.394274 111.44450 1.51998778
## Education3 -151.800846 107.93985 -1.40634663
## JobSatisfaction2 -141.038264 100.68643 -1.40076739
## Education2 -150.754922 120.54784 -1.25058169
## DepartmentResearch & Development 551.353643 453.50926 1.21574948
## AttritionYes -120.712163 101.21726 -1.19260455
## YearsInCurrentRole 310.864297 264.72191 1.17430512
## JobSatisfaction3 -100.382876 91.38259 -1.09849015
## GenderMale 63.139388 65.55567 0.96314149
## OverTimeYes 68.411639 72.44940 0.94426791
## PerformanceRating4 -128.060348 135.99746 -0.94163780
## JobRoleSales Executive 274.291916 293.62630 0.93415310
## Age -189.363032 207.14438 -0.91415965
## YearsSinceLastPromotion 172.839726 192.77058 0.89660842
## PercentSalaryHike 168.532776 191.21627 0.88137255
## BusinessTravelTravel_Rarely 91.409246 105.30608 0.86803387
## EducationFieldMedical -262.353335 308.23376 -0.85115055
## RelationshipSatisfaction2 85.137088 100.54232 0.84677866
## WorkLifeBalance4 139.557308 167.60848 0.83263871
## YearsWithCurrManager -211.210099 266.40319 -0.79282121
## BusinessTravelTravel_Frequently 92.981542 123.17768 0.75485709
## EducationFieldLife Sciences -223.608795 308.54154 -0.72472833
## EducationFieldMarketing -232.839766 327.56029 -0.71083026
## TrainingTimesLastYear -101.781937 148.29514 -0.68634708
## EducationFieldOther -224.944912 330.68697 -0.68023519
## EducationFieldTechnical Degree -202.511764 319.61746 -0.63360671
## JobRoleHuman Resources -278.771392 474.99356 -0.58689510
## WorkLifeBalance3 83.592331 143.80397 0.58129362
## JobSatisfaction4 -52.738468 91.50354 -0.57635444
## MaritalStatusSingle 76.193547 137.79456 0.55295033
## EnvironmentSatisfaction2 -52.280042 101.98499 -0.51262488
## DepartmentSales 232.652083 461.82531 0.50376642
## WorkLifeBalance2 74.649952 152.98642 0.48795150
## JobRoleManufacturing Director -70.873720 149.73308 -0.47333376
## MaritalStatusMarried 37.796013 84.42156 0.44770568
## DistanceFromHome -46.711766 108.73972 -0.42957411
## EnvironmentSatisfaction4 -38.608441 92.80066 -0.41603625
## DailyRate 45.235325 110.07200 0.41096122
## HourlyRate 44.388598 109.39371 0.40576921
## RelationshipSatisfaction3 15.371131 92.67310 0.16586400
## MonthlyRate -16.768151 116.25081 -0.14424115
## EnvironmentSatisfaction3 -9.737652 92.23749 -0.10557152
## RelationshipSatisfaction4 7.705286 94.47260 0.08156107
## YearsAtCompany 35.760154 475.69156 0.07517509
## StockOptionLevel3 -11.093335 168.41468 -0.06586917
## StockOptionLevel2 8.134095 141.69522 0.05740557
## Education4 -5.505794 115.27068 -0.04776405
## P.Value
## JobLevel5 1.619338e-191
## JobLevel4 2.081430e-161
## JobLevel3 1.064009e-131
## JobRoleResearch Director 2.342074e-56
## JobRoleManager 8.981680e-40
## JobLevel2 1.753125e-34
## JobRoleLaboratory Technician 8.484063e-15
## JobRoleResearch Scientist 9.699518e-14
## (Intercept) 4.114887e-10
## JobRoleSales Representative 2.640573e-04
## TotalWorkingYears 4.133597e-04
## JobInvolvement3 8.885654e-03
## JobInvolvement4 3.032553e-02
## NumCompaniesWorked 3.966716e-02
## JobInvolvement2 4.281099e-02
## Education5 1.176039e-01
## StockOptionLevel1 1.287981e-01
## Education3 1.599003e-01
## JobSatisfaction2 1.615624e-01
## Education2 2.113500e-01
## DepartmentResearch & Development 2.243382e-01
## AttritionYes 2.332785e-01
## YearsInCurrentRole 2.405240e-01
## JobSatisfaction3 2.722281e-01
## GenderMale 3.356857e-01
## OverTimeYes 3.452377e-01
## PerformanceRating4 3.465824e-01
## JobRoleSales Executive 3.504275e-01
## Age 3.608310e-01
## YearsSinceLastPromotion 3.701218e-01
## PercentSalaryHike 3.783065e-01
## BusinessTravelTravel_Rarely 3.855629e-01
## EducationFieldMedical 3.948688e-01
## RelationshipSatisfaction2 3.973005e-01
## WorkLifeBalance4 4.052271e-01
## YearsWithCurrManager 4.280510e-01
## BusinessTravelTravel_Frequently 4.504944e-01
## EducationFieldLife Sciences 4.687710e-01
## EducationFieldMarketing 4.773384e-01
## TrainingTimesLastYear 4.926372e-01
## EducationFieldOther 4.964970e-01
## EducationFieldTechnical Degree 5.264677e-01
## JobRoleHuman Resources 5.573932e-01
## WorkLifeBalance3 5.611603e-01
## JobSatisfaction4 5.644922e-01
## MaritalStatusSingle 5.804085e-01
## EnvironmentSatisfaction2 6.083155e-01
## DepartmentSales 6.145252e-01
## WorkLifeBalance2 6.256803e-01
## JobRoleManufacturing Director 6.360679e-01
## MaritalStatusMarried 6.544527e-01
## DistanceFromHome 6.675886e-01
## EnvironmentSatisfaction4 6.774637e-01
## DailyRate 6.811801e-01
## HourlyRate 6.849902e-01
## RelationshipSatisfaction3 8.682941e-01
## MonthlyRate 8.853362e-01
## EnvironmentSatisfaction3 9.159413e-01
## RelationshipSatisfaction4 9.350104e-01
## YearsAtCompany 9.400889e-01
## StockOptionLevel3 9.474938e-01
## StockOptionLevel2 9.542324e-01
## Education4 9.619129e-01
Support Vector Machine
dummy_model <- dummyVars("~ .", data = X)
X_numeric <- predict(dummy_model, newdata = X)
X_numeric <- as.data.frame(X_numeric)
# Train the model
results <- model_training_regression(X_numeric, y, model = "svmRadial")
## Fold 1: MSE = 959.78, RMSE = 1245.38, R-Squared = 0.93
## Fold 2: MSE = 958.74, RMSE = 1256.37, R-Squared = 0.94
## Fold 3: MSE = 932.48, RMSE = 1204.21, R-Squared = 0.93
## Fold 4: MSE = 903.54, RMSE = 1188.87, R-Squared = 0.94
## Fold 5: MSE = 940.82, RMSE = 1201.95, R-Squared = 0.93
##
## Average Metrics Across Folds: MAE = 939.07, RMSE = 1219.36, R-Squared = 0.93
##
## Best Model is from Fold 2 with R-Squared = 0.94
# View the fold metrics
print(results$FoldMetrics)
## Fold MAE RMSE R_Squared
## 1 1 959.7766 1245.375 0.9319172
## 2 2 958.7353 1256.369 0.9356262
## 3 3 932.4791 1204.212 0.9273188
## 4 4 903.5439 1188.875 0.9354365
## 5 5 940.8169 1201.947 0.9329300
# Access the best model
best_model <- results$BestModel
# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(best_model, scale = FALSE)
print(importance)
## loess r-squared variable importance
##
## only 20 most important variables shown (out of 79)
##
## Overall
## TotalWorkingYears 0.58817
## JobRole.Manager 0.39688
## JobLevel.1 0.36809
## JobLevel.5 0.34497
## JobLevel.4 0.29949
## YearsAtCompany 0.27172
## Age 0.23015
## JobRole.Research Director 0.21773
## YearsInCurrentRole 0.14339
## YearsSinceLastPromotion 0.13441
## JobRole.Research Scientist 0.12625
## YearsWithCurrManager 0.11669
## JobRole.Laboratory Technician 0.09895
## JobLevel.3 0.08358
## JobRole.Sales Representative 0.03905
## Attrition.No 0.02541
## Attrition.Yes 0.02541
## JobLevel.2 0.02248
## NumCompaniesWorked 0.01971
## StockOptionLevel.1 0.01360
# Plot variable importance
plot(importance, top = 20, cex = 0.8)
Random Forest Classifier
# Step 1: Create an 80:20 train-test split
set.seed(42)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X_numeric[train_idx, ]
X_test <- X_numeric[-train_idx, ]
y_train <- y[train_idx]
y_test <- y[-train_idx]
# Step 2: Train the model
model <- train(
X_train, y_train,
method = "rf",
preProcess = c("center", "scale"),
trControl = trainControl(method = "none") # Single train-test split
)
# Step 3: Predict on the test set
y_pred <- predict(model, X_test)
# Step 4: Compute metrics
mae <- mean(abs(y_test - y_pred)) # Mean Absolute Error
rmse <- sqrt(mean((y_test - y_pred)^2)) # Root Mean Squared Error
ss_total <- sum((y_test - mean(y_test))^2)
ss_residual <- sum((y_test - y_pred)^2)
r_squared <- 1 - (ss_residual / ss_total) # R-Squared
# Print metrics
cat(sprintf("Metrics on Test Set:\nMAE = %.2f\nRMSE = %.2f\nR-Squared = %.2f\n", mae, rmse, r_squared))
## Metrics on Test Set:
## MAE = 856.51
## RMSE = 1151.24
## R-Squared = 0.94
# Step 5: Plot Predictions vs Actual
ggplot(data = data.frame(Actual = y_test, Predicted = y_pred), aes(x = Actual, y = Predicted)) +
geom_point(color = "blue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(title = "Predicted vs Actual", x = "Actual", y = "Predicted") +
theme_minimal()
# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(model, scale = FALSE)
print(importance)
## rf variable importance
##
## only 20 most important variables shown (out of 79)
##
## Overall
## TotalWorkingYears 8.076e+09
## JobLevel.1 3.037e+09
## JobLevel.5 2.873e+09
## JobLevel.3 2.402e+09
## JobRole.Manager 2.181e+09
## JobLevel.4 1.559e+09
## JobLevel.2 1.437e+09
## JobRole.Research Director 1.422e+09
## YearsAtCompany 6.369e+08
## Age 3.372e+08
## JobRole.Sales Executive 1.520e+08
## JobRole.Laboratory Technician 1.273e+08
## YearsInCurrentRole 1.260e+08
## JobRole.Research Scientist 1.132e+08
## DailyRate 1.009e+08
## HourlyRate 8.956e+07
## MonthlyRate 8.865e+07
## YearsWithCurrManager 8.197e+07
## YearsSinceLastPromotion 8.138e+07
## DistanceFromHome 7.257e+07
# Plot variable importance
plot(importance)
Logistic regression, SVM and Random Forest models are used to predict attrition using other variables. After comparison, logistic regression is selected as the best model because it has the highest accuracy of 0.905 and the least number of false predictions shown in the confusion matrix. Based on the model summary, working overtime is the primary factor contributing to attrition followed by monthly income, total working years, number of companies worked and distance from home.
Logistic Regression
X <- df_final %>% select(-Attrition)
num_cols <- sapply(X, is.numeric)
# Min-Max scale numerical columns
X[, num_cols] <- lapply(X[, num_cols], rescale, to = c(0, 1))
y <- df_final$Attrition
suppressWarnings({
results <- model_training(X, y, model = "glm", sampler = NULL)
})
## Fold 1: Accuracy = 0.86
## Fold 2: Accuracy = 0.89
## Fold 3: Accuracy = 0.86
## Fold 4: Accuracy = 0.87
## Fold 5: Accuracy = 0.91
## Average Accuracy: 0.88
##
## Best Model is from Fold 5 with Accuracy = 0.91
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 236 11
## Yes 17 31
##
## Accuracy : 0.9051
## 95% CI : (0.8657, 0.936)
## No Information Rate : 0.8576
## P-Value [Acc > NIR] : 0.009493
##
## Kappa : 0.6332
##
## Mcnemar's Test P-Value : 0.344704
##
## Sensitivity : 0.9328
## Specificity : 0.7381
## Pos Pred Value : 0.9555
## Neg Pred Value : 0.6458
## Prevalence : 0.8576
## Detection Rate : 0.8000
## Detection Prevalence : 0.8373
## Balanced Accuracy : 0.8355
##
## 'Positive' Class : No
##
# Access the best model
best_model <- results$BestModel
# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels
# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)
# Print the classification report
classification_report
## Class Precision Recall F1_Score Support
## 1 No 0.9554656 0.9328063 0.9440000 247
## 2 Yes 0.6458333 0.7380952 0.6888889 48
## 3 Macro Average 0.8006495 0.8354508 0.8164444 NA
## 4 Weighted Average 0.9050847 0.9011245 0.9024904 295
model_summary <- summary(best_model)
# Extract coefficients table and convert it to a data frame
coefficients_df <- as.data.frame(model_summary$coefficients)
# Add column names for clarity
colnames(coefficients_df) <- c("Estimate", "Std.Error", "t.value", "P.Value")
# Sort by p-value
sorted_coefficients <- coefficients_df[order(coefficients_df$P.Value), ]
sorted_coefficients
## Estimate Std.Error t.value
## OverTimeYes 2.14728976 0.2351219 9.13266710
## NumCompaniesWorked 1.96199273 0.4044277 4.85128130
## EnvironmentSatisfaction4 -1.39308297 0.3059738 -4.55294788
## DistanceFromHome 1.46393935 0.3529273 4.14799136
## JobInvolvement3 -1.69100215 0.4177843 -4.04754882
## JobInvolvement4 -2.22500703 0.5500317 -4.04523443
## BusinessTravelTravel_Frequently 1.92052570 0.4752718 4.04089992
## JobSatisfaction4 -1.13127134 0.3003542 -3.76645752
## EnvironmentSatisfaction3 -1.11117173 0.2959835 -3.75416828
## EnvironmentSatisfaction2 -1.27776080 0.3426227 -3.72935195
## YearsSinceLastPromotion 2.67637895 0.7667300 3.49064073
## JobInvolvement2 -1.48543339 0.4441896 -3.34414232
## JobLevel2 -1.97161336 0.5944914 -3.31647090
## WorkLifeBalance3 -1.29638574 0.4270061 -3.03598846
## RelationshipSatisfaction2 -1.03790311 0.3514475 -2.95322373
## RelationshipSatisfaction3 -0.86466470 0.3046718 -2.83802015
## YearsWithCurrManager -2.54204443 0.9315524 -2.72882614
## StockOptionLevel1 -0.97576638 0.3639313 -2.68118286
## RelationshipSatisfaction4 -0.79169582 0.2996262 -2.64227841
## JobSatisfaction3 -0.68182842 0.2856354 -2.38705843
## YearsInCurrentRole -2.40180035 1.0313226 -2.32885451
## BusinessTravelTravel_Rarely 0.97619068 0.4337658 2.25050178
## YearsAtCompany 4.14588740 1.8429603 2.24958042
## TotalWorkingYears -3.07892059 1.4055955 -2.19047415
## WorkLifeBalance2 -0.95968886 0.4579243 -2.09573691
## DailyRate -0.73708157 0.3677695 -2.00419420
## TrainingTimesLastYear -0.98934134 0.5042333 -1.96207053
## GenderMale 0.40880073 0.2178506 1.87651862
## MonthlyIncome -3.53008494 2.0091424 -1.75701080
## JobSatisfaction2 -0.55183424 0.3165151 -1.74346896
## WorkLifeBalance4 -0.87450136 0.5026432 -1.73980527
## StockOptionLevel2 -0.84103357 0.5005979 -1.68005822
## JobRoleResearch Director -2.40776989 1.4648478 -1.64369969
## MaritalStatusSingle 0.63403839 0.4618445 1.37283960
## EducationFieldOther -1.46544712 1.0760930 -1.36182198
## JobLevel5 2.29487657 1.8123382 1.26625182
## MonthlyRate 0.48599936 0.3872225 1.25509072
## JobRoleResearch Scientist -0.95575395 0.7871207 -1.21424058
## JobRoleManufacturing Director 0.74116024 0.6576467 1.12698848
## EducationFieldLife Sciences -1.00750838 0.9490896 -1.06155246
## Age -0.66242002 0.6624095 -1.00001593
## EducationFieldMedical -0.88367045 0.9526814 -0.92756136
## MaritalStatusMarried 0.27796145 0.3259368 0.85280778
## JobRoleSales Executive 1.04583115 1.3245031 0.78960267
## HourlyRate 0.26369307 0.3659306 0.72060943
## PercentSalaryHike -0.39141001 0.6512822 -0.60098375
## EducationFieldMarketing -0.59331395 1.0018866 -0.59219672
## JobLevel4 -0.73042113 1.4061718 -0.51943948
## JobRoleLaboratory Technician 0.36195536 0.7542048 0.47991653
## StockOptionLevel3 -0.24117872 0.5444473 -0.44297901
## Education4 0.15545223 0.3715673 0.41836900
## Education2 -0.14023446 0.3936925 -0.35620305
## JobRoleManager 0.35478524 1.2063074 0.29410849
## EducationFieldTechnical Degree -0.26616344 0.9580493 -0.27781811
## JobRoleSales Representative 0.37377966 1.4531736 0.25721611
## JobLevel3 -0.20214672 0.8481719 -0.23833225
## Education3 0.08019492 0.3409004 0.23524444
## PerformanceRating4 0.08096047 0.4811355 0.16826960
## Education5 -0.04431876 0.7003849 -0.06327772
## DepartmentSales 14.86901021 598.7455466 0.02483360
## JobRoleHuman Resources 14.43411793 598.7453717 0.02410727
## DepartmentResearch & Development 14.33604722 598.7450422 0.02394349
## (Intercept) -10.89246893 598.7464656 -0.01819212
## P.Value
## OverTimeYes 6.683212e-20
## NumCompaniesWorked 1.226664e-06
## EnvironmentSatisfaction4 5.289938e-06
## DistanceFromHome 3.354050e-05
## JobInvolvement3 5.175679e-05
## JobInvolvement4 5.227079e-05
## BusinessTravelTravel_Frequently 5.324648e-05
## JobSatisfaction4 1.655802e-04
## EnvironmentSatisfaction3 1.739180e-04
## EnvironmentSatisfaction2 1.919729e-04
## YearsSinceLastPromotion 4.818638e-04
## JobInvolvement2 8.253739e-04
## JobLevel2 9.116208e-04
## WorkLifeBalance3 2.397485e-03
## RelationshipSatisfaction2 3.144740e-03
## RelationshipSatisfaction3 4.539432e-03
## YearsWithCurrManager 6.356020e-03
## StockOptionLevel1 7.336241e-03
## RelationshipSatisfaction4 8.235033e-03
## JobSatisfaction3 1.698379e-02
## YearsInCurrentRole 1.986677e-02
## BusinessTravelTravel_Rarely 2.441711e-02
## YearsAtCompany 2.447559e-02
## TotalWorkingYears 2.848987e-02
## WorkLifeBalance2 3.610554e-02
## DailyRate 4.504926e-02
## TrainingTimesLastYear 4.975427e-02
## GenderMale 6.058410e-02
## MonthlyIncome 7.891597e-02
## JobSatisfaction2 8.125173e-02
## WorkLifeBalance4 8.189322e-02
## StockOptionLevel2 9.294599e-02
## JobRoleResearch Director 1.002382e-01
## MaritalStatusSingle 1.698022e-01
## EducationFieldOther 1.732541e-01
## JobLevel5 2.054229e-01
## MonthlyRate 2.094458e-01
## JobRoleResearch Scientist 2.246559e-01
## JobRoleManufacturing Director 2.597474e-01
## EducationFieldLife Sciences 2.884389e-01
## Age 3.173028e-01
## EducationFieldMedical 3.536351e-01
## MaritalStatusMarried 3.937659e-01
## JobRoleSales Executive 4.297598e-01
## HourlyRate 4.711499e-01
## PercentSalaryHike 5.478508e-01
## EducationFieldMarketing 5.537189e-01
## JobLevel4 6.034543e-01
## JobRoleLaboratory Technician 6.312867e-01
## StockOptionLevel3 6.577809e-01
## Education4 6.756773e-01
## Education2 7.216885e-01
## JobRoleManager 7.686750e-01
## EducationFieldTechnical Degree 7.811520e-01
## JobRoleSales Representative 7.970119e-01
## JobLevel3 8.116234e-01
## Education3 8.140190e-01
## PerformanceRating4 8.663712e-01
## Education5 9.495454e-01
## DepartmentSales 9.801877e-01
## JobRoleHuman Resources 9.807670e-01
## DepartmentResearch & Development 9.808977e-01
## (Intercept) 9.854856e-01
Oversampling is worsening the model performance on minority instead of improving, possibly because the difference is too big.
suppressWarnings({
results <- model_training(X, y, model = "glm", sampler = "RandomOversampling")
})
## Applying Random Oversampling...
## Fold 1: Accuracy = 0.77
## Applying Random Oversampling...
## Fold 2: Accuracy = 0.78
## Applying Random Oversampling...
## Fold 3: Accuracy = 0.79
## Applying Random Oversampling...
## Fold 4: Accuracy = 0.79
## Applying Random Oversampling...
## Fold 5: Accuracy = 0.79
## Average Accuracy: 0.78
##
## Best Model is from Fold 5 with Accuracy = 0.79
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 194 53
## Yes 10 38
##
## Accuracy : 0.7864
## 95% CI : (0.7352, 0.8318)
## No Information Rate : 0.6915
## P-Value [Acc > NIR] : 0.0001746
##
## Kappa : 0.4241
##
## Mcnemar's Test P-Value : 1.213e-07
##
## Sensitivity : 0.9510
## Specificity : 0.4176
## Pos Pred Value : 0.7854
## Neg Pred Value : 0.7917
## Prevalence : 0.6915
## Detection Rate : 0.6576
## Detection Prevalence : 0.8373
## Balanced Accuracy : 0.6843
##
## 'Positive' Class : No
##
# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels
# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)
# Print the classification report
classification_report
## Class Precision Recall F1_Score Support
## 1 No 0.7854251 0.9509804 0.8603104 247
## 2 Yes 0.7916667 0.4175824 0.5467626 48
## 3 Macro Average 0.7885459 0.6842814 0.7035365 NA
## 4 Weighted Average 0.7864407 0.8641902 0.8092925 295
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Perform logistic regression
model <- glm(Attrition ~ ., data = train_data, family = binomial)
# Summarize the model
summary(model)
##
## Call:
## glm(formula = Attrition ~ ., family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.131e+00 6.522e+02 -0.014 0.988829
## Age -3.549e-02 1.681e-02 -2.112 0.034687 *
## BusinessTravelTravel_Frequently 2.343e+00 4.972e-01 4.712 2.45e-06 ***
## BusinessTravelTravel_Rarely 1.204e+00 4.490e-01 2.682 0.007323 **
## DailyRate -4.124e-04 2.627e-04 -1.570 0.116473
## DepartmentResearch & Development 1.426e+01 6.522e+02 0.022 0.982550
## DepartmentSales 1.343e+01 6.522e+02 0.021 0.983576
## DistanceFromHome 5.388e-02 1.336e-02 4.031 5.54e-05 ***
## Education2 4.232e-01 3.935e-01 1.076 0.282099
## Education3 3.967e-01 3.517e-01 1.128 0.259301
## Education4 3.749e-01 3.871e-01 0.969 0.332721
## Education5 5.782e-01 6.814e-01 0.848 0.396172
## EducationFieldLife Sciences -1.195e+00 9.703e-01 -1.231 0.218283
## EducationFieldMarketing -5.345e-01 1.029e+00 -0.520 0.603348
## EducationFieldMedical -9.341e-01 9.683e-01 -0.965 0.334740
## EducationFieldOther -1.274e+00 1.069e+00 -1.192 0.233338
## EducationFieldTechnical Degree -3.483e-02 9.954e-01 -0.035 0.972088
## EnvironmentSatisfaction2 -1.206e+00 3.453e-01 -3.491 0.000481 ***
## EnvironmentSatisfaction3 -1.042e+00 3.014e-01 -3.458 0.000545 ***
## EnvironmentSatisfaction4 -1.438e+00 3.096e-01 -4.646 3.39e-06 ***
## GenderMale 4.334e-01 2.210e-01 1.961 0.049836 *
## HourlyRate 4.423e-03 5.448e-03 0.812 0.416928
## JobInvolvement2 -9.876e-01 4.228e-01 -2.336 0.019494 *
## JobInvolvement3 -1.631e+00 4.011e-01 -4.065 4.80e-05 ***
## JobInvolvement4 -2.039e+00 5.475e-01 -3.724 0.000196 ***
## JobLevel2 -2.308e+00 6.204e-01 -3.720 0.000200 ***
## JobLevel3 -5.086e-01 8.808e-01 -0.577 0.563672
## JobLevel4 -1.177e+00 1.500e+00 -0.785 0.432559
## JobLevel5 2.443e+00 2.117e+00 1.154 0.248449
## JobRoleHuman Resources 1.396e+01 6.522e+02 0.021 0.982920
## JobRoleLaboratory Technician -1.881e-01 7.328e-01 -0.257 0.797443
## JobRoleManager -8.795e-01 1.551e+00 -0.567 0.570657
## JobRoleManufacturing Director 6.053e-01 6.058e-01 0.999 0.317675
## JobRoleResearch Director -3.041e+00 1.610e+00 -1.889 0.058912 .
## JobRoleResearch Scientist -1.315e+00 7.584e-01 -1.734 0.082932 .
## JobRoleSales Executive 2.218e+00 1.687e+00 1.315 0.188549
## JobRoleSales Representative 9.756e-01 1.782e+00 0.547 0.584100
## JobSatisfaction2 -7.221e-01 3.250e-01 -2.222 0.026280 *
## JobSatisfaction3 -5.940e-01 2.881e-01 -2.062 0.039230 *
## JobSatisfaction4 -1.286e+00 3.142e-01 -4.093 4.25e-05 ***
## MaritalStatusMarried 4.019e-01 3.290e-01 1.221 0.221907
## MaritalStatusSingle 6.364e-01 4.727e-01 1.346 0.178276
## MonthlyIncome -1.745e-04 1.097e-04 -1.591 0.111685
## MonthlyRate 6.023e-06 1.587e-05 0.380 0.704260
## NumCompaniesWorked 2.001e-01 4.671e-02 4.283 1.84e-05 ***
## OverTimeYes 2.268e+00 2.409e-01 9.412 < 2e-16 ***
## PercentSalaryHike -5.212e-02 4.766e-02 -1.094 0.274171
## PerformanceRating4 2.803e-01 4.925e-01 0.569 0.569282
## RelationshipSatisfaction2 -1.130e+00 3.533e-01 -3.200 0.001376 **
## RelationshipSatisfaction3 -1.113e+00 3.047e-01 -3.654 0.000258 ***
## RelationshipSatisfaction4 -1.003e+00 2.991e-01 -3.354 0.000797 ***
## StockOptionLevel1 -1.263e+00 3.770e-01 -3.350 0.000807 ***
## StockOptionLevel2 -1.169e+00 5.221e-01 -2.238 0.025224 *
## StockOptionLevel3 -3.105e-01 5.417e-01 -0.573 0.566539
## TotalWorkingYears -5.116e-02 3.560e-02 -1.437 0.150760
## TrainingTimesLastYear -2.043e-01 8.521e-02 -2.398 0.016489 *
## WorkLifeBalance2 -1.019e+00 4.446e-01 -2.292 0.021907 *
## WorkLifeBalance3 -1.376e+00 4.176e-01 -3.294 0.000986 ***
## WorkLifeBalance4 -9.547e-01 5.101e-01 -1.872 0.061252 .
## YearsAtCompany 1.065e-01 4.939e-02 2.157 0.031016 *
## YearsInCurrentRole -1.704e-01 5.875e-02 -2.901 0.003720 **
## YearsSinceLastPromotion 1.594e-01 5.195e-02 3.067 0.002159 **
## YearsWithCurrManager -1.179e-01 5.574e-02 -2.115 0.034407 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1046.75 on 1175 degrees of freedom
## Residual deviance: 609.94 on 1113 degrees of freedom
## AIC: 735.94
##
## Number of Fisher Scoring iterations: 15
# Predict on the test data
predictions <- predict(model, newdata = test_data, type = "response")
# Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
predicted_attrition <- ifelse(predictions > 0.5, 1, 0)
# Evaluate the model
# Create a confusion matrix
conf_matrix <- table(predicted_attrition, test_data$Attrition)
print(conf_matrix)
##
## predicted_attrition No Yes
## 0 238 22
## 1 11 23
# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8877551
SVM
dummy_model <- dummyVars("~ .", data = X)
X_numeric <- predict(dummy_model, newdata = X)
X_numeric <- as.data.frame(X_numeric)
# Train the model
results <- model_training(X_numeric, y, model = "svmLinear")
## Fold 1: Accuracy = 0.87
## Fold 2: Accuracy = 0.89
## Fold 3: Accuracy = 0.88
## Fold 4: Accuracy = 0.87
## Fold 5: Accuracy = 0.89
## Average Accuracy: 0.88
##
## Best Model is from Fold 5 with Accuracy = 0.89
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 236 11
## Yes 21 27
##
## Accuracy : 0.8915
## 95% CI : (0.8503, 0.9246)
## No Information Rate : 0.8712
## P-Value [Acc > NIR] : 0.1699
##
## Kappa : 0.5654
##
## Mcnemar's Test P-Value : 0.1116
##
## Sensitivity : 0.9183
## Specificity : 0.7105
## Pos Pred Value : 0.9555
## Neg Pred Value : 0.5625
## Prevalence : 0.8712
## Detection Rate : 0.8000
## Detection Prevalence : 0.8373
## Balanced Accuracy : 0.8144
##
## 'Positive' Class : No
##
# View the fold metrics
print(results$FoldMetrics)
## NULL
# Access the best model
best_model <- results$BestModel
# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels
# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)
# Print the classification report
classification_report
## Class Precision Recall F1_Score Support
## 1 No 0.9554656 0.9182879 0.9365079 247
## 2 Yes 0.5625000 0.7105263 0.6279070 48
## 3 Macro Average 0.7589828 0.8144071 0.7822075 NA
## 4 Weighted Average 0.8915254 0.8844827 0.8862949 295
# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(best_model, scale = FALSE)
print(importance)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 78)
##
## Importance
## MonthlyIncome 0.6695
## TotalWorkingYears 0.6656
## JobLevel.1 0.6512
## YearsAtCompany 0.6492
## OverTime.Yes 0.6376
## OverTime.No 0.6376
## YearsWithCurrManager 0.6339
## Age 0.6327
## YearsInCurrentRole 0.6313
## StockOptionLevel.0 0.6244
## MaritalStatus.Single 0.6098
## StockOptionLevel.1 0.6010
## JobLevel.2 0.5910
## EnvironmentSatisfaction.1 0.5704
## DistanceFromHome 0.5675
## MaritalStatus.Married 0.5653
## BusinessTravel.Travel_Frequently 0.5574
## JobRole.Laboratory Technician 0.5572
## Department.Research & Development 0.5539
## DailyRate 0.5505
# Plot variable importance
plot(importance)
# Train the model
results <- model_training(X_numeric, y, model = "svmRadial")
## Fold 1: Accuracy = 0.86
## Fold 2: Accuracy = 0.88
## Fold 3: Accuracy = 0.86
## Fold 4: Accuracy = 0.88
## Fold 5: Accuracy = 0.89
## Average Accuracy: 0.87
##
## Best Model is from Fold 5 with Accuracy = 0.89
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 246 1
## Yes 32 16
##
## Accuracy : 0.8881
## 95% CI : (0.8465, 0.9217)
## No Information Rate : 0.9424
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.4451
##
## Mcnemar's Test P-Value : 1.767e-07
##
## Sensitivity : 0.8849
## Specificity : 0.9412
## Pos Pred Value : 0.9960
## Neg Pred Value : 0.3333
## Prevalence : 0.9424
## Detection Rate : 0.8339
## Detection Prevalence : 0.8373
## Balanced Accuracy : 0.9130
##
## 'Positive' Class : No
##
# View the fold metrics
print(results$FoldMetrics)
## NULL
# Access the best model
best_model <- results$BestModel
# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels
# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)
# Print the classification report
classification_report
## Class Precision Recall F1_Score Support
## 1 No 0.9959514 0.8848921 0.9371429 247
## 2 Yes 0.3333333 0.9411765 0.4923077 48
## 3 Macro Average 0.6646424 0.9130343 0.7147253 NA
## 4 Weighted Average 0.8881356 0.8940502 0.8647629 295
# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(best_model, scale = FALSE)
print(importance)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 78)
##
## Importance
## MonthlyIncome 0.6695
## TotalWorkingYears 0.6656
## JobLevel.1 0.6512
## YearsAtCompany 0.6492
## OverTime.Yes 0.6376
## OverTime.No 0.6376
## YearsWithCurrManager 0.6339
## Age 0.6327
## YearsInCurrentRole 0.6313
## StockOptionLevel.0 0.6244
## MaritalStatus.Single 0.6098
## StockOptionLevel.1 0.6010
## JobLevel.2 0.5910
## EnvironmentSatisfaction.1 0.5704
## DistanceFromHome 0.5675
## MaritalStatus.Married 0.5653
## BusinessTravel.Travel_Frequently 0.5574
## JobRole.Laboratory Technician 0.5572
## Department.Research & Development 0.5539
## DailyRate 0.5505
# Plot variable importance
plot(importance)
Random Forest (RF)
#Too slow, just run one fold to check
# Step 1: Create an 80:20 train-test split
set.seed(42)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X_numeric[train_idx, ]
X_test <- X_numeric[-train_idx, ]
y_train <- y[train_idx]
y_test <- y[-train_idx]
# Step 2: Train the model
model <- train(
X_train, y_train,
method = "rf",
preProcess = c("center", "scale"),
trControl = trainControl(method = "none") # Single train-test split
)
# Step 3: Predict on the test set
y_pred <- predict(model, X_test)
confusionMatrix(y_test, y_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 245 1
## Yes 39 8
##
## Accuracy : 0.8635
## 95% CI : (0.8188, 0.9006)
## No Information Rate : 0.9693
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2469
##
## Mcnemar's Test P-Value : 4.909e-09
##
## Sensitivity : 0.8627
## Specificity : 0.8889
## Pos Pred Value : 0.9959
## Neg Pred Value : 0.1702
## Prevalence : 0.9693
## Detection Rate : 0.8362
## Detection Prevalence : 0.8396
## Balanced Accuracy : 0.8758
##
## 'Positive' Class : No
##
# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)
# Print the classification report
classification_report
## Class Precision Recall F1_Score Support
## 1 No 0.9959350 0.8626761 0.9245283 246
## 2 Yes 0.1702128 0.8888889 0.2857143 47
## 3 Macro Average 0.5830739 0.8757825 0.6051213 NA
## 4 Weighted Average 0.8634812 0.8668808 0.8220564 293
# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(model, scale = FALSE)
print(importance)
## rf variable importance
##
## only 20 most important variables shown (out of 78)
##
## Overall
## MonthlyIncome 19.573
## Age 17.533
## DailyRate 14.628
## TotalWorkingYears 13.921
## MonthlyRate 13.499
## HourlyRate 13.111
## DistanceFromHome 12.189
## YearsAtCompany 12.082
## PercentSalaryHike 9.730
## NumCompaniesWorked 9.315
## OverTime.Yes 9.197
## YearsWithCurrManager 8.994
## OverTime.No 8.752
## TrainingTimesLastYear 7.447
## YearsInCurrentRole 7.322
## YearsSinceLastPromotion 7.089
## EnvironmentSatisfaction.1 5.240
## StockOptionLevel.0 5.150
## BusinessTravel.Travel_Frequently 4.608
## JobLevel.1 4.600
# Plot variable importance
plot(importance)
Important features used by model to predict monthly income: 1. Job Role 2. Job Level 3. Total Working Years
Interpretation of Job Role’s Impact on Monthly Income:
Job role is a significant factor in determining monthly income, as evidenced by the clear distinctions between roles in the boxplot below. The boxplot showing the distribution of monthly income across various job roles. Based on the observation, a few key insights are derived from the boxplot below: 1. Manager and Research Director have the highest median income while Sales Representative has the lowest median income. 2. Outliers are visible for multiple job roles like Laboratory Technicians, Research Scientist, Sales Executive, Sales Representatives, indicating unique cases of exceptionally high or low salaries. 3. By comparing different roles, Manufacturing Director and Research Director have a broader income range, suggesting diverse pay structures based on seniority or performance. While Sales Representatives consistently earn the least, with their incomes clustering at the lower end.
options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = JobRole, y = MonthlyIncome, fill = JobRole)) +
geom_boxplot() +
labs(
title = "Monthly Income by Job Role",
x = "Job Role",
y = "Monthly Income"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation of Job Level’s Impact on Monthly Income:
After job role, job level is the second most significant factor in determining monthly income. The boxplot below shows a clear positive relationship between job level and monthly income. As the job level increases (from 1 to 5), the median monthly income rises significantly, indicating that higher job levels are associated with higher earnings. This also representing: 1. Promotions and career progression result in higher pay. 2. The wider spread at higher job levels (especially Job Level 4) could indicate variations in responsibilities, individual negotiation, or incentives for senior positions. 3. Job Level 2 has a noticeable number of outliers earning higher than the upper quartile. This may indicate exceptional cases, such as high-performing individuals or unique circumstances at this level.
options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = JobLevel, y = MonthlyIncome, fill = JobLevel)) +
geom_boxplot() +
labs(
title = "Monthly Income by Job Level",
x = "Job Level",
y = "Monthly Income"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation of Total Working Years’ Impact on Monthly Income:
After job level, total working years is the third most significant factor influencing monthly income. The smoothing line shows a positive correlation between working years and income, indicating that more experience generally leads to higher earnings. Key insights include: 1. Experience and skills over time lead to higher-paying roles and promotions, with more experienced individuals often commanding higher salaries. 2. The chart differentiates data points by job level, highlighting that higher job levels tend to earn more, even with the same number of working years. 3. Variability in income at each experience level suggests that other factors, like industry and location, also influence earnings.
ggplot(df, aes(x = TotalWorkingYears, y = MonthlyIncome, color = JobLevel)) +
geom_point(size = 3, alpha = 0.8) + # Adjust size and transparency
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") + # Trend line
labs(
title = "Monthly Income by Total Working Years",
x = "Total Working Years",
y = "Monthly Income",
color = "Job Level"
) +
theme_minimal() +
theme(
legend.position = "top", # Move legend to the top
axis.text = element_text(size = 12), # Adjust axis text size
plot.title = element_text(hjust = 0.5) # Center the title
)
## `geom_smooth()` using formula = 'y ~ x'
Important features used by models to predict attrition: 1. Over Time 2. Monthly Income 3. Total Working Years 4. Number of Companies Worked 5. Distance From Home
Interpretation of Over Time’s Impact on Employee Attrition:
Overtime is the primary factor influencing employee attrition. The chart indicates a positive association between overtime work and employee attrition, highlighting the need for further analysis and potential interventions to address this issue. Key insights include: 1. Employees who do not work overtime have a significantly lower attrition rate compared to those who do work overtime. There is a notable association between working overtime and the likelihood of leaving the company. 2. This suggests that working overtime may be a contributing factor to higher employee turnover. 3. The potential underlying reasons could include work-life imbalance, job stress, or other factors related to overtime work.
Recommendations: 1. Company should investigate the reasons behind the higher attrition rate among employees who work overtime. 2. Consider strategies to reduce overtime or support employees who work extra hours to improve retention. 3. Strategies such as: (i). Offering fair pay or additional benefits for overtime work shows that the company values employees’ time and effort. (ii). Providing flexible working hours or the option to take time off in lieu of overtime hours worked helps employees manage their work-life balance. (iii). Offering access to health and wellness programs or mental health support can help employees manage the stress and fatigue that may come from working extra hours. (iv). Acknowledging the hard work and commitment of employees who put in extra hours can foster a sense of appreciation and loyalty, motivating them to stay with the organization.
# Summarize data: Count occurrences of each combination
df_summary <- as.data.frame(table(df$OverTime, df$Attrition))
colnames(df_summary) <- c("OverTime", "Attrition", "Count")
# Create the grouped bar chart
ggplot(df_summary, aes(x = OverTime, y = Count, fill = Attrition)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Attrition Grouped by OverTime",
x = "OverTime",
y = "Count",
fill = "Attrition"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5),
plot.title = element_text(hjust = 0.5)
)
Interpretation of Monthly Income’s Impact on Employee Attrition:
Monthly income is the second most significant factor influencing employee attrition. The boxplot below show the income distributions between staying and leaving employees, suggesting a potential relationship between income and attrition. The boxplot show that employees who leave tend to have lower incomes, this could suggest that monthly income is a key factor driving attrition. The key insights include: 1. Lower incomes among those who leave could suggest dissatisfaction related to compensation, prompting employees to seek better opportunities elsewhere. 2. Higher incomes among those who leave might indicate other factors, such as lack of job satisfaction, insufficient career development, or poor work-life balance. It could also suggest that despite earning higher salaries, these employees may feel their potential is not being fully utilized or recognized, prompting them to move on.
Recommendations: 1. Regularly review and adjust compensation to remain competitive within the industry, as salary is a key retention factor. 2. Offer valuable benefits like health insurance, retirement plans, and performance incentives to show care for employees’ long-term well-being.
options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = Attrition, y = MonthlyIncome, fill = Attrition)) +
geom_boxplot() +
labs(
title = "Boxplot of Monthly Income Grouped by Attrition",
x = "Attrition",
y = "Monthly Income"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation of the Impact of Number of Companies Worked on Employee Attrition:
The boxplot below shows that employees who left the company tend to have more variability in the number of companies they’ve worked for. However, the median number of companies worked for by employees who left is 1, suggesting they may be early in their careers and have worked at only one previous employer. The key insights include: 1. This may reflect a pattern of lower organizational loyalty or dissatisfaction with career progression. 2. Employees in their first job may leave after gaining initial experience, seeking a better fit once they understand their skills, interests, and the industry.
Recommendations: 1. Provide employees with clear, structured career progression plans. Make sure they understand how they can advance within the company and the steps required to reach higher positions. 2. Offer continuous learning opportunities, including skill-building programs, leadership training, and mentorship. This can help employees feel valued and give them the tools to grow within the organization. 3. Implement regular check-ins and performance reviews to provide feedback, set goals, and recognize achievements. This helps employees feel their progress is acknowledged and motivates them to stay. 4. Ensure that employees’ roles are aligned with their skills, interests, and long-term career goals. Providing meaningful, challenging work can increase job satisfaction and reduce the urge to leave for new opportunities.
options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = Attrition, y = NumCompaniesWorked, fill = Attrition)) +
geom_boxplot() +
labs(
title = "Boxplot of Number of Companies Worked Grouped by Attrition",
x = "Attrition",
y = "Number of Companies Worked"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Those who left their median is 1, possibly first job ppl leaving
Interpretation of the Impact of Distance From Home on Employee Attrition:
The boxplot shows indicate a link between longer commutes and higher attrition rates. By comparing the median distances for both groups, employees who left the company tend to live farther from work. A higher median would suggest that greater commuting distances might correlate with attrition. Key insight include: 1. Long distances can increase employee dissatisfaction due to the time and cost involved in commuting, leading to a higher likelihood of leaving the company.
Recommendations: 1. Implement remote or hybrid work models to reduce commuting and improve work-life balance. 2. Offer transportation subsidies or reimbursements to ease commuting costs and stress. 3. Consider satellite offices or coworking spaces closer to employees’ homes to cut commuting time. 4. Allow flexible work hours to help employees avoid peak commute times. 5. Provide relocation assistance for employees willing to move closer to the office.
options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = Attrition, y = DistanceFromHome, fill = Attrition)) +
geom_boxplot() +
labs(
title = "Boxplot of Distance from Home Grouped by Attrition",
x = "Attrition",
y = "Distance from Home"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
In this project, we successfully developed machine learning models to predict employee attrition and monthly income with promising performance. Through our analysis, we identified key factors influencing attrition, specifically OverTime, Monthly Income, Total Working Years, Number of Companies Worked, and Distance From Home. These insights provide valuable direction for addressing employee retention by highlighting the most impactful areas for intervention.
However, this project has several limitations. The dataset used was relatively small, and limited computing resources restricted the complexity of models and analysis we could perform. For future work, expanding the dataset, integrating additional features, and retraining the models on more diverse data could improve performance and test the models’ generalizability across different industries.
In conclusion, adopting a holistic approach to enhancing the employee experience—focused on improving work-life balance, supporting career growth, and promoting overall well-being—can help foster a more engaged and loyal workforce. This will ultimately lead to higher retention, increased employee motivation, and better alignment with the company’s long-term goals, driving sustained success.