IBM ATTRITION

INTRODUCTION This project analyzes the IBM HR Analytics dataset using R programming. The aim is to understand employee data such as age, income, job satisfaction, work-life balance, and attrition.

Basic data preprocessing is performed to clean the dataset. Various graphs like histograms, bar charts, and scatter plots are used to study patterns. Statistical methods such as correlation and regression are also applied to find relationships between variables.

The project helps in understanding employee behavior and provides useful insights for better decision-making.

Loading necessary libraries

# Load necessary libraries
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Reading the Dataset

IBMDATA <- read.csv("F:/cap482p/IBM.csv")

Summary of the dataset

str(IBMDATA)

## 'data.frame':    1470 obs. of  13 variables:
##  $ Age                    : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition              : chr  "Yes" "No" "Yes" "No" ...
##  $ Department             : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome       : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education              : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField         : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EnvironmentSatisfaction: int  2 3 4 4 1 4 3 4 4 3 ...
##  $ JobSatisfaction        : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus          : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome          : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ NumCompaniesWorked     : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ WorkLifeBalance        : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany         : int  6 10 0 8 2 7 1 1 9 7 ...

summary(IBMDATA)

##       Age         Attrition          Department        DistanceFromHome
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 1.000  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 2.000  
##  Median :36.00   Mode  :character   Mode  :character   Median : 7.000  
##  Mean   :36.92                                         Mean   : 9.193  
##  3rd Qu.:43.00                                         3rd Qu.:14.000  
##  Max.   :60.00                                         Max.   :29.000  
##    Education     EducationField     EnvironmentSatisfaction JobSatisfaction
##  Min.   :1.000   Length:1470        Min.   :1.000           Min.   :1.000  
##  1st Qu.:2.000   Class :character   1st Qu.:2.000           1st Qu.:2.000  
##  Median :3.000   Mode  :character   Median :3.000           Median :3.000  
##  Mean   :2.913                      Mean   :2.722           Mean   :2.729  
##  3rd Qu.:4.000                      3rd Qu.:4.000           3rd Qu.:4.000  
##  Max.   :5.000                      Max.   :4.000           Max.   :4.000  
##  MaritalStatus      MonthlyIncome   NumCompaniesWorked WorkLifeBalance
##  Length:1470        Min.   : 1009   Min.   :0.000      Min.   :1.000  
##  Class :character   1st Qu.: 2911   1st Qu.:1.000      1st Qu.:2.000  
##  Mode  :character   Median : 4919   Median :2.000      Median :3.000  
##                     Mean   : 6503   Mean   :2.693      Mean   :2.761  
##                     3rd Qu.: 8379   3rd Qu.:4.000      3rd Qu.:3.000  
##                     Max.   :19999   Max.   :9.000      Max.   :4.000  
##  YearsAtCompany  
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 7.008  
##  3rd Qu.: 9.000  
##  Max.   :40.000

colnames(IBMDATA)

##  [1] "Age"                     "Attrition"              
##  [3] "Department"              "DistanceFromHome"       
##  [5] "Education"               "EducationField"         
##  [7] "EnvironmentSatisfaction" "JobSatisfaction"        
##  [9] "MaritalStatus"           "MonthlyIncome"          
## [11] "NumCompaniesWorked"      "WorkLifeBalance"        
## [13] "YearsAtCompany"

Preprocessing the dataset

sum(is.na(IBMDATA))

## [1] 0

#this code checks for any missing values in the dataset

IBMDATA <- na.omit(IBMDATA)
#this code removes any rows with missing values from the dataset

Converting categorical variables to factors

IBMDATA$Attrition <- as.factor(IBMDATA$Attrition)
IBMDATA$Department <- as.factor(IBMDATA$Department)
IBMDATA$EducationField <- as.factor(IBMDATA$EducationField)
IBMDATA$MaritalStatus <- as.factor(IBMDATA$MaritalStatus)
#this code covert all required column to factor data type

Duplicates check and removal

sum(duplicated(IBMDATA))

## [1] 0

#this code checks for any duplicate rows in the dataset
IBMDATA <- IBMDATA[!duplicated(IBMDATA), ]
#this code removes any duplicate rows from the dataset

Histogram graphs

Q1: What is the distribution of employee ages in the dataset? Are there more young or old employees?

#distribution of young vs old age employees
ggplot(IBMDATA, aes(x = Age)) +
  geom_histogram(bins = 20, fill = "blue", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Count")

#this code tell us the distribution of young vs old age employees in the dataset

Q2: What is the distribution of monthly income among employees? Are there more low-income or high-income employees?

#distribution of low income vs high income range of employees
ggplot(IBMDATA, aes(x = MonthlyIncome)) +
  geom_histogram(bins = 20, fill = "green", color = "black") +
  labs(title = "Monthly Income Distribution", x = "Monthly Income", y = "Count")

#this code tell us the distribution of low income vs high income range of employees in the dataset

Q3: What is the distribution of employee job satisfaction levels? Are there more satisfied or dissatisfied employees?

#distribution of employes distance from home
ggplot(IBMDATA, aes(x = DistanceFromHome)) +
  geom_histogram(bins = 15, fill = "orange", color = "blue") +
  labs(title = "Distance From Home Distribution", x = "Distance From Home", y = "Count")

#this code tell us the distribution of employes distance from home in the dataset

Q4: What is the distribution of employee years at the company? Are there more new employees or experienced employees?

#distribution of employes years at company new vs experienced employees
ggplot(IBMDATA, aes(x = YearsAtCompany)) +
  geom_histogram(bins = 15, fill = "purple", color = "black") +
  labs(title = "Years at Company Distribution", x = "Years at Company", y = "Count")

#this code tell us the distribution of employes years at company new vs experienced employees in the #dataset

Q5: What is the distribution of employees in how many companies they have worked for? Are there more employees who have worked for many companies or just a few?

#distribution of employs in how many companies they have worked for
ggplot(IBMDATA, aes(x = NumCompaniesWorked)) +
  geom_histogram(bins = 10, fill = "green", color = "black") +
  labs(title = "Number of Companies Worked", x = "Number of Companies Worked",y = "Count")

#this code tell us the distribution of employs in how many companies they have worked for in the
#dataset

Bar graphs

Q6: How many employees are there in each Department? Which department has the most employees?

#How many employees are there in each Department?
ggplot(IBMDATA, aes(x = Department)) +
  geom_bar(fill = "grey", color = "black") +
  labs(title = "Number of Employees in Each Department", x = "Department", y = "Count")

#this code tell us how many employees are there in each Department in the dataset

Q7: What is the count of employees with Attrition? How many employees have left the company?

#What is the count of employees with Attrition?
ggplot(IBMDATA, aes(x = Attrition)) +
  geom_bar(fill = "red", color = "black") +
  labs(title = "Count of Employees with Attrition", x = "Attrition", y = "Count")

#this code tell us the count of employees with Attrition in the dataset

Q8: What is the count of employees in each Education Field? Which education field has the most employees?

#What is the count of employees in each Education Field?
ggplot(IBMDATA, aes(x = EducationField)) +
  geom_bar(fill = "pink", color = "black") +
  labs(title = "Count of Employees in Each Education Field", x = "Education Field",
       y = "Count")

#this code tell us the count of employees in each Education Field in the dataset

Q9: What is the count of employees in each Marital Status category? Which marital status category has the most employees?

#What is the count of employees in each Marital Status category?
ggplot(IBMDATA, aes(x = MaritalStatus)) +
  geom_bar(fill = "cyan", color = "black") +
  labs(title = "Count of Employees in Each Marital Status Category", x = "Marital Status",
       y = "Count")

#this code tell us the count of employees in each Marital Status category in the dataset

Q10: What is the distribution of Work-Life Balance ratings? Are there more employees with good work-life balance or poor work-life balance?

#What is the distribution of Work-Life Balance ratings?
ggplot(IBMDATA, aes(x = WorkLifeBalance)) +
  geom_bar(fill = "yellow", color = "black") +
  labs(title = "Distribution of Work-Life Balance", x = "Work-Life Balance",y = "Count")

#this code tell us the distribution of Work-Life Balance ratings in the dataset

Scatter plot

Q11: Is there a relationship between Age and Monthly Income? Do older employees tend to have higher monthly income?

#Is there a relationship between Age and Monthly Income?
ggplot(IBMDATA, aes(x = Age, y = MonthlyIncome)) +
  geom_point(color = "blue") +
  labs(title = "Relationship between Age and Monthly Income", x = "Age", y ="Monthly Income")

#this code tell us the relationship between Age and Monthly Income in the dataset means if older
#employees tend to have higher monthly income or not

Q12: Is there a relationship between Years at Company and Monthly Income? Do employees with more years at the company tend to have higher monthly income?

#Do more years in company increase income?
ggplot(IBMDATA, aes(x = YearsAtCompany, y = MonthlyIncome)) +
  geom_point(color = "green") +
  labs(title = "Relationship between Years at Company and Monthly Income", x = "Years at Company",
       y = "Monthly Income")

#this code tell us the relationship between Years at Company and Monthly Income in the dataset means #if more years in company increase income or not

Q13: Is there a relationship between Distance from Home and Job Satisfaction? Do employees who live farther from work tend to have lower job satisfaction?

#Does distance from home affect job satisfaction?
ggplot(IBMDATA, aes(x = DistanceFromHome, y = JobSatisfaction)) +
  geom_point(color = "orange") +
  labs(title = "Relationship between Distance from Home and Job Satisfaction", x = "Distance from Home",y = "Job Satisfaction")

#this code tell us the relationship between Distance from Home and Job Satisfaction in the dataset #means if distance from home affect job satisfaction or not

Q14: Is there a relationship between Number of Companies Worked and Monthly Income? Do employees who have worked for more companies tend to have higher monthly income?

#Does working in more companies affect income?
ggplot(IBMDATA, aes(x = NumCompaniesWorked, y = MonthlyIncome)) +
  geom_point(color = "purple") +
  labs(title = "Relationship between Number of Companies Worked and Monthly Income", x = "Number of Companies Worked", y = "Monthly Income")

#this code tell us the relationship between Number of Companies Worked and Monthly Income in the #dataset

Q15: Is there a relationship between Work-Life Balance and Job Satisfaction? Do employees with better work-life balance tend to have higher job satisfaction?

#Does better work-life balance increase job satisfaction?
ggplot(IBMDATA, aes(x = WorkLifeBalance, y = JobSatisfaction)) +
  geom_point(color = "cyan") +
  labs(title = "Relationship between Work-Life Balance and Job Satisfaction", x = "Work-Life Balance", y = "Job Satisfaction")

#this code tell us the relationship between Work-Life Balance and Job Satisfaction in the dataset

Q16: Is there a relationship between Age and Attrition? Do older employees tend to have lower attrition rates?

#CDF of Monthly Income
plot(
  ecdf(IBMDATA$MonthlyIncome),
  main = "CDF of Monthly Income",
  xlab = "Income",
  ylab = "Cumulative Probability",
  col = "blue"
)

#this code creates a cumulative distribution function (CDF) plot for the Monthly Income variable in #the dataset, showing the cumulative probability of different income levels among employees.

Q17: Is there a relationship between Department and Monthly Income? Do certain departments tend to have higher or lower income levels?

#Box Plot of Income by Department
ggplot(IBMDATA,
       aes(x = Department,
           y = MonthlyIncome,
           fill = Department)) +
  geom_boxplot() +
  labs(
    title = "Income by Department",
    x = "Department",
    y = "Income"
  )

#this code creates a box plot to compare the distribution of Monthly Income across different Departments in the dataset, allowing us to see if certain departments have higher or lower income levels.

Q18: Is there a relationship between Work-Life Balance and Job Satisfaction? Do employees with better work-life balance tend to have higher job satisfaction?

#Box Plot of Age by Attrition
ggplot(IBMDATA,
       aes(x = Attrition,
           y = Age,
           fill = Attrition)) +
  geom_boxplot() +
  labs(
    title = "Age vs Attrition",
    x = "Attrition",
    y = "Age"
  )

#this code creates a box plot to compare the distribution of Age between employees who have Attrition and those who do not, helping us understand if age is a factor in employee attrition.

Q19: Are there statistically significant differences in Monthly Income across different Departments?

#ANOVA: Income vs Department
anova_dept <- aov(
  MonthlyIncome ~ Department,
  data = IBMDATA
)

summary(anova_dept)

##               Df    Sum Sq  Mean Sq F value Pr(>F)  
## Department     2 1.415e+08 70754962   3.202  0.041 *
## Residuals   1467 3.242e+10 22098613                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#this code performs an ANOVA test to determine if there are statistically significant differences in Monthly Income across different Departments in the dataset.

Q20: Are there statistically significant differences in Job Satisfaction across different levels of Work-Life Balance?

#ANOVA: Job Satisfaction vs Work-Life Balance
anova_wlb <- aov(
  JobSatisfaction ~ WorkLifeBalance,
  data = IBMDATA
)

summary(anova_wlb)

##                   Df Sum Sq Mean Sq F value Pr(>F)
## WorkLifeBalance    1    0.7  0.6765   0.556  0.456
## Residuals       1468 1786.0  1.2166

#this code performs an ANOVA test to determine if there are statistically significant differences in Job Satisfaction across different levels of Work-Life Balance in the dataset.

Q21: What is the strength and direction of the relationship between Age and Monthly Income? Do older employees tend to have higher or lower income?

#Correlation: Age & Income
cor(
  IBMDATA$Age,
  IBMDATA$MonthlyIncome,
  use = "complete.obs"
)

## [1] 0.4978546

#this code calculates the correlation coefficient between Age and Monthly Income in the dataset, indicating the strength and direction of the relationship between these two variables.

Q22: What is the strength and direction of the relationship between Years at Company and Monthly Income? Do employees with more years at the company tend to have higher or lower income?

#Correlation: Years at Company & Income
cor(
  IBMDATA$YearsAtCompany,
  IBMDATA$MonthlyIncome,
  use = "complete.obs"
)

## [1] 0.5142848

#this code calculates the correlation coefficient between Years at Company and Monthly Income in the dataset, indicating the strength and direction of the relationship between these two variables.

Q23: What is the strength and direction of the relationship between Distance from Home and Job Satisfaction? Do employees who live farther from work tend to have lower job satisfaction?

#Correlation Matrix
numeric_data <- IBMDATA %>%
  select(Age, MonthlyIncome, YearsAtCompany, JobSatisfaction)

cor(numeric_data)

##                          Age MonthlyIncome YearsAtCompany JobSatisfaction
## Age              1.000000000   0.497854567    0.311308770    -0.004891877
## MonthlyIncome    0.497854567   1.000000000    0.514284826    -0.007156742
## YearsAtCompany   0.311308770   0.514284826    1.000000000    -0.003802628
## JobSatisfaction -0.004891877  -0.007156742   -0.003802628     1.000000000

#this code creates a correlation matrix for the selected numeric variables (Age, Monthly Income, Years at Company, and Job Satisfaction) in the dataset, showing the pairwise correlations between these variables.

Q24: What is the strength and direction of the relationship between Number of Companies Worked and Monthly Income? Do employees who have worked for more companies tend to have higher or lower income?

#Regression: Income vs Years at Company
model1 <- lm(
  MonthlyIncome ~ YearsAtCompany,
  data = IBMDATA
)

summary(model1)

## 
## Call:
## lm(formula = MonthlyIncome ~ YearsAtCompany, data = IBMDATA)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9504  -2499  -1188   1393  15484 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3733.3      160.1   23.32   <2e-16 ***
## YearsAtCompany    395.2       17.2   22.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4039 on 1468 degrees of freedom
## Multiple R-squared:  0.2645, Adjusted R-squared:  0.264 
## F-statistic: 527.9 on 1 and 1468 DF,  p-value: < 2.2e-16

#this code fits a linear regression model to predict Monthly Income based on Years at Company, and then summarizes the results to understand the relationship between these two variables.

Q25: What is the strength and direction of the relationship between Work-Life Balance and Job Satisfaction? Do employees with better work-life balance tend to have higher job satisfaction?

#Regression Plot
ggplot(IBMDATA,
       aes(x = YearsAtCompany,
           y = MonthlyIncome)) +
  geom_point(color = "blue") +
  geom_smooth(
    method = "lm",
    se = FALSE,
    color = "red"
  ) +
  labs(
    title = "Years vs Income",
    x = "Years",
    y = "Income"
  )

## `geom_smooth()` using formula = 'y ~ x'

#this code creates a scatter plot of Years at Company vs Monthly Income, with a linear regression line added to visualize the relationship between these two variables in the dataset.

Q26: Can we predict Monthly Income based on Age and Years at Company? How well do these variables explain the variation in income?

#Multiple Regression
model2 <- lm(
  MonthlyIncome ~ YearsAtCompany + Age,
  data = IBMDATA
)

summary(model2)

## 
## Call:
## lm(formula = MonthlyIncome ~ YearsAtCompany + Age, data = IBMDATA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10339.9  -2284.3   -505.4   1618.8  13968.6 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2756.47     399.62  -6.898 7.83e-12 ***
## YearsAtCompany   305.73      16.48  18.554  < 2e-16 ***
## Age              192.74      11.05  17.441  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3677 on 1467 degrees of freedom
## Multiple R-squared:  0.3908, Adjusted R-squared:   0.39 
## F-statistic: 470.6 on 2 and 1467 DF,  p-value: < 2.2e-16

#this code fits a multiple linear regression model to predict Monthly Income based on both Years at Company and Age, and then summarizes the results to understand the combined effect of these two variables on income.

Q27: How accurate are the predictions of Monthly Income based on the multiple regression model? Can we visualize the predicted vs actual income values?

#Predict Income Values
predicted_income <- predict(model2)

head(predicted_income)

##        1        2        3        4        5        6 
## 6980.354 9745.226 4374.982 6049.887 3059.031 5551.411

#this code uses the fitted multiple regression model (model2) to predict Monthly Income values based on the Years at Company and Age variables, and then displays the first few predicted income values.

Q28: What is the distribution of Monthly Income among employees? Are there more low-income or high-income employees?

#Actual vs Predicted Income
actual_pred <- data.frame(
  Actual = IBMDATA$MonthlyIncome,
  Predicted = predicted_income
)

ggplot(actual_pred,
       aes(x = Actual,
           y = Predicted)) +
  geom_point(color = "green") +
  labs(
    title = "Actual vs Predicted Income"
  )

#this code creates a scatter plot comparing the actual Monthly Income values to the predicted values from the multiple regression model, allowing us to visually assess the accuracy of the predictions.

Q29: What is the distribution of Monthly Income among employees? Are there more low-income or high-income employees?

#Density Plot of Income
ggplot(IBMDATA,
       aes(x = MonthlyIncome)) +
  geom_density(fill = "lightblue") +
  labs(
    title = "Density of Income",
    x = "Income",
    y = "Density"
  )

#this code creates a density plot for the Monthly Income variable, showing the distribution of income levels among employees in the dataset.

Q30: Is the distribution of Monthly Income approximately normal? Can we use a QQ plot to assess the normality of the income distribution?

#QQ Plot of Income
qqnorm(IBMDATA$MonthlyIncome)

qqline(IBMDATA$MonthlyIncome, col = "red")

#this code creates a QQ plot for the Monthly Income variable to assess the normality of the income distribution, with a reference line added to help visualize deviations from normality.

THANKS

IBM ATTRITION

Gurjeewanjot Singh & Rohit Kumar