INTRODUCTION This project analyzes the IBM HR Analytics dataset using R programming. The aim is to understand employee data such as age, income, job satisfaction, work-life balance, and attrition.
Basic data preprocessing is performed to clean the dataset. Various graphs like histograms, bar charts, and scatter plots are used to study patterns. Statistical methods such as correlation and regression are also applied to find relationships between variables.
The project helps in understanding employee behavior and provides useful insights for better decision-making.
Loading necessary libraries
# Load necessary libraries
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Reading the Dataset
IBMDATA <- read.csv("F:/cap482p/IBM.csv")
Summary of the dataset
str(IBMDATA)
## 'data.frame': 1470 obs. of 13 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EnvironmentSatisfaction: int 2 3 4 4 1 4 3 4 4 3 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
summary(IBMDATA)
## Age Attrition Department DistanceFromHome
## Min. :18.00 Length:1470 Length:1470 Min. : 1.000
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 2.000
## Median :36.00 Mode :character Mode :character Median : 7.000
## Mean :36.92 Mean : 9.193
## 3rd Qu.:43.00 3rd Qu.:14.000
## Max. :60.00 Max. :29.000
## Education EducationField EnvironmentSatisfaction JobSatisfaction
## Min. :1.000 Length:1470 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 Class :character 1st Qu.:2.000 1st Qu.:2.000
## Median :3.000 Mode :character Median :3.000 Median :3.000
## Mean :2.913 Mean :2.722 Mean :2.729
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :4.000 Max. :4.000
## MaritalStatus MonthlyIncome NumCompaniesWorked WorkLifeBalance
## Length:1470 Min. : 1009 Min. :0.000 Min. :1.000
## Class :character 1st Qu.: 2911 1st Qu.:1.000 1st Qu.:2.000
## Mode :character Median : 4919 Median :2.000 Median :3.000
## Mean : 6503 Mean :2.693 Mean :2.761
## 3rd Qu.: 8379 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :19999 Max. :9.000 Max. :4.000
## YearsAtCompany
## Min. : 0.000
## 1st Qu.: 3.000
## Median : 5.000
## Mean : 7.008
## 3rd Qu.: 9.000
## Max. :40.000
colnames(IBMDATA)
## [1] "Age" "Attrition"
## [3] "Department" "DistanceFromHome"
## [5] "Education" "EducationField"
## [7] "EnvironmentSatisfaction" "JobSatisfaction"
## [9] "MaritalStatus" "MonthlyIncome"
## [11] "NumCompaniesWorked" "WorkLifeBalance"
## [13] "YearsAtCompany"
Preprocessing the dataset
sum(is.na(IBMDATA))
## [1] 0
#this code checks for any missing values in the dataset
IBMDATA <- na.omit(IBMDATA)
#this code removes any rows with missing values from the dataset
Converting categorical variables to factors
IBMDATA$Attrition <- as.factor(IBMDATA$Attrition)
IBMDATA$Department <- as.factor(IBMDATA$Department)
IBMDATA$EducationField <- as.factor(IBMDATA$EducationField)
IBMDATA$MaritalStatus <- as.factor(IBMDATA$MaritalStatus)
#this code covert all required column to factor data type
Duplicates check and removal
sum(duplicated(IBMDATA))
## [1] 0
#this code checks for any duplicate rows in the dataset
IBMDATA <- IBMDATA[!duplicated(IBMDATA), ]
#this code removes any duplicate rows from the dataset
Histogram graphs
Q1: What is the distribution of employee ages in the dataset? Are there more young or old employees?
#distribution of young vs old age employees
ggplot(IBMDATA, aes(x = Age)) +
geom_histogram(bins = 20, fill = "blue", color = "black") +
labs(title = "Age Distribution", x = "Age", y = "Count")
#this code tell us the distribution of young vs old age employees in the dataset
Q2: What is the distribution of monthly income among employees? Are there more low-income or high-income employees?
#distribution of low income vs high income range of employees
ggplot(IBMDATA, aes(x = MonthlyIncome)) +
geom_histogram(bins = 20, fill = "green", color = "black") +
labs(title = "Monthly Income Distribution", x = "Monthly Income", y = "Count")
#this code tell us the distribution of low income vs high income range of employees in the dataset
Q3: What is the distribution of employee job satisfaction levels? Are there more satisfied or dissatisfied employees?
#distribution of employes distance from home
ggplot(IBMDATA, aes(x = DistanceFromHome)) +
geom_histogram(bins = 15, fill = "orange", color = "blue") +
labs(title = "Distance From Home Distribution", x = "Distance From Home", y = "Count")
#this code tell us the distribution of employes distance from home in the dataset
Q4: What is the distribution of employee years at the company? Are there more new employees or experienced employees?
#distribution of employes years at company new vs experienced employees
ggplot(IBMDATA, aes(x = YearsAtCompany)) +
geom_histogram(bins = 15, fill = "purple", color = "black") +
labs(title = "Years at Company Distribution", x = "Years at Company", y = "Count")
#this code tell us the distribution of employes years at company new vs experienced employees in the #dataset
Q5: What is the distribution of employees in how many companies they have worked for? Are there more employees who have worked for many companies or just a few?
#distribution of employs in how many companies they have worked for
ggplot(IBMDATA, aes(x = NumCompaniesWorked)) +
geom_histogram(bins = 10, fill = "green", color = "black") +
labs(title = "Number of Companies Worked", x = "Number of Companies Worked",y = "Count")
#this code tell us the distribution of employs in how many companies they have worked for in the
#dataset
Bar graphs
Q6: How many employees are there in each Department? Which department has the most employees?
#How many employees are there in each Department?
ggplot(IBMDATA, aes(x = Department)) +
geom_bar(fill = "grey", color = "black") +
labs(title = "Number of Employees in Each Department", x = "Department", y = "Count")
#this code tell us how many employees are there in each Department in the dataset
Q7: What is the count of employees with Attrition? How many employees have left the company?
#What is the count of employees with Attrition?
ggplot(IBMDATA, aes(x = Attrition)) +
geom_bar(fill = "red", color = "black") +
labs(title = "Count of Employees with Attrition", x = "Attrition", y = "Count")
#this code tell us the count of employees with Attrition in the dataset
Q8: What is the count of employees in each Education Field? Which education field has the most employees?
#What is the count of employees in each Education Field?
ggplot(IBMDATA, aes(x = EducationField)) +
geom_bar(fill = "pink", color = "black") +
labs(title = "Count of Employees in Each Education Field", x = "Education Field",
y = "Count")
#this code tell us the count of employees in each Education Field in the dataset
Q9: What is the count of employees in each Marital Status category? Which marital status category has the most employees?
#What is the count of employees in each Marital Status category?
ggplot(IBMDATA, aes(x = MaritalStatus)) +
geom_bar(fill = "cyan", color = "black") +
labs(title = "Count of Employees in Each Marital Status Category", x = "Marital Status",
y = "Count")
#this code tell us the count of employees in each Marital Status category in the dataset
Q10: What is the distribution of Work-Life Balance ratings? Are there more employees with good work-life balance or poor work-life balance?
#What is the distribution of Work-Life Balance ratings?
ggplot(IBMDATA, aes(x = WorkLifeBalance)) +
geom_bar(fill = "yellow", color = "black") +
labs(title = "Distribution of Work-Life Balance", x = "Work-Life Balance",y = "Count")
#this code tell us the distribution of Work-Life Balance ratings in the dataset
Scatter plot
Q11: Is there a relationship between Age and Monthly Income? Do older employees tend to have higher monthly income?
#Is there a relationship between Age and Monthly Income?
ggplot(IBMDATA, aes(x = Age, y = MonthlyIncome)) +
geom_point(color = "blue") +
labs(title = "Relationship between Age and Monthly Income", x = "Age", y ="Monthly Income")
#this code tell us the relationship between Age and Monthly Income in the dataset means if older
#employees tend to have higher monthly income or not
Q12: Is there a relationship between Years at Company and Monthly Income? Do employees with more years at the company tend to have higher monthly income?
#Do more years in company increase income?
ggplot(IBMDATA, aes(x = YearsAtCompany, y = MonthlyIncome)) +
geom_point(color = "green") +
labs(title = "Relationship between Years at Company and Monthly Income", x = "Years at Company",
y = "Monthly Income")
#this code tell us the relationship between Years at Company and Monthly Income in the dataset means #if more years in company increase income or not
Q13: Is there a relationship between Distance from Home and Job Satisfaction? Do employees who live farther from work tend to have lower job satisfaction?
#Does distance from home affect job satisfaction?
ggplot(IBMDATA, aes(x = DistanceFromHome, y = JobSatisfaction)) +
geom_point(color = "orange") +
labs(title = "Relationship between Distance from Home and Job Satisfaction", x = "Distance from Home",y = "Job Satisfaction")
#this code tell us the relationship between Distance from Home and Job Satisfaction in the dataset #means if distance from home affect job satisfaction or not
Q14: Is there a relationship between Number of Companies Worked and Monthly Income? Do employees who have worked for more companies tend to have higher monthly income?
#Does working in more companies affect income?
ggplot(IBMDATA, aes(x = NumCompaniesWorked, y = MonthlyIncome)) +
geom_point(color = "purple") +
labs(title = "Relationship between Number of Companies Worked and Monthly Income", x = "Number of Companies Worked", y = "Monthly Income")
#this code tell us the relationship between Number of Companies Worked and Monthly Income in the #dataset
Q15: Is there a relationship between Work-Life Balance and Job Satisfaction? Do employees with better work-life balance tend to have higher job satisfaction?
#Does better work-life balance increase job satisfaction?
ggplot(IBMDATA, aes(x = WorkLifeBalance, y = JobSatisfaction)) +
geom_point(color = "cyan") +
labs(title = "Relationship between Work-Life Balance and Job Satisfaction", x = "Work-Life Balance", y = "Job Satisfaction")
#this code tell us the relationship between Work-Life Balance and Job Satisfaction in the dataset
Q16: Is there a relationship between Age and Attrition? Do older employees tend to have lower attrition rates?
#CDF of Monthly Income
plot(
ecdf(IBMDATA$MonthlyIncome),
main = "CDF of Monthly Income",
xlab = "Income",
ylab = "Cumulative Probability",
col = "blue"
)
#this code creates a cumulative distribution function (CDF) plot for the Monthly Income variable in #the dataset, showing the cumulative probability of different income levels among employees.
Q17: Is there a relationship between Department and Monthly Income? Do certain departments tend to have higher or lower income levels?
#Box Plot of Income by Department
ggplot(IBMDATA,
aes(x = Department,
y = MonthlyIncome,
fill = Department)) +
geom_boxplot() +
labs(
title = "Income by Department",
x = "Department",
y = "Income"
)
#this code creates a box plot to compare the distribution of Monthly Income across different Departments in the dataset, allowing us to see if certain departments have higher or lower income levels.
Q18: Is there a relationship between Work-Life Balance and Job Satisfaction? Do employees with better work-life balance tend to have higher job satisfaction?
#Box Plot of Age by Attrition
ggplot(IBMDATA,
aes(x = Attrition,
y = Age,
fill = Attrition)) +
geom_boxplot() +
labs(
title = "Age vs Attrition",
x = "Attrition",
y = "Age"
)
#this code creates a box plot to compare the distribution of Age between employees who have Attrition and those who do not, helping us understand if age is a factor in employee attrition.
Q19: Are there statistically significant differences in Monthly Income across different Departments?
#ANOVA: Income vs Department
anova_dept <- aov(
MonthlyIncome ~ Department,
data = IBMDATA
)
summary(anova_dept)
## Df Sum Sq Mean Sq F value Pr(>F)
## Department 2 1.415e+08 70754962 3.202 0.041 *
## Residuals 1467 3.242e+10 22098613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#this code performs an ANOVA test to determine if there are statistically significant differences in Monthly Income across different Departments in the dataset.
Q20: Are there statistically significant differences in Job Satisfaction across different levels of Work-Life Balance?
#ANOVA: Job Satisfaction vs Work-Life Balance
anova_wlb <- aov(
JobSatisfaction ~ WorkLifeBalance,
data = IBMDATA
)
summary(anova_wlb)
## Df Sum Sq Mean Sq F value Pr(>F)
## WorkLifeBalance 1 0.7 0.6765 0.556 0.456
## Residuals 1468 1786.0 1.2166
#this code performs an ANOVA test to determine if there are statistically significant differences in Job Satisfaction across different levels of Work-Life Balance in the dataset.
Q21: What is the strength and direction of the relationship between Age and Monthly Income? Do older employees tend to have higher or lower income?
#Correlation: Age & Income
cor(
IBMDATA$Age,
IBMDATA$MonthlyIncome,
use = "complete.obs"
)
## [1] 0.4978546
#this code calculates the correlation coefficient between Age and Monthly Income in the dataset, indicating the strength and direction of the relationship between these two variables.
Q22: What is the strength and direction of the relationship between Years at Company and Monthly Income? Do employees with more years at the company tend to have higher or lower income?
#Correlation: Years at Company & Income
cor(
IBMDATA$YearsAtCompany,
IBMDATA$MonthlyIncome,
use = "complete.obs"
)
## [1] 0.5142848
#this code calculates the correlation coefficient between Years at Company and Monthly Income in the dataset, indicating the strength and direction of the relationship between these two variables.
Q23: What is the strength and direction of the relationship between Distance from Home and Job Satisfaction? Do employees who live farther from work tend to have lower job satisfaction?
#Correlation Matrix
numeric_data <- IBMDATA %>%
select(Age, MonthlyIncome, YearsAtCompany, JobSatisfaction)
cor(numeric_data)
## Age MonthlyIncome YearsAtCompany JobSatisfaction
## Age 1.000000000 0.497854567 0.311308770 -0.004891877
## MonthlyIncome 0.497854567 1.000000000 0.514284826 -0.007156742
## YearsAtCompany 0.311308770 0.514284826 1.000000000 -0.003802628
## JobSatisfaction -0.004891877 -0.007156742 -0.003802628 1.000000000
#this code creates a correlation matrix for the selected numeric variables (Age, Monthly Income, Years at Company, and Job Satisfaction) in the dataset, showing the pairwise correlations between these variables.
Q24: What is the strength and direction of the relationship between Number of Companies Worked and Monthly Income? Do employees who have worked for more companies tend to have higher or lower income?
#Regression: Income vs Years at Company
model1 <- lm(
MonthlyIncome ~ YearsAtCompany,
data = IBMDATA
)
summary(model1)
##
## Call:
## lm(formula = MonthlyIncome ~ YearsAtCompany, data = IBMDATA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9504 -2499 -1188 1393 15484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3733.3 160.1 23.32 <2e-16 ***
## YearsAtCompany 395.2 17.2 22.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4039 on 1468 degrees of freedom
## Multiple R-squared: 0.2645, Adjusted R-squared: 0.264
## F-statistic: 527.9 on 1 and 1468 DF, p-value: < 2.2e-16
#this code fits a linear regression model to predict Monthly Income based on Years at Company, and then summarizes the results to understand the relationship between these two variables.
Q25: What is the strength and direction of the relationship between Work-Life Balance and Job Satisfaction? Do employees with better work-life balance tend to have higher job satisfaction?
#Regression Plot
ggplot(IBMDATA,
aes(x = YearsAtCompany,
y = MonthlyIncome)) +
geom_point(color = "blue") +
geom_smooth(
method = "lm",
se = FALSE,
color = "red"
) +
labs(
title = "Years vs Income",
x = "Years",
y = "Income"
)
## `geom_smooth()` using formula = 'y ~ x'
#this code creates a scatter plot of Years at Company vs Monthly Income, with a linear regression line added to visualize the relationship between these two variables in the dataset.
Q26: Can we predict Monthly Income based on Age and Years at Company? How well do these variables explain the variation in income?
#Multiple Regression
model2 <- lm(
MonthlyIncome ~ YearsAtCompany + Age,
data = IBMDATA
)
summary(model2)
##
## Call:
## lm(formula = MonthlyIncome ~ YearsAtCompany + Age, data = IBMDATA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10339.9 -2284.3 -505.4 1618.8 13968.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2756.47 399.62 -6.898 7.83e-12 ***
## YearsAtCompany 305.73 16.48 18.554 < 2e-16 ***
## Age 192.74 11.05 17.441 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3677 on 1467 degrees of freedom
## Multiple R-squared: 0.3908, Adjusted R-squared: 0.39
## F-statistic: 470.6 on 2 and 1467 DF, p-value: < 2.2e-16
#this code fits a multiple linear regression model to predict Monthly Income based on both Years at Company and Age, and then summarizes the results to understand the combined effect of these two variables on income.
Q27: How accurate are the predictions of Monthly Income based on the multiple regression model? Can we visualize the predicted vs actual income values?
#Predict Income Values
predicted_income <- predict(model2)
head(predicted_income)
## 1 2 3 4 5 6
## 6980.354 9745.226 4374.982 6049.887 3059.031 5551.411
#this code uses the fitted multiple regression model (model2) to predict Monthly Income values based on the Years at Company and Age variables, and then displays the first few predicted income values.
Q28: What is the distribution of Monthly Income among employees? Are there more low-income or high-income employees?
#Actual vs Predicted Income
actual_pred <- data.frame(
Actual = IBMDATA$MonthlyIncome,
Predicted = predicted_income
)
ggplot(actual_pred,
aes(x = Actual,
y = Predicted)) +
geom_point(color = "green") +
labs(
title = "Actual vs Predicted Income"
)
#this code creates a scatter plot comparing the actual Monthly Income values to the predicted values from the multiple regression model, allowing us to visually assess the accuracy of the predictions.
Q29: What is the distribution of Monthly Income among employees? Are there more low-income or high-income employees?
#Density Plot of Income
ggplot(IBMDATA,
aes(x = MonthlyIncome)) +
geom_density(fill = "lightblue") +
labs(
title = "Density of Income",
x = "Income",
y = "Density"
)
#this code creates a density plot for the Monthly Income variable, showing the distribution of income levels among employees in the dataset.
Q30: Is the distribution of Monthly Income approximately normal? Can we use a QQ plot to assess the normality of the income distribution?
#QQ Plot of Income
qqnorm(IBMDATA$MonthlyIncome)
qqline(IBMDATA$MonthlyIncome, col = "red")
#this code creates a QQ plot for the Monthly Income variable to assess the normality of the income distribution, with a reference line added to help visualize deviations from normality.
THANKS