Capston Project Report

Analaysis of IBM-HR Dataset

About IBM:

IBM (International Business Machines Corporation) is an American multinational technology company headquartered in Armonk, New York, United States, with operations in over 170 countries. IBM has a large and diverse portfolio of products and services. As of 2016, these offerings fall into the categories of cloud computing, cognitive computing, commerce, data and analytics, Internet of Things,IT infrastructure, mobile, and security.

IBM Cloud includes infrastructure as a service (IaaS), software as a service (SaaS) and platform as a service (PaaS) offered through public, private and hybrid cloud delivery models. For instance, the IBM Bluemix PaaS enables developers to quickly create complex websites on a pay-as-you-go model. IBM SoftLayer is a dedicated server, managed hosting and cloud computing provider, which in 2011 reported hosting more than 81,000 servers for more than 26,000 customers.IBM also provides Cloud Data Encryption Services (ICDES), using cryptographic splitting to secure customer data.

Hardware designed by IBM for these categories include IBM’s POWER microprocessors, which are employed inside many console gaming systems, including Xbox 360,PlayStation 3, and Nintendo’s Wii U.IBM Secure Blue is encryption hardware that can be built into microprocessors, and in 2014, the company revealed it was investing $3 billion over the following five years to design a neural chip that mimics the human brain, with 10 billion neurons and 100 trillion synapses, but that uses just 1 kilowatt of power.In 2016, the company launched all-flash arrays designed for small and midsized companies, which includes software for data compression, provisioning, and snapshots across various systems.

IBM headquaters in Armonk,New York

IBM headquaters in Armonk,New York

IBM Employees:

IBM has one of the largest workforces in the world, and employees at Big Blue are referred to as “IBMers”. The company was among the first corporations to provide group life insurance, survivor benefits, training for women, paid vacations, and training for disabled people.IBM has several leadership development and recognition programs to recognize employee potential and achievements. For early-career high potential employees,IBM sponsors leadership development programs by discipline (e.g., general management),human resources, finance. Each year, the company also selects 500 IBMers for the IBM Corporate Service Corps,which has been described as the corporate equivalent of the Peace Corps and gives top employees a month to do humanitarian work abroad.For certain interns, IBM also has a program called Extreme Blue that partners top business and technical students to develop high-value technology and compete to present their business case to the company’s CEO at internship’s end.

Employees

Employees

The company also has various designations for exceptional individual contributors such as Senior Technical Staff Member, Research Staff Member, Distinguished Engineer, and Distinguished Designer.The company’s most prestigious designation is that of IBM Fellow.

Overview of the dataset:

This dataset gives the information about the factors that lead to employee attrition and helps us extract answers for the qestions like “how the distance from home can effect the job involvment of an employee?”, “how does the job environment plays role in determining job satisfaction?”, “how the hourly rate of doing work and income are realted?”, etc.

The survey was carried out and the information is layed in the form of dataset consisting of rows and columns.

Description of some relavant columns is as follows:

1.Age:Gives the age of employee in numbers.

2.Attrition: If there is decline in the performance of employee=‘Yes’ If there is no decline in performance of employee=‘NO’

3.Business Travel:gives informantion of the frequency of the business tours an employee has to go for.

4.DailyRate:Rate at which an employee works daily.

5.Department:The section of the company in which the employee works.

6.DistanceFromHome:how far an employee lives from his workplace.

7.Education:1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

8.Education Field:Qualification of employee

9.EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

10.Gender:sex of employee ‘Male’ or ‘Female’

11.JobInvolvement:1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

12.JobSatisfaction:1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

13:PerformanceRating:1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

14:RelationshipSatisfaction:1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

15.WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

16.MaritalStatus:‘Married’,‘single’,‘divorced’

Regression Model:

Regression Model 1

formula–> MonthlyIncome=ï..Age+ DistanceFromHome+Relationship Satisfaction+ EnvironmentSatisfaction+ joblevel + JobInvolvement+NumCompaniesWorked+ WorkLifeBalance

# Read the data
pdata <- read.csv(file="IBM-HR-Employee-Attrition.csv")
MyData <- pdata[-c(2,3,5,8,9,10,12,16,18,22,23)]
attach(MyData)
# Model 1
M1 <- lm(MonthlyIncome~ï..Age
                  +DistanceFromHome
                  +RelationshipSatisfaction
                  +EnvironmentSatisfaction
                  +JobLevel
                  +JobInvolvement
                  +NumCompaniesWorked
                  +WorkLifeBalance,
         data=MyData)
         
summary(M1)
## 
## Call:
## lm(formula = MonthlyIncome ~ ï..Age + DistanceFromHome + RelationshipSatisfaction + 
##     EnvironmentSatisfaction + JobLevel + JobInvolvement + NumCompaniesWorked + 
##     WorkLifeBalance, data = MyData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5276.3  -946.0    98.8   809.2  4013.4 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -1773.325    293.459  -6.043 1.92e-09 ***
## ï..Age                       7.659      5.050   1.517  0.12953    
## DistanceFromHome           -12.742      4.712  -2.704  0.00693 ** 
## RelationshipSatisfaction    20.083     35.399   0.567  0.57057    
## EnvironmentSatisfaction    -34.289     34.933  -0.982  0.32647    
## JobLevel                  4004.090     40.156  99.714  < 2e-16 ***
## JobInvolvement             -27.004     53.716  -0.503  0.61524    
## NumCompaniesWorked          19.109     16.033   1.192  0.23349    
## WorkLifeBalance            -33.512     54.170  -0.619  0.53624    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1463 on 1461 degrees of freedom
## Multiple R-squared:  0.904,  Adjusted R-squared:  0.9035 
## F-statistic:  1720 on 8 and 1461 DF,  p-value: < 2.2e-16

Through the above regression model,we established the effect of Monthly income on the various other factors with the simplest model. We regressed Age, DistanceFromHome,Relationship satisfaction, EnvironmentSatisfaction, job level,JobInvolvement,NumCompaniesWorked,worklife balance.We estimated model, using linear least squares.

Regression Model 2

formula—>WorkLifeBalance=TotalWorkingYears+ MonthlyIncome+ MonthlyRate+ RelationshipSatisfaction+ JobLevel +PerformanceRating +PercentSalaryHike

# Model 2
M2 <- lm(WorkLifeBalance~TotalWorkingYears
                  +MonthlyIncome
                  +MonthlyRate
                  +RelationshipSatisfaction
                  +JobLevel
                  +PerformanceRating
                  +PercentSalaryHike
                ,
         data=MyData)
         
summary(M2)
## 
## Call:
## lm(formula = WorkLifeBalance ~ TotalWorkingYears + MonthlyIncome + 
##     MonthlyRate + RelationshipSatisfaction + JobLevel + PerformanceRating + 
##     PercentSalaryHike, data = MyData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8341 -0.7189  0.2198  0.2690  1.3540 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               2.601e+00  1.942e-01  13.391   <2e-16 ***
## TotalWorkingYears        -6.548e-03  3.853e-03  -1.699   0.0895 .  
## MonthlyIncome            -4.931e-06  1.273e-05  -0.387   0.6986    
## MonthlyRate               6.186e-07  2.593e-06   0.239   0.8115    
## RelationshipSatisfaction  1.274e-02  1.707e-02   0.746   0.4557    
## JobLevel                  7.957e-02  5.518e-02   1.442   0.1495    
## PerformanceRating         3.024e-02  8.073e-02   0.375   0.7080    
## PercentSalaryHike        -2.403e-03  7.961e-03  -0.302   0.7628    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7067 on 1462 degrees of freedom
## Multiple R-squared:  0.004147,   Adjusted R-squared:  -0.0006214 
## F-statistic: 0.8697 on 7 and 1462 DF,  p-value: 0.5298

Through the above regression model,we established the effect of Work Life Balance on the various other factors like Total Working Years,Monthly Income,Monthly Rate,Relationship Satisfaction,Job Level,Performance Rating,PercentSalary Hike .We estimated model, using linear least squares.

p value is more positive in model 2 whereas R value is more positive in model 1

Summary of Dataset

# Summarize the Data
library(psych)
describe(pdata)
##                          vars    n     mean      sd  median  trimmed
## ï..Age                      1 1470    36.92    9.14    36.0    36.47
## Attrition*                  2 1470     1.16    0.37     1.0     1.08
## BusinessTravel*             3 1470     2.61    0.67     3.0     2.76
## DailyRate                   4 1470   802.49  403.51   802.0   803.83
## Department*                 5 1470     2.26    0.53     2.0     2.25
## DistanceFromHome            6 1470     9.19    8.11     7.0     8.08
## Education                   7 1470     2.91    1.02     3.0     2.98
## EducationField*             8 1470     3.25    1.33     3.0     3.10
## EmployeeCount               9 1470     1.00    0.00     1.0     1.00
## EmployeeNumber             10 1470  1024.87  602.02  1020.5  1023.40
## EnvironmentSatisfaction    11 1470     2.72    1.09     3.0     2.78
## Gender*                    12 1470     1.60    0.49     2.0     1.62
## HourlyRate                 13 1470    65.89   20.33    66.0    66.02
## JobInvolvement             14 1470     2.73    0.71     3.0     2.74
## JobLevel                   15 1470     2.06    1.11     2.0     1.90
## JobRole*                   16 1470     5.46    2.46     6.0     5.61
## JobSatisfaction            17 1470     2.73    1.10     3.0     2.79
## MaritalStatus*             18 1470     2.10    0.73     2.0     2.12
## MonthlyIncome              19 1470  6502.93 4707.96  4919.0  5667.24
## MonthlyRate                20 1470 14313.10 7117.79 14235.5 14286.48
## NumCompaniesWorked         21 1470     2.69    2.50     2.0     2.36
## Over18*                    22 1470     1.00    0.00     1.0     1.00
## OverTime*                  23 1470     1.28    0.45     1.0     1.23
## PercentSalaryHike          24 1470    15.21    3.66    14.0    14.80
## PerformanceRating          25 1470     3.15    0.36     3.0     3.07
## RelationshipSatisfaction   26 1470     2.71    1.08     3.0     2.77
## StandardHours              27 1470    80.00    0.00    80.0    80.00
## StockOptionLevel           28 1470     0.79    0.85     1.0     0.67
## TotalWorkingYears          29 1470    11.28    7.78    10.0    10.37
## TrainingTimesLastYear      30 1470     2.80    1.29     3.0     2.72
## WorkLifeBalance            31 1470     2.76    0.71     3.0     2.77
## YearsAtCompany             32 1470     7.01    6.13     5.0     5.99
## YearsInCurrentRole         33 1470     4.23    3.62     3.0     3.85
## YearsSinceLastPromotion    34 1470     2.19    3.22     1.0     1.48
## YearsWithCurrManager       35 1470     4.12    3.57     3.0     3.77
##                              mad  min   max range  skew kurtosis     se
## ï..Age                      8.90   18    60    42  0.41    -0.41   0.24
## Attrition*                  0.00    1     2     1  1.84     1.39   0.01
## BusinessTravel*             0.00    1     3     2 -1.44     0.69   0.02
## DailyRate                 510.01  102  1499  1397  0.00    -1.21  10.52
## Department*                 0.00    1     3     2  0.17    -0.40   0.01
## DistanceFromHome            7.41    1    29    28  0.96    -0.23   0.21
## Education                   1.48    1     5     4 -0.29    -0.56   0.03
## EducationField*             1.48    1     6     5  0.55    -0.69   0.03
## EmployeeCount               0.00    1     1     0   NaN      NaN   0.00
## EmployeeNumber            790.97    1  2068  2067  0.02    -1.23  15.70
## EnvironmentSatisfaction     1.48    1     4     3 -0.32    -1.20   0.03
## Gender*                     0.00    1     2     1 -0.41    -1.83   0.01
## HourlyRate                 26.69   30   100    70 -0.03    -1.20   0.53
## JobInvolvement              0.00    1     4     3 -0.50     0.26   0.02
## JobLevel                    1.48    1     5     4  1.02     0.39   0.03
## JobRole*                    2.97    1     9     8 -0.36    -1.20   0.06
## JobSatisfaction             1.48    1     4     3 -0.33    -1.22   0.03
## MaritalStatus*              1.48    1     3     2 -0.15    -1.12   0.02
## MonthlyIncome            3260.24 1009 19999 18990  1.37     0.99 122.79
## MonthlyRate              9201.76 2094 26999 24905  0.02    -1.22 185.65
## NumCompaniesWorked          1.48    0     9     9  1.02     0.00   0.07
## Over18*                     0.00    1     1     0   NaN      NaN   0.00
## OverTime*                   0.00    1     2     1  0.96    -1.07   0.01
## PercentSalaryHike           2.97   11    25    14  0.82    -0.31   0.10
## PerformanceRating           0.00    3     4     1  1.92     1.68   0.01
## RelationshipSatisfaction    1.48    1     4     3 -0.30    -1.19   0.03
## StandardHours               0.00   80    80     0   NaN      NaN   0.00
## StockOptionLevel            1.48    0     3     3  0.97     0.35   0.02
## TotalWorkingYears           5.93    0    40    40  1.11     0.91   0.20
## TrainingTimesLastYear       1.48    0     6     6  0.55     0.48   0.03
## WorkLifeBalance             0.00    1     4     3 -0.55     0.41   0.02
## YearsAtCompany              4.45    0    40    40  1.76     3.91   0.16
## YearsInCurrentRole          4.45    0    18    18  0.92     0.47   0.09
## YearsSinceLastPromotion     1.48    0    15    15  1.98     3.59   0.08
## YearsWithCurrManager        4.45    0    17    17  0.83     0.16   0.09

Two-way contingency table based on Gender and Attrition

table1 <- xtabs(~ Gender + Attrition, data = pdata)
table1
##         Attrition
## Gender    No Yes
##   Female 501  87
##   Male   732 150

18% of males show attrition in their performance whereas 15% of females show attrition.

Two-way contingency table based on Distance From Home and Attrition in the performance

table2 <- xtabs(~  WorkLifeBalance+ Attrition, data = pdata)
table2
##                Attrition
## WorkLifeBalance  No Yes
##               1  55  25
##               2 286  58
##               3 766 127
##               4 126  27

People having work life balance of 3-i.e better balance have low percentage of people having attrition.

Two-way contingency table based on Job Satisfaction and Attrition in the performance

table3 <- xtabs(~  JobSatisfaction+ Attrition, data = pdata)
table3
##                Attrition
## JobSatisfaction  No Yes
##               1 223  66
##               2 234  46
##               3 369  73
##               4 407  52

group of people having higher Job Satisfaction have lower percentage of people whoes performance is reduced.

Two-way contingency table based on Relationship Satisfaction and Attrition in the performance

table4 <- xtabs(~  RelationshipSatisfaction+ Attrition, data = pdata)
table4
##                         Attrition
## RelationshipSatisfaction  No Yes
##                        1 219  57
##                        2 258  45
##                        3 388  71
##                        4 368  64

people having better Relationship satisfaction have low number of people who have show attrition in their performance.

Boxplots

boxplot(MonthlyIncome~WorkLifeBalance, data=pdata, horizontal=TRUE,
        xlab="attrition", las=1,
        col=c("red","blue","green","yellow"),
        main="boxplot of worklife balance and attrition in performance of employees")

boxplot(PercentSalaryHike~JobSatisfaction, data=pdata, horizontal=TRUE,
        xlab="Percent salary Hike", las=1,
        col=c("red","blue","green","yellow"),
        main="boxplot of Percent Salary hike and Job Satisfaction")

Histograms

1)Work Life Balance

hist(pdata$WorkLifeBalance, 
     main="Histogram of Work Life balance",
     col=c("blue"),
     xlab="work life balance" )

Histogram shows that highest number of employees have better work life balance in IBM

2)Job Satisfaction

hist(pdata$JobSatisfaction, 
     main="Histogram of Job Satisfaction",
     col=c("yellow"),
     xlab="Job Satisfaction level" )

histogram shows that majority employees have higher level of job satisfaction

3)monthly income

hist(pdata$MonthlyIncome, 
     main="Histogram of Monthly Income",
     col=c("green"),
     xlab="Monthly income in rupees" )

Majority of people have their monthly salary between 0-10,000 Rupees.

Two way factorial ANOVA

considered parameters: Work life balance,percent salaryhike and job satisfaction.

WorkLifeBalance <- factor(WorkLifeBalance)
fit <- aov(PercentSalaryHike~WorkLifeBalance*JobSatisfaction)
summary(fit)
##                                   Df Sum Sq Mean Sq F value Pr(>F)
## WorkLifeBalance                    3     44  14.813   1.106  0.346
## JobSatisfaction                    1     10   9.966   0.744  0.389
## WorkLifeBalance:JobSatisfaction    3     34  11.458   0.855  0.464
## Residuals                       1462  19589  13.399

Result:

By using box-plots, contingency tables,histograms we can deduce the dependencies of variables like work life balance, job involvement, monthly income, job satisfaction etc. causing the attrition in the performance of the employee.

Conclusion:

From this analysis we can conclude that the factors like Age, Distance From Home, Percent Salary Hike, Environment Satisfaction, Hourly Rate,Job Involvement,Number of Companies Worked cause impact on the work life balance of the people and hence effect the attrition. We can easily make out from the boxplots and histograms.