This analysis is submitted as the Capstone Project work done for the internship pursued under the guidance of Prof.Sameer Mathur for duration Jan 1,2018-Jan 27,2018.

1. Introduction

There is a phrase quoted by Robert Heinlein in his sci-fi novel: “There is no such thing as a free lunch” Indeed, this has proven to be completely correct in reality too because income has become the prime motive for any person to do any work or provide any service. Every person expects the compensation for his hard work. But, since everybody expects compensation, so there are needs for certain criteria that are efficient to judge the extent to which the job has been successfully completed, both qualitatively as well as quantitatively. And after the performance has been carefully analyzed, it is the time to reward back the employee in order to ensure better job retention and enthusiastic response for the future projects. But, this assurance is very subjective as what is worthy for one can be peanuts for other. Hence, there is a need to justify the extent of effort that employee has put-in. This is taken care generally during the performance appraisal of the employee in which different parameters are taken into consideration for rating the employee. But how to quantify these parameters and judge the performance extent? This is done by collecting the data of all the employees and carrying out analysis on different parameters so as to know who deserves what. Besides, this analysis can also be helpful is estimating the underpaid or overpaid employees and appraise them accordingly. Apart from that, it may also help in monetary budgeting as well as in offering the packages to the new hirees as per their worth. The project is centered on how this analysis which is carried over the past data, can be helpful to study the dependency of monthly income on different parameters and estimate it by incorporating statistics.

2. Overview of the Study

Our field of study is concerned with the people that are working in different companies and their job related attributes. The specific objective of this study was to investigate the different factors governing monthly income of an employee. This study analyzes the dependency of monthly income on different parameters and demonstrates the dependency pattern as well. Our goal was to find whether monthly income actually is a function of these parameters or not. The study parameters selected for carrying out the analysis are discussed in the subsequent section of the report.

2.1 Data

The data used for carrying out the analysis was an open source data which is available on the IBM Watson Analytics website. The URL for the website is given as https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/ We have focused on analyzing the extent to which the monthly income is governed by different parameters. The detailed description of each of the parameters is as follows: i) Gender: This is a qualitative variable which differentiates the employee based on the gender to which they belong. In the dataset, field Gender accounts for this information which has Female and Male as entries that correspond to their usual meanings.

  1. Department: This is a qualitative variable which differentiates the employee based on the department to which they belong. In the dataset, field Department accounts for this information which has Human Resources, Research & Development and Sales as entries that correspond to their usual meanings.

  2. Business Travel Status: This is a qualitative variable which differentiates the employee based on the extent to which they travel. In the dataset, field BusinessTravel accounts for this information which has Travel_Rarely, Travel_Frequently and Non-Travel as entries that correspond to employees who travel rarely, travel frequently and do not travel respectively.

  3. Marital Status: This is a qualitative variable which differentiates the employee based on their marital status. In the dataset, field MaritalStatus accounts for this information which has Single, Married and Divorced as entries that correspond to their usual meanings.

  4. Job Role: This is a qualitative variable which differentiates the employee based on their roles in the company. In the dataset, field JobRole accounts for this information which has Healthcare Representatives, Human Resources, Laboratory Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive and Sales Representative as entries that correspond to their usual meanings.

  5. Age: This is a quantitative variable which records the data corresponding to the age of the employee. In the dataset, field Age accounts for this information which has the respective age of the employees as its entries.

  6. Education Level: This is a quantitative variable which records the data corresponding to the education level of the employee. In the dataset, field Age accounts for this information which have been collected over the Likert Scale over the range 1-5.

    1 Below College 2 College 3 Bachelor 4 Master 5 Doctor

  7. Job Level: This is a quantitative variable which records the data corresponding to the job level of the employee. In the dataset, field JobLevel accounts for this information which have been collected over the Likert Scale over the range 1-5.

    1 Staff 2 Representative 3 Manager 4 Executive 5 Director

  8. Number of companies previously worked: This is a quantitative variable which records the data corresponding to the number of companies previously served by the employee. In the dataset, field NumCompaniesWorked accounts for this information which has the respective value of number of companies an employee has worked for, as its entries.

  9. Percentage of salary hike obtained: This is a quantitative variable which records the data corresponding to the percentage salary hike received by the employee. In the dataset, field PercentSalaryHike accounts for this information which has the respective value of percentage salary hike an employee has received, as its entries.

  10. Total Work Experience: This is a quantitative variable which records the data corresponding to the total work experience possessed by the employee. In the dataset, field TotalWorkingYears accounts for this information which has the respective value of no. of years an employee has experience of working, as its entries.

  11. Years working in the company: This is a quantitative variable which records the data corresponding to the total number of years the employee has served the company. In the dataset, field YearsAtCompany accounts for this information which has the respective value of no. of years an employee has served the company, as its entries.

2.2 Hypothesis

The main intention of the analysis is to find whether Monthly Income of an employee is significantly dependent upon the aforementioned parameters or not. As every analysis commence by considering a null hypothesis, the hypothesis in this case is defined as: H0: The average monthly income of the employee is independent of all the parameters. H1: The average monthly income of the employee is dependent on the parameters.

2.3 Model

The regression model was created for the following analysis and it was found that of all the 13 parameters, only 5 are significant. The regression equation for the model thus developed can be given as:

Monthly Income = (-111.39) + 90 Gender + (451.6) BusinessTravelTravel_Frequently+ (276.33) BusinessTravelTravel_Rarely - (237.67) JobRoleHuman Resources - (900.24) JobRoleLaboratory Technician + (4022.03) JobRoleManager - (356.7) JobRoleManufacturing Director + (3791.69) JobRoleResearch Director - (509.37) JobRoleResearch Scientist - (252.52) JobSales Executive - (775.40) JobRoleSales Representative + (2779.29) JobLevel + (36.43) TotalWorkingYears

# Reading the file Monthly Income Project_data.csv
monthinc.df <- read.csv(paste("Monthly Income Project_data.csv", sep= ""))
View(monthinc.df)

# Regression Analysis
regress3 <- lm(MonthlyIncome ~ Gender + BusinessTravel + JobRole + JobLevel + TotalWorkingYears , data = monthinc.df)

summary(regress3)
## 
## Call:
## lm(formula = MonthlyIncome ~ Gender + BusinessTravel + JobRole + 
##     JobLevel + TotalWorkingYears, data = monthinc.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3644.3  -645.1    53.5   660.8  4079.7 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -111.387    268.895  -0.414  0.67883    
## GenderMale                        89.988     83.764   1.074  0.28307    
## BusinessTravelTravel_Frequently  451.610    167.411   2.698  0.00716 ** 
## BusinessTravelTravel_Rarely      276.331    145.512   1.899  0.05798 .  
## JobRoleHuman Resources          -237.667    295.366  -0.805  0.42130    
## JobRoleLaboratory Technician    -900.236    195.818  -4.597 5.10e-06 ***
## JobRoleManager                  4022.027    248.321  16.197  < 2e-16 ***
## JobRoleManufacturing Director   -356.711    193.798  -1.841  0.06611 .  
## JobRoleResearch Director        3791.686    250.255  15.151  < 2e-16 ***
## JobRoleResearch Scientist       -509.365    195.868  -2.601  0.00951 ** 
## JobRoleSales Executive          -252.517    168.092  -1.502  0.13349    
## JobRoleSales Representative     -775.396    241.189  -3.215  0.00137 ** 
## JobLevel                        2779.291     96.307  28.859  < 2e-16 ***
## TotalWorkingYears                 36.430      8.802   4.139 3.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1080 on 686 degrees of freedom
## Multiple R-squared:  0.9472, Adjusted R-squared:  0.9462 
## F-statistic: 946.6 on 13 and 686 DF,  p-value: < 2.2e-16

2.4 Results

The analysis was carried out in order to study the dependency how Monthly Income of the employee is dependent on different variables. The following insights were drawn out of it: 1) Monthly Income of the employee depends very significantly on Job Role and Job Level of the employee. 2) Monthly Income also has significant dependency on Gender, Business Travel Status and Total Work Experience of the employee. 3) Monthly Income has weak dependency on Department and Marital Status of the employee. 4) Monthly Income is almost independent of Age and Education Level of the employee, Number of companies that the employee has served earlier, Percent Salary Hike obtained by employee and Years the employee has served the company. 5) Job Level contributes highly to Monthly Income implying that a raise of about 2780 is obtained for each increment in the Job Level. 6) Monthly income is found to be maximum for Manager and Research Director and minimum for Laboratory Technician and Sales Representative among all the other job roles. 7) Monthly income is having a difference of just about 90 if the employee is male rather than female ensuring gender equality in the company. 8) Monthly income of employee who travel frequently exceeds by about 175 than one who travel rarely. 9) Monthly income of employee rises by about 36 per unit rise in the Total work experience of the employee.

3. Conclusion

This analysis was motivated by the need to explore the dependency of monthly income on different parameters so that we could improve our understanding of how it is influenced through different study parameters and how it can be correctly estimated for a variety of employees having different attributes and records. The unique contribution of this paper is that the world believes in the gender equality as no significant dependency was found between monthly income and gender of the employee. We also found that total work experience also does not significantly contribute to increase in the monthly income. And its quiet obvious that as the Job Level will increase, the monthly income will rise which can also be concluded from the analysis as well.

5. Appendix

Reading the file Monthly Income Project_data.csv and measuring dimensions

monthinc.df <- read.csv(paste("Monthly Income Project_data.csv", sep= ""))
View(monthinc.df)
dim(monthinc.df)
## [1] 700  14

Generating summary for the dataframe

library(psych)
describe(monthinc.df)
##                    vars   n    mean      sd median trimmed     mad  min
## Employee.ID           1 700 1477.59  280.39 1469.5 1475.78  355.08 1001
## Gender*               2 700    1.59    0.49    2.0    1.61    0.00    1
## Department*           3 700    2.27    0.52    2.0    2.26    0.00    1
## BusinessTravel*       4 700    2.63    0.64    3.0    2.78    0.00    1
## MaritalStatus*        5 700    2.09    0.74    2.0    2.12    1.48    1
## JobRole*              6 700    5.54    2.43    6.0    5.70    2.97    1
## Age                   7 700   36.83    9.32   35.5   36.36    9.64   18
## Education             8 700    2.88    1.03    3.0    2.94    1.48    1
## JobLevel              9 700    2.05    1.10    2.0    1.88    1.48    1
## NumCompaniesWorked   10 700    2.71    2.52    2.0    2.38    1.48    0
## PercentSalaryHike    11 700   15.28    3.72   14.0   14.88    2.97   11
## TotalWorkingYears    12 700   11.20    7.80    9.5   10.26    5.19    0
## YearsAtCompany       13 700    6.98    6.37    5.0    5.84    4.45    0
## MonthlyIncome        14 700 6419.27 4654.16 4887.0 5581.97 3220.95 1009
##                      max range  skew kurtosis     se
## Employee.ID         1976   975  0.05    -1.16  10.60
## Gender*                2     1 -0.37    -1.86   0.02
## Department*            3     2  0.24    -0.50   0.02
## BusinessTravel*        3     2 -1.53     1.01   0.02
## MaritalStatus*         3     2 -0.15    -1.16   0.03
## JobRole*               9     8 -0.39    -1.14   0.09
## Age                   60    42  0.44    -0.43   0.35
## Education              5     4 -0.25    -0.64   0.04
## JobLevel               5     4  1.05     0.50   0.04
## NumCompaniesWorked     9     9  1.03     0.01   0.10
## PercentSalaryHike     25    14  0.77    -0.48   0.14
## TotalWorkingYears     40    40  1.18     1.18   0.29
## YearsAtCompany        40    40  1.90     4.43   0.24
## MonthlyIncome      19999 18990  1.39     1.08 175.91

Creating one-way contingency tables for Gender and displaying the distribution on bar graph

table(monthinc.df$Gender)
## 
## Female   Male 
##    286    414
counts_gen <-table(monthinc.df$Gender)
barplot(counts_gen, width=1, space=1, main = "Gender Distribution", xlab="Gender",col=c( "pink","navy"), 
        ylim=c(0,450), xlim=c(0,10), names.arg=c("Female","Male"))
Axis(side=1, labels=FALSE)

Creating one-way contingency tables for Department and displaying the distribution on bar graph

table(monthinc.df$Department)
## 
##        Human Resources Research & Development                  Sales 
##                     24                    460                    216
counts_dep <-table(monthinc.df$Department)
barplot(counts_dep, width=1, space=1, main = "Department Distribution", xlab="Department",
        col=c("red","gold","purple"), ylim=c(0,500), xlim=c(0,10), names.arg=c("HR","R&D","Sales"))
Axis(side=1, labels=FALSE)

Creating one-way contingency tables for Business Travelling and displaying the distribution on bar graph

table(monthinc.df$BusinessTravel)
## 
##        Non-Travel Travel_Frequently     Travel_Rarely 
##                64               129               507
counts_bustr <-table(monthinc.df$BusinessTravel)
barplot(counts_bustr, width=1, space=1, main = "Business Travelling Status", xlab="Status",
        col=c("wheat","green1","tomato"), ylim=c(0,500), xlim=c(0,10), 
        names.arg=c(" Don't Travel","Frequently","Rarely"))
Axis(side=1, labels=FALSE)

Creating one-way contingency tables for Marital Status and displaying the distribution on bar graph

table(monthinc.df$MaritalStatus)
## 
## Divorced  Married   Single 
##      161      312      227
counts_marstat <-table(monthinc.df$MaritalStatus)
barplot(counts_marstat, width=1, space=1, main = "Marital Status", xlab="Status",
        col=c("white","firebrick","skyblue"), ylim=c(0,500), xlim=c(0,10), 
        names.arg=c("Divorced","Married","Single"))
Axis(side=1, labels=FALSE)

Creating one-way contingency tables for Job Role and displaying the distribution on bar graph

table(monthinc.df$JobRole)
## 
## Healthcare Representative           Human Resources 
##                        58                        19 
##     Laboratory Technician                   Manager 
##                       123                        51 
##    Manufacturing Director         Research Director 
##                        69                        39 
##        Research Scientist           Sales Executive 
##                       142                       155 
##      Sales Representative 
##                        44
counts_jobr <-table(monthinc.df$JobRole)
barplot(counts_jobr, width=1, space=1, main = "Job Role", xlab="Role", ylim=c(0,380), xlim=c(0,20), xaxt="n", col=c("violet","ivory","blue","green","yellow","darkorange","red","cyan","magenta"))
legend("topleft", c("Health Rep.","HR","Lab. Tech.","Manager","Manufac. Director","Research Director", "Research Scientist","Sales Executive","Sales Representative"), fill=c("violet","ivory","blue","green", "yellow","darkorange","red","cyan","magenta"))
Axis(side=1, labels=FALSE)

Creating two-way contingency tables for Gender & Department

two_way_tab <-xtabs(~ Gender + Department, data = monthinc.df)
addmargins(two_way_tab)
##         Department
## Gender   Human Resources Research & Development Sales Sum
##   Female               8                    185    93 286
##   Male                16                    275   123 414
##   Sum                 24                    460   216 700

Creating two-way contingency tables for Job Role & Business Travel Status

two_way_tab <-xtabs(~ JobRole + BusinessTravel, data = monthinc.df)
addmargins(two_way_tab)
##                            BusinessTravel
## JobRole                     Non-Travel Travel_Frequently Travel_Rarely Sum
##   Healthcare Representative          7                12            39  58
##   Human Resources                    0                 2            17  19
##   Laboratory Technician             13                26            84 123
##   Manager                            6                 7            38  51
##   Manufacturing Director             7                12            50  69
##   Research Director                  2                 5            32  39
##   Research Scientist                 7                23           112 142
##   Sales Executive                   21                30           104 155
##   Sales Representative               1                12            31  44
##   Sum                               64               129           507 700

Creating two-way contingency tables for Gender & Marital Status

two_way_tab <-xtabs(~ Gender + MaritalStatus, data = monthinc.df)
addmargins(two_way_tab)
##         MaritalStatus
## Gender   Divorced Married Single Sum
##   Female       61     122    103 286
##   Male        100     190    124 414
##   Sum         161     312    227 700

Creating two-way contingency tables for Department & Business Travel Status

two_way_tab <-xtabs(~ Department + BusinessTravel, data = monthinc.df)
addmargins(two_way_tab)
##                         BusinessTravel
## Department               Non-Travel Travel_Frequently Travel_Rarely Sum
##   Human Resources                 2                 3            19  24
##   Research & Development         39                83           338 460
##   Sales                          23                43           150 216
##   Sum                            64               129           507 700

Creating three-way contingency tables for Department & Business Travel Status

three_way_tab <- xtabs(~ Department + JobRole + Gender, data = monthinc.df)
ftable(three_way_tab)
##                                                  Gender Female Male
## Department             JobRole                                     
## Human Resources        Healthcare Representative             0    0
##                        Human Resources                       6   13
##                        Laboratory Technician                 0    0
##                        Manager                               2    3
##                        Manufacturing Director                0    0
##                        Research Director                     0    0
##                        Research Scientist                    0    0
##                        Sales Executive                       0    0
##                        Sales Representative                  0    0
## Research & Development Healthcare Representative            27   31
##                        Human Resources                       0    0
##                        Laboratory Technician                41   82
##                        Manager                              12   17
##                        Manufacturing Director               32   37
##                        Research Director                    15   24
##                        Research Scientist                   58   84
##                        Sales Executive                       0    0
##                        Sales Representative                  0    0
## Sales                  Healthcare Representative             0    0
##                        Human Resources                       0    0
##                        Laboratory Technician                 0    0
##                        Manager                              11    6
##                        Manufacturing Director                0    0
##                        Research Director                     0    0
##                        Research Scientist                    0    0
##                        Sales Executive                      63   92
##                        Sales Representative                 19   25

Box plot for Age

boxplot(monthinc.df$Age, horizontal = TRUE, main = "Box Plot for Age", xlab = "Age", col = "chocolate")

Box plot for Work Experience

boxplot(monthinc.df$TotalWorkingYears, horizontal = TRUE, main = "Box Plot for Work Experience", xlab = "Years", col = "gold")

Box plot for Monthly Income

boxplot(monthinc.df$MonthlyIncome, horizontal = TRUE, main = "Box Plot for Monthly Income", xlab = "Amount", col = "red")

Histogram for Percentage hike in the Salary

table(monthinc.df$PercentSalaryHike)
## 
##  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25 
## 103  95  94  92  47  39  34  45  37  25  25  29  17  12   6
hist(monthinc.df$PercentSalaryHike, 
     main="Variation in Percentage hike in the Salary",
     xlab="Percentage",
     ylab="Count",
     xlim=c(10,30), ylim=c(0,270), 
     breaks=5,             
     col=c("orange", "darkorange", "orangered", "orangered3", "red", "red3", "firebrick3", "firebrick"))       

Histogram for Years in the company

table(monthinc.df$YearsAtCompany)
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 25 
## 21 87 61 63 50 96 36 42 39 39 53 15  5 12  6  7  5  6  7  2 12  6  7  5  3 
## 26 27 29 31 32 33 34 36 37 40 
##  2  2  1  3  1  2  1  1  1  1
hist(monthinc.df$YearsAtCompany, 
     main="No. of years in the company",
     xlab="Years",
     ylab="Count",
     xlim=c(0,40), ylim=c(0,300), 
     breaks=10,             
     col=c("lightblue", "lightblue3", "royalblue", "blue", "royalblue4", "blue3", "dark blue", "navy"))       

Correlation matrix and Corrgram Generation

round(cor(monthinc.df[,7:14]), 3)
##                       Age Education JobLevel NumCompaniesWorked
## Age                 1.000     0.196    0.495              0.299
## Education           0.196     1.000    0.136              0.120
## JobLevel            0.495     0.136    1.000              0.121
## NumCompaniesWorked  0.299     0.120    0.121              1.000
## PercentSalaryHike  -0.054    -0.030   -0.096             -0.004
## TotalWorkingYears   0.675     0.157    0.780              0.220
## YearsAtCompany      0.304     0.086    0.566             -0.137
## MonthlyIncome       0.484     0.134    0.951              0.123
##                    PercentSalaryHike TotalWorkingYears YearsAtCompany
## Age                           -0.054             0.675          0.304
## Education                     -0.030             0.157          0.086
## JobLevel                      -0.096             0.780          0.566
## NumCompaniesWorked            -0.004             0.220         -0.137
## PercentSalaryHike              1.000            -0.096         -0.085
## TotalWorkingYears             -0.096             1.000          0.640
## YearsAtCompany                -0.085             0.640          1.000
## MonthlyIncome                 -0.093             0.763          0.543
##                    MonthlyIncome
## Age                        0.484
## Education                  0.134
## JobLevel                   0.951
## NumCompaniesWorked         0.123
## PercentSalaryHike         -0.093
## TotalWorkingYears          0.763
## YearsAtCompany             0.543
## MonthlyIncome              1.000
library(corrgram)
## Warning: replacing previous import by 'magrittr::%>%' when loading
## 'dendextend'
corrgram(monthinc.df, order=TRUE, lower.panel=panel.shade,upper.panel=panel.pie, text.panel=panel.txt,main="Corrgram  for different variables")

ScatterPlot Martrix for different variables

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(~ Gender+ Department + MonthlyIncome + MaritalStatus + Age, data=monthinc.df,
main="Variation of Monthly Income with Gender, Department, 
Marital Status and  Age")

scatterplotMatrix(~ Education + JobLevel + MonthlyIncome + TotalWorkingYears + YearsAtCompany , data=monthinc.df, main="Variation of Monthly Income with Education, Job Level, 
Total Work Experience and Years in the Company")

Defining Null Hypothesis

Considering the null hypothesis H0: There is no significant difference in the value of both means. H1: There is a significant difference in the value of both means.

t-Test for dependency of Monthly Income on Age

t.test(monthinc.df$MonthlyIncome, monthinc.df$Age)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$Age
## t = 36.282, df = 699.01, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6037.062 6727.818
## sample estimates:
##  mean of x  mean of y 
## 6419.27429   36.83429

-> Since p-value < 2.2e-16, so Age is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis

t-Test for dependency of Monthly Income on Education

t.test(monthinc.df$MonthlyIncome, monthinc.df$Education)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$Education
## t = 36.475, df = 699, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6071.017 6761.771
## sample estimates:
## mean of x mean of y 
##  6419.274     2.880

-> Since p-value < 2.2e-16, so Education is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis

t-Test for dependency of Monthly Income on Job Level

t.test(monthinc.df$MonthlyIncome, monthinc.df$JobLevel)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$JobLevel
## t = 36.48, df = 699, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6071.850 6762.604
## sample estimates:
##   mean of x   mean of y 
## 6419.274286    2.047143

-> Since p-value < 2.2e-16, so Job Level is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

t-Test for dependency of Monthly Income on Number of companies worked earlier

t.test(monthinc.df$MonthlyIncome, monthinc.df$NumCompaniesWorked)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$NumCompaniesWorked
## t = 36.476, df = 699, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6071.187 6761.941
## sample estimates:
## mean of x mean of y 
##  6419.274     2.710

-> Since p-value < 2.2e-16, so Number of companies worked earlier is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

t-Test for dependency of Monthly Income on Percentage Hike in salary

t.test(monthinc.df$MonthlyIncome, monthinc.df$PercentSalaryHike)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$PercentSalaryHike
## t = 36.405, df = 699, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6058.619 6749.373
## sample estimates:
##  mean of x  mean of y 
## 6419.27429   15.27857

-> Since p-value < 2.2e-16, so Percentage Hike in salary is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

t-Test for dependency of Monthly Income on Total Work Experience

t.test(monthinc.df$MonthlyIncome, monthinc.df$TotalWorkingYears)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$TotalWorkingYears
## t = 36.428, df = 699, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6062.698 6753.453
## sample estimates:
##  mean of x  mean of y 
## 6419.27429   11.19857

-> Since p-value < 2.2e-16, so Total Work Experience is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

t-Test for dependency of Monthly Income on Years in the present company

t.test(monthinc.df$MonthlyIncome, monthinc.df$YearsAtCompany)
## 
##  Welch Two Sample t-test
## 
## data:  monthinc.df$MonthlyIncome and monthinc.df$YearsAtCompany
## t = 36.452, df = 699, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  6066.918 6757.673
## sample estimates:
##   mean of x   mean of y 
## 6419.274286    6.978571

-> Since p-value < 2.2e-16, so Total Work Experience is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

Converting variables into factors

# Converting Gender into factor variable 
monthinc.df$Gender[monthinc.df$Res == 0] <- 'Female'
monthinc.df$Gender[monthinc.df$Res == 1] <- 'Male'
monthinc.df$Gender<- factor(monthinc.df$Gender)

# Converting Department into factor variable 
monthinc.df$Department[monthinc.df$Res == 1] <- 'Human Resources'
monthinc.df$Department[monthinc.df$Res == 2] <- 'Research & Development'
monthinc.df$Department[monthinc.df$Res == 3] <- 'Sales'
monthinc.df$Department<- factor(monthinc.df$Department)

# Converting Business Travel Status into factor variable 
monthinc.df$BusinessTravel[monthinc.df$Res == 0] <- 'Non-Travel'
monthinc.df$BusinessTravel[monthinc.df$Res == 1] <- 'Travel_Frequently'
monthinc.df$BusinessTravel[monthinc.df$Res == 2] <- 'Travel_Rarely'
monthinc.df$BusinessTravel<- factor(monthinc.df$BusinessTravel)

# Converting Marital Status into factor variable 
monthinc.df$MaritalStatus[monthinc.df$Res == 0] <- 'Divorced'
monthinc.df$MaritalStatus[monthinc.df$Res == 1] <- 'Married'
monthinc.df$MaritalStatus[monthinc.df$Res == 2] <- 'Single'
monthinc.df$MaritalStatus<- factor(monthinc.df$MaritalStatus)

# Converting Job Role into factor variable
monthinc.df$JobRole[monthinc.df$Res == 1] <- 'Healthcare Representative'
monthinc.df$JobRole[monthinc.df$Res == 2] <- 'Human Resources'
monthinc.df$JobRole[monthinc.df$Res == 3] <- 'Laboratory Technician'
monthinc.df$JobRole[monthinc.df$Res == 4] <- 'Manager'
monthinc.df$JobRole[monthinc.df$Res == 5] <- 'Manufacturing Director'
monthinc.df$JobRole[monthinc.df$Res == 6] <- 'Research Director'
monthinc.df$JobRole[monthinc.df$Res == 7] <- 'Research Scientist'
monthinc.df$JobRole[monthinc.df$Res == 8] <- 'Sales Executive'
monthinc.df$JobRole[monthinc.df$Res == 9] <- 'Sales Representative'
monthinc.df$JobRole<- factor(monthinc.df$JobRole)

# Adding MonthlyIncome_Range variable
monthinc.df$MonthlyIncome_Range<-cut(monthinc.df$MonthlyIncome, seq(0,20000,4000), right=FALSE, labels=c(1:5))

Chi-square Test for dependency of Monthly Income on Gender

chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df$Gender)
## 
##  Pearson's Chi-squared test
## 
## data:  monthinc.df$MonthlyIncome_Range and monthinc.df$Gender
## X-squared = 9.9068, df = 4, p-value = 0.04203

-> Since p-value = 0.04203, so Gender is a significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

Chi-square Test for dependency of Monthly Income on Department

chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df$Department)
## Warning in chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df
## $Department): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  monthinc.df$MonthlyIncome_Range and monthinc.df$Department
## X-squared = 63.027, df = 8, p-value = 1.183e-10

-> Since p-value = 1.183e-10, so Department is a very significant contributor in estimating Monthly Income. Hence we reject the null hypothesis.

Chi-square Test for dependency of Monthly Income on Business Travel Status

chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df$BusinessTravel)
## Warning in chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df
## $BusinessTravel): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  monthinc.df$MonthlyIncome_Range and monthinc.df$BusinessTravel
## X-squared = 14.648, df = 8, p-value = 0.06636

-> Since p-value = 0.06636, so Business Travel Status is not a significant contributor in estimating Monthly Income. Hence we accept the null hypothesis.

Chi-square Test for dependency of Monthly Income on Marital Status

chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df$MaritalStatus)
## 
##  Pearson's Chi-squared test
## 
## data:  monthinc.df$MonthlyIncome_Range and monthinc.df$MaritalStatus
## X-squared = 8.6358, df = 8, p-value = 0.3739

-> Since p-value = 0.3739, so Marital Status is not a significant contributor in estimating Monthly Income. Hence we accept the null hypothesis.

Chi-square Test for dependency of Monthly Income on Job Role

chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df$JobRole)
## Warning in chisq.test(monthinc.df$MonthlyIncome_Range, monthinc.df
## $JobRole): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  monthinc.df$MonthlyIncome_Range and monthinc.df$JobRole
## X-squared = 1076, df = 32, p-value < 2.2e-16

-> Since p-value < 2.2e-16, so Job Role is a very significant contributor in estimating Monthly Income. Hence we accept the null hypothesis.

Regression Analysis

# Model 1
regress1 <- lm(MonthlyIncome ~ Gender + Department + BusinessTravel + MaritalStatus + JobRole + Age + Education + JobLevel + NumCompaniesWorked + PercentSalaryHike + TotalWorkingYears + YearsAtCompany , data = monthinc.df)
summary(regress1)
## 
## Call:
## lm(formula = MonthlyIncome ~ Gender + Department + BusinessTravel + 
##     MaritalStatus + JobRole + Age + Education + JobLevel + NumCompaniesWorked + 
##     PercentSalaryHike + TotalWorkingYears + YearsAtCompany, data = monthinc.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3650.8  -640.1    39.5   668.0  4086.6 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -300.8557   668.5264  -0.450  0.65283    
## GenderMale                         88.5381    84.9566   1.042  0.29771    
## DepartmentResearch & Development   38.5753   529.7776   0.073  0.94198    
## DepartmentSales                  -323.8944   559.7732  -0.579  0.56304    
## BusinessTravelTravel_Frequently   448.5799   169.7570   2.642  0.00842 ** 
## BusinessTravelTravel_Rarely       275.4782   147.7588   1.864  0.06270 .  
## MaritalStatusMarried               58.2552   106.3658   0.548  0.58409    
## MaritalStatusSingle                61.1117   113.6017   0.538  0.59079    
## JobRoleHuman Resources           -196.3777   615.0775  -0.319  0.74962    
## JobRoleLaboratory Technician     -903.0242   197.8531  -4.564 5.96e-06 ***
## JobRoleManager                   4164.8585   286.3968  14.542  < 2e-16 ***
## JobRoleManufacturing Director    -351.4553   195.0757  -1.802  0.07205 .  
## JobRoleResearch Director         3807.5247   252.7972  15.062  < 2e-16 ***
## JobRoleResearch Scientist        -511.6669   198.0583  -2.583  0.00999 ** 
## JobRoleSales Executive            112.4187   380.5108   0.295  0.76775    
## JobRoleSales Representative      -412.8890   414.7542  -0.996  0.31985    
## Age                                 1.0651     6.2762   0.170  0.86529    
## Education                           4.3941    41.2531   0.107  0.91520    
## JobLevel                         2771.3820    98.2868  28.197  < 2e-16 ***
## NumCompaniesWorked                  4.9890    18.3792   0.271  0.78613    
## PercentSalaryHike                   4.0238    11.2243   0.358  0.72009    
## TotalWorkingYears                  35.9719    11.6733   3.082  0.00214 ** 
## YearsAtCompany                     -0.1172     9.3219  -0.013  0.98997    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1085 on 677 degrees of freedom
## Multiple R-squared:  0.9473, Adjusted R-squared:  0.9456 
## F-statistic: 553.5 on 22 and 677 DF,  p-value: < 2.2e-16
# Model 2
regress2 <- lm(MonthlyIncome ~ Gender + BusinessTravel + JobRole + Age + Education + JobLevel + NumCompaniesWorked + PercentSalaryHike + TotalWorkingYears , data = monthinc.df)
summary(regress2)
## 
## Call:
## lm(formula = MonthlyIncome ~ Gender + BusinessTravel + JobRole + 
##     Age + Education + JobLevel + NumCompaniesWorked + PercentSalaryHike + 
##     TotalWorkingYears, data = monthinc.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3637.0  -646.7    53.5   660.9  4100.8 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -217.4699   379.9958  -0.572 0.567310    
## GenderMale                        91.8727    84.4311   1.088 0.276918    
## BusinessTravelTravel_Frequently  450.4878   168.7186   2.670 0.007765 ** 
## BusinessTravelTravel_Rarely      276.2277   146.3652   1.887 0.059551 .  
## JobRoleHuman Resources          -240.6726   296.6887  -0.811 0.417536    
## JobRoleLaboratory Technician    -897.0046   196.6876  -4.561 6.05e-06 ***
## JobRoleManager                  4028.3348   249.4139  16.151  < 2e-16 ***
## JobRoleManufacturing Director   -356.6867   194.4279  -1.835 0.067009 .  
## JobRoleResearch Director        3791.7969   251.3978  15.083  < 2e-16 ***
## JobRoleResearch Scientist       -504.6747   196.6720  -2.566 0.010498 *  
## JobRoleSales Executive          -249.5603   168.6999  -1.479 0.139518    
## JobRoleSales Representative     -767.9924   242.7296  -3.164 0.001625 ** 
## Age                                1.3406     6.1411   0.218 0.827260    
## Education                          0.6416    40.8904   0.016 0.987486    
## JobLevel                        2782.7533    96.8472  28.733  < 2e-16 ***
## NumCompaniesWorked                 5.0694    17.2895   0.293 0.769451    
## PercentSalaryHike                  3.1814    11.1274   0.286 0.775034    
## TotalWorkingYears                 34.7833    10.2763   3.385 0.000753 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083 on 682 degrees of freedom
## Multiple R-squared:  0.9472, Adjusted R-squared:  0.9459 
## F-statistic: 719.9 on 17 and 682 DF,  p-value: < 2.2e-16
# Model 3
regress3 <- lm(MonthlyIncome ~ Gender + BusinessTravel + JobRole + JobLevel + TotalWorkingYears , 
data = monthinc.df)

summary(regress3)
## 
## Call:
## lm(formula = MonthlyIncome ~ Gender + BusinessTravel + JobRole + 
##     JobLevel + TotalWorkingYears, data = monthinc.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3644.3  -645.1    53.5   660.8  4079.7 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -111.387    268.895  -0.414  0.67883    
## GenderMale                        89.988     83.764   1.074  0.28307    
## BusinessTravelTravel_Frequently  451.610    167.411   2.698  0.00716 ** 
## BusinessTravelTravel_Rarely      276.331    145.512   1.899  0.05798 .  
## JobRoleHuman Resources          -237.667    295.366  -0.805  0.42130    
## JobRoleLaboratory Technician    -900.236    195.818  -4.597 5.10e-06 ***
## JobRoleManager                  4022.027    248.321  16.197  < 2e-16 ***
## JobRoleManufacturing Director   -356.711    193.798  -1.841  0.06611 .  
## JobRoleResearch Director        3791.686    250.255  15.151  < 2e-16 ***
## JobRoleResearch Scientist       -509.365    195.868  -2.601  0.00951 ** 
## JobRoleSales Executive          -252.517    168.092  -1.502  0.13349    
## JobRoleSales Representative     -775.396    241.189  -3.215  0.00137 ** 
## JobLevel                        2779.291     96.307  28.859  < 2e-16 ***
## TotalWorkingYears                 36.430      8.802   4.139 3.93e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1080 on 686 degrees of freedom
## Multiple R-squared:  0.9472, Adjusted R-squared:  0.9462 
## F-statistic: 946.6 on 13 and 686 DF,  p-value: < 2.2e-16

-> Model 3 has Multiple R-squared ~ 0.94 and p-value < 2.2e-16. The values are almost similar to Model 1 and MOdel 2 but the same accuracy level in modelling is achieved in Model3 with less indpendent variables. Hence, best model that can be considered for estimation is Model 3.