Data

Introduction

When an employee leaves a company, it’s called attrition or turnover. Predicting employee turnover is at the forefront of Human Resource (HR) departments across the globe as companies face massive costs resulting from employee attrition. These costs are both tangible, such as the cost of training expenses and time but also intangible - such as ideas, customer relationships or leadership.

The goal of this project is to utilise machine learning techniques to predict whether an employee will leave the company. Moreover, the major factors that lead to employee attrition will be identified in order to better answer questions about why people leave companies.

Dataset Description and Source

The dataset was taken from the IBM Watson Analytics website, and is a fictional dataset created by IBM data scientists. It contains 1,470 observations and consists of 34 descriptive features and 1 target feature. From this dataset, there are 1233 “No” responses to attrition, and 237 “Yes” responses, which is an example of an imbalanced dataset. Furthermore, there are no missing values within the data.

Source

Short Link: https://ibm.co/2DvoldM

Original Link: https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

Target Feature

The response feature, or target feature is “Attrition” is a factor with 2 levels which is defined as:

\[Attrition = \binom{Yes}{No}\]

Where “Yes” indicates that an employee has left the company, and “No” indicates that an employee has stayed with the company.

Descriptive Features

Age: continuous
BusinessTravel: Travel_Rarely, Travel_Frequently, Non-Travel
DailyRate: continuous
Department: Human Resources, Sales, Research & Development
DistanceFromHome: continuous
Education: 1 = “Below College”, 2 = “College”, 3 = “Bachelor”, 4 = “Master”, 5 = “Doctor”
EducationField: Human Resources, Life Sciences, Marketing, Medical, Other, Technical Degree
EmployeeCount: discrete
EmployeeNumber: discrete
EnvironmentSatisfaction: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
Gender: Male, Female
HourlyRate: continuous
JobInvolvement: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
JobLevel: 1, 2, 3, 4, 5
JobRole: Healthcare Representative, Human Resources, Laboratory Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive, Sales Representative
JobSatisfaction: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
MaritalStatus: Divorced, Married, Single
MonthlyIncome: continuous
MonthlyRate: continuous
NumCompaniesWorked: continuous
Over18: Y
OverTime: Yes, No
PercentSalaryHike: discrete
PerformanceRating: 3,4
RelationshipSatisfaction: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
StandardHours: 80
StockOptionLevel: 0, 1, 2, 3
TotalWorkingYears: continuous
TrainingTimesLastYear: continuous
WorkLifeBalance: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
YearsAtCompany: continuous
YearsInCurrentRole: continuous
YearsSinceLastPromotion: continuous
YearsWithCurrManager: continuous

Steps Taken

Reading of the raw data (data1) and viewing of the data1
Subsetting data1 into a new dataframe named data2 with three specific columns (“EmployeeNumber”, “JobInvolvement” and “Department”) and viewing of the data
Subsetting data1 into a new dataframe named data3 and dropping two specific columns (“JobInvolvement” and “Department”) and viewing of the data
Merging the two datasets (dataframes), data2 and data3 together into a new dataframe named final. Then finally viewing of the dataset named final.

# Reading Data
data1 <- read.csv("Employeechurn.csv",header = TRUE,sep = ",")

# View the data
head(data1, n = 2)

# data1: # of obs and # of variables
nrow(data1)

## [1] 1470

ncol(data1)

## [1] 35

# Task 1 - Merging two datasets
data2 <- subset(data1, select=c("EmployeeNumber", "JobInvolvement", "Department")) 
colnames(data2)

## [1] "EmployeeNumber" "JobInvolvement" "Department"

# data2: # of obs and # of variables
nrow(data2)

## [1] 1470

ncol(data2)

## [1] 3

# Removing the two variables from the original dataset
data3 = subset(data1, select = -c(JobInvolvement,Department) ) 
colnames(data3)

##  [1] "Attrition"                "Age"                     
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "DistanceFromHome"         "Education"               
##  [7] "EducationField"           "EmployeeCount"           
##  [9] "EmployeeNumber"           "EnvironmentSatisfaction" 
## [11] "Gender"                   "HourlyRate"              
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyIncome"            "MonthlyRate"             
## [19] "NumCompaniesWorked"       "Over18"                  
## [21] "OverTime"                 "PercentSalaryHike"       
## [23] "PerformanceRating"        "RelationshipSatisfaction"
## [25] "StandardHours"            "StockOptionLevel"        
## [27] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [29] "WorkLifeBalance"          "YearsAtCompany"          
## [31] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [33] "YearsWithCurrManager"

# data3: # of obs and # of variables
nrow(data3)

## [1] 1470

ncol(data3)

## [1] 33

# Merging the above two dataset
final <- merge(data2,data3,"EmployeeNumber")
colnames(final)

##  [1] "EmployeeNumber"           "JobInvolvement"          
##  [3] "Department"               "Attrition"               
##  [5] "Age"                      "BusinessTravel"          
##  [7] "DailyRate"                "DistanceFromHome"        
##  [9] "Education"                "EducationField"          
## [11] "EmployeeCount"            "EnvironmentSatisfaction" 
## [13] "Gender"                   "HourlyRate"              
## [15] "JobLevel"                 "JobRole"                 
## [17] "JobSatisfaction"          "MaritalStatus"           
## [19] "MonthlyIncome"            "MonthlyRate"             
## [21] "NumCompaniesWorked"       "Over18"                  
## [23] "OverTime"                 "PercentSalaryHike"       
## [25] "PerformanceRating"        "RelationshipSatisfaction"
## [27] "StandardHours"            "StockOptionLevel"        
## [29] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [31] "WorkLifeBalance"          "YearsAtCompany"          
## [33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [35] "YearsWithCurrManager"

# final: # of obs and # of variables
nrow(final)

## [1] 1470

ncol(final)

## [1] 35

# Comparing the raw data (data1) to the final data set 
# data1: # of obs and # of variables
nrow(data1)

## [1] 1470

ncol(data1)

## [1] 35

# final: # of obs and # of variables
nrow(final)

## [1] 1470

ncol(final)

## [1] 35

There appears to be no loss in the # of obs. and # of variables when comparing raw (data1) and the final datasets.

Understand

Steps Taken

Using the str function, it was discovered that the data wasn’t presented ideally.
We converted the required data type conversions (i.e., character -> factor, character -> date, numeric -> factor, etc. conversions)
We looked into the data set should and labelled or ordered where necessary.
Using the summarizeColumns and kable function, we summarized the dataset again after the changes

# Task 3 and 4 - Observing the dataset structure for data conversions if any and tidying it

str(final)

## 'data.frame':    1470 obs. of  35 variables:
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

# Need to convert some of the variables in integer data type to factor data type
final$JobInvolvement<-as.factor(final$JobInvolvement)
final$Education<-as.factor(final$Education)
final$EnvironmentSatisfaction<-as.factor(final$EnvironmentSatisfaction)
final$JobLevel<-as.factor(final$JobLevel)
final$JobSatisfaction<-as.factor(final$JobSatisfaction)
final$PerformanceRating<-as.factor(final$PerformanceRating)
final$RelationshipSatisfaction<-as.factor(final$RelationshipSatisfaction)
final$StockOptionLevel<-as.factor(final$StockOptionLevel)
final$TrainingTimesLastYear<-as.factor(final$TrainingTimesLastYear)
final$WorkLifeBalance<-as.factor(final$WorkLifeBalance)

# Ordering our levels
final$Education <- factor(final$Education, levels=c(1,2,3,4,5))
levels(final$Education) <- list("Below College"=1,"College"=2,"Bachelor"=3,"Master"=4,"Doctor"=5)

final$EnvironmentSatisfaction <- factor(final$EnvironmentSatisfaction, levels=c(1,2,3,4))
levels(final$EnvironmentSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$JobInvolvement <- factor(final$JobInvolvement, levels=c(1,2,3,4))
levels(final$JobInvolvement) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$JobLevel  <- factor(c(1,2,3,4,5))

final$JobSatisfaction <- factor(final$JobSatisfaction, levels=c(1,2,3,4))
levels(final$JobSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$PerformanceRating <- factor(final$PerformanceRating, levels=c(1,2,3,4))
levels(final$PerformanceRating) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$RelationshipSatisfaction <- factor(final$RelationshipSatisfaction, levels=c(1,2,3,4))
levels(final$RelationshipSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$StockOptionLevel <- factor(c(0,1,2,3,4))

final$WorkLifeBalance <- factor(final$WorkLifeBalance, levels=c(1,2,3,4))
levels(final$WorkLifeBalance) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

# Summary of Varibales
summary <- summarizeColumns(final)
#Kable to output the result in a nice table
kable(summary)

name	type	mean	disp	median	mad	min	max	nlevs
EmployeeNumber	integer	1024.865306	602.0243348	1020.5	790.9671	1	2068	0
JobInvolvement	factor	NA	0.4095238	NA	NA	83	868	4
Department	factor	NA	0.3462585	NA	NA	63	961	3
Attrition	factor	NA	0.1612245	NA	NA	237	1233	2
Age	integer	36.923809	9.1353735	36.0	8.8956	18	60	0
BusinessTravel	factor	NA	0.2904762	NA	NA	150	1043	3
DailyRate	integer	802.485714	403.5090999	802.0	510.0144	102	1499	0
DistanceFromHome	integer	9.192517	8.1068644	7.0	7.4130	1	29	0
Education	factor	NA	0.6108844	NA	NA	48	572	5
EducationField	factor	NA	0.5877551	NA	NA	27	606	6
EmployeeCount	integer	1.000000	0.0000000	1.0	0.0000	1	1	0
EnvironmentSatisfaction	factor	NA	0.6918367	NA	NA	284	453	4
Gender	factor	NA	0.4000000	NA	NA	588	882	2
HourlyRate	integer	65.891156	20.3294276	66.0	26.6868	30	100	0
JobLevel	factor	NA	0.8000000	NA	NA	294	294	5
JobRole	factor	NA	0.7782313	NA	NA	52	326	9
JobSatisfaction	factor	NA	0.6877551	NA	NA	280	459	4
MaritalStatus	factor	NA	0.5421769	NA	NA	327	673	3
MonthlyIncome	integer	6502.931293	4707.9567831	4919.0	3260.2374	1009	19999	0
MonthlyRate	integer	14313.103401	7117.7860441	14235.5	9201.7569	2094	26999	0
NumCompaniesWorked	integer	2.693197	2.4980090	2.0	1.4826	0	9	0
Over18	factor	NA	0.0000000	NA	NA	1470	1470	1
OverTime	factor	NA	0.2829932	NA	NA	416	1054	2
PercentSalaryHike	integer	15.209524	3.6599377	14.0	2.9652	11	25	0
PerformanceRating	factor	NA	0.1537415	NA	NA	0	1244	2
RelationshipSatisfaction	factor	NA	0.6877551	NA	NA	276	459	4
StandardHours	integer	80.000000	0.0000000	80.0	0.0000	80	80	0
StockOptionLevel	factor	NA	0.8000000	NA	NA	294	294	5
TotalWorkingYears	integer	11.279592	7.7807817	10.0	5.9304	0	40	0
TrainingTimesLastYear	factor	NA	0.6278912	NA	NA	54	547	7
WorkLifeBalance	factor	NA	0.3925170	NA	NA	80	893	4
YearsAtCompany	integer	7.008163	6.1265252	5.0	4.4478	0	40	0
YearsInCurrentRole	integer	4.229252	3.6231370	3.0	4.4478	0	18	0
YearsSinceLastPromotion	integer	2.187755	3.2224303	1.0	1.4826	0	15	0
YearsWithCurrManager	integer	4.123129	3.5681361	3.0	4.4478	0	17	0

Tidy & Manipulate Data I

There is three rules to Hadley Wickham’s tidy rules 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

We can observe that none of Hadley Wickham’s tidy rules were violated. Therefore it is in tidy format. ### Steps Taken

# Viewing of the the dataset
head(final, n = 5)

Tidy & Manipulate Data II

Steps Taken

I’ve added a new column “Monthly Income” to the dataset “final2”. I did this by multiplying “Monthly Income” by 12 as there is 12 months in a year. I then added the new column to the dataset “final2” and summarized it.

# Task 6 - Add a new variable "Annual Income"
final2<-mutate(final,AnnualIncome = final$MonthlyIncome*12)

# Summary of Varibales
summary <- summarizeColumns(final2)
#Kable to output the result in a nice table
kable(summary)

name	type	mean	disp	median	mad	min	max	nlevs
EmployeeNumber	integer	1024.865306	6.020243e+02	1020.5	790.9671	1	2068	0
JobInvolvement	factor	NA	4.095238e-01	NA	NA	83	868	4
Department	factor	NA	3.462585e-01	NA	NA	63	961	3
Attrition	factor	NA	1.612245e-01	NA	NA	237	1233	2
Age	integer	36.923809	9.135374e+00	36.0	8.8956	18	60	0
BusinessTravel	factor	NA	2.904762e-01	NA	NA	150	1043	3
DailyRate	integer	802.485714	4.035091e+02	802.0	510.0144	102	1499	0
DistanceFromHome	integer	9.192517	8.106864e+00	7.0	7.4130	1	29	0
Education	factor	NA	6.108844e-01	NA	NA	48	572	5
EducationField	factor	NA	5.877551e-01	NA	NA	27	606	6
EmployeeCount	integer	1.000000	0.000000e+00	1.0	0.0000	1	1	0
EnvironmentSatisfaction	factor	NA	6.918367e-01	NA	NA	284	453	4
Gender	factor	NA	4.000000e-01	NA	NA	588	882	2
HourlyRate	integer	65.891156	2.032943e+01	66.0	26.6868	30	100	0
JobLevel	factor	NA	8.000000e-01	NA	NA	294	294	5
JobRole	factor	NA	7.782313e-01	NA	NA	52	326	9
JobSatisfaction	factor	NA	6.877551e-01	NA	NA	280	459	4
MaritalStatus	factor	NA	5.421769e-01	NA	NA	327	673	3
MonthlyIncome	integer	6502.931293	4.707957e+03	4919.0	3260.2374	1009	19999	0
MonthlyRate	integer	14313.103401	7.117786e+03	14235.5	9201.7569	2094	26999	0
NumCompaniesWorked	integer	2.693197	2.498009e+00	2.0	1.4826	0	9	0
Over18	factor	NA	0.000000e+00	NA	NA	1470	1470	1
OverTime	factor	NA	2.829932e-01	NA	NA	416	1054	2
PercentSalaryHike	integer	15.209524	3.659938e+00	14.0	2.9652	11	25	0
PerformanceRating	factor	NA	1.537415e-01	NA	NA	0	1244	2
RelationshipSatisfaction	factor	NA	6.877551e-01	NA	NA	276	459	4
StandardHours	integer	80.000000	0.000000e+00	80.0	0.0000	80	80	0
StockOptionLevel	factor	NA	8.000000e-01	NA	NA	294	294	5
TotalWorkingYears	integer	11.279592	7.780782e+00	10.0	5.9304	0	40	0
TrainingTimesLastYear	factor	NA	6.278912e-01	NA	NA	54	547	7
WorkLifeBalance	factor	NA	3.925170e-01	NA	NA	80	893	4
YearsAtCompany	integer	7.008163	6.126525e+00	5.0	4.4478	0	40	0
YearsInCurrentRole	integer	4.229252	3.623137e+00	3.0	4.4478	0	18	0
YearsSinceLastPromotion	integer	2.187755	3.222430e+00	1.0	1.4826	0	15	0
YearsWithCurrManager	integer	4.123129	3.568136e+00	3.0	4.4478	0	17	0
AnnualIncome	numeric	78035.175510	5.649548e+04	59028.0	39122.8488	12108	239988	0

print("data2: # of obs and # of variables")

## [1] "data2: # of obs and # of variables"

nrow(final2)

## [1] 1470

ncol(final2)

## [1] 36

Scan I

Steps Taken

Checking for missing values, inconsistencies, and errors.

Nothing is found. Task 7 is done.

# Task 7 - Check for Missing Values
Check_missing_values<-is.na(final2)

# All the values are false
table(Check_missing_values)

## Check_missing_values
## FALSE 
## 52920

# No missing values or inconsistencies
sum(is.na(final2))

## [1] 0

# Check inputs whether they are not finite or NA using a function called is.special
is.special <- function(x){if (is.numeric(x)) !is.finite(x) else is.na(x)}
is.special <- function(x){if (is.numeric(x)) !is.finite(x)}

# apply this function to the data frame.
results <- sapply(final2, is.special)

final3 <- as.data.frame(results[["Attrition"]])
unique(final3)

final3 <- as.data.frame(results[["Age"]])
unique(final3)

final3 <- as.data.frame(results[["BusinessTravel"]])
unique(final3)

final3 <- as.data.frame(results[["DailyRate"]])
unique(final3)

final3 <- as.data.frame(results[["Department"]])
unique(final3)

final3 <- as.data.frame(results[["DistanceFromHome"]])
unique(final3)

final3 <- as.data.frame(results[["Education"]])
unique(final3)

final3 <- as.data.frame(results[["EducationField"]])    
unique(final3)

final3 <- as.data.frame(results[["EmployeeCount"]])
unique(final3)

final3 <- as.data.frame(results[["EmployeeNumber"]])
unique(final3)

final3 <- as.data.frame(results[["EnvironmentSatisfaction"]])   
unique(final3)

final3 <- as.data.frame(results[["Gender"]])    
unique(final3)

final3 <- as.data.frame(results[["HourlyRate"]])    
unique(final3)

final3 <- as.data.frame(results[["JobInvolvement"]])    
unique(final3)

final3 <- as.data.frame(results[["JobLevel"]])
unique(final3)

final3 <- as.data.frame(results[["JobRole"]])   
unique(final3)

final3 <- as.data.frame(results[["JobSatisfaction"]])   
unique(final3)

final3 <- as.data.frame(results[["MaritalStatus"]])
unique(final3)

final3 <- as.data.frame(results[["MonthlyIncome"]]) 
unique(final3)

final3 <- as.data.frame(results[["MonthlyRate"]])
unique(final3)

final3 <- as.data.frame(results[["NumCompaniesWorked"]])    
unique(final3)

final3 <- as.data.frame(results[["Over18"]])    
unique(final3)

final3 <- as.data.frame(results[["OverTime"]])
unique(final3)

final3 <- as.data.frame(results[["PercentSalaryHike"]]) 
unique(final3)

final3 <- as.data.frame(results[["PerformanceRating"]])
unique(final3)

final3 <- as.data.frame(results[["RelationshipSatisfaction"]])  
unique(final3)

final3 <- as.data.frame(results[["StandardHours"]])
unique(final3)

final3 <- as.data.frame(results[["StockOptionLevel"]])  
unique(final3)

final3 <- as.data.frame(results[["TotalWorkingYears"]]) 
unique(final3)

final3 <- as.data.frame(results[["TrainingTimesLastYear"]]) 
unique(final3)

final3 <- as.data.frame(results[["WorkLifeBalance"]])   
unique(final3)

final3 <- as.data.frame(results[["YearsAtCompany"]])
unique(final3)

final3 <- as.data.frame(results[["YearsInCurrentRole"]])
unique(final3)

final3 <- as.data.frame(results[["YearsSinceLastPromotion"]])
unique(final3)

final3 <- as.data.frame(results[["YearsWithCurrManager"]])
unique(final3)

final3 <- as.data.frame(results[["AnnualIncome"]])
unique(final3)

Scan II

Steps Taken

Perform the outlier test. All fail to reject the null hypothesis (therefore our outliers are okay and don’t need to be modified)

# Outliers test
chisq.out.test(final2$MonthlyIncome,variance = var(final2$MonthlyIncome,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$MonthlyIncome
## X-squared = 1.3618, p-value = 0.2432
## alternative hypothesis: lowest value 1009 is an outlier

chisq.out.test(final2$DailyRate,variance = var(final2$DailyRate,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$DailyRate
## X-squared = 2.9796, p-value = 0.08432
## alternative hypothesis: highest value 1499 is an outlier

chisq.out.test(final2$DistanceFromHome,variance = var(final2$DistanceFromHome,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$DistanceFromHome
## X-squared = 1.0212, p-value = 0.3122
## alternative hypothesis: lowest value 1 is an outlier

chisq.out.test(final2$HourlyRate,variance = var(final2$HourlyRate,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$HourlyRate
## X-squared = 2.815, p-value = 0.09338
## alternative hypothesis: highest value 100 is an outlier

chisq.out.test(final2$MonthlyRate,variance = var(final2$MonthlyRate,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$MonthlyRate
## X-squared = 2.9471, p-value = 0.08603
## alternative hypothesis: lowest value 2094 is an outlier

chisq.out.test(final2$NumCompaniesWorked,variance = var(final2$NumCompaniesWorked,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$NumCompaniesWorked
## X-squared = 1.1624, p-value = 0.281
## alternative hypothesis: lowest value 0 is an outlier

chisq.out.test(final2$PercentSalaryHike,variance = var(final2$PercentSalaryHike,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$PercentSalaryHike
## X-squared = 1.3229, p-value = 0.2501
## alternative hypothesis: lowest value 11 is an outlier

chisq.out.test(final2$TotalWorkingYears,variance = var(final2$TotalWorkingYears,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$TotalWorkingYears
## X-squared = 2.1016, p-value = 0.1471
## alternative hypothesis: lowest value 0 is an outlier

chisq.out.test(final2$YearsAtCompany,variance = var(final2$YearsAtCompany,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$YearsAtCompany
## X-squared = 1.3085, p-value = 0.2527
## alternative hypothesis: lowest value 0 is an outlier

chisq.out.test(final2$YearsInCurrentRole,variance = var(final2$YearsInCurrentRole,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$YearsInCurrentRole
## X-squared = 1.3626, p-value = 0.2431
## alternative hypothesis: lowest value 0 is an outlier

chisq.out.test(final2$YearsSinceLastPromotion,variance = var(final2$YearsSinceLastPromotion,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$YearsSinceLastPromotion
## X-squared = 0.46093, p-value = 0.4972
## alternative hypothesis: lowest value 0 is an outlier

chisq.out.test(final2$YearsWithCurrManager,variance = var(final2$YearsWithCurrManager,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$YearsWithCurrManager
## X-squared = 1.3353, p-value = 0.2479
## alternative hypothesis: lowest value 0 is an outlier

chisq.out.test(final2$AnnualIncome,variance = var(final2$AnnualIncome,na.rm=TRUE),opposite = TRUE)

## 
##  chi-squared test for outlier
## 
## data:  final2$AnnualIncome
## X-squared = 1.3618, p-value = 0.2432
## alternative hypothesis: lowest value 12108 is an outlier

# No outlier in the lower end as all p-values > 0.05

Transform

Steps Taken

We’re looking at the Monthly Income and can see it has been positively skewed.
We then log the Monthly Income and can see that this has helped the distribution of the variable. It has brought it closer to normal distribution.

# Task 9
par(mfrow=c(1,1))
hist(final2$MonthlyIncome)

# QQ Plot before transformation
qqnorm(final2$MonthlyIncome, main = "Normal QQ Plot")
qqline(final2$MonthlyIncome, col = "red")

# Taking logs to transform it 
Transform.MonthlyIncome<-log(final2$MonthlyIncome)  

# After log transformation - QQ plot
qqnorm(Transform.MonthlyIncome, main = "Normal QQ Plot")
qqline(Transform.MonthlyIncome, col = "red")

MATH2349 Semester 1, 2018

Assignment 3

Simon Prasad (s3526093)

Required packages

Executive Summary

Data

Introduction

Dataset Description and Source

Target Feature

Descriptive Features

Steps Taken

Understand

Steps Taken

Tidy & Manipulate Data I

Tidy & Manipulate Data II

Steps Taken

Scan I

Steps Taken

Scan II

Steps Taken

Transform

Steps Taken