Required packages

setwd("/Volumes/GoogleDrive/My Drive/Studies/Data Preprocessing/Actual Assignments/Assignment #3")
knitr::opts_chunk$set(echo = TRUE, fig.align = "center", fig.height = 6, fig.width = 12)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(mlr)
## Loading required package: ParamHelpers
library(knitr)
library(outliers)

Executive Summary

In a high level summary, the tasks we performed were:

This preprocessing will aide us when we apply statistical models which is the second phase after preprocessing.

RPubs Link to this pdf: https://rpubs.com/SimonP/Assignment_3

Data

Introduction

When an employee leaves a company, it’s called attrition or turnover. Predicting employee turnover is at the forefront of Human Resource (HR) departments across the globe as companies face massive costs resulting from employee attrition. These costs are both tangible, such as the cost of training expenses and time but also intangible - such as ideas, customer relationships or leadership.

The goal of this project is to utilise machine learning techniques to predict whether an employee will leave the company. Moreover, the major factors that lead to employee attrition will be identified in order to better answer questions about why people leave companies.

Dataset Description and Source

The dataset was taken from the IBM Watson Analytics website, and is a fictional dataset created by IBM data scientists. It contains 1,470 observations and consists of 34 descriptive features and 1 target feature. From this dataset, there are 1233 “No” responses to attrition, and 237 “Yes” responses, which is an example of an imbalanced dataset. Furthermore, there are no missing values within the data.

Source

Short Link: https://ibm.co/2DvoldM

Original Link: https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

Target Feature

The response feature, or target feature is “Attrition” is a factor with 2 levels which is defined as:

\[Attrition = \binom{Yes}{No}\]

Where “Yes” indicates that an employee has left the company, and “No” indicates that an employee has stayed with the company.

Descriptive Features

  • Age: continuous
  • BusinessTravel: Travel_Rarely, Travel_Frequently, Non-Travel
  • DailyRate: continuous
  • Department: Human Resources, Sales, Research & Development
  • DistanceFromHome: continuous
  • Education: 1 = “Below College”, 2 = “College”, 3 = “Bachelor”, 4 = “Master”, 5 = “Doctor”
  • EducationField: Human Resources, Life Sciences, Marketing, Medical, Other, Technical Degree
  • EmployeeCount: discrete
  • EmployeeNumber: discrete
  • EnvironmentSatisfaction: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
  • Gender: Male, Female
  • HourlyRate: continuous
  • JobInvolvement: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
  • JobLevel: 1, 2, 3, 4, 5
  • JobRole: Healthcare Representative, Human Resources, Laboratory Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive, Sales Representative
  • JobSatisfaction: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
  • MaritalStatus: Divorced, Married, Single
  • MonthlyIncome: continuous
  • MonthlyRate: continuous
  • NumCompaniesWorked: continuous
  • Over18: Y
  • OverTime: Yes, No
  • PercentSalaryHike: discrete
  • PerformanceRating: 3,4
  • RelationshipSatisfaction: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
  • StandardHours: 80
  • StockOptionLevel: 0, 1, 2, 3
  • TotalWorkingYears: continuous
  • TrainingTimesLastYear: continuous
  • WorkLifeBalance: 1 = “Low”, 2 = “Medium”, 3 = “High”, 4 = “Very High”
  • YearsAtCompany: continuous
  • YearsInCurrentRole: continuous
  • YearsSinceLastPromotion: continuous
  • YearsWithCurrManager: continuous

Steps Taken

  1. Reading of the raw data (data1) and viewing of the data1

  2. Subsetting data1 into a new dataframe named data2 with three specific columns (“EmployeeNumber”, “JobInvolvement” and “Department”) and viewing of the data

  3. Subsetting data1 into a new dataframe named data3 and dropping two specific columns (“JobInvolvement” and “Department”) and viewing of the data

  4. Merging the two datasets (dataframes), data2 and data3 together into a new dataframe named final. Then finally viewing of the dataset named final.

# Reading Data
data1 <- read.csv("Employeechurn.csv",header = TRUE,sep = ",")

# View the data
head(data1, n = 2)
# data1: # of obs and # of variables
nrow(data1)
## [1] 1470
ncol(data1)
## [1] 35
# Task 1 - Merging two datasets
data2 <- subset(data1, select=c("EmployeeNumber", "JobInvolvement", "Department")) 
colnames(data2)
## [1] "EmployeeNumber" "JobInvolvement" "Department"
# data2: # of obs and # of variables
nrow(data2)
## [1] 1470
ncol(data2)
## [1] 3
# Removing the two variables from the original dataset
data3 = subset(data1, select = -c(JobInvolvement,Department) ) 
colnames(data3)
##  [1] "Attrition"                "Age"                     
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "DistanceFromHome"         "Education"               
##  [7] "EducationField"           "EmployeeCount"           
##  [9] "EmployeeNumber"           "EnvironmentSatisfaction" 
## [11] "Gender"                   "HourlyRate"              
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyIncome"            "MonthlyRate"             
## [19] "NumCompaniesWorked"       "Over18"                  
## [21] "OverTime"                 "PercentSalaryHike"       
## [23] "PerformanceRating"        "RelationshipSatisfaction"
## [25] "StandardHours"            "StockOptionLevel"        
## [27] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [29] "WorkLifeBalance"          "YearsAtCompany"          
## [31] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [33] "YearsWithCurrManager"
# data3: # of obs and # of variables
nrow(data3)
## [1] 1470
ncol(data3)
## [1] 33
# Merging the above two dataset
final <- merge(data2,data3,"EmployeeNumber")
colnames(final)
##  [1] "EmployeeNumber"           "JobInvolvement"          
##  [3] "Department"               "Attrition"               
##  [5] "Age"                      "BusinessTravel"          
##  [7] "DailyRate"                "DistanceFromHome"        
##  [9] "Education"                "EducationField"          
## [11] "EmployeeCount"            "EnvironmentSatisfaction" 
## [13] "Gender"                   "HourlyRate"              
## [15] "JobLevel"                 "JobRole"                 
## [17] "JobSatisfaction"          "MaritalStatus"           
## [19] "MonthlyIncome"            "MonthlyRate"             
## [21] "NumCompaniesWorked"       "Over18"                  
## [23] "OverTime"                 "PercentSalaryHike"       
## [25] "PerformanceRating"        "RelationshipSatisfaction"
## [27] "StandardHours"            "StockOptionLevel"        
## [29] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [31] "WorkLifeBalance"          "YearsAtCompany"          
## [33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [35] "YearsWithCurrManager"
# final: # of obs and # of variables
nrow(final)
## [1] 1470
ncol(final)
## [1] 35
# Comparing the raw data (data1) to the final data set 
# data1: # of obs and # of variables
nrow(data1)
## [1] 1470
ncol(data1)
## [1] 35
# final: # of obs and # of variables
nrow(final)
## [1] 1470
ncol(final)
## [1] 35

There appears to be no loss in the # of obs. and # of variables when comparing raw (data1) and the final datasets.

Understand

Steps Taken

  1. Using the str function, it was discovered that the data wasn’t presented ideally.

  2. We converted the required data type conversions (i.e., character -> factor, character -> date, numeric -> factor, etc. conversions)

  3. We looked into the data set should and labelled or ordered where necessary.

  4. Using the summarizeColumns and kable function, we summarized the dataset again after the changes

# Task 3 and 4 - Observing the dataset structure for data conversions if any and tidying it

str(final)                                            
## 'data.frame':    1470 obs. of  35 variables:
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
# Need to convert some of the variables in integer data type to factor data type
final$JobInvolvement<-as.factor(final$JobInvolvement)
final$Education<-as.factor(final$Education)
final$EnvironmentSatisfaction<-as.factor(final$EnvironmentSatisfaction)
final$JobLevel<-as.factor(final$JobLevel)
final$JobSatisfaction<-as.factor(final$JobSatisfaction)
final$PerformanceRating<-as.factor(final$PerformanceRating)
final$RelationshipSatisfaction<-as.factor(final$RelationshipSatisfaction)
final$StockOptionLevel<-as.factor(final$StockOptionLevel)
final$TrainingTimesLastYear<-as.factor(final$TrainingTimesLastYear)
final$WorkLifeBalance<-as.factor(final$WorkLifeBalance)

# Ordering our levels
final$Education <- factor(final$Education, levels=c(1,2,3,4,5))
levels(final$Education) <- list("Below College"=1,"College"=2,"Bachelor"=3,"Master"=4,"Doctor"=5)

final$EnvironmentSatisfaction <- factor(final$EnvironmentSatisfaction, levels=c(1,2,3,4))
levels(final$EnvironmentSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$JobInvolvement <- factor(final$JobInvolvement, levels=c(1,2,3,4))
levels(final$JobInvolvement) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$JobLevel  <- factor(c(1,2,3,4,5))

final$JobSatisfaction <- factor(final$JobSatisfaction, levels=c(1,2,3,4))
levels(final$JobSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$PerformanceRating <- factor(final$PerformanceRating, levels=c(1,2,3,4))
levels(final$PerformanceRating) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$RelationshipSatisfaction <- factor(final$RelationshipSatisfaction, levels=c(1,2,3,4))
levels(final$RelationshipSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

final$StockOptionLevel <- factor(c(0,1,2,3,4))

final$WorkLifeBalance <- factor(final$WorkLifeBalance, levels=c(1,2,3,4))
levels(final$WorkLifeBalance) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)

# Summary of Varibales
summary <- summarizeColumns(final)
#Kable to output the result in a nice table
kable(summary)
name type na mean disp median mad min max nlevs
EmployeeNumber integer 0 1024.865306 602.0243348 1020.5 790.9671 1 2068 0
JobInvolvement factor 0 NA 0.4095238 NA NA 83 868 4
Department factor 0 NA 0.3462585 NA NA 63 961 3
Attrition factor 0 NA 0.1612245 NA NA 237 1233 2
Age integer 0 36.923809 9.1353735 36.0 8.8956 18 60 0
BusinessTravel factor 0 NA 0.2904762 NA NA 150 1043 3
DailyRate integer 0 802.485714 403.5090999 802.0 510.0144 102 1499 0
DistanceFromHome integer 0 9.192517 8.1068644 7.0 7.4130 1 29 0
Education factor 0 NA 0.6108844 NA NA 48 572 5
EducationField factor 0 NA 0.5877551 NA NA 27 606 6
EmployeeCount integer 0 1.000000 0.0000000 1.0 0.0000 1 1 0
EnvironmentSatisfaction factor 0 NA 0.6918367 NA NA 284 453 4
Gender factor 0 NA 0.4000000 NA NA 588 882 2
HourlyRate integer 0 65.891156 20.3294276 66.0 26.6868 30 100 0
JobLevel factor 0 NA 0.8000000 NA NA 294 294 5
JobRole factor 0 NA 0.7782313 NA NA 52 326 9
JobSatisfaction factor 0 NA 0.6877551 NA NA 280 459 4
MaritalStatus factor 0 NA 0.5421769 NA NA 327 673 3
MonthlyIncome integer 0 6502.931293 4707.9567831 4919.0 3260.2374 1009 19999 0
MonthlyRate integer 0 14313.103401 7117.7860441 14235.5 9201.7569 2094 26999 0
NumCompaniesWorked integer 0 2.693197 2.4980090 2.0 1.4826 0 9 0
Over18 factor 0 NA 0.0000000 NA NA 1470 1470 1
OverTime factor 0 NA 0.2829932 NA NA 416 1054 2
PercentSalaryHike integer 0 15.209524 3.6599377 14.0 2.9652 11 25 0
PerformanceRating factor 0 NA 0.1537415 NA NA 0 1244 2
RelationshipSatisfaction factor 0 NA 0.6877551 NA NA 276 459 4
StandardHours integer 0 80.000000 0.0000000 80.0 0.0000 80 80 0
StockOptionLevel factor 0 NA 0.8000000 NA NA 294 294 5
TotalWorkingYears integer 0 11.279592 7.7807817 10.0 5.9304 0 40 0
TrainingTimesLastYear factor 0 NA 0.6278912 NA NA 54 547 7
WorkLifeBalance factor 0 NA 0.3925170 NA NA 80 893 4
YearsAtCompany integer 0 7.008163 6.1265252 5.0 4.4478 0 40 0
YearsInCurrentRole integer 0 4.229252 3.6231370 3.0 4.4478 0 18 0
YearsSinceLastPromotion integer 0 2.187755 3.2224303 1.0 1.4826 0 15 0
YearsWithCurrManager integer 0 4.123129 3.5681361 3.0 4.4478 0 17 0

Tidy & Manipulate Data I

There is three rules to Hadley Wickham’s tidy rules 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.

We can observe that none of Hadley Wickham’s tidy rules were violated. Therefore it is in tidy format. ### Steps Taken

# Viewing of the the dataset
head(final, n = 5)

Tidy & Manipulate Data II

Steps Taken

I’ve added a new column “Monthly Income” to the dataset “final2”. I did this by multiplying “Monthly Income” by 12 as there is 12 months in a year. I then added the new column to the dataset “final2” and summarized it.

# Task 6 - Add a new variable "Annual Income"
final2<-mutate(final,AnnualIncome = final$MonthlyIncome*12)

# Summary of Varibales
summary <- summarizeColumns(final2)
#Kable to output the result in a nice table
kable(summary)
name type na mean disp median mad min max nlevs
EmployeeNumber integer 0 1024.865306 6.020243e+02 1020.5 790.9671 1 2068 0
JobInvolvement factor 0 NA 4.095238e-01 NA NA 83 868 4
Department factor 0 NA 3.462585e-01 NA NA 63 961 3
Attrition factor 0 NA 1.612245e-01 NA NA 237 1233 2
Age integer 0 36.923809 9.135374e+00 36.0 8.8956 18 60 0
BusinessTravel factor 0 NA 2.904762e-01 NA NA 150 1043 3
DailyRate integer 0 802.485714 4.035091e+02 802.0 510.0144 102 1499 0
DistanceFromHome integer 0 9.192517 8.106864e+00 7.0 7.4130 1 29 0
Education factor 0 NA 6.108844e-01 NA NA 48 572 5
EducationField factor 0 NA 5.877551e-01 NA NA 27 606 6
EmployeeCount integer 0 1.000000 0.000000e+00 1.0 0.0000 1 1 0
EnvironmentSatisfaction factor 0 NA 6.918367e-01 NA NA 284 453 4
Gender factor 0 NA 4.000000e-01 NA NA 588 882 2
HourlyRate integer 0 65.891156 2.032943e+01 66.0 26.6868 30 100 0
JobLevel factor 0 NA 8.000000e-01 NA NA 294 294 5
JobRole factor 0 NA 7.782313e-01 NA NA 52 326 9
JobSatisfaction factor 0 NA 6.877551e-01 NA NA 280 459 4
MaritalStatus factor 0 NA 5.421769e-01 NA NA 327 673 3
MonthlyIncome integer 0 6502.931293 4.707957e+03 4919.0 3260.2374 1009 19999 0
MonthlyRate integer 0 14313.103401 7.117786e+03 14235.5 9201.7569 2094 26999 0
NumCompaniesWorked integer 0 2.693197 2.498009e+00 2.0 1.4826 0 9 0
Over18 factor 0 NA 0.000000e+00 NA NA 1470 1470 1
OverTime factor 0 NA 2.829932e-01 NA NA 416 1054 2
PercentSalaryHike integer 0 15.209524 3.659938e+00 14.0 2.9652 11 25 0
PerformanceRating factor 0 NA 1.537415e-01 NA NA 0 1244 2
RelationshipSatisfaction factor 0 NA 6.877551e-01 NA NA 276 459 4
StandardHours integer 0 80.000000 0.000000e+00 80.0 0.0000 80 80 0
StockOptionLevel factor 0 NA 8.000000e-01 NA NA 294 294 5
TotalWorkingYears integer 0 11.279592 7.780782e+00 10.0 5.9304 0 40 0
TrainingTimesLastYear factor 0 NA 6.278912e-01 NA NA 54 547 7
WorkLifeBalance factor 0 NA 3.925170e-01 NA NA 80 893 4
YearsAtCompany integer 0 7.008163 6.126525e+00 5.0 4.4478 0 40 0
YearsInCurrentRole integer 0 4.229252 3.623137e+00 3.0 4.4478 0 18 0
YearsSinceLastPromotion integer 0 2.187755 3.222430e+00 1.0 1.4826 0 15 0
YearsWithCurrManager integer 0 4.123129 3.568136e+00 3.0 4.4478 0 17 0
AnnualIncome numeric 0 78035.175510 5.649548e+04 59028.0 39122.8488 12108 239988 0
print("data2: # of obs and # of variables")
## [1] "data2: # of obs and # of variables"
nrow(final2)
## [1] 1470
ncol(final2)
## [1] 36

Scan I

Steps Taken

  1. Checking for missing values, inconsistencies, and errors.

Nothing is found. Task 7 is done.

# Task 7 - Check for Missing Values
Check_missing_values<-is.na(final2)

# All the values are false
table(Check_missing_values)
## Check_missing_values
## FALSE 
## 52920
# No missing values or inconsistencies
sum(is.na(final2))                                          
## [1] 0
# Check inputs whether they are not finite or NA using a function called is.special
is.special <- function(x){if (is.numeric(x)) !is.finite(x) else is.na(x)}
is.special <- function(x){if (is.numeric(x)) !is.finite(x)}

# apply this function to the data frame.
results <- sapply(final2, is.special)

final3 <- as.data.frame(results[["Attrition"]])
unique(final3)
final3 <- as.data.frame(results[["Age"]])
unique(final3)
final3 <- as.data.frame(results[["BusinessTravel"]])
unique(final3)
final3 <- as.data.frame(results[["DailyRate"]])
unique(final3)
final3 <- as.data.frame(results[["Department"]])
unique(final3)
final3 <- as.data.frame(results[["DistanceFromHome"]])
unique(final3)
final3 <- as.data.frame(results[["Education"]])
unique(final3)
final3 <- as.data.frame(results[["EducationField"]])    
unique(final3)
final3 <- as.data.frame(results[["EmployeeCount"]])
unique(final3)
final3 <- as.data.frame(results[["EmployeeNumber"]])
unique(final3)
final3 <- as.data.frame(results[["EnvironmentSatisfaction"]])   
unique(final3)
final3 <- as.data.frame(results[["Gender"]])    
unique(final3)
final3 <- as.data.frame(results[["HourlyRate"]])    
unique(final3)
final3 <- as.data.frame(results[["JobInvolvement"]])    
unique(final3)
final3 <- as.data.frame(results[["JobLevel"]])
unique(final3)
final3 <- as.data.frame(results[["JobRole"]])   
unique(final3)
final3 <- as.data.frame(results[["JobSatisfaction"]])   
unique(final3)
final3 <- as.data.frame(results[["MaritalStatus"]])
unique(final3)
final3 <- as.data.frame(results[["MonthlyIncome"]]) 
unique(final3)
final3 <- as.data.frame(results[["MonthlyRate"]])
unique(final3)
final3 <- as.data.frame(results[["NumCompaniesWorked"]])    
unique(final3)
final3 <- as.data.frame(results[["Over18"]])    
unique(final3)
final3 <- as.data.frame(results[["OverTime"]])
unique(final3)
final3 <- as.data.frame(results[["PercentSalaryHike"]]) 
unique(final3)
final3 <- as.data.frame(results[["PerformanceRating"]])
unique(final3)
final3 <- as.data.frame(results[["RelationshipSatisfaction"]])  
unique(final3)
final3 <- as.data.frame(results[["StandardHours"]])
unique(final3)
final3 <- as.data.frame(results[["StockOptionLevel"]])  
unique(final3)
final3 <- as.data.frame(results[["TotalWorkingYears"]]) 
unique(final3)
final3 <- as.data.frame(results[["TrainingTimesLastYear"]]) 
unique(final3)
final3 <- as.data.frame(results[["WorkLifeBalance"]])   
unique(final3)
final3 <- as.data.frame(results[["YearsAtCompany"]])
unique(final3)
final3 <- as.data.frame(results[["YearsInCurrentRole"]])
unique(final3)
final3 <- as.data.frame(results[["YearsSinceLastPromotion"]])
unique(final3)
final3 <- as.data.frame(results[["YearsWithCurrManager"]])
unique(final3)
final3 <- as.data.frame(results[["AnnualIncome"]])
unique(final3)

Scan II

Steps Taken

  1. Perform the outlier test. All fail to reject the null hypothesis (therefore our outliers are okay and don’t need to be modified)
# Outliers test
chisq.out.test(final2$MonthlyIncome,variance = var(final2$MonthlyIncome,na.rm=TRUE),opposite = TRUE) 
## 
##  chi-squared test for outlier
## 
## data:  final2$MonthlyIncome
## X-squared = 1.3618, p-value = 0.2432
## alternative hypothesis: lowest value 1009 is an outlier
chisq.out.test(final2$DailyRate,variance = var(final2$DailyRate,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$DailyRate
## X-squared = 2.9796, p-value = 0.08432
## alternative hypothesis: highest value 1499 is an outlier
chisq.out.test(final2$DistanceFromHome,variance = var(final2$DistanceFromHome,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$DistanceFromHome
## X-squared = 1.0212, p-value = 0.3122
## alternative hypothesis: lowest value 1 is an outlier
chisq.out.test(final2$HourlyRate,variance = var(final2$HourlyRate,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$HourlyRate
## X-squared = 2.815, p-value = 0.09338
## alternative hypothesis: highest value 100 is an outlier
chisq.out.test(final2$MonthlyRate,variance = var(final2$MonthlyRate,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$MonthlyRate
## X-squared = 2.9471, p-value = 0.08603
## alternative hypothesis: lowest value 2094 is an outlier
chisq.out.test(final2$NumCompaniesWorked,variance = var(final2$NumCompaniesWorked,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$NumCompaniesWorked
## X-squared = 1.1624, p-value = 0.281
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$PercentSalaryHike,variance = var(final2$PercentSalaryHike,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$PercentSalaryHike
## X-squared = 1.3229, p-value = 0.2501
## alternative hypothesis: lowest value 11 is an outlier
chisq.out.test(final2$TotalWorkingYears,variance = var(final2$TotalWorkingYears,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$TotalWorkingYears
## X-squared = 2.1016, p-value = 0.1471
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsAtCompany,variance = var(final2$YearsAtCompany,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$YearsAtCompany
## X-squared = 1.3085, p-value = 0.2527
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsInCurrentRole,variance = var(final2$YearsInCurrentRole,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$YearsInCurrentRole
## X-squared = 1.3626, p-value = 0.2431
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsSinceLastPromotion,variance = var(final2$YearsSinceLastPromotion,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$YearsSinceLastPromotion
## X-squared = 0.46093, p-value = 0.4972
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsWithCurrManager,variance = var(final2$YearsWithCurrManager,na.rm=TRUE),opposite = TRUE)
## 
##  chi-squared test for outlier
## 
## data:  final2$YearsWithCurrManager
## X-squared = 1.3353, p-value = 0.2479
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$AnnualIncome,variance = var(final2$AnnualIncome,na.rm=TRUE),opposite = TRUE)  
## 
##  chi-squared test for outlier
## 
## data:  final2$AnnualIncome
## X-squared = 1.3618, p-value = 0.2432
## alternative hypothesis: lowest value 12108 is an outlier
# No outlier in the lower end as all p-values > 0.05

Transform

Steps Taken

  1. We’re looking at the Monthly Income and can see it has been positively skewed.

  2. We then log the Monthly Income and can see that this has helped the distribution of the variable. It has brought it closer to normal distribution.

# Task 9
par(mfrow=c(1,1))
hist(final2$MonthlyIncome) 

# QQ Plot before transformation
qqnorm(final2$MonthlyIncome, main = "Normal QQ Plot")
qqline(final2$MonthlyIncome, col = "red")

# Taking logs to transform it 
Transform.MonthlyIncome<-log(final2$MonthlyIncome)  

# After log transformation - QQ plot
qqnorm(Transform.MonthlyIncome, main = "Normal QQ Plot")
qqline(Transform.MonthlyIncome, col = "red")