setwd("/Volumes/GoogleDrive/My Drive/Studies/Data Preprocessing/Actual Assignments/Assignment #3")
knitr::opts_chunk$set(echo = TRUE, fig.align = "center", fig.height = 6, fig.width = 12)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(mlr)
## Loading required package: ParamHelpers
library(knitr)
library(outliers)
In a high level summary, the tasks we performed were:
This preprocessing will aide us when we apply statistical models which is the second phase after preprocessing.
RPubs Link to this pdf: https://rpubs.com/SimonP/Assignment_3
When an employee leaves a company, it’s called attrition or turnover. Predicting employee turnover is at the forefront of Human Resource (HR) departments across the globe as companies face massive costs resulting from employee attrition. These costs are both tangible, such as the cost of training expenses and time but also intangible - such as ideas, customer relationships or leadership.
The goal of this project is to utilise machine learning techniques to predict whether an employee will leave the company. Moreover, the major factors that lead to employee attrition will be identified in order to better answer questions about why people leave companies.
The dataset was taken from the IBM Watson Analytics website, and is a fictional dataset created by IBM data scientists. It contains 1,470 observations and consists of 34 descriptive features and 1 target feature. From this dataset, there are 1233 “No” responses to attrition, and 237 “Yes” responses, which is an example of an imbalanced dataset. Furthermore, there are no missing values within the data.
Source
Short Link: https://ibm.co/2DvoldM
Original Link: https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
The response feature, or target feature is “Attrition” is a factor with 2 levels which is defined as:
\[Attrition = \binom{Yes}{No}\]
Where “Yes” indicates that an employee has left the company, and “No” indicates that an employee has stayed with the company.
Reading of the raw data (data1) and viewing of the data1
Subsetting data1 into a new dataframe named data2 with three specific columns (“EmployeeNumber”, “JobInvolvement” and “Department”) and viewing of the data
Subsetting data1 into a new dataframe named data3 and dropping two specific columns (“JobInvolvement” and “Department”) and viewing of the data
Merging the two datasets (dataframes), data2 and data3 together into a new dataframe named final. Then finally viewing of the dataset named final.
# Reading Data
data1 <- read.csv("Employeechurn.csv",header = TRUE,sep = ",")
# View the data
head(data1, n = 2)
# data1: # of obs and # of variables
nrow(data1)
## [1] 1470
ncol(data1)
## [1] 35
# Task 1 - Merging two datasets
data2 <- subset(data1, select=c("EmployeeNumber", "JobInvolvement", "Department"))
colnames(data2)
## [1] "EmployeeNumber" "JobInvolvement" "Department"
# data2: # of obs and # of variables
nrow(data2)
## [1] 1470
ncol(data2)
## [1] 3
# Removing the two variables from the original dataset
data3 = subset(data1, select = -c(JobInvolvement,Department) )
colnames(data3)
## [1] "Attrition" "Age"
## [3] "BusinessTravel" "DailyRate"
## [5] "DistanceFromHome" "Education"
## [7] "EducationField" "EmployeeCount"
## [9] "EmployeeNumber" "EnvironmentSatisfaction"
## [11] "Gender" "HourlyRate"
## [13] "JobLevel" "JobRole"
## [15] "JobSatisfaction" "MaritalStatus"
## [17] "MonthlyIncome" "MonthlyRate"
## [19] "NumCompaniesWorked" "Over18"
## [21] "OverTime" "PercentSalaryHike"
## [23] "PerformanceRating" "RelationshipSatisfaction"
## [25] "StandardHours" "StockOptionLevel"
## [27] "TotalWorkingYears" "TrainingTimesLastYear"
## [29] "WorkLifeBalance" "YearsAtCompany"
## [31] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [33] "YearsWithCurrManager"
# data3: # of obs and # of variables
nrow(data3)
## [1] 1470
ncol(data3)
## [1] 33
# Merging the above two dataset
final <- merge(data2,data3,"EmployeeNumber")
colnames(final)
## [1] "EmployeeNumber" "JobInvolvement"
## [3] "Department" "Attrition"
## [5] "Age" "BusinessTravel"
## [7] "DailyRate" "DistanceFromHome"
## [9] "Education" "EducationField"
## [11] "EmployeeCount" "EnvironmentSatisfaction"
## [13] "Gender" "HourlyRate"
## [15] "JobLevel" "JobRole"
## [17] "JobSatisfaction" "MaritalStatus"
## [19] "MonthlyIncome" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
# final: # of obs and # of variables
nrow(final)
## [1] 1470
ncol(final)
## [1] 35
# Comparing the raw data (data1) to the final data set
# data1: # of obs and # of variables
nrow(data1)
## [1] 1470
ncol(data1)
## [1] 35
# final: # of obs and # of variables
nrow(final)
## [1] 1470
ncol(final)
## [1] 35
There appears to be no loss in the # of obs. and # of variables when comparing raw (data1) and the final datasets.
Using the str function, it was discovered that the data wasn’t presented ideally.
We converted the required data type conversions (i.e., character -> factor, character -> date, numeric -> factor, etc. conversions)
We looked into the data set should and labelled or ordered where necessary.
Using the summarizeColumns and kable function, we summarized the dataset again after the changes
# Task 3 and 4 - Observing the dataset structure for data conversions if any and tidying it
str(final)
## 'data.frame': 1470 obs. of 35 variables:
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
# Need to convert some of the variables in integer data type to factor data type
final$JobInvolvement<-as.factor(final$JobInvolvement)
final$Education<-as.factor(final$Education)
final$EnvironmentSatisfaction<-as.factor(final$EnvironmentSatisfaction)
final$JobLevel<-as.factor(final$JobLevel)
final$JobSatisfaction<-as.factor(final$JobSatisfaction)
final$PerformanceRating<-as.factor(final$PerformanceRating)
final$RelationshipSatisfaction<-as.factor(final$RelationshipSatisfaction)
final$StockOptionLevel<-as.factor(final$StockOptionLevel)
final$TrainingTimesLastYear<-as.factor(final$TrainingTimesLastYear)
final$WorkLifeBalance<-as.factor(final$WorkLifeBalance)
# Ordering our levels
final$Education <- factor(final$Education, levels=c(1,2,3,4,5))
levels(final$Education) <- list("Below College"=1,"College"=2,"Bachelor"=3,"Master"=4,"Doctor"=5)
final$EnvironmentSatisfaction <- factor(final$EnvironmentSatisfaction, levels=c(1,2,3,4))
levels(final$EnvironmentSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)
final$JobInvolvement <- factor(final$JobInvolvement, levels=c(1,2,3,4))
levels(final$JobInvolvement) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)
final$JobLevel <- factor(c(1,2,3,4,5))
final$JobSatisfaction <- factor(final$JobSatisfaction, levels=c(1,2,3,4))
levels(final$JobSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)
final$PerformanceRating <- factor(final$PerformanceRating, levels=c(1,2,3,4))
levels(final$PerformanceRating) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)
final$RelationshipSatisfaction <- factor(final$RelationshipSatisfaction, levels=c(1,2,3,4))
levels(final$RelationshipSatisfaction) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)
final$StockOptionLevel <- factor(c(0,1,2,3,4))
final$WorkLifeBalance <- factor(final$WorkLifeBalance, levels=c(1,2,3,4))
levels(final$WorkLifeBalance) <- list("Low"=1,"Medium"=2,"High"=3,"Very High"=4)
# Summary of Varibales
summary <- summarizeColumns(final)
#Kable to output the result in a nice table
kable(summary)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| EmployeeNumber | integer | 0 | 1024.865306 | 602.0243348 | 1020.5 | 790.9671 | 1 | 2068 | 0 |
| JobInvolvement | factor | 0 | NA | 0.4095238 | NA | NA | 83 | 868 | 4 |
| Department | factor | 0 | NA | 0.3462585 | NA | NA | 63 | 961 | 3 |
| Attrition | factor | 0 | NA | 0.1612245 | NA | NA | 237 | 1233 | 2 |
| Age | integer | 0 | 36.923809 | 9.1353735 | 36.0 | 8.8956 | 18 | 60 | 0 |
| BusinessTravel | factor | 0 | NA | 0.2904762 | NA | NA | 150 | 1043 | 3 |
| DailyRate | integer | 0 | 802.485714 | 403.5090999 | 802.0 | 510.0144 | 102 | 1499 | 0 |
| DistanceFromHome | integer | 0 | 9.192517 | 8.1068644 | 7.0 | 7.4130 | 1 | 29 | 0 |
| Education | factor | 0 | NA | 0.6108844 | NA | NA | 48 | 572 | 5 |
| EducationField | factor | 0 | NA | 0.5877551 | NA | NA | 27 | 606 | 6 |
| EmployeeCount | integer | 0 | 1.000000 | 0.0000000 | 1.0 | 0.0000 | 1 | 1 | 0 |
| EnvironmentSatisfaction | factor | 0 | NA | 0.6918367 | NA | NA | 284 | 453 | 4 |
| Gender | factor | 0 | NA | 0.4000000 | NA | NA | 588 | 882 | 2 |
| HourlyRate | integer | 0 | 65.891156 | 20.3294276 | 66.0 | 26.6868 | 30 | 100 | 0 |
| JobLevel | factor | 0 | NA | 0.8000000 | NA | NA | 294 | 294 | 5 |
| JobRole | factor | 0 | NA | 0.7782313 | NA | NA | 52 | 326 | 9 |
| JobSatisfaction | factor | 0 | NA | 0.6877551 | NA | NA | 280 | 459 | 4 |
| MaritalStatus | factor | 0 | NA | 0.5421769 | NA | NA | 327 | 673 | 3 |
| MonthlyIncome | integer | 0 | 6502.931293 | 4707.9567831 | 4919.0 | 3260.2374 | 1009 | 19999 | 0 |
| MonthlyRate | integer | 0 | 14313.103401 | 7117.7860441 | 14235.5 | 9201.7569 | 2094 | 26999 | 0 |
| NumCompaniesWorked | integer | 0 | 2.693197 | 2.4980090 | 2.0 | 1.4826 | 0 | 9 | 0 |
| Over18 | factor | 0 | NA | 0.0000000 | NA | NA | 1470 | 1470 | 1 |
| OverTime | factor | 0 | NA | 0.2829932 | NA | NA | 416 | 1054 | 2 |
| PercentSalaryHike | integer | 0 | 15.209524 | 3.6599377 | 14.0 | 2.9652 | 11 | 25 | 0 |
| PerformanceRating | factor | 0 | NA | 0.1537415 | NA | NA | 0 | 1244 | 2 |
| RelationshipSatisfaction | factor | 0 | NA | 0.6877551 | NA | NA | 276 | 459 | 4 |
| StandardHours | integer | 0 | 80.000000 | 0.0000000 | 80.0 | 0.0000 | 80 | 80 | 0 |
| StockOptionLevel | factor | 0 | NA | 0.8000000 | NA | NA | 294 | 294 | 5 |
| TotalWorkingYears | integer | 0 | 11.279592 | 7.7807817 | 10.0 | 5.9304 | 0 | 40 | 0 |
| TrainingTimesLastYear | factor | 0 | NA | 0.6278912 | NA | NA | 54 | 547 | 7 |
| WorkLifeBalance | factor | 0 | NA | 0.3925170 | NA | NA | 80 | 893 | 4 |
| YearsAtCompany | integer | 0 | 7.008163 | 6.1265252 | 5.0 | 4.4478 | 0 | 40 | 0 |
| YearsInCurrentRole | integer | 0 | 4.229252 | 3.6231370 | 3.0 | 4.4478 | 0 | 18 | 0 |
| YearsSinceLastPromotion | integer | 0 | 2.187755 | 3.2224303 | 1.0 | 1.4826 | 0 | 15 | 0 |
| YearsWithCurrManager | integer | 0 | 4.123129 | 3.5681361 | 3.0 | 4.4478 | 0 | 17 | 0 |
There is three rules to Hadley Wickham’s tidy rules 1. Each variable forms a column. 2. Each observation forms a row. 3. Each type of observational unit forms a table.
We can observe that none of Hadley Wickham’s tidy rules were violated. Therefore it is in tidy format. ### Steps Taken
# Viewing of the the dataset
head(final, n = 5)
I’ve added a new column “Monthly Income” to the dataset “final2”. I did this by multiplying “Monthly Income” by 12 as there is 12 months in a year. I then added the new column to the dataset “final2” and summarized it.
# Task 6 - Add a new variable "Annual Income"
final2<-mutate(final,AnnualIncome = final$MonthlyIncome*12)
# Summary of Varibales
summary <- summarizeColumns(final2)
#Kable to output the result in a nice table
kable(summary)
| name | type | na | mean | disp | median | mad | min | max | nlevs |
|---|---|---|---|---|---|---|---|---|---|
| EmployeeNumber | integer | 0 | 1024.865306 | 6.020243e+02 | 1020.5 | 790.9671 | 1 | 2068 | 0 |
| JobInvolvement | factor | 0 | NA | 4.095238e-01 | NA | NA | 83 | 868 | 4 |
| Department | factor | 0 | NA | 3.462585e-01 | NA | NA | 63 | 961 | 3 |
| Attrition | factor | 0 | NA | 1.612245e-01 | NA | NA | 237 | 1233 | 2 |
| Age | integer | 0 | 36.923809 | 9.135374e+00 | 36.0 | 8.8956 | 18 | 60 | 0 |
| BusinessTravel | factor | 0 | NA | 2.904762e-01 | NA | NA | 150 | 1043 | 3 |
| DailyRate | integer | 0 | 802.485714 | 4.035091e+02 | 802.0 | 510.0144 | 102 | 1499 | 0 |
| DistanceFromHome | integer | 0 | 9.192517 | 8.106864e+00 | 7.0 | 7.4130 | 1 | 29 | 0 |
| Education | factor | 0 | NA | 6.108844e-01 | NA | NA | 48 | 572 | 5 |
| EducationField | factor | 0 | NA | 5.877551e-01 | NA | NA | 27 | 606 | 6 |
| EmployeeCount | integer | 0 | 1.000000 | 0.000000e+00 | 1.0 | 0.0000 | 1 | 1 | 0 |
| EnvironmentSatisfaction | factor | 0 | NA | 6.918367e-01 | NA | NA | 284 | 453 | 4 |
| Gender | factor | 0 | NA | 4.000000e-01 | NA | NA | 588 | 882 | 2 |
| HourlyRate | integer | 0 | 65.891156 | 2.032943e+01 | 66.0 | 26.6868 | 30 | 100 | 0 |
| JobLevel | factor | 0 | NA | 8.000000e-01 | NA | NA | 294 | 294 | 5 |
| JobRole | factor | 0 | NA | 7.782313e-01 | NA | NA | 52 | 326 | 9 |
| JobSatisfaction | factor | 0 | NA | 6.877551e-01 | NA | NA | 280 | 459 | 4 |
| MaritalStatus | factor | 0 | NA | 5.421769e-01 | NA | NA | 327 | 673 | 3 |
| MonthlyIncome | integer | 0 | 6502.931293 | 4.707957e+03 | 4919.0 | 3260.2374 | 1009 | 19999 | 0 |
| MonthlyRate | integer | 0 | 14313.103401 | 7.117786e+03 | 14235.5 | 9201.7569 | 2094 | 26999 | 0 |
| NumCompaniesWorked | integer | 0 | 2.693197 | 2.498009e+00 | 2.0 | 1.4826 | 0 | 9 | 0 |
| Over18 | factor | 0 | NA | 0.000000e+00 | NA | NA | 1470 | 1470 | 1 |
| OverTime | factor | 0 | NA | 2.829932e-01 | NA | NA | 416 | 1054 | 2 |
| PercentSalaryHike | integer | 0 | 15.209524 | 3.659938e+00 | 14.0 | 2.9652 | 11 | 25 | 0 |
| PerformanceRating | factor | 0 | NA | 1.537415e-01 | NA | NA | 0 | 1244 | 2 |
| RelationshipSatisfaction | factor | 0 | NA | 6.877551e-01 | NA | NA | 276 | 459 | 4 |
| StandardHours | integer | 0 | 80.000000 | 0.000000e+00 | 80.0 | 0.0000 | 80 | 80 | 0 |
| StockOptionLevel | factor | 0 | NA | 8.000000e-01 | NA | NA | 294 | 294 | 5 |
| TotalWorkingYears | integer | 0 | 11.279592 | 7.780782e+00 | 10.0 | 5.9304 | 0 | 40 | 0 |
| TrainingTimesLastYear | factor | 0 | NA | 6.278912e-01 | NA | NA | 54 | 547 | 7 |
| WorkLifeBalance | factor | 0 | NA | 3.925170e-01 | NA | NA | 80 | 893 | 4 |
| YearsAtCompany | integer | 0 | 7.008163 | 6.126525e+00 | 5.0 | 4.4478 | 0 | 40 | 0 |
| YearsInCurrentRole | integer | 0 | 4.229252 | 3.623137e+00 | 3.0 | 4.4478 | 0 | 18 | 0 |
| YearsSinceLastPromotion | integer | 0 | 2.187755 | 3.222430e+00 | 1.0 | 1.4826 | 0 | 15 | 0 |
| YearsWithCurrManager | integer | 0 | 4.123129 | 3.568136e+00 | 3.0 | 4.4478 | 0 | 17 | 0 |
| AnnualIncome | numeric | 0 | 78035.175510 | 5.649548e+04 | 59028.0 | 39122.8488 | 12108 | 239988 | 0 |
print("data2: # of obs and # of variables")
## [1] "data2: # of obs and # of variables"
nrow(final2)
## [1] 1470
ncol(final2)
## [1] 36
Nothing is found. Task 7 is done.
# Task 7 - Check for Missing Values
Check_missing_values<-is.na(final2)
# All the values are false
table(Check_missing_values)
## Check_missing_values
## FALSE
## 52920
# No missing values or inconsistencies
sum(is.na(final2))
## [1] 0
# Check inputs whether they are not finite or NA using a function called is.special
is.special <- function(x){if (is.numeric(x)) !is.finite(x) else is.na(x)}
is.special <- function(x){if (is.numeric(x)) !is.finite(x)}
# apply this function to the data frame.
results <- sapply(final2, is.special)
final3 <- as.data.frame(results[["Attrition"]])
unique(final3)
final3 <- as.data.frame(results[["Age"]])
unique(final3)
final3 <- as.data.frame(results[["BusinessTravel"]])
unique(final3)
final3 <- as.data.frame(results[["DailyRate"]])
unique(final3)
final3 <- as.data.frame(results[["Department"]])
unique(final3)
final3 <- as.data.frame(results[["DistanceFromHome"]])
unique(final3)
final3 <- as.data.frame(results[["Education"]])
unique(final3)
final3 <- as.data.frame(results[["EducationField"]])
unique(final3)
final3 <- as.data.frame(results[["EmployeeCount"]])
unique(final3)
final3 <- as.data.frame(results[["EmployeeNumber"]])
unique(final3)
final3 <- as.data.frame(results[["EnvironmentSatisfaction"]])
unique(final3)
final3 <- as.data.frame(results[["Gender"]])
unique(final3)
final3 <- as.data.frame(results[["HourlyRate"]])
unique(final3)
final3 <- as.data.frame(results[["JobInvolvement"]])
unique(final3)
final3 <- as.data.frame(results[["JobLevel"]])
unique(final3)
final3 <- as.data.frame(results[["JobRole"]])
unique(final3)
final3 <- as.data.frame(results[["JobSatisfaction"]])
unique(final3)
final3 <- as.data.frame(results[["MaritalStatus"]])
unique(final3)
final3 <- as.data.frame(results[["MonthlyIncome"]])
unique(final3)
final3 <- as.data.frame(results[["MonthlyRate"]])
unique(final3)
final3 <- as.data.frame(results[["NumCompaniesWorked"]])
unique(final3)
final3 <- as.data.frame(results[["Over18"]])
unique(final3)
final3 <- as.data.frame(results[["OverTime"]])
unique(final3)
final3 <- as.data.frame(results[["PercentSalaryHike"]])
unique(final3)
final3 <- as.data.frame(results[["PerformanceRating"]])
unique(final3)
final3 <- as.data.frame(results[["RelationshipSatisfaction"]])
unique(final3)
final3 <- as.data.frame(results[["StandardHours"]])
unique(final3)
final3 <- as.data.frame(results[["StockOptionLevel"]])
unique(final3)
final3 <- as.data.frame(results[["TotalWorkingYears"]])
unique(final3)
final3 <- as.data.frame(results[["TrainingTimesLastYear"]])
unique(final3)
final3 <- as.data.frame(results[["WorkLifeBalance"]])
unique(final3)
final3 <- as.data.frame(results[["YearsAtCompany"]])
unique(final3)
final3 <- as.data.frame(results[["YearsInCurrentRole"]])
unique(final3)
final3 <- as.data.frame(results[["YearsSinceLastPromotion"]])
unique(final3)
final3 <- as.data.frame(results[["YearsWithCurrManager"]])
unique(final3)
final3 <- as.data.frame(results[["AnnualIncome"]])
unique(final3)
# Outliers test
chisq.out.test(final2$MonthlyIncome,variance = var(final2$MonthlyIncome,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$MonthlyIncome
## X-squared = 1.3618, p-value = 0.2432
## alternative hypothesis: lowest value 1009 is an outlier
chisq.out.test(final2$DailyRate,variance = var(final2$DailyRate,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$DailyRate
## X-squared = 2.9796, p-value = 0.08432
## alternative hypothesis: highest value 1499 is an outlier
chisq.out.test(final2$DistanceFromHome,variance = var(final2$DistanceFromHome,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$DistanceFromHome
## X-squared = 1.0212, p-value = 0.3122
## alternative hypothesis: lowest value 1 is an outlier
chisq.out.test(final2$HourlyRate,variance = var(final2$HourlyRate,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$HourlyRate
## X-squared = 2.815, p-value = 0.09338
## alternative hypothesis: highest value 100 is an outlier
chisq.out.test(final2$MonthlyRate,variance = var(final2$MonthlyRate,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$MonthlyRate
## X-squared = 2.9471, p-value = 0.08603
## alternative hypothesis: lowest value 2094 is an outlier
chisq.out.test(final2$NumCompaniesWorked,variance = var(final2$NumCompaniesWorked,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$NumCompaniesWorked
## X-squared = 1.1624, p-value = 0.281
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$PercentSalaryHike,variance = var(final2$PercentSalaryHike,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$PercentSalaryHike
## X-squared = 1.3229, p-value = 0.2501
## alternative hypothesis: lowest value 11 is an outlier
chisq.out.test(final2$TotalWorkingYears,variance = var(final2$TotalWorkingYears,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$TotalWorkingYears
## X-squared = 2.1016, p-value = 0.1471
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsAtCompany,variance = var(final2$YearsAtCompany,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$YearsAtCompany
## X-squared = 1.3085, p-value = 0.2527
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsInCurrentRole,variance = var(final2$YearsInCurrentRole,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$YearsInCurrentRole
## X-squared = 1.3626, p-value = 0.2431
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsSinceLastPromotion,variance = var(final2$YearsSinceLastPromotion,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$YearsSinceLastPromotion
## X-squared = 0.46093, p-value = 0.4972
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$YearsWithCurrManager,variance = var(final2$YearsWithCurrManager,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$YearsWithCurrManager
## X-squared = 1.3353, p-value = 0.2479
## alternative hypothesis: lowest value 0 is an outlier
chisq.out.test(final2$AnnualIncome,variance = var(final2$AnnualIncome,na.rm=TRUE),opposite = TRUE)
##
## chi-squared test for outlier
##
## data: final2$AnnualIncome
## X-squared = 1.3618, p-value = 0.2432
## alternative hypothesis: lowest value 12108 is an outlier
# No outlier in the lower end as all p-values > 0.05
We’re looking at the Monthly Income and can see it has been positively skewed.
We then log the Monthly Income and can see that this has helped the distribution of the variable. It has brought it closer to normal distribution.
# Task 9
par(mfrow=c(1,1))
hist(final2$MonthlyIncome)
# QQ Plot before transformation
qqnorm(final2$MonthlyIncome, main = "Normal QQ Plot")
qqline(final2$MonthlyIncome, col = "red")
# Taking logs to transform it
Transform.MonthlyIncome<-log(final2$MonthlyIncome)
# After log transformation - QQ plot
qqnorm(Transform.MonthlyIncome, main = "Normal QQ Plot")
qqline(Transform.MonthlyIncome, col = "red")