Absentism is a huge Issue for any organization which cost Huge Amount of Money.Here we cannot achive 0% Absentism Because a Employee can get into Health issue while working across the time line of his/her job.Lets Explore …..!
To Achieve above Goal basic data we require is Employee demography and Attendence data Records which clearly gives information about Employee
- regularity to work.
- He/She taking all PL in the Month.
- No of Holiday acquired by Eployee.
- No hours Absent.
- How Early he/she joins the Office in the Morning.
- How Early he/she Leaves the Office in the Morning.
Data Summary:
## 'data.frame': 8336 obs. of 13 variables:
## $ EmployeeNumber: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Surname : Factor w/ 4051 levels "Aaron","Abadie",..: 1556 1618 941 3415 946
## 1917 503 2152 3451 222 ...
## $ GivenName : Factor w/ 1625 levels "Aaron","Abel",..: 1141 1453 265 688 450
## 514 1260 625 760 1305 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 2 2 2 ...
## $ City : Factor w/ 243 levels "Abbotsford","Agassiz",..: 29 52 180 227 144 180
## 223 192 144 223 ...
## $ JobTitle : Factor w/ 47 levels "Accounting Clerk",..: 5 5 5 5 5 5 1 5 5 1 ...
## $ DepartmentName: Factor w/ 21 levels "Accounting","Accounts Payable",..: 5 5 5
## 5 5 5 1 5 5 1 ...
## $ StoreLocation : Factor w/ 40 levels "Abbotsford","Aldergrove",..: 5 18 29 37
## 21 29 35 38 21 35 ...
## $ Division : Factor w/ 6 levels "Executive","FinanceAndAccounting",..: 6 6 6 6
## 6 6 2 6 6 2 ...
## $ Age : num 32 40.3 48.8 44.6 35.7 ...
## $ LengthService : num 6.02 5.53 4.39 3.08 3.62 ...
## $ AbsentHours : num 36.6 30.2 83.8 70 0 ...
## $ BusinessUnit : Factor w/ 2 levels "HeadOffice","Stores": 2 2 2 2 2 2 1 2 2 1
## ...
Whats the Take away Points from Data Summary?Other method for Exploring Data is summary
## EmployeeNumber Surname GivenName Gender
## Min. : 1 Johnson : 106 James : 182 F:4120
## 1st Qu.:2085 Smith : 86 John : 161 M:4216
## Median :4168 Jones : 71 Robert : 136
## Mean :4168 Williams: 71 Mary : 124
## 3rd Qu.:6252 Brown : 62 William: 121
## Max. :8336 Moore : 47 Michael: 107
## (Other) :7893 (Other):7505
## City JobTitle DepartmentName
## Vancouver :1780 Cashier :1703 Customer Service:1737
## Victoria : 690 Dairy Person :1514 Dairy :1515
## New Westminster: 540 Meat Cutter :1480 Meats :1514
## Burnaby : 339 Baker :1404 Bakery :1449
## Surrey : 275 Produce Clerk:1129 Produce :1163
## Richmond : 228 Shelf Stocker: 712 Processed Foods : 746
## (Other) :4484 (Other) : 394 (Other) : 212
## StoreLocation Division Age
## Vancouver :1836 Executive : 11 Min. : 3.505
## Victoria : 853 FinanceAndAccounting: 73 1st Qu.:35.299
## Nanaimo : 610 HumanResources : 76 Median :42.115
## New Westminster: 525 InfoTech : 10 Mean :42.007
## Kelowna : 418 Legal : 3 3rd Qu.:48.667
## Kamloops : 360 Stores :8163 Max. :77.938
## (Other) :3734
## LengthService AbsentHours BusinessUnit
## Min. : 0.0121 Min. : 0.00 HeadOffice: 173
## 1st Qu.: 3.5759 1st Qu.: 19.13 Stores :8163
## Median : 4.6002 Median : 56.01
## Mean : 4.7829 Mean : 61.28
## 3rd Qu.: 5.6239 3rd Qu.: 94.28
## Max. :43.7352 Max. :272.53
##
Categorical Variables
## Var1 Freq
## 1 Abbotsford 114
## 2 Agassiz 9
## 3 Aiyansh 10
## 4 Aldergrove 94
## 5 Alexis Creek 7
## 6 Alkali Lake 3
## 7 Armstrong 8
## 8 Ashcroft 15
## 9 Atlin 5
## 10 Avola 9
## 11 Balfour 8
## 12 Bamfield 7
## 13 Barriere 9
## 14 Bear Lake 3
## 15 Beaver Valley 10
## 16 Bella Bella 7
## 17 Black Point 9
## 18 Black Pool 3
## 19 Blue River 2
## 20 Blueberry 6
## 21 Bob Quinn Lake 11
## 22 Boston Bar 14
## 23 Bouchie Lake 5
## 24 Bougie Creek 4
## 25 Bowen Island 5
## 26 Brackendale 8
## 27 Bridge Lake 3
## 28 Britannia Beach 7
## 29 Burnaby 339
## 30 Burns Lake 14
## 31 Cache Creek 10
## 32 Campbell River 87
## 33 Canal Flats 9
## 34 Cassiar 11
## 35 Castlegar 20
## 36 Celista 6
## 37 Chase 6
## 38 Chemainus 8
## 39 Chetwynd 23
## 40 Chief Lake 7
## 41 Chilako River 6
## 42 Chilanko Forks 6
## 43 Chilliwack 90
## 44 Christina Lake 8
## 45 Clearwater 7
## 46 Clinton 9
## 47 Cluculz Lake 5
## 48 Cobble Hill 13
## 49 Comox 19
## 50 Coquitlam 36
## 51 Cortes Island 7
## 52 Courtenay 49
## 53 Cranbrook 72
## 54 Crawford Bay 18
## 55 Creston 25
## 56 Cumberland 9
## 57 D'arcy 6
## 58 Dawson Creek 21
## 59 Dease Lake 4
## 60 Decker Lake 10
## 61 Douglas Lake 4
## 62 Dragon Lake 9
## 63 Duncan 72
## 64 Elkford 8
## 65 Elko 5
## 66 Enderby 8
## 67 Fairmont Hot Springs 6
## 68 Fauquier 8
## 69 Fernie 6
## 70 Field 4
## 71 Flatrock 10
## 72 Forest Grove 7
## 73 Fort Fraser 11
## 74 Fort Langley 38
## 75 Fort Nelson 55
## 76 Fort St James 8
## 77 Fort St John 69
## 78 Francois Lake 8
## 79 Fraser Lake 9
## 80 Fruitvale 15
## 81 Fulford Harbour 13
## 82 Gabriola Island 6
## 83 Ganges 14
## 84 Genelle 7
## 85 Gibsons 27
## 86 Giscome 7
## 87 Gold Bridge 9
## 88 Golden 16
## 89 Good Hope Lake 6
## 90 Grand Forks 22
## 91 Granisle 10
## 92 Grasmere 12
## 93 Grassy Plains 5
## 94 Greenwood 11
## 95 Haney 32
## 96 Hansard 5
## 97 Hazelton 5
## 98 Hedley 11
## 99 Hemlock Valley 8
## 100 Hixon 6
## 101 Hope 18
## 102 Horsefly 9
## 103 Houston 16
## 104 Huntingdon 17
## 105 Invermere 8
## 106 Iskut 15
## 107 Jaffray 11
## 108 Kamloops 156
## 109 Kaslo 23
## 110 Kelowna 158
## 111 Keremeos 2
## 112 Kimberley 13
## 113 Kitimat 11
## 114 Kitwanga 7
## 115 Klemtu 7
## 116 Lac La Hache 6
## 117 Ladysmith 9
## 118 Lake Cowichan 9
## 119 Lakelse Lake 9
## 120 Lakeview Heights 12
## 121 Langley 122
## 122 Likely 6
## 123 Lillooet 4
## 124 Little Fort 4
## 125 Logan Lake 13
## 126 Lower Post 5
## 127 Lumby 12
## 128 Lytton 2
## 129 Mackenzie 10
## 130 Manning Park 6
## 131 Mayne Island 6
## 132 Mcbride 6
## 133 Mcleese Lake 8
## 134 Merritt 14
## 135 Mica Creek 9
## 136 Midway 10
## 137 Mission 9
## 138 Montney 14
## 139 Muncho Lake 5
## 140 Nakusp 5
## 141 Nanaimo 176
## 142 Nelson 45
## 143 New Westminister 62
## 144 New Westminster 540
## 145 Nimpo Lake 6
## 146 North Pender Island 7
## 147 North Vancouver 111
## 148 Ocean Falls 12
## 149 Okanagan Falls 7
## 150 Okanagan Mission 3
## 151 Oliver 11
## 152 Osoyoos 8
## 153 Oyama 9
## 154 Oyster River 10
## 155 Parksville 22
## 156 Parson 7
## 157 Peachland 13
## 158 Pemberton 4
## 159 Pender Harbour 9
## 160 Penticton 82
## 161 Pitt Meadows 12
## 162 Port Alberni 54
## 163 Port Alice 11
## 164 Port Coquitlam 89
## 165 Port Edward 7
## 166 Port Hardy 38
## 167 Port Mcneill 12
## 168 Port Mellon 4
## 169 Port Renfrew 5
## 170 Pouce Coupe 5
## 171 Powell River 9
## 172 Prince George 174
## 173 Princeton 16
## 174 Pritchard 8
## 175 Quadra Island 7
## 176 Qualicum Beach 21
## 177 Quesnel 30
## 178 Radium Hot Springs 5
## 179 Revelstoke 10
## 180 Richmond 228
## 181 Riske Creek 6
## 182 Rock Creek 7
## 183 Rosedale 5
## 184 Rossland 11
## 185 Rutland 18
## 186 Salmo 3
## 187 Salmon Arm 35
## 188 Salmon Valley 7
## 189 Sandspit 6
## 190 Sardis 23
## 191 Sayward 5
## 192 Sechelt 17
## 193 Seton Portage 4
## 194 Sicamous 4
## 195 Sidney 27
## 196 Skookumchuck 10
## 197 Slocan 8
## 198 Smithers 14
## 199 Sooke 9
## 200 Sorrento 10
## 201 South Slocan 7
## 202 Sparwood 21
## 203 Spences Bridge 8
## 204 Spillimacheen 5
## 205 Squamish 32
## 206 Summerland 17
## 207 Surrey 275
## 208 Tappen 9
## 209 Tatla Lake 6
## 210 Taylor 8
## 211 Telegraph Creek 4
## 212 Terrace 55
## 213 Toad River 5
## 214 Tofino 19
## 215 Topley 11
## 216 Trail 45
## 217 Tumbler Ridge 7
## 218 Ucluelet 7
## 219 Union Bay 10
## 220 Valemount 7
## 221 Vallican 16
## 222 Vananda 7
## 223 Vancouver 1780
## 224 Vanderhoof 15
## 225 Vavenby 7
## 226 Vernon 90
## 227 Victoria 690
## 228 Wells 3
## 229 West Vancouver 60
## 230 Westbank 15
## 231 Westwold 10
## 232 Whistler 82
## 233 White Rock 38
## 234 Wildwood 8
## 235 Williams Lake 40
## 236 Willow Point 16
## 237 Winfield 12
## 238 Woss 6
## 239 Wynndel 7
## 240 Yahk 6
## 241 Yale 6
## 242 Yarrow 7
## 243 Youbou 6
Numeric Variables (other than Employee Number)
d <- density(MFGEmployees$Age)
plot(d, main="Age")
polygon(d, col="red", border="blue")d <- density(MFGEmployees$LengthService)
plot(d, main="Length of Service")
polygon(d, col="red", border="blue")d <- density(MFGEmployees$AbsentHours)
plot(d, main="Absent Hours")
polygon(d, col="red", border="blue")MFGEmployees<-subset(MFGEmployees,MFGEmployees$Age>=18)
MFGEmployees<-subset(MFGEmployees,MFGEmployees$Age<=65)
summary(MFGEmployees)## EmployeeNumber Surname GivenName Gender
## Min. : 1 Johnson : 103 James : 180 F:4017
## 1st Qu.:2081 Smith : 85 John : 157 M:4148
## Median :4166 Jones : 70 Robert : 134
## Mean :4165 Williams: 69 William: 120
## 3rd Qu.:6245 Brown : 62 Mary : 118
## Max. :8336 Moore : 46 Michael: 104
## (Other) :7730 (Other):7352
## City JobTitle DepartmentName
## Vancouver :1751 Cashier :1663 Customer Service:1695
## Victoria : 677 Dairy Person :1476 Meats :1495
## New Westminster: 530 Meat Cutter :1461 Dairy :1477
## Burnaby : 332 Baker :1375 Bakery :1420
## Surrey : 261 Produce Clerk:1101 Produce :1133
## Richmond : 223 Shelf Stocker: 701 Processed Foods : 735
## (Other) :4391 (Other) : 388 (Other) : 210
## StoreLocation Division Age
## Vancouver :1807 Executive : 11 Min. :18.20
## Victoria : 837 FinanceAndAccounting: 73 1st Qu.:35.46
## Nanaimo : 601 HumanResources : 75 Median :42.10
## New Westminster: 515 InfoTech : 10 Mean :41.99
## Kelowna : 405 Legal : 3 3rd Qu.:48.51
## Kamloops : 352 Stores :7993 Max. :65.00
## (Other) :3648
## LengthService AbsentHours BusinessUnit
## Min. : 0.05328 Min. : 0.00 HeadOffice: 172
## 1st Qu.: 3.58261 1st Qu.: 20.07 Stores :7993
## Median : 4.59800 Median : 55.86
## Mean : 4.78887 Mean : 60.47
## 3rd Qu.: 5.62358 3rd Qu.: 93.38
## Max. :43.73524 Max. :252.19
##
We take the absent hours and divide by (40 hours per week *52 paid weeks) multiplied by 100.
MFGEmployees$AbsenceRate<-MFGEmployees$AbsentHours/2080*100
str(MFGEmployees,width=80,strict.width ="wrap")## 'data.frame': 8165 obs. of 14 variables:
## $ EmployeeNumber: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Surname : Factor w/ 4051 levels "Aaron","Abadie",..: 1556 1618 941 3415 946
## 1917 503 2152 3451 222 ...
## $ GivenName : Factor w/ 1625 levels "Aaron","Abel",..: 1141 1453 265 688 450
## 514 1260 625 760 1305 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 2 2 2 ...
## $ City : Factor w/ 243 levels "Abbotsford","Agassiz",..: 29 52 180 227 144 180
## 223 192 144 223 ...
## $ JobTitle : Factor w/ 47 levels "Accounting Clerk",..: 5 5 5 5 5 5 1 5 5 1 ...
## $ DepartmentName: Factor w/ 21 levels "Accounting","Accounts Payable",..: 5 5 5
## 5 5 5 1 5 5 1 ...
## $ StoreLocation : Factor w/ 40 levels "Abbotsford","Aldergrove",..: 5 18 29 37
## 21 29 35 38 21 35 ...
## $ Division : Factor w/ 6 levels "Executive","FinanceAndAccounting",..: 6 6 6 6
## 6 6 2 6 6 2 ...
## $ Age : num 32 40.3 48.8 44.6 35.7 ...
## $ LengthService : num 6.02 5.53 4.39 3.08 3.62 ...
## $ AbsentHours : num 36.6 30.2 83.8 70 0 ...
## $ BusinessUnit : Factor w/ 2 levels "HeadOffice","Stores": 2 2 2 2 2 2 1 2 2 1
## ...
## $ AbsenceRate : num 1.76 1.45 4.03 3.37 0 ...
Lets answere some question we rased during Goal Defining ?
mean(MFGEmployees$AbsenceRate)## [1] 2.907265
Do we have Excess Absentism?
As from the graph we can there are excess absentism which shows with black dots
Diagnostic Analytics
There are two things at the diagnostic stage we are often interested in:
- For numeric variables, whether there is any correlation or not.
If for the numeric variables, there are strong correlations- if it’s between independent variables and the dependent variable, it may indicate a level of predictability. If it’s strong correlations between independent variables, it may indicate that the independent variables might not serve as good predictors in conjunction with each other.
- For categorical variables, whether there are statistically signi???cant di!erences on the numeric variables (usually the dependent variable or metric of interest).
For the categorical variables, if statistically signi???cant di!erences are found on these, where these variables are the independent variables with a numeric variable (the metric of interest) being the dependent variable,it helps answer ‘why’ things are happening. The ‘why’ is sometimes answered through the ‘where’ is it happening.
Numeric Variables
Absent Rate Vs Age
library(RcmdrMisc)
scatterplot(AbsenceRate ~ Age, reg.line = FALSE, smooth = FALSE, spread = FALSE,
boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5,.9),
data = MFGEmployees)cor(MFGEmployees$Age, MFGEmployees$AbsenceRate)## [1] 0.8246129
The correlation between age and absence rate is very strong- over .80. So it seems to be highly predictive.
AbsenceRate Vs LengthService
scatterplot(AbsenceRate ~ LengthService, reg.line = FALSE, smooth = FALSE, spread = FALSE,
boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5,.9),
data = MFGEmployees)cor(MFGEmployees$LengthService, MFGEmployees$AbsenceRate)## [1] -0.04669242
The correlation between length of service and absence rate is almost non existant at -0.04669242.
LengthService ~ Age
scatterplot(LengthService ~ Age, reg.line = FALSE, smooth = FALSE, spread = FALSE,
boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5,.9),
data = MFGEmployees)cor(MFGEmployees$Age, MFGEmployees$LengthService)## [1] 0.05623405
The correlation between the two independent variables of length of service and age, is also almost non existant at 0.05623405.
Categorical Variables
Gender wise Absent rate
ggplot() + geom_boxplot(aes(y= AbsenceRate, x= Gender), data = MFGEmployees) +
coord_flip()You will notice that the boxplot shows that the average is higher for females than males, but on the visualization alone you wouldn’t know they were statistically signi???cant.
This is where it can be quite useful to test for di!erences in the means of these categorical variables. The standard diagnostic statistical procedure often used in Analysis of Variance.
library(RcmdrMisc)
AnovaModel.1 <- (lm(MFGEmployees$AbsenceRate ~ Gender, data=MFGEmployees))
Anova(AnovaModel.1)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## Gender 496 1 97.773 < 2.2e-16 ***
## Residuals 41379 8163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(Gender),
mean, na.rm=TRUE)))## F M
## 3.157624 2.664813
The test shows that the Absence Rate is statistically signi???cantly di!erent between genders (Pr(>F) < .05)
AnovaModel.2 <- (lm(MFGEmployees$AbsenceRate ~ City, data=MFGEmployees))
Anova(AnovaModel.2)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## City 1085 242 0.8704 0.9254
## Residuals 40790 7922
Absence Rate doesn’t vary signi???cantly by City. Pr(>F)= 0.9254) which is greater than .05
AnovaModel.3 <- (lm(MFGEmployees$AbsenceRate ~ JobTitle,
data=MFGEmployees))
Anova(AnovaModel.3)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## JobTitle 281 46 1.1928 0.1745
## Residuals 41593 8118
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(JobTitle),
mean, na.rm=TRUE)))## Accounting Clerk Accounts Payable Clerk
## 1.946229 1.623203
## Accounts Receiveable Clerk Auditor
## 1.419389 2.160575
## Baker Bakery Manager
## 2.923429 2.579606
## Benefits Admin Cashier
## 2.972834 2.996985
## CEO CHief Information Officer
## 1.709012 4.170790
## Compensation Analyst Corporate Lawyer
## 1.790482 2.471724
## Customer Service Manager Dairy Manager
## 2.546558 9.250115
## Dairy Person Director, Accounting
## 2.991750 2.462047
## Director, Accounts Payable Director, Accounts Receivable
## 1.818838 1.647439
## Director, Audit Director, Compensation
## 1.499403 1.218407
## Director, Employee Records Director, HR Technology
## 2.006652 0.000000
## Director, Investments Director, Labor Relations
## 0.000000 0.000000
## Director, Recruitment Director, Training
## 4.177417 1.412981
## Exec Assistant, Finance Exec Assistant, Human Resources
## 0.000000 2.872417
## Exec Assistant, Legal Counsel Exec Assistant, VP Stores
## 3.758468 0.000000
## HRIS Analyst Investment Analyst
## 2.493649 3.190082
## Labor Relations Analyst Legal Counsel
## 2.655482 0.000000
## Meat Cutter Meats Manager
## 2.846953 3.006050
## Processed Foods Manager Produce Clerk
## 2.617214 2.800071
## Produce Manager Recruiter
## 2.684084 3.186081
## Shelf Stocker Store Manager
## 3.002924 2.517101
## Systems Analyst Trainer
## 1.925995 3.084251
## VP Finance VP Human Resources
## 3.173603 3.416529
## VP Stores
## 3.586138
Absence Rate doesn’t vary signi???cantly by Job Title. Pr(>F) is greater than .05
AnovaModel.4 <- (lm(MFGEmployees$AbsenceRate ~ DepartmentName, data=MFGEmployees))
Anova(AnovaModel.4)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## DepartmentName 172 20 1.675 0.02995 *
## Residuals 41703 8144
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(DepartmentName),
mean, na.rm=TRUE)))## Accounting Accounts Payable Accounts Receiveable
## 1.974886 1.636245 1.433642
## Audit Bakery Compensation
## 2.116497 2.912533 1.726918
## Customer Service Dairy Employee Records
## 2.988482 2.995988 2.892319
## Executive HR Technology Information Technology
## 2.323580 2.315531 1.925995
## Investment Labor Relations Legal
## 2.835628 2.434192 2.471724
## Meats Processed Foods Produce
## 2.850572 2.985082 2.796795
## Recruitment Store Management Training
## 3.262338 2.517101 2.972833
Absence Rate does vary signi???cantly by Department. Pr(>F) is less than .05. Let’s visualize this:
ggplot() + geom_boxplot(aes(y= AbsenceRate, x= DepartmentName), data = MFGEmployees) +
coord_flip()AnovaModel.5 <- (lm(MFGEmployees$AbsenceRate ~ StoreLocation, data=MFGEmployees))
Anova(AnovaModel.5)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## StoreLocation 191 39 0.956 0.5483
## Residuals 41683 8125
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(StoreLocation),
mean, na.rm=TRUE)))## Abbotsford Aldergrove Bella Bella Blue River
## 2.937386 2.901258 2.031680 2.911973
## Burnaby Chilliwack Cortes Island Cranbrook
## 2.944466 3.082989 2.485313 2.952225
## Dawson Creek Dease Lake Fort Nelson Fort St John
## 3.545463 1.456964 2.741195 2.422886
## Grand Forks Haney Kamloops Kelowna
## 3.251701 2.549883 2.869252 2.898045
## Langley Nanaimo Nelson New Westminister
## 2.579410 2.975264 3.454472 2.930529
## New Westminster North Vancouver Ocean Falls Pitt Meadows
## 2.969692 3.016872 3.711442 2.347963
## Port Coquitlam Prince George Princeton Quesnel
## 3.160728 2.919220 2.679282 3.109154
## Richmond Squamish Surrey Terrace
## 3.090128 2.907684 2.984772 3.082298
## Trail Valemount Vancouver Vernon
## 3.115680 3.382016 2.855725 2.898143
## Victoria West Vancouver White Rock Williams Lake
## 2.792531 2.688103 3.237530 2.854828
Absence Rate does not vary signi???cantly by Store Location. Pr(>F) is greater than .05.
AnovaModel.6 <- (lm(MFGEmployees$AbsenceRate ~ Division, data=MFGEmployees))
Anova(AnovaModel.6)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## Division 91 5 3.5617 0.003218 **
## Residuals 41783 8159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(Division),
mean, na.rm=TRUE)))## Executive FinanceAndAccounting HumanResources
## 2.323580 1.921890 2.651743
## InfoTech Legal Stores
## 1.925995 2.471724 2.920856
Absence Rate does vary signi???cantly by Division. Pr(>F) is less than .05.
ggplot() + geom_boxplot(aes(y= AbsenceRate, x= Division), data = MFGEmployees) +
coord_flip()AnovaModel.7 <- (lm(MFGEmployees$AbsenceRate ~ BusinessUnit, data=MFGEmployees))
Anova(AnovaModel.7)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## BusinessUnit 70 1 13.687 0.0002174 ***
## Residuals 41804 8163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(BusinessUnit),
mean, na.rm=TRUE)))## HeadOffice Stores
## 2.275658 2.920856
Absence Rate does vary signi???cantly by Business Unit. Pr(>F) is less than .05 .
ggplot() + geom_boxplot(aes(y= AbsenceRate, x= BusinessUnit), data = MFGEmployees) +
coord_flip()AnovaModel.8 <- (lm(MFGEmployees$AbsenceRate ~ Division*Gender, data=MFGEmployees))
Anova(AnovaModel.8)## Anova Table (Type II tests)
##
## Response: MFGEmployees$AbsenceRate
## Sum Sq Df F value Pr(>F)
## Division 92 5 3.6145 0.002877 **
## Gender 496 1 97.9418 < 2.2e-16 ***
## Division:Gender 5 5 0.1784 0.970783
## Residuals 41283 8153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(Division, Gender), mean, na.rm=TRUE)))## F M
## Executive 2.976419 1.779546
## FinanceAndAccounting 2.172804 1.634077
## HumanResources 3.014491 2.214311
## InfoTech 3.298112 1.773538
## Legal 3.298112 2.058530
## Stores 3.169049 2.680788
nums <- sapply(MFGEmployees, is.numeric)
num_df <- MFGEmployees[,c(nums)]
names(num_df)## [1] "EmployeeNumber" "Age" "LengthService" "AbsentHours"
## [5] "AbsenceRate"
cat <- sapply(MFGEmployees, is.factor)
cat_df <- MFGEmployees[,c(cat)]
names(cat_df)## [1] "Surname" "GivenName" "Gender" "City"
## [5] "JobTitle" "DepartmentName" "StoreLocation" "Division"
## [9] "BusinessUnit"
dim(MFGEmployees)## [1] 8165 14
boxplot(num_df[,-c(1)])MYdataset <-MFGEmployees
MYinput <- c("Gender", "DepartmentName", "StoreLocation", "Division",
"Age", "LengthService", "BusinessUnit")
MYnumeric <- c("Age", "LengthService")
MYcategoric <- c("Gender", "DepartmentName", "StoreLocation", "Division",
"BusinessUnit")
MYtarget <- "AbsenceRate"
MYrisk <- NULL
MYident <- "EmployeeNumber"
MYignore <- c("Surname", "GivenName", "City", "JobTitle", "AbsentHours")
MYweights <- NULLlibrary(rpart, quietly=TRUE)
MYrpart <- rpart(AbsenceRate ~ .,
data=MYdataset[, c(MYinput, MYtarget)],
method="anova",
parms=list(split="information"),
control=rpart.control(minsplit=10,
maxdepth=10,
usesurrogate=0,
maxsurrogate=0))library(rpart.plot)
rpart.plot(MYrpart, type = 3)summary(MYrpart)## Call:
## rpart(formula = AbsenceRate ~ ., data = MYdataset[, c(MYinput,
## MYtarget)], method = "anova", parms = list(split = "information"),
## control = rpart.control(minsplit = 10, maxdepth = 10, usesurrogate = 0,
## maxsurrogate = 0))
## n= 8165
##
## CP nsplit rel error xerror xstd
## 1 0.49040074 0 1.0000000 1.0001182 0.014403965
## 2 0.08966098 1 0.5095993 0.5103564 0.008743930
## 3 0.06342449 2 0.4199383 0.4210306 0.007139500
## 4 0.01823037 3 0.3565138 0.3578884 0.006839875
## 5 0.01246625 4 0.3382834 0.3401323 0.006315182
## 6 0.01000000 5 0.3258172 0.3304490 0.006188780
##
## Variable importance
## Age Gender
## 98 2
##
## Node number 1: 8165 observations, complexity param=0.4904007
## mean=2.907265, MSE=5.128515
## left son=2 (4360 obs) right son=3 (3805 obs)
## Primary splits:
## Age < 42.92874 to the left, improve=0.490400700, (0 missing)
## Gender splits as RL, improve=0.011835810, (0 missing)
## LengthService < 9.854487 to the right, improve=0.003112393, (0 missing)
## DepartmentName splits as LLLLRLRRRLLLRLRRRRRRR, improve=0.002538353, (0 missing)
## Division splits as LLRLRR, improve=0.001997797, (0 missing)
##
## Node number 2: 4360 observations, complexity param=0.06342449
## mean=1.425752, MSE=1.861194
## left son=4 (2035 obs) right son=5 (2325 obs)
## Primary splits:
## Age < 35.43927 to the left, improve=0.327285400, (0 missing)
## Gender splits as RL, improve=0.027778790, (0 missing)
## StoreLocation splits as LLLLRRLLLRRLLLRRLLRLRRRRRLLLRLRRRRLLLLLL, improve=0.003846682, (0 missing)
## DepartmentName splits as LLLLLLLLRRLRLRRLLLRLR, improve=0.003252535, (0 missing)
## Division splits as RLRRRL, improve=0.002941291, (0 missing)
##
## Node number 3: 3805 observations, complexity param=0.08966098
## mean=4.604872, MSE=3.47551
## left son=6 (2541 obs) right son=7 (1264 obs)
## Primary splits:
## Age < 51.70168 to the left, improve=0.28390820, (0 missing)
## Gender splits as RL, improve=0.06454366, (0 missing)
## LengthService < 9.854487 to the right, improve=0.03924278, (0 missing)
## DepartmentName splits as LLLLRLRRLLLLRL-RRRRRL, improve=0.03245560, (0 missing)
## Division splits as LLLL-R, improve=0.03068791, (0 missing)
##
## Node number 4: 2035 observations
## mean=0.591517, MSE=0.7469363
##
## Node number 5: 2325 observations
## mean=2.155932, MSE=1.694165
##
## Node number 6: 2541 observations, complexity param=0.01246625
## mean=3.904273, MSE=2.129771
## left son=12 (1355 obs) right son=13 (1186 obs)
## Primary splits:
## Gender splits as RL, improve=0.09645969, (0 missing)
## Age < 46.63122 to the left, improve=0.08911943, (0 missing)
## LengthService < 9.854487 to the right, improve=0.02802483, (0 missing)
## DepartmentName splits as LLLLRLRRLLLLRL-RRRLRL, improve=0.02281884, (0 missing)
## BusinessUnit splits as LR, improve=0.02086535, (0 missing)
##
## Node number 7: 1264 observations, complexity param=0.01823037
## mean=6.013276, MSE=3.210503
## left son=14 (911 obs) right son=15 (353 obs)
## Primary splits:
## Age < 58.02903 to the left, improve=0.1881149, (0 missing)
## LengthService < 9.751852 to the right, improve=0.1413326, (0 missing)
## DepartmentName splits as -LLLRLRRLLLLLL-RRRLRL, improve=0.1245715, (0 missing)
## BusinessUnit splits as LR, improve=0.1245715, (0 missing)
## Division splits as LLLL-R, improve=0.1245715, (0 missing)
##
## Node number 12: 1355 observations
## mean=3.480228, MSE=1.940531
##
## Node number 13: 1186 observations
## mean=4.388743, MSE=1.905829
##
## Node number 14: 911 observations
## mean=5.52952, MSE=2.524452
##
## Node number 15: 353 observations
## mean=7.261722, MSE=2.818458
#Linear Regression Model
RegressionCurrentData <- lm(AbsenceRate~Age+LengthService, data=MFGEmployees)
summary(RegressionCurrentData)##
## Call:
## lm(formula = AbsenceRate ~ Age + LengthService, data = MFGEmployees)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3578 -0.8468 -0.0230 0.8523 5.1030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.190213 0.068959 -75.27 <2e-16 ***
## Age 0.202593 0.001510 134.16 <2e-16 ***
## LengthService -0.085309 0.005652 -15.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.264 on 8162 degrees of freedom
## Multiple R-squared: 0.6887, Adjusted R-squared: 0.6886
## F-statistic: 9027 on 2 and 8162 DF, p-value: < 2.2e-16
This procedure gives us a mathematical equation for predicting Absence Rate \[Absence Rate=-5.190213 + (0.202593 * Age) + (-0.085309 * Length of Service)\] From the Pr(>|t|) column both age and LengthService are signi???cantly related and the amount of variation
in Absence Rate explained is 68% (Multiple R-squared: 0.6887) Let’s look at a visual of Absence Rate and Age
library(ggplot2)
ggplot() + geom_point(aes(x= Age,y= AbsenceRate),data=MFGEmployees) +
geom_smooth(aes(x= Age,y= AbsenceRate),data=MFGEmployees,method = "lm")Now let’s look at a visual with both age and length of service with Absence Rate.
library(scatterplot3d)
s3d <-scatterplot3d(MFGEmployees$Age,MFGEmployees$LengthService,MFGEmployees$AbsenceRate,
pch=16, highlight.3d=TRUE,
type="h", main="Absence Rate By Age And Length of Service")
fit <- lm(MFGEmployees$AbsenceRate ~ MFGEmployees$Age+MFGEmployees$LengthService)
s3d$plane3d(fit)To Be Continued !! Wait for it…!