1 Introduction

Absentism is a huge Issue for any organization which cost Huge Amount of Money.Here we cannot achive 0% Absentism Because a Employee can get into Health issue while working across the time line of his/her job.Lets Explore …..!

2 Life Cycle Of Hr Analytic Project

2.1 Define Goal(Hr Business Problem/Issue)

  • Descriptive
    . What is its rate of absenteeism?
    . Does anyone have excessive absen
  • Diagnostic
    . Is it the same across the organiza
    . Does it vary by gender?
    . Does it vary by length of service o related to absenteeism rates.
  • Predictive
    . Can it predict next year’s absente
    . If so, how well can it predict?
  • Prescriptive
    . Can we reduce our absenteeism?

2.2 Data Required to Accomplish Defined Goal

To Achieve above Goal basic data we require is Employee demography and Attendence data Records which clearly gives information about Employee
- regularity to work.
- He/She taking all PL in the Month.
- No of Holiday acquired by Eployee.
- No hours Absent.
- How Early he/she joins the Office in the Morning.
- How Early he/she Leaves the Office in the Morning.

So Lets take an example of Data available at Kaggle One of the Best Data Science Platform on Earth.

Glimpse of Data

Data Summary:

## 'data.frame':    8336 obs. of  13 variables:
## $ EmployeeNumber: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Surname : Factor w/ 4051 levels "Aaron","Abadie",..: 1556 1618 941 3415 946
##    1917 503 2152 3451 222 ...
## $ GivenName : Factor w/ 1625 levels "Aaron","Abel",..: 1141 1453 265 688 450
##    514 1260 625 760 1305 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 2 2 2 ...
## $ City : Factor w/ 243 levels "Abbotsford","Agassiz",..: 29 52 180 227 144 180
##    223 192 144 223 ...
## $ JobTitle : Factor w/ 47 levels "Accounting Clerk",..: 5 5 5 5 5 5 1 5 5 1 ...
## $ DepartmentName: Factor w/ 21 levels "Accounting","Accounts Payable",..: 5 5 5
##    5 5 5 1 5 5 1 ...
## $ StoreLocation : Factor w/ 40 levels "Abbotsford","Aldergrove",..: 5 18 29 37
##    21 29 35 38 21 35 ...
## $ Division : Factor w/ 6 levels "Executive","FinanceAndAccounting",..: 6 6 6 6
##    6 6 2 6 6 2 ...
## $ Age : num 32 40.3 48.8 44.6 35.7 ...
## $ LengthService : num 6.02 5.53 4.39 3.08 3.62 ...
## $ AbsentHours : num 36.6 30.2 83.8 70 0 ...
## $ BusinessUnit : Factor w/ 2 levels "HeadOffice","Stores": 2 2 2 2 2 2 1 2 2 1
##    ...
Whats the Take away Points from Data Summary?
- Data Type (Category or Numerical or int or Time series).
- Name of the variable in your Data set.

Other method for Exploring Data is summary

##  EmployeeNumber     Surname       GivenName    Gender  
##  Min.   :   1   Johnson : 106   James  : 182   F:4120  
##  1st Qu.:2085   Smith   :  86   John   : 161   M:4216  
##  Median :4168   Jones   :  71   Robert : 136           
##  Mean   :4168   Williams:  71   Mary   : 124           
##  3rd Qu.:6252   Brown   :  62   William: 121           
##  Max.   :8336   Moore   :  47   Michael: 107           
##                 (Other) :7893   (Other):7505           
##               City               JobTitle             DepartmentName
##  Vancouver      :1780   Cashier      :1703   Customer Service:1737  
##  Victoria       : 690   Dairy Person :1514   Dairy           :1515  
##  New Westminster: 540   Meat Cutter  :1480   Meats           :1514  
##  Burnaby        : 339   Baker        :1404   Bakery          :1449  
##  Surrey         : 275   Produce Clerk:1129   Produce         :1163  
##  Richmond       : 228   Shelf Stocker: 712   Processed Foods : 746  
##  (Other)        :4484   (Other)      : 394   (Other)         : 212  
##          StoreLocation                  Division         Age        
##  Vancouver      :1836   Executive           :  11   Min.   : 3.505  
##  Victoria       : 853   FinanceAndAccounting:  73   1st Qu.:35.299  
##  Nanaimo        : 610   HumanResources      :  76   Median :42.115  
##  New Westminster: 525   InfoTech            :  10   Mean   :42.007  
##  Kelowna        : 418   Legal               :   3   3rd Qu.:48.667  
##  Kamloops       : 360   Stores              :8163   Max.   :77.938  
##  (Other)        :3734                                               
##  LengthService      AbsentHours         BusinessUnit 
##  Min.   : 0.0121   Min.   :  0.00   HeadOffice: 173  
##  1st Qu.: 3.5759   1st Qu.: 19.13   Stores    :8163  
##  Median : 4.6002   Median : 56.01                    
##  Mean   : 4.7829   Mean   : 61.28                    
##  3rd Qu.: 5.6239   3rd Qu.: 94.28                    
##  Max.   :43.7352   Max.   :272.53                    
## 

Categorical Variables

##                     Var1 Freq
## 1             Abbotsford  114
## 2                Agassiz    9
## 3                Aiyansh   10
## 4             Aldergrove   94
## 5           Alexis Creek    7
## 6            Alkali Lake    3
## 7              Armstrong    8
## 8               Ashcroft   15
## 9                  Atlin    5
## 10                 Avola    9
## 11               Balfour    8
## 12              Bamfield    7
## 13              Barriere    9
## 14             Bear Lake    3
## 15         Beaver Valley   10
## 16           Bella Bella    7
## 17           Black Point    9
## 18            Black Pool    3
## 19            Blue River    2
## 20             Blueberry    6
## 21        Bob Quinn Lake   11
## 22            Boston Bar   14
## 23          Bouchie Lake    5
## 24          Bougie Creek    4
## 25          Bowen Island    5
## 26           Brackendale    8
## 27           Bridge Lake    3
## 28       Britannia Beach    7
## 29               Burnaby  339
## 30            Burns Lake   14
## 31           Cache Creek   10
## 32        Campbell River   87
## 33           Canal Flats    9
## 34               Cassiar   11
## 35             Castlegar   20
## 36               Celista    6
## 37                 Chase    6
## 38             Chemainus    8
## 39              Chetwynd   23
## 40            Chief Lake    7
## 41         Chilako River    6
## 42        Chilanko Forks    6
## 43            Chilliwack   90
## 44        Christina Lake    8
## 45            Clearwater    7
## 46               Clinton    9
## 47          Cluculz Lake    5
## 48           Cobble Hill   13
## 49                 Comox   19
## 50             Coquitlam   36
## 51         Cortes Island    7
## 52             Courtenay   49
## 53             Cranbrook   72
## 54          Crawford Bay   18
## 55               Creston   25
## 56            Cumberland    9
## 57                D'arcy    6
## 58          Dawson Creek   21
## 59            Dease Lake    4
## 60           Decker Lake   10
## 61          Douglas Lake    4
## 62           Dragon Lake    9
## 63                Duncan   72
## 64               Elkford    8
## 65                  Elko    5
## 66               Enderby    8
## 67  Fairmont Hot Springs    6
## 68              Fauquier    8
## 69                Fernie    6
## 70                 Field    4
## 71              Flatrock   10
## 72          Forest Grove    7
## 73           Fort Fraser   11
## 74          Fort Langley   38
## 75           Fort Nelson   55
## 76         Fort St James    8
## 77          Fort St John   69
## 78         Francois Lake    8
## 79           Fraser Lake    9
## 80             Fruitvale   15
## 81       Fulford Harbour   13
## 82       Gabriola Island    6
## 83                Ganges   14
## 84               Genelle    7
## 85               Gibsons   27
## 86               Giscome    7
## 87           Gold Bridge    9
## 88                Golden   16
## 89        Good Hope Lake    6
## 90           Grand Forks   22
## 91              Granisle   10
## 92              Grasmere   12
## 93         Grassy Plains    5
## 94             Greenwood   11
## 95                 Haney   32
## 96               Hansard    5
## 97              Hazelton    5
## 98                Hedley   11
## 99        Hemlock Valley    8
## 100                Hixon    6
## 101                 Hope   18
## 102             Horsefly    9
## 103              Houston   16
## 104           Huntingdon   17
## 105            Invermere    8
## 106                Iskut   15
## 107              Jaffray   11
## 108             Kamloops  156
## 109                Kaslo   23
## 110              Kelowna  158
## 111             Keremeos    2
## 112            Kimberley   13
## 113              Kitimat   11
## 114             Kitwanga    7
## 115               Klemtu    7
## 116         Lac La Hache    6
## 117            Ladysmith    9
## 118        Lake Cowichan    9
## 119         Lakelse Lake    9
## 120     Lakeview Heights   12
## 121              Langley  122
## 122               Likely    6
## 123             Lillooet    4
## 124          Little Fort    4
## 125           Logan Lake   13
## 126           Lower Post    5
## 127                Lumby   12
## 128               Lytton    2
## 129            Mackenzie   10
## 130         Manning Park    6
## 131         Mayne Island    6
## 132              Mcbride    6
## 133         Mcleese Lake    8
## 134              Merritt   14
## 135           Mica Creek    9
## 136               Midway   10
## 137              Mission    9
## 138              Montney   14
## 139          Muncho Lake    5
## 140               Nakusp    5
## 141              Nanaimo  176
## 142               Nelson   45
## 143     New Westminister   62
## 144      New Westminster  540
## 145           Nimpo Lake    6
## 146  North Pender Island    7
## 147      North Vancouver  111
## 148          Ocean Falls   12
## 149       Okanagan Falls    7
## 150     Okanagan Mission    3
## 151               Oliver   11
## 152              Osoyoos    8
## 153                Oyama    9
## 154         Oyster River   10
## 155           Parksville   22
## 156               Parson    7
## 157            Peachland   13
## 158            Pemberton    4
## 159       Pender Harbour    9
## 160            Penticton   82
## 161         Pitt Meadows   12
## 162         Port Alberni   54
## 163           Port Alice   11
## 164       Port Coquitlam   89
## 165          Port Edward    7
## 166           Port Hardy   38
## 167         Port Mcneill   12
## 168          Port Mellon    4
## 169         Port Renfrew    5
## 170          Pouce Coupe    5
## 171         Powell River    9
## 172        Prince George  174
## 173            Princeton   16
## 174            Pritchard    8
## 175        Quadra Island    7
## 176       Qualicum Beach   21
## 177              Quesnel   30
## 178   Radium Hot Springs    5
## 179           Revelstoke   10
## 180             Richmond  228
## 181          Riske Creek    6
## 182           Rock Creek    7
## 183             Rosedale    5
## 184             Rossland   11
## 185              Rutland   18
## 186                Salmo    3
## 187           Salmon Arm   35
## 188        Salmon Valley    7
## 189             Sandspit    6
## 190               Sardis   23
## 191              Sayward    5
## 192              Sechelt   17
## 193        Seton Portage    4
## 194             Sicamous    4
## 195               Sidney   27
## 196         Skookumchuck   10
## 197               Slocan    8
## 198             Smithers   14
## 199                Sooke    9
## 200             Sorrento   10
## 201         South Slocan    7
## 202             Sparwood   21
## 203       Spences Bridge    8
## 204        Spillimacheen    5
## 205             Squamish   32
## 206           Summerland   17
## 207               Surrey  275
## 208               Tappen    9
## 209           Tatla Lake    6
## 210               Taylor    8
## 211      Telegraph Creek    4
## 212              Terrace   55
## 213           Toad River    5
## 214               Tofino   19
## 215               Topley   11
## 216                Trail   45
## 217        Tumbler Ridge    7
## 218             Ucluelet    7
## 219            Union Bay   10
## 220            Valemount    7
## 221             Vallican   16
## 222              Vananda    7
## 223            Vancouver 1780
## 224           Vanderhoof   15
## 225              Vavenby    7
## 226               Vernon   90
## 227             Victoria  690
## 228                Wells    3
## 229       West Vancouver   60
## 230             Westbank   15
## 231             Westwold   10
## 232             Whistler   82
## 233           White Rock   38
## 234             Wildwood    8
## 235        Williams Lake   40
## 236         Willow Point   16
## 237             Winfield   12
## 238                 Woss    6
## 239              Wynndel    7
## 240                 Yahk    6
## 241                 Yale    6
## 242               Yarrow    7
## 243               Youbou    6

Numeric Variables (other than Employee Number)

d <- density(MFGEmployees$Age)
plot(d, main="Age")
polygon(d, col="red", border="blue")

d <- density(MFGEmployees$LengthService)
plot(d, main="Length of Service")
polygon(d, col="red", border="blue")

d <- density(MFGEmployees$AbsentHours)
plot(d, main="Absent Hours")
polygon(d, col="red", border="blue")

3 Data Cleansing

MFGEmployees<-subset(MFGEmployees,MFGEmployees$Age>=18)
MFGEmployees<-subset(MFGEmployees,MFGEmployees$Age<=65)
summary(MFGEmployees)
##  EmployeeNumber     Surname       GivenName    Gender  
##  Min.   :   1   Johnson : 103   James  : 180   F:4017  
##  1st Qu.:2081   Smith   :  85   John   : 157   M:4148  
##  Median :4166   Jones   :  70   Robert : 134           
##  Mean   :4165   Williams:  69   William: 120           
##  3rd Qu.:6245   Brown   :  62   Mary   : 118           
##  Max.   :8336   Moore   :  46   Michael: 104           
##                 (Other) :7730   (Other):7352           
##               City               JobTitle             DepartmentName
##  Vancouver      :1751   Cashier      :1663   Customer Service:1695  
##  Victoria       : 677   Dairy Person :1476   Meats           :1495  
##  New Westminster: 530   Meat Cutter  :1461   Dairy           :1477  
##  Burnaby        : 332   Baker        :1375   Bakery          :1420  
##  Surrey         : 261   Produce Clerk:1101   Produce         :1133  
##  Richmond       : 223   Shelf Stocker: 701   Processed Foods : 735  
##  (Other)        :4391   (Other)      : 388   (Other)         : 210  
##          StoreLocation                  Division         Age       
##  Vancouver      :1807   Executive           :  11   Min.   :18.20  
##  Victoria       : 837   FinanceAndAccounting:  73   1st Qu.:35.46  
##  Nanaimo        : 601   HumanResources      :  75   Median :42.10  
##  New Westminster: 515   InfoTech            :  10   Mean   :41.99  
##  Kelowna        : 405   Legal               :   3   3rd Qu.:48.51  
##  Kamloops       : 352   Stores              :7993   Max.   :65.00  
##  (Other)        :3648                                              
##  LengthService       AbsentHours         BusinessUnit 
##  Min.   : 0.05328   Min.   :  0.00   HeadOffice: 172  
##  1st Qu.: 3.58261   1st Qu.: 20.07   Stores    :7993  
##  Median : 4.59800   Median : 55.86                    
##  Mean   : 4.78887   Mean   : 60.47                    
##  3rd Qu.: 5.62358   3rd Qu.: 93.38                    
##  Max.   :43.73524   Max.   :252.19                    
## 

4 Data Transforming

We take the absent hours and divide by (40 hours per week *52 paid weeks) multiplied by 100.

MFGEmployees$AbsenceRate<-MFGEmployees$AbsentHours/2080*100
str(MFGEmployees,width=80,strict.width ="wrap")
## 'data.frame':    8165 obs. of  14 variables:
## $ EmployeeNumber: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Surname : Factor w/ 4051 levels "Aaron","Abadie",..: 1556 1618 941 3415 946
##    1917 503 2152 3451 222 ...
## $ GivenName : Factor w/ 1625 levels "Aaron","Abel",..: 1141 1453 265 688 450
##    514 1260 625 760 1305 ...
## $ Gender : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 2 2 2 ...
## $ City : Factor w/ 243 levels "Abbotsford","Agassiz",..: 29 52 180 227 144 180
##    223 192 144 223 ...
## $ JobTitle : Factor w/ 47 levels "Accounting Clerk",..: 5 5 5 5 5 5 1 5 5 1 ...
## $ DepartmentName: Factor w/ 21 levels "Accounting","Accounts Payable",..: 5 5 5
##    5 5 5 1 5 5 1 ...
## $ StoreLocation : Factor w/ 40 levels "Abbotsford","Aldergrove",..: 5 18 29 37
##    21 29 35 38 21 35 ...
## $ Division : Factor w/ 6 levels "Executive","FinanceAndAccounting",..: 6 6 6 6
##    6 6 2 6 6 2 ...
## $ Age : num 32 40.3 48.8 44.6 35.7 ...
## $ LengthService : num 6.02 5.53 4.39 3.08 3.62 ...
## $ AbsentHours : num 36.6 30.2 83.8 70 0 ...
## $ BusinessUnit : Factor w/ 2 levels "HeadOffice","Stores": 2 2 2 2 2 2 1 2 2 1
##    ...
## $ AbsenceRate : num 1.76 1.45 4.03 3.37 0 ...

Lets answere some question we rased during Goal Defining ?

mean(MFGEmployees$AbsenceRate)
## [1] 2.907265

Do we have Excess Absentism?

As from the graph we can there are excess absentism which shows with black dots

5 Build a Model

Diagnostic Analytics
There are two things at the diagnostic stage we are often interested in:
- For numeric variables, whether there is any correlation or not.
If for the numeric variables, there are strong correlations- if it’s between independent variables and the dependent variable, it may indicate a level of predictability. If it’s strong correlations between independent variables, it may indicate that the independent variables might not serve as good predictors in conjunction with each other.
- For categorical variables, whether there are statistically signi???cant di!erences on the numeric variables (usually the dependent variable or metric of interest).
For the categorical variables, if statistically signi???cant di!erences are found on these, where these variables are the independent variables with a numeric variable (the metric of interest) being the dependent variable,it helps answer ‘why’ things are happening. The ‘why’ is sometimes answered through the ‘where’ is it happening.

Numeric Variables

Absent Rate Vs Age

library(RcmdrMisc)
scatterplot(AbsenceRate ~ Age, reg.line = FALSE, smooth = FALSE, spread = FALSE,
boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5,.9),
data = MFGEmployees)

cor(MFGEmployees$Age, MFGEmployees$AbsenceRate)
## [1] 0.8246129

The correlation between age and absence rate is very strong- over .80. So it seems to be highly predictive.

AbsenceRate Vs LengthService

scatterplot(AbsenceRate ~ LengthService, reg.line = FALSE, smooth = FALSE, spread = FALSE,
boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5,.9),
data = MFGEmployees)

cor(MFGEmployees$LengthService, MFGEmployees$AbsenceRate)
## [1] -0.04669242

The correlation between length of service and absence rate is almost non existant at -0.04669242.

LengthService ~ Age

scatterplot(LengthService ~ Age, reg.line = FALSE, smooth = FALSE, spread = FALSE,
boxplots = FALSE, span = 0.5, ellipse = FALSE, levels = c(.5,.9),
data = MFGEmployees)

cor(MFGEmployees$Age, MFGEmployees$LengthService)
## [1] 0.05623405

The correlation between the two independent variables of length of service and age, is also almost non existant at 0.05623405.

Categorical Variables
Gender wise Absent rate

ggplot() + geom_boxplot(aes(y= AbsenceRate, x= Gender), data = MFGEmployees) +
coord_flip()

You will notice that the boxplot shows that the average is higher for females than males, but on the visualization alone you wouldn’t know they were statistically signi???cant.

This is where it can be quite useful to test for di!erences in the means of these categorical variables. The standard diagnostic statistical procedure often used in Analysis of Variance.

library(RcmdrMisc)
AnovaModel.1 <- (lm(MFGEmployees$AbsenceRate ~ Gender, data=MFGEmployees))
Anova(AnovaModel.1)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##           Sum Sq   Df F value    Pr(>F)    
## Gender       496    1  97.773 < 2.2e-16 ***
## Residuals  41379 8163                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(Gender),
mean, na.rm=TRUE)))
##        F        M 
## 3.157624 2.664813

The test shows that the Absence Rate is statistically signi???cantly di!erent between genders (Pr(>F) < .05)

AnovaModel.2 <- (lm(MFGEmployees$AbsenceRate ~ City, data=MFGEmployees))
Anova(AnovaModel.2)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##           Sum Sq   Df F value Pr(>F)
## City        1085  242  0.8704 0.9254
## Residuals  40790 7922

Absence Rate doesn’t vary signi???cantly by City. Pr(>F)= 0.9254) which is greater than .05

AnovaModel.3 <- (lm(MFGEmployees$AbsenceRate ~ JobTitle,
data=MFGEmployees))
Anova(AnovaModel.3)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##           Sum Sq   Df F value Pr(>F)
## JobTitle     281   46  1.1928 0.1745
## Residuals  41593 8118
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(JobTitle),
mean, na.rm=TRUE)))
##                Accounting Clerk          Accounts Payable Clerk 
##                        1.946229                        1.623203 
##      Accounts Receiveable Clerk                         Auditor 
##                        1.419389                        2.160575 
##                           Baker                  Bakery Manager 
##                        2.923429                        2.579606 
##                  Benefits Admin                         Cashier 
##                        2.972834                        2.996985 
##                             CEO       CHief Information Officer 
##                        1.709012                        4.170790 
##            Compensation Analyst                Corporate Lawyer 
##                        1.790482                        2.471724 
##        Customer Service Manager                   Dairy Manager 
##                        2.546558                        9.250115 
##                    Dairy Person            Director, Accounting 
##                        2.991750                        2.462047 
##      Director, Accounts Payable   Director, Accounts Receivable 
##                        1.818838                        1.647439 
##                 Director, Audit          Director, Compensation 
##                        1.499403                        1.218407 
##      Director, Employee Records         Director, HR Technology 
##                        2.006652                        0.000000 
##           Director, Investments       Director, Labor Relations 
##                        0.000000                        0.000000 
##           Director, Recruitment              Director, Training 
##                        4.177417                        1.412981 
##         Exec Assistant, Finance Exec Assistant, Human Resources 
##                        0.000000                        2.872417 
##   Exec Assistant, Legal Counsel       Exec Assistant, VP Stores 
##                        3.758468                        0.000000 
##                    HRIS Analyst              Investment Analyst 
##                        2.493649                        3.190082 
##         Labor Relations Analyst                   Legal Counsel 
##                        2.655482                        0.000000 
##                     Meat Cutter                   Meats Manager 
##                        2.846953                        3.006050 
##         Processed Foods Manager                   Produce Clerk 
##                        2.617214                        2.800071 
##                 Produce Manager                       Recruiter 
##                        2.684084                        3.186081 
##                   Shelf Stocker                   Store Manager 
##                        3.002924                        2.517101 
##                 Systems Analyst                         Trainer 
##                        1.925995                        3.084251 
##                      VP Finance              VP Human Resources 
##                        3.173603                        3.416529 
##                       VP Stores 
##                        3.586138

Absence Rate doesn’t vary signi???cantly by Job Title. Pr(>F) is greater than .05

AnovaModel.4 <- (lm(MFGEmployees$AbsenceRate ~ DepartmentName, data=MFGEmployees))
Anova(AnovaModel.4)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##                Sum Sq   Df F value  Pr(>F)  
## DepartmentName    172   20   1.675 0.02995 *
## Residuals       41703 8144                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(DepartmentName),
mean, na.rm=TRUE)))
##             Accounting       Accounts Payable   Accounts Receiveable 
##               1.974886               1.636245               1.433642 
##                  Audit                 Bakery           Compensation 
##               2.116497               2.912533               1.726918 
##       Customer Service                  Dairy       Employee Records 
##               2.988482               2.995988               2.892319 
##              Executive          HR Technology Information Technology 
##               2.323580               2.315531               1.925995 
##             Investment        Labor Relations                  Legal 
##               2.835628               2.434192               2.471724 
##                  Meats        Processed Foods                Produce 
##               2.850572               2.985082               2.796795 
##            Recruitment       Store Management               Training 
##               3.262338               2.517101               2.972833

Absence Rate does vary signi???cantly by Department. Pr(>F) is less than .05. Let’s visualize this:

ggplot() + geom_boxplot(aes(y= AbsenceRate, x= DepartmentName), data = MFGEmployees) +
coord_flip()

AnovaModel.5 <- (lm(MFGEmployees$AbsenceRate ~ StoreLocation, data=MFGEmployees))
Anova(AnovaModel.5)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##               Sum Sq   Df F value Pr(>F)
## StoreLocation    191   39   0.956 0.5483
## Residuals      41683 8125
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(StoreLocation),
mean, na.rm=TRUE)))
##       Abbotsford       Aldergrove      Bella Bella       Blue River 
##         2.937386         2.901258         2.031680         2.911973 
##          Burnaby       Chilliwack    Cortes Island        Cranbrook 
##         2.944466         3.082989         2.485313         2.952225 
##     Dawson Creek       Dease Lake      Fort Nelson     Fort St John 
##         3.545463         1.456964         2.741195         2.422886 
##      Grand Forks            Haney         Kamloops          Kelowna 
##         3.251701         2.549883         2.869252         2.898045 
##          Langley          Nanaimo           Nelson New Westminister 
##         2.579410         2.975264         3.454472         2.930529 
##  New Westminster  North Vancouver      Ocean Falls     Pitt Meadows 
##         2.969692         3.016872         3.711442         2.347963 
##   Port Coquitlam    Prince George        Princeton          Quesnel 
##         3.160728         2.919220         2.679282         3.109154 
##         Richmond         Squamish           Surrey          Terrace 
##         3.090128         2.907684         2.984772         3.082298 
##            Trail        Valemount        Vancouver           Vernon 
##         3.115680         3.382016         2.855725         2.898143 
##         Victoria   West Vancouver       White Rock    Williams Lake 
##         2.792531         2.688103         3.237530         2.854828

Absence Rate does not vary signi???cantly by Store Location. Pr(>F) is greater than .05.

AnovaModel.6 <- (lm(MFGEmployees$AbsenceRate ~ Division, data=MFGEmployees))
Anova(AnovaModel.6)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##           Sum Sq   Df F value   Pr(>F)   
## Division      91    5  3.5617 0.003218 **
## Residuals  41783 8159                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(Division),
mean, na.rm=TRUE)))
##            Executive FinanceAndAccounting       HumanResources 
##             2.323580             1.921890             2.651743 
##             InfoTech                Legal               Stores 
##             1.925995             2.471724             2.920856

Absence Rate does vary signi???cantly by Division. Pr(>F) is less than .05.

ggplot() + geom_boxplot(aes(y= AbsenceRate, x= Division), data = MFGEmployees) +
coord_flip()

AnovaModel.7 <- (lm(MFGEmployees$AbsenceRate ~ BusinessUnit, data=MFGEmployees))
Anova(AnovaModel.7)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##              Sum Sq   Df F value    Pr(>F)    
## BusinessUnit     70    1  13.687 0.0002174 ***
## Residuals     41804 8163                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(BusinessUnit),
mean, na.rm=TRUE)))
## HeadOffice     Stores 
##   2.275658   2.920856

Absence Rate does vary signi???cantly by Business Unit. Pr(>F) is less than .05 .

ggplot() + geom_boxplot(aes(y= AbsenceRate, x= BusinessUnit), data = MFGEmployees) +
coord_flip()

AnovaModel.8 <- (lm(MFGEmployees$AbsenceRate ~ Division*Gender, data=MFGEmployees))
Anova(AnovaModel.8)
## Anova Table (Type II tests)
## 
## Response: MFGEmployees$AbsenceRate
##                 Sum Sq   Df F value    Pr(>F)    
## Division            92    5  3.6145  0.002877 ** 
## Gender             496    1 97.9418 < 2.2e-16 ***
## Division:Gender      5    5  0.1784  0.970783    
## Residuals        41283 8153                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with(MFGEmployees, (tapply(MFGEmployees$AbsenceRate, list(Division, Gender), mean, na.rm=TRUE)))
##                             F        M
## Executive            2.976419 1.779546
## FinanceAndAccounting 2.172804 1.634077
## HumanResources       3.014491 2.214311
## InfoTech             3.298112 1.773538
## Legal                3.298112 2.058530
## Stores               3.169049 2.680788

6 Descriptive Analyticss

nums <- sapply(MFGEmployees, is.numeric)
num_df <- MFGEmployees[,c(nums)]
names(num_df)
## [1] "EmployeeNumber" "Age"            "LengthService"  "AbsentHours"   
## [5] "AbsenceRate"
cat <- sapply(MFGEmployees, is.factor)
cat_df <- MFGEmployees[,c(cat)]
names(cat_df)
## [1] "Surname"        "GivenName"      "Gender"         "City"          
## [5] "JobTitle"       "DepartmentName" "StoreLocation"  "Division"      
## [9] "BusinessUnit"
dim(MFGEmployees)
## [1] 8165   14
boxplot(num_df[,-c(1)])

7 Predictive Analytics

MYdataset <-MFGEmployees
MYinput <- c("Gender", "DepartmentName", "StoreLocation", "Division",
"Age", "LengthService", "BusinessUnit")
MYnumeric <- c("Age", "LengthService")
MYcategoric <- c("Gender", "DepartmentName", "StoreLocation", "Division",
"BusinessUnit")
MYtarget <- "AbsenceRate"
MYrisk <- NULL
MYident <- "EmployeeNumber"
MYignore <- c("Surname", "GivenName", "City", "JobTitle", "AbsentHours")
MYweights <- NULL

7.1 Decision Tree

library(rpart, quietly=TRUE)
MYrpart <- rpart(AbsenceRate ~ .,
data=MYdataset[, c(MYinput, MYtarget)],
method="anova",
parms=list(split="information"),
control=rpart.control(minsplit=10,
maxdepth=10,
usesurrogate=0,
maxsurrogate=0))

7.1.1 Plotting of Decison Tree

library(rpart.plot)
rpart.plot(MYrpart, type = 3)

7.1.2 Summary of Decison Tree

summary(MYrpart)
## Call:
## rpart(formula = AbsenceRate ~ ., data = MYdataset[, c(MYinput, 
##     MYtarget)], method = "anova", parms = list(split = "information"), 
##     control = rpart.control(minsplit = 10, maxdepth = 10, usesurrogate = 0, 
##         maxsurrogate = 0))
##   n= 8165 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.49040074      0 1.0000000 1.0001182 0.014403965
## 2 0.08966098      1 0.5095993 0.5103564 0.008743930
## 3 0.06342449      2 0.4199383 0.4210306 0.007139500
## 4 0.01823037      3 0.3565138 0.3578884 0.006839875
## 5 0.01246625      4 0.3382834 0.3401323 0.006315182
## 6 0.01000000      5 0.3258172 0.3304490 0.006188780
## 
## Variable importance
##    Age Gender 
##     98      2 
## 
## Node number 1: 8165 observations,    complexity param=0.4904007
##   mean=2.907265, MSE=5.128515 
##   left son=2 (4360 obs) right son=3 (3805 obs)
##   Primary splits:
##       Age            < 42.92874 to the left,  improve=0.490400700, (0 missing)
##       Gender         splits as  RL, improve=0.011835810, (0 missing)
##       LengthService  < 9.854487 to the right, improve=0.003112393, (0 missing)
##       DepartmentName splits as  LLLLRLRRRLLLRLRRRRRRR, improve=0.002538353, (0 missing)
##       Division       splits as  LLRLRR, improve=0.001997797, (0 missing)
## 
## Node number 2: 4360 observations,    complexity param=0.06342449
##   mean=1.425752, MSE=1.861194 
##   left son=4 (2035 obs) right son=5 (2325 obs)
##   Primary splits:
##       Age            < 35.43927 to the left,  improve=0.327285400, (0 missing)
##       Gender         splits as  RL, improve=0.027778790, (0 missing)
##       StoreLocation  splits as  LLLLRRLLLRRLLLRRLLRLRRRRRLLLRLRRRRLLLLLL, improve=0.003846682, (0 missing)
##       DepartmentName splits as  LLLLLLLLRRLRLRRLLLRLR, improve=0.003252535, (0 missing)
##       Division       splits as  RLRRRL, improve=0.002941291, (0 missing)
## 
## Node number 3: 3805 observations,    complexity param=0.08966098
##   mean=4.604872, MSE=3.47551 
##   left son=6 (2541 obs) right son=7 (1264 obs)
##   Primary splits:
##       Age            < 51.70168 to the left,  improve=0.28390820, (0 missing)
##       Gender         splits as  RL, improve=0.06454366, (0 missing)
##       LengthService  < 9.854487 to the right, improve=0.03924278, (0 missing)
##       DepartmentName splits as  LLLLRLRRLLLLRL-RRRRRL, improve=0.03245560, (0 missing)
##       Division       splits as  LLLL-R, improve=0.03068791, (0 missing)
## 
## Node number 4: 2035 observations
##   mean=0.591517, MSE=0.7469363 
## 
## Node number 5: 2325 observations
##   mean=2.155932, MSE=1.694165 
## 
## Node number 6: 2541 observations,    complexity param=0.01246625
##   mean=3.904273, MSE=2.129771 
##   left son=12 (1355 obs) right son=13 (1186 obs)
##   Primary splits:
##       Gender         splits as  RL, improve=0.09645969, (0 missing)
##       Age            < 46.63122 to the left,  improve=0.08911943, (0 missing)
##       LengthService  < 9.854487 to the right, improve=0.02802483, (0 missing)
##       DepartmentName splits as  LLLLRLRRLLLLRL-RRRLRL, improve=0.02281884, (0 missing)
##       BusinessUnit   splits as  LR, improve=0.02086535, (0 missing)
## 
## Node number 7: 1264 observations,    complexity param=0.01823037
##   mean=6.013276, MSE=3.210503 
##   left son=14 (911 obs) right son=15 (353 obs)
##   Primary splits:
##       Age            < 58.02903 to the left,  improve=0.1881149, (0 missing)
##       LengthService  < 9.751852 to the right, improve=0.1413326, (0 missing)
##       DepartmentName splits as  -LLLRLRRLLLLLL-RRRLRL, improve=0.1245715, (0 missing)
##       BusinessUnit   splits as  LR, improve=0.1245715, (0 missing)
##       Division       splits as  LLLL-R, improve=0.1245715, (0 missing)
## 
## Node number 12: 1355 observations
##   mean=3.480228, MSE=1.940531 
## 
## Node number 13: 1186 observations
##   mean=4.388743, MSE=1.905829 
## 
## Node number 14: 911 observations
##   mean=5.52952, MSE=2.524452 
## 
## Node number 15: 353 observations
##   mean=7.261722, MSE=2.818458

7.2 Linear Regression

#Linear Regression Model
RegressionCurrentData <- lm(AbsenceRate~Age+LengthService, data=MFGEmployees)
summary(RegressionCurrentData)
## 
## Call:
## lm(formula = AbsenceRate ~ Age + LengthService, data = MFGEmployees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3578 -0.8468 -0.0230  0.8523  5.1030 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.190213   0.068959  -75.27   <2e-16 ***
## Age            0.202593   0.001510  134.16   <2e-16 ***
## LengthService -0.085309   0.005652  -15.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.264 on 8162 degrees of freedom
## Multiple R-squared:  0.6887, Adjusted R-squared:  0.6886 
## F-statistic:  9027 on 2 and 8162 DF,  p-value: < 2.2e-16

This procedure gives us a mathematical equation for predicting Absence Rate \[Absence Rate=-5.190213 + (0.202593 * Age) + (-0.085309 * Length of Service)\] From the Pr(>|t|) column both age and LengthService are signi???cantly related and the amount of variation
in Absence Rate explained is 68% (Multiple R-squared: 0.6887) Let’s look at a visual of Absence Rate and Age

library(ggplot2)
ggplot() + geom_point(aes(x= Age,y= AbsenceRate),data=MFGEmployees) +
geom_smooth(aes(x= Age,y= AbsenceRate),data=MFGEmployees,method = "lm")

Now let’s look at a visual with both age and length of service with Absence Rate.

library(scatterplot3d)
s3d <-scatterplot3d(MFGEmployees$Age,MFGEmployees$LengthService,MFGEmployees$AbsenceRate,
pch=16, highlight.3d=TRUE,
type="h", main="Absence Rate By Age And Length of Service")
fit <- lm(MFGEmployees$AbsenceRate ~ MFGEmployees$Age+MFGEmployees$LengthService)
s3d$plane3d(fit)

8 Evaluate And Critique The Models

To Be Continued !! Wait for it…!