DATA622: Homework 2

Based on the latest topics presented, bring a data set of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decision.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?

Format: document with screen captures & analysis.

For this week’s homework, I’m using UCI ML’s ‘Absenteeism’ data set. This data set has information about callouts in a Brazilian firm over a 3 year period.

## [1] 740

##        ID      Reason.for.absence Month.of.absence Day.of.the.week Seasons
##  3      :113   23     :149        3      : 87      2:161           1:170  
##  28     : 76   28     :112        2      : 72      3:154           2:192  
##  34     : 55   27     : 69        10     : 71      4:156           3:183  
##  22     : 46   13     : 55        7      : 67      5:125           4:195  
##  20     : 42   0      : 43        5      : 64      6:144                  
##  11     : 40   19     : 40        11     : 63                             
##  (Other):368   (Other):272        (Other):316                             
##  Transportation.expense Distance.from.Residence.to.Work  Service.time  
##  Min.   :118.0          Min.   : 5.00                   Min.   : 1.00  
##  1st Qu.:179.0          1st Qu.:16.00                   1st Qu.: 9.00  
##  Median :225.0          Median :26.00                   Median :13.00  
##  Mean   :221.3          Mean   :29.63                   Mean   :12.55  
##  3rd Qu.:260.0          3rd Qu.:50.00                   3rd Qu.:16.00  
##  Max.   :388.0          Max.   :52.00                   Max.   :29.00  
##                                                                        
##       Age        Work.load.Average.day   Hit.target     Disciplinary.failure
##  Min.   :27.00   Min.   :205.9         Min.   : 81.00   0:700               
##  1st Qu.:31.00   1st Qu.:244.4         1st Qu.: 93.00   1: 40               
##  Median :37.00   Median :264.2         Median : 95.00                       
##  Mean   :36.45   Mean   :271.5         Mean   : 94.59                       
##  3rd Qu.:40.00   3rd Qu.:294.2         3rd Qu.: 97.00                       
##  Max.   :58.00   Max.   :378.9         Max.   :100.00                       
##                                                                             
##  Education      Son        Social.drinker Social.smoker      Pet        
##  1:611     Min.   :0.000   0:320          0:686         Min.   :0.0000  
##  2: 46     1st Qu.:0.000   1:420          1: 54         1st Qu.:0.0000  
##  3: 79     Median :1.000                                Median :0.0000  
##  4:  4     Mean   :1.019                                Mean   :0.7459  
##            3rd Qu.:2.000                                3rd Qu.:1.0000  
##            Max.   :4.000                                Max.   :8.0000  
##                                                                         
##      Weight           Height      Body.mass.index Absenteeism.time.in.hours
##  Min.   : 56.00   Min.   :163.0   Min.   :19.00   Min.   :  0.000          
##  1st Qu.: 69.00   1st Qu.:169.0   1st Qu.:24.00   1st Qu.:  2.000          
##  Median : 83.00   Median :170.0   Median :25.00   Median :  3.000          
##  Mean   : 79.04   Mean   :172.1   Mean   :26.68   Mean   :  6.924          
##  3rd Qu.: 89.00   3rd Qu.:172.0   3rd Qu.:31.00   3rd Qu.:  8.000          
##  Max.   :108.00   Max.   :196.0   Max.   :38.00   Max.   :120.000          
##

##   ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1 11                 26                7               3       1
## 2 36                  0                7               3       1
## 3  3                 23                7               4       1
## 4  7                  7                7               5       1
## 5 11                 23                7               5       1
## 6  3                 23                7               6       1
##   Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1                    289                              36           13  33
## 2                    118                              13           18  50
## 3                    179                              51           18  38
## 4                    279                               5           14  39
## 5                    289                              36           13  33
## 6                    179                              51           18  38
##   Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1               239.554         97                    0         1   2
## 2               239.554         97                    1         1   1
## 3               239.554         97                    0         1   0
## 4               239.554         97                    0         1   2
## 5               239.554         97                    0         1   2
## 6               239.554         97                    0         1   0
##   Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1              1             0   1     90    172              30
## 2              1             0   0     98    178              31
## 3              1             0   0     89    170              31
## 4              1             1   0     68    168              24
## 5              1             0   1     90    172              30
## 6              1             0   0     89    170              31
##   Absenteeism.time.in.hours
## 1                         4
## 2                         0
## 3                         2
## 4                         4
## 5                         2
## 6                         2

Including the outcome variable, there are 21 columns and 740 observations.

## [1] 36

##  [1]   0   1   2   3   4   5   7   8  16  24  32  40  48  56  64  80 104 112 120

The data description mentions that there are 28 different reasons for callouts, with 21 of them being medical information and the last seven being miscellaneous. Additionally, there are 36 total workers monitored, with several observations for each worker. Finally, it appears that missing time in hours spans partial days (0-8 hours), as well as multiple days (16 hours = 2 days). My interpretation of this is outcomes greater than 8 are when workers called out multiple days in a row, with each observation being a single span of callouts. What isn’t immediately clear is what observations with absenteeism of ‘zero’ describe.

##   ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1  4                  0                0               3       1
## 2  8                  0                0               4       2
## 3 35                  0                0               6       3
##   Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1                    118                              14           13  40
## 2                    231                              35           14  39
## 3                    179                              45           14  53
##   Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1               271.219         95                    0         1   1
## 2               271.219         95                    0         1   2
## 3               271.219         95                    0         1   1
##   Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1              1             0   8     98    170              34
## 2              1             0   2    100    170              35
## 3              0             0   1     77    175              25
##   Absenteeism.time.in.hours
## 1                         0
## 2                         0
## 3                         0

##   ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1  8                  0                9               3       1
## 2  4                  0                0               3       1
## 3  8                  0                0               4       2
## 4 35                  0                0               6       3
##   Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1                    231                              35           14  39
## 2                    118                              14           13  40
## 3                    231                              35           14  39
## 4                    179                              45           14  53
##   Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1               294.217         81                    1         1   2
## 2               271.219         95                    0         1   1
## 3               271.219         95                    0         1   2
## 4               271.219         95                    0         1   1
##   Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1              1             0   2    100    170              35
## 2              1             0   8     98    170              34
## 3              1             0   2    100    170              35
## 4              0             0   1     77    175              25
##   Absenteeism.time.in.hours
## 1                         0
## 2                         0
## 3                         0
## 4                         0

## # A tibble: 5 x 2
##   Month.of.absence n_obs
##   <fct>            <int>
## 1 0                    1
## 2 6                   16
## 3 7                   67
## 4 8                   54
## 5 9                   32

##   ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1 29                  0                9               2       4
## 2 29                 28                2               6       2
## 3 29                 19                5               4       3
## 4 29                 14                5               5       3
## 5 29                 22                5               6       3
##   Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1                    225                              26            9  28
## 2                    225                              15           15  41
## 3                    225                              15           15  41
## 4                    225                              15           15  41
## 5                    225                              15           15  41
##   Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1               241.476         92                    1         1   1
## 2               264.249         97                    0         4   2
## 3               237.656         99                    0         4   2
## 4               237.656         99                    0         4   2
## 5               237.656         99                    0         4   2
##   Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1              0             0   2     69    169              24
## 2              1             0   2     94    182              28
## 3              1             0   2     94    182              28
## 4              1             0   2     94    182              28
## 5              1             0   2     94    182              28
##   Absenteeism.time.in.hours
## 1                         0
## 2                         2
## 3                         3
## 4                         8
## 5                         8

## # A tibble: 2 x 2
##   ID    n_obs
##   <fct> <int>
## 1 28       76
## 2 29        1

I also noticed that there are four observations with a month of ‘zero’, which doesn’t make sense as a descriptor. Since these observations have a season, I’ll impute the most common month of the season onto the zero values. Also, after sorting by ID, it appears that one of the observations with an ID of 29 might be miscoded. In general, it appears many of the demographic variables (age, weight, number of pets) does not change between observations. This leads me to believe that one of the observations for ID# 28 is miscoded as 29

##     ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 738  4                  0                0               3       1
## 739  8                  0                0               4       2
## 740 35                  0                0               6       3
##     Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 738                    118                              14           13  40
## 739                    231                              35           14  39
## 740                    179                              45           14  53
##     Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 738               271.219         95                    0         1   1
## 739               271.219         95                    0         1   2
## 740               271.219         95                    0         1   1
##     Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 738              1             0   8     98    170              34
## 739              1             0   2    100    170              35
## 740              0             0   1     77    175              25
##     Absenteeism.time.in.hours
## 738                         0
## 739                         0
## 740                         0

## # A tibble: 5 x 2
##   Month.of.absence n_obs
##   <fct>            <int>
## 1 0                    1
## 2 6                   16
## 3 9                   32
## 4 8                   54
## 5 7                   67

## # A tibble: 5 x 2
##   Month.of.absence n_obs
##   <fct>            <int>
## 1 0                    1
## 2 12                   9
## 3 1                   50
## 4 3                   60
## 5 2                   72

## # A tibble: 5 x 2
##   Month.of.absence n_obs
##   <fct>            <int>
## 1 0                    1
## 2 3                   27
## 3 6                   38
## 4 4                   53
## 5 5                   64

##     ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 19   6                 11                7               5       1
## 52  29                  0                9               2       4
## 53  28                 23                9               3       4
## 57  28                 18                9               4       4
## 68  28                 23               10               6       4
## 70  28                 23               10               4       4
## 74  28                 23               10               4       4
## 77  28                 28               10               3       4
## 82  28                 23               11               4       4
## 87   6                 22               11               2       4
## 89  28                 23               11               4       4
## 91  28                 23               11               4       4
## 93  28                 13               11               6       4
## 96  28                 28               11               3       4
## 103 28                 23               12               5       4
## 107 28                 28               12               5       4
## 109 28                 23               12               3       4
## 113 28                 23               12               6       2
## 114 28                 23                1               4       2
## 118 28                 27                1               4       2
## 120 28                 28                1               5       2
## 121 28                 27                1               6       2
## 123 28                 27                1               3       2
## 136 28                 23                1               3       2
## 142  6                 23                2               5       2
## 147 28                 28                2               2       2
## 148 28                 23                2               3       2
## 151 28                 25                2               5       2
## 154 28                 23                2               4       2
## 155  6                 19                2               5       2
## 172 28                 23                3               6       2
## 175 28                 23                3               4       2
## 176 28                 11                3               2       3
## 178 28                 11                3               3       3
## 179 28                 11                3               4       3
## 182 28                 28                3               6       3
## 184 28                 28                3               2       3
## 186 28                 23                4               4       3
## 190 28                 28                4               6       3
## 209 28                 19                5               3       3
## 236 28                 28                6               5       1
## 340 28                 12               12               2       4
## 359 28                 23                1               4       2
## 372 28                  7                2               4       2
## 381  6                 22                2               2       2
## 436 28                 25                5               4       3
## 445 28                 23                6               5       3
## 454 28                 23                6               4       1
## 462 28                 25                7               5       1
## 472 28                  9                7               3       1
## 504 28                 23                9               4       1
## 516 28                 23               10               3       4
## 519 28                 23               10               3       4
## 525 28                 13               10               5       4
## 527 28                 10               10               6       4
## 528  6                 23               10               2       4
## 531 28                  0               10               2       4
## 532 28                 13               10               3       4
## 540 28                 23               11               6       4
## 544 28                 14               11               5       4
## 548 28                 23               11               3       4
## 553 28                 23               11               6       4
## 558 28                 23               12               4       4
## 560 28                 23               12               6       4
## 561 28                 23               12               4       4
## 567 28                 23               12               3       4
## 568 28                 23               12               5       4
## 569 28                 23               12               2       4
## 587 28                 23                2               3       2
## 590 28                 28                2               5       2
## 597 28                  7                2               3       2
## 600 28                 25                2               5       2
## 612 28                  7                2               2       2
## 615 28                 25                2               3       2
## 625 28                 27                3               5       2
## 627 28                 25                3               5       2
## 638 28                  7                3               2       2
## 655 28                 13                4               2       3
## 659  6                 13                4               5       3
## 668 28                 19                4               2       3
## 672 28                 19                4               5       3
## 697 28                  6                5               4       3
## 706 28                 11                5               2       3
## 731  6                 22                7               3       1
## 734 28                 22                7               4       1
##     Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 19                     189                              29           13  33
## 52                     225                              26            9  28
## 53                     225                              26            9  28
## 57                     225                              26            9  28
## 68                     225                              26            9  28
## 70                     225                              26            9  28
## 74                     225                              26            9  28
## 77                     225                              26            9  28
## 82                     225                              26            9  28
## 87                     189                              29           13  33
## 89                     225                              26            9  28
## 91                     225                              26            9  28
## 93                     225                              26            9  28
## 96                     225                              26            9  28
## 103                    225                              26            9  28
## 107                    225                              26            9  28
## 109                    225                              26            9  28
## 113                    225                              26            9  28
## 114                    225                              26            9  28
## 118                    225                              26            9  28
## 120                    225                              26            9  28
## 121                    225                              26            9  28
## 123                    225                              26            9  28
## 136                    225                              26            9  28
## 142                    189                              29           13  33
## 147                    225                              26            9  28
## 148                    225                              26            9  28
## 151                    225                              26            9  28
## 154                    225                              26            9  28
## 155                    189                              29           13  33
## 172                    225                              26            9  28
## 175                    225                              26            9  28
## 176                    225                              26            9  28
## 178                    225                              26            9  28
## 179                    225                              26            9  28
## 182                    225                              26            9  28
## 184                    225                              26            9  28
## 186                    225                              26            9  28
## 190                    225                              26            9  28
## 209                    225                              26            9  28
## 236                    225                              26            9  28
## 340                    225                              26            9  28
## 359                    225                              26            9  28
## 372                    225                              26            9  28
## 381                    189                              29           13  33
## 436                    225                              26            9  28
## 445                    225                              26            9  28
## 454                    225                              26            9  28
## 462                    225                              26            9  28
## 472                    225                              26            9  28
## 504                    225                              26            9  28
## 516                    225                              26            9  28
## 519                    225                              26            9  28
## 525                    225                              26            9  28
## 527                    225                              26            9  28
## 528                    189                              29           13  33
## 531                    225                              26            9  28
## 532                    225                              26            9  28
## 540                    225                              26            9  28
## 544                    225                              26            9  28
## 548                    225                              26            9  28
## 553                    225                              26            9  28
## 558                    225                              26            9  28
## 560                    225                              26            9  28
## 561                    225                              26            9  28
## 567                    225                              26            9  28
## 568                    225                              26            9  28
## 569                    225                              26            9  28
## 587                    225                              26            9  28
## 590                    225                              26            9  28
## 597                    225                              26            9  28
## 600                    225                              26            9  28
## 612                    225                              26            9  28
## 615                    225                              26            9  28
## 625                    225                              26            9  28
## 627                    225                              26            9  28
## 638                    225                              26            9  28
## 655                    225                              26            9  28
## 659                    189                              29           13  33
## 668                    225                              26            9  28
## 672                    225                              26            9  28
## 697                    225                              26            9  28
## 706                    225                              26            9  28
## 731                    189                              29           13  33
## 734                    225                              26            9  28
##     Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 19                239.554         97                    0         1   2
## 52                241.476         92                    1         1   1
## 53                241.476         92                    0         1   1
## 57                241.476         92                    0         1   1
## 68                253.465         93                    0         1   1
## 70                253.465         93                    0         1   1
## 74                253.465         93                    0         1   1
## 77                253.465         93                    0         1   1
## 82                306.345         93                    0         1   1
## 87                306.345         93                    0         1   2
## 89                306.345         93                    0         1   1
## 91                306.345         93                    0         1   1
## 93                306.345         93                    0         1   1
## 96                306.345         93                    0         1   1
## 103               261.306         97                    0         1   1
## 107               261.306         97                    0         1   1
## 109               261.306         97                    0         1   1
## 113               261.306         97                    0         1   1
## 114               308.593         95                    0         1   1
## 118               308.593         95                    0         1   1
## 120               308.593         95                    0         1   1
## 121               308.593         95                    0         1   1
## 123               308.593         95                    0         1   1
## 136               308.593         95                    0         1   1
## 142               302.585         99                    0         1   2
## 147               302.585         99                    0         1   1
## 148               302.585         99                    0         1   1
## 151               302.585         99                    0         1   1
## 154               302.585         99                    0         1   1
## 155               302.585         99                    0         1   2
## 172               343.253         95                    0         1   1
## 175               343.253         95                    0         1   1
## 176               343.253         95                    0         1   1
## 178               343.253         95                    0         1   1
## 179               343.253         95                    0         1   1
## 182               343.253         95                    0         1   1
## 184               343.253         95                    0         1   1
## 186               326.452         96                    0         1   1
## 190               326.452         96                    0         1   1
## 209               378.884         92                    0         1   1
## 236               377.550         94                    0         1   1
## 340               236.629         93                    0         1   1
## 359               330.061        100                    0         1   1
## 372               251.818         96                    0         1   1
## 381               251.818         96                    0         1   2
## 436               246.074         99                    0         1   1
## 445               253.957         95                    0         1   1
## 454               253.957         95                    0         1   1
## 462               230.290         92                    0         1   1
## 472               230.290         92                    0         1   1
## 504               261.756         87                    0         1   1
## 516               284.853         91                    0         1   1
## 519               284.853         91                    0         1   1
## 525               284.853         91                    0         1   1
## 527               284.853         91                    0         1   1
## 528               284.853         91                    0         1   2
## 531               284.853         91                    1         1   1
## 532               284.853         91                    0         1   1
## 540               268.519         93                    0         1   1
## 544               268.519         93                    0         1   1
## 548               268.519         93                    0         1   1
## 553               268.519         93                    0         1   1
## 558               280.549         98                    0         1   1
## 560               280.549         98                    0         1   1
## 561               280.549         98                    0         1   1
## 567               280.549         98                    0         1   1
## 568               280.549         98                    0         1   1
## 569               280.549         98                    0         1   1
## 587               264.249         97                    0         1   1
## 590               264.249         97                    0         1   1
## 597               264.249         97                    0         1   1
## 600               264.249         97                    0         1   1
## 612               264.249         97                    0         1   1
## 615               264.249         97                    0         1   1
## 625               222.196         99                    0         1   1
## 627               222.196         99                    0         1   1
## 638               222.196         99                    0         1   1
## 655               246.288         91                    0         1   1
## 659               246.288         91                    0         1   2
## 668               246.288         91                    0         1   1
## 672               246.288         91                    0         1   1
## 697               237.656         99                    0         1   1
## 706               237.656         99                    0         1   1
## 731               264.604         93                    0         1   2
## 734               264.604         93                    0         1   1
##     Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 19               0             0   2     69    167              25
## 52               0             0   2     69    169              24
## 53               0             0   2     69    169              24
## 57               0             0   2     69    169              24
## 68               0             0   2     69    169              24
## 70               0             0   2     69    169              24
## 74               0             0   2     69    169              24
## 77               0             0   2     69    169              24
## 82               0             0   2     69    169              24
## 87               0             0   2     69    167              25
## 89               0             0   2     69    169              24
## 91               0             0   2     69    169              24
## 93               0             0   2     69    169              24
## 96               0             0   2     69    169              24
## 103              0             0   2     69    169              24
## 107              0             0   2     69    169              24
## 109              0             0   2     69    169              24
## 113              0             0   2     69    169              24
## 114              0             0   2     69    169              24
## 118              0             0   2     69    169              24
## 120              0             0   2     69    169              24
## 121              0             0   2     69    169              24
## 123              0             0   2     69    169              24
## 136              0             0   2     69    169              24
## 142              0             0   2     69    167              25
## 147              0             0   2     69    169              24
## 148              0             0   2     69    169              24
## 151              0             0   2     69    169              24
## 154              0             0   2     69    169              24
## 155              0             0   2     69    167              25
## 172              0             0   2     69    169              24
## 175              0             0   2     69    169              24
## 176              0             0   2     69    169              24
## 178              0             0   2     69    169              24
## 179              0             0   2     69    169              24
## 182              0             0   2     69    169              24
## 184              0             0   2     69    169              24
## 186              0             0   2     69    169              24
## 190              0             0   2     69    169              24
## 209              0             0   2     69    169              24
## 236              0             0   2     69    169              24
## 340              0             0   2     69    169              24
## 359              0             0   2     69    169              24
## 372              0             0   2     69    169              24
## 381              0             0   2     69    167              25
## 436              0             0   2     69    169              24
## 445              0             0   2     69    169              24
## 454              0             0   2     69    169              24
## 462              0             0   2     69    169              24
## 472              0             0   2     69    169              24
## 504              0             0   2     69    169              24
## 516              0             0   2     69    169              24
## 519              0             0   2     69    169              24
## 525              0             0   2     69    169              24
## 527              0             0   2     69    169              24
## 528              0             0   2     69    167              25
## 531              0             0   2     69    169              24
## 532              0             0   2     69    169              24
## 540              0             0   2     69    169              24
## 544              0             0   2     69    169              24
## 548              0             0   2     69    169              24
## 553              0             0   2     69    169              24
## 558              0             0   2     69    169              24
## 560              0             0   2     69    169              24
## 561              0             0   2     69    169              24
## 567              0             0   2     69    169              24
## 568              0             0   2     69    169              24
## 569              0             0   2     69    169              24
## 587              0             0   2     69    169              24
## 590              0             0   2     69    169              24
## 597              0             0   2     69    169              24
## 600              0             0   2     69    169              24
## 612              0             0   2     69    169              24
## 615              0             0   2     69    169              24
## 625              0             0   2     69    169              24
## 627              0             0   2     69    169              24
## 638              0             0   2     69    169              24
## 655              0             0   2     69    169              24
## 659              0             0   2     69    167              25
## 668              0             0   2     69    169              24
## 672              0             0   2     69    169              24
## 697              0             0   2     69    169              24
## 706              0             0   2     69    169              24
## 731              0             0   2     69    167              25
## 734              0             0   2     69    169              24
##     Absenteeism.time.in.hours
## 19                          8
## 52                          0
## 53                          2
## 57                          3
## 68                          3
## 70                          2
## 74                          3
## 77                          2
## 82                          1
## 87                          8
## 89                          1
## 91                          3
## 93                          3
## 96                          3
## 103                         2
## 107                         3
## 109                         2
## 113                         2
## 114                         1
## 118                         2
## 120                         2
## 121                         1
## 123                         2
## 136                         1
## 142                         8
## 147                         2
## 148                         2
## 151                         3
## 154                         1
## 155                         8
## 172                         1
## 175                         2
## 176                         8
## 178                         8
## 179                        16
## 182                         2
## 184                         1
## 186                         1
## 190                         2
## 209                         8
## 236                         2
## 340                         3
## 359                         5
## 372                         1
## 381                         8
## 436                         3
## 445                         4
## 454                         2
## 462                         4
## 472                       112
## 504                         1
## 516                         2
## 519                         4
## 525                         1
## 527                         3
## 528                         8
## 531                         0
## 532                         3
## 540                         4
## 544                         3
## 548                         2
## 553                         2
## 558                         3
## 560                         3
## 561                         3
## 567                         2
## 568                         3
## 569                         2
## 587                         3
## 590                         3
## 597                         8
## 600                         3
## 612                         2
## 615                         2
## 625                         2
## 627                         2
## 638                         8
## 655                         8
## 659                         8
## 668                         8
## 672                         8
## 697                         3
## 706                         1
## 731                        16
## 734                         8

##  0  1 10 11 12  2  3  4  5  6  7  8  9 
##  0 50 71 63 49 72 88 54 64 54 68 54 53

## # A tibble: 1 x 2
##   ID    n_obs
##   <fct> <int>
## 1 28       77

Next, lets look at the first target variable, absenteeism in hours. I broke the target variable down into bins based on the structure of the responses. There were enough values for no callouts to be their own category, while absent days could be broken down into time off less than one day or greater than one day. I chose to group taking a full day off with taking more than one off because of the nature of callouts. Ultimately, this exercise seeks to find patterns in taking days off to detect fraudulent versus legitimate time off. Because of this, it’s unlikely that someone would only work part of a day if they were doing so fraudulently. This is supported by the reasons for absence. Taking less than one day off is mostly for miscellaneous reasons, like a dentist or doctor’s appointment. In contrast, taking more than one day off is more likely to be attributed to an ICD diagnosis. Finally, there is an extra value of ‘zero’ that was not defined as a reason for absence in the original data set description. This description is only present when the time taken off is also ‘zero’. However, for the models I’ll be creating regression trees that predict the number of hours given an absence.

Modeling the Data

Absenteeism

For the first variable, I’m using the intended outcome variable of absenteeism in hours. I had to omit the ‘ID’ column as this is too many variables for the ‘tree’ function to consider.

## 
## Regression tree:
## tree(formula = Absenteeism.time.in.hours ~ . - ID, data = train)
## Variables actually used in tree construction:
## [1] "Reason.for.absence"    "Height"                "Month.of.absence"     
## [4] "Weight"                "Day.of.the.week"       "Social.drinker"       
## [7] "Work.load.Average.day"
## Number of terminal nodes:  14 
## Residual mean deviance:  81.71 = 47230 / 578 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -51.2000  -1.4250  -0.4247   0.0000   0.5753  73.1100

The decision tree model creates a tree with 14 nodes total. The most important nodes are the reasons for absence, as well as height and month of absence.

## [1] 242.9984

The test performance is slightly worse, at 242 versus 81.

## 
## Call:
##  randomForest(formula = Absenteeism.time.in.hours ~ . - ID, data = train,      mtry = 3, importance = TRUE, na.action = na.omit) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 144.5725
##                     % Var explained: 14.61

##                                   %IncMSE IncNodePurity
## Reason.for.absence              12.529236     24392.545
## Month.of.absence                 3.319310      8470.681
## Day.of.the.week                  2.296452      4992.006
## Seasons                          4.129389      2823.696
## Transportation.expense           4.894875      3638.832
## Distance.from.Residence.to.Work  3.300605      2624.744

The random forest model creates a much smaller tree with 3 nodes. The training set has slightly worse performance than the original decision tree. The most important variable, reason for absence, remains the same, while the timing of absence is more important in this case.

## [1] 185.727

Test performance is actually better using the random forest model for this outcome.

Disciplinary Action

The second feature is less complicated than the intended target variable, which is the presence or absence of disciplinary action.

## 
## Classification tree:
## tree(formula = Disciplinary.failure ~ . - ID, data = train)
## Variables actually used in tree construction:
## [1] "Reason.for.absence" "Month.of.absence"  
## Number of terminal nodes:  3 
## Residual mean deviance:  0.01797 = 10.59 / 589 
## Misclassification error rate: 0.005068 = 3 / 592

Its decision tree is much simpler, with only 3 terminal nodes. The only variables considered are reasons for absence and its month. Misclassification is less than 1%.

##         
## dt2_pred   0   1
##        0 143   0
##        1   0   5

For the test set, all classes are correctly predicted.

## 
## Call:
##  randomForest(formula = Disciplinary.failure ~ . - ID, data = train,      mtry = 3, importance = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.51%
## Confusion matrix:
##     0  1 class.error
## 0 554  3 0.005385996
## 1   0 35 0.000000000

##         
## rf2_pred   0   1
##        0 143   0
##        1   0   5

The random forest model performed almost identically to the decision tree for the sample outcome variable.

Analysis

Looking at the more complicated variable, absenteeism, the decision trees generated were more complicated and offered more repeated nodes considering the same variable. By contrast, the corresponding random forest was a much smaller and performed better on the test set. Decision trees appear to overfit to the data they’re exposed to. In the second variable, which was a binary yes/no of whether disciplinary action took place, the models were nearly identical. Since the appeal of tree-based models is their interpretability, random forests provide a quicker look at the important variables in a potential model.

One aspect that was not mentioned as well is decision trees suffer when there is class imbalance. For instance, for disciplinary action there were only 40 instances of when it was observed. This means there was a 19:1 ratio of available classes. Similar to logistic regression, sampling methods need to compensate for this mismatch if we’re interested in predicting an uncommon event. This may mean that the models for the second variable are not actually as accurate as advertised.

DATA622: Homework 2

by Thomas Hill

Modeling the Data

Absenteeism

Disciplinary Action

Analysis