Based on the latest topics presented, bring a data set of your choice and create a Decision Tree where you can solve a classification or regression problem and predict the outcome of a particular feature or detail of the data used. Switch variables to generate 2 decision trees and compare the results. Create a random forest for regression and analyze the results. Based on real cases where decision trees went wrong, and ‘the bad & ugly’ aspects of decision trees (https://decision.com/blog/the-good-the-bad-the-ugly-of-using-decision-trees), how can you change this perception when using the decision tree you created to solve a real problem?
Format: document with screen captures & analysis.
For this week’s homework, I’m using UCI ML’s ‘Absenteeism’ data set. This data set has information about callouts in a Brazilian firm over a 3 year period.
## [1] 740
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 3 :113 23 :149 3 : 87 2:161 1:170
## 28 : 76 28 :112 2 : 72 3:154 2:192
## 34 : 55 27 : 69 10 : 71 4:156 3:183
## 22 : 46 13 : 55 7 : 67 5:125 4:195
## 20 : 42 0 : 43 5 : 64 6:144
## 11 : 40 19 : 40 11 : 63
## (Other):368 (Other):272 (Other):316
## Transportation.expense Distance.from.Residence.to.Work Service.time
## Min. :118.0 Min. : 5.00 Min. : 1.00
## 1st Qu.:179.0 1st Qu.:16.00 1st Qu.: 9.00
## Median :225.0 Median :26.00 Median :13.00
## Mean :221.3 Mean :29.63 Mean :12.55
## 3rd Qu.:260.0 3rd Qu.:50.00 3rd Qu.:16.00
## Max. :388.0 Max. :52.00 Max. :29.00
##
## Age Work.load.Average.day Hit.target Disciplinary.failure
## Min. :27.00 Min. :205.9 Min. : 81.00 0:700
## 1st Qu.:31.00 1st Qu.:244.4 1st Qu.: 93.00 1: 40
## Median :37.00 Median :264.2 Median : 95.00
## Mean :36.45 Mean :271.5 Mean : 94.59
## 3rd Qu.:40.00 3rd Qu.:294.2 3rd Qu.: 97.00
## Max. :58.00 Max. :378.9 Max. :100.00
##
## Education Son Social.drinker Social.smoker Pet
## 1:611 Min. :0.000 0:320 0:686 Min. :0.0000
## 2: 46 1st Qu.:0.000 1:420 1: 54 1st Qu.:0.0000
## 3: 79 Median :1.000 Median :0.0000
## 4: 4 Mean :1.019 Mean :0.7459
## 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :4.000 Max. :8.0000
##
## Weight Height Body.mass.index Absenteeism.time.in.hours
## Min. : 56.00 Min. :163.0 Min. :19.00 Min. : 0.000
## 1st Qu.: 69.00 1st Qu.:169.0 1st Qu.:24.00 1st Qu.: 2.000
## Median : 83.00 Median :170.0 Median :25.00 Median : 3.000
## Mean : 79.04 Mean :172.1 Mean :26.68 Mean : 6.924
## 3rd Qu.: 89.00 3rd Qu.:172.0 3rd Qu.:31.00 3rd Qu.: 8.000
## Max. :108.00 Max. :196.0 Max. :38.00 Max. :120.000
##
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1 11 26 7 3 1
## 2 36 0 7 3 1
## 3 3 23 7 4 1
## 4 7 7 7 5 1
## 5 11 23 7 5 1
## 6 3 23 7 6 1
## Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1 289 36 13 33
## 2 118 13 18 50
## 3 179 51 18 38
## 4 279 5 14 39
## 5 289 36 13 33
## 6 179 51 18 38
## Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1 239.554 97 0 1 2
## 2 239.554 97 1 1 1
## 3 239.554 97 0 1 0
## 4 239.554 97 0 1 2
## 5 239.554 97 0 1 2
## 6 239.554 97 0 1 0
## Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1 1 0 1 90 172 30
## 2 1 0 0 98 178 31
## 3 1 0 0 89 170 31
## 4 1 1 0 68 168 24
## 5 1 0 1 90 172 30
## 6 1 0 0 89 170 31
## Absenteeism.time.in.hours
## 1 4
## 2 0
## 3 2
## 4 4
## 5 2
## 6 2
Including the outcome variable, there are 21 columns and 740 observations.
## [1] 36
## [1] 0 1 2 3 4 5 7 8 16 24 32 40 48 56 64 80 104 112 120
The data description mentions that there are 28 different reasons for callouts, with 21 of them being medical information and the last seven being miscellaneous. Additionally, there are 36 total workers monitored, with several observations for each worker. Finally, it appears that missing time in hours spans partial days (0-8 hours), as well as multiple days (16 hours = 2 days). My interpretation of this is outcomes greater than 8 are when workers called out multiple days in a row, with each observation being a single span of callouts. What isn’t immediately clear is what observations with absenteeism of ‘zero’ describe.
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1 4 0 0 3 1
## 2 8 0 0 4 2
## 3 35 0 0 6 3
## Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1 118 14 13 40
## 2 231 35 14 39
## 3 179 45 14 53
## Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1 271.219 95 0 1 1
## 2 271.219 95 0 1 2
## 3 271.219 95 0 1 1
## Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1 1 0 8 98 170 34
## 2 1 0 2 100 170 35
## 3 0 0 1 77 175 25
## Absenteeism.time.in.hours
## 1 0
## 2 0
## 3 0
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1 8 0 9 3 1
## 2 4 0 0 3 1
## 3 8 0 0 4 2
## 4 35 0 0 6 3
## Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1 231 35 14 39
## 2 118 14 13 40
## 3 231 35 14 39
## 4 179 45 14 53
## Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1 294.217 81 1 1 2
## 2 271.219 95 0 1 1
## 3 271.219 95 0 1 2
## 4 271.219 95 0 1 1
## Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1 1 0 2 100 170 35
## 2 1 0 8 98 170 34
## 3 1 0 2 100 170 35
## 4 0 0 1 77 175 25
## Absenteeism.time.in.hours
## 1 0
## 2 0
## 3 0
## 4 0
## # A tibble: 5 x 2
## Month.of.absence n_obs
## <fct> <int>
## 1 0 1
## 2 6 16
## 3 7 67
## 4 8 54
## 5 9 32
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 1 29 0 9 2 4
## 2 29 28 2 6 2
## 3 29 19 5 4 3
## 4 29 14 5 5 3
## 5 29 22 5 6 3
## Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 1 225 26 9 28
## 2 225 15 15 41
## 3 225 15 15 41
## 4 225 15 15 41
## 5 225 15 15 41
## Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 1 241.476 92 1 1 1
## 2 264.249 97 0 4 2
## 3 237.656 99 0 4 2
## 4 237.656 99 0 4 2
## 5 237.656 99 0 4 2
## Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 1 0 0 2 69 169 24
## 2 1 0 2 94 182 28
## 3 1 0 2 94 182 28
## 4 1 0 2 94 182 28
## 5 1 0 2 94 182 28
## Absenteeism.time.in.hours
## 1 0
## 2 2
## 3 3
## 4 8
## 5 8
## # A tibble: 2 x 2
## ID n_obs
## <fct> <int>
## 1 28 76
## 2 29 1
I also noticed that there are four observations with a month of ‘zero’, which doesn’t make sense as a descriptor. Since these observations have a season, I’ll impute the most common month of the season onto the zero values. Also, after sorting by ID, it appears that one of the observations with an ID of 29 might be miscoded. In general, it appears many of the demographic variables (age, weight, number of pets) does not change between observations. This leads me to believe that one of the observations for ID# 28 is miscoded as 29
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 738 4 0 0 3 1
## 739 8 0 0 4 2
## 740 35 0 0 6 3
## Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 738 118 14 13 40
## 739 231 35 14 39
## 740 179 45 14 53
## Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 738 271.219 95 0 1 1
## 739 271.219 95 0 1 2
## 740 271.219 95 0 1 1
## Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 738 1 0 8 98 170 34
## 739 1 0 2 100 170 35
## 740 0 0 1 77 175 25
## Absenteeism.time.in.hours
## 738 0
## 739 0
## 740 0
## # A tibble: 5 x 2
## Month.of.absence n_obs
## <fct> <int>
## 1 0 1
## 2 6 16
## 3 9 32
## 4 8 54
## 5 7 67
## # A tibble: 5 x 2
## Month.of.absence n_obs
## <fct> <int>
## 1 0 1
## 2 12 9
## 3 1 50
## 4 3 60
## 5 2 72
## # A tibble: 5 x 2
## Month.of.absence n_obs
## <fct> <int>
## 1 0 1
## 2 3 27
## 3 6 38
## 4 4 53
## 5 5 64
## ID Reason.for.absence Month.of.absence Day.of.the.week Seasons
## 19 6 11 7 5 1
## 52 29 0 9 2 4
## 53 28 23 9 3 4
## 57 28 18 9 4 4
## 68 28 23 10 6 4
## 70 28 23 10 4 4
## 74 28 23 10 4 4
## 77 28 28 10 3 4
## 82 28 23 11 4 4
## 87 6 22 11 2 4
## 89 28 23 11 4 4
## 91 28 23 11 4 4
## 93 28 13 11 6 4
## 96 28 28 11 3 4
## 103 28 23 12 5 4
## 107 28 28 12 5 4
## 109 28 23 12 3 4
## 113 28 23 12 6 2
## 114 28 23 1 4 2
## 118 28 27 1 4 2
## 120 28 28 1 5 2
## 121 28 27 1 6 2
## 123 28 27 1 3 2
## 136 28 23 1 3 2
## 142 6 23 2 5 2
## 147 28 28 2 2 2
## 148 28 23 2 3 2
## 151 28 25 2 5 2
## 154 28 23 2 4 2
## 155 6 19 2 5 2
## 172 28 23 3 6 2
## 175 28 23 3 4 2
## 176 28 11 3 2 3
## 178 28 11 3 3 3
## 179 28 11 3 4 3
## 182 28 28 3 6 3
## 184 28 28 3 2 3
## 186 28 23 4 4 3
## 190 28 28 4 6 3
## 209 28 19 5 3 3
## 236 28 28 6 5 1
## 340 28 12 12 2 4
## 359 28 23 1 4 2
## 372 28 7 2 4 2
## 381 6 22 2 2 2
## 436 28 25 5 4 3
## 445 28 23 6 5 3
## 454 28 23 6 4 1
## 462 28 25 7 5 1
## 472 28 9 7 3 1
## 504 28 23 9 4 1
## 516 28 23 10 3 4
## 519 28 23 10 3 4
## 525 28 13 10 5 4
## 527 28 10 10 6 4
## 528 6 23 10 2 4
## 531 28 0 10 2 4
## 532 28 13 10 3 4
## 540 28 23 11 6 4
## 544 28 14 11 5 4
## 548 28 23 11 3 4
## 553 28 23 11 6 4
## 558 28 23 12 4 4
## 560 28 23 12 6 4
## 561 28 23 12 4 4
## 567 28 23 12 3 4
## 568 28 23 12 5 4
## 569 28 23 12 2 4
## 587 28 23 2 3 2
## 590 28 28 2 5 2
## 597 28 7 2 3 2
## 600 28 25 2 5 2
## 612 28 7 2 2 2
## 615 28 25 2 3 2
## 625 28 27 3 5 2
## 627 28 25 3 5 2
## 638 28 7 3 2 2
## 655 28 13 4 2 3
## 659 6 13 4 5 3
## 668 28 19 4 2 3
## 672 28 19 4 5 3
## 697 28 6 5 4 3
## 706 28 11 5 2 3
## 731 6 22 7 3 1
## 734 28 22 7 4 1
## Transportation.expense Distance.from.Residence.to.Work Service.time Age
## 19 189 29 13 33
## 52 225 26 9 28
## 53 225 26 9 28
## 57 225 26 9 28
## 68 225 26 9 28
## 70 225 26 9 28
## 74 225 26 9 28
## 77 225 26 9 28
## 82 225 26 9 28
## 87 189 29 13 33
## 89 225 26 9 28
## 91 225 26 9 28
## 93 225 26 9 28
## 96 225 26 9 28
## 103 225 26 9 28
## 107 225 26 9 28
## 109 225 26 9 28
## 113 225 26 9 28
## 114 225 26 9 28
## 118 225 26 9 28
## 120 225 26 9 28
## 121 225 26 9 28
## 123 225 26 9 28
## 136 225 26 9 28
## 142 189 29 13 33
## 147 225 26 9 28
## 148 225 26 9 28
## 151 225 26 9 28
## 154 225 26 9 28
## 155 189 29 13 33
## 172 225 26 9 28
## 175 225 26 9 28
## 176 225 26 9 28
## 178 225 26 9 28
## 179 225 26 9 28
## 182 225 26 9 28
## 184 225 26 9 28
## 186 225 26 9 28
## 190 225 26 9 28
## 209 225 26 9 28
## 236 225 26 9 28
## 340 225 26 9 28
## 359 225 26 9 28
## 372 225 26 9 28
## 381 189 29 13 33
## 436 225 26 9 28
## 445 225 26 9 28
## 454 225 26 9 28
## 462 225 26 9 28
## 472 225 26 9 28
## 504 225 26 9 28
## 516 225 26 9 28
## 519 225 26 9 28
## 525 225 26 9 28
## 527 225 26 9 28
## 528 189 29 13 33
## 531 225 26 9 28
## 532 225 26 9 28
## 540 225 26 9 28
## 544 225 26 9 28
## 548 225 26 9 28
## 553 225 26 9 28
## 558 225 26 9 28
## 560 225 26 9 28
## 561 225 26 9 28
## 567 225 26 9 28
## 568 225 26 9 28
## 569 225 26 9 28
## 587 225 26 9 28
## 590 225 26 9 28
## 597 225 26 9 28
## 600 225 26 9 28
## 612 225 26 9 28
## 615 225 26 9 28
## 625 225 26 9 28
## 627 225 26 9 28
## 638 225 26 9 28
## 655 225 26 9 28
## 659 189 29 13 33
## 668 225 26 9 28
## 672 225 26 9 28
## 697 225 26 9 28
## 706 225 26 9 28
## 731 189 29 13 33
## 734 225 26 9 28
## Work.load.Average.day Hit.target Disciplinary.failure Education Son
## 19 239.554 97 0 1 2
## 52 241.476 92 1 1 1
## 53 241.476 92 0 1 1
## 57 241.476 92 0 1 1
## 68 253.465 93 0 1 1
## 70 253.465 93 0 1 1
## 74 253.465 93 0 1 1
## 77 253.465 93 0 1 1
## 82 306.345 93 0 1 1
## 87 306.345 93 0 1 2
## 89 306.345 93 0 1 1
## 91 306.345 93 0 1 1
## 93 306.345 93 0 1 1
## 96 306.345 93 0 1 1
## 103 261.306 97 0 1 1
## 107 261.306 97 0 1 1
## 109 261.306 97 0 1 1
## 113 261.306 97 0 1 1
## 114 308.593 95 0 1 1
## 118 308.593 95 0 1 1
## 120 308.593 95 0 1 1
## 121 308.593 95 0 1 1
## 123 308.593 95 0 1 1
## 136 308.593 95 0 1 1
## 142 302.585 99 0 1 2
## 147 302.585 99 0 1 1
## 148 302.585 99 0 1 1
## 151 302.585 99 0 1 1
## 154 302.585 99 0 1 1
## 155 302.585 99 0 1 2
## 172 343.253 95 0 1 1
## 175 343.253 95 0 1 1
## 176 343.253 95 0 1 1
## 178 343.253 95 0 1 1
## 179 343.253 95 0 1 1
## 182 343.253 95 0 1 1
## 184 343.253 95 0 1 1
## 186 326.452 96 0 1 1
## 190 326.452 96 0 1 1
## 209 378.884 92 0 1 1
## 236 377.550 94 0 1 1
## 340 236.629 93 0 1 1
## 359 330.061 100 0 1 1
## 372 251.818 96 0 1 1
## 381 251.818 96 0 1 2
## 436 246.074 99 0 1 1
## 445 253.957 95 0 1 1
## 454 253.957 95 0 1 1
## 462 230.290 92 0 1 1
## 472 230.290 92 0 1 1
## 504 261.756 87 0 1 1
## 516 284.853 91 0 1 1
## 519 284.853 91 0 1 1
## 525 284.853 91 0 1 1
## 527 284.853 91 0 1 1
## 528 284.853 91 0 1 2
## 531 284.853 91 1 1 1
## 532 284.853 91 0 1 1
## 540 268.519 93 0 1 1
## 544 268.519 93 0 1 1
## 548 268.519 93 0 1 1
## 553 268.519 93 0 1 1
## 558 280.549 98 0 1 1
## 560 280.549 98 0 1 1
## 561 280.549 98 0 1 1
## 567 280.549 98 0 1 1
## 568 280.549 98 0 1 1
## 569 280.549 98 0 1 1
## 587 264.249 97 0 1 1
## 590 264.249 97 0 1 1
## 597 264.249 97 0 1 1
## 600 264.249 97 0 1 1
## 612 264.249 97 0 1 1
## 615 264.249 97 0 1 1
## 625 222.196 99 0 1 1
## 627 222.196 99 0 1 1
## 638 222.196 99 0 1 1
## 655 246.288 91 0 1 1
## 659 246.288 91 0 1 2
## 668 246.288 91 0 1 1
## 672 246.288 91 0 1 1
## 697 237.656 99 0 1 1
## 706 237.656 99 0 1 1
## 731 264.604 93 0 1 2
## 734 264.604 93 0 1 1
## Social.drinker Social.smoker Pet Weight Height Body.mass.index
## 19 0 0 2 69 167 25
## 52 0 0 2 69 169 24
## 53 0 0 2 69 169 24
## 57 0 0 2 69 169 24
## 68 0 0 2 69 169 24
## 70 0 0 2 69 169 24
## 74 0 0 2 69 169 24
## 77 0 0 2 69 169 24
## 82 0 0 2 69 169 24
## 87 0 0 2 69 167 25
## 89 0 0 2 69 169 24
## 91 0 0 2 69 169 24
## 93 0 0 2 69 169 24
## 96 0 0 2 69 169 24
## 103 0 0 2 69 169 24
## 107 0 0 2 69 169 24
## 109 0 0 2 69 169 24
## 113 0 0 2 69 169 24
## 114 0 0 2 69 169 24
## 118 0 0 2 69 169 24
## 120 0 0 2 69 169 24
## 121 0 0 2 69 169 24
## 123 0 0 2 69 169 24
## 136 0 0 2 69 169 24
## 142 0 0 2 69 167 25
## 147 0 0 2 69 169 24
## 148 0 0 2 69 169 24
## 151 0 0 2 69 169 24
## 154 0 0 2 69 169 24
## 155 0 0 2 69 167 25
## 172 0 0 2 69 169 24
## 175 0 0 2 69 169 24
## 176 0 0 2 69 169 24
## 178 0 0 2 69 169 24
## 179 0 0 2 69 169 24
## 182 0 0 2 69 169 24
## 184 0 0 2 69 169 24
## 186 0 0 2 69 169 24
## 190 0 0 2 69 169 24
## 209 0 0 2 69 169 24
## 236 0 0 2 69 169 24
## 340 0 0 2 69 169 24
## 359 0 0 2 69 169 24
## 372 0 0 2 69 169 24
## 381 0 0 2 69 167 25
## 436 0 0 2 69 169 24
## 445 0 0 2 69 169 24
## 454 0 0 2 69 169 24
## 462 0 0 2 69 169 24
## 472 0 0 2 69 169 24
## 504 0 0 2 69 169 24
## 516 0 0 2 69 169 24
## 519 0 0 2 69 169 24
## 525 0 0 2 69 169 24
## 527 0 0 2 69 169 24
## 528 0 0 2 69 167 25
## 531 0 0 2 69 169 24
## 532 0 0 2 69 169 24
## 540 0 0 2 69 169 24
## 544 0 0 2 69 169 24
## 548 0 0 2 69 169 24
## 553 0 0 2 69 169 24
## 558 0 0 2 69 169 24
## 560 0 0 2 69 169 24
## 561 0 0 2 69 169 24
## 567 0 0 2 69 169 24
## 568 0 0 2 69 169 24
## 569 0 0 2 69 169 24
## 587 0 0 2 69 169 24
## 590 0 0 2 69 169 24
## 597 0 0 2 69 169 24
## 600 0 0 2 69 169 24
## 612 0 0 2 69 169 24
## 615 0 0 2 69 169 24
## 625 0 0 2 69 169 24
## 627 0 0 2 69 169 24
## 638 0 0 2 69 169 24
## 655 0 0 2 69 169 24
## 659 0 0 2 69 167 25
## 668 0 0 2 69 169 24
## 672 0 0 2 69 169 24
## 697 0 0 2 69 169 24
## 706 0 0 2 69 169 24
## 731 0 0 2 69 167 25
## 734 0 0 2 69 169 24
## Absenteeism.time.in.hours
## 19 8
## 52 0
## 53 2
## 57 3
## 68 3
## 70 2
## 74 3
## 77 2
## 82 1
## 87 8
## 89 1
## 91 3
## 93 3
## 96 3
## 103 2
## 107 3
## 109 2
## 113 2
## 114 1
## 118 2
## 120 2
## 121 1
## 123 2
## 136 1
## 142 8
## 147 2
## 148 2
## 151 3
## 154 1
## 155 8
## 172 1
## 175 2
## 176 8
## 178 8
## 179 16
## 182 2
## 184 1
## 186 1
## 190 2
## 209 8
## 236 2
## 340 3
## 359 5
## 372 1
## 381 8
## 436 3
## 445 4
## 454 2
## 462 4
## 472 112
## 504 1
## 516 2
## 519 4
## 525 1
## 527 3
## 528 8
## 531 0
## 532 3
## 540 4
## 544 3
## 548 2
## 553 2
## 558 3
## 560 3
## 561 3
## 567 2
## 568 3
## 569 2
## 587 3
## 590 3
## 597 8
## 600 3
## 612 2
## 615 2
## 625 2
## 627 2
## 638 8
## 655 8
## 659 8
## 668 8
## 672 8
## 697 3
## 706 1
## 731 16
## 734 8
## 0 1 10 11 12 2 3 4 5 6 7 8 9
## 0 50 71 63 49 72 88 54 64 54 68 54 53
## # A tibble: 1 x 2
## ID n_obs
## <fct> <int>
## 1 28 77
Next, lets look at the first target variable, absenteeism in hours. I broke the target variable down into bins based on the structure of the responses. There were enough values for no callouts to be their own category, while absent days could be broken down into time off less than one day or greater than one day. I chose to group taking a full day off with taking more than one off because of the nature of callouts. Ultimately, this exercise seeks to find patterns in taking days off to detect fraudulent versus legitimate time off. Because of this, it’s unlikely that someone would only work part of a day if they were doing so fraudulently. This is supported by the reasons for absence. Taking less than one day off is mostly for miscellaneous reasons, like a dentist or doctor’s appointment. In contrast, taking more than one day off is more likely to be attributed to an ICD diagnosis. Finally, there is an extra value of ‘zero’ that was not defined as a reason for absence in the original data set description. This description is only present when the time taken off is also ‘zero’. However, for the models I’ll be creating regression trees that predict the number of hours given an absence.
For the first variable, I’m using the intended outcome variable of absenteeism in hours. I had to omit the ‘ID’ column as this is too many variables for the ‘tree’ function to consider.
##
## Regression tree:
## tree(formula = Absenteeism.time.in.hours ~ . - ID, data = train)
## Variables actually used in tree construction:
## [1] "Reason.for.absence" "Height" "Month.of.absence"
## [4] "Weight" "Day.of.the.week" "Social.drinker"
## [7] "Work.load.Average.day"
## Number of terminal nodes: 14
## Residual mean deviance: 81.71 = 47230 / 578
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -51.2000 -1.4250 -0.4247 0.0000 0.5753 73.1100
The decision tree model creates a tree with 14 nodes total. The most important nodes are the reasons for absence, as well as height and month of absence.
## [1] 242.9984
The test performance is slightly worse, at 242 versus 81.
##
## Call:
## randomForest(formula = Absenteeism.time.in.hours ~ . - ID, data = train, mtry = 3, importance = TRUE, na.action = na.omit)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 144.5725
## % Var explained: 14.61
## %IncMSE IncNodePurity
## Reason.for.absence 12.529236 24392.545
## Month.of.absence 3.319310 8470.681
## Day.of.the.week 2.296452 4992.006
## Seasons 4.129389 2823.696
## Transportation.expense 4.894875 3638.832
## Distance.from.Residence.to.Work 3.300605 2624.744
The random forest model creates a much smaller tree with 3 nodes. The training set has slightly worse performance than the original decision tree. The most important variable, reason for absence, remains the same, while the timing of absence is more important in this case.
## [1] 185.727
Test performance is actually better using the random forest model for this outcome.
The second feature is less complicated than the intended target variable, which is the presence or absence of disciplinary action.
##
## Classification tree:
## tree(formula = Disciplinary.failure ~ . - ID, data = train)
## Variables actually used in tree construction:
## [1] "Reason.for.absence" "Month.of.absence"
## Number of terminal nodes: 3
## Residual mean deviance: 0.01797 = 10.59 / 589
## Misclassification error rate: 0.005068 = 3 / 592
Its decision tree is much simpler, with only 3 terminal nodes. The only variables considered are reasons for absence and its month. Misclassification is less than 1%.
##
## dt2_pred 0 1
## 0 143 0
## 1 0 5
For the test set, all classes are correctly predicted.
##
## Call:
## randomForest(formula = Disciplinary.failure ~ . - ID, data = train, mtry = 3, importance = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.51%
## Confusion matrix:
## 0 1 class.error
## 0 554 3 0.005385996
## 1 0 35 0.000000000
##
## rf2_pred 0 1
## 0 143 0
## 1 0 5
The random forest model performed almost identically to the decision tree for the sample outcome variable.
Looking at the more complicated variable, absenteeism, the decision trees generated were more complicated and offered more repeated nodes considering the same variable. By contrast, the corresponding random forest was a much smaller and performed better on the test set. Decision trees appear to overfit to the data they’re exposed to. In the second variable, which was a binary yes/no of whether disciplinary action took place, the models were nearly identical. Since the appeal of tree-based models is their interpretability, random forests provide a quicker look at the important variables in a potential model.
One aspect that was not mentioned as well is decision trees suffer when there is class imbalance. For instance, for disciplinary action there were only 40 instances of when it was observed. This means there was a 19:1 ratio of available classes. Similar to logistic regression, sampling methods need to compensate for this mismatch if we’re interested in predicting an uncommon event. This may mean that the models for the second variable are not actually as accurate as advertised.