Renew: 06/02 line 37~47 => Use to see the number of each outcometype line 50~58 => Not yet. line 61~168 => Breed line 174~202 => Age line 207~290
06/04 line 205~265 => Color
line 465~617 => Random Forest
共26729筆資料,9個解釋變數,包含狗以及貓。
library(dplyr)
library(ggplot2)
library(e1071)
library(randomForest)
library(caret)
trainInit<-read.csv("train.csv")
head(trainInit)
## AnimalID Name DateTime OutcomeType OutcomeSubtype
## 1 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner
## 2 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering
## 3 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster
## 4 A683430 2014-07-11 19:09:00 Transfer Partner
## 5 A667013 2013-11-15 12:52:00 Transfer Partner
## 6 A677334 Elsa 2014-04-25 13:04:00 Transfer Partner
## AnimalType SexuponOutcome AgeuponOutcome
## 1 Dog Neutered Male 1 year
## 2 Cat Spayed Female 1 year
## 3 Dog Neutered Male 2 years
## 4 Cat Intact Male 3 weeks
## 5 Dog Neutered Male 2 years
## 6 Dog Intact Female 1 month
## Breed Color
## 1 Shetland Sheepdog Mix Brown/White
## 2 Domestic Shorthair Mix Cream Tabby
## 3 Pit Bull Mix Blue/White
## 4 Domestic Shorthair Mix Blue Cream
## 5 Lhasa Apso/Miniature Poodle Tan
## 6 Cairn Terrier/Chihuahua Shorthair Black/Tan
刪除 Name 以及 OutcomeSubtype
train<-trainInit[,-c(1,2,5)]
head(train)
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male
## 2 2013-10-13 12:44:00 Euthanasia Cat Spayed Female
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 4 2014-07-11 19:09:00 Transfer Cat Intact Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 2 1 year Domestic Shorthair Mix Cream Tabby
## 3 2 years Pit Bull Mix Blue/White
## 4 3 weeks Domestic Shorthair Mix Blue Cream
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
attach(train)
將狗和貓的data分散開來。
Dogtrain<-train[which(AnimalType=="Dog"),]
Cattrain<-train[-which(AnimalType=="Dog"),]
head(Dogtrain)
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 3 2 years Pit Bull Mix Blue/White
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
## 9 5 months American Pit Bull Terrier Mix Red/White
## 10 1 year Cairn Terrier White
attach(Dogtrain)
n = 15595 p=8
先看各個 OucomeType 分別有幾筆資料。
## Return to owner = 4286
## Transfer = 3917
## Adoption = 6497
## Died = 50
## Euthanasia = 845
因為“死亡”的個數太少,所以我們決定把他拿掉。 而且,我們將本來的OutcomeType中的,Return_to_owner 改為 Return。 因為後面在做XGBoosting的時候,會出現錯誤,說Return_to_owner超過64位元…….
將“死亡”的資料,都換成“安樂死”
Change the AgeuponOutcome as “Puppy”, “AdultDog”, “OldDog” three kinds of types.
## DateTime OutcomeType AnimalType SexuponOutcome Color
## 1 2014-02-12 18:22:00 Return Dog Neutered Male Brown/White
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male Blue/White
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male Tan
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female Black/Tan
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female Red/White
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female White
## breed Age
## 1 Other Breed OldDog
## 3 Pit Bull AdultDog
## 5 Miniature Poodle AdultDog
## 6 Cairn Terrier Puppy
## 9 Other Breed Puppy
## 10 Cairn Terrier OldDog
Replace each color to “Simple”, “Double”, “Tricolor”, “Brindle”, “Tick”, “Merle”, Six categories.
And the “Brindle” contains the color that it has Brindle and Tick, or Brindle and Merle at the same time.
But, there are still some of colors that it not belong to Brindle, like “Blue Tiger”, “Blue cream”, “Smoke” etc.. I will just consider their color, and classify them to “Simple”, “Double” or “Tricolor”.
## DateTime OutcomeType AnimalType SexuponOutcome breed
## 1 February Return Dog Neutered Male Other Breed
## 3 January Adoption Dog Neutered Male Pit Bull
## 5 November Transfer Dog Neutered Male Miniature Poodle
## 6 April Transfer Dog Intact Female Cairn Terrier
## 9 February Adoption Dog Spayed Female Other Breed
## 10 May Adoption Dog Spayed Female Cairn Terrier
## Age ColorFix
## 1 OldDog Double
## 3 AdultDog Double
## 5 AdultDog Light
## 6 Puppy Double
## 9 Puppy Double
## 10 OldDog Light
Try to use SVM to fit model.
method 1 : 使training data中,各個OutcomeType的個數相等。 各個OutcomeType各取 500~600 隨機一個個數
## The number of training data is 2197
## The number of Adoption in the training data is 502
## The number of Return to transfer in the training data is 578
## The number of Euthanasia in the training data is 588
## The number of Transfer in the training data is 529
## Time difference of 1.488188 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 198 4 40 57
## Euthanasia 4 271 80 105
## Return 298 263 442 230
## Transfer 2 50 16 137
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2167 3 244 463
## Euthanasia 58 103 524 649
## Return 3711 123 2804 1458
## Transfer 59 27 136 818
## Training Error Rate = 0.5229859
## Testing Error Rate = 0.5585525
但是效果還是沒很好,懷疑是不是為了讓他balance,導致 training data 過少
method 2 : 補上安樂死的data 使他們balance 選出 10000 筆作為 training data
## The number of Adoption in the training data is 3392
## The number of Return to transfer in the training data is 2272
## The number of Euthanasia in the training data is 2220
## The number of Transfer in the training data is 2116
## Time difference of 31.11945 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2990 614 1600 1042
## Euthanasia 289 1447 603 605
## Return 0 0 0 0
## Transfer 113 159 69 469
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2759 583 1425 902
## Euthanasia 240 1227 522 507
## Return 0 0 0 0
## Transfer 106 132 67 392
## Training Error Rate = 0.5094
## Testing Error Rate = 0.5059806
method 3 : 補上安樂死的data 使他們balance。並使training data也balance NOT FIX!!!!!!!!!!!!!!!!!!
## [1] 6497
## [1] 4286
## [1] 4161
## [1] 3917
## [1] 2170
## [1] 2412
## [1] 2458
## [1] 2112
## Time difference of 26.18733 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 801 23 166 269
## Euthanasia 179 1505 604 535
## Return 1170 699 1554 799
## Transfer 20 231 88 509
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 1564 9 118 251
## Euthanasia 345 1074 468 434
## Return 2377 467 1224 674
## Transfer 41 153 64 446
## Training Error Rate = 0.522618
## Testing Error Rate = 0.556288
method 1 : Training data : 10000 ; Tree 600 & not adjust training data
## Time difference of 1.504966 mins
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3748 149 1220 1073
## Euthanasia 0 111 6 7
## Return 342 174 1391 331
## Transfer 67 113 172 1096
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 1959 76 907 615
## Euthanasia 0 13 8 14
## Return 350 118 440 259
## Transfer 31 90 142 522
## Training Error Rate = 0.3654
## Testing Error Rate = 0.4707792
method 2 : 使training data中,各個OutcomeType的個數相等。 各個OutcomeType各取 500~600 隨機一個個數 分成 600 個 Tree
## Time difference of 39.60153 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 503 19 60 88
## Euthanasia 22 417 45 47
## Return 70 73 445 74
## Transfer 5 25 15 305
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3274 32 945 829
## Euthanasia 440 157 864 748
## Return 1983 77 1662 856
## Transfer 200 44 250 970
## Training Error Rate = 0.2453683
## Testing Error Rate = 0.5451954
method 3 : Training data : 1000 ; Tree : 600 隨機選出10000筆的 training data 後,再使 trainig data 中的 OutcomeType balance。
## The number of Adoption in the training data is 4159
## The number of Return to transfer in the training data is 4212
## The number of Euthanasia in the training data is 4269
## The number of Transfer in the training data is 4135
## Time difference of 1.631365 mins
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2553 106 493 791
## Euthanasia 370 3698 750 578
## Return 1157 312 2771 888
## Transfer 79 153 198 1878
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 1182 15 314 340
## Euthanasia 208 136 324 252
## Return 878 80 766 376
## Transfer 70 59 112 432
## Training Error Rate = 0.5875
## Testing Error Rate = 0.546176
https://www.kaggle.com/apapiu/visualizing-breeds-and-ages-by-outcome/code/notebook https://www.kaggle.com/mrisdal/quick-dirty-randomforest/code https://www.kaggle.com/fsmithus/reduced-model https://cran.r-project.org/web/packages/randomForest/randomForest.pdf https://cran.r-project.org/web/packages/xgboost/xgboost.pdf