介紹整個資料
我們的資料中共有26729筆,9個解釋變數。
包含15595隻狗,11134隻貓咪,
以及 “AnimalID”、 “Name” 、 “DateTime” 、 “OutcomeType” 、 “OutcomeSubType” 、
“SexuponOutcome” 、 “AgeuponOutcome” 、 “Breed” 、 “Color”,9個變數
其中的 OutcomeType 為我們的反應變數,OutcomeSubType為它的子類別。
介紹各個變數中有哪些(幾個)因子(factor)
library(dplyr)
library(ggplot2)
library(e1071)
library(randomForest)
library(caret)
library(gridExtra)
library(data.table)
trainInit<-read.csv("train.csv")
head(trainInit)
## AnimalID Name DateTime OutcomeType OutcomeSubtype
## 1 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner
## 2 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering
## 3 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster
## 4 A683430 2014-07-11 19:09:00 Transfer Partner
## 5 A667013 2013-11-15 12:52:00 Transfer Partner
## 6 A677334 Elsa 2014-04-25 13:04:00 Transfer Partner
## AnimalType SexuponOutcome AgeuponOutcome
## 1 Dog Neutered Male 1 year
## 2 Cat Spayed Female 1 year
## 3 Dog Neutered Male 2 years
## 4 Cat Intact Male 3 weeks
## 5 Dog Neutered Male 2 years
## 6 Dog Intact Female 1 month
## Breed Color
## 1 Shetland Sheepdog Mix Brown/White
## 2 Domestic Shorthair Mix Cream Tabby
## 3 Pit Bull Mix Blue/White
## 4 Domestic Shorthair Mix Blue Cream
## 5 Lhasa Apso/Miniature Poodle Tan
## 6 Cairn Terrier/Chihuahua Shorthair Black/Tan
train<-trainInit[,-c(1,2,5)]
head(train)
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male
## 2 2013-10-13 12:44:00 Euthanasia Cat Spayed Female
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 4 2014-07-11 19:09:00 Transfer Cat Intact Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 2 1 year Domestic Shorthair Mix Cream Tabby
## 3 2 years Pit Bull Mix Blue/White
## 4 3 weeks Domestic Shorthair Mix Blue Cream
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
attach(train)
Dogtrain<-train[which(AnimalType=="Dog"),]
Cattrain<-train[-which(AnimalType=="Dog"),]
head(Dogtrain)
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 3 2 years Pit Bull Mix Blue/White
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
## 9 5 months American Pit Bull Terrier Mix Red/White
## 10 1 year Cairn Terrier White
attach(Dogtrain)
## Return to owner = 4286
## Transfer = 3917
## Adoption = 6497
## Died = 50
## Euthanasia = 845
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 3 2 years Pit Bull Mix Blue/White
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
## 9 5 months American Pit Bull Terrier Mix Red/White
## 10 1 year Cairn Terrier White
由於狗的血統變數太多,於是我們先做出圖表,觀察每種不同血統出現的多寡,
並抓出前30名出現最多次的血統,其餘的歸類到Others,以便分析。
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Color breed
## 1 1 year Brown/White Other Breed
## 3 2 years Blue/White Pit Bull
## 5 2 years Tan Miniature Poodle
## 6 1 month Black/Tan Cairn Terrier
## 9 5 months Red/White Other Breed
## 10 1 year White Cairn Terrier
由於狗的年齡當中變數的分佈太廣,下至剛出生,上至19歲,其中也有幾周至幾個月不等,於是我們將其大致分為
1. 幼犬(不滿1歲)(Puppy)
2. 成犬(1-7歲)(AdultDog)
3. 老犬(7歲以上)(OldDog)
以便分析。
## DateTime OutcomeType AnimalType SexuponOutcome Color
## 1 2014-02-12 18:22:00 Return Dog Neutered Male Brown/White
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male Blue/White
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male Tan
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female Black/Tan
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female Red/White
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female White
## breed Age
## 1 Other Breed OldDog
## 3 Pit Bull AdultDog
## 5 Miniature Poodle AdultDog
## 6 Cairn Terrier Puppy
## 9 Other Breed Puppy
## 10 Cairn Terrier OldDog
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## breed Age ColorFix
## 1 Other Breed OldDog Double
## 3 Pit Bull AdultDog Double
## 5 Miniature Poodle AdultDog Light
## 6 Cairn Terrier Puppy Double
## 9 Other Breed Puppy Double
## 10 Cairn Terrier OldDog Light
由於日期範圍太大,且較長的時間(年),或較短的時間(周、日…等),
對於收容所中的動物變動,可能看不太出甚麼資訊,
所以我們將其分為12個月,並進行分析。
## DateTime OutcomeType AnimalType SexuponOutcome breed
## 1 February Return Dog Neutered Male Other Breed
## 3 January Adoption Dog Neutered Male Pit Bull
## 5 November Transfer Dog Neutered Male Miniature Poodle
## 6 April Transfer Dog Intact Female Cairn Terrier
## 9 February Adoption Dog Spayed Female Other Breed
## 10 May Adoption Dog Spayed Female Cairn Terrier
## Age ColorFix
## 1 OldDog Double
## 3 AdultDog Double
## 5 AdultDog Light
## 6 Puppy Double
## 9 Puppy Double
## 10 OldDog Light
任意選取 10000 筆資料做為 Training Data。
## Time difference of 27.35156 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 4095 257 2320 1385
## Euthanasia 0 0 0 0
## Return 15 145 242 210
## Transfer 62 149 240 880
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2279 136 1210 823
## Euthanasia 0 0 0 0
## Return 14 64 141 136
## Transfer 32 93 133 483
## Training Error Rate = 0.4783
## Testing Error Rate = 0.4763709
## The number of Adoption in the training data is 4172
## The number of Return to transfer in the training data is 2802
## The number of Euthanasia in the training data is 551
## The number of Transfer in the training data is 2475
任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,
補到至少跟除了安樂死以外最少的OutcomeType只差100~200筆,但不比它多。
## Time difference of 37.08312 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 4122 1109 2303 1405
## Euthanasia 43 982 367 423
## Return 0 2 6 24
## Transfer 39 200 102 630
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2252 139 1227 803
## Euthanasia 19 136 224 281
## Return 1 4 3 9
## Transfer 21 29 54 342
## Training Error Rate = 0.6017
## Testing Error Rate = 0.5070346
## The number of Adoption in the training data is 4204
## The number of Euthanasia in the training data is 2293
## The number of Return to transfer in the training data is 2778
## The number of Transfer in the training data is 2482
任意選取 10000 筆資料做為 Training Data。
## Time difference of 1.302124 mins
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3788 161 1241 1118
## Euthanasia 0 135 10 10
## Return 337 156 1369 295
## Transfer 61 112 145 1062
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 1944 86 870 612
## Euthanasia 0 15 28 18
## Return 334 83 479 264
## Transfer 33 96 144 538
## Training Error Rate = 1.015536
## Testing Error Rate = 1.647266
任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,
補到至少跟除了安樂死以外最少的OutcomeType只差100~200筆,但不比它多。
## Time difference of 1.285407 mins
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3804 317 1326 1074
## Euthanasia 143 1862 371 284
## Return 199 64 936 203
## Transfer 50 93 119 955
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 1936 93 941 651
## Euthanasia 113 117 259 191
## Return 222 35 217 108
## Transfer 30 63 117 451
## Training Error Rate = 1.02147
## Testing Error Rate = 1.7096
## The number of Adoption in the training data is 4196
## The number of Return to transfer in the training data is 2752
## The number of Euthanasia in the training data is 2336
## The number of Transfer in the training data is 2516
https://www.kaggle.com/apapiu/visualizing-breeds-and-ages-by-outcome/code/notebook
https://www.kaggle.com/mrisdal/quick-dirty-randomforest/code
https://www.kaggle.com/fsmithus/reduced-model
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
https://cran.r-project.org/web/packages/xgboost/xgboost.pdf