Renew: 06/02 line 37~47 => Use to see the number of each outcometype line 50~58 => Not yet. line 61~168 => Breed line 174~202 => Age line 207~290
06/04 line 205~265 => Color
line 465~617 => Random Forest
我們的資料共有26729筆,以及9個解釋變數。
library(dplyr)
library(ggplot2)
library(e1071)
library(randomForest)
library(caret)
library(gridExtra)
trainInit<-read.csv("train.csv")
head(trainInit)
## AnimalID Name DateTime OutcomeType OutcomeSubtype
## 1 A671945 Hambone 2014-02-12 18:22:00 Return_to_owner
## 2 A656520 Emily 2013-10-13 12:44:00 Euthanasia Suffering
## 3 A686464 Pearce 2015-01-31 12:28:00 Adoption Foster
## 4 A683430 2014-07-11 19:09:00 Transfer Partner
## 5 A667013 2013-11-15 12:52:00 Transfer Partner
## 6 A677334 Elsa 2014-04-25 13:04:00 Transfer Partner
## AnimalType SexuponOutcome AgeuponOutcome
## 1 Dog Neutered Male 1 year
## 2 Cat Spayed Female 1 year
## 3 Dog Neutered Male 2 years
## 4 Cat Intact Male 3 weeks
## 5 Dog Neutered Male 2 years
## 6 Dog Intact Female 1 month
## Breed Color
## 1 Shetland Sheepdog Mix Brown/White
## 2 Domestic Shorthair Mix Cream Tabby
## 3 Pit Bull Mix Blue/White
## 4 Domestic Shorthair Mix Blue Cream
## 5 Lhasa Apso/Miniature Poodle Tan
## 6 Cairn Terrier/Chihuahua Shorthair Black/Tan
刪除 “Name” 以及 “OutcomeSubtype”,兩個變數。
將狗和貓的data分散開來,
train<-trainInit[,-c(1,2,5)]
head(train)
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male
## 2 2013-10-13 12:44:00 Euthanasia Cat Spayed Female
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 4 2014-07-11 19:09:00 Transfer Cat Intact Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 2 1 year Domestic Shorthair Mix Cream Tabby
## 3 2 years Pit Bull Mix Blue/White
## 4 3 weeks Domestic Shorthair Mix Blue Cream
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
attach(train)
Dogtrain<-train[which(AnimalType=="Dog"),]
Cattrain<-train[-which(AnimalType=="Dog"),]
head(Dogtrain)
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 3 2 years Pit Bull Mix Blue/White
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
## 9 5 months American Pit Bull Terrier Mix Red/White
## 10 1 year Cairn Terrier White
attach(Dogtrain)
*共有15595筆資料,8個解釋變數。
先看各個 OucomeType 分別有幾筆資料。
## Return to owner = 4286
## Transfer = 3917
## Adoption = 6497
## Died = 50
## Euthanasia = 845
因為“Died”的個數太少,所以我們決定把他拿掉。
而且,我們將本來的OutcomeType中的,“Return_to_owner” 改為 “Return”。
因為後面在做XGBoosting的時候,會出現錯誤,說“Return_to_owner”超過64位元…….
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Breed Color
## 1 1 year Shetland Sheepdog Mix Brown/White
## 3 2 years Pit Bull Mix Blue/White
## 5 2 years Lhasa Apso/Miniature Poodle Tan
## 6 1 month Cairn Terrier/Chihuahua Shorthair Black/Tan
## 9 5 months American Pit Bull Terrier Mix Red/White
## 10 1 year Cairn Terrier White
由於狗的血統變數太多,於是我們先做出圖表,觀察每種不同血統出現的多寡,
並抓出前30名出現最多次的血統,其餘的歸類到Others,以便分析
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## AgeuponOutcome Color breed
## 1 1 year Brown/White Other Breed
## 3 2 years Blue/White Pit Bull
## 5 2 years Tan Miniature Poodle
## 6 1 month Black/Tan Cairn Terrier
## 9 5 months Red/White Other Breed
## 10 1 year White Cairn Terrier
由於狗的年齡當中變數的分佈太廣,下至剛出生,上至19歲,其中也有幾周至幾個月不等,
於是我們將其大致分為幼犬(不滿一歲),成犬(1-7歲),老犬(7歲以上),以便分析。
## DateTime OutcomeType AnimalType SexuponOutcome Color
## 1 2014-02-12 18:22:00 Return Dog Neutered Male Brown/White
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male Blue/White
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male Tan
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female Black/Tan
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female Red/White
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female White
## breed Age
## 1 Other Breed OldDog
## 3 Pit Bull AdultDog
## 5 Miniature Poodle AdultDog
## 6 Cairn Terrier Puppy
## 9 Other Breed Puppy
## 10 Cairn Terrier OldDog
由於顏色的種類太多,其中也有包含斑點、胎記等特徵的出現,
於是我們將其大致分為:純色、雙色、三色、有斑點、有色塊、有胎記 六大類,
接下來,我們希望再進一步分為深淺兩種,看是否對分析有幫助,
一樣先做圖,發現20名以後的顏色出現次數過少,
所以我們將前20名出現最多的顏色中,挑出單純的“顏色”,
並將其分為深淺兩部分,再將之前所分的純色進一步劃分,
最後我們得到:深色、淺色、其它純色、雙色、三色、有斑點、有色塊、有胎記 八類。
## DateTime OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return Dog Neutered Male
## 3 2015-01-31 12:28:00 Adoption Dog Neutered Male
## 5 2013-11-15 12:52:00 Transfer Dog Neutered Male
## 6 2014-04-25 13:04:00 Transfer Dog Intact Female
## 9 2014-02-04 17:17:00 Adoption Dog Spayed Female
## 10 2014-05-03 07:48:00 Adoption Dog Spayed Female
## breed Age ColorFix
## 1 Other Breed OldDog Double
## 3 Pit Bull AdultDog Double
## 5 Miniature Poodle AdultDog Light
## 6 Cairn Terrier Puppy Double
## 9 Other Breed Puppy Double
## 10 Cairn Terrier OldDog Light
由於日期範圍太大,且較長的時間(年),或較短的時間(周、日…等),
對於收容所中的動物變動,可能看不太出甚麼資訊,
所以我們將其分為12個月,並進行分析。
## DateTime OutcomeType AnimalType SexuponOutcome breed
## 1 February Return Dog Neutered Male Other Breed
## 3 January Adoption Dog Neutered Male Pit Bull
## 5 November Transfer Dog Neutered Male Miniature Poodle
## 6 April Transfer Dog Intact Female Cairn Terrier
## 9 February Adoption Dog Spayed Female Other Breed
## 10 May Adoption Dog Spayed Female Cairn Terrier
## Age ColorFix
## 1 OldDog Double
## 3 AdultDog Double
## 5 AdultDog Light
## 6 Puppy Double
## 9 Puppy Double
## 10 OldDog Light
任意選取 10000 筆資料做為 Training Data。
## Time difference of 28.20861 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 4104 255 2299 1385
## Euthanasia 0 0 0 0
## Return 19 148 250 226
## Transfer 56 162 232 864
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2270 138 1231 823
## Euthanasia 0 0 0 0
## Return 10 61 133 120
## Transfer 38 80 141 499
## Training Error Rate = 0.4782
## Testing Error Rate = 0.4765512
## The number of Adoption in the training data is 4179
## The number of Return to transfer in the training data is 2781
## The number of Euthanasia in the training data is 565
## The number of Transfer in the training data is 2475
任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,補到至少跟除了安樂死以外最少的OutcomeType只差一百~兩百筆,但不比他多。
## Time difference of 38.38219 secs
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3772 661 1974 1317
## Euthanasia 41 1045 397 500
## Return 289 444 316 150
## Transfer 41 209 81 584
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2140 94 1088 672
## Euthanasia 23 138 216 261
## Return 173 45 155 91
## Transfer 18 29 59 342
## Training Error Rate = 0.6104
## Testing Error Rate = 0.4994589
## The number of Adoption in the training data is 4143
## The number of Euthanasia in the training data is 2359
## The number of Return to transfer in the training data is 2768
## The number of Transfer in the training data is 2551
任意選取 10000 筆資料做為 Training Data。
## Time difference of 1.304458 mins
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3852 163 1346 1138
## Euthanasia 0 107 7 3
## Return 259 149 1214 269
## Transfer 63 135 177 1118
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 2020 93 985 634
## Euthanasia 2 10 14 8
## Return 259 96 373 195
## Transfer 42 91 170 552
## Training Error Rate = 0.3709
## Testing Error Rate = 0.4669913
任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,補到至少跟除了安樂死以外最少的OutcomeType只差一百~兩百筆,但不比他多。
## Time difference of 1.333776 mins
## The confusion Matrix of training data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 3803 315 1286 1077
## Euthanasia 166 1879 354 270
## Return 181 66 982 170
## Transfer 54 70 136 967
## The confusion Matrix of testing data.
## Reference
## Prediction Adoption Euthanasia Return Transfer
## Adoption 1918 78 914 647
## Euthanasia 114 113 238 189
## Return 232 46 262 129
## Transfer 29 53 114 468
## Training Error Rate = 0.4145
## Testing Error Rate = 0.5019841
## The number of Adoption in the training data is 4204
## The number of Return to transfer in the training data is 2758
## The number of Euthanasia in the training data is 2330
## The number of Transfer in the training data is 2484
https://www.kaggle.com/apapiu/visualizing-breeds-and-ages-by-outcome/code/notebook
https://www.kaggle.com/mrisdal/quick-dirty-randomforest/code
https://www.kaggle.com/fsmithus/reduced-model
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
https://cran.r-project.org/web/packages/xgboost/xgboost.pdf