\[問題\]

\[資料介紹\]

\[結果簡述\]

SVM

\[ Training Error = 1.360251 \] \[ Testing Error = 1.471125 \]

Random Forest

\[ Training Error Rate = 0.7314072 \]
\[ Testing Error Rate = 1.908389 \]

XGboost

\[ Training Error Rate = 0.9740993 \]
\[ Testing Error Rate = 1.037753 \]

  • 依照以上的結果,發現對於我們的資料來說,XGBoost方法是最好的。

\[處理Data\]

library(dplyr)
library(ggplot2)
library(e1071)
library(randomForest)
library(caret)
library(data.table)
library(xgboost)
library(Matrix)
library(formattable)
trainInit<-read.csv("train.csv")
head(trainInit)
##   AnimalID    Name            DateTime     OutcomeType OutcomeSubtype
## 1  A671945 Hambone 2014-02-12 18:22:00 Return_to_owner               
## 2  A656520   Emily 2013-10-13 12:44:00      Euthanasia      Suffering
## 3  A686464  Pearce 2015-01-31 12:28:00        Adoption         Foster
## 4  A683430         2014-07-11 19:09:00        Transfer        Partner
## 5  A667013         2013-11-15 12:52:00        Transfer        Partner
## 6  A677334    Elsa 2014-04-25 13:04:00        Transfer        Partner
##   AnimalType SexuponOutcome AgeuponOutcome
## 1        Dog  Neutered Male         1 year
## 2        Cat  Spayed Female         1 year
## 3        Dog  Neutered Male        2 years
## 4        Cat    Intact Male        3 weeks
## 5        Dog  Neutered Male        2 years
## 6        Dog  Intact Female        1 month
##                               Breed       Color
## 1             Shetland Sheepdog Mix Brown/White
## 2            Domestic Shorthair Mix Cream Tabby
## 3                      Pit Bull Mix  Blue/White
## 4            Domestic Shorthair Mix  Blue Cream
## 5       Lhasa Apso/Miniature Poodle         Tan
## 6 Cairn Terrier/Chihuahua Shorthair   Black/Tan
train<-trainInit[,-c(1,2,5)]
attach(train)
Dogtrain<-train[which(AnimalType=="Dog"),]
Cattrain<-train[-which(AnimalType=="Dog"),]
Dogtrain<-Dogtrain[,-3]
head(Dogtrain)
##               DateTime     OutcomeType SexuponOutcome AgeuponOutcome
## 1  2014-02-12 18:22:00 Return_to_owner  Neutered Male         1 year
## 3  2015-01-31 12:28:00        Adoption  Neutered Male        2 years
## 5  2013-11-15 12:52:00        Transfer  Neutered Male        2 years
## 6  2014-04-25 13:04:00        Transfer  Intact Female        1 month
## 9  2014-02-04 17:17:00        Adoption  Spayed Female       5 months
## 10 2014-05-03 07:48:00        Adoption  Spayed Female         1 year
##                                Breed       Color
## 1              Shetland Sheepdog Mix Brown/White
## 3                       Pit Bull Mix  Blue/White
## 5        Lhasa Apso/Miniature Poodle         Tan
## 6  Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9      American Pit Bull Terrier Mix   Red/White
## 10                     Cairn Terrier       White
attach(Dogtrain)

OutcomeType

  • 先看各個 OucomeType 分別有幾筆資料。
OutcomeType Frequency
Adoption 6,497
Return_to_owner 4,286
Transfer 3,917
Euthanasia 845
Died 50
## The following objects are masked from Dogtrain (pos = 3):
## 
##     AgeuponOutcome, Breed, Color, DateTime, OutcomeType,
##     SexuponOutcome
## The following objects are masked from train:
## 
##     AgeuponOutcome, Breed, Color, DateTime, OutcomeType,
##     SexuponOutcome
##               DateTime OutcomeType SexuponOutcome AgeuponOutcome
## 1  2014-02-12 18:22:00      Return  Neutered Male         1 year
## 3  2015-01-31 12:28:00    Adoption  Neutered Male        2 years
## 5  2013-11-15 12:52:00    Transfer  Neutered Male        2 years
## 6  2014-04-25 13:04:00    Transfer  Intact Female        1 month
## 9  2014-02-04 17:17:00    Adoption  Spayed Female       5 months
## 10 2014-05-03 07:48:00    Adoption  Spayed Female         1 year
##                                Breed       Color
## 1              Shetland Sheepdog Mix Brown/White
## 3                       Pit Bull Mix  Blue/White
## 5        Lhasa Apso/Miniature Poodle         Tan
## 6  Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9      American Pit Bull Terrier Mix   Red/White
## 10                     Cairn Terrier       White
  • 因為“Died”的個數太少,所以我們決定把他拿掉。
  • 另外將本來的OutcomeType中的,“Return_to_owner” 改為 “Return”。 因為後面在做資料分析的時候,會出現錯誤,說“Return_to_owner”超過64位元。

BREED

由於狗的血統變數太多,於是我們先做出圖表,觀察每種不同血統出現的多寡,
並抓出前30名出現最多次的血統,其餘的歸類到Others,以便分析。

##               DateTime OutcomeType SexuponOutcome AgeuponOutcome
## 1  2014-02-12 18:22:00      Return  Neutered Male         1 year
## 3  2015-01-31 12:28:00    Adoption  Neutered Male        2 years
## 5  2013-11-15 12:52:00    Transfer  Neutered Male        2 years
## 6  2014-04-25 13:04:00    Transfer  Intact Female        1 month
## 9  2014-02-04 17:17:00    Adoption  Spayed Female       5 months
## 10 2014-05-03 07:48:00    Adoption  Spayed Female         1 year
##          Color            breed
## 1  Brown/White      Other Breed
## 3   Blue/White         Pit Bull
## 5          Tan Miniature Poodle
## 6    Black/Tan    Cairn Terrier
## 9    Red/White      Other Breed
## 10       White    Cairn Terrier

Age

由於狗的年齡當中的變數分佈的太廣,下至剛出生,上至19歲,其中也有幾周至幾個月不等,於是我們將其大致分為
1. 幼犬(不滿1歲)(Puppy)
2. 成犬(1-7歲)(AdultDog)
3. 老犬(7歲以上)(OldDog)
以便分析。

##               DateTime OutcomeType SexuponOutcome       Color
## 1  2014-02-12 18:22:00      Return  Neutered Male Brown/White
## 3  2015-01-31 12:28:00    Adoption  Neutered Male  Blue/White
## 5  2013-11-15 12:52:00    Transfer  Neutered Male         Tan
## 6  2014-04-25 13:04:00    Transfer  Intact Female   Black/Tan
## 9  2014-02-04 17:17:00    Adoption  Spayed Female   Red/White
## 10 2014-05-03 07:48:00    Adoption  Spayed Female       White
##               breed      Age
## 1       Other Breed   OldDog
## 3          Pit Bull AdultDog
## 5  Miniature Poodle AdultDog
## 6     Cairn Terrier    Puppy
## 9       Other Breed    Puppy
## 10    Cairn Terrier   OldDog

Color

  • 由於顏色的種類太多,其中也有包含斑點、胎記等特徵的出現,
    於是我們將其大致分為:純色、雙色、三色、有斑點、有色塊、有胎記 6大類。
  • 接下來,我們希望再進一步分為深淺兩種,看是否對分析有幫助,
    一樣先做圖,發現20名以後的顏色出現次數過少,
    所以我們將前20名出現最多的顏色中,挑出單純的“顏色”,
    並將其分為深淺兩部分,再將之前所分的純色進一步劃分。
##               DateTime OutcomeType SexuponOutcome            breed
## 1  2014-02-12 18:22:00      Return  Neutered Male      Other Breed
## 3  2015-01-31 12:28:00    Adoption  Neutered Male         Pit Bull
## 5  2013-11-15 12:52:00    Transfer  Neutered Male Miniature Poodle
## 6  2014-04-25 13:04:00    Transfer  Intact Female    Cairn Terrier
## 9  2014-02-04 17:17:00    Adoption  Spayed Female      Other Breed
## 10 2014-05-03 07:48:00    Adoption  Spayed Female    Cairn Terrier
##         Age ColorFix
## 1    OldDog   Double
## 3  AdultDog   Double
## 5  AdultDog    Light
## 6     Puppy   Double
## 9     Puppy   Double
## 10   OldDog    Light

最後我們得到:
1. 深色(Heavy)
2. 淺色(Light)
3. 其它純色(Others Simple)
4. 雙色(Double)
5. 三色(Tricolor)
6. 有斑點(Brindle)
7. 有色塊(Merle)
8. 有胎記(Tick)

Time

由於日期範圍太大,且較長的時間(年),或較短的時間(周、日…等),
對於收容所中的動物變動,可能看不太出甚麼資訊,
所以我們將其分為12個月,並進行分析。

##    DateTime OutcomeType SexuponOutcome            breed      Age ColorFix
## 1  February      Return  Neutered Male      Other Breed   OldDog   Double
## 3   January    Adoption  Neutered Male         Pit Bull AdultDog   Double
## 5  November    Transfer  Neutered Male Miniature Poodle AdultDog    Light
## 6     April    Transfer  Intact Female    Cairn Terrier    Puppy   Double
## 9  February    Adoption  Spayed Female      Other Breed    Puppy   Double
## 10      May    Adoption  Spayed Female    Cairn Terrier   OldDog    Light

\[資料分析\]

SVM

Case 1 :

任意選取 10000 筆資料做為 Training Data。

model<-svm(OutcomeType~.,data = SVMtrain, cost = 100,
           gamma = 1,probability=TRUE)
## Time difference of 1.923469 mins
## The confusion Matrix of training data.
##             
## pred         Adoption Euthanasia Return Transfer
##   Adoption       3800         91    838      639
##   Euthanasia        0          7      0        0
##   Return          306        214   1710      261
##   Transfer         95        218    191     1630
## The confusion Matrix of testing data.
##             
## pred         Adoption Euthanasia Return Transfer
##   Adoption       1690         95    825      573
##   Euthanasia        2          0      0        0
##   Return          416        114    436      230
##   Transfer        188        105    286      584
## Training Error Rate = 1.360251
## Testing Error Rate = 1.471125
## The number of Adoption in the training data is  4201
## The number of Return to transfer in the training data is 2739
## The number of Euthanasia in the training data is 530
## The number of Transfer in the training data is 2530

Case 2 :

任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,
補到至少跟除了安樂死以外最少的OutcomeType只差100~200筆,但不比它多。

model<-svm(OutcomeType~.,data = SVMtrain, cost = 100,
           gamma = 5,probability=TRUE)
## Time difference of 2.777853 mins
## The confusion Matrix of training data.
##             
## trainEr      Adoption Euthanasia Return Transfer
##   Adoption       3871        124   1093      666
##   Euthanasia      179       2094    243      179
##   Return           76         72   1190      116
##   Transfer         75         52    213     1569
## The confusion Matrix of testing data.
##             
## testEr       Adoption Euthanasia Return Transfer
##   Adoption       1786        161   1020      837
##   Euthanasia      125         61    163       97
##   Return          221         47    194      126
##   Transfer        164         45    170      327
## Training Error Rate = 1.730448
## Testing Error Rate = 1.65136
## The number of Adoption in the training data is  4201
## The number of Euthanasia in the training data is 2342
## The number of Return to transfer in the training data is 2739
## The number of Transfer in the training data is 2530

Random Forest

Case 1 :

任意選取 10000 筆資料做為 Training Data。

model<-randomForest(OutcomeType~.,data=RFtrain,
                    ntree=600,
                    mtry=4, 
                    importance = TRUE)
## Time difference of 6.582578 mins

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3782         90    836      617
##   Euthanasia        4        275     52       16
##   Return          284         77   1596      196
##   Transfer        131         88    255     1701
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       1596         89    733      508
##   Euthanasia       14         30     51       53
##   Return          465        101    463      260
##   Transfer        221         94    300      566
## Training Error Rate = 0.7314072
## Testing Error Rate = 1.908389

Case 2 :

任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,
補到至少跟除了安樂死以外最少的OutcomeType只差100~200筆,但不比它多。

model<-randomForest(OutcomeType~.,
                    data=RFtrain,
                    mtry=4,
                    ntree=600,
                    importance = TRUE)
## Time difference of 6.824367 mins

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3450        122    667      508
##   Euthanasia      198       2117    281      211
##   Return          375         54   1710      266
##   Transfer        129         25    142     1508
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       1526         62    598      465
##   Euthanasia      141         82    221      161
##   Return          504         84    467      275
##   Transfer        174         61    200      523
## Training Error Rate = 0.8590911
## Testing Error Rate = 2.329743
## The number of Adoption in the training data is  4152
## The number of Return to transfer in the training data is 2800
## The number of Euthanasia in the training data is 2318
## The number of Transfer in the training data is 2493

xgboost

Case 1:

complete not deal with data but have delete some x
##               DateTime     OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00 Return_to_owner        Dog  Neutered Male
## 3  2015-01-31 12:28:00        Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00        Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00        Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00        Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00        Adoption        Dog  Spayed Female
##    AgeuponOutcome                             Breed       Color
## 1          1 year             Shetland Sheepdog Mix Brown/White
## 3         2 years                      Pit Bull Mix  Blue/White
## 5         2 years       Lhasa Apso/Miniature Poodle         Tan
## 6         1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9        5 months     American Pit Bull Terrier Mix   Red/White
## 10         1 year                     Cairn Terrier       White

設參數跑xgb.cv看best iteration

xgb_params=list(    
    objective="multi:softprob",
    eta= 0.1, 
    max_depth= 6, 
    colsample_bytree= 0.7,
    subsample = 0.7,
    num_class = 5)
## [1]  train-mlogloss:1.549853+0.003793    test-mlogloss:1.552224+0.003009 
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
## 
## [11] train-mlogloss:1.225312+0.005037    test-mlogloss:1.247638+0.005736 
## [21] train-mlogloss:1.092062+0.005594    test-mlogloss:1.131738+0.003287 
## [31] train-mlogloss:1.020930+0.004125    test-mlogloss:1.075818+0.004526 
## [41] train-mlogloss:0.978023+0.003041    test-mlogloss:1.048133+0.007007 
## [51] train-mlogloss:0.948035+0.003411    test-mlogloss:1.033613+0.007910 
## [61] train-mlogloss:0.924736+0.002959    test-mlogloss:1.025162+0.008137 
## [71] train-mlogloss:0.905240+0.002458    test-mlogloss:1.020481+0.008152 
## [81] train-mlogloss:0.887836+0.001817    test-mlogloss:1.017102+0.008694 
## [91] train-mlogloss:0.872305+0.001962    test-mlogloss:1.015409+0.009068 
## [101]    train-mlogloss:0.857283+0.001713    test-mlogloss:1.014283+0.009263 
## [111]    train-mlogloss:0.844009+0.002195    test-mlogloss:1.013251+0.009401 
## [121]    train-mlogloss:0.831194+0.002237    test-mlogloss:1.012412+0.009654 
## [131]    train-mlogloss:0.818606+0.002341    test-mlogloss:1.012150+0.010097 
## [141]    train-mlogloss:0.806708+0.002933    test-mlogloss:1.012083+0.010252 
## Stopping. Best iteration:
## [138]    train-mlogloss:0.810012+0.002908    test-mlogloss:1.011778+0.009976

建構模型去run

## [1]  train-mlogloss:1.548160 test-mlogloss:1.552089 
## [6]  train-mlogloss:1.349085 test-mlogloss:1.370634 
## [11] train-mlogloss:1.225749 test-mlogloss:1.260025 
## [16] train-mlogloss:1.141708 test-mlogloss:1.187124 
## [21] train-mlogloss:1.086716 test-mlogloss:1.141999 
## [26] train-mlogloss:1.043181 test-mlogloss:1.108875 
## [31] train-mlogloss:1.011296 test-mlogloss:1.086143 
## [36] train-mlogloss:0.985441 test-mlogloss:1.069158 
## [41] train-mlogloss:0.964426 test-mlogloss:1.057213 
## [46] train-mlogloss:0.947094 test-mlogloss:1.047990 
## [51] train-mlogloss:0.931634 test-mlogloss:1.041483 
## [56] train-mlogloss:0.917805 test-mlogloss:1.037082 
## [61] train-mlogloss:0.906150 test-mlogloss:1.033031 
## [66] train-mlogloss:0.895792 test-mlogloss:1.030709 
## [71] train-mlogloss:0.885934 test-mlogloss:1.028769 
## [76] train-mlogloss:0.876595 test-mlogloss:1.026630 
## [81] train-mlogloss:0.867169 test-mlogloss:1.025943 
## [86] train-mlogloss:0.858633 test-mlogloss:1.025280 
## [91] train-mlogloss:0.849955 test-mlogloss:1.024257 
## [96] train-mlogloss:0.841065 test-mlogloss:1.024020 
## [101]    train-mlogloss:0.832991 test-mlogloss:1.022957 
## [106]    train-mlogloss:0.824987 test-mlogloss:1.022457 
## [111]    train-mlogloss:0.817332 test-mlogloss:1.022702 
## [116]    train-mlogloss:0.809637 test-mlogloss:1.022025 
## [121]    train-mlogloss:0.802716 test-mlogloss:1.022224 
## [126]    train-mlogloss:0.796048 test-mlogloss:1.022252 
## [131]    train-mlogloss:0.787958 test-mlogloss:1.022125 
## [136]    train-mlogloss:0.781307 test-mlogloss:1.023031 
## [138]    train-mlogloss:0.778695 test-mlogloss:1.023148
##           Feature      Gain      Cover  Frequency
## 1:       DateTime 0.2525362 0.33468214 0.32568625
## 2: SexuponOutcome 0.2331548 0.09992673 0.06740927
## 3: AgeuponOutcome 0.2207229 0.21478670 0.16553950
## 4:          Breed 0.1794696 0.23555338 0.24903859
## 5:          Color 0.1141164 0.11505105 0.19232639

算train-logloss

##          Adoption         Died  Euthanasia Return_to_owner   Transfer
##     1: 0.45878291 0.0017017922 0.053674176      0.36169776 0.12414335
##     2: 0.06730244 0.0028727080 0.061332077      0.29447994 0.57401288
##     3: 0.92318618 0.0008445218 0.003078837      0.02244440 0.05044603
##     4: 0.39830166 0.0022729915 0.026639974      0.39395362 0.17883176
##     5: 0.01725563 0.0022073565 0.157445490      0.19899887 0.62409264
##    ---                                                               
##  9996: 0.43103918 0.0022689765 0.016445016      0.44577870 0.10446814
##  9997: 0.52740359 0.0035767918 0.037226059      0.30193165 0.12986192
##  9998: 0.35443890 0.0010068018 0.063138068      0.37588447 0.20553173
##  9999: 0.25305191 0.0008430329 0.020748764      0.58629858 0.13905773
## 10000: 0.02249851 0.0072929328 0.106823727      0.03305306 0.83033180
##        class
##     1:     0
##     2:     4
##     3:     0
##     4:     0
##     5:     4
##    ---      
##  9996:     3
##  9997:     0
##  9998:     3
##  9999:     3
## 10000:     4
## The confusion Matrix of training data.
##    
## y1     0    1    2    3    4
##   0 3856    0    1  242   69
##   1    8    8    0    6   16
##   2  103    0  183  137  123
##   3  973    0   11 1556  186
##   4  933    0    8  262 1319
## Training Error Rate = 0.7786949

算test-logloss

##         Adoption         Died  Euthanasia Return_to_owner   Transfer class
##    1: 0.63511705 0.0005314900 0.015772911       0.1706489 0.17792968     0
##    2: 0.49277225 0.0015316177 0.074896708       0.2807500 0.15004942     0
##    3: 0.33833238 0.0020514179 0.026702372       0.2123156 0.42059824     4
##    4: 0.52782530 0.0006620213 0.005164368       0.2965512 0.16979709     0
##    5: 0.02366604 0.0041919844 0.092562102       0.3129480 0.56663185     4
##   ---                                                                     
## 5591: 0.02657869 0.0031870250 0.063063629       0.1351611 0.77200961     4
## 5592: 0.22057034 0.0013268294 0.106754571       0.5059654 0.16538292     3
## 5593: 0.47880501 0.0015834671 0.039816048       0.3957697 0.08402585     0
## 5594: 0.37256369 0.0176620018 0.017697802       0.5037267 0.08834974     3
## 5595: 0.51050073 0.0009238016 0.033122234       0.3270785 0.12837476     0
## The confusion Matrix of testing data.
##    
## y2     0    2    3    4
##   0 1929    6  317   77
##   1    1    0    1   10
##   2   72   21   99  107
##   3  797   20  557  186
##   4  571   14  237  573
## Testing Error Rate = 1.023148

Case 2 :

  用前面處理完的資料跑xgboost
##               DateTime     OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00 Return_to_owner        Dog  Neutered Male
## 3  2015-01-31 12:28:00        Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00        Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00        Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00        Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00        Adoption        Dog  Spayed Female
##    AgeuponOutcome                             Breed       Color
## 1          1 year             Shetland Sheepdog Mix Brown/White
## 3         2 years                      Pit Bull Mix  Blue/White
## 5         2 years       Lhasa Apso/Miniature Poodle         Tan
## 6         1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9        5 months     American Pit Bull Terrier Mix   Red/White
## 10         1 year                     Cairn Terrier       White

這裡設參數並且跑xgb.cv看best iteration

xgb_params=list(    
    objective="multi:softprob",
    eta= 0.1, 
    max_depth= 6, 
    colsample_bytree= 0.7,
    subsample = 0.7,
    num_class = 4)
## [1]  train-mlogloss:1.353026+0.001894    test-mlogloss:1.354286+0.002167 
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
## 
## [11] train-mlogloss:1.153811+0.008038    test-mlogloss:1.164682+0.011225 
## [21] train-mlogloss:1.073628+0.001869    test-mlogloss:1.092965+0.005694 
## [31] train-mlogloss:1.034891+0.000647    test-mlogloss:1.062665+0.005294 
## [41] train-mlogloss:1.012826+0.001450    test-mlogloss:1.048729+0.004878 
## [51] train-mlogloss:0.998796+0.001706    test-mlogloss:1.042443+0.005295 
## [61] train-mlogloss:0.988680+0.001155    test-mlogloss:1.039640+0.005719 
## [71] train-mlogloss:0.980706+0.001269    test-mlogloss:1.038768+0.006748 
## [81] train-mlogloss:0.974023+0.000990    test-mlogloss:1.038960+0.006907 
## Stopping. Best iteration:
## [71] train-mlogloss:0.980706+0.001269    test-mlogloss:1.038768+0.006748

建構模型去run

## [1]  train-mlogloss:1.361548 test-mlogloss:1.363891 
## [6]  train-mlogloss:1.240742 test-mlogloss:1.249884 
## [11] train-mlogloss:1.157302 test-mlogloss:1.172423 
## [16] train-mlogloss:1.104450 test-mlogloss:1.123224 
## [21] train-mlogloss:1.067908 test-mlogloss:1.090902 
## [26] train-mlogloss:1.043646 test-mlogloss:1.070384 
## [31] train-mlogloss:1.026773 test-mlogloss:1.058267 
## [36] train-mlogloss:1.014848 test-mlogloss:1.050070 
## [41] train-mlogloss:1.006555 test-mlogloss:1.046505 
## [46] train-mlogloss:0.999439 test-mlogloss:1.042665 
## [51] train-mlogloss:0.993587 test-mlogloss:1.040677 
## [56] train-mlogloss:0.987523 test-mlogloss:1.038515 
## [61] train-mlogloss:0.983255 test-mlogloss:1.037505 
## [66] train-mlogloss:0.978379 test-mlogloss:1.037405 
## [71] train-mlogloss:0.974099 test-mlogloss:1.037753
##           Feature      Gain     Cover  Frequency
## 1: SexuponOutcome 0.4402829 0.2199861 0.13977939
## 2:            Age 0.2045508 0.1281322 0.09694933
## 3:          breed 0.1817956 0.3037280 0.30248190
## 4:       DateTime 0.1047721 0.1924553 0.26232334
## 5:       ColorFix 0.0685986 0.1556984 0.19846605

算train-logloss

##          Adoption Euthanasia Return_to_owner  Transfer class
##     1: 0.51531571 0.02636277      0.26090056 0.1974209     0
##     2: 0.07967110 0.02886126      0.03955148 0.8519161     3
##     3: 0.37937066 0.04655133      0.41217107 0.1619070     2
##     4: 0.56485575 0.02584560      0.26619008 0.1431085     0
##     5: 0.03970097 0.29154360      0.27125841 0.3974970     3
##    ---                                                      
##  9996: 0.38061544 0.08665598      0.33801147 0.1947171     0
##  9997: 0.33354595 0.05495799      0.42315775 0.1883383     2
##  9998: 0.05908908 0.12486818      0.19097961 0.6250631     3
##  9999: 0.51241267 0.02518433      0.24549526 0.2169078     0
## 10000: 0.37589514 0.08972079      0.33793738 0.1964467     0
## The confusion Matrix of training data.
##    
## y1     0    1    2    3
##   0 3634    1  488   78
##   1  139   36  184  171
##   2 1532   10  963  234
##   3 1068    9  438 1015
## Training Error Rate = 0.9740993

算test-logloss

##         Adoption  Euthanasia Return_to_owner  Transfer class
##    1: 0.39595342 0.033554636      0.38870567 0.1817863     0
##    2: 0.80586094 0.008442121      0.08419906 0.1014979     0
##    3: 0.31180611 0.090440273      0.37623608 0.2215175     2
##    4: 0.40552264 0.030381208      0.38745058 0.1766456     0
##    5: 0.04399491 0.116008461      0.11627511 0.7237215     3
##   ---                                                       
## 5540: 0.38396648 0.031884395      0.36405474 0.2200943     0
## 5541: 0.79534328 0.011070084      0.07400112 0.1195855     0
## 5542: 0.76695311 0.008629197      0.07759160 0.1468261     0
## 5543: 0.36331820 0.031529091      0.41140535 0.1937474     2
## 5544: 0.44894662 0.021116264      0.36779380 0.1621433     0
## The confusion Matrix of testing data.
##    
## y2     0    1    2    3
##   0 1912    1  347   36
##   1   96    6  129   83
##   2  884    7  473  183
##   3  630    9  216  532
## Testing Error Rate = 1.037753

Case 3 :

用前面處理完的資料(不包含AgeuponOutcomeType跟DateTime)跑xgb (這裡另外處理AgeuponOutcomeType跟DateTime)

##    DateTime     OutcomeType AnimalType SexuponOutcome AgeuponOutcome
## 1        02 Return_to_owner        Dog  Neutered Male          10000
## 3        01        Adoption        Dog  Neutered Male          20000
## 5        11        Transfer        Dog  Neutered Male          20000
## 6        04        Transfer        Dog  Intact Female            100
## 9        02        Adoption        Dog  Spayed Female            500
## 10       05        Adoption        Dog  Spayed Female          10000
##               breed ColorFix
## 1       Other Breed   Double
## 3          Pit Bull   Double
## 5  Miniature Poodle    Light
## 6     Cairn Terrier   Double
## 9       Other Breed   Double
## 10    Cairn Terrier    Light

設參數跑xgb.cv並且看best iteration

xgb_params=list(    
    objective="multi:softprob",
    eta= 0.1, 
    max_depth= 6, 
    colsample_bytree= 0.7,
    subsample = 0.7,
    num_class = 4)
## [1]  train-mlogloss:1.353026+0.001894    test-mlogloss:1.354286+0.002167 
## Multiple eval metrics are present. Will use test_mlogloss for early stopping.
## Will train until test_mlogloss hasn't improved in 10 rounds.
## 
## [11] train-mlogloss:1.153811+0.008038    test-mlogloss:1.164682+0.011225 
## [21] train-mlogloss:1.073628+0.001869    test-mlogloss:1.092965+0.005694 
## [31] train-mlogloss:1.034891+0.000647    test-mlogloss:1.062665+0.005294 
## [41] train-mlogloss:1.012826+0.001450    test-mlogloss:1.048729+0.004878 
## [51] train-mlogloss:0.998796+0.001706    test-mlogloss:1.042443+0.005295 
## [61] train-mlogloss:0.988680+0.001155    test-mlogloss:1.039640+0.005719 
## [71] train-mlogloss:0.980706+0.001269    test-mlogloss:1.038768+0.006748 
## [81] train-mlogloss:0.974023+0.000990    test-mlogloss:1.038960+0.006907 
## Stopping. Best iteration:
## [71] train-mlogloss:0.980706+0.001269    test-mlogloss:1.038768+0.006748

建構模型四run

## [1]  train-mlogloss:1.361548 test-mlogloss:1.363891 
## [6]  train-mlogloss:1.240742 test-mlogloss:1.249884 
## [11] train-mlogloss:1.157302 test-mlogloss:1.172423 
## [16] train-mlogloss:1.104450 test-mlogloss:1.123224 
## [21] train-mlogloss:1.067908 test-mlogloss:1.090902 
## [26] train-mlogloss:1.043646 test-mlogloss:1.070384 
## [31] train-mlogloss:1.026773 test-mlogloss:1.058267 
## [36] train-mlogloss:1.014848 test-mlogloss:1.050070 
## [41] train-mlogloss:1.006555 test-mlogloss:1.046505 
## [46] train-mlogloss:0.999439 test-mlogloss:1.042665 
## [51] train-mlogloss:0.993587 test-mlogloss:1.040677 
## [56] train-mlogloss:0.987523 test-mlogloss:1.038515 
## [61] train-mlogloss:0.983255 test-mlogloss:1.037505 
## [66] train-mlogloss:0.978379 test-mlogloss:1.037405 
## [71] train-mlogloss:0.974099 test-mlogloss:1.037753
##           Feature      Gain     Cover  Frequency
## 1: SexuponOutcome 0.4402829 0.2199861 0.13977939
## 2:            Age 0.2045508 0.1281322 0.09694933
## 3:          breed 0.1817956 0.3037280 0.30248190
## 4:       DateTime 0.1047721 0.1924553 0.26232334
## 5:       ColorFix 0.0685986 0.1556984 0.19846605

算train-logloss

##          Adoption Euthanasia Return_to_owner  Transfer class
##     1: 0.51531571 0.02636277      0.26090056 0.1974209     0
##     2: 0.07967110 0.02886126      0.03955148 0.8519161     3
##     3: 0.37937066 0.04655133      0.41217107 0.1619070     2
##     4: 0.56485575 0.02584560      0.26619008 0.1431085     0
##     5: 0.03970097 0.29154360      0.27125841 0.3974970     3
##    ---                                                      
##  9996: 0.38061544 0.08665598      0.33801147 0.1947171     0
##  9997: 0.33354595 0.05495799      0.42315775 0.1883383     2
##  9998: 0.05908908 0.12486818      0.19097961 0.6250631     3
##  9999: 0.51241267 0.02518433      0.24549526 0.2169078     0
## 10000: 0.37589514 0.08972079      0.33793738 0.1964467     0
## The confusion Matrix of training data.
##    
## y1     0    1    2    3
##   0 3634    1  488   78
##   1  139   36  184  171
##   2 1532   10  963  234
##   3 1068    9  438 1015
## Training Error Rate = 0.9740993

算test-logloss

##         Adoption  Euthanasia Return_to_owner  Transfer class
##    1: 0.39595342 0.033554636      0.38870567 0.1817863     0
##    2: 0.80586094 0.008442121      0.08419906 0.1014979     0
##    3: 0.31180611 0.090440273      0.37623608 0.2215175     2
##    4: 0.40552264 0.030381208      0.38745058 0.1766456     0
##    5: 0.04399491 0.116008461      0.11627511 0.7237215     3
##   ---                                                       
## 5540: 0.38396648 0.031884395      0.36405474 0.2200943     0
## 5541: 0.79534328 0.011070084      0.07400112 0.1195855     0
## 5542: 0.76695311 0.008629197      0.07759160 0.1468261     0
## 5543: 0.36331820 0.031529091      0.41140535 0.1937474     2
## 5544: 0.44894662 0.021116264      0.36779380 0.1621433     0
## The confusion Matrix of testing data.
##    
## y2     0    1    2    3
##   0 1912    1  347   36
##   1   96    6  129   83
##   2  884    7  473  183
##   3  630    9  216  532
## Testing Error Rate = 1.037753

\[結論\]

  1. 這筆資料因為安樂死的資料過少,所以導致SVM、RF預測的不好, 但在XGBoost就沒有這個問題,所以XGBoost對於這個方面的處理比較好。
  2. 在同樣的抽樣資料之下,XGBoost方法是最好的。
  3. 根據觀察其他人做的kernel,我們在資料處理方面,做得不完善,導致我們去Machine Leanrning的時候,結果不會預測的比較好。

\[Reference\]

https://www.kaggle.com/apapiu/visualizing-breeds-and-ages-by-outcome/code/notebook
https://www.kaggle.com/mrisdal/quick-dirty-randomforest/code
https://www.kaggle.com/fsmithus/reduced-model
https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
https://cran.r-project.org/web/packages/xgboost/xgboost.pdf
https://stackoverflow.com/questions/24197809/functionality-of-probability-true-in-svm-function-of-e1071-package-in-r
https://cran.r-project.org/web/packages/e1071/vignettes/svmdoc.pdf 林子軒學長!!!!!!! by 潘星丞 https://stackoverflow.com/questions/16961921/plot-data-in-descending-order-as-appears-in-data-frame