Renew: 06/02 line 37~47 => Use to see the number of each outcometype line 50~58 => Not yet. line 61~168 => Breed line 174~202 => Age line 207~290

06/04 line 205~265 => Color
      line 465~617 => Random Forest

處理data

共26729筆資料,9個解釋變數,包含狗以及貓。

library(dplyr)
library(ggplot2)
library(e1071)
library(randomForest)
library(caret)
trainInit<-read.csv("train.csv")
head(trainInit)
##   AnimalID    Name            DateTime     OutcomeType OutcomeSubtype
## 1  A671945 Hambone 2014-02-12 18:22:00 Return_to_owner               
## 2  A656520   Emily 2013-10-13 12:44:00      Euthanasia      Suffering
## 3  A686464  Pearce 2015-01-31 12:28:00        Adoption         Foster
## 4  A683430         2014-07-11 19:09:00        Transfer        Partner
## 5  A667013         2013-11-15 12:52:00        Transfer        Partner
## 6  A677334    Elsa 2014-04-25 13:04:00        Transfer        Partner
##   AnimalType SexuponOutcome AgeuponOutcome
## 1        Dog  Neutered Male         1 year
## 2        Cat  Spayed Female         1 year
## 3        Dog  Neutered Male        2 years
## 4        Cat    Intact Male        3 weeks
## 5        Dog  Neutered Male        2 years
## 6        Dog  Intact Female        1 month
##                               Breed       Color
## 1             Shetland Sheepdog Mix Brown/White
## 2            Domestic Shorthair Mix Cream Tabby
## 3                      Pit Bull Mix  Blue/White
## 4            Domestic Shorthair Mix  Blue Cream
## 5       Lhasa Apso/Miniature Poodle         Tan
## 6 Cairn Terrier/Chihuahua Shorthair   Black/Tan

刪除 Name 以及 OutcomeSubtype

train<-trainInit[,-c(1,2,5)]
head(train)
##              DateTime     OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner        Dog  Neutered Male
## 2 2013-10-13 12:44:00      Euthanasia        Cat  Spayed Female
## 3 2015-01-31 12:28:00        Adoption        Dog  Neutered Male
## 4 2014-07-11 19:09:00        Transfer        Cat    Intact Male
## 5 2013-11-15 12:52:00        Transfer        Dog  Neutered Male
## 6 2014-04-25 13:04:00        Transfer        Dog  Intact Female
##   AgeuponOutcome                             Breed       Color
## 1         1 year             Shetland Sheepdog Mix Brown/White
## 2         1 year            Domestic Shorthair Mix Cream Tabby
## 3        2 years                      Pit Bull Mix  Blue/White
## 4        3 weeks            Domestic Shorthair Mix  Blue Cream
## 5        2 years       Lhasa Apso/Miniature Poodle         Tan
## 6        1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
attach(train)

將狗和貓的data分散開來。

Dogtrain<-train[which(AnimalType=="Dog"),]
Cattrain<-train[-which(AnimalType=="Dog"),]
head(Dogtrain)
##               DateTime     OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00 Return_to_owner        Dog  Neutered Male
## 3  2015-01-31 12:28:00        Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00        Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00        Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00        Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00        Adoption        Dog  Spayed Female
##    AgeuponOutcome                             Breed       Color
## 1          1 year             Shetland Sheepdog Mix Brown/White
## 3         2 years                      Pit Bull Mix  Blue/White
## 5         2 years       Lhasa Apso/Miniature Poodle         Tan
## 6         1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9        5 months     American Pit Bull Terrier Mix   Red/White
## 10         1 year                     Cairn Terrier       White
attach(Dogtrain)

n = 15595 p=8

OutcomeType

先看各個 OucomeType 分別有幾筆資料。

## Return to owner = 4286
## Transfer = 3917
## Adoption = 6497
## Died = 50
## Euthanasia = 845

因為“死亡”的個數太少,所以我們決定把他拿掉。 而且,我們將本來的OutcomeType中的,Return_to_owner 改為 Return。 因為後面在做XGBoosting的時候,會出現錯誤,說Return_to_owner超過64位元…….

將“死亡”的資料,都換成“安樂死”

BREED

Age

Change the AgeuponOutcome as “Puppy”, “AdultDog”, “OldDog” three kinds of types.

##               DateTime OutcomeType AnimalType SexuponOutcome       Color
## 1  2014-02-12 18:22:00      Return        Dog  Neutered Male Brown/White
## 3  2015-01-31 12:28:00    Adoption        Dog  Neutered Male  Blue/White
## 5  2013-11-15 12:52:00    Transfer        Dog  Neutered Male         Tan
## 6  2014-04-25 13:04:00    Transfer        Dog  Intact Female   Black/Tan
## 9  2014-02-04 17:17:00    Adoption        Dog  Spayed Female   Red/White
## 10 2014-05-03 07:48:00    Adoption        Dog  Spayed Female       White
##               breed      Age
## 1       Other Breed   OldDog
## 3          Pit Bull AdultDog
## 5  Miniature Poodle AdultDog
## 6     Cairn Terrier    Puppy
## 9       Other Breed    Puppy
## 10    Cairn Terrier   OldDog

Color

Replace each color to “Simple”, “Double”, “Tricolor”, “Brindle”, “Tick”, “Merle”, Six categories.

And the “Brindle” contains the color that it has Brindle and Tick, or Brindle and Merle at the same time.

But, there are still some of colors that it not belong to Brindle, like “Blue Tiger”, “Blue cream”, “Smoke” etc.. I will just consider their color, and classify them to “Simple”, “Double” or “Tricolor”.

Time

##    DateTime OutcomeType AnimalType SexuponOutcome            breed
## 1  February      Return        Dog  Neutered Male      Other Breed
## 3   January    Adoption        Dog  Neutered Male         Pit Bull
## 5  November    Transfer        Dog  Neutered Male Miniature Poodle
## 6     April    Transfer        Dog  Intact Female    Cairn Terrier
## 9  February    Adoption        Dog  Spayed Female      Other Breed
## 10      May    Adoption        Dog  Spayed Female    Cairn Terrier
##         Age ColorFix
## 1    OldDog   Double
## 3  AdultDog   Double
## 5  AdultDog    Light
## 6     Puppy   Double
## 9     Puppy   Double
## 10   OldDog    Light

SVM

Try to use SVM to fit model.

method 1 : 使training data中,各個OutcomeType的個數相等。 各個OutcomeType各取 500~600 隨機一個個數

## The number of training data is  2197
## The number of Adoption in the training data is  502
## The number of Return to transfer in the training data is 578
## The number of Euthanasia in the training data is 588
## The number of Transfer in the training data is 529
## Time difference of 1.488188 secs
## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption        198          4     40       57
##   Euthanasia        4        271     80      105
##   Return          298        263    442      230
##   Transfer          2         50     16      137
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2167          3    244      463
##   Euthanasia       58        103    524      649
##   Return         3711        123   2804     1458
##   Transfer         59         27    136      818
## Training Error Rate = 0.5229859
## Testing Error Rate = 0.5585525

但是效果還是沒很好,懷疑是不是為了讓他balance,導致 training data 過少

method 2 : 補上安樂死的data 使他們balance 選出 10000 筆作為 training data

## The number of Adoption in the training data is  3392
## The number of Return to transfer in the training data is 2272
## The number of Euthanasia in the training data is 2220
## The number of Transfer in the training data is 2116
## Time difference of 31.11945 secs
## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2990        614   1600     1042
##   Euthanasia      289       1447    603      605
##   Return            0          0      0        0
##   Transfer        113        159     69      469
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2759        583   1425      902
##   Euthanasia      240       1227    522      507
##   Return            0          0      0        0
##   Transfer        106        132     67      392
## Training Error Rate = 0.5094
## Testing Error Rate = 0.5059806

method 3 : 補上安樂死的data 使他們balance。並使training data也balance NOT FIX!!!!!!!!!!!!!!!!!!

## [1] 6497
## [1] 4286
## [1] 4161
## [1] 3917
## [1] 2170
## [1] 2412
## [1] 2458
## [1] 2112
## Time difference of 26.18733 secs
## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption        801         23    166      269
##   Euthanasia      179       1505    604      535
##   Return         1170        699   1554      799
##   Transfer         20        231     88      509
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       1564          9    118      251
##   Euthanasia      345       1074    468      434
##   Return         2377        467   1224      674
##   Transfer         41        153     64      446
## Training Error Rate = 0.522618
## Testing Error Rate = 0.556288

Random Forest

method 1 : Training data : 10000 ; Tree 600 & not adjust training data

## Time difference of 1.504966 mins

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3748        149   1220     1073
##   Euthanasia        0        111      6        7
##   Return          342        174   1391      331
##   Transfer         67        113    172     1096
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       1959         76    907      615
##   Euthanasia        0         13      8       14
##   Return          350        118    440      259
##   Transfer         31         90    142      522
## Training Error Rate = 0.3654
## Testing Error Rate = 0.4707792

Random Forest

method 2 : 使training data中,各個OutcomeType的個數相等。 各個OutcomeType各取 500~600 隨機一個個數 分成 600 個 Tree

## Time difference of 39.60153 secs

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption        503         19     60       88
##   Euthanasia       22        417     45       47
##   Return           70         73    445       74
##   Transfer          5         25     15      305
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3274         32    945      829
##   Euthanasia      440        157    864      748
##   Return         1983         77   1662      856
##   Transfer        200         44    250      970
## Training Error Rate = 0.2453683
## Testing Error Rate = 0.5451954

Random Forest

method 3 : Training data : 1000 ; Tree : 600 隨機選出10000筆的 training data 後,再使 trainig data 中的 OutcomeType balance。

## The number of Adoption in the training data is  4159
## The number of Return to transfer in the training data is 4212
## The number of Euthanasia in the training data is 4269
## The number of Transfer in the training data is 4135
## Time difference of 1.631365 mins

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2553        106    493      791
##   Euthanasia      370       3698    750      578
##   Return         1157        312   2771      888
##   Transfer         79        153    198     1878
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       1182         15    314      340
##   Euthanasia      208        136    324      252
##   Return          878         80    766      376
##   Transfer         70         59    112      432
## Training Error Rate = 0.5875
## Testing Error Rate = 0.546176