Renew: 06/02 line 37~47 => Use to see the number of each outcometype line 50~58 => Not yet. line 61~168 => Breed line 174~202 => Age line 207~290

06/04 line 205~265 => Color
      line 465~617 => Random Forest

處理data

我們的資料共有26729筆,以及9個解釋變數。

library(dplyr)
library(ggplot2)
library(e1071)
library(randomForest)
library(caret)
library(gridExtra)
trainInit<-read.csv("train.csv")
head(trainInit)
##   AnimalID    Name            DateTime     OutcomeType OutcomeSubtype
## 1  A671945 Hambone 2014-02-12 18:22:00 Return_to_owner               
## 2  A656520   Emily 2013-10-13 12:44:00      Euthanasia      Suffering
## 3  A686464  Pearce 2015-01-31 12:28:00        Adoption         Foster
## 4  A683430         2014-07-11 19:09:00        Transfer        Partner
## 5  A667013         2013-11-15 12:52:00        Transfer        Partner
## 6  A677334    Elsa 2014-04-25 13:04:00        Transfer        Partner
##   AnimalType SexuponOutcome AgeuponOutcome
## 1        Dog  Neutered Male         1 year
## 2        Cat  Spayed Female         1 year
## 3        Dog  Neutered Male        2 years
## 4        Cat    Intact Male        3 weeks
## 5        Dog  Neutered Male        2 years
## 6        Dog  Intact Female        1 month
##                               Breed       Color
## 1             Shetland Sheepdog Mix Brown/White
## 2            Domestic Shorthair Mix Cream Tabby
## 3                      Pit Bull Mix  Blue/White
## 4            Domestic Shorthair Mix  Blue Cream
## 5       Lhasa Apso/Miniature Poodle         Tan
## 6 Cairn Terrier/Chihuahua Shorthair   Black/Tan

刪除 “Name” 以及 “OutcomeSubtype”,兩個變數。
將狗和貓的data分散開來,

train<-trainInit[,-c(1,2,5)]
head(train)
##              DateTime     OutcomeType AnimalType SexuponOutcome
## 1 2014-02-12 18:22:00 Return_to_owner        Dog  Neutered Male
## 2 2013-10-13 12:44:00      Euthanasia        Cat  Spayed Female
## 3 2015-01-31 12:28:00        Adoption        Dog  Neutered Male
## 4 2014-07-11 19:09:00        Transfer        Cat    Intact Male
## 5 2013-11-15 12:52:00        Transfer        Dog  Neutered Male
## 6 2014-04-25 13:04:00        Transfer        Dog  Intact Female
##   AgeuponOutcome                             Breed       Color
## 1         1 year             Shetland Sheepdog Mix Brown/White
## 2         1 year            Domestic Shorthair Mix Cream Tabby
## 3        2 years                      Pit Bull Mix  Blue/White
## 4        3 weeks            Domestic Shorthair Mix  Blue Cream
## 5        2 years       Lhasa Apso/Miniature Poodle         Tan
## 6        1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
attach(train)
Dogtrain<-train[which(AnimalType=="Dog"),]
Cattrain<-train[-which(AnimalType=="Dog"),]
head(Dogtrain)
##               DateTime     OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00 Return_to_owner        Dog  Neutered Male
## 3  2015-01-31 12:28:00        Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00        Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00        Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00        Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00        Adoption        Dog  Spayed Female
##    AgeuponOutcome                             Breed       Color
## 1          1 year             Shetland Sheepdog Mix Brown/White
## 3         2 years                      Pit Bull Mix  Blue/White
## 5         2 years       Lhasa Apso/Miniature Poodle         Tan
## 6         1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9        5 months     American Pit Bull Terrier Mix   Red/White
## 10         1 year                     Cairn Terrier       White
attach(Dogtrain)

*共有15595筆資料,8個解釋變數。

OutcomeType

先看各個 OucomeType 分別有幾筆資料。

## Return to owner = 4286
## Transfer = 3917
## Adoption = 6497
## Died = 50
## Euthanasia = 845

因為“Died”的個數太少,所以我們決定把他拿掉。
而且,我們將本來的OutcomeType中的,“Return_to_owner” 改為 “Return”。
因為後面在做XGBoosting的時候,會出現錯誤,說“Return_to_owner”超過64位元…….

##               DateTime OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00      Return        Dog  Neutered Male
## 3  2015-01-31 12:28:00    Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00    Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00    Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00    Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00    Adoption        Dog  Spayed Female
##    AgeuponOutcome                             Breed       Color
## 1          1 year             Shetland Sheepdog Mix Brown/White
## 3         2 years                      Pit Bull Mix  Blue/White
## 5         2 years       Lhasa Apso/Miniature Poodle         Tan
## 6         1 month Cairn Terrier/Chihuahua Shorthair   Black/Tan
## 9        5 months     American Pit Bull Terrier Mix   Red/White
## 10         1 year                     Cairn Terrier       White

BREED

由於狗的血統變數太多,於是我們先做出圖表,觀察每種不同血統出現的多寡,
並抓出前30名出現最多次的血統,其餘的歸類到Others,以便分析

##               DateTime OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00      Return        Dog  Neutered Male
## 3  2015-01-31 12:28:00    Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00    Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00    Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00    Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00    Adoption        Dog  Spayed Female
##    AgeuponOutcome       Color            breed
## 1          1 year Brown/White      Other Breed
## 3         2 years  Blue/White         Pit Bull
## 5         2 years         Tan Miniature Poodle
## 6         1 month   Black/Tan    Cairn Terrier
## 9        5 months   Red/White      Other Breed
## 10         1 year       White    Cairn Terrier

Age

由於狗的年齡當中變數的分佈太廣,下至剛出生,上至19歲,其中也有幾周至幾個月不等,
於是我們將其大致分為幼犬(不滿一歲),成犬(1-7歲),老犬(7歲以上),以便分析。

##               DateTime OutcomeType AnimalType SexuponOutcome       Color
## 1  2014-02-12 18:22:00      Return        Dog  Neutered Male Brown/White
## 3  2015-01-31 12:28:00    Adoption        Dog  Neutered Male  Blue/White
## 5  2013-11-15 12:52:00    Transfer        Dog  Neutered Male         Tan
## 6  2014-04-25 13:04:00    Transfer        Dog  Intact Female   Black/Tan
## 9  2014-02-04 17:17:00    Adoption        Dog  Spayed Female   Red/White
## 10 2014-05-03 07:48:00    Adoption        Dog  Spayed Female       White
##               breed      Age
## 1       Other Breed   OldDog
## 3          Pit Bull AdultDog
## 5  Miniature Poodle AdultDog
## 6     Cairn Terrier    Puppy
## 9       Other Breed    Puppy
## 10    Cairn Terrier   OldDog

Color

由於顏色的種類太多,其中也有包含斑點、胎記等特徵的出現,
於是我們將其大致分為:純色、雙色、三色、有斑點、有色塊、有胎記 六大類,
接下來,我們希望再進一步分為深淺兩種,看是否對分析有幫助,
一樣先做圖,發現20名以後的顏色出現次數過少,
所以我們將前20名出現最多的顏色中,挑出單純的“顏色”,
並將其分為深淺兩部分,再將之前所分的純色進一步劃分,
最後我們得到:深色、淺色、其它純色、雙色、三色、有斑點、有色塊、有胎記 八類。

##               DateTime OutcomeType AnimalType SexuponOutcome
## 1  2014-02-12 18:22:00      Return        Dog  Neutered Male
## 3  2015-01-31 12:28:00    Adoption        Dog  Neutered Male
## 5  2013-11-15 12:52:00    Transfer        Dog  Neutered Male
## 6  2014-04-25 13:04:00    Transfer        Dog  Intact Female
## 9  2014-02-04 17:17:00    Adoption        Dog  Spayed Female
## 10 2014-05-03 07:48:00    Adoption        Dog  Spayed Female
##               breed      Age ColorFix
## 1       Other Breed   OldDog   Double
## 3          Pit Bull AdultDog   Double
## 5  Miniature Poodle AdultDog    Light
## 6     Cairn Terrier    Puppy   Double
## 9       Other Breed    Puppy   Double
## 10    Cairn Terrier   OldDog    Light

Time

由於日期範圍太大,且較長的時間(年),或較短的時間(周、日…等),
對於收容所中的動物變動,可能看不太出甚麼資訊,
所以我們將其分為12個月,並進行分析。

##    DateTime OutcomeType AnimalType SexuponOutcome            breed
## 1  February      Return        Dog  Neutered Male      Other Breed
## 3   January    Adoption        Dog  Neutered Male         Pit Bull
## 5  November    Transfer        Dog  Neutered Male Miniature Poodle
## 6     April    Transfer        Dog  Intact Female    Cairn Terrier
## 9  February    Adoption        Dog  Spayed Female      Other Breed
## 10      May    Adoption        Dog  Spayed Female    Cairn Terrier
##         Age ColorFix
## 1    OldDog   Double
## 3  AdultDog   Double
## 5  AdultDog    Light
## 6     Puppy   Double
## 9     Puppy   Double
## 10   OldDog    Light

SVM

Case 1 :

任意選取 10000 筆資料做為 Training Data。

## Time difference of 28.20861 secs
## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       4104        255   2299     1385
##   Euthanasia        0          0      0        0
##   Return           19        148    250      226
##   Transfer         56        162    232      864
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2270        138   1231      823
##   Euthanasia        0          0      0        0
##   Return           10         61    133      120
##   Transfer         38         80    141      499
## Training Error Rate = 0.4782
## Testing Error Rate = 0.4765512
## The number of Adoption in the training data is  4179
## The number of Return to transfer in the training data is 2781
## The number of Euthanasia in the training data is 565
## The number of Transfer in the training data is 2475

Case 2 :

任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,補到至少跟除了安樂死以外最少的OutcomeType只差一百~兩百筆,但不比他多。

## Time difference of 38.38219 secs
## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3772        661   1974     1317
##   Euthanasia       41       1045    397      500
##   Return          289        444    316      150
##   Transfer         41        209     81      584
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2140         94   1088      672
##   Euthanasia       23        138    216      261
##   Return          173         45    155       91
##   Transfer         18         29     59      342
## Training Error Rate = 0.6104
## Testing Error Rate = 0.4994589
## The number of Adoption in the training data is  4143
## The number of Euthanasia in the training data is 2359
## The number of Return to transfer in the training data is 2768
## The number of Transfer in the training data is 2551

Random Forest

Case 1 :

任意選取 10000 筆資料做為 Training Data。

## Time difference of 1.304458 mins

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3852        163   1346     1138
##   Euthanasia        0        107      7        3
##   Return          259        149   1214      269
##   Transfer         63        135    177     1118
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       2020         93    985      634
##   Euthanasia        2         10     14        8
##   Return          259         96    373      195
##   Transfer         42         91    170      552
## Training Error Rate = 0.3709
## Testing Error Rate = 0.4669913

Case 2 :

任意選取 10000 筆資料做為 Training Data。
並將 Training Data 中的資料,用得balance一點。
也就是說,將 Training Data中的安樂死的資料數,補到至少跟除了安樂死以外最少的OutcomeType只差一百~兩百筆,但不比他多。

## Time difference of 1.333776 mins

## The confusion Matrix of training data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       3803        315   1286     1077
##   Euthanasia      166       1879    354      270
##   Return          181         66    982      170
##   Transfer         54         70    136      967
## The confusion Matrix of testing data.
##             Reference
## Prediction   Adoption Euthanasia Return Transfer
##   Adoption       1918         78    914      647
##   Euthanasia      114        113    238      189
##   Return          232         46    262      129
##   Transfer         29         53    114      468
## Training Error Rate = 0.4145
## Testing Error Rate = 0.5019841
## The number of Adoption in the training data is  4204
## The number of Return to transfer in the training data is 2758
## The number of Euthanasia in the training data is 2330
## The number of Transfer in the training data is 2484