Classification methods

Metodos Estatisticos em Data Mining

Luigi Ruberto

Data set

1000 clients asking for a mortgage

Original data set:

  • 20 attributes
    • 7 numerical, 13 categorical

Modified data set:

  • 24 attributes
    • all numerical

Cost matrix:

NO YES
NO 0 5
YES 1 0

Linear discriminant analysis

credit.l <- lda(V25 ~ ., prior = c(1, 1)/2, data = train)

Error matrix:

0 1
0 39 18
1 33 110

Probability matrix:

0 1
0 0.195 0.09
1 0.165 0.55

Total cost:

[1] 123

Quadratic discriminant analysis

credit.q <- qda(V25 ~ ., prior = c(1, 1)/2, data = train)

Error matrix:

0 1
0 35 22
1 42 101

Probability matrix:

0 1
0 0.175 0.110
1 0.210 0.505

Total cost:

[1] 152

Logistic regression

credit.g <- glm(V25 ~ ., family = binomial, data = train)

Probability threshold = 0.5

Error matrix:

FALSE TRUE
0 24 33
1 14 129

Probability matrix:

FALSE TRUE
0 0.12 0.165
1 0.07 0.645

Total cost:

[1] 179

Logistic regression

Probability threshold = 0.8

Error matrix:

FALSE TRUE
0 46 11
1 52 91

Probability matrix:

FALSE TRUE
0 0.23 0.055
1 0.26 0.455

Total cost:

[1] 107

Decision tree

credit.r <- rpart(formula = V25 ~ ., data = train, method = "class", cp = 0.001)

Error matrix

FALSE TRUE
0 29 28
1 36 107

Probability matrix:

FALSE TRUE
0 0.145 0.140
1 0.180 0.535

Total cost:

[1] 176

Decision tree

plot of chunk unnamed-chunk-22

K-nearest neighbours

Cross-validation to find K that minimizes the cost

plot of chunk unnamed-chunk-23

[1] "K = " "2"

K-nearest neighbours

credit.K <- knn(train[, -25], test[, -25], cl, k = K)

Error matrix

0 1
0 26 31
1 32 111

Probability matrix:

0 1
0 0.13 0.155
1 0.16 0.555

Total cost:

[1] 187

Support vector machines

Cross-validation to find C and gamma that minimize the cost

2^-13 2^-11 2^-9 2^-7 2^-5 2^-3
2^5 94.6 70.6 68.1 65.9 64.9 80.7
2^7 69.5 68.2 65.2 70.6 68.5 80.7
2^9 68.4 66.0 67.2 68.0 72.6 80.7
2^11 65.5 65.2 71.1 69.8 72.6 80.7
2^13 65.1 68.6 68.6 73.2 72.6 80.7
2^15 65.8 71.9 69.2 72.4 72.6 80.7

[1] "C = " "32"
[1] "gamma = " "0.03125"

Support vector machines

credit.S <- svm(formula = V25 ~ ., data = train, type = "C-classification", 
    C = C., gamma = gamma.)

Error matrix

0 1
0 23 34
1 12 131

Probability matrix:

0 1
0 0.115 0.170
1 0.060 0.655

Total cost:

[1] 182

Conclusion

plot of chunk unnamed-chunk-35

Best method

## [1] "LR"