Classification methods

Metodos Estatisticos em Data Mining

Luigi Ruberto

Data set

1000 clients asking for a mortgage

Original data set:

20 attributes
- 7 numerical, 13 categorical

Modified data set:

24 attributes
- all numerical

Cost matrix:

	NO	YES
NO	0	5
YES	1	0

Linear discriminant analysis

credit.l <- lda(V25 ~ ., prior = c(1, 1)/2, data = train)

Error matrix:

	0	1
0	39	18
1	33	110

Probability matrix:

	0	1
0	0.195	0.09
1	0.165	0.55

Total cost:

[1] 123

Quadratic discriminant analysis

credit.q <- qda(V25 ~ ., prior = c(1, 1)/2, data = train)

Error matrix:

	0	1
0	35	22
1	42	101

Probability matrix:

	0	1
0	0.175	0.110
1	0.210	0.505

Total cost:

[1] 152

Logistic regression

credit.g <- glm(V25 ~ ., family = binomial, data = train)

Probability threshold = 0.5

Error matrix:

	FALSE	TRUE
0	24	33
1	14	129

Probability matrix:

	FALSE	TRUE
0	0.12	0.165
1	0.07	0.645

Total cost:

[1] 179

Logistic regression

Probability threshold = 0.8

Error matrix:

	FALSE	TRUE
0	46	11
1	52	91

Probability matrix:

	FALSE	TRUE
0	0.23	0.055
1	0.26	0.455

Total cost:

[1] 107

Decision tree

credit.r <- rpart(formula = V25 ~ ., data = train, method = "class", cp = 0.001)

Error matrix

	FALSE	TRUE
0	29	28
1	36	107

Probability matrix:

	FALSE	TRUE
0	0.145	0.140
1	0.180	0.535

Total cost:

[1] 176

Decision tree

plot of chunk unnamed-chunk-22

K-nearest neighbours

Cross-validation to find K that minimizes the cost

plot of chunk unnamed-chunk-23

[1] "K = " "2"

K-nearest neighbours

credit.K <- knn(train[, -25], test[, -25], cl, k = K)

Error matrix

	0	1
0	26	31
1	32	111

Probability matrix:

	0	1
0	0.13	0.155
1	0.16	0.555

Total cost:

[1] 187

Support vector machines

Cross-validation to find C and gamma that minimize the cost

	2^-13	2^-11	2^-9	2^-7	2^-5	2^-3
2^5	94.6	70.6	68.1	65.9	64.9	80.7
2^7	69.5	68.2	65.2	70.6	68.5	80.7
2^9	68.4	66.0	67.2	68.0	72.6	80.7
2^11	65.5	65.2	71.1	69.8	72.6	80.7
2^13	65.1	68.6	68.6	73.2	72.6	80.7
2^15	65.8	71.9	69.2	72.4	72.6	80.7

[1] "C = " "32"
[1] "gamma = " "0.03125"

Support vector machines

credit.S <- svm(formula = V25 ~ ., data = train, type = "C-classification", 
    C = C., gamma = gamma.)

Error matrix

	0	1
0	23	34
1	12	131

Probability matrix:

	0	1
0	0.115	0.170
1	0.060	0.655

Total cost:

[1] 182

Conclusion

plot of chunk unnamed-chunk-35

Best method

## [1] "LR"