To create dataset A we first fix three mean vectors and three variance-covariance matrices. Mean vectors can be chosen arbitrarily. We call them mu1, mu2, mu3.
mu1 = ( 0,0,0 )
mu2 = ( 1,2,1 )
mu3 = ( -1,-2,-3 )
Next we choose three variance-covariance matrices for above three means. This can be done by arbitrarily choosing three full rank matrices of size three and multiplying it by its transpose. We call them sigma1, sigma2, sigma3. We keep the variances such that the three populations are neither too separated nor too overlapping.
sigma1
[,1] [,2] [,3]
[1,] 16.5 -13.0 -7.0
[2,] -13.0 35.0 11.5
[3,] -7.0 11.5 10.5
sigma2
[,1] [,2] [,3]
[1,] 22.5 -21 -7
[2,] -21.0 25 4
[3,] -7.0 4 9
sigma3
[,1] [,2] [,3]
[1,] 16.5 14.5 0.5
[2,] 14.5 37.0 -7.5
[3,] 0.5 -7.5 9.0
To simulate from the above three distributions we use the MASS package and store 500 samples from each in a data frame and add a column representing which distribution it comes from by labels 1, 2 and 3 and display a few data points.
one two three which
1 -0.8457141 4.646275 -1.12685840 1
2 1.1431884 3.436976 -2.22989579 1
3 1.9361008 -10.498659 -3.15960576 1
4 -1.8734482 -1.561796 -0.02002764 1
5 2.4289202 2.982336 -6.58705589 1
6 6.1737001 -10.069912 -1.92189660 1
Next we create (scaled) training and test sets for dataset A.
We plot the simulated dataset A.
We display the confusion matrix and error rate.
Confusion Matrix of predicting Test Data
. 1 2 3
1 62 44 50
2 123 184 29
3 65 22 171
Test Error rate is = 44.4 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 65 37 41
2 121 194 31
3 64 19 178
Training Error rate is = 41.7 %
(i) LDA
Confusion Matrix of predicting Test Data
. 1 2 3
1 104 60 51
2 88 176 35
3 58 14 164
Test Error rate is = 40.8 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 100 46 46
2 88 190 33
3 62 14 171
Training Error rate is = 38.5 %
(ii) QDA
Confusion Matrix of predicting Test Data
. 1 2 3
1 129 26 32
2 83 215 24
3 38 9 194
Test Error rate is = 28.3 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 137 19 35
2 80 221 25
3 33 10 190
Training Error rate is = 26.9 %
(iii) SVM (one versus one)
Confusion Matrix of predicting Test Data
. 1 2 3
1 113 17 26
2 93 220 16
3 44 13 208
Test Error rate is = 27.9 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 117 13 26
2 97 222 24
3 36 15 200
Training Error rate is = 28.1 %
(iii) SVM (one versus rest)
Confusion Matrix of predicting Test Data
. 1 2 3
1 95 10 16
2 97 225 18
3 58 15 216
Test Error rate is = 28.5 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 89 8 13
2 112 224 25
3 49 18 212
Training Error rate is = 30 %
(iv) KNN (K = 1)
Confusion Matrix of predicting Test Data
. 1 2 3
1 135 72 44
2 71 147 23
3 44 31 183
Test Error rate is = 38 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 250 0 0
2 0 250 0
3 0 0 250
Training Error rate is = 0 %
(iv) KNN (K = 3)
Confusion Matrix of predicting Test Data
. 1 2 3
1 135 69 34
2 73 154 18
3 42 27 198
Test Error rate is = 35.1 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 171 22 26
2 46 215 18
3 33 13 206
Training Error rate is = 21.1 %
(i) 1-NN with multiedit
We see that we get improved performance for classifying the second and third classes but very poor for the first class. Perhaps this suggests that Multiedit is not suitable for the data at hand.
Training set reduces in size from 750 to 293
Confusion Matrix of predicting Test Data
. 1 2 3
1 95 29 23
2 112 211 37
3 43 10 190
Test Error rate is = 33.9 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 96 22 22
2 118 220 41
3 36 8 187
Training Error rate is = 32.9 %
(ii) 1-NN with condensation
Condensation works well as it reduces the data set without sacrificing much on performance.
Training set reduces in size from 750 to 447
Confusion Matrix of predicting Test Data
. 1 2 3
1 127 73 55
2 71 144 22
3 52 33 173
Test Error rate is = 40.8 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 250 0 0
2 0 250 0
3 0 0 250
Training Error rate is = 0 %
(iii) 1-NN with multiedit and condense
We see that the training set reduces to a very small size ( = 4)rendering meaningful classification impossible. This is happnening perhaps because this approach of multiedit + condensation is not suitable for the dataset at hand.
Training set reduces in size from 750 to 4
Confusion Matrix of predicting Test Data
. 1 2 3
1 250 250 250
2 0 0 0
3 0 0 0
Test Error rate is = 66.7 %
Confusion Matrix of predicting Training Data
. 1 2 3
1 250 250 250
2 0 0 0
3 0 0 0
Training Error rate is = 66.7 %
We present the test data error rates and training data error rates below for comparison.
classifier test rates training rates
1 Bayes 44.40000 41.73333
2 LDA 40.80000 38.53333
3 QDA 28.26667 26.93333
4 SVM (one vs one) 27.86667 28.13333
5 SVM (one vs rest) 28.53333 30.00000
6 1-NN 38.00000 0.00000
7 3-NN 35.06667 21.06667
8 1-NN after multiedit 33.86667 32.93333
9 1-NN after condense 40.80000 0.00000
10 1-NN after multiedit and condense 66.66667 66.66667
For this leaf data we have to do only Problem 2,3 and 5.
For dataset B ie the Leaf data. We first import the csv file, rename the columns and inspect some of the data.
species specimen number eccentricity aspect ratio elongation solidity
1 1 1 0.72694 1.4742 0.32396 0.98535
2 1 2 0.74173 1.5257 0.36116 0.98152
3 1 3 0.76722 1.5725 0.38998 0.97755
4 1 4 0.73797 1.4597 0.35376 0.97566
5 1 5 0.82301 1.7707 0.44462 0.97698
6 1 6 0.72997 1.4892 0.34284 0.98755
stochastic convexity isoperimetric factor maximal indentation depth lobedness
1 1.00000 0.83592 0.0046566 0.0039465
2 0.99825 0.79867 0.0052423 0.0050016
3 1.00000 0.80812 0.0074573 0.0101210
4 1.00000 0.81697 0.0068768 0.0086068
5 1.00000 0.75493 0.0074280 0.0100420
6 1.00000 0.84482 0.0049451 0.0044506
average intensity average contrast smoothness third moment uniformity entropy
1 0.0477900 0.127950 0.0161080 0.00523230 2.7477e-04 1.17560
2 0.0241600 0.090476 0.0081195 0.00270800 7.4846e-05 0.69659
3 0.0118970 0.057445 0.0032891 0.00092068 3.7886e-05 0.44348
4 0.0159500 0.065491 0.0042707 0.00115440 6.6272e-05 0.58785
5 0.0079379 0.045339 0.0020514 0.00055986 2.3504e-05 0.34214
6 0.0104870 0.058528 0.0034138 0.00112480 2.4798e-05 0.34068
We see that the second column is just the reading number for a particular species/class so we will choose from the remaining 14 columns to keep in our dataset B finally. We also list out the items of the first column ie the classes and their frequencies below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 22 23 24 25 26 27 28 29 30 31 32
12 10 10 8 12 8 10 11 14 13 16 12 13 12 10 12 11 13 9 12 11 12 12 12 11 11
33 34 35 36
11 11 11 10
We see that the dataset does not contain data from species numbers 16:21. So we choose our dataset from the remaining classes. Also we choose columns from 3:16 as the second column is just reading number. And we display some of the data points.
species elongation entropy average contrast aspect ratio
1 3 0.36317 1.53710 0.200880 1.2099
2 3 0.32559 0.53410 0.097143 1.2065
3 3 0.33117 1.10760 0.160600 1.0991
4 3 0.38111 0.72247 0.115020 1.2510
5 3 0.38460 1.13650 0.159310 1.1435
6 3 0.29593 1.05110 0.159800 1.0066
Training and test set.
We display the confusion matrices and the test and training accuracy.
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 3 0 0 0 0 0
10 0 6 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 1
24 2 0 0 0 6 0
36 0 0 0 0 0 4
Test Error rate is = 9.1 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 4 1 0 0 0 0
10 1 5 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Training Error rate is = 6.1 %
(i) LDA
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 5 1 0 0 0 0
10 0 5 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Test Error rate is = 3 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 5 0 0 0 0 0
10 0 6 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Training Error rate is = 0 %
(ii) QDA
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 3 0 0 0 0 0
10 2 6 0 4 1 0
14 0 0 6 0 0 0
15 0 0 0 1 0 0
24 0 0 0 0 5 0
36 0 0 0 0 0 5
Test Error rate is = 21.2 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 5 0 0 0 0 0
10 0 6 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Training Error rate is = 0 %
(iii) SVM (one versus one)
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 4 0 0 0 0 0
10 0 6 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 1 0 0 0 6 0
36 0 0 0 0 0 5
Test Error rate is = 3 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 4 0 0 0 0 0
10 1 4 0 0 0 0
14 0 0 6 0 0 0
15 0 1 0 4 0 0
24 0 1 0 1 6 0
36 0 0 0 0 0 5
Training Error rate is = 12.1 %
(iii) SVM (one versus rest)
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 4 1 0 0 0 0
10 0 5 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 1 0 0 0 6 0
36 0 0 0 0 0 5
Test Error rate is = 6.1 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 4 0 0 0 0 0
10 1 5 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 4 0 0
24 0 1 0 1 6 0
36 0 0 0 0 0 5
Training Error rate is = 9.1 %
(iv) KNN (K = 1)
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 4 1 0 0 0 0
10 0 5 0 0 0 0
14 0 0 6 0 0 0
15 1 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Test Error rate is = 6.1 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 5 0 0 0 0 0
10 0 6 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Training Error rate is = 0 %
(iv) KNN (K = 3)
Confusion Matrix of predicting Test Data
. 3 10 14 15 24 36
3 5 1 0 0 0 0
10 0 5 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 5 0 0
24 0 0 0 0 6 0
36 0 0 0 0 0 5
Test Error rate is = 3 %
Confusion Matrix of predicting Training Data
. 3 10 14 15 24 36
3 4 0 0 0 0 0
10 1 5 0 0 0 0
14 0 0 6 0 0 0
15 0 0 0 4 0 0
24 0 1 0 1 6 0
36 0 0 0 0 0 5
Training Error rate is = 9.1 %
We present the test data error rates and training data error rates below for comparison.
classifier test rates training rates
1 Bayes 9.090909 6.060606
2 LDA 3.030303 0.000000
3 QDA 21.212121 0.000000
4 SVM (one vs one) 3.030303 12.121212
5 SVM (one vs rest) 6.060606 9.090909
6 1-NN 6.060606 0.000000
7 3-NN 3.030303 9.090909