Part A: In this part we solve for simulated dataset A

Problem 0: Setting up the data

To create dataset A we first fix three mean vectors and three variance-covariance matrices. Mean vectors can be chosen arbitrarily. We call them mu1, mu2, mu3.

mu1 = ( 0,0,0 )
mu2 = ( 1,2,1 )
mu3 = ( -1,-2,-3 )

Next we choose three variance-covariance matrices for above three means. This can be done by arbitrarily choosing three full rank matrices of size three and multiplying it by its transpose. We call them sigma1, sigma2, sigma3. We keep the variances such that the three populations are neither too separated nor too overlapping.

sigma1
      [,1]  [,2] [,3]
[1,]  16.5 -13.0 -7.0
[2,] -13.0  35.0 11.5
[3,]  -7.0  11.5 10.5
sigma2
      [,1] [,2] [,3]
[1,]  22.5  -21   -7
[2,] -21.0   25    4
[3,]  -7.0    4    9
sigma3
     [,1] [,2] [,3]
[1,] 16.5 14.5  0.5
[2,] 14.5 37.0 -7.5
[3,]  0.5 -7.5  9.0

To simulate from the above three distributions we use the MASS package and store 500 samples from each in a data frame and add a column representing which distribution it comes from by labels 1, 2 and 3 and display a few data points.

         one        two       three which
1 -0.8457141   4.646275 -1.12685840     1
2  1.1431884   3.436976 -2.22989579     1
3  1.9361008 -10.498659 -3.15960576     1
4 -1.8734482  -1.561796 -0.02002764     1
5  2.4289202   2.982336 -6.58705589     1
6  6.1737001 -10.069912 -1.92189660     1

Next we create (scaled) training and test sets for dataset A.

Problem 1: Plotting

We plot the simulated dataset A.

Problem 2: Bayes Classifier

We display the confusion matrix and error rate.

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1  62  44  50
  2 123 184  29
  3  65  22 171
Test Error rate is = 44.4 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1  65  37  41
  2 121 194  31
  3  64  19 178
Training Error rate is = 41.7 %

Problem 3: LDA, QDA, SVM, KNN

(i) LDA

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 104  60  51
  2  88 176  35
  3  58  14 164
Test Error rate is = 40.8 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 100  46  46
  2  88 190  33
  3  62  14 171
Training Error rate is = 38.5 %

(ii) QDA

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 129  26  32
  2  83 215  24
  3  38   9 194
Test Error rate is = 28.3 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 137  19  35
  2  80 221  25
  3  33  10 190
Training Error rate is = 26.9 %

(iii) SVM (one versus one)

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 113  17  26
  2  93 220  16
  3  44  13 208
Test Error rate is = 27.9 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 117  13  26
  2  97 222  24
  3  36  15 200
Training Error rate is = 28.1 %

(iii) SVM (one versus rest)

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1  95  10  16
  2  97 225  18
  3  58  15 216
Test Error rate is = 28.5 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1  89   8  13
  2 112 224  25
  3  49  18 212
Training Error rate is = 30 %

(iv) KNN (K = 1)

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 135  72  44
  2  71 147  23
  3  44  31 183
Test Error rate is = 38 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 250   0   0
  2   0 250   0
  3   0   0 250
Training Error rate is = 0 %

(iv) KNN (K = 3)

Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 135  69  34
  2  73 154  18
  3  42  27 198
Test Error rate is = 35.1 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 171  22  26
  2  46 215  18
  3  33  13 206
Training Error rate is = 21.1 %

Problem 4

(i) 1-NN with multiedit

We see that we get improved performance for classifying the second and third classes but very poor for the first class. Perhaps this suggests that Multiedit is not suitable for the data at hand.

Training set reduces in size from 750 to 293
Confusion Matrix of predicting Test Data
   
.     1   2   3
  1  95  29  23
  2 112 211  37
  3  43  10 190
Test Error rate is = 33.9 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1  96  22  22
  2 118 220  41
  3  36   8 187
Training Error rate is = 32.9 %

(ii) 1-NN with condensation

Condensation works well as it reduces the data set without sacrificing much on performance.

Training set reduces in size from 750 to 447
Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 127  73  55
  2  71 144  22
  3  52  33 173
Test Error rate is = 40.8 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 250   0   0
  2   0 250   0
  3   0   0 250
Training Error rate is = 0 %

(iii) 1-NN with multiedit and condense

We see that the training set reduces to a very small size ( = 4)rendering meaningful classification impossible. This is happnening perhaps because this approach of multiedit + condensation is not suitable for the dataset at hand.

Training set reduces in size from 750 to 4
Confusion Matrix of predicting Test Data
   
.     1   2   3
  1 250 250 250
  2   0   0   0
  3   0   0   0
Test Error rate is = 66.7 %
Confusion Matrix of predicting Training Data
   
.     1   2   3
  1 250 250 250
  2   0   0   0
  3   0   0   0
Training Error rate is = 66.7 %

Problem 5: Results

We present the test data error rates and training data error rates below for comparison.

                          classifier test rates training rates
1                              Bayes   44.40000       41.73333
2                                LDA   40.80000       38.53333
3                                QDA   28.26667       26.93333
4                   SVM (one vs one)   27.86667       28.13333
5                  SVM (one vs rest)   28.53333       30.00000
6                               1-NN   38.00000        0.00000
7                               3-NN   35.06667       21.06667
8               1-NN after multiedit   33.86667       32.93333
9                1-NN after condense   40.80000        0.00000
10 1-NN after multiedit and condense   66.66667       66.66667

Part B: In this part we look at dataset B (leaf data)

For this leaf data we have to do only Problem 2,3 and 5.

Problem 0: Setting up the data

For dataset B ie the Leaf data. We first import the csv file, rename the columns and inspect some of the data.

  species specimen number eccentricity aspect ratio elongation solidity
1       1               1      0.72694       1.4742    0.32396  0.98535
2       1               2      0.74173       1.5257    0.36116  0.98152
3       1               3      0.76722       1.5725    0.38998  0.97755
4       1               4      0.73797       1.4597    0.35376  0.97566
5       1               5      0.82301       1.7707    0.44462  0.97698
6       1               6      0.72997       1.4892    0.34284  0.98755
  stochastic convexity isoperimetric factor maximal indentation depth lobedness
1              1.00000              0.83592                 0.0046566 0.0039465
2              0.99825              0.79867                 0.0052423 0.0050016
3              1.00000              0.80812                 0.0074573 0.0101210
4              1.00000              0.81697                 0.0068768 0.0086068
5              1.00000              0.75493                 0.0074280 0.0100420
6              1.00000              0.84482                 0.0049451 0.0044506
  average intensity average contrast smoothness third moment uniformity entropy
1         0.0477900         0.127950  0.0161080   0.00523230 2.7477e-04 1.17560
2         0.0241600         0.090476  0.0081195   0.00270800 7.4846e-05 0.69659
3         0.0118970         0.057445  0.0032891   0.00092068 3.7886e-05 0.44348
4         0.0159500         0.065491  0.0042707   0.00115440 6.6272e-05 0.58785
5         0.0079379         0.045339  0.0020514   0.00055986 2.3504e-05 0.34214
6         0.0104870         0.058528  0.0034138   0.00112480 2.4798e-05 0.34068

We see that the second column is just the reading number for a particular species/class so we will choose from the remaining 14 columns to keep in our dataset B finally. We also list out the items of the first column ie the classes and their frequencies below.


 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 22 23 24 25 26 27 28 29 30 31 32 
12 10 10  8 12  8 10 11 14 13 16 12 13 12 10 12 11 13  9 12 11 12 12 12 11 11 
33 34 35 36 
11 11 11 10 

We see that the dataset does not contain data from species numbers 16:21. So we choose our dataset from the remaining classes. Also we choose columns from 3:16 as the second column is just reading number. And we display some of the data points.

  species elongation entropy average contrast aspect ratio
1       3    0.36317 1.53710         0.200880       1.2099
2       3    0.32559 0.53410         0.097143       1.2065
3       3    0.33117 1.10760         0.160600       1.0991
4       3    0.38111 0.72247         0.115020       1.2510
5       3    0.38460 1.13650         0.159310       1.1435
6       3    0.29593 1.05110         0.159800       1.0066

Training and test set.

Problem 2: Bayes Classifier

We display the confusion matrices and the test and training accuracy.

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  3  0  0  0  0  0
  10 0  6  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  1
  24 2  0  0  0  6  0
  36 0  0  0  0  0  4
Test Error rate is = 9.1 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  4  1  0  0  0  0
  10 1  5  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Training Error rate is = 6.1 %

Problem 4: LDA, QDA, SVM, KNN

(i) LDA

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  5  1  0  0  0  0
  10 0  5  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Test Error rate is = 3 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  5  0  0  0  0  0
  10 0  6  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Training Error rate is = 0 %

(ii) QDA

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  3  0  0  0  0  0
  10 2  6  0  4  1  0
  14 0  0  6  0  0  0
  15 0  0  0  1  0  0
  24 0  0  0  0  5  0
  36 0  0  0  0  0  5
Test Error rate is = 21.2 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  5  0  0  0  0  0
  10 0  6  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Training Error rate is = 0 %

(iii) SVM (one versus one)

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  4  0  0  0  0  0
  10 0  6  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 1  0  0  0  6  0
  36 0  0  0  0  0  5
Test Error rate is = 3 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  4  0  0  0  0  0
  10 1  4  0  0  0  0
  14 0  0  6  0  0  0
  15 0  1  0  4  0  0
  24 0  1  0  1  6  0
  36 0  0  0  0  0  5
Training Error rate is = 12.1 %

(iii) SVM (one versus rest)

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  4  1  0  0  0  0
  10 0  5  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 1  0  0  0  6  0
  36 0  0  0  0  0  5
Test Error rate is = 6.1 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  4  0  0  0  0  0
  10 1  5  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  4  0  0
  24 0  1  0  1  6  0
  36 0  0  0  0  0  5
Training Error rate is = 9.1 %

(iv) KNN (K = 1)

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  4  1  0  0  0  0
  10 0  5  0  0  0  0
  14 0  0  6  0  0  0
  15 1  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Test Error rate is = 6.1 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  5  0  0  0  0  0
  10 0  6  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Training Error rate is = 0 %

(iv) KNN (K = 3)

Confusion Matrix of predicting Test Data
    
.    3 10 14 15 24 36
  3  5  1  0  0  0  0
  10 0  5  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  5  0  0
  24 0  0  0  0  6  0
  36 0  0  0  0  0  5
Test Error rate is = 3 %
Confusion Matrix of predicting Training Data
    
.    3 10 14 15 24 36
  3  4  0  0  0  0  0
  10 1  5  0  0  0  0
  14 0  0  6  0  0  0
  15 0  0  0  4  0  0
  24 0  1  0  1  6  0
  36 0  0  0  0  0  5
Training Error rate is = 9.1 %

Problem 5: Results

We present the test data error rates and training data error rates below for comparison.

         classifier test rates training rates
1             Bayes   9.090909       6.060606
2               LDA   3.030303       0.000000
3               QDA  21.212121       0.000000
4  SVM (one vs one)   3.030303      12.121212
5 SVM (one vs rest)   6.060606       9.090909
6              1-NN   6.060606       0.000000
7              3-NN   3.030303       9.090909