April 21st, 2020

About

I used the Glass Identification Database from the UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Glass+Identification). The data contains 214 entries with 11 attributes. These attributes were: entry ID, Refractive Index of the sample, Sodium content, Magnesium content, Aluminum content, Silicon content, Potassium content, Calcium content, Barium content, and Iron content of the sample, and the type of glass it is associated with.

I started by processing the data, selecting specific attributes for analysis, visualizing the data, and then transforming the values for classification. I first created a classification tree with a set of train data. I was able to create an 11-terminal node decision tree. Alternatively, I implemented a k-NN algorithm that produced a black-box model that classifies data nodes based on its nearest neighbors. Using the K-Fold Cross Validation technique, I eliminated possible k-values and finalized a single-nearest-neighbor algorithm.

Comparing these two models, I observed that the k-NN model has a higher level of accuracy on the prediction of glass types.

The Data

data.glass <- read.table("glass.txt", sep = ',', header = FALSE)
colnames(data.glass) = c("ID","RI","Na","Mg","Al","Si","K","Ca","Ba","Fe","Type")
dim(data.glass)
## [1] 214  11
summary(data.glass[3:10])
##        Na              Mg              Al              Si       
##  Min.   :10.73   Min.   :0.000   Min.   :0.290   Min.   :69.81  
##  1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190   1st Qu.:72.28  
##  Median :13.30   Median :3.480   Median :1.360   Median :72.79  
##  Mean   :13.41   Mean   :2.685   Mean   :1.445   Mean   :72.65  
##  3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630   3rd Qu.:73.09  
##  Max.   :17.38   Max.   :4.490   Max.   :3.500   Max.   :75.41  
##        K                Ca               Ba              Fe         
##  Min.   :0.0000   Min.   : 5.430   Min.   :0.000   Min.   :0.00000  
##  1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000   1st Qu.:0.00000  
##  Median :0.5550   Median : 8.600   Median :0.000   Median :0.00000  
##  Mean   :0.4971   Mean   : 8.957   Mean   :0.175   Mean   :0.05701  
##  3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000   3rd Qu.:0.10000  
##  Max.   :6.2100   Max.   :16.190   Max.   :3.150   Max.   :0.51000

The Data (cont.)

The attributes that are significant for the classification process are:

  • sodium content
  • magnesium content
  • aluminum content
  • silicon content
  • potassium content
  • calcium content
  • barium content
  • iron content

I excluded the Refractive Index because all 7 types of the glass require a similar permability of light. I removed the ID numbers as well because R automatically indexes the entries.

Visualizing the Data

To visualize the data, I decided on using scatterplots. I added spread to the points on the graph to indicate the occurrence of a certain value. Also, I noticed in the dataset that the “type” of the glass were all numeric values, which could cause R to create a regression tree rather than a regular classification tree. I altered this attribute by creating a new “type” attribute which has “one”, “two”, “three”, etc. accordingly and removed the original “Type” attribute from the dataset. See the scatter plots on the next slide.

Visualizing the Data (cont.)

Further Processing

As part of processing the data, I decided to create a train set to have 75% of the original data, and the test set to have the rest of the 25%.

index.train = sample(1:obs, size = floor(obs * 0.75), replace = FALSE)  # 75% Training Set
index.test = setdiff(1:obs, index.train)  # 25% Test Set

train.glass = data.glass1[index.train,]  # obtain the training dataset
test.glass = data.glass1[index.test,]  # obtain the testing dataset

dim(data.glass1)
## [1] 214   9
dim(train.glass)
## [1] 160   9
dim(test.glass)
## [1] 54  9

Model (A): Decision Tree

With data processing complete, I proceeded to construct the decision tree with the training data.

Model (A): Decision Tree (cont.)

As seen from the tree, we see that there are 3 branches that have the same classification result. This means that the classification tree needs to be simplified and reduced.

To find the optial number of terminal nodes for the classification tree, I used the built-in cross-validation function to calculate the least number of nodes with a minimum misclassification error.

Model (A): Decision Tree (.cont)

As seen in the graph, nodes 7 and 8, 10 and 11, and 12 and 13 all achieve a similar misclassification error. From this, I chose the minimal number of nodes, 13.

Model (A): Confusion Matrix and Misclassification Error

# TRAINING DATA
prune.pred = predict(prune.glass, train.glass, type = "class")  
train_conf_mat = table(prune.pred, train.glass$type)  # create confusion matrix
misclass_error = 1 - sum(diag(train_conf_mat)) / sum(train_conf_mat)  # misclass. error
print(paste0("Misclassification Error on Training Data: ", toString(misclass_error)))
## [1] "Misclassification Error on Training Data: 0.19375"
# TESTING DATA
prune.test.pred = predict(prune.glass, test.glass, type = "class")  
test_conf_mat = table(prune.test.pred, test.glass$type) # create confusion matrix
misclass_error_test = 1 - sum(diag(test_conf_mat)) / sum(test_conf_mat)  # misclass. error
print(paste0("Misclassification Error on Testing Data: ", toString(misclass_error_test)))
## [1] "Misclassification Error on Testing Data: 0.314814814814815"

Model (A): Summary Statistics of the Tree

summary(prune.glass)
## 
## Classification tree:
## snip.tree(tree = tree.glass, nodes = c(31L, 24L))
## Variables actually used in tree construction:
## [1] "Mg" "Na" "Al" "Fe" "Ba" "K" 
## Number of terminal nodes:  13 
## Residual mean deviance:  1.147 = 168.7 / 147 
## Misclassification error rate: 0.1938 = 31 / 160

Model (B): k Nearest Neighbors

Alternatively, I decided to construct a k-NN model, and then calculated the misclassification error of the values the nearest neighbor chose.

Model (B): k Nearest Neighbors (cont.)

As shown in the figure, the minimum misclassification error is marked at 1 nearest neighbor. However, to reaffirm this value, I also used a K-Fold Cross Validation technique. This is done by partitioning the data into equal-sized subsets. The ‘K’ subsets are used as the test set, and the rest of the ‘K - 1’ subsets is used as the train set.

I performed a 10-Fold Cross Validation, where the data is split into 10 subsets and calculated the test set 10 times within each subset. See the graph on the next slide.

Model (B): k Nearest Neighbors Misclassification Error (cont.)

## [1] "list"

Model (B): k Nearest Neighbors - Boxplot of Misclassification Errors

From the indicated minimum, other Nearest Neighbor models could be eliminated.

Conclusions

Ultimately, to evaluate the two models, I used their misclassification rates: a straightforward approach to deciding which model performs better at classification.

round(mean.error[which.min(mean.error)], 5)  # k-fold misclassification rate
## [1] 0.28615
round(min(knn.error), 5)  # 1-NN misclassification rate
## [1] 0.27778
prune.test.pred = predict(prune.glass, test.glass, type = "class")  
test_conf_mat = table(prune.test.pred, test.glass$type) # create confusion matrix
misclass_error_test = 1 - sum(diag(test_conf_mat)) / sum(test_conf_mat)  # misclass. error
round(misclass_error_test, 5)
## [1] 0.31481

Conclusions (cont.)

From the misclassification rates of the two models, we can see that the K-NN model performed better. In fact, when limiting k = 1, the model performed even better, with a correct classification rate of 0.7222 . With the regular k-NN model, there was a correct classification rate of 0.7139 . However, the k-NN model performed better than the decision tree, which had a correct classification rate of 0.6852 .

While the k-NN model did perform better overall, I still believe the decision tree model is the better of the two for readability and understanding. It provides a clear view of classification occurs through its tree structure. Ultimately, to improve upon these results, I would need to have further insights into the data, possibly more data points, and other classification models that I could utilize.