Glass Classification

Glass Clasification

Description

A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).

Data Set Information

Extraction was done by B. German, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN Donor: Vina Spiehler, Ph.D., DABFT, Diagnostic Products Corporation

Objective

The main objective of the dataset is to classify the type of glass based on several explanatory factors affecting the type of glass like Fe,Na,Mg etc.

glass<-read.csv("C:\\Users\\sai\\Desktop\\R files\\Data sets\\glass.csv",header = T)
head(glass)

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

str(glass)

## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: int  1 1 1 1 1 1 1 1 1 1 ...

step1:convert dependant variable int into factor

glass$Type<-factor(glass$Type)

step2:create a data partition excluding the dependant variable

glass1<-glass[,-10]

normalize the data

normalize=function(x){
  return((x-min(x))/(max(x)-min(x)))
}
glass2<-normalize(glass1)
head(glass2)

##           RI        Na         Mg         Al        Si            K
## 1 0.02016987 0.1808779 0.05954117 0.01458692 0.9518631 0.0007956504
## 2 0.02012478 0.1841931 0.04773903 0.01803474 0.9644609 0.0063652036
## 3 0.02010582 0.1794192 0.04707598 0.02042169 0.9679088 0.0051717279
## 4 0.02012545 0.1751757 0.04893250 0.01710648 0.9628696 0.0075586792
## 5 0.02012226 0.1759714 0.04800424 0.01644344 0.9691022 0.0072934624
## 6 0.02010290 0.1696062 0.04787164 0.02148256 0.9676435 0.0084869381
##          Ca Ba          Fe
## 1 0.1160324  0 0.000000000
## 2 0.1038324  0 0.000000000
## 3 0.1031693  0 0.000000000
## 4 0.1090041  0 0.000000000
## 5 0.1070150  0 0.000000000
## 6 0.1070150  0 0.003447819

separate training and testing data

* create data partition

index1<-sample(nrow(glass1),0.75*nrow(glass2))


trainglass<-glass2[index1,]
testglass<-glass2[-index1,]

generate ytrain and ytest:used to generate outcome vectors for train and test data

ytrain=glass$Type[index1]
ytest=glass$Type[-index1]

to decide the value of k,use the formula:sqrt(nrow(training data))

create knn model

K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection. We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.

library(class)
knnmodelglass=knn(trainglass,testglass,k=sqrt(nrow(trainglass)),cl=ytrain)
knnmodelglass

##  [1] 2 1 1 1 1 3 1 1 1 1 1 1 1 1 1 2 1 1 3 2 1 1 1 2 2 2 2 2 2 1 3 1 2 2 2
## [36] 2 2 1 2 2 2 5 1 5 3 2 2 7 7 7 7 2 7 7
## Levels: 1 2 3 5 6 7

Confusion Matrix

It is a table which is often used to describe the performance of classification model on a set of test table.

The matrix gives insite into errors made into classifier and type of errors that are actually being made.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

confusionMatrix(ytest,knnmodelglass)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 18  3  2  0  0  0
##          2  2 11  1  0  0  0
##          3  1  0  0  0  0  0
##          5  0  3  0  1  0  0
##          6  1  0  0  1  0  0
##          7  0  3  1  0  0  6
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.5253, 0.7891)
##     No Information Rate : 0.4074          
##     P-Value [Acc > NIR] : 0.0001062       
##                                           
##                   Kappa : 0.5277          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            0.8182   0.5500  0.00000  0.50000       NA   1.0000
## Specificity            0.8438   0.9118  0.98000  0.94231  0.96296   0.9167
## Pos Pred Value         0.7826   0.7857  0.00000  0.25000       NA   0.6000
## Neg Pred Value         0.8710   0.7750  0.92453  0.98000       NA   1.0000
## Prevalence             0.4074   0.3704  0.07407  0.03704  0.00000   0.1111
## Detection Rate         0.3333   0.2037  0.00000  0.01852  0.00000   0.1111
## Detection Prevalence   0.4259   0.2593  0.01852  0.07407  0.03704   0.1852
## Balanced Accuracy      0.8310   0.7309  0.49000  0.72115       NA   0.9583

naive bayes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(glass2)

##    vars   n mean   sd median trimmed  mad  min  max range  skew kurtosis
## RI    1 214 0.02 0.00   0.02    0.02 0.00 0.02 0.02  0.00  1.60     4.72
## Na    2 214 0.18 0.01   0.18    0.18 0.01 0.14 0.23  0.09  0.45     2.90
## Mg    3 214 0.04 0.02   0.05    0.04 0.00 0.00 0.06  0.06 -1.14    -0.45
## Al    4 214 0.02 0.01   0.02    0.02 0.00 0.00 0.05  0.04  0.89     1.94
## Si    5 214 0.96 0.01   0.97    0.96 0.01 0.93 1.00  0.07 -0.72     2.82
## K     6 214 0.01 0.01   0.01    0.01 0.00 0.00 0.08  0.08  6.46    52.87
## Ca    7 214 0.12 0.02   0.11    0.12 0.01 0.07 0.21  0.14  2.02     6.41
## Ba    8 214 0.00 0.01   0.00    0.00 0.00 0.00 0.04  0.04  3.37    12.08
## Fe    9 214 0.00 0.00   0.00    0.00 0.00 0.00 0.01  0.01  1.73     2.52
##    se
## RI  0
## Na  0
## Mg  0
## Al  0
## Si  0
## K   0
## Ca  0
## Ba  0
## Fe  0

colSums(is.na(glass2))

## RI Na Mg Al Si  K Ca Ba Fe 
##  0  0  0  0  0  0  0  0  0

library(mice)

## 
## Attaching package: 'mice'

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(missForest)

## Loading required package: randomForest

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:psych':
## 
##     outlier

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Loading required package: foreach

## Loading required package: itertools

## Loading required package: iterators

library(ggplot2)
library(e1071)

index2<-sample(nrow(glass),0.75*nrow(glass))
trainglassnb<-glass[index2,]
testglassnb<-glass[-index2,]
modelnaivebayes<-naiveBayes(Type~.,data = trainglassnb)

pred<-predict(modelnaivebayes,trainglassnb)
head(pred)

## [1] 3 3 6 5 2 3
## Levels: 1 2 3 5 6 7

table(pred,trainglassnb$Type)

##     
## pred  1  2  3  5  6  7
##    1  2  2  0  0  0  0
##    2  3 12  0  5  0  0
##    3 41 43 13  0  0  1
##    5  0  3  0  2  0  0
##    6  0  4  1  2  5  3
##    7  0  0  0  0  0 18

confusionMatrix(pred,trainglassnb$Type)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1  2  2  0  0  0  0
##          2  3 12  0  5  0  0
##          3 41 43 13  0  0  1
##          5  0  3  0  2  0  0
##          6  0  4  1  2  5  3
##          7  0  0  0  0  0 18
## 
## Overall Statistics
##                                           
##                Accuracy : 0.325           
##                  95% CI : (0.2532, 0.4034)
##     No Information Rate : 0.4             
##     P-Value [Acc > NIR] : 0.9792          
##                                           
##                   Kappa : 0.2233          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity           0.04348   0.1875  0.92857  0.22222  1.00000   0.8182
## Specificity           0.98246   0.9167  0.41781  0.98013  0.93548   1.0000
## Pos Pred Value        0.50000   0.6000  0.13265  0.40000  0.33333   1.0000
## Neg Pred Value        0.71795   0.6286  0.98387  0.95484  1.00000   0.9718
## Prevalence            0.28750   0.4000  0.08750  0.05625  0.03125   0.1375
## Detection Rate        0.01250   0.0750  0.08125  0.01250  0.03125   0.1125
## Detection Prevalence  0.02500   0.1250  0.61250  0.03125  0.09375   0.1125
## Balanced Accuracy     0.51297   0.5521  0.67319  0.60118  0.96774   0.9091

pred2<-predict(modelnaivebayes,testglassnb)
head(pred2)

## [1] 3 3 3 3 3 3
## Levels: 1 2 3 5 6 7

confusionMatrix(pred2,testglassnb$Type)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1  0  0  2  0  0  0
##          2  2  1  0  3  0  0
##          3 21  8  1  0  0  0
##          5  0  1  0  0  0  0
##          6  1  2  0  1  4  1
##          7  0  0  0  0  0  6
## 
## Overall Statistics
##                                          
##                Accuracy : 0.2222         
##                  95% CI : (0.1204, 0.356)
##     No Information Rate : 0.4444         
##     P-Value [Acc > NIR] : 0.9998         
##                                          
##                   Kappa : 0.1357         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity           0.00000  0.08333  0.33333  0.00000  1.00000   0.8571
## Specificity           0.93333  0.88095  0.43137  0.98000  0.90000   1.0000
## Pos Pred Value        0.00000  0.16667  0.03333  0.00000  0.44444   1.0000
## Neg Pred Value        0.53846  0.77083  0.91667  0.92453  1.00000   0.9792
## Prevalence            0.44444  0.22222  0.05556  0.07407  0.07407   0.1296
## Detection Rate        0.00000  0.01852  0.01852  0.00000  0.07407   0.1111
## Detection Prevalence  0.03704  0.11111  0.55556  0.01852  0.16667   0.1111
## Balanced Accuracy     0.46667  0.48214  0.38235  0.49000  0.95000   0.9286

Decision Tree

Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

index2<-sample(nrow(glass),0.75*nrow(glass))
trainglassnb<-glass[index2,]
testglassnb<-glass[-index2,]


library(rpart)
treemodel<-rpart(Type ~.,data = trainglassnb)

library(rpart.plot)
rpart.plot(treemodel)

Random Forest

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

library(randomForest)
glass_rf<-randomForest(Type~.,data = trainglassnb,ntree=50)
glass_rf

## 
## Call:
##  randomForest(formula = Type ~ ., data = trainglassnb, ntree = 50) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 27.5%
## Confusion matrix:
##    1  2 3 5 6  7 class.error
## 1 44  6 5 0 0  0   0.2000000
## 2  8 43 3 1 2  1   0.2586207
## 3  6  2 6 0 0  0   0.5714286
## 5  0  2 0 7 0  1   0.3000000
## 6  0  3 0 0 4  0   0.4285714
## 7  1  3 0 0 0 12   0.2500000

plot(glass_rf)

pred2=predict(glass_rf,trainglassnb)
rf_train=confusionMatrix(trainglassnb$Type,pred2)
rf_train

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3  5  6  7
##          1 55  0  0  0  0  0
##          2  0 58  0  0  0  0
##          3  0  0 14  0  0  0
##          5  0  0  0 10  0  0
##          6  0  0  0  0  7  0
##          7  0  0  0  0  0 16
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9772, 1)
##     No Information Rate : 0.3625     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity            1.0000   1.0000   1.0000   1.0000  1.00000      1.0
## Specificity            1.0000   1.0000   1.0000   1.0000  1.00000      1.0
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000  1.00000      1.0
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000  1.00000      1.0
## Prevalence             0.3438   0.3625   0.0875   0.0625  0.04375      0.1
## Detection Rate         0.3438   0.3625   0.0875   0.0625  0.04375      0.1
## Detection Prevalence   0.3438   0.3625   0.0875   0.0625  0.04375      0.1
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000  1.00000      1.0

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.