Description
A data frame with 214 observation containing examples of the chemical analysis of 7 different types of glass. The problem is to forecast the type of class on basis of the chemical analysis. The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence (if it is correctly identified!).
Extraction was done by B. German, Central Research Establishment, Home Office Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN Donor: Vina Spiehler, Ph.D., DABFT, Diagnostic Products Corporation
The main objective of the dataset is to classify the type of glass based on several explanatory factors affecting the type of glass like Fe,Na,Mg etc.
glass<-read.csv("C:\\Users\\sai\\Desktop\\R files\\Data sets\\glass.csv",header = T)
head(glass)
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
str(glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: int 1 1 1 1 1 1 1 1 1 1 ...
glass$Type<-factor(glass$Type)
glass1<-glass[,-10]
normalize=function(x){
return((x-min(x))/(max(x)-min(x)))
}
glass2<-normalize(glass1)
head(glass2)
## RI Na Mg Al Si K
## 1 0.02016987 0.1808779 0.05954117 0.01458692 0.9518631 0.0007956504
## 2 0.02012478 0.1841931 0.04773903 0.01803474 0.9644609 0.0063652036
## 3 0.02010582 0.1794192 0.04707598 0.02042169 0.9679088 0.0051717279
## 4 0.02012545 0.1751757 0.04893250 0.01710648 0.9628696 0.0075586792
## 5 0.02012226 0.1759714 0.04800424 0.01644344 0.9691022 0.0072934624
## 6 0.02010290 0.1696062 0.04787164 0.02148256 0.9676435 0.0084869381
## Ca Ba Fe
## 1 0.1160324 0 0.000000000
## 2 0.1038324 0 0.000000000
## 3 0.1031693 0 0.000000000
## 4 0.1090041 0 0.000000000
## 5 0.1070150 0 0.000000000
## 6 0.1070150 0 0.003447819
index1<-sample(nrow(glass1),0.75*nrow(glass2))
trainglass<-glass2[index1,]
testglass<-glass2[-index1,]
ytrain=glass$Type[index1]
ytest=glass$Type[-index1]
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning. It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection. We are given some prior data (also called training data), which classifies coordinates into groups identified by an attribute.
library(class)
knnmodelglass=knn(trainglass,testglass,k=sqrt(nrow(trainglass)),cl=ytrain)
knnmodelglass
## [1] 2 1 1 1 1 3 1 1 1 1 1 1 1 1 1 2 1 1 3 2 1 1 1 2 2 2 2 2 2 1 3 1 2 2 2
## [36] 2 2 1 2 2 2 5 1 5 3 2 2 7 7 7 7 2 7 7
## Levels: 1 2 3 5 6 7
It is a table which is often used to describe the performance of classification model on a set of test table.
The matrix gives insite into errors made into classifier and type of errors that are actually being made.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
confusionMatrix(ytest,knnmodelglass)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 18 3 2 0 0 0
## 2 2 11 1 0 0 0
## 3 1 0 0 0 0 0
## 5 0 3 0 1 0 0
## 6 1 0 0 1 0 0
## 7 0 3 1 0 0 6
##
## Overall Statistics
##
## Accuracy : 0.6667
## 95% CI : (0.5253, 0.7891)
## No Information Rate : 0.4074
## P-Value [Acc > NIR] : 0.0001062
##
## Kappa : 0.5277
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.8182 0.5500 0.00000 0.50000 NA 1.0000
## Specificity 0.8438 0.9118 0.98000 0.94231 0.96296 0.9167
## Pos Pred Value 0.7826 0.7857 0.00000 0.25000 NA 0.6000
## Neg Pred Value 0.8710 0.7750 0.92453 0.98000 NA 1.0000
## Prevalence 0.4074 0.3704 0.07407 0.03704 0.00000 0.1111
## Detection Rate 0.3333 0.2037 0.00000 0.01852 0.00000 0.1111
## Detection Prevalence 0.4259 0.2593 0.01852 0.07407 0.03704 0.1852
## Balanced Accuracy 0.8310 0.7309 0.49000 0.72115 NA 0.9583
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(glass2)
## vars n mean sd median trimmed mad min max range skew kurtosis
## RI 1 214 0.02 0.00 0.02 0.02 0.00 0.02 0.02 0.00 1.60 4.72
## Na 2 214 0.18 0.01 0.18 0.18 0.01 0.14 0.23 0.09 0.45 2.90
## Mg 3 214 0.04 0.02 0.05 0.04 0.00 0.00 0.06 0.06 -1.14 -0.45
## Al 4 214 0.02 0.01 0.02 0.02 0.00 0.00 0.05 0.04 0.89 1.94
## Si 5 214 0.96 0.01 0.97 0.96 0.01 0.93 1.00 0.07 -0.72 2.82
## K 6 214 0.01 0.01 0.01 0.01 0.00 0.00 0.08 0.08 6.46 52.87
## Ca 7 214 0.12 0.02 0.11 0.12 0.01 0.07 0.21 0.14 2.02 6.41
## Ba 8 214 0.00 0.01 0.00 0.00 0.00 0.00 0.04 0.04 3.37 12.08
## Fe 9 214 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.01 1.73 2.52
## se
## RI 0
## Na 0
## Mg 0
## Al 0
## Si 0
## K 0
## Ca 0
## Ba 0
## Fe 0
colSums(is.na(glass2))
## RI Na Mg Al Si K Ca Ba Fe
## 0 0 0 0 0 0 0 0 0
library(mice)
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(missForest)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
library(ggplot2)
library(e1071)
index2<-sample(nrow(glass),0.75*nrow(glass))
trainglassnb<-glass[index2,]
testglassnb<-glass[-index2,]
modelnaivebayes<-naiveBayes(Type~.,data = trainglassnb)
pred<-predict(modelnaivebayes,trainglassnb)
head(pred)
## [1] 3 3 6 5 2 3
## Levels: 1 2 3 5 6 7
table(pred,trainglassnb$Type)
##
## pred 1 2 3 5 6 7
## 1 2 2 0 0 0 0
## 2 3 12 0 5 0 0
## 3 41 43 13 0 0 1
## 5 0 3 0 2 0 0
## 6 0 4 1 2 5 3
## 7 0 0 0 0 0 18
confusionMatrix(pred,trainglassnb$Type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 2 2 0 0 0 0
## 2 3 12 0 5 0 0
## 3 41 43 13 0 0 1
## 5 0 3 0 2 0 0
## 6 0 4 1 2 5 3
## 7 0 0 0 0 0 18
##
## Overall Statistics
##
## Accuracy : 0.325
## 95% CI : (0.2532, 0.4034)
## No Information Rate : 0.4
## P-Value [Acc > NIR] : 0.9792
##
## Kappa : 0.2233
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.04348 0.1875 0.92857 0.22222 1.00000 0.8182
## Specificity 0.98246 0.9167 0.41781 0.98013 0.93548 1.0000
## Pos Pred Value 0.50000 0.6000 0.13265 0.40000 0.33333 1.0000
## Neg Pred Value 0.71795 0.6286 0.98387 0.95484 1.00000 0.9718
## Prevalence 0.28750 0.4000 0.08750 0.05625 0.03125 0.1375
## Detection Rate 0.01250 0.0750 0.08125 0.01250 0.03125 0.1125
## Detection Prevalence 0.02500 0.1250 0.61250 0.03125 0.09375 0.1125
## Balanced Accuracy 0.51297 0.5521 0.67319 0.60118 0.96774 0.9091
pred2<-predict(modelnaivebayes,testglassnb)
head(pred2)
## [1] 3 3 3 3 3 3
## Levels: 1 2 3 5 6 7
confusionMatrix(pred2,testglassnb$Type)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 0 0 2 0 0 0
## 2 2 1 0 3 0 0
## 3 21 8 1 0 0 0
## 5 0 1 0 0 0 0
## 6 1 2 0 1 4 1
## 7 0 0 0 0 0 6
##
## Overall Statistics
##
## Accuracy : 0.2222
## 95% CI : (0.1204, 0.356)
## No Information Rate : 0.4444
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.1357
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 0.00000 0.08333 0.33333 0.00000 1.00000 0.8571
## Specificity 0.93333 0.88095 0.43137 0.98000 0.90000 1.0000
## Pos Pred Value 0.00000 0.16667 0.03333 0.00000 0.44444 1.0000
## Neg Pred Value 0.53846 0.77083 0.91667 0.92453 1.00000 0.9792
## Prevalence 0.44444 0.22222 0.05556 0.07407 0.07407 0.1296
## Detection Rate 0.00000 0.01852 0.01852 0.00000 0.07407 0.1111
## Detection Prevalence 0.03704 0.11111 0.55556 0.01852 0.16667 0.1111
## Balanced Accuracy 0.46667 0.48214 0.38235 0.49000 0.95000 0.9286
Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.
index2<-sample(nrow(glass),0.75*nrow(glass))
trainglassnb<-glass[index2,]
testglassnb<-glass[-index2,]
library(rpart)
treemodel<-rpart(Type ~.,data = trainglassnb)
library(rpart.plot)
rpart.plot(treemodel)
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
library(randomForest)
glass_rf<-randomForest(Type~.,data = trainglassnb,ntree=50)
glass_rf
##
## Call:
## randomForest(formula = Type ~ ., data = trainglassnb, ntree = 50)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 27.5%
## Confusion matrix:
## 1 2 3 5 6 7 class.error
## 1 44 6 5 0 0 0 0.2000000
## 2 8 43 3 1 2 1 0.2586207
## 3 6 2 6 0 0 0 0.5714286
## 5 0 2 0 7 0 1 0.3000000
## 6 0 3 0 0 4 0 0.4285714
## 7 1 3 0 0 0 12 0.2500000
plot(glass_rf)
pred2=predict(glass_rf,trainglassnb)
rf_train=confusionMatrix(trainglassnb$Type,pred2)
rf_train
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 5 6 7
## 1 55 0 0 0 0 0
## 2 0 58 0 0 0 0
## 3 0 0 14 0 0 0
## 5 0 0 0 10 0 0
## 6 0 0 0 0 7 0
## 7 0 0 0 0 0 16
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9772, 1)
## No Information Rate : 0.3625
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 5 Class: 6 Class: 7
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.00000 1.0
## Specificity 1.0000 1.0000 1.0000 1.0000 1.00000 1.0
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.00000 1.0
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.00000 1.0
## Prevalence 0.3438 0.3625 0.0875 0.0625 0.04375 0.1
## Detection Rate 0.3438 0.3625 0.0875 0.0625 0.04375 0.1
## Detection Prevalence 0.3438 0.3625 0.0875 0.0625 0.04375 0.1
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.00000 1.0
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.