Using Data
We will do followings during our Machine learning process:
0. Prepare Environment
1. Import data
2. Data cleansing
3. Train Model
4. Predict
5. Analyse performance
0) Prepare Envirnoment
To do the analysis , we need to import related packages.
rm(list = ls(all = TRUE)) # Removing any exisiting variables
require(dplyr)
require(tidyr)
require(caret)
require(nnet)
require(pROC)
require(e1071)
require(rpart)
## For reproducible results we need to start with the same seed
set.seed(1333)
if(.Platform$OS.type =="windows")
{
setwd("D:/gitRepos/PerfAnalysis/")
}else {
setwd("~/gitRepos/PerfAnalysis/")
##library(doMC)
##registerDoMC(cores = 4)
}
source('myMultiClassSummary.R')
1) Import data
We first read the csv files: ML-MUTAG.csv and ML-GV-MUTAG.csv . Here GV is short for GraphVector and GS is short for GraphletSearch
df.input.GS<-read.csv2(file="./input/ML-MUTAG.csv",
sep=";",
header = TRUE,
dec = ".")
df.input.GV<-read.csv2(file="./input/ML-GV-MUTAG.csv",
sep=";",
header = TRUE,
dec = ".")
Lets see the dimension of each data frame:
Rows Columns
GS 188 47
GV 188 21
2) Data cleansing
- First we need to correct column data types in order our classification algorithms to work. So the GraphID and Class are of the type nominal (or in R called factor).
- Then we find the column with zero variance and remove it from dataset.
df.input.GV$GraphID <-as.factor(df.input.GV$GraphID );
df.input.GV$Class<-as.factor(make.names(df.input.GV$Class));
df.input.GS$GraphID <-as.factor(df.input.GS$GraphID );
df.input.GS$Class<-as.factor(make.names(df.input.GS$Class));
## Finding and removing columns with zero variance
nzv.GS<-nearZeroVar(df.input.GS)
nzv.GV<-nearZeroVar(df.input.GV)
Following columns/features have zero variance, which can be removed:
Rows Columns
GS 188 47
GV 188 21
Columns for GS to be removed: Components; d0; d1; d2; d3; d5; d6; d7; d8; d9; d10; d13; d14; d15; d16; d17; d20; d21; d22; d23; d25; d26; d27; d28; d29; d30; d31; d32; d33; d34; d35; d36; d37; d38; X
Columns for GV to be removed: Components; t.0; t.11; t.12; t.13; t.16; X
Now let see the shape of our datasets after removing zero variance
Rows Columns
GS 188 12
GV 188 14
Columns for GS: GraphID; Class; Verticies; Edges; AvgDegree; Density; d4; d11; d12; d18; d19; d24
Columns for GV: GraphID; Class; t.1; t.2; t.3; t.4; t.5; t.6; t.7; t.8; t.9; t.10; t.14; t.15
The dimensions corresponds to the following Graph type:
- Partitioning the datasets into train (70 % ) and test (30 %) for cross validation
##randomly partition data into two datasets ,training and testing
#GraphletSearch
inTrain.GS<-createDataPartition(y=df.input.GS$Class,p = .7,list=FALSE)
train.GS<-df.input.GS[inTrain.GS,]
test.GS<- df.input.GS[-inTrain.GS,]
#GraphVector
inTrain.GV<-createDataPartition(y=df.input.GV$Class,p = .7,list=FALSE)
train.GV<-df.input.GV[inTrain.GV,]
test.GV<- df.input.GV[-inTrain.GV,]
3) Train Model
Now we train following models:
* SVM
* SVM with Polynomial kernel
* SVM with Radial kernel
* Random forest
##Create a data frame from all combinations of the supplied parameters for Radial and linear Kernels
grid <- expand.grid(sigma = c(.01, .015, 0.2),
C = c(0.75, 0.9, 1, 1.1, 1.25))
##Create a data frame from all combinations of the supplied parameters for Polynomial kernel
grid.poly <- expand.grid(C = c(0.75, 0.9, 1, 1.1, 1.25),
scale=c(.0001),
degree=1:2)
#We need to check if it is binary classification
#or multiclass classifcation, which requires diffrent summary function.
if(length(unique(df.input.GS$Class)) > 2)
{
ctrl <-trainControl(method="repeatedcv",
repeats=10,
classProbs=TRUE,
summaryFunction=myMultiClassSummary)
} else
{
ctrl <- trainControl(method="repeatedcv",
repeats=10,
classProbs=TRUE,
summaryFunction=twoClassSummary)
}
## Parameters for Random Forest algorithm.
RF.control <- trainControl(method="repeatedcv", number=10, repeats=3)
Before training we will center and scale the features for better accuracy.
SVM linear
svmModel.linear.GS <- train(x=train.GS[,-c(1,2)],
y= train.GS[,2],
method = "svmLinear",
preProc = c("center","scale"),
trControl=ctrl)
svmModel.linear.GV <- train(x=train.GV[,-c(1,2)],
y= train.GV[,2],
method = "svmLinear",
preProc = c("center","scale"),
trControl=ctrl)
SVM with polynomial kernel
svmModel.Poly.GS <- train(x=train.GS[,-c(1,2)],
y= train.GS[,2],
method = "svmPoly",
preProc = c("center","scale"),
tuneGrid = grid.poly,trControl=ctrl)
svmModel.Poly.GV <- train(x=train.GV[,-c(1,2)],
y= train.GV[,2],
method = "svmPoly",
preProc = c("center","scale"),
tuneGrid = grid.poly,
trControl=ctrl)
SVM with radial kernel
svmModel.Radial.GS <- train(x=train.GS[,-c(1,2)],
y= train.GS[,2],
method = "svmRadial",
preProc = c("center","scale"),
tuneGrid = grid,
trControl=ctrl)
svmModel.Radial.GV <- train(x=train.GV[,-c(1,2)],
y= train.GV[,2],
method = "svmRadial",
preProc = c("center","scale"),
tuneGrid = grid,
trControl=ctrl)
Random forest
#Random forest has its own naming preference
#GraphletSearch dataset
RfTrainGS<-cbind(y=train.GS[,2],train.GS[,-c(1,2)])
RfModelGS <- train(y~.,data=RfTrainGS,method = "rf",
preProcess=c("center","scale"),
trControl=RF.control,
prox=TRUE,
tuneGrid=expand.grid(mtry = 5),
number=10,
ntree=500)
#GraphVector dataset
RfTrainGV<-cbind(y=train.GV[,2],train.GV[,-c(1,2)])
RfModelGV <- train(y~.,data=RfTrainGV,method = "rf",
preProcess=c("center","scale"),
trControl=RF.control,
prox=TRUE,
tuneGrid=expand.grid(mtry = 5),
number=10,
ntree=500)
4) Predict
We will predict the Class value for each Graph on test dataset so we can later compare to the actual Class with the predicted value and finally calculate performance.
## Predicting for different models for GraphletSearch test dataset
## -c(1,2) means removing GraphID,Class
prediction.y.linear.GS<-predict(svmModel.linear.GS,test.GS[,-c(1,2)])
prediction.y.Radial.GS<-predict(svmModel.Radial.GS,test.GS[,-c(1,2)])
prediction.y.Poly.GS <-predict(svmModel.Poly.GS,test.GS[,-c(1,2)])
prediction.y.RF.GS<-predict(RfModelGS,test.GS[,-c(1,2)])
## Predicting for different models for GraphVector test dataset
prediction.y.linear.GV<-predict(svmModel.linear.GV,test.GV[,-c(1,2)])
prediction.y.Radial.GV<-predict(svmModel.Radial.GV,test.GV[,-c(1,2)])
prediction.y.Poly.GV <-predict(svmModel.Poly.GV,test.GV[,-c(1,2)])
prediction.y.RF.GV<-predict(RfModelGV,test.GV[,-c(1,2)])
5) Analyse performance
First we calculate performance for each model, then we will put them in a table for each dataset:
confMatrix.linear.GS<-confusionMatrix(test.GS[,2],prediction.y.linear.GS)
confMatrix.linear.GV<-confusionMatrix(test.GV[,2],prediction.y.linear.GV)
confMatrix.radial.GS<-confusionMatrix(test.GS[,2],prediction.y.Radial.GS)
confMatrix.radial.GV<-confusionMatrix(test.GV[,2],prediction.y.Radial.GV)
confMatrix.poly.GS<-confusionMatrix(test.GS[,2],prediction.y.Poly.GS)
confMatrix.poly.GV<-confusionMatrix(test.GV[,2],prediction.y.Poly.GV)
confMatrix.rf.GS<-confusionMatrix(test.GS[,2],prediction.y.RF.GS)
confMatrix.rf.GV<-confusionMatrix(test.GV[,2],prediction.y.RF.GV)
acc.GS<-cbind(
Statistic='Accuracy',
SVM_Linear_GS=confMatrix.linear.GS$overall['Accuracy'],
SVM_Radial_GS=confMatrix.radial.GS$overall['Accuracy'],
SVM_Poly_GS=confMatrix.poly.GS$overall['Accuracy'],
RandForest_GS=confMatrix.rf.GS$overall['Accuracy'])
kappa.GS<-cbind(
Statistic='Kappa',
SVM_Linear_GS=confMatrix.linear.GS$overall['Kappa'],
SVM_Radial_GS=confMatrix.radial.GS$overall['Kappa'],
SVM_Poly_GS=confMatrix.poly.GS$overall['Kappa'],
RandForest_GS=confMatrix.rf.GS$overall['Kappa'])
acc.GV<-cbind(
Statistic='Accuracy',
SVM_Linear_GV=confMatrix.linear.GV$overall['Accuracy'],
SVM_Radial_GV=confMatrix.radial.GV$overall['Accuracy'],
SVM_Poly_GV=confMatrix.poly.GV$overall['Accuracy'],
RandForest_GV=confMatrix.rf.GV$overall['Accuracy'])
kappa.GV<-cbind(
Statistic='Kappa',
SVM_Linear_GV=confMatrix.linear.GV$overall['Kappa'],
SVM_Radial_GV=confMatrix.radial.GV$overall['Kappa'],
SVM_Poly_GV=confMatrix.poly.GV$overall['Kappa'],
RandForest_GV=confMatrix.rf.GV$overall['Kappa'])
comparison.GS<-as.data.frame(rbind(acc.GS,kappa.GS))
comparison.GV<-as.data.frame(rbind(acc.GV,kappa.GV))
Confusion matrix provides different information about performance of our model. As an Example for GraphletSearch dataset using Random forest:
Note: X1 means Class label “1” and X.1 refers to the Class label “-1” of the graph.
Confusion Matrix and Statistics
Reference
Prediction X1 X.1
X1 30 7
X.1 3 15
Accuracy : 0.8182
95% CI : (0.691, 0.9092)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.000462
Kappa : 0.6094
Mcnemar's Test P-Value : 0.342782
Sensitivity : 0.9091
Specificity : 0.6818
Pos Pred Value : 0.8108
Neg Pred Value : 0.8333
Prevalence : 0.6000
Detection Rate : 0.5455
Detection Prevalence : 0.6727
Balanced Accuracy : 0.7955
'Positive' Class : X1
Generally we can compare models on each dataset based on their accuracy and kappa value. For the same range of Accuracy, we would say the method with higher kappa value provides more reliability on the result.
Performance of GraphletSearch and GraphVector datasets shown below:
For GraphletSearch:
SVM_Linear_GS SVM_Radial_GS SVM_Poly_GS RandForest_GS
Accuracy "0.854545454545454" "0.872727272727273" "0.872727272727273" "0.818181818181818"
Kappa "0.66966966966967" "0.715025906735751" "0.715025906735751" "0.609375"
For GraphVector:
SVM_Linear_GV SVM_Radial_GV SVM_Poly_GV RandForest_GV
Accuracy "0.890909090909091" "0.836363636363636" "0.836363636363636" "0.890909090909091"
Kappa "0.759124087591241" "0.611764705882353" "0.63360473723168" "0.752252252252252"


