In this document we will try to analyze a data set from UCI machine learning repository built by Dr. Henrique da Mota containing values for six biomechanical features used to classify orthopaedic patients into 3 classes (normal, disk hernia or spondilolysthesis) or 2 classes (normal or abnormal).In this example we will use the 2 classes data set.We will try fit a logistic regression model and CART model and compare their performance based on the AUC(area under curve) using R and some built-in libraries that will help us automate some of the procedure.There are several metrics to evaluate your model’s outcome such as accuracy,recall,sensitivity etc but we will stay with the AUC in this example
First we will bring in the data setting the working directory since the file is stored locally in the hard drive.
setwd("/Users/christoschalitsios/Documents/vertebral_column_data")
spine<-read.csv("spine_2c.csv")
str(spine)
## 'data.frame': 310 obs. of 7 variables:
## $ plv_inc : num 63 39.1 68.8 69.3 49.7 ...
## $ plv_tilt : num 22.55 10.06 22.22 24.65 9.65 ...
## $ l_l_angle : num 39.6 25 50.1 44.3 28.3 ...
## $ sacr_slope : num 40.5 29 46.6 44.6 40.1 ...
## $ plv_radius : num 98.7 114.4 106 101.9 108.2 ...
## $ grade_spondy: num -0.25 4.56 -3.53 11.21 7.92 ...
## $ class : Factor w/ 2 levels "AB","NO": 1 1 1 1 1 1 1 1 1 1 ...
head(spine)
## plv_inc plv_tilt l_l_angle sacr_slope plv_radius grade_spondy class
## 1 63.03 22.55 39.61 40.48 98.67 -0.25 AB
## 2 39.06 10.06 25.02 29.00 114.41 4.56 AB
## 3 68.83 22.22 50.09 46.61 105.99 -3.53 AB
## 4 69.30 24.65 44.31 44.64 101.87 11.21 AB
## 5 49.71 9.65 28.32 40.06 108.17 7.92 AB
## 6 40.25 13.92 25.12 26.33 130.33 2.23 AB
The dataset consists of the variables ‘pelvic incidence’,‘pelvic tilt’,‘lumbar lordosis angle’,‘sacral slope’,‘pelvic radius’,‘grade of spondylolisthesis’ and the ‘class’ (factor variable which contains the levels ABNORMAL(AB) and NORMAL(NO)).The first five variables are quantitative and the image below shows how they actually defined and what they measure
Next some preliminary visualizations of the data to explore our features and get an idea about the variables and relations amongst them
library(corrplot)
pairs(spine[,1:6],col=c("red","blue")[spine$class],pch=20,upper.panel=NULL)
cor_mat<-cor(spine[,1:6])
corrplot.mixed(cor_mat,tl.cex=0.6)
Looking at the scatterplot matrix we can see there are some some differences in between the two classes(blue=normal,red=abnormal).The correlation plot shows that the pearson’s correlations range from totally uncorrelated to good correlation.
We will first fit a logistic regression model.Because generalization of the predictions for correct classification are crucial we will divide the dataset into a training and testing set.The model algorithm will train in the training set and then it will be validated in the testing set.The training set will hold the 70% of the original data set and 30% will go to the test set.For model performance evaluation we will use the AUC.
library(caTools)
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
set.seed(123)
spl<-sample.split(spine$class,SplitRatio = 0.7)
train<-subset(spine,spl==TRUE)
test<-subset(spine,spl==FALSE)
dim(train)
## [1] 217 7
dim(test)
## [1] 93 7
After the division of the dataset we can fit the first model(logistic regression)
glm1<-glm(class~.,data=train,family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred_glm1<-predict(glm1,newdata=test,type="response")
roc<-prediction(pred_glm1,test$class)
roc.plot<-performance(roc,"tpr","fpr")
auc<-as.numeric(performance(roc,"auc")@y.values)
auc
## [1] 0.9137566
plot(roc.plot,colorize=TRUE)
We will examine now the performance of CART model(classification tree) using the “rpart” library.
library(rpart)
library(rpart.plot)
rpart1<-rpart(class~.,data=train,method="class")
prp(rpart1)
pred_rpart1<-predict(rpart1,newdata=test,type="prob")
roc_tree<-prediction(pred_rpart1[,2],test$class)
roc.plot_tree<-performance(roc_tree,"tpr","fpr")
auc<-as.numeric(performance(roc_tree,"auc")@y.values)
auc
## [1] 0.8412698
plot(roc.plot_tree,colorize=TRUE)
Examining the two models we can observe that in the current circumstances the logistic regression model can differentiate the 2 classes better with an AUC of about 0.91 in comparison with the CART model but there is a drawback in the logistic model in terms of intepretability since the CART produces easier to understand results.So it is up to the scope of the application which model you will choose if accuracy is your goal then go with the logistic regression model otherwise CART is the best choice.Of course there a lot models you can apply to classification problems (besides logistic regression and CART).These 2 models are pretty basic and prone to tuning to produce even better results.