Binary Classification on Patient Data

Introduction :-

In this report, I am attempting to do Classifcation of Patients based on Patient Data set.

Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 306 observations and each observation has 4 attributes. The header of the dataset is as follows.

##   age year Freq Survival
## 1  30   64    1      yes
## 2  30   62    3      yes
## 3  30   65    0      yes
## 4  31   59    2      yes
## 5  31   65    4      yes
## 6  33   58   10      yes

Explanation of all the variables :-
1. age : Age of patient at time of operation, numerical.
2. year : Patient’s year of operation (64 = 1964), numerical.
3. Freq : Number of positive auxiliary nodes (metastasis) detected, numerical.
4. Survival : the patient survived 5 years or longer (yes or no), factor.

The detailed structure is as follows.

## 'data.frame':    306 obs. of  4 variables:
##  $ age     : int  30 30 30 31 31 33 33 34 34 34 ...
##  $ year    : int  64 62 65 59 65 58 60 59 66 58 ...
##  $ Freq    : int  1 3 0 2 4 10 0 0 9 30 ...
##  $ Survival: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 2 ...

In Input , The type of each attribute is as follows.

##       age      year      Freq  Survival 
## "integer" "integer" "integer"  "factor"

For EDA, the type of each attribute is fine. We can proceed further with Exploratory Data Analysis.

Dealing with NULL values :-

The number of null values in each column are as follows.

##      age     year     Freq Survival 
##        0        0        0        0

As there is no null values, we can proceed further.

Summary :-

The overall summary of all the attributes is as follows.

##       age             year            Freq        Survival 
##  Min.   :30.00   Min.   :58.00   Min.   : 0.000   no : 81  
##  1st Qu.:44.00   1st Qu.:60.00   1st Qu.: 0.000   yes:225  
##  Median :52.00   Median :63.00   Median : 1.000            
##  Mean   :52.46   Mean   :62.85   Mean   : 4.026            
##  3rd Qu.:60.75   3rd Qu.:65.75   3rd Qu.: 4.000            
##  Max.   :83.00   Max.   :69.00   Max.   :52.000

The distribution of all continuous variables is as follows.

The distribution of all contionus variables in each category is as follows.

Age:-

Year of Operation:-

Freq:-

The co-releation between the continous variables is as follows

##              age         year         Freq
## age   1.00000000  0.089529446 -0.063176102
## year  0.08952945  1.000000000 -0.003764474
## Freq -0.06317610 -0.003764474  1.000000000

___

Description of EDA :-

In our data set,

In our given Input Dataset, we have 306 patient records and each patient is observed on 4 different attributes.
There is no null values in the input data.
There are lot of outliers in Freq attribute.
age and year attribute distribution is good.
In each category of survival, age and year distribution is good and medians are almost same in both groups.
Freq distribution is entirely different. The median of survived category patient is less when compared to not survived category.
There is no good co-releation between any numberical variables.

Models to predict Survival of Patient:-

Now I am planning to build various models to predict the Survival of Patient when we have following all the attributes for a patient.

As a first method, I used Support Vecot machines ( Classifier ) as it is one of the better models for classificaitons.

Support Vector machine ( Classifiers) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.

Fitting SVM - Classifier :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = classification_form, data = df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  167
## 
##  ( 89 78 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes

We can observe that,

Number of data points which are forming margin : 167
- 89 No Category Patients
- 78 Yes Category Patients
This algorithm is using default C-classification type.
Kernel Selected is : Radial basis function (RBF)
Default Cost Value ( C ) is 1

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no   24  57
##   yes   9 216

Correct classification rate:-

The accuracy of this SVM classifier on all input Data (Training Data ) is 78.43 %.

Parameters Tuning ( grid search ):-

As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

The tuning summary by using Cross sampling method is as follows.

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##      0    4
## 
## - best performance: 0.2653763

We Can observ that, recommended best parameters from Fixed sampling method is

\(\gamma\) : 0
C : 4
The tuning Plot is as follows :-

Fitting SVM_classifier with new parameters :-

The summary of the fitted

## 
## Call:
## svm(formula = classification_form, data = df, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  4 
## 
## Number of Support Vectors:  162
## 
##  ( 81 81 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes

We can observe that,

Number of data points which are forming margin : 162 * 89 No Category Patients * 78 Yes Category Patients
This algorithm is using default C-classification type.
Kernel Selected is : Radial basis function (RBF)
Cost Value ( C ) is 4 ( Controlled by us)
Gamme Value ( \(\gamma\) ) is 0 ( Controlled by us)

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no    0  81
##   yes   0 225

Correct classification rate:-

The accuracy of this tunned SVM classifier on all input Data is 73.53 %.

As a second model, I am choosing Linear discriminant analysis , as this is one of the better algorithm for classificaiton.

Linear discriminant analysis (LDA):-

LDA is used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes

Fitting LDA:-

The summary of the fitter LDA Model is as follows.

## Call:
## lda(classification_form, data = df)
## 
## Prior probabilities of groups:
##        no       yes 
## 0.2647059 0.7352941 
## 
## Group means:
##          age     year     Freq
## no  53.67901 62.82716 7.456790
## yes 52.01778 62.86222 2.791111
## 
## Coefficients of linear discriminants:
##              LD1
## age  -0.02826395
## year  0.01235497
## Freq -0.14194367

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no   14  67
##   yes  10 215

Correct classification rate:-

The accuracy of this LDA classifier on Training Data is 74.84 %.

Classification Trees:-

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning.

Fitting Classification Tree on training data :-

The summary of the fitted Decision tree is as follows.

## 
##   Conditional inference tree with 2 terminal nodes
## 
## Response:  Survival 
## Inputs:  age, year, Freq 
## Number of observations:  306 
## 
## 1) Freq <= 4; criterion = 1, statistic = 25.082
##   2)*  weights = 230 
## 1) Freq > 4
##   3)*  weights = 76

The same summary can be visulize as follows.

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no   39  42
##   yes  37 188

Correct classification rate:-

The accuracy of this Decision tree classifier on Training Data is 74.18 %.

Conclusion on Classification Models:-

The summary of all the fitted models and its performence on the training data & testing data is as follows

CLASSIFICAITON MODELs SUMMARY

S No	Model Name	ACC. on Given Data
1.	LDA Model	74.84 %
2.	SVM Classifier	78.43 %
3.	Tunned SVM Classifier	73.53 %
4.	Decision Tree ( Classifier )	74.18 %

As SVM Classifier has high accuracy value, I can conclude as this model is better model to predict the class of wine.

Predicting new Unseen Data:-

If we have a new patient infomation, then prediction of Survival (or) not by our models is as follows.

PREDICTION OF NEW DATA

MODEL	AGE	YEAR	FREQ	PREDICTION_CLASS
SVM_Classifier	25	60	15	SURVIVED
Tunned_ SVM_Classifier	25	60	15	SURVIVED
LDA_Model	25	60	15	SURVIVED

PREDICTION OF NEW DATA

MODEL	AGE	YEAR	FREQ	PREDICTION_CLASS
SVM_Classifier	30	58	1	SURVIVED
Tunned_ SVM_Classifier	30	58	1	SURVIVED
LDA_Model	30	58	1	SURVIVED

PREDICTION OF NEW DATA

MODEL	AGE	YEAR	FREQ	PREDICTION_CLASS
SVM_Classifier	60	67	47	SURVIVED
Tunned_ SVM_Classifier	60	67	47	SURVIVED
LDA_Model	60	67	47	NOT SURVIVED

————————————- THANK YOU ————————————-

Binary Classification on Patient Data

Anil Kumar Kanasani - 11013622

2021-02-15

Introduction :-

Exploratory Data Analysis :-

Description of EDA :-

Models to predict Survival of Patient:-

Support Vector machine ( Classifiers) :-

Fitting SVM - Classifier :-

With Default Parameters :-

The summary of the fitted default SVM_classifier :-

Confusion Matrix on training data :-

Correct classification rate:-

Parameters Tuning ( grid search ):-

Fitting SVM_classifier with new parameters :-

Confusion Matrix on training data :-

Correct classification rate:-

Linear discriminant analysis (LDA):-

Fitting LDA:-

Confusion Matrix on training data :-

Correct classification rate:-

Classification Trees:-

Fitting Classification Tree on training data :-

Confusion Matrix on training data :-

Correct classification rate:-

Conclusion on Classification Models:-

CLASSIFICAITON MODELs SUMMARY

Predicting new Unseen Data:-

PREDICTION OF NEW DATA

PREDICTION OF NEW DATA

PREDICTION OF NEW DATA