Introduction :-

In this report, I am attempting to do Classifcation of Patients based on Patient Data set.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

The given dataset has 306 observations and each observation has 4 attributes. The header of the dataset is as follows.

##   age year Freq Survival
## 1  30   64    1      yes
## 2  30   62    3      yes
## 3  30   65    0      yes
## 4  31   59    2      yes
## 5  31   65    4      yes
## 6  33   58   10      yes

The detailed structure is as follows.

## 'data.frame':    306 obs. of  4 variables:
##  $ age     : int  30 30 30 31 31 33 33 34 34 34 ...
##  $ year    : int  64 62 65 59 65 58 60 59 66 58 ...
##  $ Freq    : int  1 3 0 2 4 10 0 0 9 30 ...
##  $ Survival: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 2 ...

In Input , The type of each attribute is as follows.

##       age      year      Freq  Survival 
## "integer" "integer" "integer"  "factor"

For EDA, the type of each attribute is fine. We can proceed further with Exploratory Data Analysis.


The number of null values in each column are as follows.

##      age     year     Freq Survival 
##        0        0        0        0

As there is no null values, we can proceed further.


The overall summary of all the attributes is as follows.

##       age             year            Freq        Survival 
##  Min.   :30.00   Min.   :58.00   Min.   : 0.000   no : 81  
##  1st Qu.:44.00   1st Qu.:60.00   1st Qu.: 0.000   yes:225  
##  Median :52.00   Median :63.00   Median : 1.000            
##  Mean   :52.46   Mean   :62.85   Mean   : 4.026            
##  3rd Qu.:60.75   3rd Qu.:65.75   3rd Qu.: 4.000            
##  Max.   :83.00   Max.   :69.00   Max.   :52.000

The distribution of all continuous variables is as follows.


The distribution of all contionus variables in each category is as follows.

  1. Age:-

  1. Year of Operation:-

  1. Freq:-



The co-releation between the continous variables is as follows

##              age         year         Freq
## age   1.00000000  0.089529446 -0.063176102
## year  0.08952945  1.000000000 -0.003764474
## Freq -0.06317610 -0.003764474  1.000000000

___

Description of EDA :-

In our data set,

  • In our given Input Dataset, we have 306 patient records and each patient is observed on 4 different attributes.

  • There is no null values in the input data.

  • There are lot of outliers in Freq attribute.

  • age and year attribute distribution is good.

  • In each category of survival, age and year distribution is good and medians are almost same in both groups.

  • Freq distribution is entirely different. The median of survived category patient is less when compared to not survived category.

  • There is no good co-releation between any numberical variables.


Models to predict Survival of Patient:-

Now I am planning to build various models to predict the Survival of Patient when we have following all the attributes for a patient.

As a first method, I used Support Vecot machines ( Classifier ) as it is one of the better models for classificaitons.

Support Vector machine ( Classifiers) :-

Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.


Fitting SVM - Classifier :-

With Default Parameters :-

As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).

The summary of the fitted default SVM_classifier :-

## 
## Call:
## svm(formula = classification_form, data = df)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  167
## 
##  ( 89 78 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes

We can observe that,

  • Number of data points which are forming margin : 167
    • 89 No Category Patients
    • 78 Yes Category Patients
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Default Cost Value ( C ) is 1

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no   24  57
##   yes   9 216

Correct classification rate:-

The accuracy of this SVM classifier on all input Data (Training Data ) is 78.43 %.

Parameters Tuning ( grid search ):-

As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.

  • The tuning summary by using Cross sampling method is as follows.
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##      0    4
## 
## - best performance: 0.2653763

We Can observ that, recommended best parameters from Fixed sampling method is

  • \(\gamma\) : 0

  • C : 4

  • The tuning Plot is as follows :-

Fitting SVM_classifier with new parameters :-

The summary of the fitted
## 
## Call:
## svm(formula = classification_form, data = df, gamma = gam, cost = cos)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  4 
## 
## Number of Support Vectors:  162
## 
##  ( 81 81 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes

We can observe that,

  • Number of data points which are forming margin : 162 * 89 No Category Patients * 78 Yes Category Patients
  • This algorithm is using default C-classification type.
  • Kernel Selected is : Radial basis function (RBF)
  • Cost Value ( C ) is 4 ( Controlled by us)
  • Gamme Value ( \(\gamma\) ) is 0 ( Controlled by us)

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no    0  81
##   yes   0 225

Correct classification rate:-

The accuracy of this tunned SVM classifier on all input Data is 73.53 %.


As a second model, I am choosing Linear discriminant analysis , as this is one of the better algorithm for classificaiton.

Linear discriminant analysis (LDA):-

LDA is used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes

Fitting LDA:-

The summary of the fitter LDA Model is as follows.

## Call:
## lda(classification_form, data = df)
## 
## Prior probabilities of groups:
##        no       yes 
## 0.2647059 0.7352941 
## 
## Group means:
##          age     year     Freq
## no  53.67901 62.82716 7.456790
## yes 52.01778 62.86222 2.791111
## 
## Coefficients of linear discriminants:
##              LD1
## age  -0.02826395
## year  0.01235497
## Freq -0.14194367

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no   14  67
##   yes  10 215

Correct classification rate:-

The accuracy of this LDA classifier on Training Data is 74.84 %.

Classification Trees:-

Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning.


Fitting Classification Tree on training data :-

The summary of the fitted Decision tree is as follows.

## 
##   Conditional inference tree with 2 terminal nodes
## 
## Response:  Survival 
## Inputs:  age, year, Freq 
## Number of observations:  306 
## 
## 1) Freq <= 4; criterion = 1, statistic = 25.082
##   2)*  weights = 230 
## 1) Freq > 4
##   3)*  weights = 76

The same summary can be visulize as follows.

Confusion Matrix on training data :-

##      predicted
## true   no yes
##   no   39  42
##   yes  37 188

Correct classification rate:-

The accuracy of this Decision tree classifier on Training Data is 74.18 %.

Conclusion on Classification Models:-

The summary of all the fitted models and its performence on the training data & testing data is as follows

CLASSIFICAITON MODELs SUMMARY

S No Model Name ACC. on Given Data
1. LDA Model 74.84 %
2. SVM Classifier 78.43 %
3. Tunned SVM Classifier 73.53 %
4. Decision Tree ( Classifier ) 74.18 %

As SVM Classifier has high accuracy value, I can conclude as this model is better model to predict the class of wine.

Predicting new Unseen Data:-

If we have a new patient infomation, then prediction of Survival (or) not by our models is as follows.

PREDICTION OF NEW DATA

MODEL AGE YEAR FREQ PREDICTION_CLASS
SVM_Classifier 25 60 15 SURVIVED
Tunned_ SVM_Classifier 25 60 15 SURVIVED
LDA_Model 25 60 15 SURVIVED

PREDICTION OF NEW DATA

MODEL AGE YEAR FREQ PREDICTION_CLASS
SVM_Classifier 30 58 1 SURVIVED
Tunned_ SVM_Classifier 30 58 1 SURVIVED
LDA_Model 30 58 1 SURVIVED

PREDICTION OF NEW DATA

MODEL AGE YEAR FREQ PREDICTION_CLASS
SVM_Classifier 60 67 47 SURVIVED
Tunned_ SVM_Classifier 60 67 47 SURVIVED
LDA_Model 60 67 47 NOT SURVIVED

————————————- THANK YOU ————————————-