Introduction :-
In this report, I am attempting to do Classifcation of Patients based on Patient Data set.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
- Structure of given dataset :-
The given dataset has 306 observations and each observation has 4 attributes. The header of the dataset is as follows.
## age year Freq Survival
## 1 30 64 1 yes
## 2 30 62 3 yes
## 3 30 65 0 yes
## 4 31 59 2 yes
## 5 31 65 4 yes
## 6 33 58 10 yes
Explanation of all the variables :-
age : Age of patient at time of operation, numerical.
year : Patient’s year of operation (64 = 1964), numerical.
Freq : Number of positive auxiliary nodes (metastasis) detected, numerical.
Survival : the patient survived 5 years or longer (yes or no), factor.
The detailed structure is as follows.
## 'data.frame': 306 obs. of 4 variables:
## $ age : int 30 30 30 31 31 33 33 34 34 34 ...
## $ year : int 64 62 65 59 65 58 60 59 66 58 ...
## $ Freq : int 1 3 0 2 4 10 0 0 9 30 ...
## $ Survival: Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 1 1 2 ...
In Input , The type of each attribute is as follows.
## age year Freq Survival
## "integer" "integer" "integer" "factor"
For EDA, the type of each attribute is fine. We can proceed further with Exploratory Data Analysis.
- Dealing with NULL values :-
The number of null values in each column are as follows.
## age year Freq Survival
## 0 0 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## age year Freq Survival
## Min. :30.00 Min. :58.00 Min. : 0.000 no : 81
## 1st Qu.:44.00 1st Qu.:60.00 1st Qu.: 0.000 yes:225
## Median :52.00 Median :63.00 Median : 1.000
## Mean :52.46 Mean :62.85 Mean : 4.026
## 3rd Qu.:60.75 3rd Qu.:65.75 3rd Qu.: 4.000
## Max. :83.00 Max. :69.00 Max. :52.000
The distribution of all continuous variables is as follows.
The distribution of all contionus variables in each category is as follows.
- Age:-
- Year of Operation:-
- Freq:-
The co-releation between the continous variables is as follows
## age year Freq
## age 1.00000000 0.089529446 -0.063176102
## year 0.08952945 1.000000000 -0.003764474
## Freq -0.06317610 -0.003764474 1.000000000
___
Description of EDA :-
In our data set,
In our given Input Dataset, we have 306 patient records and each patient is observed on 4 different attributes.
There is no null values in the input data.
There are lot of outliers in Freq attribute.
age and year attribute distribution is good.
In each category of survival, age and year distribution is good and medians are almost same in both groups.
Freq distribution is entirely different. The median of survived category patient is less when compared to not survived category.
There is no good co-releation between any numberical variables.
Models to predict Survival of Patient:-
Now I am planning to build various models to predict the Survival of Patient when we have following all the attributes for a patient.
As a first method, I used Support Vecot machines ( Classifier ) as it is one of the better models for classificaitons.
Support Vector machine ( Classifiers) :-
Support-vector machines are linear models (supervised learning models) with associated learning algorithms that analyze data for classification and regression analysis. It can solve both Linear & Non-Linear problems.
Fitting SVM - Classifier :-
With Default Parameters :-
As a first step, I am trying to fit a Support Vector Machine classifier with default hyperparameter values of Cost control ( C ) and Gamma (\(\gamma\)).
The summary of the fitted default SVM_classifier :-
##
## Call:
## svm(formula = classification_form, data = df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 167
##
## ( 89 78 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
We can observe that,
- Number of data points which are forming margin : 167
- 89 No Category Patients
- 78 Yes Category Patients
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Default Cost Value ( C ) is 1
Confusion Matrix on training data :-
## predicted
## true no yes
## no 24 57
## yes 9 216
Correct classification rate:-
The accuracy of this SVM classifier on all input Data (Training Data ) is 78.43 %.
Parameters Tuning ( grid search ):-
As we can see the accuracy of the SVM model with default parameters is not so good. We can tune the parameters C and Gamma (\(\gamma\)) so that we are slightly changing the smoothness of the fitted curve ( tunning hyperplane ) to classify the data points more accurately than before.
- The tuning summary by using Cross sampling method is as follows.
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0 4
##
## - best performance: 0.2653763
We Can observ that, recommended best parameters from Fixed sampling method is
\(\gamma\) : 0
C : 4
The tuning Plot is as follows :-
Fitting SVM_classifier with new parameters :-
The summary of the fitted
##
## Call:
## svm(formula = classification_form, data = df, gamma = gam, cost = cos)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 4
##
## Number of Support Vectors: 162
##
## ( 81 81 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
We can observe that,
- Number of data points which are forming margin : 162 * 89 No Category Patients * 78 Yes Category Patients
- This algorithm is using default C-classification type.
- Kernel Selected is : Radial basis function (RBF)
- Cost Value ( C ) is 4 ( Controlled by us)
- Gamme Value ( \(\gamma\) ) is 0 ( Controlled by us)
Confusion Matrix on training data :-
## predicted
## true no yes
## no 0 81
## yes 0 225
Correct classification rate:-
The accuracy of this tunned SVM classifier on all input Data is 73.53 %.
As a second model, I am choosing Linear discriminant analysis , as this is one of the better algorithm for classificaiton.
Linear discriminant analysis (LDA):-
LDA is used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes
Fitting LDA:-
The summary of the fitter LDA Model is as follows.
## Call:
## lda(classification_form, data = df)
##
## Prior probabilities of groups:
## no yes
## 0.2647059 0.7352941
##
## Group means:
## age year Freq
## no 53.67901 62.82716 7.456790
## yes 52.01778 62.86222 2.791111
##
## Coefficients of linear discriminants:
## LD1
## age -0.02826395
## year 0.01235497
## Freq -0.14194367
Confusion Matrix on training data :-
## predicted
## true no yes
## no 14 67
## yes 10 215
Correct classification rate:-
The accuracy of this LDA classifier on Training Data is 74.84 %.
Classification Trees:-
Decision tree learning is one of the predictive modelling approaches used in statistics, data mining and machine learning.
Fitting Classification Tree on training data :-
The summary of the fitted Decision tree is as follows.
##
## Conditional inference tree with 2 terminal nodes
##
## Response: Survival
## Inputs: age, year, Freq
## Number of observations: 306
##
## 1) Freq <= 4; criterion = 1, statistic = 25.082
## 2)* weights = 230
## 1) Freq > 4
## 3)* weights = 76
The same summary can be visulize as follows.
Confusion Matrix on training data :-
## predicted
## true no yes
## no 39 42
## yes 37 188
Correct classification rate:-
The accuracy of this Decision tree classifier on Training Data is 74.18 %.
Conclusion on Classification Models:-
The summary of all the fitted models and its performence on the training data & testing data is as follows
CLASSIFICAITON MODELs SUMMARY
| S No | Model Name | ACC. on Given Data |
|---|---|---|
| 1. | LDA Model | 74.84 % |
| 2. | SVM Classifier | 78.43 % |
| 3. | Tunned SVM Classifier | 73.53 % |
| 4. | Decision Tree ( Classifier ) | 74.18 % |
As SVM Classifier has high accuracy value, I can conclude as this model is better model to predict the class of wine.
Predicting new Unseen Data:-
If we have a new patient infomation, then prediction of Survival (or) not by our models is as follows.
PREDICTION OF NEW DATA
| MODEL | AGE | YEAR | FREQ | PREDICTION_CLASS |
|---|---|---|---|---|
| SVM_Classifier | 25 | 60 | 15 | SURVIVED |
| Tunned_ SVM_Classifier | 25 | 60 | 15 | SURVIVED |
| LDA_Model | 25 | 60 | 15 | SURVIVED |
PREDICTION OF NEW DATA
| MODEL | AGE | YEAR | FREQ | PREDICTION_CLASS |
|---|---|---|---|---|
| SVM_Classifier | 30 | 58 | 1 | SURVIVED |
| Tunned_ SVM_Classifier | 30 | 58 | 1 | SURVIVED |
| LDA_Model | 30 | 58 | 1 | SURVIVED |
PREDICTION OF NEW DATA
| MODEL | AGE | YEAR | FREQ | PREDICTION_CLASS |
|---|---|---|---|---|
| SVM_Classifier | 60 | 67 | 47 | SURVIVED |
| Tunned_ SVM_Classifier | 60 | 67 | 47 | SURVIVED |
| LDA_Model | 60 | 67 | 47 | NOT SURVIVED |
————————————- THANK YOU ————————————-