Random Forrest
set.seed(1234)
dim(heart)
## [1] 1025 14
Split Data into Training and Test sets Random Forrest 80%
set.seed(417)
index <- sample(nrow(heart), size = nrow(heart)*0.80)
heart_train <- heart[index,] #take 80%
heart_test <- heart[-index,] #take 20%
Model Classifier_RF
# Fitting Random Forest to the train dataset
set.seed(120) # Setting seed
rf.heart = randomForest(x = heart_train[-14],
y = heart_train$target,
ntree = 500)
rf.heart
##
## Call:
## randomForest(x = heart_train[-14], y = heart_train$target, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.24%
## Confusion matrix:
## No Heart Disease Heart Disease class.error
## No Heart Disease 385 1 0.002590674
## Heart Disease 1 433 0.002304147
The random forest model, constructed with 500 trees and considering 3
variables at each split, achieved an out-of-bag (OOB) error rate of
0.24%. The OOB estimate of error rate is a reliable measure, calculated
using samples not included in the bootstrap sample of each tree. This
provides an unbiased estimate of the model’s performance on unseen
data.
The confusion matrix offers a detailed view of the model’s
predictions on the training dataset. The “class.error” column indicates
the error rate for each class, representing the proportion of
misclassified instances. In this instance, the error rate for the “No
Heart Disease” class is approximately 0.26%, while for the “Heart
Disease” class, it’s approximately 0.23%. These low error rates signify
the model’s excellent performance in accurately classifying instances
into their respective categories.
Confusion Matrix
y_pred <- predict(rf.heart, newdata = heart_test[, -14])
# Confusion Matrix
confusion_mtx <- table(heart_test[, 14], y_pred)
confusion_mtx
## y_pred
## No Heart Disease Heart Disease
## No Heart Disease 113 0
## Heart Disease 0 92
In the provided example:
The model correctly predicted 113 instances of “No Heart Disease”
when the actual class was also “No Heart Disease.” It correctly
predicted 92 instances of “Heart Disease” when the actual class was also
“Heart Disease.” There were no instances where the model incorrectly
classified “No Heart Disease” as “Heart Disease” (0 false positives).
Similarly, there were no instances where the model incorrectly
classified “Heart Disease” as “No Heart Disease” (0 false negatives).
Overall, the model appears to have made accurate predictions on the test
dataset based on the provided confusion matrix.
# Confusion matrix values
TP <- 92
TN <- 113
FP <- 0
FN <- 0
# Total instances
total <- TP + TN + FP + FN
# Accuracy
accuracy <- (TP + TN) / total
# Precision
precision <- TP / (TP + FP)
# Recall
recall <- TP / (TP + FN)
# F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print the results
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 1"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 1"
print(paste("Recall:", round(recall, 4)))
## [1] "Recall: 1"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 1"
The predictive accuracy is 100%.
The confusion matrix and all performance shows that the model
achieved perfect accuracy on the test dataset. All instances were
correctly classified, with no false positives or false negatives. This
indicates that the model performed very well on this particular test
dataset.
Given that your Random Forest model with 500 trees achieved perfect
scores on your test set (using an 80-20 split), it might be worthwhile
to try a 70-30 split to see if the model’s performance remains perfect
or if it deteriorates slightly, indicating potential overfitting.
Split Data into Training and Test sets Random Forrest 70%
set.seed(417)
index <- sample(nrow(heart), size = nrow(heart)*0.70)
heart_train70 <- heart[index,] #take 70%
heart_test70 <- heart[-index,] #take 30%
Model Classifier_RF 70%
# Fitting Random Forest to the train dataset
set.seed(120) # Setting seed
rf.heart70 = randomForest(x = heart_train70[-14],
y = heart_train70$target,
ntree = 500)
rf.heart70
##
## Call:
## randomForest(x = heart_train70[-14], y = heart_train70$target, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.98%
## Confusion matrix:
## No Heart Disease Heart Disease class.error
## No Heart Disease 334 2 0.005952381
## Heart Disease 5 376 0.013123360
Even though the performance on the 80% training data is impressive,
it’s possible that the model is overfitting to some extent. Achieving
perfect scores is rare in practice and can be indicative of overfitting,
especially if not confirmed through rigorous validation.
While the model’s performance on the 80% training data is promising,
it’s essential to validate its performance on an independent test set to
confirm its generalizability.
The 0.98% error rate suggests strong performance but does not
necessarily imply absence of overfitting, especially when compared to
perfect scores.
Further evaluation through cross-validation or testing on a separate
validation set can provide additional insights into the model’s
robustness and potential overfitting.
Evaluation on an Independent Test/Validation Set
Split your dataset into three subsets: training set, validation set,
and test set. Typically, you might use 60-70% of the data for training,
10-20% for validation, and the remaining 10-20% for testing.
# Split the data into training, validation, and test sets
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(heart), size = 0.7 * nrow(heart)) # 70% for training
# Remaining indices for validation and testing
remaining_indices <- setdiff(1:nrow(heart), train_indices)
# Split the remaining indices into two halves for validation and testing
validation_test_indices <- sample(remaining_indices, size = 0.5 * length(remaining_indices))
validation_indices <- validation_test_indices[1:floor(length(validation_test_indices) / 2)] # First half for validation
test_indices <- validation_test_indices[(floor(length(validation_test_indices) / 2) + 1):length(validation_test_indices)] # Second half for testing
# Create data subsets based on the indices
train_data_indices <- heart[train_indices, ]
validation_data <- heart[validation_indices, ]
test_data_indices <- heart[test_indices, ]
Train the Model
rf_model500indices <- randomForest(target ~ ., data = train_data_indices, ntree = 500)
Evaluate on Validation Set
validation_predictions <- predict(rf_model500indices, newdata = validation_data[-14]) # Predictions on validation data
validation_accuracy <- mean(validation_predictions == validation_data$target) # Calculate accuracy
validation_accuracy
## [1] 1
Final Evaluation on Test Set
test_predictions <- predict(rf_model500indices, newdata = test_data_indices[-14]) # Predictions on test data
test_accuracy <- mean(test_predictions == test_data_indices$target) # Calculate accuracy
# Print predictions for inspection
print(test_predictions)
## 496 22 636 930
## Heart Disease Heart Disease Heart Disease No Heart Disease
## 725 697 836 635
## Heart Disease Heart Disease No Heart Disease No Heart Disease
## 3 432 975 434
## No Heart Disease No Heart Disease Heart Disease Heart Disease
## 996 9 689 353
## Heart Disease No Heart Disease No Heart Disease No Heart Disease
## 863 505 454 491
## No Heart Disease No Heart Disease Heart Disease Heart Disease
## 360 919 471 97
## Heart Disease No Heart Disease Heart Disease Heart Disease
## 86 819 133 21
## Heart Disease Heart Disease Heart Disease No Heart Disease
## 1023 495 901 472
## No Heart Disease Heart Disease No Heart Disease Heart Disease
## 202 795 743 66
## Heart Disease No Heart Disease No Heart Disease No Heart Disease
## 453 959 109 848
## No Heart Disease Heart Disease No Heart Disease No Heart Disease
## 802 889 582 489
## Heart Disease No Heart Disease Heart Disease Heart Disease
## 815 213 1018 995
## Heart Disease No Heart Disease No Heart Disease No Heart Disease
## 439 107 740 641
## Heart Disease No Heart Disease No Heart Disease Heart Disease
## 305 546 527 532
## Heart Disease No Heart Disease No Heart Disease Heart Disease
## 832 941 507 805
## Heart Disease No Heart Disease No Heart Disease Heart Disease
## 510 948 913 492
## No Heart Disease Heart Disease No Heart Disease Heart Disease
## 595 329 886 321
## No Heart Disease No Heart Disease No Heart Disease Heart Disease
## 677 156 60 640
## No Heart Disease Heart Disease No Heart Disease Heart Disease
## 1011 594 988 788
## No Heart Disease No Heart Disease No Heart Disease No Heart Disease
## 63
## No Heart Disease
## Levels: No Heart Disease Heart Disease
# Calculate accuracy
test_accuracy <- mean(test_predictions == test_data_indices$target)
print(test_accuracy) # Print calculated accuracy
## [1] 1
Achieveing a 100% accuracy consistently across different train-test
splits or through cross-validation, that’s indeed a remarkable result.
It suggests that the model is performing exceptionally well on the given
dataset. Since I am consistently achieving high accuracy on both
training and test datasets, it suggests that my model is not overfitting
and can generalize to new, unseen data effectively.
Any advice will be greatly appreciated.
Plot of Model the origional model rf.heart 80%
plot(rf.heart)
legend("topright", legend = c("Heart Disease error rate", "OOB error rate", "No Heart Disease error rate"),
col = c("red", "black", "green"), lty = 1, cex = 0.8)

Important Features
set.seed(417)
importance_df <- importance(rf.heart)
importance_sorted <- data.frame(
Variable = rownames(importance_df),
MeanDecreaseGini = importance_df[, 1]
)
importance_sorted <- importance_sorted[order(importance_sorted$MeanDecreaseGini, decreasing = TRUE), , drop = FALSE]
rownames(importance_sorted) <- NULL # Reset row names
print(importance_sorted)
## Variable MeanDecreaseGini
## 1 cp 51.772016
## 2 ca 50.089382
## 3 thalach 46.092970
## 4 oldpeak 44.038777
## 5 thal 43.284840
## 6 age 37.447212
## 7 chol 33.334145
## 8 trestbps 28.795034
## 9 exang 25.272639
## 10 slope 18.226297
## 11 sex 13.497369
## 12 restecg 7.753257
## 13 fbs 3.807515
This result shows the importance of predictor variables in a random
forest model, measured by Mean Decrease in Gini impurity.
cp (Chest Pain Type): This variable has the highest importance, with
a Mean Decrease Gini of 51.772016. It suggests that splitting based on
this variable leads to the most significant reduction in impurity across
all trees in the forest.
ca (Number of major vessels colored by fluoroscopy): Following cp, ca
is the next most important predictor, with a Mean Decrease Gini of
50.089382.
thalach (Maximum Heart Rate achieved): thalach is the third most
important predictor, with a Mean Decrease Gini of 46.092970.
oldpeak (ST Depression induced by exercise relative to rest): oldpeak
ranks fourth in importance, with a Mean Decrease Gini of 44.038777.
thal (Thalassemia): Thal comes next in importance, with a Mean
Decrease Gini of 43.284840.
The rest of the variables follow in descending order of importance,
each contributing to the model’s predictive power based on their
respective Mean Decrease Gini values.
In summary, feature importance scores provide insights into which
features are most influential in the model’s decision-making process.
Higher importance scores indicate more significant contributions to the
model’s predictive performance.
Plot of Important Feature
# Variable importance plot
varImpPlot(rf.heart)

Optimization by Tree Numbers
Increasing the number of trees in a random forest from 500 to 1000
can potentially improve the model’s performance in several ways:
Improved Generalization: With more trees, the random forest model has
more opportunities to learn from the training data and capture complex
patterns in the data. This can lead to better generalization to unseen
data and potentially reduce overfitting, especially if the initial model
was underfitting.
Better Accuracy: Increasing the number of trees allows the model to
make more consensual decisions, as predictions are aggregated over a
larger number of trees. This can lead to more accurate predictions,
especially for cases where individual trees might make errors.
Increased Stability: Adding more trees can increase the stability of
the model’s predictions. With a larger ensemble of trees, the random
forest model becomes less sensitive to variations in the training data
and can provide more consistent predictions.
Improved Robustness: A larger number of trees can enhance the
robustness of the model to noise or outliers in the data. By averaging
predictions from more trees, the impact of individual noisy observations
on the overall predictions is reduced.
However, it’s essential to consider the trade-offs of increasing the
number of trees. Training and predicting with a larger ensemble can be
computationally expensive and may require more memory and processing
power. Additionally, beyond a certain point, increasing the number of
trees may lead to diminishing returns in terms of model performance
improvement.
In summary, optimizing a random forest by increasing the number of
trees can lead to better generalization, accuracy, stability, and
robustness of the model, but it’s essential to balance these benefits
with computational constraints and the law of diminishing returns.
Model classifier_RF with Optimization
set.seed(417)
heart$target = as.factor(heart$target)
train = sample(1:nrow(heart), 500)
rf.heart1000 = randomForest(target ~ ., data = heart, subset = train, ntree = 1000)
rf.heart1000
##
## Call:
## randomForest(formula = target ~ ., data = heart, ntree = 1000, subset = train)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.8%
## Confusion matrix:
## No Heart Disease Heart Disease class.error
## No Heart Disease 213 12 0.05333333
## Heart Disease 7 268 0.02545455
The out-of-bag (OOB) estimate of the error rate, which is an estimate
of the classification error rate of the random forest model calculated
using the out-of-bag samples. In this case, the OOB estimate of the
error rate is 3.8%.
The class.error column in the confusion matrix provides the error
rate for each class. In this case, the error rate for predicting “No
Heart Disease” instances is approximately 5.33%, and for predicting
“Heart Disease” instances is approximately 2.55%.
Overall, the model appears to have performed relatively well, with
low error rates for both classes. However, it’s important to further
evaluate the model’s performance using additional metrics such as
accuracy, precision, recall, and F1 score, and to validate it on unseen
data to assess its generalization ability.
Confusion Matrix with Optimization
set.seed(417)
y_pred <- predict(rf.heart1000, newdata = heart_test[, -14])
# Confusion Matrix
confusion_mtx1000 <- table(heart_test[, 14], y_pred)
confusion_mtx1000
## y_pred
## No Heart Disease Heart Disease
## No Heart Disease 111 2
## Heart Disease 2 90
# Confusion matrix values
TP <- 90
TN <- 111
FP <- 2
FN <- 2
# Total instances
total <- TP + TN + FP + FN
# Accuracy
accuracy <- (TP + TN) / total
# Precision
precision <- TP / (TP + FP)
# Recall
recall <- TP / (TP + FN)
# F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print the results
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.9805"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 0.9783"
print(paste("Recall:", round(recall, 4)))
## [1] "Recall: 0.9783"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 0.9783"
The results indicate strong performance across all metrics:
Accuracy: An accuracy of 0.9805 suggests that approximately 98.05% of
instances in the dataset were correctly classified by the model.
Precision: With a precision of 0.9783, around 97.83% of instances
classified as positive (Heart Disease) were indeed positive.
Recall: The recall score of 0.9783 indicates that approximately
97.83% of actual positive instances (Heart Disease) were correctly
identified by the model.
F1 Score: The F1 Score, being the harmonic mean of precision and
recall, is also 0.9783. This suggests a high balance between precision
and recall, indicating strong overall performance.
Plot of Model with Optimization
set.seed(417)
plot(rf.heart1000)
legend("topright", legend = c("Heart Disease error rate", "OOB error rate", "No Heart Disease error rate"),
col = c("red", "black", "green"), lty = 1, cex = 0.8)

Important Features with Optimization
set.seed(417)
importance_df1000 <- importance(rf.heart1000)
importance_sorted1000 <- data.frame(
Variable = rownames(importance_df1000),
MeanDecreaseGini = importance_df1000[, 1]
)
importance_sorted1000 <- importance_sorted1000[order(importance_sorted1000$MeanDecreaseGini, decreasing = TRUE), , drop = FALSE]
rownames(importance_sorted1000) <- NULL # Reset row names
print(importance_sorted1000)
## Variable MeanDecreaseGini
## 1 cp 35.944511
## 2 oldpeak 34.051488
## 3 thalach 24.564869
## 4 ca 23.766305
## 5 thal 22.338299
## 6 age 19.262623
## 7 chol 19.171335
## 8 trestbps 18.021224
## 9 exang 17.917996
## 10 slope 12.724322
## 11 sex 9.642634
## 12 restecg 4.489032
## 13 fbs 1.944607
The result shows the importance of each predictor variable in the
random forest model, measured by Mean Decrease in Gini impurity.
cp (Chest Pain Type): This variable has the highest importance, with
a Mean Decrease Gini of 35.944511. It indicates that splitting based on
this variable provides the most significant reduction in impurity across
all trees in the forest.
oldpeak (ST Depression induced by exercise relative to rest):
Following cp, oldpeak is the next most important predictor, with a Mean
Decrease Gini of 34.051488.
thalach (Maximum Heart Rate achieved): thalach is the third most
important predictor, with a Mean Decrease Gini of 24.564869.
ca (Number of major vessels colored by fluoroscopy): ca ranks fourth
in importance, with a Mean Decrease Gini of 23.766305.
thal (Thalassemia): Thal comes next in importance, with a Mean
Decrease Gini of 22.338299.
The rest of the variables follow in descending order of importance,
each contributing to the model’s predictive power based on their
respective Mean Decrease Gini values.
In summary, feature importance scores provide insights into which
features are most influential in the model’s decision-making process.
Higher importance scores indicate more significant contributions to the
model’s predictive performance.
Plot of Important Feature with Optimization
# Variable importance plot
varImpPlot(rf.heart1000)

Support Vector Machine
SVM (Support Vector Machine) is a supervised machine learning
algorithm that is mainly used to classify data into different classes.
Unlike most algorithms, SVM makes use of a hyperplane, which acts like a
decision boundary between the various classes.
SVM can be used to generate multiple separating hyperplanes such that
the data is divided into segments and each segment contains only one
kind of data.
Before we train our model, we’ll first implement the trainControl()
method. This will control all the computational overheads so that we can
use the train() function provided by the caret package. The training
method will train our data on different algorithms.
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
The trainControl() method here, is taking 3 parameters. The “method”
parameter defines the resampling method, in this demo we’ll be using the
repeatedcv or the repeated cross-validation method. The “number”
parameter, which basically holds the number of resampling iterations.
The “repeats” parameter contains the sets to compute for our repeated
cross-validation. We are using setting number =10 and repeats =3
This trainControl() method returns a list. We are going to pass this
on our train() method.
svm_Linear <- train(target ~., data = heart_train, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
You can check the result of our train() method. We are saving its
results in the svm_Linear variable.
svm_Linear
## Support Vector Machines with Linear Kernel
##
## 820 samples
## 13 predictor
## 2 classes: 'No Heart Disease', 'Heart Disease'
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 738, 738, 739, 738, 737, 738, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8369953 0.6703673
##
## Tuning parameter 'C' was held constant at a value of 1
Now, our model is trained with C value as 1. We are ready to predict
classes for our test set. We can use the predict() method.
The caret package provides the predict() method for predicting
results. We are passing 2 arguments. Its first parameter is our trained
model and second parameter “newdata” holds our testing data frame. The
predict() method returns a list, we are saving it in a test_pred
variable.
test_pred <- predict(svm_Linear, newdata = heart_test)
test_pred
## [1] No Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [5] Heart Disease Heart Disease No Heart Disease Heart Disease
## [9] Heart Disease Heart Disease No Heart Disease Heart Disease
## [13] No Heart Disease Heart Disease No Heart Disease No Heart Disease
## [17] No Heart Disease Heart Disease Heart Disease Heart Disease
## [21] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [25] No Heart Disease Heart Disease Heart Disease Heart Disease
## [29] No Heart Disease No Heart Disease Heart Disease Heart Disease
## [33] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [37] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [41] Heart Disease Heart Disease Heart Disease Heart Disease
## [45] Heart Disease Heart Disease No Heart Disease Heart Disease
## [49] No Heart Disease Heart Disease No Heart Disease No Heart Disease
## [53] No Heart Disease Heart Disease Heart Disease No Heart Disease
## [57] Heart Disease No Heart Disease Heart Disease Heart Disease
## [61] Heart Disease No Heart Disease Heart Disease Heart Disease
## [65] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [69] Heart Disease Heart Disease No Heart Disease Heart Disease
## [73] Heart Disease Heart Disease No Heart Disease Heart Disease
## [77] Heart Disease Heart Disease No Heart Disease Heart Disease
## [81] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [85] Heart Disease Heart Disease No Heart Disease No Heart Disease
## [89] No Heart Disease No Heart Disease No Heart Disease Heart Disease
## [93] Heart Disease Heart Disease Heart Disease Heart Disease
## [97] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [101] No Heart Disease Heart Disease No Heart Disease No Heart Disease
## [105] Heart Disease Heart Disease Heart Disease Heart Disease
## [109] No Heart Disease No Heart Disease Heart Disease No Heart Disease
## [113] Heart Disease No Heart Disease Heart Disease Heart Disease
## [117] Heart Disease Heart Disease Heart Disease No Heart Disease
## [121] No Heart Disease No Heart Disease Heart Disease No Heart Disease
## [125] No Heart Disease No Heart Disease Heart Disease Heart Disease
## [129] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [133] Heart Disease Heart Disease Heart Disease No Heart Disease
## [137] No Heart Disease Heart Disease Heart Disease No Heart Disease
## [141] Heart Disease Heart Disease No Heart Disease Heart Disease
## [145] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [149] No Heart Disease Heart Disease Heart Disease Heart Disease
## [153] No Heart Disease Heart Disease Heart Disease Heart Disease
## [157] Heart Disease Heart Disease No Heart Disease Heart Disease
## [161] No Heart Disease No Heart Disease No Heart Disease Heart Disease
## [165] No Heart Disease Heart Disease No Heart Disease Heart Disease
## [169] Heart Disease Heart Disease Heart Disease No Heart Disease
## [173] No Heart Disease Heart Disease Heart Disease No Heart Disease
## [177] No Heart Disease No Heart Disease No Heart Disease Heart Disease
## [181] No Heart Disease Heart Disease Heart Disease Heart Disease
## [185] No Heart Disease Heart Disease No Heart Disease Heart Disease
## [189] Heart Disease No Heart Disease Heart Disease No Heart Disease
## [193] No Heart Disease No Heart Disease Heart Disease Heart Disease
## [197] No Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [201] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [205] Heart Disease
## Levels: No Heart Disease Heart Disease
Let’s check the accuracy of our model. We’re going to use the
confusion matrix to predict the accuracy:
confusionMatrix(table(test_pred, heart_test$target))
## Confusion Matrix and Statistics
##
##
## test_pred No Heart Disease Heart Disease
## No Heart Disease 89 5
## Heart Disease 24 87
##
## Accuracy : 0.8585
## 95% CI : (0.8032, 0.9032)
## No Information Rate : 0.5512
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7195
##
## Mcnemar's Test P-Value : 0.0008302
##
## Sensitivity : 0.7876
## Specificity : 0.9457
## Pos Pred Value : 0.9468
## Neg Pred Value : 0.7838
## Prevalence : 0.5512
## Detection Rate : 0.4341
## Detection Prevalence : 0.4585
## Balanced Accuracy : 0.8666
##
## 'Positive' Class : No Heart Disease
##
The output shows that our model accuracy for test set is 85.85%.
# Confusion matrix values
TP <- 87
TN <- 89
FP <- 5
FN <- 24
# Total instances
total <- TP + TN + FP + FN
# Accuracy
accuracy <- (TP + TN) / total
# Precision
precision <- TP / (TP + FP)
# Recall
recall <- TP / (TP + FN)
# F1 Score
f1_score <- 2 * (precision * recall) / (precision + recall)
# Print the results
print(paste("Accuracy:", round(accuracy, 4)))
## [1] "Accuracy: 0.8585"
print(paste("Precision:", round(precision, 4)))
## [1] "Precision: 0.9457"
print(paste("Recall:", round(recall, 4)))
## [1] "Recall: 0.7838"
print(paste("F1 Score:", round(f1_score, 4)))
## [1] "F1 Score: 0.8571"
The results indicate strong performance across all metrics:
Accuracy: An accuracy of 0.8585 suggests that approximately 85.85% of
instances in the dataset were correctly classified by the model.
Precision: With a precision of 0.9457, around 94.57% of instances
classified as positive (Heart Disease) were indeed positive.
Recall: The recall score of 0.7838 indicates that approximately
78.38% of actual positive instances (Heart Disease) were correctly
identified by the model.
F1 Score: The F1 Score, being the harmonic mean of precision and
recall, is also 0.8571. This suggests a high balance between precision
and recall, indicating strong overall performance.
By following the above procedure (Checking the accuracy of our
model), we can build our svmLinear classifier.
We can also do some customization for selecting C value(Cost) in
Linear classifier. This can be done by inputting values in grid
search.
We are going to put some values of C using expand.grid() into “grid”
dataframe. The next step is to use this dataframe for testing our
classifier at specific C values. It needs to be put in train() method
with tuneGrid parameter.
grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))
svm_Linear_Grid <- train(target ~., data = heart_train, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid,
tuneLength = 10)
## Warning: model fit failed for Fold01.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep1: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold01.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep2: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold01.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep3: C=0.00 Error in .local(x, ...) :
## No Support Vectors found. You may want to change your parameters
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
svm_Linear_Grid
## Support Vector Machines with Linear Kernel
##
## 820 samples
## 13 predictor
## 2 classes: 'No Heart Disease', 'Heart Disease'
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 737, 738, 737, 739, 738, 739, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.00 NaN NaN
## 0.01 0.8332475 0.6625618
## 0.05 0.8316562 0.6594721
## 0.10 0.8328860 0.6619765
## 0.25 0.8357069 0.6675991
## 0.50 0.8385624 0.6733923
## 0.75 0.8389740 0.6742324
## 1.00 0.8393805 0.6750621
## 1.25 0.8389689 0.6742009
## 1.50 0.8393805 0.6750385
## 1.75 0.8397870 0.6758504
## 2.00 0.8397870 0.6758504
## 5.00 0.8401985 0.6766958
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 5.
plot(svm_Linear_Grid)

Above plot is showing that our classifier is giving best accuracy on
C = 1.5. Let’s try to make predictions using this model for our test
set.
test_pred_grid <- predict(svm_Linear_Grid, newdata = heart_test)
test_pred_grid
## [1] No Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [5] Heart Disease Heart Disease No Heart Disease Heart Disease
## [9] No Heart Disease Heart Disease No Heart Disease Heart Disease
## [13] No Heart Disease Heart Disease No Heart Disease No Heart Disease
## [17] No Heart Disease Heart Disease No Heart Disease Heart Disease
## [21] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [25] No Heart Disease Heart Disease Heart Disease Heart Disease
## [29] No Heart Disease No Heart Disease Heart Disease Heart Disease
## [33] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [37] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [41] Heart Disease Heart Disease Heart Disease Heart Disease
## [45] Heart Disease Heart Disease No Heart Disease Heart Disease
## [49] No Heart Disease Heart Disease No Heart Disease No Heart Disease
## [53] No Heart Disease No Heart Disease Heart Disease No Heart Disease
## [57] Heart Disease No Heart Disease Heart Disease Heart Disease
## [61] Heart Disease No Heart Disease Heart Disease Heart Disease
## [65] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [69] Heart Disease Heart Disease No Heart Disease Heart Disease
## [73] Heart Disease Heart Disease No Heart Disease Heart Disease
## [77] Heart Disease Heart Disease No Heart Disease Heart Disease
## [81] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [85] Heart Disease Heart Disease No Heart Disease No Heart Disease
## [89] No Heart Disease No Heart Disease No Heart Disease Heart Disease
## [93] Heart Disease Heart Disease Heart Disease No Heart Disease
## [97] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [101] No Heart Disease Heart Disease No Heart Disease No Heart Disease
## [105] Heart Disease Heart Disease Heart Disease Heart Disease
## [109] No Heart Disease No Heart Disease Heart Disease No Heart Disease
## [113] Heart Disease No Heart Disease Heart Disease Heart Disease
## [117] Heart Disease Heart Disease Heart Disease No Heart Disease
## [121] No Heart Disease No Heart Disease Heart Disease No Heart Disease
## [125] No Heart Disease No Heart Disease Heart Disease Heart Disease
## [129] Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [133] Heart Disease Heart Disease Heart Disease No Heart Disease
## [137] No Heart Disease Heart Disease Heart Disease No Heart Disease
## [141] Heart Disease Heart Disease No Heart Disease Heart Disease
## [145] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [149] No Heart Disease Heart Disease Heart Disease Heart Disease
## [153] No Heart Disease Heart Disease Heart Disease Heart Disease
## [157] Heart Disease Heart Disease No Heart Disease Heart Disease
## [161] No Heart Disease No Heart Disease No Heart Disease Heart Disease
## [165] No Heart Disease Heart Disease No Heart Disease Heart Disease
## [169] Heart Disease Heart Disease Heart Disease No Heart Disease
## [173] No Heart Disease Heart Disease Heart Disease No Heart Disease
## [177] No Heart Disease No Heart Disease No Heart Disease Heart Disease
## [181] No Heart Disease Heart Disease Heart Disease Heart Disease
## [185] No Heart Disease Heart Disease No Heart Disease Heart Disease
## [189] Heart Disease No Heart Disease Heart Disease No Heart Disease
## [193] No Heart Disease No Heart Disease Heart Disease Heart Disease
## [197] No Heart Disease No Heart Disease No Heart Disease No Heart Disease
## [201] Heart Disease No Heart Disease No Heart Disease Heart Disease
## [205] Heart Disease
## Levels: No Heart Disease Heart Disease
Let’s check its accuracy using confusion -matrix.
confusionMatrix(table(test_pred_grid, heart_test$target))
## Confusion Matrix and Statistics
##
##
## test_pred_grid No Heart Disease Heart Disease
## No Heart Disease 91 7
## Heart Disease 22 85
##
## Accuracy : 0.8585
## 95% CI : (0.8032, 0.9032)
## No Information Rate : 0.5512
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7183
##
## Mcnemar's Test P-Value : 0.00933
##
## Sensitivity : 0.8053
## Specificity : 0.9239
## Pos Pred Value : 0.9286
## Neg Pred Value : 0.7944
## Prevalence : 0.5512
## Detection Rate : 0.4439
## Detection Prevalence : 0.4780
## Balanced Accuracy : 0.8646
##
## 'Positive' Class : No Heart Disease
##
The results of the confusion matrix show that this time the accuracy
on the test set is 85.87%, which is the same accurate than our previous
result.