Notes:The following packages were used
library(datasets)#FOR WORKING WITH PRELOADED DATSETS
library(caret)#FOR MACHINE LEARNING
## Loading required package: ggplot2
## Loading required package: lattice
Notes: we import the Iris data set using the data() function provided by the datasets library.
{r}
# IMPORT DATA
data(iris)
Notes: Before building the model, we check if there are any missing values in the Iris data set using the is.na() function and summing the results.
sum(is.na(iris))
## [1] 0
Notes: To ensure reproducibility, we set a random seed value using set.seed(). Then, we perform a stratified random split of the data set into training and testing subsets using the createDataPartition() function from the caret package.
set.seed(100)
TrainingIndex <- createDataPartition(iris$Species, p=0.8, list = FALSE)
TrainingSet <- iris[TrainingIndex,] # Training Set
TestingSet <- iris[-TrainingIndex,] # Test Set
Notes: We proceed to build an SVM model with a polynomial kernel using the train() function from the caret package.
# Build Training model
Model <- train(Species ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess = c("scale","center"),
trControl = trainControl(method = "none"),
tuneGrid = data.frame(degree = 1, scale = 1, C = 1)
)
Notes: next, we build a cross-validation (CV) model using the same SVM algorithm and parameters as the training model.
# Print the mismatched variables
# Build CV model
Model.cv <- train(Species ~ ., data = TrainingSet,
method = "svmPoly",
na.action = na.omit,
preProcess = c("scale","center"),
trControl = trainControl(method = "cv", number = 10),
tuneGrid = data.frame(degree = 1, scale = 1, C = 1)
)
Notes:We apply the trained model to make predictions on both the training and testing data sets. We also perform cross-validation using the training data.
# Apply model for prediction
Model.training <- predict(Model, TrainingSet) # Apply model to make predictions on Training set
Model.testing <- predict(Model, TestingSet) # Apply model to make predictions on Testing set
Model.cv <- predict(Model.cv, TrainingSet) # Perform cross-validation
To evaluate the model’s performance, we calculate the confusion matrix and associated statistics for the training, testing, and cross-validation predictions.
# Model performance (Displays confusion matrix and statistics)
Model.training.confusion <- confusionMatrix(Model.training, TrainingSet$Species)
Model.testing.confusion <- confusionMatrix(Model.testing, TestingSet$Species)
Model.cv.confusion <- confusionMatrix(Model.cv, TrainingSet$Species)
print(Model.training.confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 1
## virginica 0 0 39
##
## Overall Statistics
##
## Accuracy : 0.9917
## 95% CI : (0.9544, 0.9998)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9875
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9750
## Specificity 1.0000 0.9875 1.0000
## Pos Pred Value 1.0000 0.9756 1.0000
## Neg Pred Value 1.0000 1.0000 0.9877
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3250
## Detection Prevalence 0.3333 0.3417 0.3250
## Balanced Accuracy 1.0000 0.9938 0.9875
print(Model.testing.confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 0
## virginica 0 1 10
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 1.0000
## Specificity 1.0000 1.0000 0.9500
## Pos Pred Value 1.0000 1.0000 0.9091
## Neg Pred Value 1.0000 0.9524 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.3333
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.9500 0.9750
print(Model.cv.confusion)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 40 0 0
## versicolor 0 40 1
## virginica 0 0 39
##
## Overall Statistics
##
## Accuracy : 0.9917
## 95% CI : (0.9544, 0.9998)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9875
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9750
## Specificity 1.0000 0.9875 1.0000
## Pos Pred Value 1.0000 0.9756 1.0000
## Neg Pred Value 1.0000 1.0000 0.9877
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3250
## Detection Prevalence 0.3333 0.3417 0.3250
## Balanced Accuracy 1.0000 0.9938 0.9875
All columns in the mock excel are mandatory
Notes:Lastly, we examine the importance of the input features using the varImp() function and visualize it using the plot() function
# Find rows with blank values in mock_excel_1
# Feature importance
Importance <- varImp(Model)
plot(Importance)