Setting up environment

Notes:The following packages were used

library(datasets)#FOR WORKING WITH PRELOADED DATSETS
library(caret)#FOR MACHINE LEARNING
## Loading required package: ggplot2
## Loading required package: lattice

IMPORT THE DATA(PREPARE)

Notes: we import the Iris data set using the data() function provided by the datasets library.

{r}

# IMPORT DATA
data(iris)

DATA PROCESSING

Notes: Before building the model, we check if there are any missing values in the Iris data set using the is.na() function and summing the results.

sum(is.na(iris))
## [1] 0

DATA SPLITTING

Notes: To ensure reproducibility, we set a random seed value using set.seed(). Then, we perform a stratified random split of the data set into training and testing subsets using the createDataPartition() function from the caret package.

set.seed(100)
TrainingIndex <- createDataPartition(iris$Species, p=0.8, list = FALSE)
TrainingSet <- iris[TrainingIndex,] # Training Set
TestingSet <- iris[-TrainingIndex,] # Test Set

Support Vector Machine (SVM) Model with Polynomial Kernel

Notes: We proceed to build an SVM model with a polynomial kernel using the train() function from the caret package.

# Build Training model
Model <- train(Species ~ ., data = TrainingSet,
               method = "svmPoly",
               na.action = na.omit,
               preProcess = c("scale","center"),
               trControl = trainControl(method = "none"),
               tuneGrid = data.frame(degree = 1, scale = 1, C = 1)
)

Cross-Validation Model

Notes: next, we build a cross-validation (CV) model using the same SVM algorithm and parameters as the training model.

# Print the mismatched variables
# Build CV model
Model.cv <- train(Species ~ ., data = TrainingSet,
                  method = "svmPoly",
                  na.action = na.omit,
                  preProcess = c("scale","center"),
                  trControl = trainControl(method = "cv", number = 10),
                  tuneGrid = data.frame(degree = 1, scale = 1, C = 1)
)

Model Prediction and Evaluation

Notes:We apply the trained model to make predictions on both the training and testing data sets. We also perform cross-validation using the training data.

# Apply model for prediction
Model.training <- predict(Model, TrainingSet) # Apply model to make predictions on Training set
Model.testing <- predict(Model, TestingSet) # Apply model to make predictions on Testing set
Model.cv <- predict(Model.cv, TrainingSet) # Perform cross-validation

To evaluate the model’s performance, we calculate the confusion matrix and associated statistics for the training, testing, and cross-validation predictions.

# Model performance (Displays confusion matrix and statistics)
Model.training.confusion <- confusionMatrix(Model.training, TrainingSet$Species)
Model.testing.confusion <- confusionMatrix(Model.testing, TestingSet$Species)
Model.cv.confusion <- confusionMatrix(Model.cv, TrainingSet$Species)

print(Model.training.confusion)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         40         1
##   virginica       0          0        39
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9917          
##                  95% CI : (0.9544, 0.9998)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9875          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9750
## Specificity                 1.0000            0.9875           1.0000
## Pos Pred Value              1.0000            0.9756           1.0000
## Neg Pred Value              1.0000            1.0000           0.9877
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3250
## Detection Prevalence        0.3333            0.3417           0.3250
## Balanced Accuracy           1.0000            0.9938           0.9875
print(Model.testing.confusion)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750
print(Model.cv.confusion)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         40         1
##   virginica       0          0        39
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9917          
##                  95% CI : (0.9544, 0.9998)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9875          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9750
## Specificity                 1.0000            0.9875           1.0000
## Pos Pred Value              1.0000            0.9756           1.0000
## Neg Pred Value              1.0000            1.0000           0.9877
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3250
## Detection Prevalence        0.3333            0.3417           0.3250
## Balanced Accuracy           1.0000            0.9938           0.9875

All columns in the mock excel are mandatory

Feature Importance

Notes:Lastly, we examine the importance of the input features using the varImp() function and visualize it using the plot() function

# Find rows with blank values in mock_excel_1
# Feature importance
Importance <- varImp(Model)
plot(Importance)