R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# PREPARE THE DATASET
################################################################################
# OUR PROBLEM: Binary classification problem used: to determine whether a patient
#              will have a patient will have an onset of diabetes within the next
#             five years.
# INPUT ATTRIBUTES: Numeric & describe mesdical details for female patients.
################################################################################

# Load the packages
library(mlbench) 
## Warning: package 'mlbench' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: ggplot2
## Loading required package: lattice
# Load the dataset
data("PimaIndiansDiabetes")

# TRAIN THE MODELS
################################################################################
# Cross validation will be used to compare the models/algorithms. 
# Resampling methods: data split and k-fold cross vaildation. 
# CART -Classification and Regression Trees
# LDA - Linear Discriminant Analysis
# SVM - Support Vector Machine with Radical Basis Function
# KNN - k-Nearest Neighbors
# RF - Random Forest
################################################################################

#Prepare training scheme
trainControl<- trainControl(method = "repeatedcv", number = 10, repeats = 3)
#CART
set.seed(7)
fit.cart <- train(diabetes~., data = PimaIndiansDiabetes, method="rpart", trControl=trainControl)
#LDA
set.seed(7)
fit.lda <- train(diabetes~., data=PimaIndiansDiabetes, method="lda", trControl=trainControl)
#SVM
set.seed(7)
fit.svm <- train(diabetes~., data=PimaIndiansDiabetes, method="svmRadial", trControl=trainControl)
#KNN
set.seed(7)
fit.knn <- train(diabetes~., data=PimaIndiansDiabetes, method="knn", trControl=trainControl)
#Random Forest
set.seed(7)
fit.rf <- train(diabetes~., data=PimaIndiansDiabetes, method="rf", trControl=trainControl)
#Collect resamples
results <- resamples(list(CART=fit.cart, LDA=fit.lda, SVM=fit.svm, KNN=fit.knn, RF=fit.rf))

# COMPARE THE MODELS

#Summarize differences between models
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, LDA, SVM, KNN, RF 
## Number of resamples: 30 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.6753247 0.7272727 0.7532468 0.7469697 0.7662338 0.7922078    0
## LDA  0.7142857 0.7508117 0.7662338 0.7791069 0.8000256 0.9078947    0
## SVM  0.7236842 0.7508117 0.7631579 0.7712919 0.7915243 0.8947368    0
## KNN  0.6753247 0.7036056 0.7272727 0.7369503 0.7662338 0.8311688    0
## RF   0.6842105 0.7305195 0.7597403 0.7638528 0.8019481 0.8421053    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.2762566 0.3620724 0.4241878 0.4151867 0.4861107 0.5250000    0
## LDA  0.3011551 0.4192537 0.4662541 0.4862025 0.5308596 0.7812500    0
## SVM  0.3391908 0.3997116 0.4460612 0.4621585 0.5234605 0.7475083    0
## KNN  0.2553191 0.3406000 0.3841761 0.3984995 0.4539789 0.6195363    0
## RF   0.2951613 0.3778304 0.4640696 0.4630809 0.5447483 0.6426332    0
# Box and Whisker Plots – looks at the spread of estimated accuracies
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
bwplot(results, scales=scales)

# Density Plots – shows the distribution of model accuracy as density plots
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
densityplot(results, scales=scales, pch ="|")

# Dot Plots – show both the mean estimated accuracy as well as the 95% confidence interval (e.g., the   range in which 95% of observed scores fell)
scales <- list(x = list(relation = "free"), y = list(relation = "free"))
dotplot(results, scales=scales)

#pairwise scatter plots of prediction to compare models
splom(results)

#Calculate and summarize statistical significance
# Difference in model predictions
diffs <- diff(results)

#summarize p-values for pairwise comparisons
summary(diffs)
## 
## Call:
## summary.diff.resamples(object = diffs)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##      CART      LDA       SVM       KNN       RF       
## CART           -0.032137 -0.024322  0.010019 -0.016883
## LDA  0.0011862            0.007815  0.042157  0.015254
## SVM  0.0116401 0.9156892            0.034342  0.007439
## KNN  1.0000000 6.68e-05  0.0002941           -0.026902
## RF   0.2727542 0.4490617 1.0000000 0.0183793          
## 
## Kappa 
##      CART      LDA        SVM        KNN        RF        
## CART           -0.0710158 -0.0469717  0.0166872 -0.0478942
## LDA  0.0008086             0.0240440  0.0877029  0.0231215
## SVM  0.0258079 0.3562734              0.0636589 -0.0009225
## KNN  1.0000000 0.0003858  0.0040823             -0.0645814
## RF   0.0211763 1.0000000  1.0000000  0.0158974
##The lower diagonal of the table shows p-values for the null hypothesis 
#        (distributions are the same), smaller is better.
#The upper diagonal of the table shows the estimated difference between the distributions.

#Algorithm Test Harness

# Run algorithms using 10-fold cross-validation
trainControl <- trainControl(method="cv", number=10)
metric <- "Accuracy"

#Test harness involves 3 elements
#   * The resampling method to split-up the dataset
#   * The machine learning algorithm to evaluate
#   * The performance measure metreic by which to evaluate predictions.


#Build Models

#Linear Discriminant Analysis (LDA)
#Classification and Regression Trees (CART)
#k – Nearest Neighbors (KNN)
#Support Vector Machines (SVM) with a radial kernel (for this example) 
#Random Forest (RF)

data("iris")
# LDA
set.seed(7)
fit.lda <- train(Species~., data=iris, method="lda", metric=metric, trControl=trainControl)
# CART
set.seed(7)
fit.cart <-  train(Species~., data=iris, method="rpart", metric=metric, trControl=trainControl)
# KNN
set.seed(7)
fit.knn <-  train(Species~., data=iris, method="knn", metric=metric, trControl=trainControl)
# SVM
set.seed(7)
fit.svm <-  train(Species~., data=iris, method="svmRadial", metric=metric, trControl=trainControl)
# Random Forest
set.seed(7)
fit.rf <-  train(Species~., data=iris, method="rf", metric=metric, trControl=trainControl)

#Summarize the accuracy of the models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## lda  0.9333333 0.9500000 1.0000000 0.9800000 1.0000000    1    0
## cart 0.8666667 0.9333333 0.9333333 0.9400000 0.9833333    1    0
## knn  0.8666667 0.9333333 1.0000000 0.9666667 1.0000000    1    0
## svm  0.8000000 0.9333333 0.9666667 0.9466667 1.0000000    1    0
## rf   0.8666667 0.9333333 0.9666667 0.9600000 1.0000000    1    0
## 
## Kappa 
##      Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda   0.9   0.925   1.00 0.97   1.000    1    0
## cart  0.8   0.900   0.90 0.91   0.975    1    0
## knn   0.8   0.900   1.00 0.95   1.000    1    0
## svm   0.7   0.900   0.95 0.92   1.000    1    0
## rf    0.8   0.900   0.95 0.94   1.000    1    0
# Compare accuracy of models
dotplot(results)

#summarize Best model
print(fit.lda)
## Linear Discriminant Analysis 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.98      0.97
#MAKE PREDICTIONS
# Estimate skills of LDA on the validation dataset
predictions <- predict(fit.lda, iris)
confusionMatrix(predictions, iris$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         1
##   virginica       0          2        49
## 
## Overall Statistics
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9427, 0.9959)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.97            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9600           0.9800
## Specificity                 1.0000            0.9900           0.9800
## Pos Pred Value              1.0000            0.9796           0.9608
## Neg Pred Value              1.0000            0.9802           0.9899
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3200           0.3267
## Detection Prevalence        0.3333            0.3267           0.3400
## Balanced Accuracy           1.0000            0.9750           0.9800

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.