Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.
When you are applying machine learning to your own datasets, you are working on a project.
The process of a machine learning project may not be linear, but there are a number of well-known steps:
For more information on the steps in a machine learning project see this checklist and more on the process.
The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.
If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).
This is a good project because it is so well understood.
Let’s get started with your hello world machine learning project in R.
We are going to use our EEG dataset. This dataset collected by our EEG specialist
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
#Atrrition Analysis
#Load Data
setwd("~/")
data<-read.csv("EEG PUPR MOD.csv", sep=",", header=TRUE)
data<-na.omit(data)
data$Mod<-as.factor(data$Mod)
str(data)
## 'data.frame': 387 obs. of 41 variables:
## $ X1 : num 0.767 -0.129 1.209 -0.204 -1.134 ...
## $ X2 : num 0.1637 1.0462 0.556 0.4089 0.0166 ...
## $ X3 : num -1.04 1.375 -0.968 -0.418 2.427 ...
## $ X4 : num -1.3482 -0.0664 -1.3606 -0.9538 1.5605 ...
## $ X5 : num 0.241 -0.543 -0.361 0.786 -0.456 ...
## $ X6 : num 0.734 -0.131 1.348 -0.125 -0.655 ...
## $ X7 : num 0.189 1.009 0.611 0.537 0.139 ...
## $ X8 : num -1.13 1.61 -1.11 -0.49 2.11 ...
## $ X9 : num -1.356 -0.137 -1.513 -1.018 0.973 ...
## $ X10: num 0.359 -0.593 -0.342 0.77 -0.616 ...
## $ X11: num 1.044 -0.474 0.954 -0.288 -1.081 ...
## $ X12: num 0.3945 0.749 -0.3145 0.1333 -0.0906 ...
## $ X13: num -0.936 1.804 -1.102 -0.313 2.779 ...
## $ X14: num -1.184 0.268 -1.144 -0.643 1.45 ...
## $ X15: num -0.295 -0.469 0.1 0.816 -0.549 ...
## $ X16: num 0.509 -0.467 0.709 1.909 -1.024 ...
## $ X17: num 0.0608 0.7258 0.2063 0.4972 -0.1054 ...
## $ X18: num -0.947 1.914 -0.861 -0.818 2.516 ...
## $ X19: num -1 0.137 -0.95 -1.715 1.375 ...
## $ X20: num 0.4109 -0.4333 0.0501 -0.9528 -0.5704 ...
## $ X21: num 1.465 -0.354 1.113 1.852 -0.976 ...
## $ X22: num 0.326 0.536 0.402 0.726 -0.475 ...
## $ X23: num -1.02 0.93 -0.921 -1.059 2.88 ...
## $ X24: num -1.399 0.0904 -1.1301 -1.6472 1.1867 ...
## $ X25: num -0.536 -0.159 -0.368 -0.94 -0.476 ...
## $ X26: num 1.039 -0.182 1.258 0.28 -0.881 ...
## $ X27: num -0.189 0.852 0.518 1.088 -0.248 ...
## $ X28: num -1.102 1.217 -0.911 0.101 1.182 ...
## $ X29: num -1.416 0.191 -1.217 -1.009 0.906 ...
## $ X30: num 0.2913 -0.6662 -0.3796 -0.0278 -0.0083 ...
## $ X31: num 0.758 -0.274 1.26 1.612 -1.033 ...
## $ X32: num -0.0652 0.9181 0.1851 0.5426 -0.387 ...
## $ X33: num -0.99 1.366 -0.952 -0.669 2.931 ...
## $ X34: num -1.135 0.132 -1.126 -1.482 1.137 ...
## $ X35: num 0.262 -0.576 -0.423 -0.807 -0.442 ...
## $ X36: num 0.56 -0.336 1.093 1.392 -1.099 ...
## $ X37: num -0.0269 0.4386 0.652 1.389 -0.3178 ...
## $ X38: num -0.878 0.715 -0.728 -0.765 2.888 ...
## $ X39: num -0.9726 -0.0156 -1.0409 -1.4511 1.3418 ...
## $ X40: num 0.28 0.222 -0.547 -0.931 -0.449 ...
## $ Mod: Factor w/ 3 levels "1","2","3": 1 2 2 3 3 2 2 1 2 2 ...
We need to know that the model we created is any good.
Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data$Mod, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data <- data[validation_index,]
You now have training data in the dataset variable and a validation set we will use later in the validation variable.
Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.
Now it is time to take a look at the data.
In this step we are going to take a look at the data a few different ways:
Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.
Dimensions of Dataset We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
# dimensions of dataset
dim(data)
## [1] 310 41
You should see 310 instances and 43 attributes.
It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.
Knowing the types is important as it will give you an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.
# list types for each attribute
sapply(data, class)
## X1 X2 X3 X4 X5 X6 X7
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X8 X9 X10 X11 X12 X13 X14
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X15 X16 X17 X18 X19 X20 X21
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X22 X23 X24 X25 X26 X27 X28
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X29 X30 X31 X32 X33 X34 X35
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X36 X37 X38 X39 X40 Mod
## "numeric" "numeric" "numeric" "numeric" "numeric" "factor"
You should see that all of the inputs are double and that the class value is a factor.
It is also always a good idea to actually eyeball your data. You should see the first 5 rows of the data:
# take a peek at the first 5 rows of the data
head(data)
## X1 X2 X3 X4 X5 X6
## 1 0.7673868 0.1637270 -1.0395744 -1.34824985 0.2412754 0.73405244
## 2 -0.1290103 1.0462421 1.3750280 -0.06642437 -0.5428035 -0.13148475
## 3 1.2087945 0.5559559 -0.9678536 -1.36057509 -0.3613038 1.34830465
## 4 -0.2037101 0.4088700 -0.4179936 -0.95384201 0.7857747 -0.12450461
## 5 -1.1340617 0.0166411 2.4269340 1.56050796 -0.4556836 -0.65499516
## 6 -0.1493830 0.9481848 0.4426568 -0.22665256 -0.1217241 -0.09658406
## X7 X8 X9 X10 X11 X12
## 1 0.1890194 -1.1344082 -1.3561874 0.35889924 1.0439487 0.39448998
## 2 1.0088692 1.6097344 -0.1371429 -0.59281174 -0.4739625 0.74898621
## 3 0.6113663 -1.1105461 -1.5130941 -0.34156004 0.9539984 -0.31450248
## 4 0.5368345 -0.4901312 -1.0182345 0.77003838 -0.2884401 0.13328223
## 5 0.1393315 2.1108387 0.9732739 -0.61565280 -1.0811270 -0.09061013
## 6 1.0088692 0.1302836 -0.3543983 -0.02939884 -0.4064998 2.22294422
## X13 X14 X15 X16 X17 X18
## 1 -0.9360329 -1.1837148 -0.2948808 0.5089845 0.0608205 -0.9469114
## 2 1.8037381 0.2682155 -0.4689169 -0.4666821 0.7257509 1.9139039
## 3 -1.1020796 -1.1436616 0.1000471 0.7089658 0.2062740 -0.8608718
## 4 -0.3133576 -0.6429960 0.8162724 1.9088539 0.4971811 -0.8178521
## 5 2.7792626 1.4497863 -0.5492412 -1.0242059 -0.1054121 2.5161808
## 6 0.2678059 -0.2725034 -0.2212502 -0.2000403 3.3439143 -0.1940653
## X19 X20 X21 X22 X23 X24
## 1 -1.0004456 0.41091876 1.4648076 0.3260878 -1.0195392 -1.39897092
## 2 0.1370406 -0.43330283 -0.3542093 0.5357951 0.9301182 0.09039257
## 3 -0.9501144 0.05014031 1.1127398 0.4023450 -0.9210717 -1.13005807
## 4 -1.7151493 -0.95282380 1.8520821 0.7264382 -1.0589262 -1.64719817
## 5 1.3751893 -0.57039864 -0.9761957 -0.4746131 2.8797757 1.18672958
## 6 -1.0306444 -0.12303335 -0.2661924 2.5184830 -0.2908794 -0.76806000
## X25 X26 X27 X28 X29 X30
## 1 -0.53622102 1.03918252 -0.1894085 -1.1024462 -1.4156314 0.291335156
## 2 -0.15912303 -0.18215632 0.8519556 1.2165038 0.1913446 -0.666210057
## 3 -0.36787370 1.25792977 0.5179332 -0.9106534 -1.2172393 -0.379597885
## 4 -0.94025458 0.27964344 1.0877362 0.1006181 -1.0089276 -0.027846582
## 5 -0.47561599 -0.88093227 -0.2483536 1.1816324 0.9055561 -0.008304843
## 6 0.01595818 -0.02417219 2.0308583 -0.1260462 -0.4335905 -0.314458754
## X31 X32 X33 X34 X35 X36
## 1 0.7580038 -0.06520403 -0.9895163 -1.135140 0.26232704 0.5597571
## 2 -0.2743233 0.91807624 1.3662893 0.132103 -0.57633753 -0.3361657
## 3 1.2602170 0.18508549 -0.9518234 -1.125753 -0.42268906 1.0930445
## 4 1.6117663 0.54264196 -0.6691267 -1.482459 -0.80681024 1.3916855
## 5 -1.0332232 -0.38700485 2.9305441 1.136511 -0.44189512 -1.0987667
## 6 -0.1292395 2.24103516 -0.2733514 -0.684565 -0.09618606 -0.3095013
## X37 X38 X39 X40 Mod
## 1 -0.02685066 -0.8778608 -0.9725822 0.2802739 1
## 2 0.43862368 0.7148718 -0.0155864 0.2216551 2
## 3 0.65196608 -0.7279565 -1.0409390 -0.5469034 2
## 4 1.38896711 -0.7654326 -1.4510801 -0.9311827 3
## 5 -0.31777212 2.8884834 1.3417851 -0.4492053 3
## 6 2.88236393 -0.2969818 -0.8358685 0.0783645 2
The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:
# list the levels for the class
data$Mod<-as.factor(data$Mod)
levels(data$Mod)
## [1] "1" "2" "3"
Notice above how we can refer to an attribute by name as a property of the dataset. In the results we can see that the class has 2 different labels.
There were two levels, it would be a binary classification problem.
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage.
# summarize the class distribution
percentage <- prop.table(table(data$Mod)) * 100
cbind(freq=table(data$Dom), percentage=percentage)
## percentage
## 1 16.77419
## 2 63.54839
## 3 19.67742
Now finally, we can take a look at a summary of each attribute.
This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).
# summarize attribute distributions
summary(data)
## X1 X2 X3
## Min. :-2.071204 Min. :-2.94959 Min. :-2.04367
## 1st Qu.:-0.543254 1st Qu.:-0.62073 1st Qu.:-0.62718
## Median : 0.003656 Median :-0.03239 Median :-0.23869
## Mean : 0.031587 Mean : 0.02529 Mean :-0.01006
## 3rd Qu.: 0.699478 3rd Qu.: 0.60498 3rd Qu.: 0.48449
## Max. : 2.437945 Max. : 5.70396 Max. : 5.00889
## X4 X5 X6
## Min. :-2.17404 Min. :-1.18894 Min. :-2.03706
## 1st Qu.:-0.61490 1st Qu.:-0.63050 1st Qu.:-0.54157
## Median :-0.11572 Median :-0.33589 Median : 0.03953
## Mean : 0.03188 Mean :-0.05818 Mean : 0.06197
## 3rd Qu.: 0.59606 3rd Qu.: 0.29935 3rd Qu.: 0.67647
## Max. : 3.35999 Max. : 4.98931 Max. : 2.58379
## X7 X8 X9
## Min. :-2.69288 Min. :-1.92186 Min. :-2.189000
## 1st Qu.:-0.63083 1st Qu.:-0.70489 1st Qu.:-0.608422
## Median :-0.06757 Median :-0.25151 Median :-0.081274
## Mean : 0.01630 Mean :-0.04548 Mean :-0.007595
## 3rd Qu.: 0.61137 3rd Qu.: 0.49650 3rd Qu.: 0.611181
## Max. : 7.86580 Max. : 4.64022 Max. : 3.230317
## X10 X11 X12
## Min. :-1.2476 Min. :-1.75013 Min. :-2.47879
## 1st Qu.:-0.6233 1st Qu.:-0.71430 1st Qu.:-0.58970
## Median :-0.3149 Median :-0.09857 Median :-0.09061
## Mean :-0.0540 Mean : 0.03544 Mean : 0.03052
## 3rd Qu.: 0.3437 3rd Qu.: 0.92870 3rd Qu.: 0.48778
## Max. : 4.9043 Max. : 2.59202 Max. : 7.14858
## X13 X14 X15
## Min. :-2.160627 Min. :-1.93471 Min. :-1.26174
## 1st Qu.:-0.723285 1st Qu.:-0.76868 1st Qu.:-0.70320
## Median :-0.219956 Median :-0.05722 Median :-0.31161
## Mean : 0.000597 Mean : 0.01539 Mean :-0.06462
## 3rd Qu.: 0.407908 3rd Qu.: 0.77889 3rd Qu.: 0.37449
## Max. : 4.937870 Max. : 2.37101 Max. : 4.88604
## X16 X17 X18
## Min. :-1.84837 Min. :-2.515785 Min. :-2.40098
## 1st Qu.:-0.72575 1st Qu.:-0.604110 1st Qu.:-0.71030
## Median :-0.07581 Median :-0.048461 Median :-0.21557
## Mean : 0.04280 Mean : 0.001166 Mean :-0.01214
## 3rd Qu.: 0.82714 3rd Qu.: 0.453502 3rd Qu.: 0.40821
## Max. : 2.43608 Max. : 6.855578 Max. : 3.76375
## X19 X20 X21
## Min. :-2.21142 Min. :-1.2847 Min. :-1.76835
## 1st Qu.:-0.76892 1st Qu.:-0.6717 1st Qu.:-0.65640
## Median : 0.03794 Median :-0.2998 Median :-0.18111
## Mean : 0.02299 Mean :-0.0679 Mean : 0.01092
## 3rd Qu.: 0.78128 3rd Qu.: 0.3045 3rd Qu.: 0.86629
## Max. : 2.94552 Max. : 4.8774 Max. : 2.34498
## X22 X23 X24
## Min. :-2.571687 Min. :-2.22872 Min. :-2.09194
## 1st Qu.:-0.598531 1st Qu.:-0.70444 1st Qu.:-0.68949
## Median :-0.169584 Median :-0.25149 Median : 0.06143
## Mean :-0.002683 Mean :-0.01418 Mean : 0.05580
## 3rd Qu.: 0.535795 3rd Qu.: 0.40471 3rd Qu.: 0.77043
## Max. : 7.322688 Max. : 3.43119 Max. : 2.30375
## X25 X26 X27
## Min. :-1.27695 Min. :-1.71946 Min. :-2.66510
## 1st Qu.:-0.65183 1st Qu.:-0.68297 1st Qu.:-0.54308
## Median :-0.26013 Median :-0.06367 Median :-0.11081
## Mean :-0.04422 Mean : 0.04235 Mean : 0.01543
## 3rd Qu.: 0.27725 3rd Qu.: 0.75360 3rd Qu.: 0.48846
## Max. : 5.02597 Max. : 2.67371 Max. : 6.45174
## X28 X29 X30
## Min. :-2.061411 Min. :-2.12984 Min. :-1.24595
## 1st Qu.:-0.666553 1st Qu.:-0.70742 1st Qu.:-0.62550
## Median :-0.248096 Median : 0.08223 Median :-0.29554
## Mean :-0.006322 Mean : 0.03340 Mean :-0.07168
## 3rd Qu.: 0.514720 3rd Qu.: 0.79148 3rd Qu.: 0.25225
## Max. : 3.518018 Max. : 2.64149 Max. : 4.86410
## X31 X32 X33
## Min. :-1.714001 Min. :-2.478710 Min. :-2.10146
## 1st Qu.:-0.751426 1st Qu.:-0.597069 1st Qu.:-0.65028
## Median :-0.132029 Median :-0.184709 Median :-0.23566
## Mean :-0.003986 Mean : 0.001085 Mean : 0.00381
## 3rd Qu.: 0.776139 3rd Qu.: 0.501314 3rd Qu.: 0.40512
## Max. : 2.030277 Max. : 6.585346 Max. : 4.24979
## X34 X35 X36
## Min. :-1.93303 Min. :-1.21654 Min. :-1.70138
## 1st Qu.:-0.68222 1st Qu.:-0.61315 1st Qu.:-0.76013
## Median : 0.08001 Median :-0.21782 Median :-0.20284
## Mean : 0.06604 Mean :-0.04227 Mean :-0.03121
## 3rd Qu.: 0.87367 3rd Qu.: 0.20044 3rd Qu.: 0.70241
## Max. : 3.27674 Max. : 4.88459 Max. : 2.08496
## X37 X38 X39
## Min. :-2.64514 Min. :-2.02088 Min. :-1.90028
## 1st Qu.:-0.60869 1st Qu.:-0.64832 1st Qu.:-0.70892
## Median :-0.08503 Median :-0.20329 Median : 0.04422
## Mean : 0.02019 Mean : 0.00424 Mean : 0.08639
## 3rd Qu.: 0.55499 3rd Qu.: 0.43107 3rd Qu.: 0.84376
## Max. : 7.11042 Max. : 4.31257 Max. : 2.69916
## X40 Mod
## Min. :-1.20474 1: 52
## 1st Qu.:-0.65111 2:197
## Median :-0.22776 3: 61
## Mean :-0.02732
## 3rd Qu.: 0.29004
## Max. : 4.91768
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Here is what we are going to cover in this step:
Set-up the test harness to use 10-fold cross validation. Build 5 different models to predict species from flower measurements Select the best model.
We will 10-fold crossvalidation to estimate accuracy.
This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 5 different algorithms:
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF, LR). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build our six models:
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
#Model
# a) linear algorithms
set.seed(7)
fit.lda <- train(Mod~., data=data, method="lda", metric=metric, trControl=control)
set.seed(7)
library(caret)
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Mod~., data=data, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Mod~., data=data, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Mod~., data=data, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Mod~., data=data, method="rf", metric=metric, trControl=control)
Caret does support the configuration and tuning of the configuration of each model, but we are not going to cover that in this tutorial.
We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
We can see the accuracy of each classifier and also other metrics like Kappa:
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn, svm, rf
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.4838710 0.5519153 0.5806452 0.5678629 0.5806452 0.6774194 0
## cart 0.4516129 0.5734375 0.6290323 0.6031384 0.6451613 0.6774194 0
## knn 0.4516129 0.5161290 0.5903226 0.5680645 0.6129032 0.6451613 0
## svm 0.6129032 0.6270833 0.6451613 0.6355108 0.6451613 0.6451613 0
## rf 0.5483871 0.5806452 0.5967742 0.6030376 0.6401210 0.6451613 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu.
## lda -0.13501144 -0.06220033 -0.01003141 0.006588438 0.05536641
## cart -0.09550562 -0.01480263 0.00000000 0.019184614 0.04090909
## knn -0.17111111 -0.15626386 -0.04459459 -0.057369709 0.01829268
## svm 0.00000000 0.00000000 0.00000000 0.000000000 0.00000000
## rf -0.09550562 -0.07981103 -0.02337662 -0.032321107 0.00000000
## Max. NA's
## lda 0.16216216 0
## cart 0.21914358 0
## knn 0.09708738 0
## svm 0.00000000 0
## rf 0.04952830 0
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
We can see that the most accurate model in this case was Random Forest (RF):
# compare accuracy of models
dotplot(results)
The results for just the SVM model can be summarized.
# summarize Best Model
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 310 samples
## 40 predictor
## 3 classes: '1', '2', '3'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 278, 279, 280, 279, 279, 279, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.6355108 0
## 0.50 0.6355108 0
## 1.00 0.6355108 0
##
## Tuning parameter 'sigma' was held constant at a value of 0.0269754
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.0269754 and C = 0.25.
This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 81.3% accuracy +/- 4%
The SVM was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
We can run the SVM model directly on the validation set and summarize the results in a confusion matrix.
# estimate skill of SVM on the validation dataset
predictions <- predict(fit.svm, validation)
confusionMatrix(predictions, validation$Mod)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3
## 1 0 0 0
## 2 13 49 15
## 3 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6364
## 95% CI : (0.5188, 0.743)
## No Information Rate : 0.6364
## P-Value [Acc > NIR] : 0.5513
##
## Kappa : 0
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 0.0000 1.0000 0.0000
## Specificity 1.0000 0.0000 1.0000
## Pos Pred Value NaN 0.6364 NaN
## Neg Pred Value 0.8312 NaN 0.8052
## Prevalence 0.1688 0.6364 0.1948
## Detection Rate 0.0000 0.6364 0.0000
## Detection Prevalence 0.0000 1.0000 0.0000
## Balanced Accuracy 0.5000 0.5000 0.5000
We can see that the accuracy is 63.64%. It was a small validation dataset (20%).
In this post you discovered step-by-step how to complete EEG data machine learning project in R.
You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.