Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.
When you are applying machine learning to your own datasets, you are working on a project.
The process of a machine learning project may not be linear, but there are a number of well-known steps:
For more information on the steps in a machine learning project see this checklist and more on the process.
The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.
If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).
This is a good project because it is so well understood.
Let’s get started with your hello world machine learning project in R.
We are going to use our EEG dataset. This dataset collected by our EEG specialist
library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
#Atrrition Analysis
#Load Data
setwd("~/")
data<-read.csv("PUPR Neuro EEG clean.csv", sep=",", header=TRUE)
data<-na.omit(data)
data$Dom<-as.factor(data$Dom)
str(data)
## 'data.frame': 387 obs. of 43 variables:
## $ No : Factor w/ 387 levels "X1","X10","X100",..: 1 112 223 322 333 344 355 366 377 2 ...
## $ X1 : num 0.767 -0.129 1.209 -0.204 -1.134 ...
## $ X2 : num 0.1637 1.0462 0.556 0.4089 0.0166 ...
## $ X3 : num -1.04 1.375 -0.968 -0.418 2.427 ...
## $ X4 : num -1.3482 -0.0664 -1.3606 -0.9538 1.5605 ...
## $ X5 : num 0.241 -0.543 -0.361 0.786 -0.456 ...
## $ X6 : num 0.734 -0.131 1.348 -0.125 -0.655 ...
## $ X7 : num 0.189 1.009 0.611 0.537 0.139 ...
## $ X8 : num -1.13 1.61 -1.11 -0.49 2.11 ...
## $ X9 : num -1.356 -0.137 -1.513 -1.018 0.973 ...
## $ X10: num 0.359 -0.593 -0.342 0.77 -0.616 ...
## $ X11: num 1.044 -0.474 0.954 -0.288 -1.081 ...
## $ X12: num 0.3945 0.749 -0.3145 0.1333 -0.0906 ...
## $ X13: num -0.936 1.804 -1.102 -0.313 2.779 ...
## $ X14: num -1.184 0.268 -1.144 -0.643 1.45 ...
## $ X15: num -0.295 -0.469 0.1 0.816 -0.549 ...
## $ X16: num 0.509 -0.467 0.709 1.909 -1.024 ...
## $ X17: num 0.0608 0.7258 0.2063 0.4972 -0.1054 ...
## $ X18: num -0.947 1.914 -0.861 -0.818 2.516 ...
## $ X19: num -1 0.137 -0.95 -1.715 1.375 ...
## $ X20: num 0.4109 -0.4333 0.0501 -0.9528 -0.5704 ...
## $ X21: num 1.465 -0.354 1.113 1.852 -0.976 ...
## $ X22: num 0.326 0.536 0.402 0.726 -0.475 ...
## $ X23: num -1.02 0.93 -0.921 -1.059 2.88 ...
## $ X24: num -1.399 0.0904 -1.1301 -1.6472 1.1867 ...
## $ X25: num -0.536 -0.159 -0.368 -0.94 -0.476 ...
## $ X26: num 1.039 -0.182 1.258 0.28 -0.881 ...
## $ X27: num -0.189 0.852 0.518 1.088 -0.248 ...
## $ X28: num -1.102 1.217 -0.911 0.101 1.182 ...
## $ X29: num -1.416 0.191 -1.217 -1.009 0.906 ...
## $ X30: num 0.2913 -0.6662 -0.3796 -0.0278 -0.0083 ...
## $ X31: num 0.758 -0.274 1.26 1.612 -1.033 ...
## $ X32: num -0.0652 0.9181 0.1851 0.5426 -0.387 ...
## $ X33: num -0.99 1.366 -0.952 -0.669 2.931 ...
## $ X34: num -1.135 0.132 -1.126 -1.482 1.137 ...
## $ X35: num 0.262 -0.576 -0.423 -0.807 -0.442 ...
## $ X36: num 0.56 -0.336 1.093 1.392 -1.099 ...
## $ X37: num -0.0269 0.4386 0.652 1.389 -0.3178 ...
## $ X38: num -0.878 0.715 -0.728 -0.765 2.888 ...
## $ X39: num -0.9726 -0.0156 -1.0409 -1.4511 1.3418 ...
## $ X40: num 0.28 0.222 -0.547 -0.931 -0.449 ...
## $ Dom: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Mod: Factor w/ 4 levels "1","2","3","FALSE": 1 2 2 3 3 2 2 1 2 2 ...
We need to know that the model we created is any good.
Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(data$Dom, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data <- data[validation_index,]
You now have training data in the dataset variable and a validation set we will use later in the validation variable.
Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.
Now it is time to take a look at the data.
In this step we are going to take a look at the data a few different ways:
Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.
Dimensions of Dataset We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.
# dimensions of dataset
dim(data)
## [1] 310 43
You should see 310 instances and 43 attributes.
It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.
Knowing the types is important as it will give you an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.
# list types for each attribute
sapply(data, class)
## No X1 X2 X3 X4 X5 X6
## "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X7 X8 X9 X10 X11 X12 X13
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X14 X15 X16 X17 X18 X19 X20
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X21 X22 X23 X24 X25 X26 X27
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X28 X29 X30 X31 X32 X33 X34
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## X35 X36 X37 X38 X39 X40 Dom
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor"
## Mod
## "factor"
You should see that all of the inputs are double and that the class value is a factor.
It is also always a good idea to actually eyeball your data. You should see the first 5 rows of the data:
# take a peek at the first 5 rows of the data
head(data)
## No X1 X2 X3 X4 X5 X6
## 1 X1 0.7673868 0.1637270 -1.0395744 -1.34824985 0.2412754 0.73405244
## 2 X2 -0.1290103 1.0462421 1.3750280 -0.06642437 -0.5428035 -0.13148475
## 3 X3 1.2087945 0.5559559 -0.9678536 -1.36057509 -0.3613038 1.34830465
## 4 X4 -0.2037101 0.4088700 -0.4179936 -0.95384201 0.7857747 -0.12450461
## 5 X5 -1.1340617 0.0166411 2.4269340 1.56050796 -0.4556836 -0.65499516
## 6 X6 -0.1493830 0.9481848 0.4426568 -0.22665256 -0.1217241 -0.09658406
## X7 X8 X9 X10 X11 X12
## 1 0.1890194 -1.1344082 -1.3561874 0.35889924 1.0439487 0.39448998
## 2 1.0088692 1.6097344 -0.1371429 -0.59281174 -0.4739625 0.74898621
## 3 0.6113663 -1.1105461 -1.5130941 -0.34156004 0.9539984 -0.31450248
## 4 0.5368345 -0.4901312 -1.0182345 0.77003838 -0.2884401 0.13328223
## 5 0.1393315 2.1108387 0.9732739 -0.61565280 -1.0811270 -0.09061013
## 6 1.0088692 0.1302836 -0.3543983 -0.02939884 -0.4064998 2.22294422
## X13 X14 X15 X16 X17 X18
## 1 -0.9360329 -1.1837148 -0.2948808 0.5089845 0.0608205 -0.9469114
## 2 1.8037381 0.2682155 -0.4689169 -0.4666821 0.7257509 1.9139039
## 3 -1.1020796 -1.1436616 0.1000471 0.7089658 0.2062740 -0.8608718
## 4 -0.3133576 -0.6429960 0.8162724 1.9088539 0.4971811 -0.8178521
## 5 2.7792626 1.4497863 -0.5492412 -1.0242059 -0.1054121 2.5161808
## 6 0.2678059 -0.2725034 -0.2212502 -0.2000403 3.3439143 -0.1940653
## X19 X20 X21 X22 X23 X24
## 1 -1.0004456 0.41091876 1.4648076 0.3260878 -1.0195392 -1.39897092
## 2 0.1370406 -0.43330283 -0.3542093 0.5357951 0.9301182 0.09039257
## 3 -0.9501144 0.05014031 1.1127398 0.4023450 -0.9210717 -1.13005807
## 4 -1.7151493 -0.95282380 1.8520821 0.7264382 -1.0589262 -1.64719817
## 5 1.3751893 -0.57039864 -0.9761957 -0.4746131 2.8797757 1.18672958
## 6 -1.0306444 -0.12303335 -0.2661924 2.5184830 -0.2908794 -0.76806000
## X25 X26 X27 X28 X29 X30
## 1 -0.53622102 1.03918252 -0.1894085 -1.1024462 -1.4156314 0.291335156
## 2 -0.15912303 -0.18215632 0.8519556 1.2165038 0.1913446 -0.666210057
## 3 -0.36787370 1.25792977 0.5179332 -0.9106534 -1.2172393 -0.379597885
## 4 -0.94025458 0.27964344 1.0877362 0.1006181 -1.0089276 -0.027846582
## 5 -0.47561599 -0.88093227 -0.2483536 1.1816324 0.9055561 -0.008304843
## 6 0.01595818 -0.02417219 2.0308583 -0.1260462 -0.4335905 -0.314458754
## X31 X32 X33 X34 X35 X36
## 1 0.7580038 -0.06520403 -0.9895163 -1.135140 0.26232704 0.5597571
## 2 -0.2743233 0.91807624 1.3662893 0.132103 -0.57633753 -0.3361657
## 3 1.2602170 0.18508549 -0.9518234 -1.125753 -0.42268906 1.0930445
## 4 1.6117663 0.54264196 -0.6691267 -1.482459 -0.80681024 1.3916855
## 5 -1.0332232 -0.38700485 2.9305441 1.136511 -0.44189512 -1.0987667
## 6 -0.1292395 2.24103516 -0.2733514 -0.684565 -0.09618606 -0.3095013
## X37 X38 X39 X40 Dom Mod
## 1 -0.02685066 -0.8778608 -0.9725822 0.2802739 1 1
## 2 0.43862368 0.7148718 -0.0155864 0.2216551 1 2
## 3 0.65196608 -0.7279565 -1.0409390 -0.5469034 1 2
## 4 1.38896711 -0.7654326 -1.4510801 -0.9311827 1 3
## 5 -0.31777212 2.8884834 1.3417851 -0.4492053 1 3
## 6 2.88236393 -0.2969818 -0.8358685 0.0783645 1 2
The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:
# list the levels for the class
data$Dom<-as.factor(data$Dom)
levels(data$Dom)
## [1] "0" "1"
Notice above how we can refer to an attribute by name as a property of the dataset. In the results we can see that the class has 2 different labels.
There were two levels, it would be a binary classification problem.
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage.
# summarize the class distribution
percentage <- prop.table(table(data$Dom)) * 100
cbind(freq=table(data$Dom), percentage=percentage)
## freq percentage
## 0 58 18.70968
## 1 252 81.29032
Now finally, we can take a look at a summary of each attribute.
This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).
# summarize attribute distributions
data<-data[,2:42]
summary(data)
## X1 X2 X3
## Min. :-2.07800 Min. :-3.194733 Min. :-2.52181
## 1st Qu.:-0.59758 1st Qu.:-0.620731 1st Qu.:-0.56144
## Median :-0.02375 Median :-0.032388 Median :-0.17892
## Mean :-0.02878 Mean :-0.001973 Mean : 0.04358
## 3rd Qu.: 0.56026 3rd Qu.: 0.580470 3rd Qu.: 0.53829
## Max. : 2.43795 Max. : 2.885645 Max. : 5.00889
## X4 X5 X6
## Min. :-2.51915 Min. :-1.24702 Min. :-2.058003
## 1st Qu.:-0.54403 1st Qu.:-0.63050 1st Qu.:-0.600899
## Median :-0.02945 Median :-0.29233 Median : 0.001138
## Mean : 0.04827 Mean :-0.01119 Mean : 0.023104
## 3rd Qu.: 0.57141 3rd Qu.: 0.42640 3rd Qu.: 0.646801
## Max. : 3.35999 Max. : 5.41765 Max. : 2.583789
## X7 X8 X9
## Min. :-3.23944 Min. :-2.613859 Min. :-2.58730
## 1st Qu.:-0.63083 1st Qu.:-0.657166 1st Qu.:-0.63200
## Median :-0.08426 Median :-0.227648 Median :-0.07679
## Mean :-0.03481 Mean :-0.002021 Mean : 0.00353
## 3rd Qu.: 0.55547 3rd Qu.: 0.533010 3rd Qu.: 0.61118
## Max. : 2.47466 Max. : 5.093602 Max. : 3.23032
## X10 X11 X12
## Min. :-1.25520 Min. :-1.75013 Min. :-2.32953
## 1st Qu.:-0.62898 1st Qu.:-0.75927 1st Qu.:-0.61303
## Median :-0.29968 Median :-0.17038 Median :-0.12793
## Mean :-0.01642 Mean :-0.01339 Mean :-0.01827
## 3rd Qu.: 0.37793 3rd Qu.: 0.83453 3rd Qu.: 0.42943
## Max. : 5.70371 Max. : 2.59202 Max. : 3.82751
## X13 X14 X15
## Min. :-1.66249 Min. :-1.80725 Min. :-1.26174
## 1st Qu.:-0.66621 1st Qu.:-0.74814 1st Qu.:-0.66136
## Median :-0.20958 Median : 0.02289 Median :-0.25807
## Mean : 0.05188 Mean : 0.05035 Mean :-0.02915
## 3rd Qu.: 0.45365 3rd Qu.: 0.80813 3rd Qu.: 0.43473
## Max. : 4.93787 Max. : 2.56126 Max. : 4.42417
## X16 X17 X18
## Min. :-1.84231 Min. :-2.45345 Min. :-2.40098
## 1st Qu.:-0.74241 1st Qu.:-0.59891 1st Qu.:-0.68347
## Median :-0.16065 Median :-0.06385 Median :-0.20482
## Mean :-0.01276 Mean :-0.02412 Mean : 0.03067
## 3rd Qu.: 0.80689 3rd Qu.: 0.43484 3rd Qu.: 0.45123
## Max. : 2.15732 Max. : 4.02962 Max. : 4.98982
## X19 X20 X21
## Min. :-2.21142 Min. :-1.26775 Min. :-1.76248
## 1st Qu.:-0.73621 1st Qu.:-0.64255 1st Qu.:-0.68281
## Median : 0.03638 Median :-0.28067 Median :-0.21632
## Mean : 0.04405 Mean :-0.02204 Mean :-0.01175
## 3rd Qu.: 0.79135 3rd Qu.: 0.36221 3rd Qu.: 0.80615
## Max. : 2.94552 Max. : 5.06496 Max. : 2.34498
## X22 X23 X24
## Min. :-2.49543 Min. :-2.22872 Min. :-1.84371
## 1st Qu.:-0.56993 1st Qu.:-0.68547 1st Qu.:-0.68532
## Median :-0.13146 Median :-0.22195 Median : 0.07488
## Mean :-0.01394 Mean : 0.04008 Mean : 0.06433
## 3rd Qu.: 0.49290 3rd Qu.: 0.47224 3rd Qu.: 0.76762
## Max. : 3.96737 Max. : 3.49027 Max. : 2.66575
## X25 X26 X27
## Min. :-1.26348 Min. :-1.71946 Min. :-2.62581
## 1st Qu.:-0.65070 1st Qu.:-0.72599 1st Qu.:-0.56273
## Median :-0.25676 Median :-0.17000 Median :-0.14029
## Mean :-0.03811 Mean :-0.02959 Mean :-0.04705
## 3rd Qu.: 0.29874 3rd Qu.: 0.66839 3rd Qu.: 0.38039
## Max. : 4.54787 Max. : 2.64941 Max. : 3.68132
## X28 X29 X30
## Min. :-2.05792 Min. :-2.09016 Min. :-1.23943
## 1st Qu.:-0.61356 1st Qu.:-0.63198 1st Qu.:-0.62061
## Median :-0.18707 Median : 0.12191 Median :-0.25583
## Mean : 0.05086 Mean : 0.08796 Mean :-0.03085
## 3rd Qu.: 0.57138 3rd Qu.: 0.84108 3rd Qu.: 0.33693
## Max. : 3.51802 Max. : 2.87956 Max. : 4.80548
## X31 X32 X33
## Min. :-1.714001 Min. :-2.4787 Min. :-2.10146
## 1st Qu.:-0.764544 1st Qu.:-0.5479 1st Qu.:-0.68326
## Median :-0.143190 Median :-0.1903 Median :-0.23219
## Mean :-0.006353 Mean :-0.0253 Mean : 0.02755
## 3rd Qu.: 0.811015 3rd Qu.: 0.4443 3rd Qu.: 0.47108
## Max. : 2.337185 Max. : 3.4031 Max. : 4.24979
## X34 X35 X36
## Min. :-1.93303 Min. :-1.21654 Min. :-1.70138
## 1st Qu.:-0.69707 1st Qu.:-0.61315 1st Qu.:-0.80679
## Median : 0.07079 Median :-0.27544 Median :-0.18685
## Mean : 0.06197 Mean :-0.03549 Mean :-0.01292
## 3rd Qu.: 0.87133 3rd Qu.: 0.22872 3rd Qu.: 0.84107
## Max. : 3.27674 Max. : 4.88459 Max. : 1.95164
## X37 X38 X39
## Min. :-2.54817 Min. :-1.77729 Min. :-1.67568
## 1st Qu.:-0.60869 1st Qu.:-0.70565 1st Qu.:-0.82610
## Median :-0.06689 Median :-0.20329 Median : 0.03324
## Mean : 0.01912 Mean : 0.05438 Mean : 0.06798
## 3rd Qu.: 0.55499 3rd Qu.: 0.49001 3rd Qu.: 0.84132
## Max. : 4.00726 Max. : 4.31257 Max. : 2.69916
## X40 Dom
## Min. :-1.19822 0: 58
## 1st Qu.:-0.68368 1:252
## Median :-0.25707
## Mean :-0.05663
## 3rd Qu.: 0.27376
## Max. : 4.59202
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Here is what we are going to cover in this step:
Set-up the test harness to use 10-fold cross validation. Build 5 different models to predict species from flower measurements Select the best model.
We will 10-fold crossvalidation to estimate accuracy.
This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 6 different algorithms:
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF, LR). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build our six models:
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
#Model
# a) linear algorithms
set.seed(7)
fit.lda <- train(Dom~., data=data, method="lda", metric=metric, trControl=control)
set.seed(7)
library(caret)
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Dom~., data=data, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Dom~., data=data, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Dom~., data=data, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Dom~., data=data, method="rf", metric=metric, trControl=control)
#Logistic
set.seed(7)
fit.log<- train(Dom~., data=data, method="glm", family="binomial",metric=metric, trControl=control)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Caret does support the configuration and tuning of the configuration of each model, but we are not going to cover that in this tutorial.
We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
We can see the accuracy of each classifier and also other metrics like Kappa:
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf, log=fit.log))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, cart, knn, svm, rf, log
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.6774194 0.7759577 0.7906250 0.7772177 0.8064516 0.8387097 0
## cart 0.5937500 0.7024194 0.7419355 0.7295363 0.7681452 0.8387097 0
## knn 0.7419355 0.7741935 0.7906250 0.7907594 0.8064516 0.8666667 0
## svm 0.8064516 0.8064516 0.8064516 0.8130376 0.8125000 0.8333333 0
## rf 0.7419355 0.8064516 0.8064516 0.8065860 0.8125000 0.8333333 0
## log 0.6562500 0.7419355 0.7580645 0.7517070 0.7953125 0.8064516 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## lda -0.1832061 -0.05805338 0.00000000 0.03156875 0.13492063 0.2900763
## cart -0.2530120 -0.10714286 -0.08620690 -0.00387239 0.08024691 0.3621399
## knn -0.1071429 -0.04390244 0.10905350 0.07252398 0.16294643 0.2941176
## svm 0.0000000 0.00000000 0.00000000 0.00000000 0.00000000 0.0000000
## rf -0.1071429 0.00000000 0.00000000 -0.01071429 0.00000000 0.0000000
## log -0.2000000 -0.09541738 -0.05757018 -0.02237319 0.04007634 0.1696429
## NA's
## lda 0
## cart 0
## knn 0
## svm 0
## rf 0
## log 0
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
We can see that the most accurate model in this case was Random Forest (RF):
# compare accuracy of models
dotplot(results)
The results for just the SVM model can be summarized.
# summarize Best Model
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel
##
## 310 samples
## 40 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 279, 279, 279, 279, 280, 279, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.8130376 0
## 0.50 0.8130376 0
## 1.00 0.8130376 0
##
## Tuning parameter 'sigma' was held constant at a value of 0.02482246
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.02482246 and C = 0.25.
This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 81.3% accuracy +/- 4%
The SVM was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
We can run the SVM model directly on the validation set and summarize the results in a confusion matrix.
# estimate skill of SVM on the validation dataset
predictions <- predict(fit.svm, validation)
confusionMatrix(predictions, validation$Dom)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 0
## 1 14 63
##
## Accuracy : 0.8182
## 95% CI : (0.7138, 0.8969)
## No Information Rate : 0.8182
## P-Value [Acc > NIR] : 0.570793
##
## Kappa : 0
## Mcnemar's Test P-Value : 0.000512
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.8182
## Prevalence : 0.1818
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
We can see that the accuracy is 81.82%. It was a small validation dataset (20%), but this result is within our expected margin of 99.13% +/-4% suggesting we may have an accurate and a reliably accurate model.
In this post you discovered step-by-step how to complete EEG data machine learning project in R.
You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.