My EEG Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

The process of a machine learning project may not be linear, but there are a number of well-known steps:

  1. Define Problem.
  2. Prepare Data.
  3. Evaluate Algorithms.
  4. Improve Results.
  5. Present Results.

For more information on the steps in a machine learning project see this checklist and more on the process.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

Let’s get started with your hello world machine learning project in R.

Load Library and Data

We are going to use our EEG dataset. This dataset collected by our EEG specialist

library(caret)
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
#Atrrition Analysis
#Load Data
setwd("~/")
data<-read.csv("EEG PUPR MOD.csv", sep=",", header=TRUE)
data<-na.omit(data)
data$Mod<-as.factor(data$Mod)
str(data)
## 'data.frame':    387 obs. of  41 variables:
##  $ X1 : num  0.767 -0.129 1.209 -0.204 -1.134 ...
##  $ X2 : num  0.1637 1.0462 0.556 0.4089 0.0166 ...
##  $ X3 : num  -1.04 1.375 -0.968 -0.418 2.427 ...
##  $ X4 : num  -1.3482 -0.0664 -1.3606 -0.9538 1.5605 ...
##  $ X5 : num  0.241 -0.543 -0.361 0.786 -0.456 ...
##  $ X6 : num  0.734 -0.131 1.348 -0.125 -0.655 ...
##  $ X7 : num  0.189 1.009 0.611 0.537 0.139 ...
##  $ X8 : num  -1.13 1.61 -1.11 -0.49 2.11 ...
##  $ X9 : num  -1.356 -0.137 -1.513 -1.018 0.973 ...
##  $ X10: num  0.359 -0.593 -0.342 0.77 -0.616 ...
##  $ X11: num  1.044 -0.474 0.954 -0.288 -1.081 ...
##  $ X12: num  0.3945 0.749 -0.3145 0.1333 -0.0906 ...
##  $ X13: num  -0.936 1.804 -1.102 -0.313 2.779 ...
##  $ X14: num  -1.184 0.268 -1.144 -0.643 1.45 ...
##  $ X15: num  -0.295 -0.469 0.1 0.816 -0.549 ...
##  $ X16: num  0.509 -0.467 0.709 1.909 -1.024 ...
##  $ X17: num  0.0608 0.7258 0.2063 0.4972 -0.1054 ...
##  $ X18: num  -0.947 1.914 -0.861 -0.818 2.516 ...
##  $ X19: num  -1 0.137 -0.95 -1.715 1.375 ...
##  $ X20: num  0.4109 -0.4333 0.0501 -0.9528 -0.5704 ...
##  $ X21: num  1.465 -0.354 1.113 1.852 -0.976 ...
##  $ X22: num  0.326 0.536 0.402 0.726 -0.475 ...
##  $ X23: num  -1.02 0.93 -0.921 -1.059 2.88 ...
##  $ X24: num  -1.399 0.0904 -1.1301 -1.6472 1.1867 ...
##  $ X25: num  -0.536 -0.159 -0.368 -0.94 -0.476 ...
##  $ X26: num  1.039 -0.182 1.258 0.28 -0.881 ...
##  $ X27: num  -0.189 0.852 0.518 1.088 -0.248 ...
##  $ X28: num  -1.102 1.217 -0.911 0.101 1.182 ...
##  $ X29: num  -1.416 0.191 -1.217 -1.009 0.906 ...
##  $ X30: num  0.2913 -0.6662 -0.3796 -0.0278 -0.0083 ...
##  $ X31: num  0.758 -0.274 1.26 1.612 -1.033 ...
##  $ X32: num  -0.0652 0.9181 0.1851 0.5426 -0.387 ...
##  $ X33: num  -0.99 1.366 -0.952 -0.669 2.931 ...
##  $ X34: num  -1.135 0.132 -1.126 -1.482 1.137 ...
##  $ X35: num  0.262 -0.576 -0.423 -0.807 -0.442 ...
##  $ X36: num  0.56 -0.336 1.093 1.392 -1.099 ...
##  $ X37: num  -0.0269 0.4386 0.652 1.389 -0.3178 ...
##  $ X38: num  -0.878 0.715 -0.728 -0.765 2.888 ...
##  $ X39: num  -0.9726 -0.0156 -1.0409 -1.4511 1.3418 ...
##  $ X40: num  0.28 0.222 -0.547 -0.931 -0.449 ...
##  $ Mod: Factor w/ 3 levels "1","2","3": 1 2 2 3 3 2 2 1 2 2 ...

Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

# create a list of 80% of the rows in the original dataset we can use for training

validation_index <- createDataPartition(data$Mod, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data <- data[validation_index,]

You now have training data in the dataset variable and a validation set we will use later in the validation variable.

Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.

Summarize Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Types of the attributes.
  3. Peek at the data itself.
  4. Levels of the class attribute.
  5. Breakdown of the instances in each class.
  6. Statistical summary of all attributes.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

Dimensions of Dataset We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.

dimensions of dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.

# dimensions of dataset
dim(data)
## [1] 310  41

You should see 310 instances and 43 attributes.

Types of Attributes

It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.

Knowing the types is important as it will give you an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.

# list types for each attribute
sapply(data, class)
##        X1        X2        X3        X4        X5        X6        X7 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        X8        X9       X10       X11       X12       X13       X14 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X15       X16       X17       X18       X19       X20       X21 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X22       X23       X24       X25       X26       X27       X28 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X29       X30       X31       X32       X33       X34       X35 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X36       X37       X38       X39       X40       Mod 
## "numeric" "numeric" "numeric" "numeric" "numeric"  "factor"

You should see that all of the inputs are double and that the class value is a factor.

Peek at the Data

It is also always a good idea to actually eyeball your data. You should see the first 5 rows of the data:

# take a peek at the first 5 rows of the data
head(data)
##           X1        X2         X3          X4         X5          X6
## 1  0.7673868 0.1637270 -1.0395744 -1.34824985  0.2412754  0.73405244
## 2 -0.1290103 1.0462421  1.3750280 -0.06642437 -0.5428035 -0.13148475
## 3  1.2087945 0.5559559 -0.9678536 -1.36057509 -0.3613038  1.34830465
## 4 -0.2037101 0.4088700 -0.4179936 -0.95384201  0.7857747 -0.12450461
## 5 -1.1340617 0.0166411  2.4269340  1.56050796 -0.4556836 -0.65499516
## 6 -0.1493830 0.9481848  0.4426568 -0.22665256 -0.1217241 -0.09658406
##          X7         X8         X9         X10        X11         X12
## 1 0.1890194 -1.1344082 -1.3561874  0.35889924  1.0439487  0.39448998
## 2 1.0088692  1.6097344 -0.1371429 -0.59281174 -0.4739625  0.74898621
## 3 0.6113663 -1.1105461 -1.5130941 -0.34156004  0.9539984 -0.31450248
## 4 0.5368345 -0.4901312 -1.0182345  0.77003838 -0.2884401  0.13328223
## 5 0.1393315  2.1108387  0.9732739 -0.61565280 -1.0811270 -0.09061013
## 6 1.0088692  0.1302836 -0.3543983 -0.02939884 -0.4064998  2.22294422
##          X13        X14        X15        X16        X17        X18
## 1 -0.9360329 -1.1837148 -0.2948808  0.5089845  0.0608205 -0.9469114
## 2  1.8037381  0.2682155 -0.4689169 -0.4666821  0.7257509  1.9139039
## 3 -1.1020796 -1.1436616  0.1000471  0.7089658  0.2062740 -0.8608718
## 4 -0.3133576 -0.6429960  0.8162724  1.9088539  0.4971811 -0.8178521
## 5  2.7792626  1.4497863 -0.5492412 -1.0242059 -0.1054121  2.5161808
## 6  0.2678059 -0.2725034 -0.2212502 -0.2000403  3.3439143 -0.1940653
##          X19         X20        X21        X22        X23         X24
## 1 -1.0004456  0.41091876  1.4648076  0.3260878 -1.0195392 -1.39897092
## 2  0.1370406 -0.43330283 -0.3542093  0.5357951  0.9301182  0.09039257
## 3 -0.9501144  0.05014031  1.1127398  0.4023450 -0.9210717 -1.13005807
## 4 -1.7151493 -0.95282380  1.8520821  0.7264382 -1.0589262 -1.64719817
## 5  1.3751893 -0.57039864 -0.9761957 -0.4746131  2.8797757  1.18672958
## 6 -1.0306444 -0.12303335 -0.2661924  2.5184830 -0.2908794 -0.76806000
##           X25         X26        X27        X28        X29          X30
## 1 -0.53622102  1.03918252 -0.1894085 -1.1024462 -1.4156314  0.291335156
## 2 -0.15912303 -0.18215632  0.8519556  1.2165038  0.1913446 -0.666210057
## 3 -0.36787370  1.25792977  0.5179332 -0.9106534 -1.2172393 -0.379597885
## 4 -0.94025458  0.27964344  1.0877362  0.1006181 -1.0089276 -0.027846582
## 5 -0.47561599 -0.88093227 -0.2483536  1.1816324  0.9055561 -0.008304843
## 6  0.01595818 -0.02417219  2.0308583 -0.1260462 -0.4335905 -0.314458754
##          X31         X32        X33       X34         X35        X36
## 1  0.7580038 -0.06520403 -0.9895163 -1.135140  0.26232704  0.5597571
## 2 -0.2743233  0.91807624  1.3662893  0.132103 -0.57633753 -0.3361657
## 3  1.2602170  0.18508549 -0.9518234 -1.125753 -0.42268906  1.0930445
## 4  1.6117663  0.54264196 -0.6691267 -1.482459 -0.80681024  1.3916855
## 5 -1.0332232 -0.38700485  2.9305441  1.136511 -0.44189512 -1.0987667
## 6 -0.1292395  2.24103516 -0.2733514 -0.684565 -0.09618606 -0.3095013
##           X37        X38        X39        X40 Mod
## 1 -0.02685066 -0.8778608 -0.9725822  0.2802739   1
## 2  0.43862368  0.7148718 -0.0155864  0.2216551   2
## 3  0.65196608 -0.7279565 -1.0409390 -0.5469034   2
## 4  1.38896711 -0.7654326 -1.4510801 -0.9311827   3
## 5 -0.31777212  2.8884834  1.3417851 -0.4492053   3
## 6  2.88236393 -0.2969818 -0.8358685  0.0783645   2

Levels of the Class

The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:

# list the levels for the class
data$Mod<-as.factor(data$Mod)
levels(data$Mod)
## [1] "1" "2" "3"

Notice above how we can refer to an attribute by name as a property of the dataset. In the results we can see that the class has 2 different labels.

There were two levels, it would be a binary classification problem.

Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage.

# summarize the class distribution
percentage <- prop.table(table(data$Mod)) * 100
cbind(freq=table(data$Dom), percentage=percentage)
##   percentage
## 1   16.77419
## 2   63.54839
## 3   19.67742

Statistical Summary

Now finally, we can take a look at a summary of each attribute.

This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).

# summarize attribute distributions
summary(data)
##        X1                  X2                 X3          
##  Min.   :-2.071204   Min.   :-2.94959   Min.   :-2.04367  
##  1st Qu.:-0.543254   1st Qu.:-0.62073   1st Qu.:-0.62718  
##  Median : 0.003656   Median :-0.03239   Median :-0.23869  
##  Mean   : 0.031587   Mean   : 0.02529   Mean   :-0.01006  
##  3rd Qu.: 0.699478   3rd Qu.: 0.60498   3rd Qu.: 0.48449  
##  Max.   : 2.437945   Max.   : 5.70396   Max.   : 5.00889  
##        X4                 X5                 X6          
##  Min.   :-2.17404   Min.   :-1.18894   Min.   :-2.03706  
##  1st Qu.:-0.61490   1st Qu.:-0.63050   1st Qu.:-0.54157  
##  Median :-0.11572   Median :-0.33589   Median : 0.03953  
##  Mean   : 0.03188   Mean   :-0.05818   Mean   : 0.06197  
##  3rd Qu.: 0.59606   3rd Qu.: 0.29935   3rd Qu.: 0.67647  
##  Max.   : 3.35999   Max.   : 4.98931   Max.   : 2.58379  
##        X7                 X8                 X9           
##  Min.   :-2.69288   Min.   :-1.92186   Min.   :-2.189000  
##  1st Qu.:-0.63083   1st Qu.:-0.70489   1st Qu.:-0.608422  
##  Median :-0.06757   Median :-0.25151   Median :-0.081274  
##  Mean   : 0.01630   Mean   :-0.04548   Mean   :-0.007595  
##  3rd Qu.: 0.61137   3rd Qu.: 0.49650   3rd Qu.: 0.611181  
##  Max.   : 7.86580   Max.   : 4.64022   Max.   : 3.230317  
##       X10               X11                X12          
##  Min.   :-1.2476   Min.   :-1.75013   Min.   :-2.47879  
##  1st Qu.:-0.6233   1st Qu.:-0.71430   1st Qu.:-0.58970  
##  Median :-0.3149   Median :-0.09857   Median :-0.09061  
##  Mean   :-0.0540   Mean   : 0.03544   Mean   : 0.03052  
##  3rd Qu.: 0.3437   3rd Qu.: 0.92870   3rd Qu.: 0.48778  
##  Max.   : 4.9043   Max.   : 2.59202   Max.   : 7.14858  
##       X13                 X14                X15          
##  Min.   :-2.160627   Min.   :-1.93471   Min.   :-1.26174  
##  1st Qu.:-0.723285   1st Qu.:-0.76868   1st Qu.:-0.70320  
##  Median :-0.219956   Median :-0.05722   Median :-0.31161  
##  Mean   : 0.000597   Mean   : 0.01539   Mean   :-0.06462  
##  3rd Qu.: 0.407908   3rd Qu.: 0.77889   3rd Qu.: 0.37449  
##  Max.   : 4.937870   Max.   : 2.37101   Max.   : 4.88604  
##       X16                X17                 X18          
##  Min.   :-1.84837   Min.   :-2.515785   Min.   :-2.40098  
##  1st Qu.:-0.72575   1st Qu.:-0.604110   1st Qu.:-0.71030  
##  Median :-0.07581   Median :-0.048461   Median :-0.21557  
##  Mean   : 0.04280   Mean   : 0.001166   Mean   :-0.01214  
##  3rd Qu.: 0.82714   3rd Qu.: 0.453502   3rd Qu.: 0.40821  
##  Max.   : 2.43608   Max.   : 6.855578   Max.   : 3.76375  
##       X19                X20               X21          
##  Min.   :-2.21142   Min.   :-1.2847   Min.   :-1.76835  
##  1st Qu.:-0.76892   1st Qu.:-0.6717   1st Qu.:-0.65640  
##  Median : 0.03794   Median :-0.2998   Median :-0.18111  
##  Mean   : 0.02299   Mean   :-0.0679   Mean   : 0.01092  
##  3rd Qu.: 0.78128   3rd Qu.: 0.3045   3rd Qu.: 0.86629  
##  Max.   : 2.94552   Max.   : 4.8774   Max.   : 2.34498  
##       X22                 X23                X24          
##  Min.   :-2.571687   Min.   :-2.22872   Min.   :-2.09194  
##  1st Qu.:-0.598531   1st Qu.:-0.70444   1st Qu.:-0.68949  
##  Median :-0.169584   Median :-0.25149   Median : 0.06143  
##  Mean   :-0.002683   Mean   :-0.01418   Mean   : 0.05580  
##  3rd Qu.: 0.535795   3rd Qu.: 0.40471   3rd Qu.: 0.77043  
##  Max.   : 7.322688   Max.   : 3.43119   Max.   : 2.30375  
##       X25                X26                X27          
##  Min.   :-1.27695   Min.   :-1.71946   Min.   :-2.66510  
##  1st Qu.:-0.65183   1st Qu.:-0.68297   1st Qu.:-0.54308  
##  Median :-0.26013   Median :-0.06367   Median :-0.11081  
##  Mean   :-0.04422   Mean   : 0.04235   Mean   : 0.01543  
##  3rd Qu.: 0.27725   3rd Qu.: 0.75360   3rd Qu.: 0.48846  
##  Max.   : 5.02597   Max.   : 2.67371   Max.   : 6.45174  
##       X28                 X29                X30          
##  Min.   :-2.061411   Min.   :-2.12984   Min.   :-1.24595  
##  1st Qu.:-0.666553   1st Qu.:-0.70742   1st Qu.:-0.62550  
##  Median :-0.248096   Median : 0.08223   Median :-0.29554  
##  Mean   :-0.006322   Mean   : 0.03340   Mean   :-0.07168  
##  3rd Qu.: 0.514720   3rd Qu.: 0.79148   3rd Qu.: 0.25225  
##  Max.   : 3.518018   Max.   : 2.64149   Max.   : 4.86410  
##       X31                 X32                 X33          
##  Min.   :-1.714001   Min.   :-2.478710   Min.   :-2.10146  
##  1st Qu.:-0.751426   1st Qu.:-0.597069   1st Qu.:-0.65028  
##  Median :-0.132029   Median :-0.184709   Median :-0.23566  
##  Mean   :-0.003986   Mean   : 0.001085   Mean   : 0.00381  
##  3rd Qu.: 0.776139   3rd Qu.: 0.501314   3rd Qu.: 0.40512  
##  Max.   : 2.030277   Max.   : 6.585346   Max.   : 4.24979  
##       X34                X35                X36          
##  Min.   :-1.93303   Min.   :-1.21654   Min.   :-1.70138  
##  1st Qu.:-0.68222   1st Qu.:-0.61315   1st Qu.:-0.76013  
##  Median : 0.08001   Median :-0.21782   Median :-0.20284  
##  Mean   : 0.06604   Mean   :-0.04227   Mean   :-0.03121  
##  3rd Qu.: 0.87367   3rd Qu.: 0.20044   3rd Qu.: 0.70241  
##  Max.   : 3.27674   Max.   : 4.88459   Max.   : 2.08496  
##       X37                X38                X39          
##  Min.   :-2.64514   Min.   :-2.02088   Min.   :-1.90028  
##  1st Qu.:-0.60869   1st Qu.:-0.64832   1st Qu.:-0.70892  
##  Median :-0.08503   Median :-0.20329   Median : 0.04422  
##  Mean   : 0.02019   Mean   : 0.00424   Mean   : 0.08639  
##  3rd Qu.: 0.55499   3rd Qu.: 0.43107   3rd Qu.: 0.84376  
##  Max.   : 7.11042   Max.   : 4.31257   Max.   : 2.69916  
##       X40           Mod    
##  Min.   :-1.20474   1: 52  
##  1st Qu.:-0.65111   2:197  
##  Median :-0.22776   3: 61  
##  Mean   :-0.02732          
##  3rd Qu.: 0.29004          
##  Max.   : 4.91768

Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

Set-up the test harness to use 10-fold cross validation. Build 5 different models to predict species from flower measurements Select the best model.

Test Harness

We will 10-fold crossvalidation to estimate accuracy.

This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.

Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 5 different algorithms:

  • Linear Discriminant Analysis (LDA)
  • Classification and Regression Trees (CART).
  • k-Nearest Neighbors (kNN).
  • Support Vector Machines (SVM) with a linear kernel.
  • Random Forest (RF)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF, LR). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build our six models:

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
#Model
# a) linear algorithms
set.seed(7)
fit.lda <- train(Mod~., data=data, method="lda", metric=metric, trControl=control)
set.seed(7)
library(caret)
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.4
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Mod~., data=data, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Mod~., data=data, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Mod~., data=data, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Mod~., data=data, method="rf", metric=metric, trControl=control)

Caret does support the configuration and tuning of the configuration of each model, but we are not going to cover that in this tutorial.

Select Best Model

We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

We can see the accuracy of each classifier and also other metrics like Kappa:

results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.4838710 0.5519153 0.5806452 0.5678629 0.5806452 0.6774194    0
## cart 0.4516129 0.5734375 0.6290323 0.6031384 0.6451613 0.6774194    0
## knn  0.4516129 0.5161290 0.5903226 0.5680645 0.6129032 0.6451613    0
## svm  0.6129032 0.6270833 0.6451613 0.6355108 0.6451613 0.6451613    0
## rf   0.5483871 0.5806452 0.5967742 0.6030376 0.6401210 0.6451613    0
## 
## Kappa 
##             Min.     1st Qu.      Median         Mean    3rd Qu.
## lda  -0.13501144 -0.06220033 -0.01003141  0.006588438 0.05536641
## cart -0.09550562 -0.01480263  0.00000000  0.019184614 0.04090909
## knn  -0.17111111 -0.15626386 -0.04459459 -0.057369709 0.01829268
## svm   0.00000000  0.00000000  0.00000000  0.000000000 0.00000000
## rf   -0.09550562 -0.07981103 -0.02337662 -0.032321107 0.00000000
##            Max. NA's
## lda  0.16216216    0
## cart 0.21914358    0
## knn  0.09708738    0
## svm  0.00000000    0
## rf   0.04952830    0

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

We can see that the most accurate model in this case was Random Forest (RF):

# compare accuracy of models
dotplot(results)

The results for just the SVM model can be summarized.

# summarize Best Model
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 310 samples
##  40 predictor
##   3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 278, 279, 280, 279, 279, 279, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa
##   0.25  0.6355108  0    
##   0.50  0.6355108  0    
##   1.00  0.6355108  0    
## 
## Tuning parameter 'sigma' was held constant at a value of 0.0269754
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.0269754 and C = 0.25.

This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 81.3% accuracy +/- 4%

Make Predictions

The SVM was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the SVM model directly on the validation set and summarize the results in a confusion matrix.

# estimate skill of SVM on the validation dataset
predictions <- predict(fit.svm, validation)
confusionMatrix(predictions, validation$Mod)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2  3
##          1  0  0  0
##          2 13 49 15
##          3  0  0  0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6364         
##                  95% CI : (0.5188, 0.743)
##     No Information Rate : 0.6364         
##     P-Value [Acc > NIR] : 0.5513         
##                                          
##                   Kappa : 0              
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.0000   1.0000   0.0000
## Specificity            1.0000   0.0000   1.0000
## Pos Pred Value            NaN   0.6364      NaN
## Neg Pred Value         0.8312      NaN   0.8052
## Prevalence             0.1688   0.6364   0.1948
## Detection Rate         0.0000   0.6364   0.0000
## Detection Prevalence   0.0000   1.0000   0.0000
## Balanced Accuracy      0.5000   0.5000   0.5000

We can see that the accuracy is 63.64%. It was a small validation dataset (20%).

Summary

In this post you discovered step-by-step how to complete EEG data machine learning project in R.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.