EEG data PUPR

My EEG Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

The process of a machine learning project may not be linear, but there are a number of well-known steps:

Define Problem.
Prepare Data.
Evaluate Algorithms.
Improve Results.
Present Results.

For more information on the steps in a machine learning project see this checklist and more on the process.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing your data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

Attributes are numeric so you have to figure out how to load and handle data.
It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithms.
It is a binary-class classification problem (binomial) that may require some specialized handling.
It only has 2 attribute and 1000 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
All of the numeric attributes are in the same units and the same scale not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in R.

Load Library and Data

We are going to use our EEG dataset. This dataset collected by our EEG specialist

library(caret)

## Warning: package 'caret' was built under R version 3.4.4

## Loading required package: lattice

## Loading required package: ggplot2

#Atrrition Analysis
#Load Data
setwd("~/")
data<-read.csv("PUPR Neuro EEG clean.csv", sep=",", header=TRUE)
data<-na.omit(data)
data$Dom<-as.factor(data$Dom)
str(data)

## 'data.frame':    387 obs. of  43 variables:
##  $ No : Factor w/ 387 levels "X1","X10","X100",..: 1 112 223 322 333 344 355 366 377 2 ...
##  $ X1 : num  0.767 -0.129 1.209 -0.204 -1.134 ...
##  $ X2 : num  0.1637 1.0462 0.556 0.4089 0.0166 ...
##  $ X3 : num  -1.04 1.375 -0.968 -0.418 2.427 ...
##  $ X4 : num  -1.3482 -0.0664 -1.3606 -0.9538 1.5605 ...
##  $ X5 : num  0.241 -0.543 -0.361 0.786 -0.456 ...
##  $ X6 : num  0.734 -0.131 1.348 -0.125 -0.655 ...
##  $ X7 : num  0.189 1.009 0.611 0.537 0.139 ...
##  $ X8 : num  -1.13 1.61 -1.11 -0.49 2.11 ...
##  $ X9 : num  -1.356 -0.137 -1.513 -1.018 0.973 ...
##  $ X10: num  0.359 -0.593 -0.342 0.77 -0.616 ...
##  $ X11: num  1.044 -0.474 0.954 -0.288 -1.081 ...
##  $ X12: num  0.3945 0.749 -0.3145 0.1333 -0.0906 ...
##  $ X13: num  -0.936 1.804 -1.102 -0.313 2.779 ...
##  $ X14: num  -1.184 0.268 -1.144 -0.643 1.45 ...
##  $ X15: num  -0.295 -0.469 0.1 0.816 -0.549 ...
##  $ X16: num  0.509 -0.467 0.709 1.909 -1.024 ...
##  $ X17: num  0.0608 0.7258 0.2063 0.4972 -0.1054 ...
##  $ X18: num  -0.947 1.914 -0.861 -0.818 2.516 ...
##  $ X19: num  -1 0.137 -0.95 -1.715 1.375 ...
##  $ X20: num  0.4109 -0.4333 0.0501 -0.9528 -0.5704 ...
##  $ X21: num  1.465 -0.354 1.113 1.852 -0.976 ...
##  $ X22: num  0.326 0.536 0.402 0.726 -0.475 ...
##  $ X23: num  -1.02 0.93 -0.921 -1.059 2.88 ...
##  $ X24: num  -1.399 0.0904 -1.1301 -1.6472 1.1867 ...
##  $ X25: num  -0.536 -0.159 -0.368 -0.94 -0.476 ...
##  $ X26: num  1.039 -0.182 1.258 0.28 -0.881 ...
##  $ X27: num  -0.189 0.852 0.518 1.088 -0.248 ...
##  $ X28: num  -1.102 1.217 -0.911 0.101 1.182 ...
##  $ X29: num  -1.416 0.191 -1.217 -1.009 0.906 ...
##  $ X30: num  0.2913 -0.6662 -0.3796 -0.0278 -0.0083 ...
##  $ X31: num  0.758 -0.274 1.26 1.612 -1.033 ...
##  $ X32: num  -0.0652 0.9181 0.1851 0.5426 -0.387 ...
##  $ X33: num  -0.99 1.366 -0.952 -0.669 2.931 ...
##  $ X34: num  -1.135 0.132 -1.126 -1.482 1.137 ...
##  $ X35: num  0.262 -0.576 -0.423 -0.807 -0.442 ...
##  $ X36: num  0.56 -0.336 1.093 1.392 -1.099 ...
##  $ X37: num  -0.0269 0.4386 0.652 1.389 -0.3178 ...
##  $ X38: num  -0.878 0.715 -0.728 -0.765 2.888 ...
##  $ X39: num  -0.9726 -0.0156 -1.0409 -1.4511 1.3418 ...
##  $ X40: num  0.28 0.222 -0.547 -0.931 -0.449 ...
##  $ Dom: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Mod: Factor w/ 4 levels "1","2","3","FALSE": 1 2 2 3 3 2 2 1 2 2 ...

Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data. That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

# create a list of 80% of the rows in the original dataset we can use for training

validation_index <- createDataPartition(data$Dom, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data[-validation_index,]
# use the remaining 80% of data to training and testing the models
data <- data[validation_index,]

You now have training data in the dataset variable and a validation set we will use later in the validation variable.

Note that we replaced our dataset variable with the 80% sample of the dataset. This was an attempt to keep the rest of the code simpler and readable.

Summarize Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

Dimensions of the dataset.
Types of the attributes.
Peek at the data itself.
Levels of the class attribute.
Breakdown of the instances in each class.
Statistical summary of all attributes.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

Dimensions of Dataset We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.

dimensions of dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the dim function.

# dimensions of dataset
dim(data)

## [1] 310  43

You should see 310 instances and 43 attributes.

Types of Attributes

It is a good idea to get an idea of the types of the attributes. They could be doubles, integers, strings, factors and other types.

Knowing the types is important as it will give you an idea of how to better summarize the data you have and the types of transforms you might need to use to prepare the data before you model it.

# list types for each attribute
sapply(data, class)

##        No        X1        X2        X3        X4        X5        X6 
##  "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##        X7        X8        X9       X10       X11       X12       X13 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X14       X15       X16       X17       X18       X19       X20 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X21       X22       X23       X24       X25       X26       X27 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X28       X29       X30       X31       X32       X33       X34 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##       X35       X36       X37       X38       X39       X40       Dom 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"  "factor" 
##       Mod 
##  "factor"

You should see that all of the inputs are double and that the class value is a factor.

Peek at the Data

It is also always a good idea to actually eyeball your data. You should see the first 5 rows of the data:

# take a peek at the first 5 rows of the data
head(data)

##   No         X1        X2         X3          X4         X5          X6
## 1 X1  0.7673868 0.1637270 -1.0395744 -1.34824985  0.2412754  0.73405244
## 2 X2 -0.1290103 1.0462421  1.3750280 -0.06642437 -0.5428035 -0.13148475
## 3 X3  1.2087945 0.5559559 -0.9678536 -1.36057509 -0.3613038  1.34830465
## 4 X4 -0.2037101 0.4088700 -0.4179936 -0.95384201  0.7857747 -0.12450461
## 5 X5 -1.1340617 0.0166411  2.4269340  1.56050796 -0.4556836 -0.65499516
## 6 X6 -0.1493830 0.9481848  0.4426568 -0.22665256 -0.1217241 -0.09658406
##          X7         X8         X9         X10        X11         X12
## 1 0.1890194 -1.1344082 -1.3561874  0.35889924  1.0439487  0.39448998
## 2 1.0088692  1.6097344 -0.1371429 -0.59281174 -0.4739625  0.74898621
## 3 0.6113663 -1.1105461 -1.5130941 -0.34156004  0.9539984 -0.31450248
## 4 0.5368345 -0.4901312 -1.0182345  0.77003838 -0.2884401  0.13328223
## 5 0.1393315  2.1108387  0.9732739 -0.61565280 -1.0811270 -0.09061013
## 6 1.0088692  0.1302836 -0.3543983 -0.02939884 -0.4064998  2.22294422
##          X13        X14        X15        X16        X17        X18
## 1 -0.9360329 -1.1837148 -0.2948808  0.5089845  0.0608205 -0.9469114
## 2  1.8037381  0.2682155 -0.4689169 -0.4666821  0.7257509  1.9139039
## 3 -1.1020796 -1.1436616  0.1000471  0.7089658  0.2062740 -0.8608718
## 4 -0.3133576 -0.6429960  0.8162724  1.9088539  0.4971811 -0.8178521
## 5  2.7792626  1.4497863 -0.5492412 -1.0242059 -0.1054121  2.5161808
## 6  0.2678059 -0.2725034 -0.2212502 -0.2000403  3.3439143 -0.1940653
##          X19         X20        X21        X22        X23         X24
## 1 -1.0004456  0.41091876  1.4648076  0.3260878 -1.0195392 -1.39897092
## 2  0.1370406 -0.43330283 -0.3542093  0.5357951  0.9301182  0.09039257
## 3 -0.9501144  0.05014031  1.1127398  0.4023450 -0.9210717 -1.13005807
## 4 -1.7151493 -0.95282380  1.8520821  0.7264382 -1.0589262 -1.64719817
## 5  1.3751893 -0.57039864 -0.9761957 -0.4746131  2.8797757  1.18672958
## 6 -1.0306444 -0.12303335 -0.2661924  2.5184830 -0.2908794 -0.76806000
##           X25         X26        X27        X28        X29          X30
## 1 -0.53622102  1.03918252 -0.1894085 -1.1024462 -1.4156314  0.291335156
## 2 -0.15912303 -0.18215632  0.8519556  1.2165038  0.1913446 -0.666210057
## 3 -0.36787370  1.25792977  0.5179332 -0.9106534 -1.2172393 -0.379597885
## 4 -0.94025458  0.27964344  1.0877362  0.1006181 -1.0089276 -0.027846582
## 5 -0.47561599 -0.88093227 -0.2483536  1.1816324  0.9055561 -0.008304843
## 6  0.01595818 -0.02417219  2.0308583 -0.1260462 -0.4335905 -0.314458754
##          X31         X32        X33       X34         X35        X36
## 1  0.7580038 -0.06520403 -0.9895163 -1.135140  0.26232704  0.5597571
## 2 -0.2743233  0.91807624  1.3662893  0.132103 -0.57633753 -0.3361657
## 3  1.2602170  0.18508549 -0.9518234 -1.125753 -0.42268906  1.0930445
## 4  1.6117663  0.54264196 -0.6691267 -1.482459 -0.80681024  1.3916855
## 5 -1.0332232 -0.38700485  2.9305441  1.136511 -0.44189512 -1.0987667
## 6 -0.1292395  2.24103516 -0.2733514 -0.684565 -0.09618606 -0.3095013
##           X37        X38        X39        X40 Dom Mod
## 1 -0.02685066 -0.8778608 -0.9725822  0.2802739   1   1
## 2  0.43862368  0.7148718 -0.0155864  0.2216551   1   2
## 3  0.65196608 -0.7279565 -1.0409390 -0.5469034   1   2
## 4  1.38896711 -0.7654326 -1.4510801 -0.9311827   1   3
## 5 -0.31777212  2.8884834  1.3417851 -0.4492053   1   3
## 6  2.88236393 -0.2969818 -0.8358685  0.0783645   1   2

Levels of the Class

The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:

# list the levels for the class
data$Dom<-as.factor(data$Dom)
levels(data$Dom)

## [1] "0" "1"

Notice above how we can refer to an attribute by name as a property of the dataset. In the results we can see that the class has 2 different labels.

There were two levels, it would be a binary classification problem.

Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count and as a percentage.

# summarize the class distribution
percentage <- prop.table(table(data$Dom)) * 100
cbind(freq=table(data$Dom), percentage=percentage)

##   freq percentage
## 0   58   18.70968
## 1  252   81.29032

Statistical Summary

Now finally, we can take a look at a summary of each attribute.

This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute).

# summarize attribute distributions
data<-data[,2:42]
summary(data)

##        X1                 X2                  X3          
##  Min.   :-2.07800   Min.   :-3.194733   Min.   :-2.52181  
##  1st Qu.:-0.59758   1st Qu.:-0.620731   1st Qu.:-0.56144  
##  Median :-0.02375   Median :-0.032388   Median :-0.17892  
##  Mean   :-0.02878   Mean   :-0.001973   Mean   : 0.04358  
##  3rd Qu.: 0.56026   3rd Qu.: 0.580470   3rd Qu.: 0.53829  
##  Max.   : 2.43795   Max.   : 2.885645   Max.   : 5.00889  
##        X4                 X5                 X6           
##  Min.   :-2.51915   Min.   :-1.24702   Min.   :-2.058003  
##  1st Qu.:-0.54403   1st Qu.:-0.63050   1st Qu.:-0.600899  
##  Median :-0.02945   Median :-0.29233   Median : 0.001138  
##  Mean   : 0.04827   Mean   :-0.01119   Mean   : 0.023104  
##  3rd Qu.: 0.57141   3rd Qu.: 0.42640   3rd Qu.: 0.646801  
##  Max.   : 3.35999   Max.   : 5.41765   Max.   : 2.583789  
##        X7                 X8                  X9          
##  Min.   :-3.23944   Min.   :-2.613859   Min.   :-2.58730  
##  1st Qu.:-0.63083   1st Qu.:-0.657166   1st Qu.:-0.63200  
##  Median :-0.08426   Median :-0.227648   Median :-0.07679  
##  Mean   :-0.03481   Mean   :-0.002021   Mean   : 0.00353  
##  3rd Qu.: 0.55547   3rd Qu.: 0.533010   3rd Qu.: 0.61118  
##  Max.   : 2.47466   Max.   : 5.093602   Max.   : 3.23032  
##       X10                X11                X12          
##  Min.   :-1.25520   Min.   :-1.75013   Min.   :-2.32953  
##  1st Qu.:-0.62898   1st Qu.:-0.75927   1st Qu.:-0.61303  
##  Median :-0.29968   Median :-0.17038   Median :-0.12793  
##  Mean   :-0.01642   Mean   :-0.01339   Mean   :-0.01827  
##  3rd Qu.: 0.37793   3rd Qu.: 0.83453   3rd Qu.: 0.42943  
##  Max.   : 5.70371   Max.   : 2.59202   Max.   : 3.82751  
##       X13                X14                X15          
##  Min.   :-1.66249   Min.   :-1.80725   Min.   :-1.26174  
##  1st Qu.:-0.66621   1st Qu.:-0.74814   1st Qu.:-0.66136  
##  Median :-0.20958   Median : 0.02289   Median :-0.25807  
##  Mean   : 0.05188   Mean   : 0.05035   Mean   :-0.02915  
##  3rd Qu.: 0.45365   3rd Qu.: 0.80813   3rd Qu.: 0.43473  
##  Max.   : 4.93787   Max.   : 2.56126   Max.   : 4.42417  
##       X16                X17                X18          
##  Min.   :-1.84231   Min.   :-2.45345   Min.   :-2.40098  
##  1st Qu.:-0.74241   1st Qu.:-0.59891   1st Qu.:-0.68347  
##  Median :-0.16065   Median :-0.06385   Median :-0.20482  
##  Mean   :-0.01276   Mean   :-0.02412   Mean   : 0.03067  
##  3rd Qu.: 0.80689   3rd Qu.: 0.43484   3rd Qu.: 0.45123  
##  Max.   : 2.15732   Max.   : 4.02962   Max.   : 4.98982  
##       X19                X20                X21          
##  Min.   :-2.21142   Min.   :-1.26775   Min.   :-1.76248  
##  1st Qu.:-0.73621   1st Qu.:-0.64255   1st Qu.:-0.68281  
##  Median : 0.03638   Median :-0.28067   Median :-0.21632  
##  Mean   : 0.04405   Mean   :-0.02204   Mean   :-0.01175  
##  3rd Qu.: 0.79135   3rd Qu.: 0.36221   3rd Qu.: 0.80615  
##  Max.   : 2.94552   Max.   : 5.06496   Max.   : 2.34498  
##       X22                X23                X24          
##  Min.   :-2.49543   Min.   :-2.22872   Min.   :-1.84371  
##  1st Qu.:-0.56993   1st Qu.:-0.68547   1st Qu.:-0.68532  
##  Median :-0.13146   Median :-0.22195   Median : 0.07488  
##  Mean   :-0.01394   Mean   : 0.04008   Mean   : 0.06433  
##  3rd Qu.: 0.49290   3rd Qu.: 0.47224   3rd Qu.: 0.76762  
##  Max.   : 3.96737   Max.   : 3.49027   Max.   : 2.66575  
##       X25                X26                X27          
##  Min.   :-1.26348   Min.   :-1.71946   Min.   :-2.62581  
##  1st Qu.:-0.65070   1st Qu.:-0.72599   1st Qu.:-0.56273  
##  Median :-0.25676   Median :-0.17000   Median :-0.14029  
##  Mean   :-0.03811   Mean   :-0.02959   Mean   :-0.04705  
##  3rd Qu.: 0.29874   3rd Qu.: 0.66839   3rd Qu.: 0.38039  
##  Max.   : 4.54787   Max.   : 2.64941   Max.   : 3.68132  
##       X28                X29                X30          
##  Min.   :-2.05792   Min.   :-2.09016   Min.   :-1.23943  
##  1st Qu.:-0.61356   1st Qu.:-0.63198   1st Qu.:-0.62061  
##  Median :-0.18707   Median : 0.12191   Median :-0.25583  
##  Mean   : 0.05086   Mean   : 0.08796   Mean   :-0.03085  
##  3rd Qu.: 0.57138   3rd Qu.: 0.84108   3rd Qu.: 0.33693  
##  Max.   : 3.51802   Max.   : 2.87956   Max.   : 4.80548  
##       X31                 X32               X33          
##  Min.   :-1.714001   Min.   :-2.4787   Min.   :-2.10146  
##  1st Qu.:-0.764544   1st Qu.:-0.5479   1st Qu.:-0.68326  
##  Median :-0.143190   Median :-0.1903   Median :-0.23219  
##  Mean   :-0.006353   Mean   :-0.0253   Mean   : 0.02755  
##  3rd Qu.: 0.811015   3rd Qu.: 0.4443   3rd Qu.: 0.47108  
##  Max.   : 2.337185   Max.   : 3.4031   Max.   : 4.24979  
##       X34                X35                X36          
##  Min.   :-1.93303   Min.   :-1.21654   Min.   :-1.70138  
##  1st Qu.:-0.69707   1st Qu.:-0.61315   1st Qu.:-0.80679  
##  Median : 0.07079   Median :-0.27544   Median :-0.18685  
##  Mean   : 0.06197   Mean   :-0.03549   Mean   :-0.01292  
##  3rd Qu.: 0.87133   3rd Qu.: 0.22872   3rd Qu.: 0.84107  
##  Max.   : 3.27674   Max.   : 4.88459   Max.   : 1.95164  
##       X37                X38                X39          
##  Min.   :-2.54817   Min.   :-1.77729   Min.   :-1.67568  
##  1st Qu.:-0.60869   1st Qu.:-0.70565   1st Qu.:-0.82610  
##  Median :-0.06689   Median :-0.20329   Median : 0.03324  
##  Mean   : 0.01912   Mean   : 0.05438   Mean   : 0.06798  
##  3rd Qu.: 0.55499   3rd Qu.: 0.49001   3rd Qu.: 0.84132  
##  Max.   : 4.00726   Max.   : 4.31257   Max.   : 2.69916  
##       X40           Dom    
##  Min.   :-1.19822   0: 58  
##  1st Qu.:-0.68368   1:252  
##  Median :-0.25707          
##  Mean   :-0.05663          
##  3rd Qu.: 0.27376          
##  Max.   : 4.59202

Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

Set-up the test harness to use 10-fold cross validation. Build 5 different models to predict species from flower measurements Select the best model.

Test Harness

We will 10-fold crossvalidation to estimate accuracy.

This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.

# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.

Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

Linear Discriminant Analysis (LDA)
Classification and Regression Trees (CART).
k-Nearest Neighbors (kNN).
Support Vector Machines (SVM) with a linear kernel.
Random Forest (RF)
Logistic Regression (LR)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF, LR). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build our six models:

control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
#Model
# a) linear algorithms
set.seed(7)
fit.lda <- train(Dom~., data=data, method="lda", metric=metric, trControl=control)
set.seed(7)
library(caret)
library(e1071)

## Warning: package 'e1071' was built under R version 3.4.4

# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Dom~., data=data, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(Dom~., data=data, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Dom~., data=data, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Dom~., data=data, method="rf", metric=metric, trControl=control)
#Logistic
set.seed(7)
fit.log<- train(Dom~., data=data, method="glm", family="binomial",metric=metric, trControl=control)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Caret does support the configuration and tuning of the configuration of each model, but we are not going to cover that in this tutorial.

Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

We can report on the accuracy of each model by first creating a list of the created models and using the summary function.

We can see the accuracy of each classifier and also other metrics like Kappa:

results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf, log=fit.log))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, cart, knn, svm, rf, log 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda  0.6774194 0.7759577 0.7906250 0.7772177 0.8064516 0.8387097    0
## cart 0.5937500 0.7024194 0.7419355 0.7295363 0.7681452 0.8387097    0
## knn  0.7419355 0.7741935 0.7906250 0.7907594 0.8064516 0.8666667    0
## svm  0.8064516 0.8064516 0.8064516 0.8130376 0.8125000 0.8333333    0
## rf   0.7419355 0.8064516 0.8064516 0.8065860 0.8125000 0.8333333    0
## log  0.6562500 0.7419355 0.7580645 0.7517070 0.7953125 0.8064516    0
## 
## Kappa 
##            Min.     1st Qu.      Median        Mean    3rd Qu.      Max.
## lda  -0.1832061 -0.05805338  0.00000000  0.03156875 0.13492063 0.2900763
## cart -0.2530120 -0.10714286 -0.08620690 -0.00387239 0.08024691 0.3621399
## knn  -0.1071429 -0.04390244  0.10905350  0.07252398 0.16294643 0.2941176
## svm   0.0000000  0.00000000  0.00000000  0.00000000 0.00000000 0.0000000
## rf   -0.1071429  0.00000000  0.00000000 -0.01071429 0.00000000 0.0000000
## log  -0.2000000 -0.09541738 -0.05757018 -0.02237319 0.04007634 0.1696429
##      NA's
## lda     0
## cart    0
## knn     0
## svm     0
## rf      0
## log     0

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

We can see that the most accurate model in this case was Random Forest (RF):

# compare accuracy of models
dotplot(results)

The results for just the SVM model can be summarized.

# summarize Best Model
print(fit.svm)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 310 samples
##  40 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 279, 279, 279, 279, 280, 279, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa
##   0.25  0.8130376  0    
##   0.50  0.8130376  0    
##   1.00  0.8130376  0    
## 
## Tuning parameter 'sigma' was held constant at a value of 0.02482246
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.02482246 and C = 0.25.

This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 81.3% accuracy +/- 4%

Make Predictions

The SVM was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the SVM model directly on the validation set and summarize the results in a confusion matrix.

# estimate skill of SVM on the validation dataset
predictions <- predict(fit.svm, validation)
confusionMatrix(predictions, validation$Dom)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  0
##          1 14 63
##                                           
##                Accuracy : 0.8182          
##                  95% CI : (0.7138, 0.8969)
##     No Information Rate : 0.8182          
##     P-Value [Acc > NIR] : 0.570793        
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 0.000512        
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.8182          
##              Prevalence : 0.1818          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

We can see that the accuracy is 81.82%. It was a small validation dataset (20%), but this result is within our expected margin of 99.13% +/-4% suggesting we may have an accurate and a reliably accurate model.

Summary

In this post you discovered step-by-step how to complete EEG data machine learning project in R.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.