1. Užkraunamos reikalingos duomenų vizualizavimui ir metodams atlikti.

rm(list = ls())
library(dplyr)
library(tidyr)
library(stringr)
library(mlr)
library(tidyverse)
library(plyr)
library(caret)
library(gmodels)
library(ggplot2)
library(e1071)
library(caTools)
library(class)
library(GGally)
library(parallelMap)
library(parallel)
library(rpart.plot)
require(ISLR)
require(tree)
library(corrplot)
library(factoextra)
library(umap)
library(Rtsne)

BANDS data

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.

Kategorinio tipo kintamieji:

  1. Cylinder number
  2. Job number
  3. Customer
  4. Grain screened
  5. Ink color
  6. Proof on ctd ink
  7. Blade mfg
  8. Cylinder division
  9. Paper type
  10. Ink type
  11. Direct steam
  12. Solvent type
  13. Type on cylinder
  14. Press type
  15. Press
  16. Cylinder type
  17. Paper mill location
  18. Band Type

Kiekybinio tipo kintamieji:

  1. Proof cut
  2. Viscosity
  3. Caliper
  4. Ink temperature
  5. Humifity
  6. Roughness
  7. Blade pressure
  8. Varnish pct
  9. Press speed
  10. Ink pct
  11. Solvent pct
  12. ESA Voltage
  13. ESA Amperage
  14. Wax
  15. Hardener: numeric
  16. Roller durometer
  17. Current density
  18. Anode space ratio
  19. Chrome content
getwd()
## [1] "C:/Users/skirmantas/OneDrive/Desktop"
setwd("C:/Users/skirmantas/OneDrive/Desktop")
bands <- read.csv2("C:/Users/skirmantas/OneDrive/Desktop/Duomenys/duomenys1.csv", header = TRUE, sep = ";", dec = ".")

Duomenims braižomos histogramos

bands %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=20) 

Kintamieji:

  1. Chrome content
  2. ESA.Amperage
  3. Times Tamp
  4. Wax

Yra pašalinami, nes jie neneša naudingos informacijos ir šių kintamųjų reikšmės yra susikoncentravusios ties keliomis reiškmėmis.

Kategoriniam kintamajam yra braižoma stulpelinė diagrama.

ggplot(bands, aes(x=as.factor(bandtype) )) +
  geom_bar(color="red", fill=rgb(0.7,0.4,0.5,0.6) )+ ggtitle("Band type") +
  xlab("Class") + ylab("Value")

Iš grafiko galime matyti, kad duomenis nėra susikoncentravę vienoje binarinio kintamojo reikšmėje.

Nuskaitomi sutvarkyti duomenis.

getwd()
## [1] "C:/Users/skirmantas/OneDrive/Desktop"
setwd("C:/Users/skirmantas/OneDrive/Desktop")
bands <- read.csv2("C:/Users/skirmantas/OneDrive/Desktop/Duomenys/duomenys22.csv", header = TRUE, sep = ";", dec = ".")

Šalinamos praleistos reikšmės

print(anyNA.data.frame(bands))
## [1] TRUE
bands[bands == "?"] <- NA
bands <- na.omit(bands)

Duomenims priskiriamas integer duomenų tipas.

bands$bandtype <- as.factor(bands$bandtype)
bands$unitnumber <- as.integer(bands$unitnumber)
bands$press  <- as.integer(bands$press)
bands$platingtank <- as.integer(bands$platingtank)
bands$proofcut <- as.integer(bands$proofcut)
bands$viscosity <- as.integer(bands$viscosity)
bands$caliper <- as.integer(bands$caliper)
bands$roughness <- as.integer(bands$roughness)
bands$bladepressure <- as.integer(bands$bladepressure)
bands$speed <- as.integer(bands$speed)
str(bands)
## 'data.frame':    205 obs. of  20 variables:
##  $ press        : int  824 827 827 827 815 815 815 827 827 816 ...
##  $ unitnumber   : int  2 9 9 2 2 9 2 2 9 9 ...
##  $ platingtank  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ proofcut     : int  30 40 50 50 37 37 35 60 52 40 ...
##  $ viscosity    : int  43 42 45 45 44 44 44 43 43 46 ...
##  $ caliper      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ temperature  : num  16.3 14.5 15 15.2 16 16.5 16 15.8 16.6 15.9 ...
##  $ humifity     : int  70 74 76 72 90 91 80 57 58 78 ...
##  $ roughness    : int  0 1 1 1 0 0 1 0 0 0 ...
##  $ bladepressure: int  25 25 30 25 28 30 32 24 20 34 ...
##  $ narnish      : num  1.2 8 8 5.9 20 21.7 7 1.2 7.9 2.4 ...
##  $ speed        : int  2200 2100 2150 2150 2050 2050 1400 1480 1480 2000 ...
##  $ ink          : num  58.1 57.5 57.5 58.8 45 43.5 58.1 58.1 56.2 61 ...
##  $ solvent      : num  40.7 34.5 34.5 35.3 35 34.8 34.9 40.7 36 36.6 ...
##  $ Voltage      : num  4 0 0 0 0 0 0 0 0 0 ...
##  $ hardener     : num  1 0.7 1 1 0.8 0.8 1 1.7 0.7 1.3 ...
##  $ durometer    : int  30 35 35 35 35 35 28 33 33 35 ...
##  $ density      : int  40 40 40 40 40 40 40 40 40 40 ...
##  $ spaceratio   : num  96.9 107.4 107.4 107.4 107.4 ...
##  $ bandtype     : Factor w/ 2 levels "band","noband": 2 1 2 2 2 2 2 2 2 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:8] 56 77 161 182 210 211 212 213
##   ..- attr(*, "names")= chr [1:8] "56" "77" "161" "182" ...

Sukuriamos mokymo ir testavimo imtys

smp_size <- floor(0.9 * nrow(bands))
set.seed(10)
train_ind <- sample(seq_len(nrow(bands)), size = smp_size)

adult_norm <- bands[train_ind, ]
test_adult_norm <- bands[-train_ind, ]

LDA modelis

Sukuriamas LDA klasifikatorius ir apmokomas pagal sukurtą klasifikavimo uždavinį.

bands <- bands[,-6]
bandsTask <- makeClassifTask(data = bands, target = "bandtype")
bands <- as_tibble(bands)
lda <- makeLearner("classif.lda")
holdout <- makeResampleDesc(method = "Holdout", split = 3/6, stratify = TRUE)
set.seed(123)
holdoutCV_lda <- resample(learner = lda, task = bandsTask, resampling = holdout, measures = list(mmce, acc))
## Resampling: holdout
## Measures:             mmce      acc
## [Resample] iter 1:    0.1747573 0.8252427
## 
## Aggregated Result: mmce.test.mean=0.1747573,acc.test.mean=0.8252427
## 

Gauti LDA modelio tikslumo ir paklaidų įverčiai parodo, kad 79 procentai duomenų yra suklasifikuojami gerai.

calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.50/0.61 0.50/0.13 0.50     
##   noband 0.09/0.39 0.91/0.87 0.09     
##   -err.-      0.39      0.13 0.17     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band     11     11     11
##   noband    7     74      7
##   -err.-    7     11     18

Iš 103 narių 18 narių klasifikuojama blogai.

k-fold validavimas

bandsTask <- makeClassifTask(data = bands, target = "bandtype")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
kFold <- makeResampleDesc(method = "RepCV", folds = 10,  stratify = TRUE)
set.seed(10)
kfold_ldaCV <- resample(learner = lda, task = bandsTask, resampling = kFold, measures = list(mlr::mmce, mlr::acc))
kfold_ldaCV$aggr
## mmce.test.mean  acc.test.mean 
##      0.1446407      0.8553593
Fold <- makeResampleDesc(method = "RepCV", folds = 10,  stratify = TRUE)
set.seed(10)
kfold_ldaCV <- resample(learner = lda, task = bandsTask, resampling = kFold, measures = list(mlr::mmce, mlr::acc))
## Resampling: repeated cross-validation
## Measures:             mmce      acc
## [Resample] iter 1:    0.0952381 0.9047619
## [Resample] iter 2:    0.1000000 0.9000000
## [Resample] iter 3:    0.1904762 0.8095238
## [Resample] iter 4:    0.0500000 0.9500000
## [Resample] iter 5:    0.1000000 0.9000000
## [Resample] iter 6:    0.2000000 0.8000000
## [Resample] iter 7:    0.2000000 0.8000000
## [Resample] iter 8:    0.1500000 0.8500000
## [Resample] iter 9:    0.1904762 0.8095238
## [Resample] iter 10:   0.1363636 0.8636364
## [Resample] iter 11:   0.1500000 0.8500000
## [Resample] iter 12:   0.1428571 0.8571429
## [Resample] iter 13:   0.1500000 0.8500000
## [Resample] iter 14:   0.0952381 0.9047619
## [Resample] iter 15:   0.1818182 0.8181818
## [Resample] iter 16:   0.0500000 0.9500000
## [Resample] iter 17:   0.2500000 0.7500000
## [Resample] iter 18:   0.2000000 0.8000000
## [Resample] iter 19:   0.1000000 0.9000000
## [Resample] iter 20:   0.1428571 0.8571429
## [Resample] iter 21:   0.0952381 0.9047619
## [Resample] iter 22:   0.1000000 0.9000000
## [Resample] iter 23:   0.0952381 0.9047619
## [Resample] iter 24:   0.1000000 0.9000000
## [Resample] iter 25:   0.2000000 0.8000000
## [Resample] iter 26:   0.2380952 0.7619048
## [Resample] iter 27:   0.0476190 0.9523810
## [Resample] iter 28:   0.1500000 0.8500000
## [Resample] iter 29:   0.1000000 0.9000000
## [Resample] iter 30:   0.2380952 0.7619048
## [Resample] iter 31:   0.2727273 0.7272727
## [Resample] iter 32:   0.0500000 0.9500000
## [Resample] iter 33:   0.2000000 0.8000000
## [Resample] iter 34:   0.1428571 0.8571429
## [Resample] iter 35:   0.1500000 0.8500000
## [Resample] iter 36:   0.2000000 0.8000000
## [Resample] iter 37:   0.1500000 0.8500000
## [Resample] iter 38:   0.1000000 0.9000000
## [Resample] iter 39:   0.1904762 0.8095238
## [Resample] iter 40:   0.1904762 0.8095238
## [Resample] iter 41:   0.1000000 0.9000000
## [Resample] iter 42:   0.1500000 0.8500000
## [Resample] iter 43:   0.1000000 0.9000000
## [Resample] iter 44:   0.1500000 0.8500000
## [Resample] iter 45:   0.1428571 0.8571429
## [Resample] iter 46:   0.0952381 0.9047619
## [Resample] iter 47:   0.1500000 0.8500000
## [Resample] iter 48:   0.2272727 0.7727273
## [Resample] iter 49:   0.0952381 0.9047619
## [Resample] iter 50:   0.1500000 0.8500000
## [Resample] iter 51:   0.2380952 0.7619048
## [Resample] iter 52:   0.1000000 0.9000000
## [Resample] iter 53:   0.0952381 0.9047619
## [Resample] iter 54:   0.1000000 0.9000000
## [Resample] iter 55:   0.0952381 0.9047619
## [Resample] iter 56:   0.1500000 0.8500000
## [Resample] iter 57:   0.1500000 0.8500000
## [Resample] iter 58:   0.1904762 0.8095238
## [Resample] iter 59:   0.2500000 0.7500000
## [Resample] iter 60:   0.1428571 0.8571429
## [Resample] iter 61:   0.1000000 0.9000000
## [Resample] iter 62:   0.1000000 0.9000000
## [Resample] iter 63:   0.1500000 0.8500000
## [Resample] iter 64:   0.1904762 0.8095238
## [Resample] iter 65:   0.1363636 0.8636364
## [Resample] iter 66:   0.1428571 0.8571429
## [Resample] iter 67:   0.1000000 0.9000000
## [Resample] iter 68:   0.2000000 0.8000000
## [Resample] iter 69:   0.2000000 0.8000000
## [Resample] iter 70:   0.0476190 0.9523810
## [Resample] iter 71:   0.1500000 0.8500000
## [Resample] iter 72:   0.0476190 0.9523810
## [Resample] iter 73:   0.1904762 0.8095238
## [Resample] iter 74:   0.1428571 0.8571429
## [Resample] iter 75:   0.1428571 0.8571429
## [Resample] iter 76:   0.1500000 0.8500000
## [Resample] iter 77:   0.0500000 0.9500000
## [Resample] iter 78:   0.2380952 0.7619048
## [Resample] iter 79:   0.1000000 0.9000000
## [Resample] iter 80:   0.1500000 0.8500000
## [Resample] iter 81:   0.2500000 0.7500000
## [Resample] iter 82:   0.2500000 0.7500000
## [Resample] iter 83:   0.1000000 0.9000000
## [Resample] iter 84:   0.0909091 0.9090909
## [Resample] iter 85:   0.1818182 0.8181818
## [Resample] iter 86:   0.1428571 0.8571429
## [Resample] iter 87:   0.2000000 0.8000000
## [Resample] iter 88:   0.1000000 0.9000000
## [Resample] iter 89:   0.1000000 0.9000000
## [Resample] iter 90:   0.2000000 0.8000000
## [Resample] iter 91:   0.2272727 0.7727273
## [Resample] iter 92:   0.1500000 0.8500000
## [Resample] iter 93:   0.1000000 0.9000000
## [Resample] iter 94:   0.1500000 0.8500000
## [Resample] iter 95:   0.1428571 0.8571429
## [Resample] iter 96:   0.0952381 0.9047619
## [Resample] iter 97:   0.0952381 0.9047619
## [Resample] iter 98:   0.1500000 0.8500000
## [Resample] iter 99:   0.2500000 0.7500000
## [Resample] iter 100:  0.0000000 1.0000000
## 
## Aggregated Result: mmce.test.mean=0.1446407,acc.test.mean=0.8553593
## 

Atlikus LDA modelio k-fold validavimą, modelis suklasifikuoja teingai 86 procentus duomenų. Atlikus šį validavimą modelio klasifikavimas pagerėja 7 procentais.

calculateConfusionMatrix(kfold_ldaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.45/0.76 0.55/0.13 0.55     
##   noband 0.04/0.24 0.96/0.87 0.04     
##   -err.-      0.24      0.13 0.14     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band    194    236    236
##   noband   61   1559     61
##   -err.-   61    236    297

LOO validation

LOO <- makeResampleDesc(method = "LOO")
set.seed(50)
lda <- makeLearner("classif.lda")
lda_LOO <- resample(learner = lda, task = bandsTask, resampling = LOO,
                  measures = list(mmce, acc))
## Resampling: LOO
## Measures:             mmce      acc
## [Resample] iter 1:    0.0000000 1.0000000
## [Resample] iter 2:    1.0000000 0.0000000
## [Resample] iter 3:    0.0000000 1.0000000
## [Resample] iter 4:    0.0000000 1.0000000
## [Resample] iter 5:    0.0000000 1.0000000
## [Resample] iter 6:    0.0000000 1.0000000
## [Resample] iter 7:    1.0000000 0.0000000
## [Resample] iter 8:    0.0000000 1.0000000
## [Resample] iter 9:    0.0000000 1.0000000
## [Resample] iter 10:   0.0000000 1.0000000
## [Resample] iter 11:   0.0000000 1.0000000
## [Resample] iter 12:   0.0000000 1.0000000
## [Resample] iter 13:   0.0000000 1.0000000
## [Resample] iter 14:   0.0000000 1.0000000
## [Resample] iter 15:   0.0000000 1.0000000
## [Resample] iter 16:   0.0000000 1.0000000
## [Resample] iter 17:   0.0000000 1.0000000
## [Resample] iter 18:   0.0000000 1.0000000
## [Resample] iter 19:   0.0000000 1.0000000
## [Resample] iter 20:   0.0000000 1.0000000
## [Resample] iter 21:   0.0000000 1.0000000
## [Resample] iter 22:   0.0000000 1.0000000
## [Resample] iter 23:   0.0000000 1.0000000
## [Resample] iter 24:   0.0000000 1.0000000
## [Resample] iter 25:   0.0000000 1.0000000
## [Resample] iter 26:   0.0000000 1.0000000
## [Resample] iter 27:   0.0000000 1.0000000
## [Resample] iter 28:   0.0000000 1.0000000
## [Resample] iter 29:   0.0000000 1.0000000
## [Resample] iter 30:   0.0000000 1.0000000
## [Resample] iter 31:   0.0000000 1.0000000
## [Resample] iter 32:   0.0000000 1.0000000
## [Resample] iter 33:   0.0000000 1.0000000
## [Resample] iter 34:   0.0000000 1.0000000
## [Resample] iter 35:   0.0000000 1.0000000
## [Resample] iter 36:   0.0000000 1.0000000
## [Resample] iter 37:   0.0000000 1.0000000
## [Resample] iter 38:   0.0000000 1.0000000
## [Resample] iter 39:   1.0000000 0.0000000
## [Resample] iter 40:   1.0000000 0.0000000
## [Resample] iter 41:   0.0000000 1.0000000
## [Resample] iter 42:   1.0000000 0.0000000
## [Resample] iter 43:   0.0000000 1.0000000
## [Resample] iter 44:   0.0000000 1.0000000
## [Resample] iter 45:   0.0000000 1.0000000
## [Resample] iter 46:   0.0000000 1.0000000
## [Resample] iter 47:   0.0000000 1.0000000
## [Resample] iter 48:   0.0000000 1.0000000
## [Resample] iter 49:   0.0000000 1.0000000
## [Resample] iter 50:   0.0000000 1.0000000
## [Resample] iter 51:   0.0000000 1.0000000
## [Resample] iter 52:   0.0000000 1.0000000
## [Resample] iter 53:   0.0000000 1.0000000
## [Resample] iter 54:   1.0000000 0.0000000
## [Resample] iter 55:   0.0000000 1.0000000
## [Resample] iter 56:   1.0000000 0.0000000
## [Resample] iter 57:   1.0000000 0.0000000
## [Resample] iter 58:   1.0000000 0.0000000
## [Resample] iter 59:   0.0000000 1.0000000
## [Resample] iter 60:   0.0000000 1.0000000
## [Resample] iter 61:   0.0000000 1.0000000
## [Resample] iter 62:   0.0000000 1.0000000
## [Resample] iter 63:   0.0000000 1.0000000
## [Resample] iter 64:   0.0000000 1.0000000
## [Resample] iter 65:   0.0000000 1.0000000
## [Resample] iter 66:   0.0000000 1.0000000
## [Resample] iter 67:   0.0000000 1.0000000
## [Resample] iter 68:   1.0000000 0.0000000
## [Resample] iter 69:   0.0000000 1.0000000
## [Resample] iter 70:   0.0000000 1.0000000
## [Resample] iter 71:   0.0000000 1.0000000
## [Resample] iter 72:   0.0000000 1.0000000
## [Resample] iter 73:   1.0000000 0.0000000
## [Resample] iter 74:   1.0000000 0.0000000
## [Resample] iter 75:   0.0000000 1.0000000
## [Resample] iter 76:   1.0000000 0.0000000
## [Resample] iter 77:   0.0000000 1.0000000
## [Resample] iter 78:   1.0000000 0.0000000
## [Resample] iter 79:   0.0000000 1.0000000
## [Resample] iter 80:   0.0000000 1.0000000
## [Resample] iter 81:   0.0000000 1.0000000
## [Resample] iter 82:   0.0000000 1.0000000
## [Resample] iter 83:   0.0000000 1.0000000
## [Resample] iter 84:   0.0000000 1.0000000
## [Resample] iter 85:   0.0000000 1.0000000
## [Resample] iter 86:   0.0000000 1.0000000
## [Resample] iter 87:   0.0000000 1.0000000
## [Resample] iter 88:   0.0000000 1.0000000
## [Resample] iter 89:   0.0000000 1.0000000
## [Resample] iter 90:   0.0000000 1.0000000
## [Resample] iter 91:   0.0000000 1.0000000
## [Resample] iter 92:   0.0000000 1.0000000
## [Resample] iter 93:   0.0000000 1.0000000
## [Resample] iter 94:   0.0000000 1.0000000
## [Resample] iter 95:   0.0000000 1.0000000
## [Resample] iter 96:   0.0000000 1.0000000
## [Resample] iter 97:   0.0000000 1.0000000
## [Resample] iter 98:   0.0000000 1.0000000
## [Resample] iter 99:   0.0000000 1.0000000
## [Resample] iter 100:  0.0000000 1.0000000
## [Resample] iter 101:  0.0000000 1.0000000
## [Resample] iter 102:  0.0000000 1.0000000
## [Resample] iter 103:  0.0000000 1.0000000
## [Resample] iter 104:  1.0000000 0.0000000
## [Resample] iter 105:  1.0000000 0.0000000
## [Resample] iter 106:  0.0000000 1.0000000
## [Resample] iter 107:  0.0000000 1.0000000
## [Resample] iter 108:  0.0000000 1.0000000
## [Resample] iter 109:  0.0000000 1.0000000
## [Resample] iter 110:  1.0000000 0.0000000
## [Resample] iter 111:  0.0000000 1.0000000
## [Resample] iter 112:  0.0000000 1.0000000
## [Resample] iter 113:  0.0000000 1.0000000
## [Resample] iter 114:  0.0000000 1.0000000
## [Resample] iter 115:  0.0000000 1.0000000
## [Resample] iter 116:  0.0000000 1.0000000
## [Resample] iter 117:  0.0000000 1.0000000
## [Resample] iter 118:  0.0000000 1.0000000
## [Resample] iter 119:  0.0000000 1.0000000
## [Resample] iter 120:  0.0000000 1.0000000
## [Resample] iter 121:  0.0000000 1.0000000
## [Resample] iter 122:  0.0000000 1.0000000
## [Resample] iter 123:  0.0000000 1.0000000
## [Resample] iter 124:  0.0000000 1.0000000
## [Resample] iter 125:  0.0000000 1.0000000
## [Resample] iter 126:  0.0000000 1.0000000
## [Resample] iter 127:  0.0000000 1.0000000
## [Resample] iter 128:  0.0000000 1.0000000
## [Resample] iter 129:  0.0000000 1.0000000
## [Resample] iter 130:  0.0000000 1.0000000
## [Resample] iter 131:  0.0000000 1.0000000
## [Resample] iter 132:  0.0000000 1.0000000
## [Resample] iter 133:  0.0000000 1.0000000
## [Resample] iter 134:  0.0000000 1.0000000
## [Resample] iter 135:  0.0000000 1.0000000
## [Resample] iter 136:  0.0000000 1.0000000
## [Resample] iter 137:  0.0000000 1.0000000
## [Resample] iter 138:  0.0000000 1.0000000
## [Resample] iter 139:  0.0000000 1.0000000
## [Resample] iter 140:  0.0000000 1.0000000
## [Resample] iter 141:  0.0000000 1.0000000
## [Resample] iter 142:  1.0000000 0.0000000
## [Resample] iter 143:  1.0000000 0.0000000
## [Resample] iter 144:  0.0000000 1.0000000
## [Resample] iter 145:  1.0000000 0.0000000
## [Resample] iter 146:  0.0000000 1.0000000
## [Resample] iter 147:  0.0000000 1.0000000
## [Resample] iter 148:  0.0000000 1.0000000
## [Resample] iter 149:  0.0000000 1.0000000
## [Resample] iter 150:  0.0000000 1.0000000
## [Resample] iter 151:  0.0000000 1.0000000
## [Resample] iter 152:  0.0000000 1.0000000
## [Resample] iter 153:  0.0000000 1.0000000
## [Resample] iter 154:  0.0000000 1.0000000
## [Resample] iter 155:  0.0000000 1.0000000
## [Resample] iter 156:  0.0000000 1.0000000
## [Resample] iter 157:  1.0000000 0.0000000
## [Resample] iter 158:  0.0000000 1.0000000
## [Resample] iter 159:  1.0000000 0.0000000
## [Resample] iter 160:  1.0000000 0.0000000
## [Resample] iter 161:  1.0000000 0.0000000
## [Resample] iter 162:  0.0000000 1.0000000
## [Resample] iter 163:  0.0000000 1.0000000
## [Resample] iter 164:  0.0000000 1.0000000
## [Resample] iter 165:  0.0000000 1.0000000
## [Resample] iter 166:  0.0000000 1.0000000
## [Resample] iter 167:  0.0000000 1.0000000
## [Resample] iter 168:  0.0000000 1.0000000
## [Resample] iter 169:  0.0000000 1.0000000
## [Resample] iter 170:  0.0000000 1.0000000
## [Resample] iter 171:  1.0000000 0.0000000
## [Resample] iter 172:  0.0000000 1.0000000
## [Resample] iter 173:  0.0000000 1.0000000
## [Resample] iter 174:  0.0000000 1.0000000
## [Resample] iter 175:  0.0000000 1.0000000
## [Resample] iter 176:  1.0000000 0.0000000
## [Resample] iter 177:  1.0000000 0.0000000
## [Resample] iter 178:  0.0000000 1.0000000
## [Resample] iter 179:  1.0000000 0.0000000
## [Resample] iter 180:  0.0000000 1.0000000
## [Resample] iter 181:  1.0000000 0.0000000
## [Resample] iter 182:  0.0000000 1.0000000
## [Resample] iter 183:  0.0000000 1.0000000
## [Resample] iter 184:  0.0000000 1.0000000
## [Resample] iter 185:  0.0000000 1.0000000
## [Resample] iter 186:  0.0000000 1.0000000
## [Resample] iter 187:  0.0000000 1.0000000
## [Resample] iter 188:  0.0000000 1.0000000
## [Resample] iter 189:  0.0000000 1.0000000
## [Resample] iter 190:  0.0000000 1.0000000
## [Resample] iter 191:  0.0000000 1.0000000
## [Resample] iter 192:  0.0000000 1.0000000
## [Resample] iter 193:  0.0000000 1.0000000
## [Resample] iter 194:  0.0000000 1.0000000
## [Resample] iter 195:  0.0000000 1.0000000
## [Resample] iter 196:  0.0000000 1.0000000
## [Resample] iter 197:  0.0000000 1.0000000
## [Resample] iter 198:  0.0000000 1.0000000
## [Resample] iter 199:  0.0000000 1.0000000
## [Resample] iter 200:  0.0000000 1.0000000
## [Resample] iter 201:  0.0000000 1.0000000
## [Resample] iter 202:  0.0000000 1.0000000
## [Resample] iter 203:  0.0000000 1.0000000
## [Resample] iter 204:  0.0000000 1.0000000
## [Resample] iter 205:  0.0000000 1.0000000
## 
## Aggregated Result: mmce.test.mean=0.1414634,acc.test.mean=0.8585366
## 
lda_LOO$aggr
## mmce.test.mean  acc.test.mean 
##      0.1414634      0.8585366

Atlikus LOO validavimą tesingai suklasifikuojama yra 86 procentai duomenų.

KNN metodas

bandsTask <- makeClassifTask(data = bands , target = "bandtype")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:10))
gridSearch <- makeTuneControlGrid()
set.seed(10)
holdout <- makeResampleDesc(method = "Holdout", split = 3/5, stratify = TRUE)
tunedKCv <- tuneParams("classif.knn", task = bandsTask, resampling = holdout, par.set = knnParamSpace, control = gridSearch)
## [Tune] Started tuning learner classif.knn for parameter set:
##       Type len Def               Constr Req Tunable Trafo
## k discrete   -   - 1,2,3,4,5,6,7,8,9,10   -    TRUE     -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=1
## [Tune-y] 1: mmce.test.mean=0.0722892; time: 0.0 min
## [Tune-x] 2: k=2
## [Tune-y] 2: mmce.test.mean=0.1566265; time: 0.0 min
## [Tune-x] 3: k=3
## [Tune-y] 3: mmce.test.mean=0.1686747; time: 0.0 min
## [Tune-x] 4: k=4
## [Tune-y] 4: mmce.test.mean=0.1566265; time: 0.0 min
## [Tune-x] 5: k=5
## [Tune-y] 5: mmce.test.mean=0.1445783; time: 0.0 min
## [Tune-x] 6: k=6
## [Tune-y] 6: mmce.test.mean=0.1325301; time: 0.0 min
## [Tune-x] 7: k=7
## [Tune-y] 7: mmce.test.mean=0.1566265; time: 0.0 min
## [Tune-x] 8: k=8
## [Tune-y] 8: mmce.test.mean=0.1807229; time: 0.0 min
## [Tune-x] 9: k=9
## [Tune-y] 9: mmce.test.mean=0.1807229; time: 0.0 min
## [Tune-x] 10: k=10
## [Tune-y] 10: mmce.test.mean=0.1807229; time: 0.0 min
## [Tune] Result: k=1 : mmce.test.mean=0.0722892
knnTuningData <- generateHyperParsEffectData(tunedKCv)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean", plot.type = "line") + theme_bw()

Iš grafiko galime matyti, kad aukščiausias taškas (accuracy) yra, kai k = 10, todėl KNN modeliui pasirinksime šią reikšmę.

tunedKCv
## Tune result:
## Op. pars: k=1
## mmce.test.mean=0.0722892

Artimiausio kaimyno metodas teisingai suklasifikuoja 72 procentus duomenų.

knn <- makeLearner("classif.knn", par.vals = list("k" = 10))
holdoutNoStrat <- makeResampleDesc(method = "Holdout", split = 0.5, stratify = FALSE)
set.seed(10)
kFoldCV <- resample(learner = knn, task = bandsTask, resampling = holdoutNoStrat, measures = list(mmce, acc))
## Resampling: holdout
## Measures:             mmce      acc
## [Resample] iter 1:    0.2038835 0.7961165
## 
## Aggregated Result: mmce.test.mean=0.2038835,acc.test.mean=0.7961165
## 

Priskyrus fold = 10 reikšmę, duomenų klasifikavimas pagerėja 8 procentais.

calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.30/0.58 0.70/0.18 0.70     
##   noband 0.06/0.42 0.94/0.82 0.06     
##   -err.-      0.42      0.18 0.20     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band      7     16     16
##   noband    5     75      5
##   -err.-    5     16     21

KNN klasifikatoriaus validavimas

Holdout validavimas.

set.seed(10)
holdoutCV <- resample(learner = knn, task = bandsTask, resampling = holdout, measures = list(mlr::mmce, mlr::acc))
## Resampling: holdout
## Measures:             mmce      acc
## [Resample] iter 1:    0.1807229 0.8192771
## 
## Aggregated Result: mmce.test.mean=0.1807229,acc.test.mean=0.8192771
## 

Atlikus KNN modeliui holdout validavimą duomenų klasifikavimas pagerėja 2 procentais.

holdoutCV$aggr
## mmce.test.mean  acc.test.mean 
##      0.1807229      0.8192771
calculateConfusionMatrix(holdoutCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.28/0.71 0.72/0.17 0.72     
##   noband 0.03/0.29 0.97/0.83 0.03     
##   -err.-      0.29      0.17 0.18     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band      5     13     13
##   noband    2     63      2
##   -err.-    2     13     15

KFOLD validavimas.

kfold <- makeResampleDesc(method = "RepCV", folds = 10)
set.seed(10)
kfoldCV <- resample(learner = knn, task = bandsTask, resampling = kfold , measures = list(mlr::mmce, mlr::acc))
kfoldCV$aggr 
## mmce.test.mean  acc.test.mean 
##      0.1545238      0.8454762

KNN modeliui pritaikius KFOLD validavimą KNN modelio klasifikavimas pagerėja 5 procentais.

calculateConfusionMatrix(kfoldCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.50/0.68 0.50/0.12 0.50     
##   noband 0.06/0.32 0.94/0.88 0.06     
##   -err.-      0.32      0.12 0.15     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band    214    216    216
##   noband  101   1519    101
##   -err.-  101    216    317

SVM Tiesinis

bandsT <- as_tibble(bands)
bandsTask <- makeClassifTask(data = bandsT, target = "bandtype")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
cvForTuning <- makeResampleDesc("Holdout", split = 0.9)
kernels <- c("polynomial", "radial", "sigmoid")
svmParamSpace <- makeParamSet(makeDiscreteParam("kernel", values = kernels),
                              makeIntegerParam("degree", lower = 1, upper = 3),
                              makeNumericParam("cost", lower = 0.1, upper = 10),
                              makeNumericParam("gamma", lower = 0.1, 10))


randSearch <- makeTuneControlRandom(maxit = 10)
outer <- makeResampleDesc("CV", iters = 3)
svmWrapper <- makeTuneWrapper("classif.svm", resampling = cvForTuning,
                              par.set = svmParamSpace, control = randSearch)
cvWithTuning <- resample(learner = svmWrapper, task = bandsTask, resampling = outer, measures = list(mmce, acc))
## Resampling: cross-validation
## Measures:             mmce      acc
## [Tune] Started tuning learner classif.svm for parameter set:
##            Type len Def                    Constr Req Tunable Trafo
## kernel discrete   -   - polynomial,radial,sigmoid   -    TRUE     -
## degree  integer   -   -                    1 to 3   -    TRUE     -
## cost    numeric   -   -                 0.1 to 10   -    TRUE     -
## gamma   numeric   -   -                 0.1 to 10   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1
## [Tune-x] 1: kernel=polynomial; degree=1; cost=7.42; gamma=0.767
## [Tune-y] 1: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 2: kernel=sigmoid; degree=2; cost=4.89; gamma=5.05
## [Tune-y] 2: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 3: kernel=sigmoid; degree=2; cost=4.77; gamma=4.6
## [Tune-y] 3: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 4: kernel=sigmoid; degree=3; cost=0.117; gamma=5.8
## [Tune-y] 4: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 5: kernel=sigmoid; degree=1; cost=4.81; gamma=2.58
## [Tune-y] 5: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 6: kernel=sigmoid; degree=3; cost=5.73; gamma=1.29
## [Tune-y] 6: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 7: kernel=sigmoid; degree=2; cost=6.13; gamma=3.05
## [Tune-y] 7: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 8: kernel=polynomial; degree=3; cost=6.89; gamma=3.3
## [Tune-y] 8: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 9: kernel=polynomial; degree=2; cost=0.641; gamma=9.73
## [Tune-y] 9: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 10: kernel=radial; degree=1; cost=8.98; gamma=4.04
## [Tune-y] 10: mmce.test.mean=0.0714286; time: 0.0 min
## [Tune] Result: kernel=polynomial; degree=3; cost=6.89; gamma=3.3 : mmce.test.mean=0.0000000
## [Resample] iter 1:    0.0294118 0.9705882
## [Tune] Started tuning learner classif.svm for parameter set:
##            Type len Def                    Constr Req Tunable Trafo
## kernel discrete   -   - polynomial,radial,sigmoid   -    TRUE     -
## degree  integer   -   -                    1 to 3   -    TRUE     -
## cost    numeric   -   -                 0.1 to 10   -    TRUE     -
## gamma   numeric   -   -                 0.1 to 10   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1
## [Tune-x] 1: kernel=polynomial; degree=1; cost=9.56; gamma=9.87
## [Tune-y] 1: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 2: kernel=radial; degree=2; cost=3.85; gamma=7.61
## [Tune-y] 2: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 3: kernel=polynomial; degree=1; cost=0.598; gamma=7.97
## [Tune-y] 3: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 4: kernel=radial; degree=3; cost=4.8; gamma=3.3
## [Tune-y] 4: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 5: kernel=radial; degree=1; cost=3.92; gamma=3.1
## [Tune-y] 5: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 6: kernel=polynomial; degree=3; cost=2.11; gamma=8.43
## [Tune-y] 6: mmce.test.mean=0.0714286; time: 0.0 min
## [Tune-x] 7: kernel=sigmoid; degree=3; cost=0.268; gamma=6.14
## [Tune-y] 7: mmce.test.mean=0.0714286; time: 0.0 min
## [Tune-x] 8: kernel=sigmoid; degree=2; cost=6.01; gamma=0.416
## [Tune-y] 8: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 9: kernel=polynomial; degree=2; cost=0.347; gamma=5.02
## [Tune-y] 9: mmce.test.mean=0.0714286; time: 0.0 min
## [Tune-x] 10: kernel=radial; degree=2; cost=8.84; gamma=9.06
## [Tune-y] 10: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune] Result: kernel=radial; degree=1; cost=3.92; gamma=3.1 : mmce.test.mean=0.0000000
## [Resample] iter 2:    0.1014493 0.8985507
## [Tune] Started tuning learner classif.svm for parameter set:
##            Type len Def                    Constr Req Tunable Trafo
## kernel discrete   -   - polynomial,radial,sigmoid   -    TRUE     -
## degree  integer   -   -                    1 to 3   -    TRUE     -
## cost    numeric   -   -                 0.1 to 10   -    TRUE     -
## gamma   numeric   -   -                 0.1 to 10   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1
## [Tune-x] 1: kernel=radial; degree=1; cost=4.09; gamma=8.01
## [Tune-y] 1: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 2: kernel=polynomial; degree=1; cost=2.56; gamma=9.82
## [Tune-y] 2: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 3: kernel=polynomial; degree=1; cost=9.98; gamma=0.313
## [Tune-y] 3: mmce.test.mean=0.1428571; time: 0.0 min
## [Tune-x] 4: kernel=radial; degree=1; cost=4.49; gamma=7.17
## [Tune-y] 4: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 5: kernel=sigmoid; degree=3; cost=8.08; gamma=1.66
## [Tune-y] 5: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 6: kernel=radial; degree=3; cost=1.43; gamma=7.86
## [Tune-y] 6: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 7: kernel=radial; degree=3; cost=1.77; gamma=1.33
## [Tune-y] 7: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 8: kernel=sigmoid; degree=3; cost=4.48; gamma=7.06
## [Tune-y] 8: mmce.test.mean=0.2857143; time: 0.0 min
## [Tune-x] 9: kernel=polynomial; degree=3; cost=9.37; gamma=2.66
## [Tune-y] 9: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune-x] 10: kernel=radial; degree=1; cost=9.75; gamma=0.479
## [Tune-y] 10: mmce.test.mean=0.0000000; time: 0.0 min
## [Tune] Result: kernel=polynomial; degree=3; cost=9.37; gamma=2.66 : mmce.test.mean=0.0000000
## [Resample] iter 3:    0.1176471 0.8823529
## 
## Aggregated Result: mmce.test.mean=0.0828360,acc.test.mean=0.9171640
## 
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.70/0.88 0.30/0.08 0.30     
##   noband 0.02/0.12 0.98/0.92 0.02     
##   -err.-      0.12      0.08 0.08     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band     30     13     13
##   noband    4    158      4
##   -err.-    4     13     17

SVM modelis iš 205 imties narių 188 suklasifikuoja teisingai.

Tiesinio SVM klasifikatoriaus validavimas

bandsTask <- makeClassifTask(data = bands, target = "bandtype")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
kernels <- c("polynomial", "radial", "sigmoid")
svmParamSpace <- makeParamSet(
  makeDiscreteParam("kernel", values = kernels),
  makeIntegerParam("degree", lower = 1, upper = 3),
  makeNumericParam("cost", lower = 0.1, upper = 10),
  makeNumericParam("gamma", lower = 0.1, 10))

set.seed(10)
randSearch <- makeTuneControlRandom(maxit = 10)
cvForTuning <- makeResampleDesc("Holdout", split = 1/3)

library(parallelMap)
library(parallel)

parallelStartSocket(cpus = detectCores())
## Starting parallelization in mode=socket with cpus=8.
set.seed(10)
tunedSvmPars <- tuneParams("classif.svm", task = bandsTask,
                           resampling = cvForTuning,
                           par.set = svmParamSpace,
                           control = randSearch)
## [Tune] Started tuning learner classif.svm for parameter set:
##            Type len Def                    Constr Req Tunable Trafo
## kernel discrete   -   - polynomial,radial,sigmoid   -    TRUE     -
## degree  integer   -   -                    1 to 3   -    TRUE     -
## cost    numeric   -   -                 0.1 to 10   -    TRUE     -
## gamma   numeric   -   -                 0.1 to 10   -    TRUE     -
## With control class: TuneControlRandom
## Imputation value: 1
## Exporting objects to slaves for mode socket: .mlr.slave.options
## Mapping in parallel: mode = socket; level = mlr.tuneParams; cpus = 8; elements = 10.
## [Tune] Result: kernel=polynomial; degree=3; cost=9.58; gamma=9.81 : mmce.test.mean=0.1313869
parallelStop()
## Stopped parallelization. All cleaned up.
tunedSvmPars
## Tune result:
## Op. pars: kernel=polynomial; degree=3; cost=9.58; gamma=9.81
## mmce.test.mean=0.1313869

Logistinė regresija

logReg <- makeLearner("classif.logreg", predict.type = "prob")

logRegWrapper <- makeImputeWrapper("classif.logreg")
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, bandsTask,
                             resampling = holdout,
                             measures = list(acc, fpr, fnr))
## Resampling: holdout
## Measures:             acc       fpr       fnr
## [Resample] iter 1:    0.8571429 0.0606061 0.4444444
## 
## Aggregated Result: acc.test.mean=0.8571429,fpr.test.mean=0.0606061,fnr.test.mean=0.4444444
## 
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.56/0.71 0.44/0.11 0.44     
##   noband 0.06/0.29 0.94/0.89 0.06     
##   -err.-      0.29      0.11 0.14     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band      5      4      4
##   noband    2     31      2
##   -err.-    2      4      6

Logistinės regresijos modelis teisingai suklasifikuoja 86 procentus duomenų.

10-fold Crossvalidation

kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, bandsTask,
                             resampling = kFold,
                             measures = list(acc, fpr, fnr))
## Resampling: cross-validation
## Measures:             acc       fpr       fnr
## [Resample] iter 1:    0.9523810 0.0625000 0.0000000
## [Resample] iter 2:    0.9523810 0.0000000 0.2500000
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 3:    0.8500000 0.0625000 0.5000000
## [Resample] iter 4:    0.9000000 0.0714286 0.1666667
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 5:    0.8000000 0.1764706 0.3333333
## [Resample] iter 6:    0.9523810 0.0000000 0.2500000
## [Resample] iter 7:    0.9000000 0.0000000 0.3333333
## [Resample] iter 8:    0.9047619 0.1111111 0.0000000
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 9:    0.8571429 0.0000000 0.5000000
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 10:   0.7000000 0.2222222 1.0000000
## 
## Aggregated Result: acc.test.mean=0.8769048,fpr.test.mean=0.0706232,fnr.test.mean=0.3333333
## 
logRegwithImpute$aggr
## acc.test.mean fpr.test.mean fnr.test.mean 
##    0.87690476    0.07062325    0.33333333
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     band      noband    -err.-   
##   band   0.70/0.71 0.30/0.08 0.30     
##   noband 0.07/0.29 0.93/0.92 0.07     
##   -err.-      0.29      0.08 0.12     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     band noband -err.-
##   band     30     13     13
##   noband   12    150     12
##   -err.-   12     13     25

Logistinei regresijai atlikus 10 - FOLD validavimą, modelio prognozavimas pagerėja 2 procentais.

Išvados