1. Duomenų charakerizavimas. Automobilių duomenys.

Darbe naudojamas duomenų rinkinys, kuriame yra sukaupti duomenys apie automobilius.

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.

Kategorinio tipo kintamieji:

  1. Symboling
  2. Make
  3. Fuel type
  4. Aspiration
  5. Number of doors
  6. Body style
  7. Drive wheels
  8. Engine type
  9. Number of cylinders
  10. Engine size
  11. Fuel system

Kiekybiniai diskretieji kintamieji

  1. Number
  2. Normalized losses
  3. Lenght
  4. Width
  5. Height
  6. Curb Weight
  7. Bore
  8. Stroke
  9. Compression ratio
  10. Peak rpm
  11. City mpg
  12. Highway mpg
  13. Price
rm(list = ls())
library(dplyr)
library(tidyr)
library(stringr)
library(mlr)
library(tidyverse)
library(plyr)
library(caret)
library(gmodels)
library(ggplot2)
library(e1071)
library(caTools)
library(class)
library(GGally)
library(parallelMap)
library(parallel)
library(rpart.plot)
require(ISLR)
require(tree)
library(corrplot)
library(factoextra)
library(umap)
library(Rtsne)

Nuskaitomi duomenys

getwd()
## [1] "C:/Users/skirmantas/OneDrive/Desktop"
setwd("C:/Users/skirmantas/OneDrive/Desktop")
duomne <- read.csv2("C:/Users/skirmantas/OneDrive/Desktop/Duomenys/automobiliai.csv", header = TRUE, sep = ";", dec = ".")

Duomenų rinkinio analizė

Braižomos histogramos

duomne %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=20) 

ggplot(duomne, aes(x=as.factor(num_of_doors) )) +
  geom_bar(color="red", fill=rgb(0.7,0.4,0.5,0.6) )+ ggtitle("Number of doors") +
  xlab("Class") + ylab("Value")

getwd()
## [1] "C:/Users/skirmantas/OneDrive/Desktop"
setwd("C:/Users/skirmantas/OneDrive/Desktop")
duom <- read.csv2("C:/Users/skirmantas/OneDrive/Desktop/Duomenys/automobiliaiA2.csv", header = TRUE, sep = ";", dec = ".")

Normalizuojami kiekybiniai duomenys.

min.max.norm <- function(x, x.max, x.min)
{
  return((x-x.min)/(x.max-x.min))
}
for(i in c(1,9,10,11))
{
  max <- max(duom[,i])
  min <- min(duom[,i])
  for(ii in 1:nrow(duom))
  {
    duom[ii,i] <- min.max.norm(duom[ii,i], max, min)
  }
}

Kategoriniam kintamajam priskiriams faktoriaus tipas.

duom$num_of_doors <- as.factor(duom$num_of_doors)
str(duom)
## 'data.frame':    159 obs. of  27 variables:
##  $ number           : num  0.00985 0.01478 0.02463 0.03448 0 ...
##  $ symboling        : int  1 1 1 1 1 0 0 0 1 1 ...
##  $ normalized_losses: int  14 14 18 18 1 1 188 188 11 18 ...
##  $ make             : int  0 0 0 0 1 1 1 1 1 1 ...
##  $ fuel_type        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ aspiration       : int  1 1 1 0 1 1 1 1 1 1 ...
##  $ num_of_doors     : Factor w/ 2 levels "four","one": 2 2 2 2 1 2 1 2 1 1 ...
##  $ body_style       : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ drive_wheels     : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ engine_location  : num  NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
##  $ wheel_base       : num  0.00915 0.00458 0.00915 0.00915 0.00229 ...
##  $ length           : num  16.6 16.6 1.7 1.7 16.8 16.8 16.8 16.8 11.1 15.1 ...
##  $ width            : num  66.2 66.4 71.4 71.4 64.8 64.8 64.8 64.8 60.3 63.6 ...
##  $ height           : num  54.3 54.3 55.7 55.1 54.3 54.3 54.3 54.3 53.2 52 ...
##  $ curb_weight      : int  2337 2824 2844 3086 231 231 271 2765 188 1874 ...
##  $ engine_type      : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ num_of_cylinders : int  0 1 1 1 0 0 1 1 1 0 ...
##  $ engine_size      : int  1 1 1 11 18 18 14 14 61 1 ...
##  $ fuel_system      : int  0 0 0 0 0 0 0 0 1 1 ...
##  $ bore             : num  3.1 3.1 3.1 3.1 3.5 3.5 3.31 3.31 2.1 3.03 ...
##  $ stroke           : num  3.4 3.4 3.4 3.4 2.8 2.8 3.1 3.1 3.03 3.1 ...
##  $ compression_ratio: num  1 8 8.5 8.3 8.8 8.8 1 1 1.5 1.6 ...
##  $ horsepower       : int  1 1 1 10 1 1 11 11 48 70 ...
##  $ peak_rpm         : int  5500 5500 5500 5500 5800 5800 4250 4250 510 5400 ...
##  $ city_mpg         : int  24 18 1 1 23 23 21 21 47 38 ...
##  $ highway_mpg      : int  30 22 25 20 21 21 28 28 53 43 ...
##  $ price            : int  110 1450 171 23875 1430 11 2010 21 511 621 ...
summary(duom)
##      number          symboling      normalized_losses      make       
##  Min.   :0.00000   Min.   :0.0000   Min.   :  1.00    Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:  1.00    1st Qu.:1.0000  
##  Median :0.06404   Median :1.0000   Median : 11.00    Median :1.0000  
##  Mean   :0.17433   Mean   :0.6981   Mean   : 24.66    Mean   :0.9748  
##  3rd Qu.:0.23399   3rd Qu.:1.0000   3rd Qu.: 18.00    3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :256.00    Max.   :1.0000  
##                                                                       
##    fuel_type         aspiration     num_of_doors   body_style    
##  Min.   :0.00000   Min.   :0.0000   four:64      Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:1.0000   one :95      1st Qu.:0.0000  
##  Median :0.00000   Median :1.0000                Median :1.0000  
##  Mean   :0.09434   Mean   :0.8302                Mean   :0.5031  
##  3rd Qu.:0.00000   3rd Qu.:1.0000                3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000                Max.   :1.0000  
##                                                                  
##   drive_wheels    engine_location   wheel_base           length     
##  Min.   :0.0000   Min.   : NA     Min.   :0.000000   Min.   :  1.1  
##  1st Qu.:1.0000   1st Qu.: NA     1st Qu.:0.002288   1st Qu.: 11.7  
##  Median :1.0000   Median : NA     Median :0.005721   Median : 15.3  
##  Mean   :0.9497   Mean   :NaN     Mean   :0.039679   Mean   : 48.3  
##  3rd Qu.:1.0000   3rd Qu.: NA     3rd Qu.:0.008009   3rd Qu.: 18.7  
##  Max.   :1.0000   Max.   : NA     Max.   :1.000000   Max.   :202.6  
##                   NA's   :159                                       
##      width           height       curb_weight    engine_type    
##  Min.   :60.30   Min.   :41.40   Min.   :   1   Min.   :0.0000  
##  1st Qu.:64.00   1st Qu.:52.00   1st Qu.: 211   1st Qu.:0.0000  
##  Median :65.40   Median :54.10   Median :2004   Median :0.0000  
##  Mean   :65.51   Mean   :53.35   Mean   :1429   Mean   :0.2264  
##  3rd Qu.:66.50   3rd Qu.:55.50   3rd Qu.:2412   3rd Qu.:0.0000  
##  Max.   :71.70   Max.   :58.70   Max.   :4066   Max.   :1.0000  
##                                                                 
##  num_of_cylinders  engine_size      fuel_system          bore      
##  Min.   :0.0000   Min.   :  1.00   Min.   :0.0000   Min.   :2.100  
##  1st Qu.:0.0000   1st Qu.:  1.00   1st Qu.:0.0000   1st Qu.:3.050  
##  Median :0.0000   Median : 10.00   Median :1.0000   Median :3.270  
##  Mean   :0.1447   Mean   : 22.05   Mean   :0.5975   Mean   :3.166  
##  3rd Qu.:0.0000   3rd Qu.: 16.00   3rd Qu.:1.0000   3rd Qu.:3.540  
##  Max.   :1.0000   Max.   :258.00   Max.   :1.0000   Max.   :3.780  
##                                                                    
##      stroke      compression_ratio   horsepower        peak_rpm   
##  Min.   :2.070   Min.   : 1.000    Min.   :  1.00   Min.   : 410  
##  1st Qu.:3.100   1st Qu.: 1.000    1st Qu.:  1.00   1st Qu.:4800  
##  Median :3.230   Median : 1.400    Median : 52.00   Median :5200  
##  Mean   :3.215   Mean   : 5.156    Mean   : 39.41   Mean   :4928  
##  3rd Qu.:3.410   3rd Qu.: 8.400    3rd Qu.: 69.00   3rd Qu.:5500  
##  Max.   :4.100   Max.   :23.000    Max.   :200.00   Max.   :6600  
##                                                                   
##     city_mpg      highway_mpg        price        
##  Min.   : 1.00   Min.   : 1.00   Min.   :    1.0  
##  1st Qu.:22.00   1st Qu.:27.00   1st Qu.:  112.5  
##  Median :26.00   Median :32.00   Median :  781.0  
##  Mean   :23.84   Mean   :31.57   Mean   : 3879.9  
##  3rd Qu.:31.00   3rd Qu.:37.00   3rd Qu.: 6479.5  
##  Max.   :47.00   Max.   :54.00   Max.   :35056.0  
## 

LDA modelis

Sukuriamas LDA klasifikatorius ir apmokomas pagal sukurtą klasifikavimo uždavinį.

duom <- duom[,-10]
duom <- as_tibble(duom)
autoTask <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
lda <- makeLearner("classif.lda")
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
holdoutCV_lda <- resample(learner = lda, task = autoTask, resampling = holdout, measures = list(mmce, acc))
## Resampling: holdout
## Measures:             mmce      acc
## [Resample] iter 1:    0.2187500 0.7812500
## 
## Aggregated Result: mmce.test.mean=0.2187500,acc.test.mean=0.7812500
## 

LDA modelio, HOLDOUT metodui, padaliname mokymo imtį į 5 dalis. Imtis padalinama į mokymo ir testavimo imtis.

LDA modelis pasiekia 78 procentų tikslumą.

calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.85/0.69 0.15/0.12 0.15     
##   one    0.26/0.31 0.74/0.88 0.26     
##   -err.-      0.31      0.12 0.22     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four     11   2      2
##   one       5  14      5
##   -err.-    5   2      7

Gavome, kad LDA modelio tikslumas yra \(\sim 78\%\). Iš 30 duomenų, 7 nariai yra suklasifikuota neteisingai.

duom <- as_tibble(duom)

#TD_qda <- makeLearner("classif.qda")
#qdaModel <- train(qda, duom)


#holdout <- makeResampleDesc(method = "Holdout", split = 2/3, stratify = TRUE)
#set.seed(10)
#holdout_qdaCV <- resample(learner = qda, task = data2_Task, resampling = holdout, measures = #list(mlr::mmce, mlr::acc))


#Error in qda.default(x, grouping, ...) : rank deficiency in group four

Qda modelis, kuris neveikia del kolinearumo problemos, tačiau atlikus koreliacine analize ir pašalinus kintmauosius kurie tarpusavyje koreliuoja, šis QDA modelis vis tiek netiko duomenims.

K-FOLD validavimas

autoTasK <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
kFold <- makeResampleDesc(method = "RepCV", folds = 10,  stratify = TRUE)
set.seed(10)
kfold_ldaCV <- resample(learner = lda, task = autoTask, resampling = kFold, measures = list(mlr::mmce, mlr::acc))

Gavome kad LDA k-fold modelio tikslumas yra \(79\) procentai.

Fold <- makeResampleDesc(method = "RepCV", folds = 10,  stratify = TRUE)
set.seed(10)
kfold_ldaCV <- resample(learner = lda, task = autoTask, resampling = kFold, measures = list(mlr::mmce, mlr::acc))
## Resampling: repeated cross-validation
## Measures:             mmce      acc
## [Resample] iter 1:    0.1333333 0.8666667
## [Resample] iter 2:    0.1764706 0.8235294
## [Resample] iter 3:    0.1875000 0.8125000
## [Resample] iter 4:    0.1764706 0.8235294
## [Resample] iter 5:    0.1176471 0.8823529
## [Resample] iter 6:    0.3125000 0.6875000
## [Resample] iter 7:    0.2000000 0.8000000
## [Resample] iter 8:    0.2500000 0.7500000
## [Resample] iter 9:    0.2000000 0.8000000
## [Resample] iter 10:   0.2000000 0.8000000
## [Resample] iter 11:   0.1250000 0.8750000
## [Resample] iter 12:   0.2352941 0.7647059
## [Resample] iter 13:   0.1875000 0.8125000
## [Resample] iter 14:   0.3750000 0.6250000
## [Resample] iter 15:   0.2941176 0.7058824
## [Resample] iter 16:   0.0666667 0.9333333
## [Resample] iter 17:   0.1333333 0.8666667
## [Resample] iter 18:   0.3333333 0.6666667
## [Resample] iter 19:   0.2500000 0.7500000
## [Resample] iter 20:   0.1875000 0.8125000
## [Resample] iter 21:   0.3125000 0.6875000
## [Resample] iter 22:   0.2941176 0.7058824
## [Resample] iter 23:   0.3125000 0.6875000
## [Resample] iter 24:   0.2000000 0.8000000
## [Resample] iter 25:   0.0625000 0.9375000
## [Resample] iter 26:   0.2000000 0.8000000
## [Resample] iter 27:   0.2941176 0.7058824
## [Resample] iter 28:   0.2666667 0.7333333
## [Resample] iter 29:   0.1764706 0.8235294
## [Resample] iter 30:   0.1333333 0.8666667
## [Resample] iter 31:   0.3529412 0.6470588
## [Resample] iter 32:   0.0625000 0.9375000
## [Resample] iter 33:   0.2000000 0.8000000
## [Resample] iter 34:   0.2500000 0.7500000
## [Resample] iter 35:   0.2941176 0.7058824
## [Resample] iter 36:   0.2500000 0.7500000
## [Resample] iter 37:   0.2000000 0.8000000
## [Resample] iter 38:   0.2000000 0.8000000
## [Resample] iter 39:   0.1875000 0.8125000
## [Resample] iter 40:   0.1250000 0.8750000
## [Resample] iter 41:   0.0625000 0.9375000
## [Resample] iter 42:   0.2666667 0.7333333
## [Resample] iter 43:   0.2500000 0.7500000
## [Resample] iter 44:   0.2000000 0.8000000
## [Resample] iter 45:   0.1764706 0.8235294
## [Resample] iter 46:   0.2500000 0.7500000
## [Resample] iter 47:   0.1764706 0.8235294
## [Resample] iter 48:   0.1250000 0.8750000
## [Resample] iter 49:   0.2666667 0.7333333
## [Resample] iter 50:   0.1250000 0.8750000
## [Resample] iter 51:   0.2352941 0.7647059
## [Resample] iter 52:   0.4666667 0.5333333
## [Resample] iter 53:   0.2941176 0.7058824
## [Resample] iter 54:   0.2500000 0.7500000
## [Resample] iter 55:   0.0666667 0.9333333
## [Resample] iter 56:   0.3125000 0.6875000
## [Resample] iter 57:   0.1250000 0.8750000
## [Resample] iter 58:   0.2500000 0.7500000
## [Resample] iter 59:   0.0666667 0.9333333
## [Resample] iter 60:   0.0625000 0.9375000
## [Resample] iter 61:   0.1250000 0.8750000
## [Resample] iter 62:   0.0588235 0.9411765
## [Resample] iter 63:   0.3529412 0.6470588
## [Resample] iter 64:   0.2000000 0.8000000
## [Resample] iter 65:   0.2666667 0.7333333
## [Resample] iter 66:   0.1875000 0.8125000
## [Resample] iter 67:   0.1875000 0.8125000
## [Resample] iter 68:   0.0000000 1.0000000
## [Resample] iter 69:   0.2666667 0.7333333
## [Resample] iter 70:   0.2352941 0.7647059
## [Resample] iter 71:   0.4666667 0.5333333
## [Resample] iter 72:   0.1176471 0.8823529
## [Resample] iter 73:   0.0666667 0.9333333
## [Resample] iter 74:   0.0625000 0.9375000
## [Resample] iter 75:   0.3125000 0.6875000
## [Resample] iter 76:   0.2500000 0.7500000
## [Resample] iter 77:   0.1250000 0.8750000
## [Resample] iter 78:   0.3750000 0.6250000
## [Resample] iter 79:   0.0000000 1.0000000
## [Resample] iter 80:   0.3750000 0.6250000
## [Resample] iter 81:   0.2000000 0.8000000
## [Resample] iter 82:   0.1333333 0.8666667
## [Resample] iter 83:   0.5000000 0.5000000
## [Resample] iter 84:   0.2666667 0.7333333
## [Resample] iter 85:   0.1176471 0.8823529
## [Resample] iter 86:   0.0666667 0.9333333
## [Resample] iter 87:   0.2352941 0.7647059
## [Resample] iter 88:   0.2352941 0.7647059
## [Resample] iter 89:   0.1875000 0.8125000
## [Resample] iter 90:   0.1875000 0.8125000
## [Resample] iter 91:   0.2666667 0.7333333
## [Resample] iter 92:   0.1764706 0.8235294
## [Resample] iter 93:   0.1250000 0.8750000
## [Resample] iter 94:   0.1875000 0.8125000
## [Resample] iter 95:   0.1875000 0.8125000
## [Resample] iter 96:   0.2666667 0.7333333
## [Resample] iter 97:   0.3750000 0.6250000
## [Resample] iter 98:   0.2500000 0.7500000
## [Resample] iter 99:   0.1250000 0.8750000
## [Resample] iter 100:  0.1875000 0.8125000
## 
## Aggregated Result: mmce.test.mean=0.2085270,acc.test.mean=0.7914730
## 
calculateConfusionMatrix(kfold_ldaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.75/0.74 0.25/0.17 0.25     
##   one    0.18/0.26 0.82/0.83 0.18     
##   -err.-      0.26      0.17 0.21     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four    480 160    160
##   one     172 778    172
##   -err.-  172 160    332

Iš 1590 duomenų modelis gerai suklasifikuoja 1255 narius.

#LOO metodas

LOO <- makeResampleDesc(method = "LOO")
set.seed(50)
lda <- makeLearner("classif.lda")
lda_LOO <- resample(learner = lda, task = autoTask, resampling = LOO,
                    measures = list(mmce, acc))
## Resampling: LOO
## Measures:             mmce      acc
## [Resample] iter 1:    0.0000000 1.0000000
## [Resample] iter 2:    0.0000000 1.0000000
## [Resample] iter 3:    0.0000000 1.0000000
## [Resample] iter 4:    0.0000000 1.0000000
## [Resample] iter 5:    1.0000000 0.0000000
## [Resample] iter 6:    0.0000000 1.0000000
## [Resample] iter 7:    1.0000000 0.0000000
## [Resample] iter 8:    0.0000000 1.0000000
## [Resample] iter 9:    0.0000000 1.0000000
## [Resample] iter 10:   0.0000000 1.0000000
## [Resample] iter 11:   0.0000000 1.0000000
## [Resample] iter 12:   0.0000000 1.0000000
## [Resample] iter 13:   0.0000000 1.0000000
## [Resample] iter 14:   0.0000000 1.0000000
## [Resample] iter 15:   1.0000000 0.0000000
## [Resample] iter 16:   0.0000000 1.0000000
## [Resample] iter 17:   0.0000000 1.0000000
## [Resample] iter 18:   1.0000000 0.0000000
## [Resample] iter 19:   0.0000000 1.0000000
## [Resample] iter 20:   0.0000000 1.0000000
## [Resample] iter 21:   0.0000000 1.0000000
## [Resample] iter 22:   0.0000000 1.0000000
## [Resample] iter 23:   0.0000000 1.0000000
## [Resample] iter 24:   0.0000000 1.0000000
## [Resample] iter 25:   0.0000000 1.0000000
## [Resample] iter 26:   0.0000000 1.0000000
## [Resample] iter 27:   1.0000000 0.0000000
## [Resample] iter 28:   1.0000000 0.0000000
## [Resample] iter 29:   0.0000000 1.0000000
## [Resample] iter 30:   0.0000000 1.0000000
## [Resample] iter 31:   0.0000000 1.0000000
## [Resample] iter 32:   1.0000000 0.0000000
## [Resample] iter 33:   0.0000000 1.0000000
## [Resample] iter 34:   0.0000000 1.0000000
## [Resample] iter 35:   0.0000000 1.0000000
## [Resample] iter 36:   0.0000000 1.0000000
## [Resample] iter 37:   0.0000000 1.0000000
## [Resample] iter 38:   0.0000000 1.0000000
## [Resample] iter 39:   0.0000000 1.0000000
## [Resample] iter 40:   0.0000000 1.0000000
## [Resample] iter 41:   0.0000000 1.0000000
## [Resample] iter 42:   0.0000000 1.0000000
## [Resample] iter 43:   0.0000000 1.0000000
## [Resample] iter 44:   0.0000000 1.0000000
## [Resample] iter 45:   0.0000000 1.0000000
## [Resample] iter 46:   1.0000000 0.0000000
## [Resample] iter 47:   1.0000000 0.0000000
## [Resample] iter 48:   0.0000000 1.0000000
## [Resample] iter 49:   1.0000000 0.0000000
## [Resample] iter 50:   0.0000000 1.0000000
## [Resample] iter 51:   0.0000000 1.0000000
## [Resample] iter 52:   0.0000000 1.0000000
## [Resample] iter 53:   0.0000000 1.0000000
## [Resample] iter 54:   0.0000000 1.0000000
## [Resample] iter 55:   0.0000000 1.0000000
## [Resample] iter 56:   0.0000000 1.0000000
## [Resample] iter 57:   0.0000000 1.0000000
## [Resample] iter 58:   0.0000000 1.0000000
## [Resample] iter 59:   0.0000000 1.0000000
## [Resample] iter 60:   1.0000000 0.0000000
## [Resample] iter 61:   1.0000000 0.0000000
## [Resample] iter 62:   1.0000000 0.0000000
## [Resample] iter 63:   0.0000000 1.0000000
## [Resample] iter 64:   1.0000000 0.0000000
## [Resample] iter 65:   1.0000000 0.0000000
## [Resample] iter 66:   0.0000000 1.0000000
## [Resample] iter 67:   0.0000000 1.0000000
## [Resample] iter 68:   1.0000000 0.0000000
## [Resample] iter 69:   0.0000000 1.0000000
## [Resample] iter 70:   0.0000000 1.0000000
## [Resample] iter 71:   0.0000000 1.0000000
## [Resample] iter 72:   0.0000000 1.0000000
## [Resample] iter 73:   0.0000000 1.0000000
## [Resample] iter 74:   0.0000000 1.0000000
## [Resample] iter 75:   0.0000000 1.0000000
## [Resample] iter 76:   0.0000000 1.0000000
## [Resample] iter 77:   0.0000000 1.0000000
## [Resample] iter 78:   0.0000000 1.0000000
## [Resample] iter 79:   0.0000000 1.0000000
## [Resample] iter 80:   0.0000000 1.0000000
## [Resample] iter 81:   0.0000000 1.0000000
## [Resample] iter 82:   0.0000000 1.0000000
## [Resample] iter 83:   0.0000000 1.0000000
## [Resample] iter 84:   0.0000000 1.0000000
## [Resample] iter 85:   0.0000000 1.0000000
## [Resample] iter 86:   0.0000000 1.0000000
## [Resample] iter 87:   1.0000000 0.0000000
## [Resample] iter 88:   0.0000000 1.0000000
## [Resample] iter 89:   0.0000000 1.0000000
## [Resample] iter 90:   1.0000000 0.0000000
## [Resample] iter 91:   0.0000000 1.0000000
## [Resample] iter 92:   0.0000000 1.0000000
## [Resample] iter 93:   0.0000000 1.0000000
## [Resample] iter 94:   0.0000000 1.0000000
## [Resample] iter 95:   0.0000000 1.0000000
## [Resample] iter 96:   0.0000000 1.0000000
## [Resample] iter 97:   0.0000000 1.0000000
## [Resample] iter 98:   0.0000000 1.0000000
## [Resample] iter 99:   0.0000000 1.0000000
## [Resample] iter 100:  1.0000000 0.0000000
## [Resample] iter 101:  0.0000000 1.0000000
## [Resample] iter 102:  0.0000000 1.0000000
## [Resample] iter 103:  0.0000000 1.0000000
## [Resample] iter 104:  0.0000000 1.0000000
## [Resample] iter 105:  0.0000000 1.0000000
## [Resample] iter 106:  1.0000000 0.0000000
## [Resample] iter 107:  1.0000000 0.0000000
## [Resample] iter 108:  0.0000000 1.0000000
## [Resample] iter 109:  0.0000000 1.0000000
## [Resample] iter 110:  0.0000000 1.0000000
## [Resample] iter 111:  0.0000000 1.0000000
## [Resample] iter 112:  1.0000000 0.0000000
## [Resample] iter 113:  0.0000000 1.0000000
## [Resample] iter 114:  0.0000000 1.0000000
## [Resample] iter 115:  0.0000000 1.0000000
## [Resample] iter 116:  0.0000000 1.0000000
## [Resample] iter 117:  0.0000000 1.0000000
## [Resample] iter 118:  0.0000000 1.0000000
## [Resample] iter 119:  1.0000000 0.0000000
## [Resample] iter 120:  0.0000000 1.0000000
## [Resample] iter 121:  0.0000000 1.0000000
## [Resample] iter 122:  0.0000000 1.0000000
## [Resample] iter 123:  1.0000000 0.0000000
## [Resample] iter 124:  0.0000000 1.0000000
## [Resample] iter 125:  0.0000000 1.0000000
## [Resample] iter 126:  0.0000000 1.0000000
## [Resample] iter 127:  0.0000000 1.0000000
## [Resample] iter 128:  0.0000000 1.0000000
## [Resample] iter 129:  0.0000000 1.0000000
## [Resample] iter 130:  0.0000000 1.0000000
## [Resample] iter 131:  0.0000000 1.0000000
## [Resample] iter 132:  0.0000000 1.0000000
## [Resample] iter 133:  0.0000000 1.0000000
## [Resample] iter 134:  1.0000000 0.0000000
## [Resample] iter 135:  1.0000000 0.0000000
## [Resample] iter 136:  0.0000000 1.0000000
## [Resample] iter 137:  1.0000000 0.0000000
## [Resample] iter 138:  0.0000000 1.0000000
## [Resample] iter 139:  0.0000000 1.0000000
## [Resample] iter 140:  0.0000000 1.0000000
## [Resample] iter 141:  1.0000000 0.0000000
## [Resample] iter 142:  1.0000000 0.0000000
## [Resample] iter 143:  0.0000000 1.0000000
## [Resample] iter 144:  0.0000000 1.0000000
## [Resample] iter 145:  0.0000000 1.0000000
## [Resample] iter 146:  0.0000000 1.0000000
## [Resample] iter 147:  0.0000000 1.0000000
## [Resample] iter 148:  0.0000000 1.0000000
## [Resample] iter 149:  0.0000000 1.0000000
## [Resample] iter 150:  1.0000000 0.0000000
## [Resample] iter 151:  0.0000000 1.0000000
## [Resample] iter 152:  1.0000000 0.0000000
## [Resample] iter 153:  0.0000000 1.0000000
## [Resample] iter 154:  0.0000000 1.0000000
## [Resample] iter 155:  0.0000000 1.0000000
## [Resample] iter 156:  0.0000000 1.0000000
## [Resample] iter 157:  0.0000000 1.0000000
## [Resample] iter 158:  0.0000000 1.0000000
## [Resample] iter 159:  0.0000000 1.0000000
## 
## Aggregated Result: mmce.test.mean=0.1949686,acc.test.mean=0.8050314
## 
lda_LOO$aggr
## mmce.test.mean  acc.test.mean 
##      0.1949686      0.8050314

LDA modelis po LOO metodo klasifikuoja \(\sim 78\%\) tikslumu. Pritaikius šį metodą, duomenų klasifikavimas nepagerėjo.

KNN artimiausių kaimynų metodaas

autoTasK <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:20))
gridSearch <- makeTuneControlGrid()
set.seed(10)
holdout <- makeResampleDesc(method = "Holdout", split = 2/3, stratify = TRUE)
tunedKCv <- tuneParams("classif.knn", task = autoTask, resampling = holdout, par.set = knnParamSpace, control = gridSearch)
## [Tune] Started tuning learner classif.knn for parameter set:
##       Type len Def                                   Constr Req Tunable Trafo
## k discrete   -   - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,1...   -    TRUE     -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=1
## [Tune-y] 1: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 2: k=2
## [Tune-y] 2: mmce.test.mean=0.6296296; time: 0.0 min
## [Tune-x] 3: k=3
## [Tune-y] 3: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 4: k=4
## [Tune-y] 4: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 5: k=5
## [Tune-y] 5: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 6: k=6
## [Tune-y] 6: mmce.test.mean=0.4814815; time: 0.0 min
## [Tune-x] 7: k=7
## [Tune-y] 7: mmce.test.mean=0.5185185; time: 0.0 min
## [Tune-x] 8: k=8
## [Tune-y] 8: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 9: k=9
## [Tune-y] 9: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 10: k=10
## [Tune-y] 10: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 11: k=11
## [Tune-y] 11: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 12: k=12
## [Tune-y] 12: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 13: k=13
## [Tune-y] 13: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 14: k=14
## [Tune-y] 14: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 15: k=15
## [Tune-y] 15: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 16: k=16
## [Tune-y] 16: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 17: k=17
## [Tune-y] 17: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 18: k=18
## [Tune-y] 18: mmce.test.mean=0.3703704; time: 0.0 min
## [Tune-x] 19: k=19
## [Tune-y] 19: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 20: k=20
## [Tune-y] 20: mmce.test.mean=0.4259259; time: 0.0 min
## [Tune] Result: k=18 : mmce.test.mean=0.3703704
knnTuningData <- generateHyperParsEffectData(tunedKCv)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean", plot.type = "line") + theme_bw()

Iš grafiko galime matyti, kad aukščiausias taškas (accuracy) yra, kai k = 20, todėl KNN modeliui pasirinksime šią reikšmę.

knn <- makeLearner("classif.knn", par.vals = list("k" = 20))
holdoutNoStrat <- makeResampleDesc(method = "Holdout", split = 0.5, stratify = FALSE)
set.seed(10)
kFoldCV <- resample(learner = knn, task = autoTask, resampling = holdoutNoStrat, measures = list(mmce, acc))
## Resampling: holdout
## Measures:             mmce      acc
## [Resample] iter 1:    0.4250000 0.5750000
## 
## Aggregated Result: mmce.test.mean=0.4250000,acc.test.mean=0.5750000
## 

Artimiausių kaimynų modelis, su pasirinktu iš grafiko parametru k, gerai sukllasifikuoja 58 procentus duomenų.

KNN modelis su pasiekė \(\sim 55\%\) tikslumą.

calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.14/0.56 0.86/0.42 0.86     
##   one    0.09/0.44 0.91/0.58 0.09     
##   -err.-      0.44      0.42 0.42     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four      5  30     30
##   one       4  41      4
##   -err.-    4  30     34

Iš 80 imties narių modelis gerai suklasifikuoja 46 narius.

KNN klasifikatoriaus validavimas

Holdout validavimas.

ATask <- makeClassifTask(data = duom , target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
set.seed(10)
holdoutKNN <- resample(learner = knn, task = ATask, resampling = holdout, measures = list(mlr::mmce, mlr::acc))
## Resampling: holdout
## Measures:             mmce      acc
## [Resample] iter 1:    0.3888889 0.6111111
## 
## Aggregated Result: mmce.test.mean=0.3888889,acc.test.mean=0.6111111
## 
holdoutKNN$aggr
## mmce.test.mean  acc.test.mean 
##      0.3888889      0.6111111

Atlikus KNN holdout validavimą modelis gerai suklasifikuoja 61 procentą duomenų. Modelio tiksingumas pagerėja 3 procentais.

calculateConfusionMatrix(holdoutKNN$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.27/0.55 0.73/0.37 0.73     
##   one    0.16/0.45 0.84/0.63 0.16     
##   -err.-      0.45      0.37 0.39     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four      6  16     16
##   one       5  27      5
##   -err.-    5  16     21

Iš 54 imties narių 33 yra suklasifikuojami tinkamai.

KFOLD validavimas.

kfold <- makeResampleDesc(method = "RepCV", folds = 10)
set.seed(10)
kfoldAUTO <- resample(learner = knn, task = ATask, resampling = kfold , measures = list(mlr::mmce, mlr::acc))
kfoldAUTO$aggr 
## mmce.test.mean  acc.test.mean 
##       0.421125       0.578875

KNN modeliui pritaikius K-fold metodą, tinkamas duomenų suklasifikavimas sumažėja procentais.

calculateConfusionMatrix(kfoldAUTO$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.14/0.43 0.86/0.40 0.86     
##   one    0.13/0.57 0.87/0.60 0.13     
##   -err.-      0.57      0.40 0.42     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four     90 550    550
##   one     119 831    119
##   -err.-  119 550    669

Iš 1590 testavimo imties, gerai yra suklasifikuojami 921 imties nariai.

SVM Tiesinis

automobiliai <- as_tibble(duom)
autoTask <- makeClassifTask(data = automobiliai, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
cvForTuning <- makeResampleDesc("Holdout", split = 0.8)
kernels <- c("polynomial", "radial", "sigmoid")
svmParam <- makeParamSet(makeDiscreteParam("kernel", values = kernels),
                              makeIntegerParam("degree", lower = 1, upper = 3),
                              makeNumericParam("cost", lower = 0.1, upper = 10),
                              makeNumericParam("gamma", lower = 0.1, 10))


randSearch <- makeTuneControlRandom(maxit = 10)
outer <- makeResampleDesc("CV", iters = 3)
svmWrapper <- makeTuneWrapper("classif.svm", resampling = cvForTuning,
                              par.set = svmParam, control = randSearch)
cvWithTuning <- resample(learner = svmWrapper, task = autoTask, resampling = outer, measures = list(mmce, acc))
cvWithTuning
## Resample Result
## Task: automobiliai
## Learner: classif.svm.tuned
## Aggr perf: mmce.test.mean=0.3081761,acc.test.mean=0.6918239
## Runtime: 0.87598

SVM modelis gerai suklasifikuoja 59 procentus duomenų.

calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.69/0.60 0.31/0.23 0.31     
##   one    0.31/0.40 0.69/0.77 0.31     
##   -err.-      0.40      0.23 0.31     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four     44  20     20
##   one      29  66     29
##   -err.-   29  20     49

Logistinė regresija

autoTask <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
logReg <- makeLearner("classif.logreg", predict.type = "prob")

logRegWrapper <- makeImputeWrapper("classif.logreg")
holdout <- makeResampleDesc(method = "Holdout", split = 2/3, stratify = TRUE)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, autoTask,
                             resampling = holdout,
                             measures = list(acc, fpr, fnr))
## Resampling: holdout
## Measures:             acc       fpr       fnr
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 1:    0.6851852 0.2812500 0.3636364
## 
## Aggregated Result: acc.test.mean=0.6851852,fpr.test.mean=0.2812500,fnr.test.mean=0.3636364
## 
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.64/0.61 0.36/0.26 0.36     
##   one    0.28/0.39 0.72/0.74 0.28     
##   -err.-      0.39      0.26 0.31     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four     14   8      8
##   one       9  23      9
##   -err.-    9   8     17

Logstinės regresijos modelis gerai suklasifikuoja 69 procentus duomenų.

10-fold Crossvalidation

kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, autoTask,
                             resampling = kFold,
                             measures = list(acc, fpr, fnr))
## Resampling: cross-validation
## Measures:             acc       fpr       fnr
## [Resample] iter 1:    0.7500000 0.2000000 0.3333333
## [Resample] iter 2:    0.8125000 0.2000000 0.1666667
## [Resample] iter 3:    0.7500000 0.1666667 0.5000000
## [Resample] iter 4:    0.8750000 0.1428571 0.1111111
## [Resample] iter 5:    0.6875000 0.3333333 0.2857143
## [Resample] iter 6:    0.9375000 0.0000000 0.1428571
## [Resample] iter 7:    0.7333333 0.2500000 0.2857143
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 8:    0.7500000 0.1538462 0.6666667
## [Resample] iter 9:    0.8125000 0.1428571 0.2222222
## [Resample] iter 10:   0.8125000 0.1000000 0.3333333
## 
## Aggregated Result: acc.test.mean=0.7920833,fpr.test.mean=0.1689560,fnr.test.mean=0.3047619
## 
logRegwithImpute$aggr
## acc.test.mean fpr.test.mean fnr.test.mean 
##     0.7920833     0.1689560     0.3047619
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
##         predicted
## true     four      one       -err.-   
##   four   0.73/0.75 0.27/0.18 0.27     
##   one    0.17/0.25 0.83/0.82 0.17     
##   -err.-      0.25      0.18 0.21     
## 
## 
## Absolute confusion matrix:
##         predicted
## true     four one -err.-
##   four     47  17     17
##   one      16  79     16
##   -err.-   16  17     33

Logistinės regresijos modeliui atlikus 10 - fold crossvalidation, mpdelio tikslumas pagerėja 11 procentų.

Išvados