Darbe naudojamas duomenų rinkinys, kuriame yra sukaupti duomenys apie automobilius.
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process “symboling”. A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc…), and represents the average loss per car per year.
Kategorinio tipo kintamieji:
Kiekybiniai diskretieji kintamieji
rm(list = ls())
library(dplyr)
library(tidyr)
library(stringr)
library(mlr)
library(tidyverse)
library(plyr)
library(caret)
library(gmodels)
library(ggplot2)
library(e1071)
library(caTools)
library(class)
library(GGally)
library(parallelMap)
library(parallel)
library(rpart.plot)
require(ISLR)
require(tree)
library(corrplot)
library(factoextra)
library(umap)
library(Rtsne)
getwd()
## [1] "C:/Users/skirmantas/OneDrive/Desktop"
setwd("C:/Users/skirmantas/OneDrive/Desktop")
duomne <- read.csv2("C:/Users/skirmantas/OneDrive/Desktop/Duomenys/automobiliai.csv", header = TRUE, sep = ";", dec = ".")
Duomenų rinkinio analizė
duomne %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=20)
ggplot(duomne, aes(x=as.factor(num_of_doors) )) +
geom_bar(color="red", fill=rgb(0.7,0.4,0.5,0.6) )+ ggtitle("Number of doors") +
xlab("Class") + ylab("Value")
getwd()
## [1] "C:/Users/skirmantas/OneDrive/Desktop"
setwd("C:/Users/skirmantas/OneDrive/Desktop")
duom <- read.csv2("C:/Users/skirmantas/OneDrive/Desktop/Duomenys/automobiliaiA2.csv", header = TRUE, sep = ";", dec = ".")
min.max.norm <- function(x, x.max, x.min)
{
return((x-x.min)/(x.max-x.min))
}
for(i in c(1,9,10,11))
{
max <- max(duom[,i])
min <- min(duom[,i])
for(ii in 1:nrow(duom))
{
duom[ii,i] <- min.max.norm(duom[ii,i], max, min)
}
}
duom$num_of_doors <- as.factor(duom$num_of_doors)
str(duom)
## 'data.frame': 159 obs. of 27 variables:
## $ number : num 0.00985 0.01478 0.02463 0.03448 0 ...
## $ symboling : int 1 1 1 1 1 0 0 0 1 1 ...
## $ normalized_losses: int 14 14 18 18 1 1 188 188 11 18 ...
## $ make : int 0 0 0 0 1 1 1 1 1 1 ...
## $ fuel_type : int 0 0 0 0 0 0 0 0 0 0 ...
## $ aspiration : int 1 1 1 0 1 1 1 1 1 1 ...
## $ num_of_doors : Factor w/ 2 levels "four","one": 2 2 2 2 1 2 1 2 1 1 ...
## $ body_style : int 0 0 0 0 0 0 0 0 1 1 ...
## $ drive_wheels : num 1 0 1 1 1 1 1 1 1 1 ...
## $ engine_location : num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
## $ wheel_base : num 0.00915 0.00458 0.00915 0.00915 0.00229 ...
## $ length : num 16.6 16.6 1.7 1.7 16.8 16.8 16.8 16.8 11.1 15.1 ...
## $ width : num 66.2 66.4 71.4 71.4 64.8 64.8 64.8 64.8 60.3 63.6 ...
## $ height : num 54.3 54.3 55.7 55.1 54.3 54.3 54.3 54.3 53.2 52 ...
## $ curb_weight : int 2337 2824 2844 3086 231 231 271 2765 188 1874 ...
## $ engine_type : int 0 0 0 0 0 0 0 0 1 0 ...
## $ num_of_cylinders : int 0 1 1 1 0 0 1 1 1 0 ...
## $ engine_size : int 1 1 1 11 18 18 14 14 61 1 ...
## $ fuel_system : int 0 0 0 0 0 0 0 0 1 1 ...
## $ bore : num 3.1 3.1 3.1 3.1 3.5 3.5 3.31 3.31 2.1 3.03 ...
## $ stroke : num 3.4 3.4 3.4 3.4 2.8 2.8 3.1 3.1 3.03 3.1 ...
## $ compression_ratio: num 1 8 8.5 8.3 8.8 8.8 1 1 1.5 1.6 ...
## $ horsepower : int 1 1 1 10 1 1 11 11 48 70 ...
## $ peak_rpm : int 5500 5500 5500 5500 5800 5800 4250 4250 510 5400 ...
## $ city_mpg : int 24 18 1 1 23 23 21 21 47 38 ...
## $ highway_mpg : int 30 22 25 20 21 21 28 28 53 43 ...
## $ price : int 110 1450 171 23875 1430 11 2010 21 511 621 ...
summary(duom)
## number symboling normalized_losses make
## Min. :0.00000 Min. :0.0000 Min. : 1.00 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 1.00 1st Qu.:1.0000
## Median :0.06404 Median :1.0000 Median : 11.00 Median :1.0000
## Mean :0.17433 Mean :0.6981 Mean : 24.66 Mean :0.9748
## 3rd Qu.:0.23399 3rd Qu.:1.0000 3rd Qu.: 18.00 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :256.00 Max. :1.0000
##
## fuel_type aspiration num_of_doors body_style
## Min. :0.00000 Min. :0.0000 four:64 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:1.0000 one :95 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :1.0000
## Mean :0.09434 Mean :0.8302 Mean :0.5031
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
##
## drive_wheels engine_location wheel_base length
## Min. :0.0000 Min. : NA Min. :0.000000 Min. : 1.1
## 1st Qu.:1.0000 1st Qu.: NA 1st Qu.:0.002288 1st Qu.: 11.7
## Median :1.0000 Median : NA Median :0.005721 Median : 15.3
## Mean :0.9497 Mean :NaN Mean :0.039679 Mean : 48.3
## 3rd Qu.:1.0000 3rd Qu.: NA 3rd Qu.:0.008009 3rd Qu.: 18.7
## Max. :1.0000 Max. : NA Max. :1.000000 Max. :202.6
## NA's :159
## width height curb_weight engine_type
## Min. :60.30 Min. :41.40 Min. : 1 Min. :0.0000
## 1st Qu.:64.00 1st Qu.:52.00 1st Qu.: 211 1st Qu.:0.0000
## Median :65.40 Median :54.10 Median :2004 Median :0.0000
## Mean :65.51 Mean :53.35 Mean :1429 Mean :0.2264
## 3rd Qu.:66.50 3rd Qu.:55.50 3rd Qu.:2412 3rd Qu.:0.0000
## Max. :71.70 Max. :58.70 Max. :4066 Max. :1.0000
##
## num_of_cylinders engine_size fuel_system bore
## Min. :0.0000 Min. : 1.00 Min. :0.0000 Min. :2.100
## 1st Qu.:0.0000 1st Qu.: 1.00 1st Qu.:0.0000 1st Qu.:3.050
## Median :0.0000 Median : 10.00 Median :1.0000 Median :3.270
## Mean :0.1447 Mean : 22.05 Mean :0.5975 Mean :3.166
## 3rd Qu.:0.0000 3rd Qu.: 16.00 3rd Qu.:1.0000 3rd Qu.:3.540
## Max. :1.0000 Max. :258.00 Max. :1.0000 Max. :3.780
##
## stroke compression_ratio horsepower peak_rpm
## Min. :2.070 Min. : 1.000 Min. : 1.00 Min. : 410
## 1st Qu.:3.100 1st Qu.: 1.000 1st Qu.: 1.00 1st Qu.:4800
## Median :3.230 Median : 1.400 Median : 52.00 Median :5200
## Mean :3.215 Mean : 5.156 Mean : 39.41 Mean :4928
## 3rd Qu.:3.410 3rd Qu.: 8.400 3rd Qu.: 69.00 3rd Qu.:5500
## Max. :4.100 Max. :23.000 Max. :200.00 Max. :6600
##
## city_mpg highway_mpg price
## Min. : 1.00 Min. : 1.00 Min. : 1.0
## 1st Qu.:22.00 1st Qu.:27.00 1st Qu.: 112.5
## Median :26.00 Median :32.00 Median : 781.0
## Mean :23.84 Mean :31.57 Mean : 3879.9
## 3rd Qu.:31.00 3rd Qu.:37.00 3rd Qu.: 6479.5
## Max. :47.00 Max. :54.00 Max. :35056.0
##
duom <- duom[,-10]
duom <- as_tibble(duom)
autoTask <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
lda <- makeLearner("classif.lda")
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
holdoutCV_lda <- resample(learner = lda, task = autoTask, resampling = holdout, measures = list(mmce, acc))
## Resampling: holdout
## Measures: mmce acc
## [Resample] iter 1: 0.2187500 0.7812500
##
## Aggregated Result: mmce.test.mean=0.2187500,acc.test.mean=0.7812500
##
LDA modelio, HOLDOUT metodui, padaliname mokymo imtį į 5 dalis. Imtis padalinama į mokymo ir testavimo imtis.
LDA modelis pasiekia 78 procentų tikslumą.
calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.85/0.69 0.15/0.12 0.15
## one 0.26/0.31 0.74/0.88 0.26
## -err.- 0.31 0.12 0.22
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 11 2 2
## one 5 14 5
## -err.- 5 2 7
Gavome, kad LDA modelio tikslumas yra \(\sim 78\%\). Iš 30 duomenų, 7 nariai yra suklasifikuota neteisingai.
duom <- as_tibble(duom)
#TD_qda <- makeLearner("classif.qda")
#qdaModel <- train(qda, duom)
#holdout <- makeResampleDesc(method = "Holdout", split = 2/3, stratify = TRUE)
#set.seed(10)
#holdout_qdaCV <- resample(learner = qda, task = data2_Task, resampling = holdout, measures = #list(mlr::mmce, mlr::acc))
#Error in qda.default(x, grouping, ...) : rank deficiency in group four
Qda modelis, kuris neveikia del kolinearumo problemos, tačiau atlikus koreliacine analize ir pašalinus kintmauosius kurie tarpusavyje koreliuoja, šis QDA modelis vis tiek netiko duomenims.
autoTasK <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
kFold <- makeResampleDesc(method = "RepCV", folds = 10, stratify = TRUE)
set.seed(10)
kfold_ldaCV <- resample(learner = lda, task = autoTask, resampling = kFold, measures = list(mlr::mmce, mlr::acc))
Gavome kad LDA k-fold modelio tikslumas yra \(79\) procentai.
Fold <- makeResampleDesc(method = "RepCV", folds = 10, stratify = TRUE)
set.seed(10)
kfold_ldaCV <- resample(learner = lda, task = autoTask, resampling = kFold, measures = list(mlr::mmce, mlr::acc))
## Resampling: repeated cross-validation
## Measures: mmce acc
## [Resample] iter 1: 0.1333333 0.8666667
## [Resample] iter 2: 0.1764706 0.8235294
## [Resample] iter 3: 0.1875000 0.8125000
## [Resample] iter 4: 0.1764706 0.8235294
## [Resample] iter 5: 0.1176471 0.8823529
## [Resample] iter 6: 0.3125000 0.6875000
## [Resample] iter 7: 0.2000000 0.8000000
## [Resample] iter 8: 0.2500000 0.7500000
## [Resample] iter 9: 0.2000000 0.8000000
## [Resample] iter 10: 0.2000000 0.8000000
## [Resample] iter 11: 0.1250000 0.8750000
## [Resample] iter 12: 0.2352941 0.7647059
## [Resample] iter 13: 0.1875000 0.8125000
## [Resample] iter 14: 0.3750000 0.6250000
## [Resample] iter 15: 0.2941176 0.7058824
## [Resample] iter 16: 0.0666667 0.9333333
## [Resample] iter 17: 0.1333333 0.8666667
## [Resample] iter 18: 0.3333333 0.6666667
## [Resample] iter 19: 0.2500000 0.7500000
## [Resample] iter 20: 0.1875000 0.8125000
## [Resample] iter 21: 0.3125000 0.6875000
## [Resample] iter 22: 0.2941176 0.7058824
## [Resample] iter 23: 0.3125000 0.6875000
## [Resample] iter 24: 0.2000000 0.8000000
## [Resample] iter 25: 0.0625000 0.9375000
## [Resample] iter 26: 0.2000000 0.8000000
## [Resample] iter 27: 0.2941176 0.7058824
## [Resample] iter 28: 0.2666667 0.7333333
## [Resample] iter 29: 0.1764706 0.8235294
## [Resample] iter 30: 0.1333333 0.8666667
## [Resample] iter 31: 0.3529412 0.6470588
## [Resample] iter 32: 0.0625000 0.9375000
## [Resample] iter 33: 0.2000000 0.8000000
## [Resample] iter 34: 0.2500000 0.7500000
## [Resample] iter 35: 0.2941176 0.7058824
## [Resample] iter 36: 0.2500000 0.7500000
## [Resample] iter 37: 0.2000000 0.8000000
## [Resample] iter 38: 0.2000000 0.8000000
## [Resample] iter 39: 0.1875000 0.8125000
## [Resample] iter 40: 0.1250000 0.8750000
## [Resample] iter 41: 0.0625000 0.9375000
## [Resample] iter 42: 0.2666667 0.7333333
## [Resample] iter 43: 0.2500000 0.7500000
## [Resample] iter 44: 0.2000000 0.8000000
## [Resample] iter 45: 0.1764706 0.8235294
## [Resample] iter 46: 0.2500000 0.7500000
## [Resample] iter 47: 0.1764706 0.8235294
## [Resample] iter 48: 0.1250000 0.8750000
## [Resample] iter 49: 0.2666667 0.7333333
## [Resample] iter 50: 0.1250000 0.8750000
## [Resample] iter 51: 0.2352941 0.7647059
## [Resample] iter 52: 0.4666667 0.5333333
## [Resample] iter 53: 0.2941176 0.7058824
## [Resample] iter 54: 0.2500000 0.7500000
## [Resample] iter 55: 0.0666667 0.9333333
## [Resample] iter 56: 0.3125000 0.6875000
## [Resample] iter 57: 0.1250000 0.8750000
## [Resample] iter 58: 0.2500000 0.7500000
## [Resample] iter 59: 0.0666667 0.9333333
## [Resample] iter 60: 0.0625000 0.9375000
## [Resample] iter 61: 0.1250000 0.8750000
## [Resample] iter 62: 0.0588235 0.9411765
## [Resample] iter 63: 0.3529412 0.6470588
## [Resample] iter 64: 0.2000000 0.8000000
## [Resample] iter 65: 0.2666667 0.7333333
## [Resample] iter 66: 0.1875000 0.8125000
## [Resample] iter 67: 0.1875000 0.8125000
## [Resample] iter 68: 0.0000000 1.0000000
## [Resample] iter 69: 0.2666667 0.7333333
## [Resample] iter 70: 0.2352941 0.7647059
## [Resample] iter 71: 0.4666667 0.5333333
## [Resample] iter 72: 0.1176471 0.8823529
## [Resample] iter 73: 0.0666667 0.9333333
## [Resample] iter 74: 0.0625000 0.9375000
## [Resample] iter 75: 0.3125000 0.6875000
## [Resample] iter 76: 0.2500000 0.7500000
## [Resample] iter 77: 0.1250000 0.8750000
## [Resample] iter 78: 0.3750000 0.6250000
## [Resample] iter 79: 0.0000000 1.0000000
## [Resample] iter 80: 0.3750000 0.6250000
## [Resample] iter 81: 0.2000000 0.8000000
## [Resample] iter 82: 0.1333333 0.8666667
## [Resample] iter 83: 0.5000000 0.5000000
## [Resample] iter 84: 0.2666667 0.7333333
## [Resample] iter 85: 0.1176471 0.8823529
## [Resample] iter 86: 0.0666667 0.9333333
## [Resample] iter 87: 0.2352941 0.7647059
## [Resample] iter 88: 0.2352941 0.7647059
## [Resample] iter 89: 0.1875000 0.8125000
## [Resample] iter 90: 0.1875000 0.8125000
## [Resample] iter 91: 0.2666667 0.7333333
## [Resample] iter 92: 0.1764706 0.8235294
## [Resample] iter 93: 0.1250000 0.8750000
## [Resample] iter 94: 0.1875000 0.8125000
## [Resample] iter 95: 0.1875000 0.8125000
## [Resample] iter 96: 0.2666667 0.7333333
## [Resample] iter 97: 0.3750000 0.6250000
## [Resample] iter 98: 0.2500000 0.7500000
## [Resample] iter 99: 0.1250000 0.8750000
## [Resample] iter 100: 0.1875000 0.8125000
##
## Aggregated Result: mmce.test.mean=0.2085270,acc.test.mean=0.7914730
##
calculateConfusionMatrix(kfold_ldaCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.75/0.74 0.25/0.17 0.25
## one 0.18/0.26 0.82/0.83 0.18
## -err.- 0.26 0.17 0.21
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 480 160 160
## one 172 778 172
## -err.- 172 160 332
Iš 1590 duomenų modelis gerai suklasifikuoja 1255 narius.
#LOO metodas
LOO <- makeResampleDesc(method = "LOO")
set.seed(50)
lda <- makeLearner("classif.lda")
lda_LOO <- resample(learner = lda, task = autoTask, resampling = LOO,
measures = list(mmce, acc))
## Resampling: LOO
## Measures: mmce acc
## [Resample] iter 1: 0.0000000 1.0000000
## [Resample] iter 2: 0.0000000 1.0000000
## [Resample] iter 3: 0.0000000 1.0000000
## [Resample] iter 4: 0.0000000 1.0000000
## [Resample] iter 5: 1.0000000 0.0000000
## [Resample] iter 6: 0.0000000 1.0000000
## [Resample] iter 7: 1.0000000 0.0000000
## [Resample] iter 8: 0.0000000 1.0000000
## [Resample] iter 9: 0.0000000 1.0000000
## [Resample] iter 10: 0.0000000 1.0000000
## [Resample] iter 11: 0.0000000 1.0000000
## [Resample] iter 12: 0.0000000 1.0000000
## [Resample] iter 13: 0.0000000 1.0000000
## [Resample] iter 14: 0.0000000 1.0000000
## [Resample] iter 15: 1.0000000 0.0000000
## [Resample] iter 16: 0.0000000 1.0000000
## [Resample] iter 17: 0.0000000 1.0000000
## [Resample] iter 18: 1.0000000 0.0000000
## [Resample] iter 19: 0.0000000 1.0000000
## [Resample] iter 20: 0.0000000 1.0000000
## [Resample] iter 21: 0.0000000 1.0000000
## [Resample] iter 22: 0.0000000 1.0000000
## [Resample] iter 23: 0.0000000 1.0000000
## [Resample] iter 24: 0.0000000 1.0000000
## [Resample] iter 25: 0.0000000 1.0000000
## [Resample] iter 26: 0.0000000 1.0000000
## [Resample] iter 27: 1.0000000 0.0000000
## [Resample] iter 28: 1.0000000 0.0000000
## [Resample] iter 29: 0.0000000 1.0000000
## [Resample] iter 30: 0.0000000 1.0000000
## [Resample] iter 31: 0.0000000 1.0000000
## [Resample] iter 32: 1.0000000 0.0000000
## [Resample] iter 33: 0.0000000 1.0000000
## [Resample] iter 34: 0.0000000 1.0000000
## [Resample] iter 35: 0.0000000 1.0000000
## [Resample] iter 36: 0.0000000 1.0000000
## [Resample] iter 37: 0.0000000 1.0000000
## [Resample] iter 38: 0.0000000 1.0000000
## [Resample] iter 39: 0.0000000 1.0000000
## [Resample] iter 40: 0.0000000 1.0000000
## [Resample] iter 41: 0.0000000 1.0000000
## [Resample] iter 42: 0.0000000 1.0000000
## [Resample] iter 43: 0.0000000 1.0000000
## [Resample] iter 44: 0.0000000 1.0000000
## [Resample] iter 45: 0.0000000 1.0000000
## [Resample] iter 46: 1.0000000 0.0000000
## [Resample] iter 47: 1.0000000 0.0000000
## [Resample] iter 48: 0.0000000 1.0000000
## [Resample] iter 49: 1.0000000 0.0000000
## [Resample] iter 50: 0.0000000 1.0000000
## [Resample] iter 51: 0.0000000 1.0000000
## [Resample] iter 52: 0.0000000 1.0000000
## [Resample] iter 53: 0.0000000 1.0000000
## [Resample] iter 54: 0.0000000 1.0000000
## [Resample] iter 55: 0.0000000 1.0000000
## [Resample] iter 56: 0.0000000 1.0000000
## [Resample] iter 57: 0.0000000 1.0000000
## [Resample] iter 58: 0.0000000 1.0000000
## [Resample] iter 59: 0.0000000 1.0000000
## [Resample] iter 60: 1.0000000 0.0000000
## [Resample] iter 61: 1.0000000 0.0000000
## [Resample] iter 62: 1.0000000 0.0000000
## [Resample] iter 63: 0.0000000 1.0000000
## [Resample] iter 64: 1.0000000 0.0000000
## [Resample] iter 65: 1.0000000 0.0000000
## [Resample] iter 66: 0.0000000 1.0000000
## [Resample] iter 67: 0.0000000 1.0000000
## [Resample] iter 68: 1.0000000 0.0000000
## [Resample] iter 69: 0.0000000 1.0000000
## [Resample] iter 70: 0.0000000 1.0000000
## [Resample] iter 71: 0.0000000 1.0000000
## [Resample] iter 72: 0.0000000 1.0000000
## [Resample] iter 73: 0.0000000 1.0000000
## [Resample] iter 74: 0.0000000 1.0000000
## [Resample] iter 75: 0.0000000 1.0000000
## [Resample] iter 76: 0.0000000 1.0000000
## [Resample] iter 77: 0.0000000 1.0000000
## [Resample] iter 78: 0.0000000 1.0000000
## [Resample] iter 79: 0.0000000 1.0000000
## [Resample] iter 80: 0.0000000 1.0000000
## [Resample] iter 81: 0.0000000 1.0000000
## [Resample] iter 82: 0.0000000 1.0000000
## [Resample] iter 83: 0.0000000 1.0000000
## [Resample] iter 84: 0.0000000 1.0000000
## [Resample] iter 85: 0.0000000 1.0000000
## [Resample] iter 86: 0.0000000 1.0000000
## [Resample] iter 87: 1.0000000 0.0000000
## [Resample] iter 88: 0.0000000 1.0000000
## [Resample] iter 89: 0.0000000 1.0000000
## [Resample] iter 90: 1.0000000 0.0000000
## [Resample] iter 91: 0.0000000 1.0000000
## [Resample] iter 92: 0.0000000 1.0000000
## [Resample] iter 93: 0.0000000 1.0000000
## [Resample] iter 94: 0.0000000 1.0000000
## [Resample] iter 95: 0.0000000 1.0000000
## [Resample] iter 96: 0.0000000 1.0000000
## [Resample] iter 97: 0.0000000 1.0000000
## [Resample] iter 98: 0.0000000 1.0000000
## [Resample] iter 99: 0.0000000 1.0000000
## [Resample] iter 100: 1.0000000 0.0000000
## [Resample] iter 101: 0.0000000 1.0000000
## [Resample] iter 102: 0.0000000 1.0000000
## [Resample] iter 103: 0.0000000 1.0000000
## [Resample] iter 104: 0.0000000 1.0000000
## [Resample] iter 105: 0.0000000 1.0000000
## [Resample] iter 106: 1.0000000 0.0000000
## [Resample] iter 107: 1.0000000 0.0000000
## [Resample] iter 108: 0.0000000 1.0000000
## [Resample] iter 109: 0.0000000 1.0000000
## [Resample] iter 110: 0.0000000 1.0000000
## [Resample] iter 111: 0.0000000 1.0000000
## [Resample] iter 112: 1.0000000 0.0000000
## [Resample] iter 113: 0.0000000 1.0000000
## [Resample] iter 114: 0.0000000 1.0000000
## [Resample] iter 115: 0.0000000 1.0000000
## [Resample] iter 116: 0.0000000 1.0000000
## [Resample] iter 117: 0.0000000 1.0000000
## [Resample] iter 118: 0.0000000 1.0000000
## [Resample] iter 119: 1.0000000 0.0000000
## [Resample] iter 120: 0.0000000 1.0000000
## [Resample] iter 121: 0.0000000 1.0000000
## [Resample] iter 122: 0.0000000 1.0000000
## [Resample] iter 123: 1.0000000 0.0000000
## [Resample] iter 124: 0.0000000 1.0000000
## [Resample] iter 125: 0.0000000 1.0000000
## [Resample] iter 126: 0.0000000 1.0000000
## [Resample] iter 127: 0.0000000 1.0000000
## [Resample] iter 128: 0.0000000 1.0000000
## [Resample] iter 129: 0.0000000 1.0000000
## [Resample] iter 130: 0.0000000 1.0000000
## [Resample] iter 131: 0.0000000 1.0000000
## [Resample] iter 132: 0.0000000 1.0000000
## [Resample] iter 133: 0.0000000 1.0000000
## [Resample] iter 134: 1.0000000 0.0000000
## [Resample] iter 135: 1.0000000 0.0000000
## [Resample] iter 136: 0.0000000 1.0000000
## [Resample] iter 137: 1.0000000 0.0000000
## [Resample] iter 138: 0.0000000 1.0000000
## [Resample] iter 139: 0.0000000 1.0000000
## [Resample] iter 140: 0.0000000 1.0000000
## [Resample] iter 141: 1.0000000 0.0000000
## [Resample] iter 142: 1.0000000 0.0000000
## [Resample] iter 143: 0.0000000 1.0000000
## [Resample] iter 144: 0.0000000 1.0000000
## [Resample] iter 145: 0.0000000 1.0000000
## [Resample] iter 146: 0.0000000 1.0000000
## [Resample] iter 147: 0.0000000 1.0000000
## [Resample] iter 148: 0.0000000 1.0000000
## [Resample] iter 149: 0.0000000 1.0000000
## [Resample] iter 150: 1.0000000 0.0000000
## [Resample] iter 151: 0.0000000 1.0000000
## [Resample] iter 152: 1.0000000 0.0000000
## [Resample] iter 153: 0.0000000 1.0000000
## [Resample] iter 154: 0.0000000 1.0000000
## [Resample] iter 155: 0.0000000 1.0000000
## [Resample] iter 156: 0.0000000 1.0000000
## [Resample] iter 157: 0.0000000 1.0000000
## [Resample] iter 158: 0.0000000 1.0000000
## [Resample] iter 159: 0.0000000 1.0000000
##
## Aggregated Result: mmce.test.mean=0.1949686,acc.test.mean=0.8050314
##
lda_LOO$aggr
## mmce.test.mean acc.test.mean
## 0.1949686 0.8050314
LDA modelis po LOO metodo klasifikuoja \(\sim 78\%\) tikslumu. Pritaikius šį metodą, duomenų klasifikavimas nepagerėjo.
autoTasK <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:20))
gridSearch <- makeTuneControlGrid()
set.seed(10)
holdout <- makeResampleDesc(method = "Holdout", split = 2/3, stratify = TRUE)
tunedKCv <- tuneParams("classif.knn", task = autoTask, resampling = holdout, par.set = knnParamSpace, control = gridSearch)
## [Tune] Started tuning learner classif.knn for parameter set:
## Type len Def Constr Req Tunable Trafo
## k discrete - - 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,1... - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=1
## [Tune-y] 1: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 2: k=2
## [Tune-y] 2: mmce.test.mean=0.6296296; time: 0.0 min
## [Tune-x] 3: k=3
## [Tune-y] 3: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 4: k=4
## [Tune-y] 4: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 5: k=5
## [Tune-y] 5: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 6: k=6
## [Tune-y] 6: mmce.test.mean=0.4814815; time: 0.0 min
## [Tune-x] 7: k=7
## [Tune-y] 7: mmce.test.mean=0.5185185; time: 0.0 min
## [Tune-x] 8: k=8
## [Tune-y] 8: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 9: k=9
## [Tune-y] 9: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 10: k=10
## [Tune-y] 10: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 11: k=11
## [Tune-y] 11: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 12: k=12
## [Tune-y] 12: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 13: k=13
## [Tune-y] 13: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 14: k=14
## [Tune-y] 14: mmce.test.mean=0.4629630; time: 0.0 min
## [Tune-x] 15: k=15
## [Tune-y] 15: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 16: k=16
## [Tune-y] 16: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 17: k=17
## [Tune-y] 17: mmce.test.mean=0.4074074; time: 0.0 min
## [Tune-x] 18: k=18
## [Tune-y] 18: mmce.test.mean=0.3703704; time: 0.0 min
## [Tune-x] 19: k=19
## [Tune-y] 19: mmce.test.mean=0.4444444; time: 0.0 min
## [Tune-x] 20: k=20
## [Tune-y] 20: mmce.test.mean=0.4259259; time: 0.0 min
## [Tune] Result: k=18 : mmce.test.mean=0.3703704
knnTuningData <- generateHyperParsEffectData(tunedKCv)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean", plot.type = "line") + theme_bw()
knn <- makeLearner("classif.knn", par.vals = list("k" = 20))
holdoutNoStrat <- makeResampleDesc(method = "Holdout", split = 0.5, stratify = FALSE)
set.seed(10)
kFoldCV <- resample(learner = knn, task = autoTask, resampling = holdoutNoStrat, measures = list(mmce, acc))
## Resampling: holdout
## Measures: mmce acc
## [Resample] iter 1: 0.4250000 0.5750000
##
## Aggregated Result: mmce.test.mean=0.4250000,acc.test.mean=0.5750000
##
Artimiausių kaimynų modelis, su pasirinktu iš grafiko parametru k, gerai sukllasifikuoja 58 procentus duomenų.
calculateConfusionMatrix(kFoldCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.14/0.56 0.86/0.42 0.86
## one 0.09/0.44 0.91/0.58 0.09
## -err.- 0.44 0.42 0.42
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 5 30 30
## one 4 41 4
## -err.- 4 30 34
Iš 80 imties narių modelis gerai suklasifikuoja 46 narius.
ATask <- makeClassifTask(data = duom , target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
set.seed(10)
holdoutKNN <- resample(learner = knn, task = ATask, resampling = holdout, measures = list(mlr::mmce, mlr::acc))
## Resampling: holdout
## Measures: mmce acc
## [Resample] iter 1: 0.3888889 0.6111111
##
## Aggregated Result: mmce.test.mean=0.3888889,acc.test.mean=0.6111111
##
holdoutKNN$aggr
## mmce.test.mean acc.test.mean
## 0.3888889 0.6111111
Atlikus KNN holdout validavimą modelis gerai suklasifikuoja 61 procentą duomenų. Modelio tiksingumas pagerėja 3 procentais.
calculateConfusionMatrix(holdoutKNN$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.27/0.55 0.73/0.37 0.73
## one 0.16/0.45 0.84/0.63 0.16
## -err.- 0.45 0.37 0.39
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 6 16 16
## one 5 27 5
## -err.- 5 16 21
Iš 54 imties narių 33 yra suklasifikuojami tinkamai.
kfold <- makeResampleDesc(method = "RepCV", folds = 10)
set.seed(10)
kfoldAUTO <- resample(learner = knn, task = ATask, resampling = kfold , measures = list(mlr::mmce, mlr::acc))
kfoldAUTO$aggr
## mmce.test.mean acc.test.mean
## 0.421125 0.578875
KNN modeliui pritaikius K-fold metodą, tinkamas duomenų suklasifikavimas sumažėja procentais.
calculateConfusionMatrix(kfoldAUTO$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.14/0.43 0.86/0.40 0.86
## one 0.13/0.57 0.87/0.60 0.13
## -err.- 0.57 0.40 0.42
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 90 550 550
## one 119 831 119
## -err.- 119 550 669
Iš 1590 testavimo imties, gerai yra suklasifikuojami 921 imties nariai.
automobiliai <- as_tibble(duom)
autoTask <- makeClassifTask(data = automobiliai, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
cvForTuning <- makeResampleDesc("Holdout", split = 0.8)
kernels <- c("polynomial", "radial", "sigmoid")
svmParam <- makeParamSet(makeDiscreteParam("kernel", values = kernels),
makeIntegerParam("degree", lower = 1, upper = 3),
makeNumericParam("cost", lower = 0.1, upper = 10),
makeNumericParam("gamma", lower = 0.1, 10))
randSearch <- makeTuneControlRandom(maxit = 10)
outer <- makeResampleDesc("CV", iters = 3)
svmWrapper <- makeTuneWrapper("classif.svm", resampling = cvForTuning,
par.set = svmParam, control = randSearch)
cvWithTuning <- resample(learner = svmWrapper, task = autoTask, resampling = outer, measures = list(mmce, acc))
cvWithTuning
## Resample Result
## Task: automobiliai
## Learner: classif.svm.tuned
## Aggr perf: mmce.test.mean=0.3081761,acc.test.mean=0.6918239
## Runtime: 0.87598
SVM modelis gerai suklasifikuoja 59 procentus duomenų.
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.69/0.60 0.31/0.23 0.31
## one 0.31/0.40 0.69/0.77 0.31
## -err.- 0.40 0.23 0.31
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 44 20 20
## one 29 66 29
## -err.- 29 20 49
autoTask <- makeClassifTask(data = duom, target = "num_of_doors")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
logReg <- makeLearner("classif.logreg", predict.type = "prob")
logRegWrapper <- makeImputeWrapper("classif.logreg")
holdout <- makeResampleDesc(method = "Holdout", split = 2/3, stratify = TRUE)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, autoTask,
resampling = holdout,
measures = list(acc, fpr, fnr))
## Resampling: holdout
## Measures: acc fpr fnr
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 1: 0.6851852 0.2812500 0.3636364
##
## Aggregated Result: acc.test.mean=0.6851852,fpr.test.mean=0.2812500,fnr.test.mean=0.3636364
##
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.64/0.61 0.36/0.26 0.36
## one 0.28/0.39 0.72/0.74 0.28
## -err.- 0.39 0.26 0.31
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 14 8 8
## one 9 23 9
## -err.- 9 8 17
Logstinės regresijos modelis gerai suklasifikuoja 69 procentus duomenų.
kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, autoTask,
resampling = kFold,
measures = list(acc, fpr, fnr))
## Resampling: cross-validation
## Measures: acc fpr fnr
## [Resample] iter 1: 0.7500000 0.2000000 0.3333333
## [Resample] iter 2: 0.8125000 0.2000000 0.1666667
## [Resample] iter 3: 0.7500000 0.1666667 0.5000000
## [Resample] iter 4: 0.8750000 0.1428571 0.1111111
## [Resample] iter 5: 0.6875000 0.3333333 0.2857143
## [Resample] iter 6: 0.9375000 0.0000000 0.1428571
## [Resample] iter 7: 0.7333333 0.2500000 0.2857143
## Warning: glm.fit: atsirado tikimybės, kurios skaitine prasme yra 0 arba 1
## [Resample] iter 8: 0.7500000 0.1538462 0.6666667
## [Resample] iter 9: 0.8125000 0.1428571 0.2222222
## [Resample] iter 10: 0.8125000 0.1000000 0.3333333
##
## Aggregated Result: acc.test.mean=0.7920833,fpr.test.mean=0.1689560,fnr.test.mean=0.3047619
##
logRegwithImpute$aggr
## acc.test.mean fpr.test.mean fnr.test.mean
## 0.7920833 0.1689560 0.3047619
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true four one -err.-
## four 0.73/0.75 0.27/0.18 0.27
## one 0.17/0.25 0.83/0.82 0.17
## -err.- 0.25 0.18 0.21
##
##
## Absolute confusion matrix:
## predicted
## true four one -err.-
## four 47 17 17
## one 16 79 16
## -err.- 16 17 33
Logistinės regresijos modeliui atlikus 10 - fold crossvalidation, mpdelio tikslumas pagerėja 11 procentų.