https://archive.ics.uci.edu/ml/datasets/Rice+%28Cammeo+and+Osmancik%29#
Data Set Information:
Among the certified rice grown in TURKEY, the Osmancik species, which has a large planting area since 1997 and the Cammeo species grown since 2014 have been selected for the study. When looking at the general characteristics of Osmancik species, they have a wide, long, glassy and dull appearance. When looking at the general characteristics of the Cammeo species, they have wide and long, glassy and dull in appearance. A total of 3810 rice grain’s images were taken for the two species, processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.
Attribute Information:
1.) Area: Returns the number of pixels within the boundaries of the rice grain. 2.) Perimeter: Calculates the circumference by calculating the distance between pixels around the boundaries of the rice grain. 3.) Major Axis Length: The longest line that can be drawn on the rice grain, i.e. the main axis distance, gives. 4.) Minor Axis Length: The shortest line that can be drawn on the rice grain, i.e. the small axis distance, gives. 5.) Eccentricity: It measures how round the ellipse, which has the same moments as the rice grain, is. 6.) Convex Area: Returns the pixel count of the smallest convex shell of the region formed by the rice grain. 7.) Extent: Returns the ratio of the regionformed by the rice grain to the bounding box pixels. 8.) Class: Cammeo and Osmancik rices
knitr::opts_chunk$set(echo = TRUE)
library(MASS)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
library(tibble)
library(ggplot2)
library(devtools)
## Įkeliamas reikalingas paketas: usethis
library(ISLR)
## Warning: paketas 'ISLR' buvo sukurtas pagal R versijà 4.1.3
library(mlr)
## Įkeliamas reikalingas paketas: ParamHelpers
## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.
library(GGally)
## Warning: paketas 'GGally' buvo sukurtas pagal R versijà 4.1.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# del revalue
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Pridedamas paketas: 'plyr'
## Šie objektai yra užmaskuoti nuo 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Šis objektas yra užmaskuotas nuo 'package:purrr':
##
## compact
# for missing values visualisation
library(naniar)
## Warning: paketas 'naniar' buvo sukurtas pagal R versijà 4.1.3
library(corrplot)
## corrplot 0.92 loaded
# del KNN
library(class)
library(e1071)
##
## Pridedamas paketas: 'e1071'
## Šis objektas yra užmaskuotas nuo 'package:mlr':
##
## impute
library(tree)
## Warning: paketas 'tree' buvo sukurtas pagal R versijà 4.1.3
# PCA
library(factoextra)
## Warning: paketas 'factoextra' buvo sukurtas pagal R versijà 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(umap)
## Warning: paketas 'umap' buvo sukurtas pagal R versijà 4.1.3
read.csv("C:\\duomenys\\rice.csv")
ricedata <- read.csv("C:\\duomenys\\rice.csv")
Duomenų paruošimas:
# Stulpelų pavadinimas
names(ricedata)
## [1] "AREA" "Perimeter" "Major_Axis_Length"
## [4] "Minor_Axis_Length" "Eccentricity" "Convex_Area"
## [7] "Extent" "Class"
# Duomenų PVZ:
head(ricedata, 6)
## AREA Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area
## 1 15231 525.579 229.7499 85.09379 0.9288820 15617
## 2 14656 494.311 206.0201 91.73097 0.8954050 15072
## 3 14634 501.122 214.1068 87.76829 0.9121181 14954
## 4 13176 458.343 193.3374 87.44839 0.8918609 13368
## 5 14688 507.167 211.7434 89.31245 0.9066909 15262
## 6 13479 477.016 200.0531 86.65029 0.9013283 13786
## Extent Class
## 1 0.5728955 Cammeo
## 2 0.6154363 Cammeo
## 3 0.6932588 Cammeo
## 4 0.6406690 Cammeo
## 5 0.6460239 Cammeo
## 6 0.6578973 Cammeo
# Stebinių kiekis
nrow(ricedata)
## [1] 3810
# Patiriname ar nėra trūkstamų duomenų
ricedata[!complete.cases(ricedata),]
## [1] AREA Perimeter Major_Axis_Length Minor_Axis_Length
## [5] Eccentricity Convex_Area Extent Class
## <0 eilučių> (arba 0-ilgio row.names)
# Vizualzijuoma trūkstamus duomenis
vis_miss(ricedata)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# Eilutes su trūkstamais duomenimis šaliname
ricedata<- na.omit(ricedata)
# Dar akrtą Patiriname ar nėra trūkstamų duomenų
ricedata[!complete.cases(ricedata),]
## [1] AREA Perimeter Major_Axis_Length Minor_Axis_Length
## [5] Eccentricity Convex_Area Extent Class
## <0 eilučių> (arba 0-ilgio row.names)
# Dirbame su tiek duomenų
nrow(ricedata)
## [1] 3810
# min/max values
unique(ricedata$Class)
## [1] "Cammeo" "Osmancik"
sapply(ricedata, min)
## AREA Perimeter Major_Axis_Length Minor_Axis_Length
## "7551" "359.1000061" "145.2644653" "59.53240585"
## Eccentricity Convex_Area Extent Class
## "0.777232587" "7723" "0.49741286" "Cammeo"
sapply(ricedata, max)
## AREA Perimeter Major_Axis_Length Minor_Axis_Length
## "18913" "548.4459839" "239.010498" "107.54245"
## Eccentricity Convex_Area Extent Class
## "0.948006928" "19099" "0.861049533" "Osmancik"
str(ricedata)
## 'data.frame': 3810 obs. of 8 variables:
## $ AREA : int 15231 14656 14634 13176 14688 13479 15757 16405 14534 13485 ...
## $ Perimeter : num 526 494 501 458 507 ...
## $ Major_Axis_Length: num 230 206 214 193 212 ...
## $ Minor_Axis_Length: num 85.1 91.7 87.8 87.4 89.3 ...
## $ Eccentricity : num 0.929 0.895 0.912 0.892 0.907 ...
## $ Convex_Area : int 15617 15072 14954 13368 15262 13786 16150 16837 14932 13734 ...
## $ Extent : num 0.573 0.615 0.693 0.641 0.646 ...
## $ Class : chr "Cammeo" "Cammeo" "Cammeo" "Cammeo" ...
summary(ricedata)
## AREA Perimeter Major_Axis_Length Minor_Axis_Length
## Min. : 7551 Min. :359.1 Min. :145.3 Min. : 59.53
## 1st Qu.:11370 1st Qu.:426.1 1st Qu.:174.4 1st Qu.: 82.73
## Median :12422 Median :448.9 Median :185.8 Median : 86.43
## Mean :12668 Mean :454.2 Mean :188.8 Mean : 86.31
## 3rd Qu.:13950 3rd Qu.:483.7 3rd Qu.:203.6 3rd Qu.: 90.14
## Max. :18913 Max. :548.4 Max. :239.0 Max. :107.54
## Eccentricity Convex_Area Extent Class
## Min. :0.7772 Min. : 7723 Min. :0.4974 Length:3810
## 1st Qu.:0.8724 1st Qu.:11626 1st Qu.:0.5989 Class :character
## Median :0.8891 Median :12706 Median :0.6454 Mode :character
## Mean :0.8869 Mean :12952 Mean :0.6619
## 3rd Qu.:0.9026 3rd Qu.:14284 3rd Qu.:0.7266
## Max. :0.9480 Max. :19099 Max. :0.8610
# vizual
par(mfrow = c(2,2))
hist(ricedata$AREA, main = "AREA")
hist(ricedata$Perimeter , main = "Perimeter ")
hist(ricedata$Major_Axis_Length, main = "Major_Axis_Length")
hist(ricedata$Minor_Axis_Length, main = "Minor_Axis_Length")
par(mfrow = c(1,3))
hist(ricedata$Eccentricity , main = "Eccentricity ")
hist(ricedata$Convex_Area, main = "Convex_Area")
hist(ricedata$Extent , main = "Extent ")
par(mfrow = c(1,1))
ggplot(data = ricedata, aes(x=Class, fill=Class)) +geom_bar()
ggpairs(ricedata, mapping = ggplot2::aes(colour=Class) )
ricedata_num <-ricedata
ricedata_num$Class<- revalue(ricedata_num$Class, c("Cammeo"=1))
ricedata_num$Class<- revalue(ricedata_num$Class, c("Osmancik"=0))
ricedata_num$Class<-as.integer(ricedata_num$Class)
corrplot(cor(ricedata_num), method = "number", type = "upper")
# ruošiame duomenis apmokymui
ricedata_numTib <- as_tibble(ricedata_num)
ricedata_numTib
## # A tibble: 3,810 x 8
## AREA Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 15231 526. 230. 85.1 0.929 15617
## 2 14656 494. 206. 91.7 0.895 15072
## 3 14634 501. 214. 87.8 0.912 14954
## 4 13176 458. 193. 87.4 0.892 13368
## 5 14688 507. 212. 89.3 0.907 15262
## 6 13479 477. 200. 86.7 0.901 13786
## 7 15757 509. 207. 98.3 0.880 16150
## 8 16405 527. 222. 95.4 0.903 16837
## 9 14534 484. 197. 95.1 0.875 14932
## 10 13485 472. 198. 87.7 0.897 13734
## # ... with 3,800 more rows, and 2 more variables: Extent <dbl>, Class <int>
names(ricedata_num)
## [1] "AREA" "Perimeter" "Major_Axis_Length"
## [4] "Minor_Axis_Length" "Eccentricity" "Convex_Area"
## [7] "Extent" "Class"
Tiesinė diskriminantinė analizė (angl. Linear Discriminant Analysis (LDA)) yra naudojama ieškant linijinės ribos tarp klasifikatorių, o kvadratinė diskriminantinė analizė (angl. Quadratic Discriminant Analysis (QDA)) yra naudojama netiesinei ribai tarp klasifikatorių rasti. LDA ir QDA veikia geriau, kai atsakymų klasės yra seperabilios (atskiriamos), o X=x pasiskirstymas visoms klasėms yra normalusis. Kuo daugiau klasių yra separabilios ir kuo normalesnis pasiskirstymas, tuo geresnis bus LDA ir QDA klasifikavimo rezultatas.
# LDA
ricedataTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
lda <- makeLearner("classif.lda")
ldaModel <- train(lda, ricedataTask)
ldaModelData <- getLearnerModel(ldaModel)
ldaPreds <- predict(ldaModelData)$x
head(ldaPreds)
## LD1
## 1 4.5443551
## 2 2.0229339
## 3 2.6352128
## 4 0.2475078
## 5 3.0359498
## 6 1.0594607
# ricedata_numTib %>%
# mutate(LD1 = ldaPreds[, 1],
# LD2 = ldaPreds[, 2]) %>%
# ggplot(aes(LD1, LD2, col = Class)) +
# geom_point() +
# stat_ellipse() +
# theme_bw()
# Krosvalidavimas K-fold
kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE)
ldaCVKF <- resample(learner = lda, task = ricedataTask, resampling = kFold, measures = list(mmce, acc))
#K-fold Krosvalidavimo rezultatai
calculateConfusionMatrix(ldaCVKF$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.95/0.93 0.05/0.06 0.05
## 1 0.10/0.07 0.90/0.94 0.10
## -err.- 0.07 0.06 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 104047 4953 4953
## 1 8310 73190 8310
## -err.- 8310 4953 13263
ldaCVKF$aggr
## mmce.test.mean acc.test.mean
## 0.06962205 0.93037795
#LOO Krosvalidavimo rezultatai
LOO <- makeResampleDesc(method = "LOO")
ricedataTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
ldaCVLOO <- resample(learner = lda, task = ricedataTask, resampling = LOO, measures = list(mmce, acc))
ldaCVLOO$aggr
## mmce.test.mean acc.test.mean
## 0.06955381 0.93044619
#Holdout Krosvalidavimo rezultatai
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
holdoutCV_lda <- resample(learner = lda, task = ricedataTask, resampling = holdout, measures = list(mmce, acc))
calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.96/0.93 0.04/0.06 0.04
## 1 0.10/0.07 0.90/0.94 0.10
## -err.- 0.07 0.06 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 417 19 19
## 1 31 295 31
## -err.- 31 19 50
holdoutCV_lda$aggr
## mmce.test.mean acc.test.mean
## 0.0656168 0.9343832
# QDA
qda <- makeLearner("classif.qda")
qdaModel <- train(qda, ricedataTask)
# Krosvalidavimas K-fold
kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE)
ldaCVKF <- resample(learner = qda, task = ricedataTask, resampling = kFold, measures = list(mmce, acc))
#K-fold Krosvalidavimo rezultatai
calculateConfusionMatrix(ldaCVKF$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.92/0.95 0.08/0.10 0.08
## 1 0.07/0.05 0.93/0.90 0.07
## -err.- 0.05 0.10 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 100478 8522 8522
## 1 5480 76020 5480
## -err.- 5480 8522 14002
ldaCVKF$aggr
## mmce.test.mean acc.test.mean
## 0.07350131 0.92649869
#LOO Krosvalidavimo rezultatai
LOO <- makeResampleDesc(method = "LOO")
hcvTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
ldaCVLOO <- resample(learner = qda, task = ricedataTask, resampling = LOO, measures = list(mmce, acc))
ldaCVLOO$aggr
## mmce.test.mean acc.test.mean
## 0.07322835 0.92677165
#Holdout Krosvalidavimo rezultatai
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
# hcvTask <- makeClassifTask(data = hcvdataTib, target = "Class")
holdoutCV_lda <- resample(learner = qda, task = ricedataTask, resampling = holdout, measures = list(mmce, acc))
calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.92/0.95 0.08/0.10 0.08
## 1 0.06/0.05 0.94/0.90 0.06
## -err.- 0.05 0.10 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 402 34 34
## 1 19 307 19
## -err.- 19 34 53
holdoutCV_lda$aggr
## mmce.test.mean acc.test.mean
## 0.06955381 0.93044619
ricedata_numTib %>% mutate(LD1 = ldaPreds[, 1], LD1 = ldaPreds[, 1]) %>% ggplot(aes(LD1, LD1, col = Class)) +geom_point() +theme_bw()
Statistikoje k-arčiausių kaimynų algoritmas (k-NN) yra neparametrinis prižiūrimas mokymosi metodas. Jis naudojamas klasifikavimui ir regresijai. Abiem atvejais įvestis susideda iš k artimiausių mokymo pavyzdžių duomenų rinkinyje. Išvestis priklauso nuo to, ar k-NN naudojamas klasifikavimui ar regresijai: K-NN klasifikacijoje išvestis yra klasės narystė. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Taikant k-NN regresiją, išvestis yra objekto savybės vertė. Ši reikšmė yra k artimiausių kaimynų reikšmių vidurkis. Algoritmas identifikuoja arčiausiai kiekvieno nepažymėto atvejo k pažymėtus atvejus (kaimynus). k yra mūsų nurodytas sveikasis skaičius. Tai yra surandame tinkamiausią k, su kuriuo kintamieji yra tiksliausiai priskiriami klasėms
ricedataTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:20))
gridSearch <- makeTuneControlGrid()
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)
tunedK <- tuneParams("classif.knn", task = ricedataTask,
resampling = cvForTuning,
par.set = knnParamSpace, control = gridSearch)
knnTuningData <- generateHyperParsEffectData(tunedK)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
plot.type = "line") +
theme_bw()
### Holdout
knn <- makeLearner("classif.knn", par.vals = list("k" = 2))
# Hold out krosvalidavimas
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
holdoutCV <- resample(learner = knn, task = ricedataTask,
resampling = holdout, measures = list(mmce, acc))
calculateConfusionMatrix(holdoutCV$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.87/0.90 0.13/0.16 0.13
## 1 0.13/0.10 0.87/0.84 0.13
## -err.- 0.10 0.16 0.13
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 381 55 55
## 1 41 285 41
## -err.- 41 55 96
holdoutCV$aggr
## mmce.test.mean acc.test.mean
## 0.1259843 0.8740157
kFold <- makeResampleDesc(method = "CV", iters = 10,
stratify = TRUE)
set.seed(123)
kFoldCV <- resample(learner = knn, task = ricedataTask,
resampling = kFold, measures = list(mmce, acc))
kFoldCV$aggr
## mmce.test.mean acc.test.mean
## 0.1301837 0.8698163
LOO <- makeResampleDesc(method = "LOO")
set.seed(123)
kLOO <- resample(learner = knn, task = ricedataTask,
resampling = LOO, measures = list(mmce, acc))
kLOO$aggr
## mmce.test.mean acc.test.mean
## 0.1404199 0.8595801
SVM – Atraminių vektorių klasifikatorius (angl. – Support vector machine) sistemos mokymosi algoritmas skirtas klasifikuoti duomenims. Tai prižiūrimo mokymosi metodas, kuomet siekiama suklasifikuoti jau pažymėtus duomenis. „Support Vector Machine“ (SVM) yra mašininio mokymosi su mokytoju algoritmas, kuris gali būti naudojamas tiek klasifikavimo, tiek regresijos uždaviniams spręsti. Visgi, jis dažniausiai naudojamas klasifikavimo uždaviniams. SVM tikslas yra rasti maksimalią skiriamąją liniją (jei atvejis yra dvimatis) arba skiriamąją plokštumą (jei atvejis yra trimatis), arba skiriamąją hiperplokštumą (jei atvejis yra n- matmenų, n>3), kuri turėtų didžiausią atstumą tarp artimiausių mokymo duomenų objektų.
set.seed(123)
svmfit <- svm(Class~., data = ricedata_num, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)
##
## Call:
## svm(formula = Class ~ ., data = ricedata_num, kernel = "linear",
## cost = 10, scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: linear
## cost: 10
## gamma: 0.1428571
## epsilon: 0.1
##
##
## Number of Support Vectors: 3805
ricedata_numTib <- as_tibble(ricedata_num)
ricedata_numTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
cvForTuning <- makeResampleDesc("Holdout", split = 0.9)
kernels <- c("polynomial", "radial", "sigmoid")
svmParamSpace <- makeParamSet(makeDiscreteParam("kernel", values = kernels),
makeIntegerParam("degree", lower = 1, upper = 3),
makeNumericParam("cost", lower = 0.1, upper = 10),
makeNumericParam("gamma", lower = 0.1, 10))
randSearch <- makeTuneControlRandom(maxit = 10)
outer <- makeResampleDesc("CV", iters = 3)
svmWrapper <- makeTuneWrapper("classif.svm", resampling = cvForTuning,
par.set = svmParamSpace, control = randSearch)
cvWithTuning <- resample(learner = svmWrapper, task = ricedata_numTask, resampling = outer, measures = list(mmce, acc))
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.94/0.94 0.06/0.08 0.06
## 1 0.09/0.06 0.91/0.92 0.09
## -err.- 0.06 0.08 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 2055 125 125
## 1 140 1490 140
## -err.- 140 125 265
cvWithTuning
## Resample Result
## Task: ricedata_numTib
## Learner: classif.svm.tuned
## Aggr perf: mmce.test.mean=0.0695538,acc.test.mean=0.9304462
## Runtime: 156.018
set.seed(123)
svmfit_nl <- svm(Class~., data = ricedata_num, kernel = "radial", cost = 5, scale = FALSE)
print(svmfit_nl)
##
## Call:
## svm(formula = Class ~ ., data = ricedata_num, kernel = "radial",
## cost = 5, scale = FALSE)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 5
## gamma: 0.1428571
## epsilon: 0.1
##
##
## Number of Support Vectors: 3810
Sprendimų medžiai yra mašininio mokymosi su mokytoju metodas, kai duomenys nuolat skaidomi pagal tam tikrą parametrą. Medį sudaro sprendimų mazgai ir lapai. Lapai yra sprendimai arba galutiniai rezultatai, o sprendimų mazgai atsiranda ten, kur duomenys yra padalijami.
tree.hcv <- tree(Class~., data = ricedata_num)
summary(tree.hcv)
##
## Regression tree:
## tree(formula = Class ~ ., data = ricedata_num)
## Variables actually used in tree construction:
## [1] "Major_Axis_Length" "Perimeter"
## Number of terminal nodes: 4
## Residual mean deviance: 0.05584 = 212.5 / 3806
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.965700 -0.009416 -0.009416 0.000000 0.034260 0.990600
# Error in text.tree(tree.hcv, pretty = 0) : cannot plot singlenode tree
plot(tree.hcv)
text(tree.hcv, pretty = 0)
Komentaras
tree <- makeLearner("classif.rpart")
treeParamSpace <- makeParamSet(makeIntegerParam("minsplit", lower = 5, upper = 20), makeIntegerParam("minbucket", lower = 3, upper = 10), makeNumericParam("cp", lower = 0.01, upper = 0.1), makeIntegerParam("maxdepth", lower = 3, upper = 10))
randSearch <- makeTuneControlRandom(maxit = 200)
cvForTuning <- makeResampleDesc("CV", iters = 5)
# library(parallelMap)
# parallelStartSocket(cpus = detectCores())
# parallelStartSocket(cpus=3L)
# parallelMap(function(x) {x+1}, x = 1:10)
tunedTreePars <- tuneParams(tree, task = ricedataTask, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)
# parallelStop()
tunedTree <- setHyperPars(tree, par.vals = tunedTreePars$x)
tunedTreeModel <- train(tunedTree, ricedataTask)
library(rpart.plot)
treeModelData <- getLearnerModel(tunedTreeModel)
par(mfrow = c(1,1))
rpart.plot(treeModelData, roundint = FALSE, box.palette = "BuBn", type = 5)
### Hold out confusion matrix
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
treeWrapper <- makeTuneWrapper("classif.rpart", resampling = holdout,
par.set = treeParamSpace,
control = randSearch)
set.seed(123)
cvWithTuning <- resample(treeWrapper, ricedataTask, resampling = holdout)
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.95/0.93 0.05/0.07 0.05
## 1 0.09/0.07 0.91/0.93 0.09
## -err.- 0.07 0.07 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 414 22 22
## 1 30 296 30
## -err.- 30 22 52
cvWithTuning
## Resample Result
## Task: ricedata_numTib
## Learner: classif.rpart.tuned
## Aggr perf: mmce.test.mean=0.0682415
## Runtime: 4.5708
kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)
treeWrapper <- makeTuneWrapper("classif.rpart", resampling = kFold,
par.set = treeParamSpace,
control = randSearch)
set.seed(123)
cvWithTuning <- resample(treeWrapper, ricedataTask, resampling = kFold)
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.94/0.93 0.06/0.09 0.06
## 1 0.09/0.07 0.91/0.91 0.09
## -err.- 0.07 0.09 0.08
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 2040 140 140
## 1 150 1480 150
## -err.- 150 140 290
cvWithTuning
## Resample Result
## Task: ricedata_numTib
## Learner: classif.rpart.tuned
## Aggr perf: mmce.test.mean=0.0761155
## Runtime: 252.001
Statistikoje logistinis modelis yra statistinis modelis, modeliuojantis vieno įvykio tikimybę, kai įvykio log-odds yra tiesinis vieno ar kelių nepriklausomų kintamųjų derinys. Regresinėje analizėje logistinė regresija yra logistinio modelio parametrų įvertinimas.
ricedata_Task <- makeClassifTask(data = ricedata_num, target = "Class")
logReg <- makeLearner("classif.logreg", predict.type = "prob")
logRegModel <- train(logReg, ricedata_Task)
logRegWrapper <- makeImputeWrapper("classif.logreg")
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, ricedata_Task,
resampling = holdout,
measures = list(acc, fpr, fnr))
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.94/0.95 0.06/0.08 0.06
## 1 0.07/0.05 0.93/0.92 0.07
## -err.- 0.05 0.08 0.06
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 411 25 25
## 1 22 304 22
## -err.- 22 25 47
logRegwithImpute
## Resample Result
## Task: ricedata_num
## Learner: classif.logreg.imputed
## Aggr perf: acc.test.mean=0.9383202,fpr.test.mean=0.0674847,fnr.test.mean=0.0573394
## Runtime: 0.0276251
kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, ricedata_Task,
resampling = kFold,
measures = list(acc, fpr, fnr))
logRegwithImpute
## Resample Result
## Task: ricedata_num
## Learner: classif.logreg.imputed
## Aggr perf: acc.test.mean=0.9299213,fpr.test.mean=0.0857996,fnr.test.mean=0.0579707
## Runtime: 0.305983
calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)
## Relative confusion matrix (normalized by row/column):
## predicted
## true 0 1 -err.-
## 0 0.94/0.94 0.06/0.08 0.06
## 1 0.09/0.06 0.91/0.92 0.09
## -err.- 0.06 0.08 0.07
##
##
## Absolute confusion matrix:
## predicted
## true 0 1 -err.-
## 0 2053 127 127
## 1 140 1490 140
## -err.- 140 127 267
logRegwithImpute$aggr
## acc.test.mean fpr.test.mean fnr.test.mean
## 0.92992126 0.08579965 0.05797067
Logistinės regresijos modelio po 10-fold CV bendras klasifikavimo tikslumas siekia ∼96%.
### LOO Crossvalidation
LOO <- makeResampleDesc(method = "LOO")
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, ricedata_Task,
resampling = LOO,
measures = list(acc))
logRegwithImpute$aggr
## acc.test.mean
## 0.9304462
komentaras
pca <- select(ricedata_numTib, -Class) %>%
prcomp(center = TRUE, scale = TRUE)
pca
## Standard deviations (1, .., p=7):
## [1] 2.13985496 1.22464634 0.94910769 0.10842422 0.07884577 0.04530634 0.02077348
##
## Rotation (n x k) = (7 x 7):
## PC1 PC2 PC3 PC4 PC5
## AREA -0.46125179 -0.12437713 0.01261109 0.33469497 -0.37202436
## Perimeter -0.46440793 0.05575087 -0.02837202 -0.68885689 0.47535595
## Major_Axis_Length -0.44707633 0.21345616 -0.12155490 -0.24054740 -0.52572729
## Minor_Axis_Length -0.32175200 -0.56710548 0.21271588 0.32792777 0.42723064
## Eccentricity -0.22732878 0.67315219 -0.29798463 0.48220471 0.40928605
## Convex_Area -0.46169356 -0.12253499 0.01687902 0.12294785 -0.09489032
## Extent 0.05771599 -0.38223203 -0.92191790 -0.02045092 0.01414705
## PC6 PC7
## AREA 0.19428799 0.695303821
## Perimeter 0.12575766 0.253199628
## Major_Axis_Length -0.58169723 -0.259151923
## Minor_Axis_Length -0.47572402 -0.115078218
## Eccentricity -0.07804218 -0.016305971
## Convex_Area 0.61290856 -0.609722960
## Extent 0.00216608 0.003443235
fviz_pca_var(pca)
fviz_screeplot(pca, addlabels = TRUE, choice = "eigenvalue")
fviz_screeplot(pca, addlabels = TRUE, choice = "variance")
TDPca <- ricedata_numTib %>%
mutate(PCA1 = pca$x[, 1], PCA2 = pca$x[, 2])
ggplot(TDPca, aes(PCA1, PCA2, col = Class)) +
geom_point() +
theme_bw()
UMAP (angl. Uniform Manifold Approximation and Projection) yra netiesninis dimensijų sumažininimo algoritmas, kuris panašiai kaip ir t-SNE, gali būti naudojamas daugiamačių duomenų vizualizacijai, taip pat bendram netiesiniam matmenų mažinimui.
ricedataUMAP <- dplyr::select(ricedata_numTib, -Class) %>% as.matrix() %>% umap(n_neighbors = 10, min_dist = 0.1, metric = "manhattan", n_epochs = 200, verbose = TRUE)
ricedataTibmap <- ricedata_numTib %>% mutate_if(.funs = scale, .predicate = is.numeric, scale = FALSE) %>% mutate(UMAP1 = ricedataUMAP$layout[, 1], UMAP2 = ricedataUMAP$layout[, 2]) %>% gather(key = "Variable", value = "Value", c(-UMAP1, -UMAP2, -Class))
ggplot(ricedataTibmap, aes(UMAP1, UMAP2, col = Class))+ geom_point() + theme_bw()
lentelė
| Prognozavimo alg. | Krosvalidatorius | Tikslumas |
|---|---|---|
| LDA | K-FOLD | 93,0 % |
| LDA | LOO | 93,0% |
| LDA | Holdout | 93,4% |
| QDA | K-FOLD | 92,6 |
| QDA | LOO | 92,7 |
| QDA | Holdout | 93,0% |
| KNN | K-FOLD | 87,0% |
| KNN | LOO | 86,0% |
| KNN | Holdout | 87,4% |
| Logistic reg. | K-FOLD | 93,0% |
| Logistic reg. | LOO | - |
| Logistic reg. | Holdout | 93,8% |
| Liner SVM | K-FOLD | 93,0% |
| Liner SVM | LOO | - |
| Liner SVM | Holdout | - |
| Non-Liner SVM | K-FOLD | - |
| Non-Liner SVM | LOO | - |
| Non-Liner SVM | Holdout | - |
| Tree | K-FOLD | 92,4% |
| Tree | LOO | - |
| Tree | Holdout | 93,2% |
Skirtingų klasifikatorių tikslumo įverčiai rodo, kad tinkamiausias yra Logistic regression klasifikavimo algoritmas. Naudojant šį algoritmą, gauti aukščiausi tikslumai taikant Holdout validavimo metodą, gautas 93,8% tikslumas. Antras pagal tikslumą buvo LDA klasifikatorius su 93,4% tikslumu atlikus Holdout krosvalidavimą.
Bendrai paėmus, su šio tipo duomenimis visi klasivifikatoriai dirbo pakankamai tiskliai.
Taikant dimensijų mažinimo metodus, buvo nustatyta, kad dimensijų mažinimui tinkamiausia yra pagrindinių komponenčių analizė. Atvaizdavus duomenis pagal dvi pirmas pagrindines komponentes, galime matyti skirtingų klasių atsiskyrimą. Šio darbo analizės metodika galima būtų taikyti su panašaus pobūdžio duomenismis, t. y. su keletų klasifikuojamų klasių ir panašiu kiekiu skaitinių duomenų ir duomenų dimencijų kiekiu.
Gediminas Kazėnas 2022-05-19