https://archive.ics.uci.edu/ml/datasets/Rice+%28Cammeo+and+Osmancik%29#

Data Set Information:

Among the certified rice grown in TURKEY, the Osmancik species, which has a large planting area since 1997 and the Cammeo species grown since 2014 have been selected for the study. When looking at the general characteristics of Osmancik species, they have a wide, long, glassy and dull appearance. When looking at the general characteristics of the Cammeo species, they have wide and long, glassy and dull in appearance. A total of 3810 rice grain’s images were taken for the two species, processed and feature inferences were made. 7 morphological features were obtained for each grain of rice.

Attribute Information:

1.) Area: Returns the number of pixels within the boundaries of the rice grain. 2.) Perimeter: Calculates the circumference by calculating the distance between pixels around the boundaries of the rice grain. 3.) Major Axis Length: The longest line that can be drawn on the rice grain, i.e. the main axis distance, gives. 4.) Minor Axis Length: The shortest line that can be drawn on the rice grain, i.e. the small axis distance, gives. 5.) Eccentricity: It measures how round the ellipse, which has the same moments as the rice grain, is. 6.) Convex Area: Returns the pixel count of the smallest convex shell of the region formed by the rice grain. 7.) Extent: Returns the ratio of the regionformed by the rice grain to the bounding box pixels. 8.) Class: Cammeo and Osmancik rices

Klasifikaimo uždavinys: Analizuojant 7 geometrinius parametrus nustatyti kokiai rušiai priklauso ryžio grūdas:

Cammeo arba Osmancik

knitr::opts_chunk$set(echo = TRUE)
library(MASS)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()

library(tibble)
library(ggplot2)
library(devtools)

## Įkeliamas reikalingas paketas: usethis

library(ISLR)

## Warning: paketas 'ISLR' buvo sukurtas pagal R versijà 4.1.3

library(mlr)

## Įkeliamas reikalingas paketas: ParamHelpers

## Warning message: 'mlr' is in 'maintenance-only' mode since July 2019.
## Future development will only happen in 'mlr3'
## (<https://mlr3.mlr-org.com>). Due to the focus on 'mlr3' there might be
## uncaught bugs meanwhile in {mlr} - please consider switching.

library(GGally)

## Warning: paketas 'GGally' buvo sukurtas pagal R versijà 4.1.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

# del revalue
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Pridedamas paketas: 'plyr'

## Šie objektai yra užmaskuoti nuo 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## Šis objektas yra užmaskuotas nuo 'package:purrr':
## 
##     compact

# for missing values visualisation
library(naniar)

## Warning: paketas 'naniar' buvo sukurtas pagal R versijà 4.1.3

library(corrplot)

## corrplot 0.92 loaded

# del KNN
library(class)
library(e1071)

## 
## Pridedamas paketas: 'e1071'

## Šis objektas yra užmaskuotas nuo 'package:mlr':
## 
##     impute

library(tree)

## Warning: paketas 'tree' buvo sukurtas pagal R versijà 4.1.3

# PCA
library(factoextra)

## Warning: paketas 'factoextra' buvo sukurtas pagal R versijà 4.1.3

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(umap)

## Warning: paketas 'umap' buvo sukurtas pagal R versijà 4.1.3

Duomenų paruošimas

read.csv("C:\\duomenys\\rice.csv")
ricedata <- read.csv("C:\\duomenys\\rice.csv")

Duomenų paruošimas:

# Stulpelų pavadinimas
names(ricedata)

## [1] "AREA"              "Perimeter"         "Major_Axis_Length"
## [4] "Minor_Axis_Length" "Eccentricity"      "Convex_Area"      
## [7] "Extent"            "Class"

# Duomenų PVZ:
head(ricedata, 6)

##    AREA Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area
## 1 15231   525.579          229.7499          85.09379    0.9288820       15617
## 2 14656   494.311          206.0201          91.73097    0.8954050       15072
## 3 14634   501.122          214.1068          87.76829    0.9121181       14954
## 4 13176   458.343          193.3374          87.44839    0.8918609       13368
## 5 14688   507.167          211.7434          89.31245    0.9066909       15262
## 6 13479   477.016          200.0531          86.65029    0.9013283       13786
##      Extent  Class
## 1 0.5728955 Cammeo
## 2 0.6154363 Cammeo
## 3 0.6932588 Cammeo
## 4 0.6406690 Cammeo
## 5 0.6460239 Cammeo
## 6 0.6578973 Cammeo

# Stebinių kiekis
nrow(ricedata)

## [1] 3810

# Patiriname ar nėra trūkstamų duomenų
ricedata[!complete.cases(ricedata),]

## [1] AREA              Perimeter         Major_Axis_Length Minor_Axis_Length
## [5] Eccentricity      Convex_Area       Extent            Class            
## <0 eilučių> (arba 0-ilgio row.names)

# Vizualzijuoma trūkstamus duomenis
vis_miss(ricedata)

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

# Eilutes su trūkstamais duomenimis šaliname
ricedata<- na.omit(ricedata)
# Dar akrtą Patiriname ar nėra trūkstamų duomenų
ricedata[!complete.cases(ricedata),]

## [1] AREA              Perimeter         Major_Axis_Length Minor_Axis_Length
## [5] Eccentricity      Convex_Area       Extent            Class            
## <0 eilučių> (arba 0-ilgio row.names)

# Dirbame su tiek duomenų
nrow(ricedata)

## [1] 3810

Aprašomoji statistika

# min/max values
unique(ricedata$Class)

## [1] "Cammeo"   "Osmancik"

sapply(ricedata, min)

##              AREA         Perimeter Major_Axis_Length Minor_Axis_Length 
##            "7551"     "359.1000061"     "145.2644653"     "59.53240585" 
##      Eccentricity       Convex_Area            Extent             Class 
##     "0.777232587"            "7723"      "0.49741286"          "Cammeo"

sapply(ricedata, max)

##              AREA         Perimeter Major_Axis_Length Minor_Axis_Length 
##           "18913"     "548.4459839"      "239.010498"       "107.54245" 
##      Eccentricity       Convex_Area            Extent             Class 
##     "0.948006928"           "19099"     "0.861049533"        "Osmancik"

str(ricedata)

## 'data.frame':    3810 obs. of  8 variables:
##  $ AREA             : int  15231 14656 14634 13176 14688 13479 15757 16405 14534 13485 ...
##  $ Perimeter        : num  526 494 501 458 507 ...
##  $ Major_Axis_Length: num  230 206 214 193 212 ...
##  $ Minor_Axis_Length: num  85.1 91.7 87.8 87.4 89.3 ...
##  $ Eccentricity     : num  0.929 0.895 0.912 0.892 0.907 ...
##  $ Convex_Area      : int  15617 15072 14954 13368 15262 13786 16150 16837 14932 13734 ...
##  $ Extent           : num  0.573 0.615 0.693 0.641 0.646 ...
##  $ Class            : chr  "Cammeo" "Cammeo" "Cammeo" "Cammeo" ...

summary(ricedata)

##       AREA         Perimeter     Major_Axis_Length Minor_Axis_Length
##  Min.   : 7551   Min.   :359.1   Min.   :145.3     Min.   : 59.53   
##  1st Qu.:11370   1st Qu.:426.1   1st Qu.:174.4     1st Qu.: 82.73   
##  Median :12422   Median :448.9   Median :185.8     Median : 86.43   
##  Mean   :12668   Mean   :454.2   Mean   :188.8     Mean   : 86.31   
##  3rd Qu.:13950   3rd Qu.:483.7   3rd Qu.:203.6     3rd Qu.: 90.14   
##  Max.   :18913   Max.   :548.4   Max.   :239.0     Max.   :107.54   
##   Eccentricity     Convex_Area        Extent          Class          
##  Min.   :0.7772   Min.   : 7723   Min.   :0.4974   Length:3810       
##  1st Qu.:0.8724   1st Qu.:11626   1st Qu.:0.5989   Class :character  
##  Median :0.8891   Median :12706   Median :0.6454   Mode  :character  
##  Mean   :0.8869   Mean   :12952   Mean   :0.6619                     
##  3rd Qu.:0.9026   3rd Qu.:14284   3rd Qu.:0.7266                     
##  Max.   :0.9480   Max.   :19099   Max.   :0.8610

Bazinės vizualizacijos

# vizual
par(mfrow = c(2,2))
hist(ricedata$AREA, main = "AREA")
hist(ricedata$Perimeter     , main = "Perimeter     ")
hist(ricedata$Major_Axis_Length, main = "Major_Axis_Length")
hist(ricedata$Minor_Axis_Length, main = "Minor_Axis_Length")

par(mfrow = c(1,3))
hist(ricedata$Eccentricity     , main = "Eccentricity     ")
hist(ricedata$Convex_Area, main = "Convex_Area")
hist(ricedata$Extent          , main = "Extent          ")

par(mfrow = c(1,1))

ggplot(data = ricedata,  aes(x=Class, fill=Class)) +geom_bar()

ggpairs(ricedata, mapping = ggplot2::aes(colour=Class) )

ricedata_num <-ricedata
ricedata_num$Class<- revalue(ricedata_num$Class, c("Cammeo"=1))
ricedata_num$Class<- revalue(ricedata_num$Class, c("Osmancik"=0))

ricedata_num$Class<-as.integer(ricedata_num$Class)


corrplot(cor(ricedata_num), method = "number", type = "upper")

# ruošiame duomenis apmokymui


ricedata_numTib <- as_tibble(ricedata_num)
ricedata_numTib

## # A tibble: 3,810 x 8
##     AREA Perimeter Major_Axis_Length Minor_Axis_Length Eccentricity Convex_Area
##    <int>     <dbl>             <dbl>             <dbl>        <dbl>       <int>
##  1 15231      526.              230.              85.1        0.929       15617
##  2 14656      494.              206.              91.7        0.895       15072
##  3 14634      501.              214.              87.8        0.912       14954
##  4 13176      458.              193.              87.4        0.892       13368
##  5 14688      507.              212.              89.3        0.907       15262
##  6 13479      477.              200.              86.7        0.901       13786
##  7 15757      509.              207.              98.3        0.880       16150
##  8 16405      527.              222.              95.4        0.903       16837
##  9 14534      484.              197.              95.1        0.875       14932
## 10 13485      472.              198.              87.7        0.897       13734
## # ... with 3,800 more rows, and 2 more variables: Extent <dbl>, Class <int>

names(ricedata_num)

## [1] "AREA"              "Perimeter"         "Major_Axis_Length"
## [4] "Minor_Axis_Length" "Eccentricity"      "Convex_Area"      
## [7] "Extent"            "Class"

LDA

Tiesinė diskriminantinė analizė (angl. Linear Discriminant Analysis (LDA)) yra naudojama ieškant linijinės ribos tarp klasifikatorių, o kvadratinė diskriminantinė analizė (angl. Quadratic Discriminant Analysis (QDA)) yra naudojama netiesinei ribai tarp klasifikatorių rasti. LDA ir QDA veikia geriau, kai atsakymų klasės yra seperabilios (atskiriamos), o X=x pasiskirstymas visoms klasėms yra normalusis. Kuo daugiau klasių yra separabilios ir kuo normalesnis pasiskirstymas, tuo geresnis bus LDA ir QDA klasifikavimo rezultatas.

# LDA
ricedataTask  <- makeClassifTask(data = ricedata_numTib, target = "Class")

## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.

lda <- makeLearner("classif.lda")
ldaModel <- train(lda, ricedataTask)
ldaModelData <- getLearnerModel(ldaModel)
ldaPreds <- predict(ldaModelData)$x
head(ldaPreds)

##         LD1
## 1 4.5443551
## 2 2.0229339
## 3 2.6352128
## 4 0.2475078
## 5 3.0359498
## 6 1.0594607

# ricedata_numTib  %>%
# mutate(LD1 = ldaPreds[, 1],
# LD2 = ldaPreds[, 2]) %>%
# ggplot(aes(LD1, LD2, col = Class)) +
# geom_point() +
# stat_ellipse() +
# theme_bw()

Krosvalidavimas LDA

# Krosvalidavimas K-fold
kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE)
ldaCVKF <- resample(learner = lda, task = ricedataTask, resampling = kFold, measures = list(mmce, acc))

#K-fold Krosvalidavimo rezultatai

calculateConfusionMatrix(ldaCVKF$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.95/0.93 0.05/0.06 0.05     
##   1      0.10/0.07 0.90/0.94 0.10     
##   -err.-      0.07      0.06 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true          0     1 -err.-
##   0      104047  4953   4953
##   1        8310 73190   8310
##   -err.-   8310  4953  13263

ldaCVKF$aggr

## mmce.test.mean  acc.test.mean 
##     0.06962205     0.93037795

LOO LDA

#LOO Krosvalidavimo rezultatai
LOO <- makeResampleDesc(method = "LOO")
ricedataTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
ldaCVLOO <- resample(learner = lda, task = ricedataTask, resampling = LOO, measures = list(mmce, acc))

ldaCVLOO$aggr

## mmce.test.mean  acc.test.mean 
##     0.06955381     0.93044619

Holdout LDA

#Holdout Krosvalidavimo rezultatai
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
holdoutCV_lda <- resample(learner = lda, task = ricedataTask, resampling = holdout, measures = list(mmce, acc))

calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.96/0.93 0.04/0.06 0.04     
##   1      0.10/0.07 0.90/0.94 0.10     
##   -err.-      0.07      0.06 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true       0   1 -err.-
##   0      417  19     19
##   1       31 295     31
##   -err.-  31  19     50

holdoutCV_lda$aggr

## mmce.test.mean  acc.test.mean 
##      0.0656168      0.9343832

QDA

# QDA
 qda <- makeLearner("classif.qda")
 qdaModel <- train(qda, ricedataTask)

Krosvalidavimas kFold QDA

# Krosvalidavimas K-fold
kFold <- makeResampleDesc(method = "RepCV", folds = 10, reps = 50, stratify = TRUE)
ldaCVKF <- resample(learner = qda, task = ricedataTask, resampling = kFold, measures = list(mmce, acc))

#K-fold Krosvalidavimo rezultatai

calculateConfusionMatrix(ldaCVKF$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.92/0.95 0.08/0.10 0.08     
##   1      0.07/0.05 0.93/0.90 0.07     
##   -err.-      0.05      0.10 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true          0     1 -err.-
##   0      100478  8522   8522
##   1        5480 76020   5480
##   -err.-   5480  8522  14002

 ldaCVKF$aggr

## mmce.test.mean  acc.test.mean 
##     0.07350131     0.92649869

LOO QDA

#LOO Krosvalidavimo rezultatai
LOO <- makeResampleDesc(method = "LOO")
hcvTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
ldaCVLOO <- resample(learner = qda, task = ricedataTask, resampling = LOO, measures = list(mmce, acc))

ldaCVLOO$aggr

## mmce.test.mean  acc.test.mean 
##     0.07322835     0.92677165

Holdout QDA

#Holdout Krosvalidavimo rezultatai
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
# hcvTask <- makeClassifTask(data = hcvdataTib, target = "Class")
holdoutCV_lda <- resample(learner = qda, task = ricedataTask, resampling = holdout, measures = list(mmce, acc))

calculateConfusionMatrix(holdoutCV_lda$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.92/0.95 0.08/0.10 0.08     
##   1      0.06/0.05 0.94/0.90 0.06     
##   -err.-      0.05      0.10 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true       0   1 -err.-
##   0      402  34     34
##   1       19 307     19
##   -err.-  19  34     53

holdoutCV_lda$aggr

## mmce.test.mean  acc.test.mean 
##     0.06955381     0.93044619

ricedata_numTib %>% mutate(LD1 = ldaPreds[, 1], LD1 = ldaPreds[, 1]) %>% ggplot(aes(LD1, LD1, col = Class)) +geom_point() +theme_bw()

KNN

Statistikoje k-arčiausių kaimynų algoritmas (k-NN) yra neparametrinis prižiūrimas mokymosi metodas. Jis naudojamas klasifikavimui ir regresijai. Abiem atvejais įvestis susideda iš k artimiausių mokymo pavyzdžių duomenų rinkinyje. Išvestis priklauso nuo to, ar k-NN naudojamas klasifikavimui ar regresijai: K-NN klasifikacijoje išvestis yra klasės narystė. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Taikant k-NN regresiją, išvestis yra objekto savybės vertė. Ši reikšmė yra k artimiausių kaimynų reikšmių vidurkis. Algoritmas identifikuoja arčiausiai kiekvieno nepažymėto atvejo k pažymėtus atvejus (kaimynus). k yra mūsų nurodytas sveikasis skaičius. Tai yra surandame tinkamiausią k, su kuriuo kintamieji yra tiksliausiai priskiriami klasėms

ricedataTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
knnParamSpace <- makeParamSet(makeDiscreteParam("k", values = 1:20))
gridSearch <- makeTuneControlGrid()
cvForTuning <- makeResampleDesc("RepCV", folds = 10, reps = 20)

tunedK <- tuneParams("classif.knn", task = ricedataTask,
                     resampling = cvForTuning,
                     par.set = knnParamSpace, control = gridSearch)

knnTuningData <- generateHyperParsEffectData(tunedK)
plotHyperParsEffect(knnTuningData, x = "k", y = "mmce.test.mean",
                    plot.type = "line") +
    theme_bw()

### Holdout

knn <- makeLearner("classif.knn", par.vals = list("k" = 2))
# Hold out krosvalidavimas
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
holdoutCV <- resample(learner = knn, task = ricedataTask,
                      resampling = holdout, measures = list(mmce, acc))
calculateConfusionMatrix(holdoutCV$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.87/0.90 0.13/0.16 0.13     
##   1      0.13/0.10 0.87/0.84 0.13     
##   -err.-      0.10      0.16 0.13     
## 
## 
## Absolute confusion matrix:
##         predicted
## true       0   1 -err.-
##   0      381  55     55
##   1       41 285     41
##   -err.-  41  55     96

holdoutCV$aggr

## mmce.test.mean  acc.test.mean 
##      0.1259843      0.8740157

k-Fold

kFold <- makeResampleDesc(method = "CV", iters = 10,
                          stratify = TRUE)
set.seed(123)
kFoldCV <- resample(learner = knn, task = ricedataTask,
                    resampling = kFold, measures = list(mmce, acc))
kFoldCV$aggr

## mmce.test.mean  acc.test.mean 
##      0.1301837      0.8698163

LOO

LOO <- makeResampleDesc(method = "LOO")
set.seed(123)
kLOO <- resample(learner = knn, task = ricedataTask,
                    resampling = LOO, measures = list(mmce, acc))
kLOO$aggr

## mmce.test.mean  acc.test.mean 
##      0.1404199      0.8595801

Liner SVM

SVM – Atraminių vektorių klasifikatorius (angl. – Support vector machine) sistemos mokymosi algoritmas skirtas klasifikuoti duomenims. Tai prižiūrimo mokymosi metodas, kuomet siekiama suklasifikuoti jau pažymėtus duomenis. „Support Vector Machine“ (SVM) yra mašininio mokymosi su mokytoju algoritmas, kuris gali būti naudojamas tiek klasifikavimo, tiek regresijos uždaviniams spręsti. Visgi, jis dažniausiai naudojamas klasifikavimo uždaviniams. SVM tikslas yra rasti maksimalią skiriamąją liniją (jei atvejis yra dvimatis) arba skiriamąją plokštumą (jei atvejis yra trimatis), arba skiriamąją hiperplokštumą (jei atvejis yra n- matmenų, n>3), kuri turėtų didžiausią atstumą tarp artimiausių mokymo duomenų objektų.

set.seed(123)
svmfit <- svm(Class~., data = ricedata_num, kernel = "linear", cost = 10, scale = FALSE)
print(svmfit)

## 
## Call:
## svm(formula = Class ~ ., data = ricedata_num, kernel = "linear", 
##     cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.1428571 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  3805

ricedata_numTib <- as_tibble(ricedata_num)
ricedata_numTask <- makeClassifTask(data = ricedata_numTib, target = "Class")
cvForTuning <- makeResampleDesc("Holdout", split = 0.9)
kernels <- c("polynomial", "radial", "sigmoid")
svmParamSpace <- makeParamSet(makeDiscreteParam("kernel", values = kernels),
                              makeIntegerParam("degree", lower = 1, upper = 3),
                              makeNumericParam("cost", lower = 0.1, upper = 10),
                              makeNumericParam("gamma", lower = 0.1, 10))


randSearch <- makeTuneControlRandom(maxit = 10)
outer <- makeResampleDesc("CV", iters = 3)
svmWrapper <- makeTuneWrapper("classif.svm", resampling = cvForTuning,
                              par.set = svmParamSpace, control = randSearch)
cvWithTuning <- resample(learner = svmWrapper, task = ricedata_numTask, resampling = outer, measures = list(mmce, acc))

calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.94/0.94 0.06/0.08 0.06     
##   1      0.09/0.06 0.91/0.92 0.09     
##   -err.-      0.06      0.08 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true        0    1 -err.-
##   0      2055  125    125
##   1       140 1490    140
##   -err.-  140  125    265

cvWithTuning

## Resample Result
## Task: ricedata_numTib
## Learner: classif.svm.tuned
## Aggr perf: mmce.test.mean=0.0695538,acc.test.mean=0.9304462
## Runtime: 156.018

Non-Liner SVM

set.seed(123)
svmfit_nl <- svm(Class~., data = ricedata_num, kernel = "radial", cost = 5, scale = FALSE)
print(svmfit_nl)

## 
## Call:
## svm(formula = Class ~ ., data = ricedata_num, kernel = "radial", 
##     cost = 5, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  5 
##       gamma:  0.1428571 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  3810

Decision tree

Sprendimų medžiai yra mašininio mokymosi su mokytoju metodas, kai duomenys nuolat skaidomi pagal tam tikrą parametrą. Medį sudaro sprendimų mazgai ir lapai. Lapai yra sprendimai arba galutiniai rezultatai, o sprendimų mazgai atsiranda ten, kur duomenys yra padalijami.

tree.hcv <- tree(Class~., data = ricedata_num)
summary(tree.hcv)

## 
## Regression tree:
## tree(formula = Class ~ ., data = ricedata_num)
## Variables actually used in tree construction:
## [1] "Major_Axis_Length" "Perimeter"        
## Number of terminal nodes:  4 
## Residual mean deviance:  0.05584 = 212.5 / 3806 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.965700 -0.009416 -0.009416  0.000000  0.034260  0.990600

# Error in text.tree(tree.hcv, pretty = 0) : cannot plot singlenode tree

plot(tree.hcv)
text(tree.hcv, pretty = 0)

Komentaras

tree <- makeLearner("classif.rpart")
treeParamSpace <- makeParamSet(makeIntegerParam("minsplit", lower = 5, upper = 20), makeIntegerParam("minbucket", lower = 3, upper = 10), makeNumericParam("cp", lower = 0.01, upper = 0.1), makeIntegerParam("maxdepth", lower = 3, upper = 10))

randSearch <- makeTuneControlRandom(maxit = 200)
cvForTuning <- makeResampleDesc("CV", iters = 5)
# library(parallelMap)
# parallelStartSocket(cpus = detectCores())
# parallelStartSocket(cpus=3L)
# parallelMap(function(x) {x+1}, x = 1:10)

tunedTreePars <- tuneParams(tree, task = ricedataTask, resampling = cvForTuning, par.set = treeParamSpace, control = randSearch)
# parallelStop()

tunedTree <- setHyperPars(tree, par.vals = tunedTreePars$x)
tunedTreeModel <- train(tunedTree, ricedataTask)
library(rpart.plot)
treeModelData <- getLearnerModel(tunedTreeModel)
par(mfrow = c(1,1))
rpart.plot(treeModelData, roundint = FALSE, box.palette = "BuBn", type = 5)

### Hold out confusion matrix

holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
treeWrapper <- makeTuneWrapper("classif.rpart", resampling = holdout,
                               par.set = treeParamSpace,
                               control = randSearch)

set.seed(123)
cvWithTuning <- resample(treeWrapper, ricedataTask, resampling = holdout)
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.95/0.93 0.05/0.07 0.05     
##   1      0.09/0.07 0.91/0.93 0.09     
##   -err.-      0.07      0.07 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true       0   1 -err.-
##   0      414  22     22
##   1       30 296     30
##   -err.-  30  22     52

cvWithTuning

## Resample Result
## Task: ricedata_numTib
## Learner: classif.rpart.tuned
## Aggr perf: mmce.test.mean=0.0682415
## Runtime: 4.5708

k-fold

kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)
treeWrapper <- makeTuneWrapper("classif.rpart", resampling = kFold,
                               par.set = treeParamSpace,
                               control = randSearch)

set.seed(123)
cvWithTuning <- resample(treeWrapper, ricedataTask, resampling = kFold)
calculateConfusionMatrix(cvWithTuning$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.94/0.93 0.06/0.09 0.06     
##   1      0.09/0.07 0.91/0.91 0.09     
##   -err.-      0.07      0.09 0.08     
## 
## 
## Absolute confusion matrix:
##         predicted
## true        0    1 -err.-
##   0      2040  140    140
##   1       150 1480    150
##   -err.-  150  140    290

cvWithTuning

## Resample Result
## Task: ricedata_numTib
## Learner: classif.rpart.tuned
## Aggr perf: mmce.test.mean=0.0761155
## Runtime: 252.001

Logistine regresija

Statistikoje logistinis modelis yra statistinis modelis, modeliuojantis vieno įvykio tikimybę, kai įvykio log-odds yra tiesinis vieno ar kelių nepriklausomų kintamųjų derinys. Regresinėje analizėje logistinė regresija yra logistinio modelio parametrų įvertinimas.

Holdout

ricedata_Task <- makeClassifTask(data = ricedata_num, target = "Class")
logReg <- makeLearner("classif.logreg", predict.type = "prob")
logRegModel <- train(logReg, ricedata_Task)

logRegWrapper <- makeImputeWrapper("classif.logreg")
holdout <- makeResampleDesc(method = "Holdout", split = 4/5, stratify = TRUE)
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, ricedata_Task,
                             resampling = holdout,
                             measures = list(acc, fpr, fnr))

calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.94/0.95 0.06/0.08 0.06     
##   1      0.07/0.05 0.93/0.92 0.07     
##   -err.-      0.05      0.08 0.06     
## 
## 
## Absolute confusion matrix:
##         predicted
## true       0   1 -err.-
##   0      411  25     25
##   1       22 304     22
##   -err.-  22  25     47

logRegwithImpute

## Resample Result
## Task: ricedata_num
## Learner: classif.logreg.imputed
## Aggr perf: acc.test.mean=0.9383202,fpr.test.mean=0.0674847,fnr.test.mean=0.0573394
## Runtime: 0.0276251

10-fold Crossvalidation

kFold <- makeResampleDesc(method = "CV", iters = 10)
set.seed(123)

logRegwithImpute <- resample(logRegWrapper, ricedata_Task,
                             resampling = kFold,
                             measures = list(acc, fpr, fnr))
logRegwithImpute

## Resample Result
## Task: ricedata_num
## Learner: classif.logreg.imputed
## Aggr perf: acc.test.mean=0.9299213,fpr.test.mean=0.0857996,fnr.test.mean=0.0579707
## Runtime: 0.305983

calculateConfusionMatrix(logRegwithImpute$pred, relative = TRUE)

## Relative confusion matrix (normalized by row/column):
##         predicted
## true     0         1         -err.-   
##   0      0.94/0.94 0.06/0.08 0.06     
##   1      0.09/0.06 0.91/0.92 0.09     
##   -err.-      0.06      0.08 0.07     
## 
## 
## Absolute confusion matrix:
##         predicted
## true        0    1 -err.-
##   0      2053  127    127
##   1       140 1490    140
##   -err.-  140  127    267

logRegwithImpute$aggr

## acc.test.mean fpr.test.mean fnr.test.mean 
##    0.92992126    0.08579965    0.05797067

Logistinės regresijos modelio po 10-fold CV bendras klasifikavimo tikslumas siekia ∼96%.
### LOO Crossvalidation

LOO <- makeResampleDesc(method = "LOO")
set.seed(123)
logRegwithImpute <- resample(logRegWrapper, ricedata_Task,
                             resampling = LOO,
                             measures = list(acc))
logRegwithImpute$aggr

## acc.test.mean 
##     0.9304462

komentaras

PCA

pca <- select(ricedata_numTib, -Class) %>%
prcomp(center = TRUE, scale = TRUE)

pca

## Standard deviations (1, .., p=7):
## [1] 2.13985496 1.22464634 0.94910769 0.10842422 0.07884577 0.04530634 0.02077348
## 
## Rotation (n x k) = (7 x 7):
##                           PC1         PC2         PC3         PC4         PC5
## AREA              -0.46125179 -0.12437713  0.01261109  0.33469497 -0.37202436
## Perimeter         -0.46440793  0.05575087 -0.02837202 -0.68885689  0.47535595
## Major_Axis_Length -0.44707633  0.21345616 -0.12155490 -0.24054740 -0.52572729
## Minor_Axis_Length -0.32175200 -0.56710548  0.21271588  0.32792777  0.42723064
## Eccentricity      -0.22732878  0.67315219 -0.29798463  0.48220471  0.40928605
## Convex_Area       -0.46169356 -0.12253499  0.01687902  0.12294785 -0.09489032
## Extent             0.05771599 -0.38223203 -0.92191790 -0.02045092  0.01414705
##                           PC6          PC7
## AREA               0.19428799  0.695303821
## Perimeter          0.12575766  0.253199628
## Major_Axis_Length -0.58169723 -0.259151923
## Minor_Axis_Length -0.47572402 -0.115078218
## Eccentricity      -0.07804218 -0.016305971
## Convex_Area        0.61290856 -0.609722960
## Extent             0.00216608  0.003443235

fviz_pca_var(pca)

fviz_screeplot(pca, addlabels = TRUE, choice = "eigenvalue")

fviz_screeplot(pca, addlabels = TRUE, choice = "variance")

TDPca <- ricedata_numTib %>%
mutate(PCA1 = pca$x[, 1], PCA2 = pca$x[, 2])

ggplot(TDPca, aes(PCA1, PCA2, col = Class)) +
geom_point() +
theme_bw()

UMAP

UMAP (angl. Uniform Manifold Approximation and Projection) yra netiesninis dimensijų sumažininimo algoritmas, kuris panašiai kaip ir t-SNE, gali būti naudojamas daugiamačių duomenų vizualizacijai, taip pat bendram netiesiniam matmenų mažinimui.

ricedataUMAP <- dplyr::select(ricedata_numTib, -Class) %>% as.matrix() %>% umap(n_neighbors = 10, min_dist = 0.1, metric = "manhattan", n_epochs = 200, verbose = TRUE)

ricedataTibmap <- ricedata_numTib %>% mutate_if(.funs = scale, .predicate = is.numeric, scale = FALSE) %>% mutate(UMAP1 = ricedataUMAP$layout[, 1], UMAP2 = ricedataUMAP$layout[, 2]) %>% gather(key = "Variable", value = "Value", c(-UMAP1, -UMAP2, -Class))

ggplot(ricedataTibmap, aes(UMAP1, UMAP2, col = Class))+ geom_point() + theme_bw()

Rezultayai

lentelė

Prognozavimo alg.	Krosvalidatorius	Tikslumas
LDA	K-FOLD	93,0 %
LDA	LOO	93,0%
LDA	Holdout	93,4%
QDA	K-FOLD	92,6
QDA	LOO	92,7
QDA	Holdout	93,0%
KNN	K-FOLD	87,0%
KNN	LOO	86,0%
KNN	Holdout	87,4%
Logistic reg.	K-FOLD	93,0%
Logistic reg.	LOO	-
Logistic reg.	Holdout	93,8%
Liner SVM	K-FOLD	93,0%
Liner SVM	LOO	-
Liner SVM	Holdout	-
Non-Liner SVM	K-FOLD	-
Non-Liner SVM	LOO	-
Non-Liner SVM	Holdout	-
Tree	K-FOLD	92,4%
Tree	LOO	-
Tree	Holdout	93,2%

Išvados

Skirtingų klasifikatorių tikslumo įverčiai rodo, kad tinkamiausias yra Logistic regression klasifikavimo algoritmas. Naudojant šį algoritmą, gauti aukščiausi tikslumai taikant Holdout validavimo metodą, gautas 93,8% tikslumas. Antras pagal tikslumą buvo LDA klasifikatorius su 93,4% tikslumu atlikus Holdout krosvalidavimą.
Bendrai paėmus, su šio tipo duomenimis visi klasivifikatoriai dirbo pakankamai tiskliai.
Taikant dimensijų mažinimo metodus, buvo nustatyta, kad dimensijų mažinimui tinkamiausia yra pagrindinių komponenčių analizė. Atvaizdavus duomenis pagal dvi pirmas pagrindines komponentes, galime matyti skirtingų klasių atsiskyrimą. Šio darbo analizės metodika galima būtų taikyti su panašaus pobūdžio duomenismis, t. y. su keletų klasifikuojamų klasių ir panašiu kiekiu skaitinių duomenų ir duomenų dimencijų kiekiu.

Gediminas Kazėnas 2022-05-19

Rice (Cammeo and Osmancik) Data Set

Gediminas Kazėnas

2022-05-19