Picture are taken from Kaggle

Intro

1.1 Greetings

Hi Everyone :)

Welcome to my Rmd.

This is my HTML_Document which contains Rice type classification

Hope you can enjoy that!

1.2. What We Will Do

We will learn to use Logistic regression and KNN model using Rice type dataset. We wanna know the relationship among variables. We also wanna classify the type of a new rice (test data) based on the data that we have trained before.

Data Source: https://www.kaggle.com/datasets/mssmartypants/rice-type-classification

1.3. About Dataset

Context

This is a set of data created for rice classification. I recommend using this dataset for educational purposes, for practice and to acquire the necessary knowledge. It is modified dataset from this resource: link Jasmine - 1, Gonen - 0.

Content

That’s inside is more than just rows and columns. You can see rice details listed as column names.

Description

All attributes are numeric variables and they are listed bellow:

-. id

-. Area

-. MajorAxisLength

-. MinorAxisLength

-. Eccentricity

-. ConvexArea

-. EquivDiameter

-. Extent

-. Perimeter

-. Roundness

-. AspectRation

-. Class

1.4. Business Goal

We wanna know :

-. Accuracy between Logistic Regression and KNN.

-. Classify type of new rice by data that we have trained before.

Library and Setup

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)

Logistic Regression

Data Import

We use rice type dataset which has some variables/features that we got from Kaggle.

rice <- read.csv("data_input/rice.csv")
str(rice)
## 'data.frame':    18185 obs. of  12 variables:
##  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Area           : int  4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
##  $ MajorAxisLength: num  92.2 74.7 76.3 77 85.1 ...
##  $ MinorAxisLength: num  64 51.4 52 51.9 56.4 ...
##  $ Eccentricity   : num  0.72 0.726 0.731 0.739 0.749 ...
##  $ ConvexArea     : int  4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
##  $ EquivDiameter  : num  76 60.5 62.3 62.6 68.6 ...
##  $ Extent         : num  0.658 0.713 0.759 0.784 0.769 ...
##  $ Perimeter      : num  273 208 210 211 230 ...
##  $ Roundness      : num  0.765 0.832 0.868 0.87 0.875 ...
##  $ AspectRation   : num  1.44 1.45 1.47 1.48 1.51 ...
##  $ Class          : int  1 1 1 1 1 1 1 1 1 1 ...
head(rice)
##   id Area MajorAxisLength MinorAxisLength Eccentricity ConvexArea EquivDiameter
## 1  1 4537        92.22932        64.01277    0.7199162       4677      76.00452
## 2  2 2872        74.69188        51.40045    0.7255527       3015      60.47102
## 3  3 3048        76.29316        52.04349    0.7312109       3132      62.29634
## 4  4 3073        77.03363        51.92849    0.7386387       3157      62.55130
## 5  5 3693        85.12478        56.37402    0.7492816       3802      68.57167
## 6  6 2990        77.41707        50.95434    0.7528609       3080      61.70078
##      Extent Perimeter Roundness AspectRation Class
## 1 0.6575362   273.085 0.7645096     1.440796     1
## 2 0.7130089   208.317 0.8316582     1.453137     1
## 3 0.7591532   210.012 0.8684336     1.465950     1
## 4 0.7835288   210.657 0.8702031     1.483456     1
## 5 0.7693750   230.332 0.8747433     1.510000     1
## 6 0.5848983   216.930 0.7984391     1.519342     1

Data Cleaning

There is a variables that we don’t need like Id. So we can do subsetting.

rice <- rice[,-c(0:1)]
str(rice)
## 'data.frame':    18185 obs. of  11 variables:
##  $ Area           : int  4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
##  $ MajorAxisLength: num  92.2 74.7 76.3 77 85.1 ...
##  $ MinorAxisLength: num  64 51.4 52 51.9 56.4 ...
##  $ Eccentricity   : num  0.72 0.726 0.731 0.739 0.749 ...
##  $ ConvexArea     : int  4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
##  $ EquivDiameter  : num  76 60.5 62.3 62.6 68.6 ...
##  $ Extent         : num  0.658 0.713 0.759 0.784 0.769 ...
##  $ Perimeter      : num  273 208 210 211 230 ...
##  $ Roundness      : num  0.765 0.832 0.868 0.87 0.875 ...
##  $ AspectRation   : num  1.44 1.45 1.47 1.48 1.51 ...
##  $ Class          : int  1 1 1 1 1 1 1 1 1 1 ...

That is very important to check missing values inside of dataset.

anyNA(rice)
## [1] FALSE

Great! The data is complete and ready to be processed.

Data Manipulation

There is a variable that has incorrect type. So we need to change it to the correct type.

rice$Class <- factor(rice$Class)
rice <- rice %>% 
  mutate( Class = factor(Class, levels = c(0,1), 
                        labels = c("Gonen", "Jasmine")))
glimpse(rice)
## Rows: 18,185
## Columns: 11
## $ Area            <int> 4537, 2872, 3048, 3073, 3693, 2990, 3556, 3788, 2629, ~
## $ MajorAxisLength <dbl> 92.22932, 74.69188, 76.29316, 77.03363, 85.12478, 77.4~
## $ MinorAxisLength <dbl> 64.01277, 51.40045, 52.04349, 51.92849, 56.37402, 50.9~
## $ Eccentricity    <dbl> 0.7199162, 0.7255527, 0.7312109, 0.7386387, 0.7492816,~
## $ ConvexArea      <int> 4677, 3015, 3132, 3157, 3802, 3080, 3636, 3866, 2790, ~
## $ EquivDiameter   <dbl> 76.00452, 60.47102, 62.29634, 62.55130, 68.57167, 61.7~
## $ Extent          <dbl> 0.6575362, 0.7130089, 0.7591532, 0.7835288, 0.7693750,~
## $ Perimeter       <dbl> 273.085, 208.317, 210.012, 210.657, 230.332, 216.930, ~
## $ Roundness       <dbl> 0.7645096, 0.8316582, 0.8684336, 0.8702031, 0.8747433,~
## $ AspectRation    <dbl> 1.440796, 1.453137, 1.465950, 1.483456, 1.510000, 1.51~
## $ Class           <fct> Jasmine, Jasmine, Jasmine, Jasmine, Jasmine, Jasmine, ~

Then check missing value in each columns.

colSums(is.na(rice))
##            Area MajorAxisLength MinorAxisLength    Eccentricity      ConvexArea 
##               0               0               0               0               0 
##   EquivDiameter          Extent       Perimeter       Roundness    AspectRation 
##               0               0               0               0               0 
##           Class 
##               0

Pre-processing Data

We must see the proportion of our Class in Class column.

prop.table(table(rice$Class))
## 
##     Gonen   Jasmine 
## 0.4509211 0.5490789
table(rice$Class)
## 
##   Gonen Jasmine 
##    8200    9985

By seeing the proportional of Class, it is already balance.

Splitting Train-Test

Next step is doing splitting train test data. The purpose is data train will be used to make model, data test will be used to test our model or compare unseen data. And we can use that to know the ability of our model to unseen data.

set.seed(303)
intrain <- sample(nrow(rice), nrow(rice)*0.7)
rice_train <- rice[intrain,]
rice_test <- rice[-intrain,]
rice$Class %>% 
  levels()
## [1] "Gonen"   "Jasmine"

Modelling

In Logistic Regression, we can modelling using glm() function. And we can use some variables that effected Class, because Class is the response variable.

model_all <- glm(formula = Class~., family = "binomial", data = rice_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = rice_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9420  -0.0016   0.0120   0.0422   4.2262  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.199e+02  4.969e+01   8.450  < 2e-16 ***
## Area             2.095e-02  5.755e-03   3.641 0.000272 ***
## MajorAxisLength  2.182e+00  1.915e-01  11.392  < 2e-16 ***
## MinorAxisLength -9.537e-01  4.172e-01  -2.286 0.022268 *  
## Eccentricity    -1.423e+02  2.283e+01  -6.231 4.65e-10 ***
## ConvexArea      -1.517e-02  2.676e-03  -5.669 1.44e-08 ***
## EquivDiameter   -3.192e+00  7.892e-01  -4.045 5.24e-05 ***
## Extent           1.275e+00  1.145e+00   1.114 0.265449    
## Perimeter       -2.468e-01  5.216e-02  -4.731 2.23e-06 ***
## Roundness       -1.012e+02  1.523e+01  -6.648 2.97e-11 ***
## AspectRation    -5.578e+01  5.950e+00  -9.375  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17521.41  on 12728  degrees of freedom
## Residual deviance:   733.37  on 12718  degrees of freedom
## AIC: 755.37
## 
## Number of Fisher Scoring iterations: 11
model <- glm(formula = Class~Area+MajorAxisLength+Eccentricity+ConvexArea+EquivDiameter+Perimeter+Roundness+AspectRation, family = "binomial", data = rice_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Call:
## glm(formula = Class ~ Area + MajorAxisLength + Eccentricity + 
##     ConvexArea + EquivDiameter + Perimeter + Roundness + AspectRation, 
##     family = "binomial", data = rice_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0644  -0.0027   0.0072   0.0353   4.3126  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.632e+02  4.089e+01   8.882  < 2e-16 ***
## Area             2.435e-02  5.587e-03   4.358 1.31e-05 ***
## MajorAxisLength  2.107e+00  2.225e-01   9.472  < 2e-16 ***
## Eccentricity    -9.428e+01  1.007e+01  -9.362  < 2e-16 ***
## ConvexArea      -1.690e-02  2.594e-03  -6.515 7.29e-11 ***
## EquivDiameter   -3.883e+00  7.093e-01  -5.475 4.38e-08 ***
## Perimeter       -2.533e-01  5.404e-02  -4.687 2.77e-06 ***
## Roundness       -1.018e+02  1.600e+01  -6.362 1.99e-10 ***
## AspectRation    -4.621e+01  6.026e+00  -7.669 1.74e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17521.41  on 12728  degrees of freedom
## Residual deviance:   739.01  on 12720  degrees of freedom
## AIC: 757.01
## 
## Number of Fisher Scoring iterations: 11

Model Fitting

We can use stepwise to do Fitting model because there are some variables that unsignificance to the Class.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
model2_all <- stepAIC(model_all, direction = "backward")
## Start:  AIC=755.37
## Class ~ Area + MajorAxisLength + MinorAxisLength + Eccentricity + 
##     ConvexArea + EquivDiameter + Extent + Perimeter + Roundness + 
##     AspectRation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                   Df Deviance    AIC
## - Extent           1   734.63 754.63
## <none>                 733.37 755.37
## - MinorAxisLength  1   738.00 758.00
## - Perimeter        1   748.36 768.36
## - Area             1   749.97 769.97
## - EquivDiameter    1   754.58 774.58
## - ConvexArea       1   759.94 779.94
## - Eccentricity     1   760.92 780.92
## - Roundness        1   762.46 782.46
## - AspectRation     1   776.67 796.67
## - MajorAxisLength  1   829.52 849.52
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=754.63
## Class ~ Area + MajorAxisLength + MinorAxisLength + Eccentricity + 
##     ConvexArea + EquivDiameter + Perimeter + Roundness + AspectRation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                   Df Deviance    AIC
## <none>                 734.63 754.63
## - MinorAxisLength  1   739.01 757.01
## - Perimeter        1   749.78 767.78
## - Area             1   751.06 769.06
## - EquivDiameter    1   755.83 773.83
## - ConvexArea       1   761.31 779.31
## - Eccentricity     1   761.63 779.63
## - Roundness        1   763.63 781.63
## - AspectRation     1   777.62 795.62
## - MajorAxisLength  1   830.55 848.55

Using backward method in stepwise, we get model like :

summary(model2_all)
## 
## Call:
## glm(formula = Class ~ Area + MajorAxisLength + MinorAxisLength + 
##     Eccentricity + ConvexArea + EquivDiameter + Perimeter + Roundness + 
##     AspectRation, family = "binomial", data = rice_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9717  -0.0016   0.0118   0.0419   4.2436  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      4.179e+02  4.973e+01   8.404  < 2e-16 ***
## Area             2.098e-02  5.785e-03   3.626 0.000288 ***
## MajorAxisLength  2.183e+00  1.914e-01  11.405  < 2e-16 ***
## MinorAxisLength -9.239e-01  4.154e-01  -2.224 0.026133 *  
## Eccentricity    -1.405e+02  2.271e+01  -6.184 6.24e-10 ***
## ConvexArea      -1.526e-02  2.680e-03  -5.695 1.23e-08 ***
## EquivDiameter   -3.204e+00  7.932e-01  -4.039 5.36e-05 ***
## Perimeter       -2.468e-01  5.201e-02  -4.744 2.09e-06 ***
## Roundness       -1.007e+02  1.523e+01  -6.613 3.76e-11 ***
## AspectRation    -5.564e+01  5.944e+00  -9.362  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17521.41  on 12728  degrees of freedom
## Residual deviance:   734.63  on 12719  degrees of freedom
## AIC: 754.63
## 
## Number of Fisher Scoring iterations: 11
model2 <- stepAIC(model, direction = "backward")
## Start:  AIC=757.01
## Class ~ Area + MajorAxisLength + Eccentricity + ConvexArea + 
##     EquivDiameter + Perimeter + Roundness + AspectRation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                   Df Deviance    AIC
## <none>                 739.01 757.01
## - Perimeter        1   752.35 768.35
## - Area             1   760.63 776.63
## - Roundness        1   764.89 780.89
## - Eccentricity     1   772.97 788.97
## - ConvexArea       1   774.29 790.29
## - EquivDiameter    1   785.81 801.81
## - AspectRation     1   794.97 810.97
## - MajorAxisLength  1   833.10 849.10
summary(model2)
## 
## Call:
## glm(formula = Class ~ Area + MajorAxisLength + Eccentricity + 
##     ConvexArea + EquivDiameter + Perimeter + Roundness + AspectRation, 
##     family = "binomial", data = rice_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0644  -0.0027   0.0072   0.0353   4.3126  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.632e+02  4.089e+01   8.882  < 2e-16 ***
## Area             2.435e-02  5.587e-03   4.358 1.31e-05 ***
## MajorAxisLength  2.107e+00  2.225e-01   9.472  < 2e-16 ***
## Eccentricity    -9.428e+01  1.007e+01  -9.362  < 2e-16 ***
## ConvexArea      -1.690e-02  2.594e-03  -6.515 7.29e-11 ***
## EquivDiameter   -3.883e+00  7.093e-01  -5.475 4.38e-08 ***
## Perimeter       -2.533e-01  5.404e-02  -4.687 2.77e-06 ***
## Roundness       -1.018e+02  1.600e+01  -6.362 1.99e-10 ***
## AspectRation    -4.621e+01  6.026e+00  -7.669 1.74e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 17521.41  on 12728  degrees of freedom
## Residual deviance:   739.01  on 12720  degrees of freedom
## AIC: 757.01
## 
## Number of Fisher Scoring iterations: 11

Prediction

Using second model model2 as the result of stepwise, we try to predict using data test.

rice_test$prob_rice <- predict(model2, type = "response", newdata = rice_test)

The graph of probability prediction data.

ggplot(rice_test, aes(x=prob_rice)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

rice_test$pred_rice <- factor(ifelse(rice_test$prob_rice > 0.5, "Jasmine","Gonen"))
rice_test[1:10, c("pred_rice", "Class")]
##    pred_rice   Class
## 6    Jasmine Jasmine
## 11   Jasmine Jasmine
## 12   Jasmine Jasmine
## 16   Jasmine Jasmine
## 17   Jasmine Jasmine
## 18   Jasmine Jasmine
## 21   Jasmine Jasmine
## 22   Jasmine Jasmine
## 26   Jasmine Jasmine
## 30   Jasmine Jasmine

By the information above, if probability of test data > 0.5 it classify as Jasmine type.

Model Evaluation

To evaluate our model that we made before, we will use confussion Matrix.

library(caret)
## Loading required package: lattice
log_conf <- confusionMatrix(rice_test$pred_rice, rice_test$Class, positive = "Jasmine")
log_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Gonen Jasmine
##    Gonen    2423      25
##    Jasmine    42    2966
##                                           
##                Accuracy : 0.9877          
##                  95% CI : (0.9844, 0.9905)
##     No Information Rate : 0.5482          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9752          
##                                           
##  Mcnemar's Test P-Value : 0.05062         
##                                           
##             Sensitivity : 0.9916          
##             Specificity : 0.9830          
##          Pos Pred Value : 0.9860          
##          Neg Pred Value : 0.9898          
##              Prevalence : 0.5482          
##          Detection Rate : 0.5436          
##    Detection Prevalence : 0.5513          
##       Balanced Accuracy : 0.9873          
##                                           
##        'Positive' Class : Jasmine         
## 

Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

Recall <- round((2966)/(2966+25),2)
Specificity <- round((2423)/(2423+42),2)
Accuracy <- round((2966+2423)/(nrow(rice_test)),2)
Precision <- round((2966)/(2966+42),2)

performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance
##   Accuracy Recall Precision Specificity
## 1     0.99   0.99      0.99        0.98

The result shows that our logistic regression model has accuracy of 99 % on test dataset, meaning that 99 % of our data is correctly classified. The value of sensitivity and specificity is 99 % and 98 %. This indicate that most of positive outcomes are correctly classified but only a small number of negative outcomes are correctly classified. The precision/positive predicted value is 99 %, meaning that 99 % of our positive prediction is correct.

Tunning Cutoff

To know the maximum threshold.

# tuning cutoff
performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = rice_test$prob_rice, 
                     ref = rice_test$Class, 
                     postarget = "Jasmine", 
                     negtarget = "Gonen")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff model perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Model Interpolation

# Odds ratio all coefficients
exp(model$coefficients) %>% 
  data.frame() 
##                             .
## (Intercept)     5.489014e+157
## Area             1.024646e+00
## MajorAxisLength  8.225424e+00
## Eccentricity     1.132953e-41
## ConvexArea       9.832458e-01
## EquivDiameter    2.057901e-02
## Perimeter        7.762342e-01
## Roundness        6.164935e-45
## AspectRation     8.533667e-21

K-Nearest Neighbour

Pre-processing Data

We make dummy variables to classify.

dmy <- dummyVars(" ~Class+Area+MajorAxisLength+Eccentricity+ConvexArea+EquivDiameter+Perimeter+Roundness+AspectRation", data = rice)
dmy <- data.frame(predict(dmy, newdata = rice))
str(dmy)
## 'data.frame':    18185 obs. of  10 variables:
##  $ Class.Gonen    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Class.Jasmine  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Area           : num  4537 2872 3048 3073 3693 ...
##  $ MajorAxisLength: num  92.2 74.7 76.3 77 85.1 ...
##  $ Eccentricity   : num  0.72 0.726 0.731 0.739 0.749 ...
##  $ ConvexArea     : num  4677 3015 3132 3157 3802 ...
##  $ EquivDiameter  : num  76 60.5 62.3 62.6 68.6 ...
##  $ Perimeter      : num  273 208 210 211 230 ...
##  $ Roundness      : num  0.765 0.832 0.868 0.87 0.875 ...
##  $ AspectRation   : num  1.44 1.45 1.47 1.48 1.51 ...

Delete dummy variable that has 2 categories.

dmy$Class.Gonen <- NULL

Check name of dummy columns.

names(dmy)
## [1] "Class.Jasmine"   "Area"            "MajorAxisLength" "Eccentricity"   
## [5] "ConvexArea"      "EquivDiameter"   "Perimeter"       "Roundness"      
## [9] "AspectRation"

Create train and test data from dummy.

set.seed(300)
dmy_train <- dmy[intrain,2:9]
dmy_test <- dmy[-intrain,2:9]

dmy_train_label <- dmy[intrain,1]
dmy_test_label <- dmy[-intrain,1]

Then predict using KNN method.

round(sqrt(nrow(dmy_train)))
## [1] 113
pred_knn <- class::knn(train = dmy_train,
                       test = dmy_test, 
                       cl = dmy_train_label, 
                       k = 113)

Make confussion matrix to predict using KNN.

pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2160   38
##          1  305 2953
##                                           
##                Accuracy : 0.9371          
##                  95% CI : (0.9304, 0.9434)
##     No Information Rate : 0.5482          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8719          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9873          
##             Specificity : 0.8763          
##          Pos Pred Value : 0.9064          
##          Neg Pred Value : 0.9827          
##              Prevalence : 0.5482          
##          Detection Rate : 0.5412          
##    Detection Prevalence : 0.5971          
##       Balanced Accuracy : 0.9318          
##                                           
##        'Positive' Class : 1               
## 

Model Evaluation Logistic Regression and K-NN

eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])

eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3])
# Model Evaluation Logit
eval_logit
## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.988  0.992       0.983     0.986
# Model Evaluation K-NN
eval_knn
## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.937  0.987       0.983     0.906

Conclusion

Both model has high accuracy more than 90% but Logistic regression model has higher accuracy than KNN Model.

Even there is no significant difference between logistic regression and K-NN in term of accuracy because the value is still above 90%. The logistic regression model is classifying data as positive outcome more often than K-NN. As a result, the sensitivity of logistic regression is higher than K-NN and the specificity is same with K-NN. And the precision of Logistic regression is higher than KNN. Overall, Logistic regression is better than K-NN.