Content
That’s inside is more than just rows and columns. You can see rice details listed as column names.
Picture are taken from Kaggle
Hi Everyone :)
Welcome to my Rmd.
This is my HTML_Document which contains Rice type classification
Hope you can enjoy that!
We will learn to use Logistic regression and KNN model using Rice type dataset. We wanna know the relationship among variables. We also wanna classify the type of a new rice (test data) based on the data that we have trained before.
Data Source: https://www.kaggle.com/datasets/mssmartypants/rice-type-classification
This is a set of data created for rice classification. I recommend using this dataset for educational purposes, for practice and to acquire the necessary knowledge. It is modified dataset from this resource: link Jasmine - 1, Gonen - 0.
That’s inside is more than just rows and columns. You can see rice details listed as column names.
All attributes are numeric variables and they are listed bellow:
-. id
-. Area
-. MajorAxisLength
-. MinorAxisLength
-. Eccentricity
-. ConvexArea
-. EquivDiameter
-. Extent
-. Perimeter
-. Roundness
-. AspectRation
-. Class
We wanna know :
-. Accuracy between Logistic Regression and KNN.
-. Classify type of new rice by data that we have trained before.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
We use rice type dataset which has some variables/features that we got from Kaggle.
rice <- read.csv("data_input/rice.csv")
str(rice)
## 'data.frame': 18185 obs. of 12 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Area : int 4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
## $ MajorAxisLength: num 92.2 74.7 76.3 77 85.1 ...
## $ MinorAxisLength: num 64 51.4 52 51.9 56.4 ...
## $ Eccentricity : num 0.72 0.726 0.731 0.739 0.749 ...
## $ ConvexArea : int 4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
## $ EquivDiameter : num 76 60.5 62.3 62.6 68.6 ...
## $ Extent : num 0.658 0.713 0.759 0.784 0.769 ...
## $ Perimeter : num 273 208 210 211 230 ...
## $ Roundness : num 0.765 0.832 0.868 0.87 0.875 ...
## $ AspectRation : num 1.44 1.45 1.47 1.48 1.51 ...
## $ Class : int 1 1 1 1 1 1 1 1 1 1 ...
head(rice)
## id Area MajorAxisLength MinorAxisLength Eccentricity ConvexArea EquivDiameter
## 1 1 4537 92.22932 64.01277 0.7199162 4677 76.00452
## 2 2 2872 74.69188 51.40045 0.7255527 3015 60.47102
## 3 3 3048 76.29316 52.04349 0.7312109 3132 62.29634
## 4 4 3073 77.03363 51.92849 0.7386387 3157 62.55130
## 5 5 3693 85.12478 56.37402 0.7492816 3802 68.57167
## 6 6 2990 77.41707 50.95434 0.7528609 3080 61.70078
## Extent Perimeter Roundness AspectRation Class
## 1 0.6575362 273.085 0.7645096 1.440796 1
## 2 0.7130089 208.317 0.8316582 1.453137 1
## 3 0.7591532 210.012 0.8684336 1.465950 1
## 4 0.7835288 210.657 0.8702031 1.483456 1
## 5 0.7693750 230.332 0.8747433 1.510000 1
## 6 0.5848983 216.930 0.7984391 1.519342 1
There is a variables that we don’t need like Id. So we can do subsetting.
rice <- rice[,-c(0:1)]
str(rice)
## 'data.frame': 18185 obs. of 11 variables:
## $ Area : int 4537 2872 3048 3073 3693 2990 3556 3788 2629 5719 ...
## $ MajorAxisLength: num 92.2 74.7 76.3 77 85.1 ...
## $ MinorAxisLength: num 64 51.4 52 51.9 56.4 ...
## $ Eccentricity : num 0.72 0.726 0.731 0.739 0.749 ...
## $ ConvexArea : int 4677 3015 3132 3157 3802 3080 3636 3866 2790 5819 ...
## $ EquivDiameter : num 76 60.5 62.3 62.6 68.6 ...
## $ Extent : num 0.658 0.713 0.759 0.784 0.769 ...
## $ Perimeter : num 273 208 210 211 230 ...
## $ Roundness : num 0.765 0.832 0.868 0.87 0.875 ...
## $ AspectRation : num 1.44 1.45 1.47 1.48 1.51 ...
## $ Class : int 1 1 1 1 1 1 1 1 1 1 ...
That is very important to check missing values inside of dataset.
anyNA(rice)
## [1] FALSE
Great! The data is complete and ready to be processed.
There is a variable that has incorrect type. So we need to change it to the correct type.
rice$Class <- factor(rice$Class)
rice <- rice %>%
mutate( Class = factor(Class, levels = c(0,1),
labels = c("Gonen", "Jasmine")))
glimpse(rice)
## Rows: 18,185
## Columns: 11
## $ Area <int> 4537, 2872, 3048, 3073, 3693, 2990, 3556, 3788, 2629, ~
## $ MajorAxisLength <dbl> 92.22932, 74.69188, 76.29316, 77.03363, 85.12478, 77.4~
## $ MinorAxisLength <dbl> 64.01277, 51.40045, 52.04349, 51.92849, 56.37402, 50.9~
## $ Eccentricity <dbl> 0.7199162, 0.7255527, 0.7312109, 0.7386387, 0.7492816,~
## $ ConvexArea <int> 4677, 3015, 3132, 3157, 3802, 3080, 3636, 3866, 2790, ~
## $ EquivDiameter <dbl> 76.00452, 60.47102, 62.29634, 62.55130, 68.57167, 61.7~
## $ Extent <dbl> 0.6575362, 0.7130089, 0.7591532, 0.7835288, 0.7693750,~
## $ Perimeter <dbl> 273.085, 208.317, 210.012, 210.657, 230.332, 216.930, ~
## $ Roundness <dbl> 0.7645096, 0.8316582, 0.8684336, 0.8702031, 0.8747433,~
## $ AspectRation <dbl> 1.440796, 1.453137, 1.465950, 1.483456, 1.510000, 1.51~
## $ Class <fct> Jasmine, Jasmine, Jasmine, Jasmine, Jasmine, Jasmine, ~
Then check missing value in each columns.
colSums(is.na(rice))
## Area MajorAxisLength MinorAxisLength Eccentricity ConvexArea
## 0 0 0 0 0
## EquivDiameter Extent Perimeter Roundness AspectRation
## 0 0 0 0 0
## Class
## 0
We must see the proportion of our Class in Class column.
prop.table(table(rice$Class))
##
## Gonen Jasmine
## 0.4509211 0.5490789
table(rice$Class)
##
## Gonen Jasmine
## 8200 9985
By seeing the proportional of Class, it is already balance.
Splitting Train-Test
Next step is doing splitting train test data. The purpose is data train will be used to make model, data test will be used to test our model or compare unseen data. And we can use that to know the ability of our model to unseen data.
set.seed(303)
intrain <- sample(nrow(rice), nrow(rice)*0.7)
rice_train <- rice[intrain,]
rice_test <- rice[-intrain,]
rice$Class %>%
levels()
## [1] "Gonen" "Jasmine"
In Logistic Regression, we can modelling using glm()
function. And we can use some variables that effected Class, because
Class is the response variable.
model_all <- glm(formula = Class~., family = "binomial", data = rice_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = rice_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9420 -0.0016 0.0120 0.0422 4.2262
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.199e+02 4.969e+01 8.450 < 2e-16 ***
## Area 2.095e-02 5.755e-03 3.641 0.000272 ***
## MajorAxisLength 2.182e+00 1.915e-01 11.392 < 2e-16 ***
## MinorAxisLength -9.537e-01 4.172e-01 -2.286 0.022268 *
## Eccentricity -1.423e+02 2.283e+01 -6.231 4.65e-10 ***
## ConvexArea -1.517e-02 2.676e-03 -5.669 1.44e-08 ***
## EquivDiameter -3.192e+00 7.892e-01 -4.045 5.24e-05 ***
## Extent 1.275e+00 1.145e+00 1.114 0.265449
## Perimeter -2.468e-01 5.216e-02 -4.731 2.23e-06 ***
## Roundness -1.012e+02 1.523e+01 -6.648 2.97e-11 ***
## AspectRation -5.578e+01 5.950e+00 -9.375 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17521.41 on 12728 degrees of freedom
## Residual deviance: 733.37 on 12718 degrees of freedom
## AIC: 755.37
##
## Number of Fisher Scoring iterations: 11
model <- glm(formula = Class~Area+MajorAxisLength+Eccentricity+ConvexArea+EquivDiameter+Perimeter+Roundness+AspectRation, family = "binomial", data = rice_train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = Class ~ Area + MajorAxisLength + Eccentricity +
## ConvexArea + EquivDiameter + Perimeter + Roundness + AspectRation,
## family = "binomial", data = rice_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0644 -0.0027 0.0072 0.0353 4.3126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.632e+02 4.089e+01 8.882 < 2e-16 ***
## Area 2.435e-02 5.587e-03 4.358 1.31e-05 ***
## MajorAxisLength 2.107e+00 2.225e-01 9.472 < 2e-16 ***
## Eccentricity -9.428e+01 1.007e+01 -9.362 < 2e-16 ***
## ConvexArea -1.690e-02 2.594e-03 -6.515 7.29e-11 ***
## EquivDiameter -3.883e+00 7.093e-01 -5.475 4.38e-08 ***
## Perimeter -2.533e-01 5.404e-02 -4.687 2.77e-06 ***
## Roundness -1.018e+02 1.600e+01 -6.362 1.99e-10 ***
## AspectRation -4.621e+01 6.026e+00 -7.669 1.74e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17521.41 on 12728 degrees of freedom
## Residual deviance: 739.01 on 12720 degrees of freedom
## AIC: 757.01
##
## Number of Fisher Scoring iterations: 11
Model Fitting
We can use stepwise to do Fitting model because there
are some variables that unsignificance to the Class.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
model2_all <- stepAIC(model_all, direction = "backward")
## Start: AIC=755.37
## Class ~ Area + MajorAxisLength + MinorAxisLength + Eccentricity +
## ConvexArea + EquivDiameter + Extent + Perimeter + Roundness +
## AspectRation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## - Extent 1 734.63 754.63
## <none> 733.37 755.37
## - MinorAxisLength 1 738.00 758.00
## - Perimeter 1 748.36 768.36
## - Area 1 749.97 769.97
## - EquivDiameter 1 754.58 774.58
## - ConvexArea 1 759.94 779.94
## - Eccentricity 1 760.92 780.92
## - Roundness 1 762.46 782.46
## - AspectRation 1 776.67 796.67
## - MajorAxisLength 1 829.52 849.52
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=754.63
## Class ~ Area + MajorAxisLength + MinorAxisLength + Eccentricity +
## ConvexArea + EquivDiameter + Perimeter + Roundness + AspectRation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 734.63 754.63
## - MinorAxisLength 1 739.01 757.01
## - Perimeter 1 749.78 767.78
## - Area 1 751.06 769.06
## - EquivDiameter 1 755.83 773.83
## - ConvexArea 1 761.31 779.31
## - Eccentricity 1 761.63 779.63
## - Roundness 1 763.63 781.63
## - AspectRation 1 777.62 795.62
## - MajorAxisLength 1 830.55 848.55
Using backward method in stepwise, we get model like :
summary(model2_all)
##
## Call:
## glm(formula = Class ~ Area + MajorAxisLength + MinorAxisLength +
## Eccentricity + ConvexArea + EquivDiameter + Perimeter + Roundness +
## AspectRation, family = "binomial", data = rice_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9717 -0.0016 0.0118 0.0419 4.2436
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.179e+02 4.973e+01 8.404 < 2e-16 ***
## Area 2.098e-02 5.785e-03 3.626 0.000288 ***
## MajorAxisLength 2.183e+00 1.914e-01 11.405 < 2e-16 ***
## MinorAxisLength -9.239e-01 4.154e-01 -2.224 0.026133 *
## Eccentricity -1.405e+02 2.271e+01 -6.184 6.24e-10 ***
## ConvexArea -1.526e-02 2.680e-03 -5.695 1.23e-08 ***
## EquivDiameter -3.204e+00 7.932e-01 -4.039 5.36e-05 ***
## Perimeter -2.468e-01 5.201e-02 -4.744 2.09e-06 ***
## Roundness -1.007e+02 1.523e+01 -6.613 3.76e-11 ***
## AspectRation -5.564e+01 5.944e+00 -9.362 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17521.41 on 12728 degrees of freedom
## Residual deviance: 734.63 on 12719 degrees of freedom
## AIC: 754.63
##
## Number of Fisher Scoring iterations: 11
model2 <- stepAIC(model, direction = "backward")
## Start: AIC=757.01
## Class ~ Area + MajorAxisLength + Eccentricity + ConvexArea +
## EquivDiameter + Perimeter + Roundness + AspectRation
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 739.01 757.01
## - Perimeter 1 752.35 768.35
## - Area 1 760.63 776.63
## - Roundness 1 764.89 780.89
## - Eccentricity 1 772.97 788.97
## - ConvexArea 1 774.29 790.29
## - EquivDiameter 1 785.81 801.81
## - AspectRation 1 794.97 810.97
## - MajorAxisLength 1 833.10 849.10
summary(model2)
##
## Call:
## glm(formula = Class ~ Area + MajorAxisLength + Eccentricity +
## ConvexArea + EquivDiameter + Perimeter + Roundness + AspectRation,
## family = "binomial", data = rice_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0644 -0.0027 0.0072 0.0353 4.3126
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.632e+02 4.089e+01 8.882 < 2e-16 ***
## Area 2.435e-02 5.587e-03 4.358 1.31e-05 ***
## MajorAxisLength 2.107e+00 2.225e-01 9.472 < 2e-16 ***
## Eccentricity -9.428e+01 1.007e+01 -9.362 < 2e-16 ***
## ConvexArea -1.690e-02 2.594e-03 -6.515 7.29e-11 ***
## EquivDiameter -3.883e+00 7.093e-01 -5.475 4.38e-08 ***
## Perimeter -2.533e-01 5.404e-02 -4.687 2.77e-06 ***
## Roundness -1.018e+02 1.600e+01 -6.362 1.99e-10 ***
## AspectRation -4.621e+01 6.026e+00 -7.669 1.74e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 17521.41 on 12728 degrees of freedom
## Residual deviance: 739.01 on 12720 degrees of freedom
## AIC: 757.01
##
## Number of Fisher Scoring iterations: 11
Using second model model2 as the result of stepwise, we
try to predict using data test.
rice_test$prob_rice <- predict(model2, type = "response", newdata = rice_test)
The graph of probability prediction data.
ggplot(rice_test, aes(x=prob_rice)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
rice_test$pred_rice <- factor(ifelse(rice_test$prob_rice > 0.5, "Jasmine","Gonen"))
rice_test[1:10, c("pred_rice", "Class")]
## pred_rice Class
## 6 Jasmine Jasmine
## 11 Jasmine Jasmine
## 12 Jasmine Jasmine
## 16 Jasmine Jasmine
## 17 Jasmine Jasmine
## 18 Jasmine Jasmine
## 21 Jasmine Jasmine
## 22 Jasmine Jasmine
## 26 Jasmine Jasmine
## 30 Jasmine Jasmine
By the information above, if probability of test data > 0.5 it classify as Jasmine type.
To evaluate our model that we made before, we will use confussion Matrix.
library(caret)
## Loading required package: lattice
log_conf <- confusionMatrix(rice_test$pred_rice, rice_test$Class, positive = "Jasmine")
log_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction Gonen Jasmine
## Gonen 2423 25
## Jasmine 42 2966
##
## Accuracy : 0.9877
## 95% CI : (0.9844, 0.9905)
## No Information Rate : 0.5482
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9752
##
## Mcnemar's Test P-Value : 0.05062
##
## Sensitivity : 0.9916
## Specificity : 0.9830
## Pos Pred Value : 0.9860
## Neg Pred Value : 0.9898
## Prevalence : 0.5482
## Detection Rate : 0.5436
## Detection Prevalence : 0.5513
## Balanced Accuracy : 0.9873
##
## 'Positive' Class : Jasmine
##
Evaluation of the model will be done with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
Recall <- round((2966)/(2966+25),2)
Specificity <- round((2423)/(2423+42),2)
Accuracy <- round((2966+2423)/(nrow(rice_test)),2)
Precision <- round((2966)/(2966+42),2)
performance <- cbind.data.frame(Accuracy, Recall, Precision, Specificity)
performance
## Accuracy Recall Precision Specificity
## 1 0.99 0.99 0.99 0.98
The result shows that our logistic regression model has accuracy of 99 % on test dataset, meaning that 99 % of our data is correctly classified. The value of sensitivity and specificity is 99 % and 98 %. This indicate that most of positive outcomes are correctly classified but only a small number of negative outcomes are correctly classified. The precision/positive predicted value is 99 %, meaning that 99 % of our positive prediction is correct.
Tunning Cutoff
To know the maximum threshold.
# tuning cutoff
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = rice_test$prob_rice,
ref = rice_test$Class,
postarget = "Jasmine",
negtarget = "Gonen")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff model perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Model Interpolation
# Odds ratio all coefficients
exp(model$coefficients) %>%
data.frame()
## .
## (Intercept) 5.489014e+157
## Area 1.024646e+00
## MajorAxisLength 8.225424e+00
## Eccentricity 1.132953e-41
## ConvexArea 9.832458e-01
## EquivDiameter 2.057901e-02
## Perimeter 7.762342e-01
## Roundness 6.164935e-45
## AspectRation 8.533667e-21
We make dummy variables to classify.
dmy <- dummyVars(" ~Class+Area+MajorAxisLength+Eccentricity+ConvexArea+EquivDiameter+Perimeter+Roundness+AspectRation", data = rice)
dmy <- data.frame(predict(dmy, newdata = rice))
str(dmy)
## 'data.frame': 18185 obs. of 10 variables:
## $ Class.Gonen : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Class.Jasmine : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Area : num 4537 2872 3048 3073 3693 ...
## $ MajorAxisLength: num 92.2 74.7 76.3 77 85.1 ...
## $ Eccentricity : num 0.72 0.726 0.731 0.739 0.749 ...
## $ ConvexArea : num 4677 3015 3132 3157 3802 ...
## $ EquivDiameter : num 76 60.5 62.3 62.6 68.6 ...
## $ Perimeter : num 273 208 210 211 230 ...
## $ Roundness : num 0.765 0.832 0.868 0.87 0.875 ...
## $ AspectRation : num 1.44 1.45 1.47 1.48 1.51 ...
Delete dummy variable that has 2 categories.
dmy$Class.Gonen <- NULL
Check name of dummy columns.
names(dmy)
## [1] "Class.Jasmine" "Area" "MajorAxisLength" "Eccentricity"
## [5] "ConvexArea" "EquivDiameter" "Perimeter" "Roundness"
## [9] "AspectRation"
Create train and test data from dummy.
set.seed(300)
dmy_train <- dmy[intrain,2:9]
dmy_test <- dmy[-intrain,2:9]
dmy_train_label <- dmy[intrain,1]
dmy_test_label <- dmy[-intrain,1]
Then predict using KNN method.
round(sqrt(nrow(dmy_train)))
## [1] 113
pred_knn <- class::knn(train = dmy_train,
test = dmy_test,
cl = dmy_train_label,
k = 113)
Make confussion matrix to predict using KNN.
pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(dmy_test_label),"1")
pred_knn_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2160 38
## 1 305 2953
##
## Accuracy : 0.9371
## 95% CI : (0.9304, 0.9434)
## No Information Rate : 0.5482
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8719
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9873
## Specificity : 0.8763
## Pos Pred Value : 0.9064
## Neg Pred Value : 0.9827
## Prevalence : 0.5482
## Detection Rate : 0.5412
## Detection Prevalence : 0.5971
## Balanced Accuracy : 0.9318
##
## 'Positive' Class : 1
##
eval_logit <- data_frame(Accuracy = log_conf$overall[1],
Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
Recall = pred_knn_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = pred_knn_conf$byClass[3])
# Model Evaluation Logit
eval_logit
## # A tibble: 1 x 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.988 0.992 0.983 0.986
# Model Evaluation K-NN
eval_knn
## # A tibble: 1 x 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.937 0.987 0.983 0.906
Both model has high accuracy more than 90% but Logistic regression model has higher accuracy than KNN Model.
Even there is no significant difference between logistic regression and K-NN in term of accuracy because the value is still above 90%. The logistic regression model is classifying data as positive outcome more often than K-NN. As a result, the sensitivity of logistic regression is higher than K-NN and the specificity is same with K-NN. And the precision of Logistic regression is higher than KNN. Overall, Logistic regression is better than K-NN.