There are many people who can’t differentiate between a orange and a grapefruit because of their similarity. Here we want to predict the two of them by their diameter, weight, and color(red, green, blue) using Logistic Regression and K-Nearest Neighbor. Using the data that was downloaded from kaggle about Oranges vs. Grapefruit.
Some relevant columns in the data :
- name : Label for the column. This should be either ‘orange’ or ‘grapefruit’.
- diameter : Diameter of the fruit in centimeters.
- weight : Diameter of the fruit in grams.
- red : Average red reading from an RGB scan. Values should be from 0 to 255.
- green : Average green reading from an RGB scan. Values should be from 0 to 255.
- blue : Average blue reading from an RGB scan. Values should be from 0 to 255.
library(car)## Loading required package: carData
library(caret)## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: ggplot2
library(class)
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)citrus <- read.csv("data/citrus.csv")head(citrus)## name diameter weight red green blue
## 1 orange 2.96 86.76 172 85 2
## 2 orange 3.91 88.05 166 78 3
## 3 orange 4.42 95.17 156 81 2
## 4 orange 4.47 95.60 163 81 4
## 5 orange 4.48 95.76 161 72 9
## 6 orange 4.59 95.86 142 100 2
tail(citrus)## name diameter weight red green blue
## 9995 grapefruit 15.16 253.64 136 76 20
## 9996 grapefruit 15.35 253.89 149 77 20
## 9997 grapefruit 15.41 254.67 148 68 7
## 9998 grapefruit 15.59 256.50 168 82 20
## 9999 grapefruit 15.92 260.14 142 72 11
## 10000 grapefruit 16.45 261.51 152 74 2
To check if there is any missing value from the data.
anyNA(citrus)## [1] FALSE
colSums(is.na(citrus))## name diameter weight red green blue
## 0 0 0 0 0 0
There is no missing value from this data.
Check data type for each column.
glimpse(citrus)## Rows: 10,000
## Columns: 6
## $ name <chr> "orange", "orange", "orange", "orange", "orange", "orange", "~
## $ diameter <dbl> 2.96, 3.91, 4.42, 4.47, 4.48, 4.59, 4.64, 4.65, 4.68, 4.69, 4~
## $ weight <dbl> 86.76, 88.05, 95.17, 95.60, 95.76, 95.86, 97.94, 98.50, 100.2~
## $ red <int> 172, 166, 156, 163, 161, 142, 156, 142, 159, 161, 148, 166, 1~
## $ green <int> 85, 78, 81, 81, 72, 100, 85, 74, 90, 76, 88, 69, 98, 86, 82, ~
## $ blue <int> 2, 3, 2, 4, 9, 2, 2, 2, 16, 6, 2, 2, 13, 6, 2, 12, 5, 2, 22, ~
citrus <- citrus %>%
mutate(name = as.factor(name))Before we use the data, we need to check the proportion between the data about orange and grapefruit.
citrus$name %>%
table() %>%
prop.table()## .
## grapefruit orange
## 0.5 0.5
The proportion of the data is balanced, so we can use the data as it is.
Here we want to split the data to make a train data and a test data. A train data is used to make modeling and the test data is used to test the model that we make using the train data.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
#index sampling
index <- sample(nrow(citrus),
nrow(citrus)*0.8)
#spilitting
citrus_train <- citrus[index, ]
citrus_test <- citrus[-index, ]We will check again the proportion of the data.
citrus_train$name %>%
table() %>%
prop.table()## .
## grapefruit orange
## 0.497625 0.502375
citrus_test$name %>%
table() %>%
prop.table()## .
## grapefruit orange
## 0.5095 0.4905
Since the train data and the test data is balanced we can proceed to the next step.
The first model we will make is using Logistic Regression.
model_one <- glm(formula = name ~ .,
data = citrus_train,
family = "binomial")
summary(model_one)##
## Call:
## glm(formula = name ~ ., family = "binomial", data = citrus_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6153 -0.1561 0.0001 0.0624 4.0595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -31.150574 2.082544 -14.958 < 2e-16 ***
## diameter -27.112488 1.190296 -22.778 < 2e-16 ***
## weight 1.643456 0.077055 21.328 < 2e-16 ***
## red 0.049751 0.006102 8.154 3.53e-16 ***
## green 0.114186 0.006652 17.165 < 2e-16 ***
## blue -0.125940 0.008500 -14.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11090.2 on 7999 degrees of freedom
## Residual deviance: 1883.1 on 7994 degrees of freedom
## AIC: 1895.1
##
## Number of Fisher Scoring iterations: 8
After we make the model, we will use the model to predict the test data to confirm the accuracy of the model.
citrus_test$pred_prob <- predict(object = model_one,
newdata = citrus_test,
type = "response")We will save the predict result in a column in the test data.
citrus_test$pred_label <- ifelse(citrus_test$pred_prob > 0.5, "orange", "grapefruit")
citrus_test <- citrus_test %>%
mutate(pred_label = as.factor(pred_label))citrus_test %>%
select(name, pred_prob, pred_label) %>%
head(10)## name pred_prob pred_label
## 1 orange 1.0000000 orange
## 7 orange 1.0000000 orange
## 8 orange 1.0000000 orange
## 19 orange 1.0000000 orange
## 25 orange 1.0000000 orange
## 26 orange 1.0000000 orange
## 32 orange 1.0000000 orange
## 39 orange 1.0000000 orange
## 40 orange 1.0000000 orange
## 50 orange 0.9999997 orange
We will evaluate the model using confusion matrix.
confusionMatrix(data = citrus_test$pred_label,
reference = citrus_test$name,
positive = "orange")## Confusion Matrix and Statistics
##
## Reference
## Prediction grapefruit orange
## grapefruit 1000 46
## orange 19 935
##
## Accuracy : 0.9675
## 95% CI : (0.9588, 0.9748)
## No Information Rate : 0.5095
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9349
##
## Mcnemar's Test P-Value : 0.00126
##
## Sensitivity : 0.9531
## Specificity : 0.9814
## Pos Pred Value : 0.9801
## Neg Pred Value : 0.9560
## Prevalence : 0.4905
## Detection Rate : 0.4675
## Detection Prevalence : 0.4770
## Balanced Accuracy : 0.9672
##
## 'Positive' Class : orange
##
The result from the confusion matrix above shows that our logistic regression model has accuracy of 96.75 % on the test data, the meaning is that 96.75 % of our data is correctly classified. The model has sensitivity of 95.31 % and specificity of 98.14 %. The meaning is that our model can highly classified the negative class and the positive class of the data. The precision/positive predicted value is 98.01 %, meaning that 98.01 % of our positive prediction is correct.
Next we will make a model using K-Nearest Neighbor. To use K-Nearest Neighbor we need to separate the label and the rest of the data.
# predictor
citrus_train_x <- citrus_train[,-1]
citrus_test_x <- citrus_test[,c(-1, -7, -8)]
# target
citrus_train_y <- citrus_train[,1]
citrus_test_y <- citrus_test[,1]We need to scale the data so the data is proportionate.
citrus_train_xs <- scale(x = citrus_train_x)
citrus_test_xs <- scale(x = citrus_test_x,
center = attr(citrus_train_xs, "scaled:center"),
scale = attr(citrus_train_xs, "scaled:scale"))Finding k optimum using square root of the amount of the train data.
citrus_train %>%
nrow() %>%
sqrt() %>%
round()## [1] 89
model_knn <- knn(train = citrus_train_xs,
test = citrus_test_xs,
cl = citrus_train_y,
k = 89)
head(model_knn)## [1] orange orange orange orange orange orange
## Levels: grapefruit orange
We will evaluate the model using confusion matrix.
confusionMatrix(data = model_knn,
reference = citrus_test_y,
positive = "orange")## Confusion Matrix and Statistics
##
## Reference
## Prediction grapefruit orange
## grapefruit 927 63
## orange 92 918
##
## Accuracy : 0.9225
## 95% CI : (0.9099, 0.9338)
## No Information Rate : 0.5095
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.845
##
## Mcnemar's Test P-Value : 0.02451
##
## Sensitivity : 0.9358
## Specificity : 0.9097
## Pos Pred Value : 0.9089
## Neg Pred Value : 0.9364
## Prevalence : 0.4905
## Detection Rate : 0.4590
## Detection Prevalence : 0.5050
## Balanced Accuracy : 0.9227
##
## 'Positive' Class : orange
##
The result from the confusion matrix above shows that our K-Nearest Neighbor model has accuracy of 92.25 % on the test data, the meaning is that 92.25 % of our data is correctly classified. The model has sensitivity of 93.58 % and specificity of 90.97 %. The meaning is that our model can highly classified the negative class and the positive class of the data. The precision/positive predicted value is 90.89 %, meaning that 90.89 % of our positive prediction is correct.
data.frame("Model" = c("Logistic Regression", "K-Nearest Neighbor"),
"Accuracy" = c(96.75, 92.25),
"Sensitivity" = c(95.31, 93.58),
"Specificity" = c(98.14, 90.97),
"Pos Pred Value" = c(98.01, 90.89))## Model Accuracy Sensitivity Specificity Pos.Pred.Value
## 1 Logistic Regression 96.75 95.31 98.14 98.01
## 2 K-Nearest Neighbor 92.25 93.58 90.97 90.89
Both model have similar result, they got great result because the data is balanced between the two classes. However, the Logistic Regression model got higher performance than the K-Nearest Neighbor model. Because the aim of this report is to differentiate between a orange and a grapefruit so we will use the Logistic Regression model based on the accuracy of the model.