1 Explanation

There are many people who can’t differentiate between a orange and a grapefruit because of their similarity. Here we want to predict the two of them by their diameter, weight, and color(red, green, blue) using Logistic Regression and K-Nearest Neighbor. Using the data that was downloaded from kaggle about Oranges vs. Grapefruit.

Some relevant columns in the data :
- name : Label for the column. This should be either ‘orange’ or ‘grapefruit’.
- diameter : Diameter of the fruit in centimeters.
- weight : Diameter of the fruit in grams.
- red : Average red reading from an RGB scan. Values should be from 0 to 255.
- green : Average green reading from an RGB scan. Values should be from 0 to 255.
- blue : Average blue reading from an RGB scan. Values should be from 0 to 255.

2 Data Preparation

2.1 Attaching Packages

library(car)

## Loading required package: carData

library(caret)

## Warning: package 'caret' was built under R version 4.0.5

## Loading required package: lattice

## Loading required package: ggplot2

library(class)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

2.2 Input Data

citrus <- read.csv("data/citrus.csv")

2.3 Data Inspection

head(citrus)

##     name diameter weight red green blue
## 1 orange     2.96  86.76 172    85    2
## 2 orange     3.91  88.05 166    78    3
## 3 orange     4.42  95.17 156    81    2
## 4 orange     4.47  95.60 163    81    4
## 5 orange     4.48  95.76 161    72    9
## 6 orange     4.59  95.86 142   100    2

tail(citrus)

##             name diameter weight red green blue
## 9995  grapefruit    15.16 253.64 136    76   20
## 9996  grapefruit    15.35 253.89 149    77   20
## 9997  grapefruit    15.41 254.67 148    68    7
## 9998  grapefruit    15.59 256.50 168    82   20
## 9999  grapefruit    15.92 260.14 142    72   11
## 10000 grapefruit    16.45 261.51 152    74    2

2.4 Data Cleansing

To check if there is any missing value from the data.

anyNA(citrus)

## [1] FALSE

colSums(is.na(citrus))

##     name diameter   weight      red    green     blue 
##        0        0        0        0        0        0

There is no missing value from this data.

Check data type for each column.

glimpse(citrus)

## Rows: 10,000
## Columns: 6
## $ name     <chr> "orange", "orange", "orange", "orange", "orange", "orange", "~
## $ diameter <dbl> 2.96, 3.91, 4.42, 4.47, 4.48, 4.59, 4.64, 4.65, 4.68, 4.69, 4~
## $ weight   <dbl> 86.76, 88.05, 95.17, 95.60, 95.76, 95.86, 97.94, 98.50, 100.2~
## $ red      <int> 172, 166, 156, 163, 161, 142, 156, 142, 159, 161, 148, 166, 1~
## $ green    <int> 85, 78, 81, 81, 72, 100, 85, 74, 90, 76, 88, 69, 98, 86, 82, ~
## $ blue     <int> 2, 3, 2, 4, 9, 2, 2, 2, 16, 6, 2, 2, 13, 6, 2, 12, 5, 2, 22, ~

citrus <- citrus %>% 
   mutate(name = as.factor(name))

2.5 Exploratory Data Analysis

Before we use the data, we need to check the proportion between the data about orange and grapefruit.

citrus$name %>% 
   table() %>% 
   prop.table()

## .
## grapefruit     orange 
##        0.5        0.5

The proportion of the data is balanced, so we can use the data as it is.

2.6 Cross Validation

Here we want to split the data to make a train data and a test data. A train data is used to make modeling and the test data is used to test the model that we make using the train data.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)
#index sampling
index <- sample(nrow(citrus), 
                nrow(citrus)*0.8) 
  
#spilitting
citrus_train <- citrus[index, ]
citrus_test <- citrus[-index, ]

We will check again the proportion of the data.

citrus_train$name %>% 
   table() %>% 
   prop.table()

## .
## grapefruit     orange 
##   0.497625   0.502375

citrus_test$name %>% 
   table() %>% 
   prop.table()

## .
## grapefruit     orange 
##     0.5095     0.4905

Since the train data and the test data is balanced we can proceed to the next step.

3 Logistic Regression

The first model we will make is using Logistic Regression.

model_one <- glm(formula = name ~ .,
                 data = citrus_train,
                 family = "binomial")

summary(model_one)

## 
## Call:
## glm(formula = name ~ ., family = "binomial", data = citrus_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6153  -0.1561   0.0001   0.0624   4.0595  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -31.150574   2.082544 -14.958  < 2e-16 ***
## diameter    -27.112488   1.190296 -22.778  < 2e-16 ***
## weight        1.643456   0.077055  21.328  < 2e-16 ***
## red           0.049751   0.006102   8.154 3.53e-16 ***
## green         0.114186   0.006652  17.165  < 2e-16 ***
## blue         -0.125940   0.008500 -14.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11090.2  on 7999  degrees of freedom
## Residual deviance:  1883.1  on 7994  degrees of freedom
## AIC: 1895.1
## 
## Number of Fisher Scoring iterations: 8

After we make the model, we will use the model to predict the test data to confirm the accuracy of the model.

citrus_test$pred_prob <- predict(object = model_one,
        newdata = citrus_test,
        type = "response")

We will save the predict result in a column in the test data.

citrus_test$pred_label <- ifelse(citrus_test$pred_prob > 0.5, "orange", "grapefruit")

citrus_test <- citrus_test %>% 
   mutate(pred_label = as.factor(pred_label))

citrus_test %>% 
   select(name, pred_prob, pred_label) %>% 
   head(10)

##      name pred_prob pred_label
## 1  orange 1.0000000     orange
## 7  orange 1.0000000     orange
## 8  orange 1.0000000     orange
## 19 orange 1.0000000     orange
## 25 orange 1.0000000     orange
## 26 orange 1.0000000     orange
## 32 orange 1.0000000     orange
## 39 orange 1.0000000     orange
## 40 orange 1.0000000     orange
## 50 orange 0.9999997     orange

We will evaluate the model using confusion matrix.

confusionMatrix(data = citrus_test$pred_label,
                reference = citrus_test$name,
                positive = "orange")

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   grapefruit orange
##   grapefruit       1000     46
##   orange             19    935
##                                           
##                Accuracy : 0.9675          
##                  95% CI : (0.9588, 0.9748)
##     No Information Rate : 0.5095          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9349          
##                                           
##  Mcnemar's Test P-Value : 0.00126         
##                                           
##             Sensitivity : 0.9531          
##             Specificity : 0.9814          
##          Pos Pred Value : 0.9801          
##          Neg Pred Value : 0.9560          
##              Prevalence : 0.4905          
##          Detection Rate : 0.4675          
##    Detection Prevalence : 0.4770          
##       Balanced Accuracy : 0.9672          
##                                           
##        'Positive' Class : orange          
##

The result from the confusion matrix above shows that our logistic regression model has accuracy of 96.75 % on the test data, the meaning is that 96.75 % of our data is correctly classified. The model has sensitivity of 95.31 % and specificity of 98.14 %. The meaning is that our model can highly classified the negative class and the positive class of the data. The precision/positive predicted value is 98.01 %, meaning that 98.01 % of our positive prediction is correct.

4 K-Nearest Neighbor

Next we will make a model using K-Nearest Neighbor. To use K-Nearest Neighbor we need to separate the label and the rest of the data.

# predictor
citrus_train_x <- citrus_train[,-1]
citrus_test_x <- citrus_test[,c(-1, -7, -8)]

# target
citrus_train_y <- citrus_train[,1]
citrus_test_y <- citrus_test[,1]

We need to scale the data so the data is proportionate.

citrus_train_xs <- scale(x = citrus_train_x)
citrus_test_xs <- scale(x = citrus_test_x,
                        center = attr(citrus_train_xs, "scaled:center"),
                        scale = attr(citrus_train_xs, "scaled:scale"))

Finding k optimum using square root of the amount of the train data.

citrus_train %>% 
   nrow() %>% 
   sqrt() %>% 
   round()

## [1] 89

model_knn <- knn(train = citrus_train_xs,
                 test = citrus_test_xs,
                 cl = citrus_train_y,
                 k = 89)

head(model_knn)

## [1] orange orange orange orange orange orange
## Levels: grapefruit orange

We will evaluate the model using confusion matrix.

confusionMatrix(data = model_knn,
                reference = citrus_test_y,
                positive = "orange")

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   grapefruit orange
##   grapefruit        927     63
##   orange             92    918
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.9099, 0.9338)
##     No Information Rate : 0.5095          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.845           
##                                           
##  Mcnemar's Test P-Value : 0.02451         
##                                           
##             Sensitivity : 0.9358          
##             Specificity : 0.9097          
##          Pos Pred Value : 0.9089          
##          Neg Pred Value : 0.9364          
##              Prevalence : 0.4905          
##          Detection Rate : 0.4590          
##    Detection Prevalence : 0.5050          
##       Balanced Accuracy : 0.9227          
##                                           
##        'Positive' Class : orange          
##

The result from the confusion matrix above shows that our K-Nearest Neighbor model has accuracy of 92.25 % on the test data, the meaning is that 92.25 % of our data is correctly classified. The model has sensitivity of 93.58 % and specificity of 90.97 %. The meaning is that our model can highly classified the negative class and the positive class of the data. The precision/positive predicted value is 90.89 %, meaning that 90.89 % of our positive prediction is correct.

5 Conclusion

data.frame("Model" = c("Logistic Regression", "K-Nearest Neighbor"),
           "Accuracy" = c(96.75, 92.25),
           "Sensitivity" = c(95.31, 93.58),
           "Specificity" = c(98.14, 90.97),
           "Pos Pred Value" = c(98.01, 90.89))

##                 Model Accuracy Sensitivity Specificity Pos.Pred.Value
## 1 Logistic Regression    96.75       95.31       98.14          98.01
## 2  K-Nearest Neighbor    92.25       93.58       90.97          90.89

Both model have similar result, they got great result because the data is balanced between the two classes. However, the Logistic Regression model got higher performance than the K-Nearest Neighbor model. Because the aim of this report is to differentiate between a orange and a grapefruit so we will use the Logistic Regression model based on the accuracy of the model.

Orange vs. Grapefruit

Josephine Wijaya

May 12, 2021