1 Introduction

1.1 Introduction

This report will show you process to make a classification model using logistic regression and k-nearest neighbor supervised learning to predict car acceptability.

This Car Acceptability Classification Database was derived from a simple hierarchical decision model which comes with CC-BY-NC-SA 4.0 license for non-commercial usage and it is provided by UCI Machine Learning Repository.

you can access this data from : https://www.kaggle.com/datasets/subhajeetdas/car-acceptability-classification-dataset

1.2 Package

To make the model we will use the following library :

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)

## Loading required package: lattice

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:gtools':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

2 Data Preparation & Explanation

2.1 Data input & Explanation

car <- read.csv("car.csv")
head(car)

The description of each variable are :

Buying_Price - Categorical Data [vhigh, high, med, low]

Maintenance_Price - Categorical Data [vhigh, high, med, low]

No_of_Doors - Categorical Data

Person_Capacity - Categorical Data [2, 4, more]

Size_of_Luggage - Categorical Data [small, med, big]

Safety - Categorical Data

Car_Acceptability - Categorical Data

2.2 Data Structure

glimpse(car)

## Rows: 1,728
## Columns: 7
## $ Buying_Price      <chr> "vhigh", "vhigh", "vhigh", "vhigh", "vhigh", "vhigh"…
## $ Maintenance_Price <chr> "vhigh", "vhigh", "vhigh", "vhigh", "vhigh", "vhigh"…
## $ No_of_Doors       <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2…
## $ Person_Capacity   <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "4", "4…
## $ Size_of_Luggage   <chr> "small", "small", "small", "med", "med", "med", "big…
## $ Safety            <chr> "low", "med", "high", "low", "med", "high", "low", "…
## $ Car_Acceptability <chr> "unacc", "unacc", "unacc", "unacc", "unacc", "unacc"…

The data has 1,728 rows and 7 columns. The column Car_Acceptability which indicates the accept status of a car of the company will be use as a target variable with the other column will be use as the predictor.

unique(car$Car_Acceptability)

## [1] "unacc" "acc"   "vgood" "good"

As we observe above, we have four level of unique value. we can assume that the value : “good” and “vgood” is classified as accepted car or “acc”, therefore we will convert those value to “acc” :

car$Car_Acceptability[car$Car_Acceptability %in% c("good", "vgood")] <- "acc"
  unique(car$Car_Acceptability)

## [1] "unacc" "acc"

The description of each variable above has tell us that each variable consist of categorical data, therefore we need to convert all variable to the desired data types :

car <- data.frame(lapply(car, as.factor))

2.3 Missing Value

colSums(is.na(car))

##      Buying_Price Maintenance_Price       No_of_Doors   Person_Capacity 
##                 0                 0                 0                 0 
##   Size_of_Luggage            Safety Car_Acceptability 
##                 0                 0                 0

sum(colSums(is.na(car)))

## [1] 0

The data set has no missing value, therefore we can proceed to the next step.

3 Model Building : Logistic Regression

3.1 Data Pre-processing

3.1.1 Data Proportion

Before we build the model, we need to examine the proportion of the target variable we have in the target column.

prop.table(table(car$Car_Acceptability))

## 
##       acc     unacc 
## 0.2997685 0.7002315

We can see that the proportion of positive and negative value of the target is unbalanced, this can affect the performance of the model, We have thousands data rows, therefore we will use downsampling method to make the proportion is balanced, however many of data information will be missing :

car_down <- downSample(
  x = car %>% select(-Car_Acceptability),
  y = car$Car_Acceptability,
  yname = "Car_Acceptability"
)
nrow(car_down)

## [1] 1036

prop.table(table(car_down$Car_Acceptability))

## 
##   acc unacc 
##   0.5   0.5

now we have balanced proportion of target variable with 1,036 data information left from the original data.

3.1.2 Data Splitting

In this section, we will split the dataset to data train and data test. The data train will be used to train the model and the data test will be used to evaluate the performance of the model. 80% of the dataset will be used for data train and the rest is data test.

set.seed(123)
samplesize <- round(0.8 * nrow(car_down), 0)
index <- sample(seq_len(nrow(car_down)), size = samplesize)

data_train <- car_down[index, ]
data_test <- car_down[-index, ]

3.2 Model Building

In this section we will make two model :

Logistic regression model using all available predictor variable
Logistic regression model with filtered predictor variable with step-wise feature selection method

3.2.1 Model ALL

model_all <- glm(formula = Car_Acceptability ~ . ,
                   data = data_train,
                 family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model_all)

## 
## Call:
## glm(formula = Car_Acceptability ~ ., family = "binomial", data = data_train)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              26.8737  1308.1443   0.021 0.983610    
## Buying_Pricelow          -4.3545     0.7167  -6.076 1.23e-09 ***
## Buying_Pricemed          -4.7254     0.7818  -6.044 1.50e-09 ***
## Buying_Pricevhigh         1.6174     0.4958   3.262 0.001105 ** 
## Maintenance_Pricelow     -2.5918     0.6354  -4.079 4.52e-05 ***
## Maintenance_Pricemed     -2.2397     0.5837  -3.837 0.000125 ***
## Maintenance_Pricevhigh    3.7578     0.6533   5.752 8.80e-09 ***
## No_of_Doors3             -2.0104     0.5463  -3.680 0.000233 ***
## No_of_Doors4             -2.7445     0.5747  -4.775 1.79e-06 ***
## No_of_Doors5more         -2.6781     0.5785  -4.629 3.67e-06 ***
## Person_Capacity4        -29.1335  1308.1444  -0.022 0.982232    
## Person_Capacitymore     -28.5812  1308.1443  -0.022 0.982569    
## Size_of_Luggagemed        1.3564     0.4907   2.764 0.005705 ** 
## Size_of_Luggagesmall      3.8438     0.6257   6.143 8.11e-10 ***
## Safetylow                29.1927  1347.0910   0.022 0.982710    
## Safetymed                 2.7712     0.4962   5.584 2.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1149.14  on 828  degrees of freedom
## Residual deviance:  187.37  on 813  degrees of freedom
## AIC: 219.37
## 
## Number of Fisher Scoring iterations: 20

Using this model, we can make interpretation :

The intercept in this model represents the estimated log-odds of the response variable when all the predictor variables are set to zero or their reference levels (for categorical variables). In the provided coefficients table, the intercept is represented by (Intercept) and has an estimated value of 27.6048. This indicates the baseline log-odds of the response variable in the absence of any predictors or when the predictor variables are at their reference levels.
breakdown of the information for the first few predictor variables:

Buying_Pricelow: This predictor variable has an estimated coefficient of -5.3785, a standard error of 0.8802, a z-value of -6.110, and a p-value of 9.94e-10.

Buying_Pricemed: This predictor variable has an estimated coefficient of -3.8459, a standard error of 0.7713, a z-value of -4.986, and a p-value of 6.15e-07.

Buying_Pricevhigh: This predictor variable has an estimated coefficient of 2.6234, a standard error of 0.6476, a z-value of 4.051, and a p-value of 5.10e-05.

The remaining predictor variables follow a similar pattern. The coefficients table provides information about the statistical significance and impact of each predictor variable on the response variable in the logistic regression model.

3.2.2 Model Step-wise

model_step <- step(model_all, direction = "backward", trace = F)

summary(model_step)

## 
## Call:
## glm(formula = Car_Acceptability ~ Buying_Price + Maintenance_Price + 
##     No_of_Doors + Person_Capacity + Size_of_Luggage + Safety, 
##     family = "binomial", data = data_train)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              26.8737  1308.1443   0.021 0.983610    
## Buying_Pricelow          -4.3545     0.7167  -6.076 1.23e-09 ***
## Buying_Pricemed          -4.7254     0.7818  -6.044 1.50e-09 ***
## Buying_Pricevhigh         1.6174     0.4958   3.262 0.001105 ** 
## Maintenance_Pricelow     -2.5918     0.6354  -4.079 4.52e-05 ***
## Maintenance_Pricemed     -2.2397     0.5837  -3.837 0.000125 ***
## Maintenance_Pricevhigh    3.7578     0.6533   5.752 8.80e-09 ***
## No_of_Doors3             -2.0104     0.5463  -3.680 0.000233 ***
## No_of_Doors4             -2.7445     0.5747  -4.775 1.79e-06 ***
## No_of_Doors5more         -2.6781     0.5785  -4.629 3.67e-06 ***
## Person_Capacity4        -29.1335  1308.1444  -0.022 0.982232    
## Person_Capacitymore     -28.5812  1308.1443  -0.022 0.982569    
## Size_of_Luggagemed        1.3564     0.4907   2.764 0.005705 ** 
## Size_of_Luggagesmall      3.8438     0.6257   6.143 8.11e-10 ***
## Safetylow                29.1927  1347.0910   0.022 0.982710    
## Safetymed                 2.7712     0.4962   5.584 2.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1149.14  on 828  degrees of freedom
## Residual deviance:  187.37  on 813  degrees of freedom
## AIC: 219.37
## 
## Number of Fisher Scoring iterations: 20

AIC (Akaike Information Criterion): It estimates the amount of missing information in a model. The smaller the AIC value, the better the model. Null deviance: It represents the error or discrepancy when no predictors are used in the model. Residual deviance: It represents the error or discrepancy when the model is built using the predictors. With these three parameter we can conclude that step-wise model is the better model.

However in this case, we have the predictor variables from model_step have the same variable from model_all, therefore the model model performance should be same.

3.3 Predict

We have our model with linear logistic, therefore we will use this model to make predict :

data_test$prob_acc<- predict(model_step, type = "response", newdata = data_test)

data_test$pred_acc <- factor(ifelse(data_test$prob_acc > 0.5, "unacc","acc"))
data_test[1:10, c("pred_acc", "Car_Acceptability")]

log_conf <- confusionMatrix(data_test$pred_acc, data_test$Car_Acceptability, positive = "acc")
log_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction acc unacc
##      acc   105     3
##      unacc   3    96
##                                          
##                Accuracy : 0.971          
##                  95% CI : (0.938, 0.9893)
##     No Information Rate : 0.5217         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9419         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9722         
##             Specificity : 0.9697         
##          Pos Pred Value : 0.9722         
##          Neg Pred Value : 0.9697         
##              Prevalence : 0.5217         
##          Detection Rate : 0.5072         
##    Detection Prevalence : 0.5217         
##       Balanced Accuracy : 0.9710         
##                                          
##        'Positive' Class : acc            
##

Re-call/Sensitivity: It measures how effective the model can correctly predict the proportion of actual positive data.

Specificity: measures how effective the model can correctly predict the proportion of actual negative data.

Accuracy: measures how effective the model can correctly predict the target variable Y.

Precision: measures how effective the model can correctly predict the positive class among all predicted results.

In this business case, we will use the sensitivity/recall metric as model performance parameter. The sensitivity choosen as metric parameter because it measure the proportion of actual positive data to minimize the false positive. We surely don’t want the unaccepted car being used in the street that increase risk of accident.

The model_all or model_step have Sensitivity = 0.9703 which is safe enough for us to use those models.

4 K-Nearest Neightbor

4.1 Pre-processing Data

We can’t use categorical predictor in K-nearest Neighbor, therefore we will create dummy variables from categorical data to be used in classification.

car_dummy <- dummyVars(" ~.", data = car_down)
car_dummy <- data.frame(predict(car_dummy, newdata = car_down))
str(car_dummy)

## 'data.frame':    1036 obs. of  23 variables:
##  $ Buying_Price.high      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Buying_Price.low       : num  0 1 0 0 1 1 1 1 0 0 ...
##  $ Buying_Price.med       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ Buying_Price.vhigh     : num  1 0 0 1 0 0 0 0 1 1 ...
##  $ Maintenance_Price.high : num  0 0 0 0 1 1 1 0 0 0 ...
##  $ Maintenance_Price.low  : num  0 0 0 1 0 0 0 1 0 0 ...
##  $ Maintenance_Price.med  : num  1 1 1 0 0 0 0 0 1 1 ...
##  $ Maintenance_Price.vhigh: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ No_of_Doors.2          : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ No_of_Doors.3          : num  0 0 1 0 1 0 0 1 0 0 ...
##  $ No_of_Doors.4          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ No_of_Doors.5more      : num  0 1 0 1 0 1 0 0 0 1 ...
##  $ Person_Capacity.2      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Person_Capacity.4      : num  0 1 0 1 0 0 0 0 1 1 ...
##  $ Person_Capacity.more   : num  1 0 1 0 1 1 1 1 0 0 ...
##  $ Size_of_Luggage.big    : num  1 1 0 0 0 1 1 0 1 1 ...
##  $ Size_of_Luggage.med    : num  0 0 1 1 1 0 0 0 0 0 ...
##  $ Size_of_Luggage.small  : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Safety.high            : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ Safety.low             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Safety.med             : num  1 0 1 0 1 0 1 1 0 1 ...
##  $ Car_Acceptability.acc  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Car_Acceptability.unacc: num  0 0 0 0 0 0 0 0 0 0 ...

Remove the targe dummy variables where there were only 2 categories.

car_dummy$Car_Acceptability.unacc <- NULL

4.2 Data Splitting

set.seed(123)
samplesize1 <- round(0.8 * nrow(car_dummy), 0)
index1 <- sample(seq_len(nrow(car_dummy)), size = samplesize1)

kn_train <- car_dummy[index1, 0:22 ]
kn_test <- car_dummy[-index1, 0:22]

kn_train_label <- car_dummy[index1,1]
kn_test_label <- car_dummy[-index1,1]

4.3 Predict

pred_knn <- knn(train = kn_train,
                       test = kn_test, 
                       cl = kn_train_label, 
                       k = 28)
sqrt(nrow(kn_train))

## [1] 28.79236

pred_knn

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
##  [38] 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
##  [75] 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0
## [112] 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0
## Levels: 0 1

In the code above, we use the optimum k value that is the root of the sum of our data.

pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(kn_test_label),"1")
pred_knn_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 160   4
##          1   0  43
##                                           
##                Accuracy : 0.9807          
##                  95% CI : (0.9513, 0.9947)
##     No Information Rate : 0.7729          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9432          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 0.9149          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9756          
##              Prevalence : 0.2271          
##          Detection Rate : 0.2077          
##    Detection Prevalence : 0.2077          
##       Balanced Accuracy : 0.9574          
##                                           
##        'Positive' Class : 1               
##

The KNN model has the Sensitivity = 0.9608 which is safe enough for us to use those models.

5 Model Evaluation & Conclusion

As stated from above, we will use the sensitivity/recall metric as model performance parameter. The sensitivity choosen as metric parameter because it measure the proportion of actual positive data to minimize the false positive. We surely don’t want the unaccepted car being used in the street that will increase risk of accident. On the contrary, if there are accepted car(actual data) that have unaccepted status(predict data), the developer of the car can submit an appeal for further inspection and won’t cause any risk of accident.

KNN model Sensitivity = 0.9608
Logistic Regression model Sensitivity = 0.9703

Therefore we can conclude that the Logistic Regression have the better model to predict the Acceptability. A sensitivity value of 0.9703 indicates that the logistic regression model is able to accurately identify approximately 97.03% of the true positive instances from all the actual positive instances in the actual data. In other words, it has a high ability to correctly predict positive cases in the actual data.

Classification - Boston House Pricing

Rizky Fadilah

29 June 2023