This report will show you process to make a classification model using logistic regression and k-nearest neighbor supervised learning to predict car acceptability.
This Car Acceptability Classification Database was derived from a simple hierarchical decision model which comes with CC-BY-NC-SA 4.0 license for non-commercial usage and it is provided by UCI Machine Learning Repository.
you can access this data from : https://www.kaggle.com/datasets/subhajeetdas/car-acceptability-classification-dataset
To make the model we will use the following library :
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)## Loading required package: lattice
library(car)## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:gtools':
##
## logit
## The following object is masked from 'package:dplyr':
##
## recode
car <- read.csv("car.csv")
head(car)The description of each variable are :
Buying_Price - Categorical Data [vhigh, high, med, low]
Maintenance_Price - Categorical Data [vhigh, high, med, low]
No_of_Doors - Categorical Data
Person_Capacity - Categorical Data [2, 4, more]
Size_of_Luggage - Categorical Data [small, med, big]
Safety - Categorical Data
Car_Acceptability - Categorical Data
glimpse(car)## Rows: 1,728
## Columns: 7
## $ Buying_Price <chr> "vhigh", "vhigh", "vhigh", "vhigh", "vhigh", "vhigh"…
## $ Maintenance_Price <chr> "vhigh", "vhigh", "vhigh", "vhigh", "vhigh", "vhigh"…
## $ No_of_Doors <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2…
## $ Person_Capacity <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "4", "4…
## $ Size_of_Luggage <chr> "small", "small", "small", "med", "med", "med", "big…
## $ Safety <chr> "low", "med", "high", "low", "med", "high", "low", "…
## $ Car_Acceptability <chr> "unacc", "unacc", "unacc", "unacc", "unacc", "unacc"…
The data has 1,728 rows and 7 columns. The column Car_Acceptability which indicates the accept status of a car of the company will be use as a target variable with the other column will be use as the predictor.
unique(car$Car_Acceptability)## [1] "unacc" "acc" "vgood" "good"
As we observe above, we have four level of unique value. we can assume that the value : “good” and “vgood” is classified as accepted car or “acc”, therefore we will convert those value to “acc” :
car$Car_Acceptability[car$Car_Acceptability %in% c("good", "vgood")] <- "acc"
unique(car$Car_Acceptability)## [1] "unacc" "acc"
The description of each variable above has tell us that each variable consist of categorical data, therefore we need to convert all variable to the desired data types :
car <- data.frame(lapply(car, as.factor))colSums(is.na(car))## Buying_Price Maintenance_Price No_of_Doors Person_Capacity
## 0 0 0 0
## Size_of_Luggage Safety Car_Acceptability
## 0 0 0
sum(colSums(is.na(car)))## [1] 0
The data set has no missing value, therefore we can proceed to the next step.
Before we build the model, we need to examine the proportion of the target variable we have in the target column.
prop.table(table(car$Car_Acceptability))##
## acc unacc
## 0.2997685 0.7002315
We can see that the proportion of positive and negative value of the target is unbalanced, this can affect the performance of the model, We have thousands data rows, therefore we will use downsampling method to make the proportion is balanced, however many of data information will be missing :
car_down <- downSample(
x = car %>% select(-Car_Acceptability),
y = car$Car_Acceptability,
yname = "Car_Acceptability"
)
nrow(car_down)## [1] 1036
prop.table(table(car_down$Car_Acceptability))##
## acc unacc
## 0.5 0.5
now we have balanced proportion of target variable with 1,036 data information left from the original data.
In this section, we will split the dataset to data train and data test. The data train will be used to train the model and the data test will be used to evaluate the performance of the model. 80% of the dataset will be used for data train and the rest is data test.
set.seed(123)
samplesize <- round(0.8 * nrow(car_down), 0)
index <- sample(seq_len(nrow(car_down)), size = samplesize)
data_train <- car_down[index, ]
data_test <- car_down[-index, ]In this section we will make two model :
Logistic regression model using all available predictor variable
Logistic regression model with filtered predictor variable with step-wise feature selection method
model_all <- glm(formula = Car_Acceptability ~ . ,
data = data_train,
family = "binomial")## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model_all)##
## Call:
## glm(formula = Car_Acceptability ~ ., family = "binomial", data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 26.8737 1308.1443 0.021 0.983610
## Buying_Pricelow -4.3545 0.7167 -6.076 1.23e-09 ***
## Buying_Pricemed -4.7254 0.7818 -6.044 1.50e-09 ***
## Buying_Pricevhigh 1.6174 0.4958 3.262 0.001105 **
## Maintenance_Pricelow -2.5918 0.6354 -4.079 4.52e-05 ***
## Maintenance_Pricemed -2.2397 0.5837 -3.837 0.000125 ***
## Maintenance_Pricevhigh 3.7578 0.6533 5.752 8.80e-09 ***
## No_of_Doors3 -2.0104 0.5463 -3.680 0.000233 ***
## No_of_Doors4 -2.7445 0.5747 -4.775 1.79e-06 ***
## No_of_Doors5more -2.6781 0.5785 -4.629 3.67e-06 ***
## Person_Capacity4 -29.1335 1308.1444 -0.022 0.982232
## Person_Capacitymore -28.5812 1308.1443 -0.022 0.982569
## Size_of_Luggagemed 1.3564 0.4907 2.764 0.005705 **
## Size_of_Luggagesmall 3.8438 0.6257 6.143 8.11e-10 ***
## Safetylow 29.1927 1347.0910 0.022 0.982710
## Safetymed 2.7712 0.4962 5.584 2.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1149.14 on 828 degrees of freedom
## Residual deviance: 187.37 on 813 degrees of freedom
## AIC: 219.37
##
## Number of Fisher Scoring iterations: 20
Using this model, we can make interpretation :
The intercept in this model represents the estimated log-odds of the response variable when all the predictor variables are set to zero or their reference levels (for categorical variables). In the provided coefficients table, the intercept is represented by (Intercept) and has an estimated value of 27.6048. This indicates the baseline log-odds of the response variable in the absence of any predictors or when the predictor variables are at their reference levels.
breakdown of the information for the first few predictor variables:
Buying_Pricelow: This predictor variable has an estimated coefficient of -5.3785, a standard error of 0.8802, a z-value of -6.110, and a p-value of 9.94e-10.
Buying_Pricemed: This predictor variable has an estimated coefficient of -3.8459, a standard error of 0.7713, a z-value of -4.986, and a p-value of 6.15e-07.
Buying_Pricevhigh: This predictor variable has an estimated coefficient of 2.6234, a standard error of 0.6476, a z-value of 4.051, and a p-value of 5.10e-05.
The remaining predictor variables follow a similar pattern. The coefficients table provides information about the statistical significance and impact of each predictor variable on the response variable in the logistic regression model.
model_step <- step(model_all, direction = "backward", trace = F)summary(model_step)##
## Call:
## glm(formula = Car_Acceptability ~ Buying_Price + Maintenance_Price +
## No_of_Doors + Person_Capacity + Size_of_Luggage + Safety,
## family = "binomial", data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 26.8737 1308.1443 0.021 0.983610
## Buying_Pricelow -4.3545 0.7167 -6.076 1.23e-09 ***
## Buying_Pricemed -4.7254 0.7818 -6.044 1.50e-09 ***
## Buying_Pricevhigh 1.6174 0.4958 3.262 0.001105 **
## Maintenance_Pricelow -2.5918 0.6354 -4.079 4.52e-05 ***
## Maintenance_Pricemed -2.2397 0.5837 -3.837 0.000125 ***
## Maintenance_Pricevhigh 3.7578 0.6533 5.752 8.80e-09 ***
## No_of_Doors3 -2.0104 0.5463 -3.680 0.000233 ***
## No_of_Doors4 -2.7445 0.5747 -4.775 1.79e-06 ***
## No_of_Doors5more -2.6781 0.5785 -4.629 3.67e-06 ***
## Person_Capacity4 -29.1335 1308.1444 -0.022 0.982232
## Person_Capacitymore -28.5812 1308.1443 -0.022 0.982569
## Size_of_Luggagemed 1.3564 0.4907 2.764 0.005705 **
## Size_of_Luggagesmall 3.8438 0.6257 6.143 8.11e-10 ***
## Safetylow 29.1927 1347.0910 0.022 0.982710
## Safetymed 2.7712 0.4962 5.584 2.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1149.14 on 828 degrees of freedom
## Residual deviance: 187.37 on 813 degrees of freedom
## AIC: 219.37
##
## Number of Fisher Scoring iterations: 20
AIC (Akaike Information Criterion): It estimates the amount of missing information in a model. The smaller the AIC value, the better the model. Null deviance: It represents the error or discrepancy when no predictors are used in the model. Residual deviance: It represents the error or discrepancy when the model is built using the predictors. With these three parameter we can conclude that step-wise model is the better model.
However in this case, we have the predictor variables from model_step have the same variable from model_all, therefore the model model performance should be same.
We have our model with linear logistic, therefore we will use this model to make predict :
data_test$prob_acc<- predict(model_step, type = "response", newdata = data_test)
data_test$pred_acc <- factor(ifelse(data_test$prob_acc > 0.5, "unacc","acc"))
data_test[1:10, c("pred_acc", "Car_Acceptability")]log_conf <- confusionMatrix(data_test$pred_acc, data_test$Car_Acceptability, positive = "acc")
log_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction acc unacc
## acc 105 3
## unacc 3 96
##
## Accuracy : 0.971
## 95% CI : (0.938, 0.9893)
## No Information Rate : 0.5217
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9419
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9722
## Specificity : 0.9697
## Pos Pred Value : 0.9722
## Neg Pred Value : 0.9697
## Prevalence : 0.5217
## Detection Rate : 0.5072
## Detection Prevalence : 0.5217
## Balanced Accuracy : 0.9710
##
## 'Positive' Class : acc
##
Re-call/Sensitivity: It measures how effective the model can correctly predict the proportion of actual positive data.
Specificity: measures how effective the model can correctly predict the proportion of actual negative data.
Accuracy: measures how effective the model can correctly predict the target variable Y.
Precision: measures how effective the model can correctly predict the positive class among all predicted results.
In this business case, we will use the sensitivity/recall metric as model performance parameter. The sensitivity choosen as metric parameter because it measure the proportion of actual positive data to minimize the false positive. We surely don’t want the unaccepted car being used in the street that increase risk of accident.
The model_all or model_step have Sensitivity = 0.9703 which is safe enough for us to use those models.
We can’t use categorical predictor in K-nearest Neighbor, therefore we will create dummy variables from categorical data to be used in classification.
car_dummy <- dummyVars(" ~.", data = car_down)
car_dummy <- data.frame(predict(car_dummy, newdata = car_down))
str(car_dummy)## 'data.frame': 1036 obs. of 23 variables:
## $ Buying_Price.high : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Buying_Price.low : num 0 1 0 0 1 1 1 1 0 0 ...
## $ Buying_Price.med : num 0 0 1 0 0 0 0 0 0 0 ...
## $ Buying_Price.vhigh : num 1 0 0 1 0 0 0 0 1 1 ...
## $ Maintenance_Price.high : num 0 0 0 0 1 1 1 0 0 0 ...
## $ Maintenance_Price.low : num 0 0 0 1 0 0 0 1 0 0 ...
## $ Maintenance_Price.med : num 1 1 1 0 0 0 0 0 1 1 ...
## $ Maintenance_Price.vhigh: num 0 0 0 0 0 0 0 0 0 0 ...
## $ No_of_Doors.2 : num 1 0 0 0 0 0 0 0 1 0 ...
## $ No_of_Doors.3 : num 0 0 1 0 1 0 0 1 0 0 ...
## $ No_of_Doors.4 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ No_of_Doors.5more : num 0 1 0 1 0 1 0 0 0 1 ...
## $ Person_Capacity.2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Person_Capacity.4 : num 0 1 0 1 0 0 0 0 1 1 ...
## $ Person_Capacity.more : num 1 0 1 0 1 1 1 1 0 0 ...
## $ Size_of_Luggage.big : num 1 1 0 0 0 1 1 0 1 1 ...
## $ Size_of_Luggage.med : num 0 0 1 1 1 0 0 0 0 0 ...
## $ Size_of_Luggage.small : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Safety.high : num 0 1 0 1 0 1 0 0 1 0 ...
## $ Safety.low : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Safety.med : num 1 0 1 0 1 0 1 1 0 1 ...
## $ Car_Acceptability.acc : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Car_Acceptability.unacc: num 0 0 0 0 0 0 0 0 0 0 ...
Remove the targe dummy variables where there were only 2 categories.
car_dummy$Car_Acceptability.unacc <- NULLset.seed(123)
samplesize1 <- round(0.8 * nrow(car_dummy), 0)
index1 <- sample(seq_len(nrow(car_dummy)), size = samplesize1)
kn_train <- car_dummy[index1, 0:22 ]
kn_test <- car_dummy[-index1, 0:22]
kn_train_label <- car_dummy[index1,1]
kn_test_label <- car_dummy[-index1,1]pred_knn <- knn(train = kn_train,
test = kn_test,
cl = kn_train_label,
k = 28)
sqrt(nrow(kn_train))## [1] 28.79236
pred_knn## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [38] 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## [75] 0 0 1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0
## [112] 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0
## [149] 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0
## Levels: 0 1
In the code above, we use the optimum k value that is the root of the sum of our data.
pred_knn_conf <- confusionMatrix(as.factor(pred_knn), as.factor(kn_test_label),"1")
pred_knn_conf## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 160 4
## 1 0 43
##
## Accuracy : 0.9807
## 95% CI : (0.9513, 0.9947)
## No Information Rate : 0.7729
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9432
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 0.9149
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9756
## Prevalence : 0.2271
## Detection Rate : 0.2077
## Detection Prevalence : 0.2077
## Balanced Accuracy : 0.9574
##
## 'Positive' Class : 1
##
The KNN model has the Sensitivity = 0.9608 which is safe enough for us to use those models.
As stated from above, we will use the sensitivity/recall metric as model performance parameter. The sensitivity choosen as metric parameter because it measure the proportion of actual positive data to minimize the false positive. We surely don’t want the unaccepted car being used in the street that will increase risk of accident. On the contrary, if there are accepted car(actual data) that have unaccepted status(predict data), the developer of the car can submit an appeal for further inspection and won’t cause any risk of accident.
KNN model Sensitivity = 0.9608
Logistic Regression model Sensitivity = 0.9703
Therefore we can conclude that the Logistic Regression have the better model to predict the Acceptability. A sensitivity value of 0.9703 indicates that the logistic regression model is able to accurately identify approximately 97.03% of the true positive instances from all the actual positive instances in the actual data. In other words, it has a high ability to correctly predict positive cases in the actual data.