Introduction

In this blog, we will explore a classification tree analysis using a car showroom dataset. The goal is to predict a binary categorical response variable based on various explanatory variables. Let’s start by loading the necessary libraries and generating some synthetic data for demonstration purposes.

Data Generation

Let’s create a synthetic car showroom dataset with explanatory variables like ‘Price’, ‘Mileage’, ‘Brand’, and ‘Condition’.

set.seed(123)
n_samples <- 500

# Generate synthetic data
car_data <- data.frame(
  Price = runif(n_samples, 10000, 50000),
  Mileage = rpois(n_samples, 30000),
  Brand = sample(c("Toyota", "Honda", "Ford"), n_samples, replace = TRUE),
  Condition = sample(c("New", "Used"), n_samples, replace = TRUE),
  Response = sample(c("Sold", "Not Sold"), n_samples, replace = TRUE)
)

Exploratory Data Analysis

Let’s start by exploring the dataset and creating some visualizations.

Summary Statistics

summary(car_data)
##      Price          Mileage         Brand            Condition        
##  Min.   :10019   Min.   :29472   Length:500         Length:500        
##  1st Qu.:19840   1st Qu.:29874   Class :character   Class :character  
##  Median :29062   Median :30004   Mode  :character   Mode  :character  
##  Mean   :29811   Mean   :29991                                        
##  3rd Qu.:39316   3rd Qu.:30118                                        
##  Max.   :49976   Max.   :30572                                        
##    Response        
##  Length:500        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Bar Plot - Response Variable

ggplot(car_data, aes(x = Response, fill = Response)) +
  geom_bar() +
  labs(title = "Distribution of Response Variable", x = "Response", y = "Count")

Scatter Plot - Price vs. Mileage

ggplot(car_data, aes(x = Price, y = Mileage, color = Response)) +
  geom_point() +
  labs(title = "Scatter Plot - Price vs. Mileage", x = "Price", y = "Mileage")

Box Plot - Price by Brand

ggplot(car_data, aes(x = Brand, y = Price, fill = Brand)) +
  geom_boxplot() +
  labs(title = "Box Plot - Price by Brand", x = "Brand", y = "Price")

Bar Plot - Condition by Response

ggplot(car_data, aes(x = Condition, fill = Response)) +
  geom_bar() +
  labs(title = "Bar Plot - Condition by Response", x = "Condition", y = "Count")

Pie Chart - Distribution of Brands

ggplot(car_data, aes(x = "", fill = Brand)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Pie Chart - Distribution of Brands", fill = "Brand")

Histogram - Mileage Distribution

ggplot(car_data, aes(x = Mileage)) +
  geom_histogram(binwidth = 2000, fill = "blue", color = "black") +
  labs(title = "Histogram - Mileage Distribution", x = "Mileage", y = "Count")

Classification Tree Analysis

Now, let’s perform a classification tree analysis using the generated dataset.

# Identify explanatory variables and response variable
explanatory_vars <- c("Price", "Mileage", "Brand", "Condition")
response_var <- "Response"

# Create a decision tree model
tree_model <- rpart(Response ~ ., data = car_data, method = "class")

# Plot the decision tree
rpart.plot(tree_model, main = "Classification Tree")