In this blog, we will explore a classification tree analysis using a car showroom dataset. The goal is to predict a binary categorical response variable based on various explanatory variables. Let’s start by loading the necessary libraries and generating some synthetic data for demonstration purposes.
Let’s create a synthetic car showroom dataset with explanatory variables like ‘Price’, ‘Mileage’, ‘Brand’, and ‘Condition’.
set.seed(123)
n_samples <- 500
# Generate synthetic data
car_data <- data.frame(
Price = runif(n_samples, 10000, 50000),
Mileage = rpois(n_samples, 30000),
Brand = sample(c("Toyota", "Honda", "Ford"), n_samples, replace = TRUE),
Condition = sample(c("New", "Used"), n_samples, replace = TRUE),
Response = sample(c("Sold", "Not Sold"), n_samples, replace = TRUE)
)
Let’s start by exploring the dataset and creating some visualizations.
summary(car_data)
## Price Mileage Brand Condition
## Min. :10019 Min. :29472 Length:500 Length:500
## 1st Qu.:19840 1st Qu.:29874 Class :character Class :character
## Median :29062 Median :30004 Mode :character Mode :character
## Mean :29811 Mean :29991
## 3rd Qu.:39316 3rd Qu.:30118
## Max. :49976 Max. :30572
## Response
## Length:500
## Class :character
## Mode :character
##
##
##
ggplot(car_data, aes(x = Response, fill = Response)) +
geom_bar() +
labs(title = "Distribution of Response Variable", x = "Response", y = "Count")
ggplot(car_data, aes(x = Price, y = Mileage, color = Response)) +
geom_point() +
labs(title = "Scatter Plot - Price vs. Mileage", x = "Price", y = "Mileage")
ggplot(car_data, aes(x = Brand, y = Price, fill = Brand)) +
geom_boxplot() +
labs(title = "Box Plot - Price by Brand", x = "Brand", y = "Price")
ggplot(car_data, aes(x = Condition, fill = Response)) +
geom_bar() +
labs(title = "Bar Plot - Condition by Response", x = "Condition", y = "Count")
ggplot(car_data, aes(x = "", fill = Brand)) +
geom_bar(width = 1) +
coord_polar(theta = "y") +
labs(title = "Pie Chart - Distribution of Brands", fill = "Brand")
ggplot(car_data, aes(x = Mileage)) +
geom_histogram(binwidth = 2000, fill = "blue", color = "black") +
labs(title = "Histogram - Mileage Distribution", x = "Mileage", y = "Count")
Now, let’s perform a classification tree analysis using the generated dataset.
# Identify explanatory variables and response variable
explanatory_vars <- c("Price", "Mileage", "Brand", "Condition")
response_var <- "Response"
# Create a decision tree model
tree_model <- rpart(Response ~ ., data = car_data, method = "class")
# Plot the decision tree
rpart.plot(tree_model, main = "Classification Tree")