Predicting Used Vehicle Prices

Author

Sarp Ata Kanca, Yagmur Beren Sengezken

#Presentation:https://rpubs.com/SarpAta/1438550 #Last Report:https://rpubs.com/SarpAta/1438972

ECON 465 Stage 3

Predicting Used Vehicle Prices

Authors

Sarp Ata Kanca

Yagmur Beren Sengezken


Economic Question

Research Question

To what extent do vehicle age, mileage, condition, and engine size predict the market price of a used vehicle?

Why Does It Matter?

  • Used vehicle markets are economically important.
  • Consumers need accurate price expectations.
  • Sellers need efficient pricing strategies.
  • Vehicle depreciation affects household wealth.

Dataset

Source

Kaggle: Craigslist Cars & Trucks Dataset

Sample

  • Original observations: 426,880
  • Cleaned sample: 170,439 observations

Variables

Price: Vehicle Listing Price , Odometer:Miles Driven , Cylinders:Engine Size

Year: Manufacturing Year , Condition:Vehicle Condition


Price Distribution

tfm library(tidyverse) cars_raw <- read_csv("vehicles.csv") cars_clean <- cars_raw %>% filter( price > 500, price < 100000, !is.na(year), !is.na(odometer), !is.na(condition), !is.na(cylinders) ) %>% mutate( log_price = log(price) ) ggplot(cars_clean, aes(price)) + geom_histogram(bins = 50) + labs( title = "Distribution of Vehicle Prices", x = "Price (USD)", y = "Frequency" )}

Observation

  • Strong right-skewed distribution.
  • Extreme high-value listings present.
  • Log transformation improves model performance.

Model A

Baseline Regression

[ log(price)=_0+_1(year)+_2(odometer)]

Economic Intuition

  • Newer vehicles should be worth more.
  • Higher mileage should reduce value.

Model B

Extended Regression

[ log(price)=_0+_1(year)+_2(odometer)+_3(condition)+_4(cylinders)]

Why Add These Variables?

  • Vehicle quality matters.
  • Engine size affects consumer demand.
  • More realistic representation of market valuation.

Model Comparison

Metric Model A Model B
RMSE 0.8344 0.7392
R?? 0.1383 0.3237

Result

Model B clearly performs better.

  • Lower prediction error
  • Higher explanatory power

Cross-Validation Results

5-Fold Cross Validation

Metric Mean
RMSE 0.729
R’2 0.339

Interpretation

  • Model performance is stable across samples.
  • Results generalize well to unseen observations.

Key Findings & Economic Interpretation

Vehicle Age

  • Effect: Positive coefficient.

    Interpretation: Newer manufacturing cohorts command statistically significant premiums, as they represent a lower baseline risk of mechanical failures.

Odometer

  • Effect: Negative coefficient.

    Interpretation: More mileage continuously scales down market values, directly reflecting the physical depreciation of the asset through operational usage.

Condition

  • Effect: Strong non-linear premiums and discounts.

    Interpretation: “Like New” listings capture substantial valuation premiums, whereas “Salvage” classifications trigger heavy immediate market discounts due to title restrictions and structural damage.


Economic Interpretation

Depreciation

Vehicle prices decrease as usage increases.

Consumer Valuation

Buyers pay premiums for:

  • Newer vehicles
  • Better condition
  • Larger engines

Market Implications

Observable quality characteristics strongly influence market outcomes.


Limitations

Limitation 1

Omitted Variable Bias (OVB): Important vehicle characteristics were omitted from our regressions due to dataset scope limitations, notably Manufacturer/Brand reputation, exact Model line tiers, Fuel efficiency (MPG), and regional geographic market location variables.

Limitation 2

Asking vs. Transaction Prices: The dataset relies entirely on initial public seller listing prices rather than finalized contract clearing transactions. Because secondary vehicle transactions typically involve localized bargaining, our models capture nominal consumer pricing expectations rather than pure market equilibrium values.

Limitation 3

Self-Reporting Bias: The data is entirely self-reported by private sellers, introducing subjective flaws into categorical metrics like vehicle condition.


Future Research

Potential Improvement

Include:

  • Brand effects
  • Fuel efficiency
  • Geographic market differences

New Economic Question

Do vehicle brands create a measurable price premium after controlling for age, mileage, condition, and engine size?


Conclusion

Main Takeaways

Vehicle age and mileage are significant predictors.

Condition substantially improves prediction accuracy.

Model B outperforms Model A.

Results are consistent with economic theory of depreciation.

Thank You

Questions?

Reproducibility Protocols

To guarantee full structural reproducibility across diverse operational platforms, the following design parameters were enforced:

  • Relative Pathing: Data ingress relies exclusively on relative path commands (read_csv("vehicles.csv")), rendering the execution script independent of hardcoded local drive environments.

  • Stochastic Isolation: Global pseudo-random distribution states are pinned via set.seed(465) at the setup phase before any data partitioning or cross-validation sampling occurs.

  • Standardized Coding Environment: Scripting utilizes standard, stable container packages within the tidyverse framework to prevent breaking changes during runtime execution.

AI Use Log

  • Assisting AI System: Gemini (Advanced Architecture Engine)

  • Applied Interaction Strategy: Layout syntax automation, code consolidation, and structural Markdown optimization.

  • Raw User Prompt Input: “Show me to how to solve this problem in quarto slides document in fastest way possible”

  • Applied Output Implementation: The AI response provided clean styling mechanics (such as the custom {.smaller} header tags and column containment blocks) to fix layout overflow issues.

  • Verification and Modification: The styling tips were verified via local engine compilation (quarto render). These slide design components were then manually converted into a formal, continuous narrative report structure. This was done by replacing presentation slide breaks (---) with descriptive paragraphs to satisfy the final essay requirement.

Final Reflections

Strategic Path Improvements

Given a more extended research timeline or access to deeper computational power, our primary structural improvement would involve implementing high-cardinality fixed-effects modeling for vehicle brands and models. Controlling for manufacturer identity would effectively insulate our continuous mileage and age coefficients from brand-equity bias (such as the slow structural depreciation rates of reliable commuter brands compared to high-end luxury lines).

Future Economic Research Questions

The insights developed across this econometric evaluation inspire a compelling new research question:

“Do secondary market vehicle brands display asymmetric depreciation elasticities across varying regional economic environments during inflationary contraction cycles?”

Answering this question would reveal whether affordable economy vehicle choices behave as Giffen or defensive assets when aggregate consumer purchasing power contracts.

Conclusion

Main Takeaways

  • Vehicle age and structural usage mileage are significant, robust predictors of asset depreciation
  • Incorporating qualitative condition metrics drastically improves model prediction accuracy and reduces error.
  • Extended Specification Model B decisively outperforms the baseline model across all metrics.
  • Empirical results match classical economic depreciation theories, showing how asset usage and features dictate consumer valuation.

##Stage 1: # PART A: Dataset 1 (Regression) - Predicting Vehicle Prices ## 1. Dataset and Source ##Source: [Kaggle - Craigslist Cars/Trucks Data] ##(https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data) ##This dataset captures thousands of individual vehicle listings, providing a high-volume cross-section of the used car market.

Variable List

Variable Description Type
Price Listing price in USD - Target Variable Continuous
Year Year the vehicle was manufactured Numeric
Odometer Total miles driven Continuous
Condition Reported condition of the vehicle Categorical
Cylinders Engine size/type Discrete

::: {.cell}

```{.r .cell-code}
## 2. Economic Question
##"To what extent do vehicle age and usage (odometer) predict the market price of a used vehicle in the coming ##year's resale market?"

:::

## 3. Data Importing and Cleaning

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Loading data from your local path
cars_data <- read_csv("vehicles.csv") |>
  select(price, year, odometer, condition, cylinders) |>
  drop_na() |>
  filter(price > 500 & price < 100000)
Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl   (6): id, price, year, odometer, lat, long
lgl   (1): county
dttm  (1): posting_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##4. Probability Analysis (Regression) 
#| label: summary-reg
summary(cars_data$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    501    6495   12450   16820   25000   99999 
ggplot(cars_data, aes(x = price)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Vehicle Prices", x = "Price ($)", y = "Frequency") +
  theme_minimal()

cars_data <- cars_data |>
  mutate(log_price = log(price + 1))

ggplot(cars_data, aes(x = log_price)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  labs(title = "Log-Transformed Price Distribution", x = "Log(Price)", y = "Frequency") +
  theme_minimal()

###   The regression model examines how vehicle age and usage affect resale prices in the used car market.
### The coefficient for `year` is expected to be positive, indicating that newer vehicles tend to have higher market values.The coefficient for `odometer` is expected to be negative, suggesting that heavily used vehicles lose value as mileage increases.

5. Theoritical Distribution Propose

##**The vehicle price data is heavily right-skewed, as most cars are affordable used models while a few luxury vehicles create a long tail. The log transformation normalizes this effectively, suggesting a Log-Normal distribution.**
##B1. Dataset and Source
##Source: Kaggle - Company Bankruptcy Prediction
##Data Link: 
##<https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction>**

## B2. Economic Question

##Can foundational financial ratios, such as net value per share, accurately classify and predict whether a firm will go bankrupt in the coming fiscal period?##

## B3. Data Importing and Cleaning


::: {.cell}

```{.r .cell-code}
bankruptcy_raw <- read_csv("bankruptcy.csv", locale = locale(encoding = "UTF-8"))
Rows: 6819 Columns: 96
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (96): Bankrupt?, ROA(C) before interest and depreciation before interest...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mydata_class <- bankruptcy_raw |>
  select(1, 17, 22, 23) |>
  rename(
    bankrupt                   = 1,
    net_value_per_share        = 2,
    revenue_per_share          = 3,
    operating_profit_per_share = 4
  ) |>
  filter(!is.na(bankrupt) & !is.na(net_value_per_share) &
         !is.na(revenue_per_share) & !is.na(operating_profit_per_share)) |>
  mutate(bankrupt = factor(bankrupt, levels = c(0, 1),
                           labels = c("Healthy", "Bankrupt")))             

:::

B4. Probability Analysis (Classification)

table(mydata_class$bankrupt)

 Healthy Bankrupt 
    6599      220 
ggplot(mydata_class, aes(x = bankrupt, fill = bankrupt)) +
  geom_bar() +
  labs(title = "Class Distribution: Firm Health Status", 
       x = "Status", y = "Count") +
  theme_minimal()

summary(mydata_class)
     bankrupt    net_value_per_share revenue_per_share  
 Healthy :6599   Min.   :0.0000      Min.   :0.000e+00  
 Bankrupt: 220   1st Qu.:0.1736      1st Qu.:0.000e+00  
                 Median :0.1844      Median :0.000e+00  
                 Mean   :0.1907      Mean   :1.329e+06  
                 3rd Qu.:0.1996      3rd Qu.:0.000e+00  
                 Max.   :1.0000      Max.   :3.020e+09  
 operating_profit_per_share
 Min.   :0.00000           
 1st Qu.:0.09608           
 Median :0.10423           
 Mean   :0.10909           
 3rd Qu.:0.11616           
 Max.   :1.00000           
### Interpretation of Class Distribution

### The dataset is highly imbalanced, with healthy firms representing the majority of observations. This is common in bankruptcy prediction studies because actual bankruptcies are relatively rare events in real-world financial markets.From an economic perspective, this imbalance reflects the fact that most firms remain financially stable, while only a small proportion experience severe financial distress.Such imbalance is important in predictive modeling because classification algorithms may become biased toward predicting the majority class.

B5. Theoretical Distribution Proposal

##**The target variable is binary (Bankrupt vs Healthy). In this cross-sectional sample of firms, the frequency of the event follows a Bernoulli Distribution. This allows us to predict the probability of failure based on foundational corporate indicators.**

summary(mydata_class)
     bankrupt    net_value_per_share revenue_per_share  
 Healthy :6599   Min.   :0.0000      Min.   :0.000e+00  
 Bankrupt: 220   1st Qu.:0.1736      1st Qu.:0.000e+00  
                 Median :0.1844      Median :0.000e+00  
                 Mean   :0.1907      Mean   :1.329e+06  
                 3rd Qu.:0.1996      3rd Qu.:0.000e+00  
                 Max.   :1.0000      Max.   :3.020e+09  
 operating_profit_per_share
 Min.   :0.00000           
 1st Qu.:0.09608           
 Median :0.10423           
 Mean   :0.10909           
 3rd Qu.:0.11616           
 Max.   :1.00000           
### Financial Ratio Interpretation

#The selected financial indicators measure different dimensions of firm performance:

#Net value per share reflects shareholder equity strength.Revenue per share captures operational scale and market activity.Operating profit per share measures operational efficiency.Economically, firms with weaker profitability and lower shareholder value are expected to face higher bankruptcy risk.

##Stage2:

##Setup

library(tidyverse)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

##Data Loading and Cleaning

cars_data <- read_csv("vehicles.csv") |>
  select(price, year, odometer, condition, cylinders) |>
  filter(!is.na(price) & !is.na(year) & !is.na(odometer) &
         !is.na(condition) & !is.na(cylinders)) |>
  filter(price > 500 & price < 100000) |>
  mutate(
    condition = as.factor(condition),
    cylinders = as.factor(cylinders),
    log_price = log(price + 1)
  )
Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl   (6): id, price, year, odometer, lat, long
lgl   (1): county
dttm  (1): posting_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(465)
cars_split <- initial_split(cars_data, prop = 0.8)
cars_train <- training(cars_split)
cars_test  <- testing(cars_split)

cat("Training set rows:", nrow(cars_train), "\n")
Training set rows: 136351 
cat("Test set rows:    ", nrow(cars_test),  "\n")
Test set rows:     34088 

Two Regression Models

Model A:

model_A <- lm(log_price ~ year + odometer, data = cars_train)
summary(model_A)

Call:
lm(formula = log_price ~ year + odometer, data = cars_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5146 -0.5544  0.0251  0.6220 10.7066 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.663e+01  4.282e-01 -108.92   <2e-16 ***
year         2.792e-02  2.130e-04  131.07   <2e-16 ***
odometer    -7.279e-07  1.109e-08  -65.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8257 on 136348 degrees of freedom
Multiple R-squared:  0.1515,    Adjusted R-squared:  0.1515 
F-statistic: 1.218e+04 on 2 and 136348 DF,  p-value: < 2.2e-16

Model B:

model_B <- lm(log_price ~ year + odometer + condition + cylinders, data = cars_train)
summary(model_B)

Call:
lm(formula = log_price ~ year + odometer + condition + cylinders, 
    data = cars_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4367 -0.4379  0.0637  0.4779 10.0329 

Coefficients:
                        Estimate Std. Error  t value Pr(>|t|)    
(Intercept)           -4.828e+01  3.933e-01 -122.773  < 2e-16 ***
year                   2.895e-02  1.953e-04  148.237  < 2e-16 ***
odometer              -6.210e-07  9.818e-09  -63.249  < 2e-16 ***
conditionfair         -1.047e+00  1.161e-02  -90.241  < 2e-16 ***
conditiongood          4.595e-03  4.308e-03    1.067    0.286    
conditionlike new      1.359e-01  6.974e-03   19.492  < 2e-16 ***
conditionnew           4.056e-01  3.023e-02   13.420  < 2e-16 ***
conditionsalvage      -1.222e+00  3.982e-02  -30.680  < 2e-16 ***
cylinders12 cylinders  3.067e-01  7.602e-02    4.035 5.46e-05 ***
cylinders3 cylinders  -7.856e-01  4.893e-02  -16.058  < 2e-16 ***
cylinders4 cylinders  -8.560e-01  2.718e-02  -31.500  < 2e-16 ***
cylinders5 cylinders  -9.796e-01  3.522e-02  -27.813  < 2e-16 ***
cylinders6 cylinders  -3.957e-01  2.710e-02  -14.602  < 2e-16 ***
cylinders8 cylinders   7.166e-03  2.716e-02    0.264    0.792    
cylindersother        -3.386e-01  4.181e-02   -8.098 5.63e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7284 on 136336 degrees of freedom
Multiple R-squared:  0.3398,    Adjusted R-squared:  0.3397 
F-statistic:  5012 on 14 and 136336 DF,  p-value: < 2.2e-16

Model Comparison and Selection

##Model A
cars_test <- cars_test |>
  mutate(pred_A = predict(model_A, newdata = cars_test))

rmse_A <- cars_test |>
  summarise(rmse = sqrt(mean((log_price - pred_A)^2)))

rsq_A <- cars_test |>
  summarise(rsq = 1 - sum((log_price - pred_A)^2) /
                      sum((log_price - mean(log_price))^2))

cat("Model A RMSE:", round(rmse_A$rmse, 4), "\n")
Model A RMSE: 0.8344 
cat("Model A R2:  ", round(rsq_A$rsq,  4), "\n")
Model A R2:   0.1383 
## Model B Evaluations
cars_test <- cars_test |>
  mutate(pred_B = predict(model_B, newdata = cars_test))

rmse_B <- cars_test |>
  summarise(rmse = sqrt(mean((log_price - pred_B)^2)))

rsq_B <- cars_test |>
  summarise(rsq = 1 - sum((log_price - pred_B)^2) /
                      sum((log_price - mean(log_price))^2))

cat("Model B RMSE:", round(rmse_B$rmse, 4), "\n")
Model B RMSE: 0.7392 
cat("Model B R2:  ", round(rsq_B$rsq,  4), "\n")
Model B R2:   0.3237 

Model B clearly outperforms Model A across both metrics, yielding a lower RMSE and higher R’2. Integrating vehicle condition and engine size provides a much more accurate representation of market value.


Cross-Validation

cars_test <- cars_test |>
  mutate(pred_B = predict(model_B, newdata = cars_test))

rmse_B <- cars_test |>
  summarise(rmse = sqrt(mean((log_price - pred_B)^2)))

rsq_B <- cars_test |>
  summarise(rsq = 1 - sum((log_price - pred_B)^2) /
                      sum((log_price - mean(log_price))^2))

cat("Model B RMSE:", round(rmse_B$rmse, 4), "\n")
Model B RMSE: 0.7392 
cat("Model B R2:  ", round(rsq_B$rsq,  4), "\n")
Model B R2:   0.3237 
cat("--- Regression Model Comparison ---\n")
--- Regression Model Comparison ---
cat("Model A (year + odometer)\n")
Model A (year + odometer)
cat("  RMSE:", round(rmse_A$rmse, 4), "\n")
  RMSE: 0.8344 
cat("  R2:  ", round(rsq_A$rsq,  4), "\n")
  R2:   0.1383 
cat("Model B (year + odometer + condition + cylinders)\n")
Model B (year + odometer + condition + cylinders)
cat("  RMSE:", round(rmse_B$rmse, 4), "\n")
  RMSE: 0.7392 
cat("  R2:  ", round(rsq_B$rsq,  4), "\n")
  R2:   0.3237 
lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

set.seed(465)
folds_cars <- vfold_cv(cars_train, v = 5)

cv_reg <- fit_resamples(
  lm_spec,
  log_price ~ year + odometer + condition + cylinders,
  resamples = folds_cars,
  metrics   = metric_set(rmse, rsq)
)

collect_metrics(cv_reg)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   0.729     5 0.00198 pre0_mod0_post0
2 rsq     standard   0.339     5 0.00328 pre0_mod0_post0
cat("Test set RMSE:", round(rmse_B$rmse, 4), "\n")
Test set RMSE: 0.7392 
cat("Test set R2:  ", round(rsq_B$rsq,  4), "\n")
Test set R2:   0.3237 

PART B: Classification-Bankruptcy Prediction

bankruptcy_raw <- read_csv("bankruptcy.csv",
                           locale = locale(encoding = "UTF-8"))
Rows: 6819 Columns: 96
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (96): Bankrupt?, ROA(C) before interest and depreciation before interest...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mydata_class <- bankruptcy_raw |>
  select(1, 17, 22, 23) |>
  rename(
    bankrupt                   = 1,
    net_value_per_share        = 2,
    revenue_per_share          = 3,
    operating_profit_per_share = 4
  ) |>
  filter(!is.na(bankrupt) & !is.na(net_value_per_share) &
         !is.na(revenue_per_share) & !is.na(operating_profit_per_share)) |>
  mutate(bankrupt = factor(bankrupt, levels = c(0, 1),
                           labels = c("Healthy", "Bankrupt")))

set.seed(465)
cls_split <- initial_split(mydata_class, prop = 0.8)
cls_train <- training(cls_split)
cls_test  <- testing(cls_split)

cat("Training set rows:", nrow(cls_train), "\n")
Training set rows: 5455 
cat("Test set rows:    ", nrow(cls_test),  "\n")
Test set rows:     1364 
table(cls_train$bankrupt)

 Healthy Bankrupt 
    5283      172 

Model1 net_value_per_share only Model 1 analyzes baseline bankruptcy risk using a single fundamental ratio: shareholder equity strength relative to outstanding stock.

logistic_model1 <- glm(
  bankrupt ~ net_value_per_share,
  data   = cls_train,
  family = binomial
)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model1)

Call:
glm(formula = bankrupt ~ net_value_per_share, family = binomial, 
    data = cls_train)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)           8.8669     0.8035   11.04   <2e-16 ***
net_value_per_share -70.2914     4.8342  -14.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1527.7  on 5454  degrees of freedom
Residual deviance: 1246.9  on 5453  degrees of freedom
AIC: 1250.9

Number of Fisher Scoring iterations: 7

Model 2 All three Rations Model 2 expands the classification framework by introducing dynamic operational efficiency measures and revenue scale indicators

logistic_model2 <- glm(
  bankrupt ~ net_value_per_share + revenue_per_share + operating_profit_per_share,
  data   = cls_train,
  family = binomial
)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model2)

Call:
glm(formula = bankrupt ~ net_value_per_share + revenue_per_share + 
    operating_profit_per_share, family = binomial, data = cls_train)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 8.779e+00  7.980e-01  11.002  < 2e-16 ***
net_value_per_share        -5.312e+01  5.781e+00  -9.189  < 2e-16 ***
revenue_per_share          -7.001e-09  2.203e-07  -0.032    0.975    
operating_profit_per_share -2.963e+01  6.225e+00  -4.759 1.94e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1527.7  on 5454  degrees of freedom
Residual deviance: 1225.0  on 5451  degrees of freedom
AIC: 1233

Number of Fisher Scoring iterations: 13

Predictions and Metrics:

Model 1 Predictions & Programmatic Handout-Safe Metrics

probs_1 <- predict(logistic_model1, cls_test, type = "response")
pred_1  <- ifelse(probs_1 > 0.5, "Bankrupt", "Healthy")
pred_1  <- factor(pred_1, levels = c("Healthy", "Bankrupt"))

confusion_1 <- table(Predicted = pred_1, Actual = cls_test$bankrupt)
confusion_1
          Actual
Predicted  Healthy Bankrupt
  Healthy     1316       46
  Bankrupt       0        2
TN_1 <- confusion_1["Healthy",  "Healthy"]
FP_1 <- confusion_1["Bankrupt", "Healthy"]
FN_1 <- confusion_1["Healthy",  "Bankrupt"]
TP_1 <- confusion_1["Bankrupt", "Bankrupt"]

accuracy_1  <- (TP_1 + TN_1) / (TP_1 + TN_1 + FP_1 + FN_1)
precision_1 <- ifelse(TP_1 + FP_1 > 0, TP_1 / (TP_1 + FP_1), 0)
recall_1    <- TP_1 / (TP_1 + FN_1)

cat("Model 1 Accuracy: ", round(accuracy_1,  3), "\n")
Model 1 Accuracy:  0.966 
cat("Model 1 Precision:", round(precision_1, 3), "\n")
Model 1 Precision: 1 
cat("Model 1 Recall:   ", round(recall_1,    3), "\n")
Model 1 Recall:    0.042 

Model 2 Predictions & Programmatic Handout-Safe Metrics:

probs_2 <- predict(logistic_model2, cls_test, type = "response")
pred_2  <- ifelse(probs_2 > 0.5, "Bankrupt", "Healthy")
pred_2  <- factor(pred_2, levels = c("Healthy", "Bankrupt"))

confusion_2 <- table(Predicted = pred_2, Actual = cls_test$bankrupt)
confusion_2
          Actual
Predicted  Healthy Bankrupt
  Healthy     1316       42
  Bankrupt       0        6
TN_2 <- confusion_2["Healthy",  "Healthy"]
FP_2 <- confusion_2["Bankrupt", "Healthy"]
FN_2 <- confusion_2["Healthy",  "Bankrupt"]
TP_2 <- confusion_2["Bankrupt", "Bankrupt"]

accuracy_2  <- (TP_2 + TN_2) / (TP_2 + TN_2 + FP_2 + FN_2)
precision_2 <- ifelse(TP_2 + FP_2 > 0, TP_2 / (TP_2 + FP_2), 0)
recall_2    <- TP_2 / (TP_2 + FN_2)

cat("Model 2 Accuracy: ", round(accuracy_2,  3), "\n")
Model 2 Accuracy:  0.969 
cat("Model 2 Precision:", round(precision_2, 3), "\n")
Model 2 Precision: 1 
cat("Model 2 Recall:   ", round(recall_2,    3), "\n")
Model 2 Recall:    0.125 

CrossValidation Model 2:

logistic_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

set.seed(465)
folds_cls <- vfold_cv(cls_train, v = 5)

cv_cls <- fit_resamples(
  logistic_spec,
  bankrupt ~ net_value_per_share + revenue_per_share + operating_profit_per_share,
  resamples = folds_cls,
  metrics   = metric_set(accuracy, precision, recall)
)
→ A | warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
There were issues with some computations   A: x1
There were issues with some computations   A: x5
collect_metrics(cv_cls)
# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config        
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>          
1 accuracy  binary     0.968     5 0.000850 pre0_mod0_post0
2 precision binary     0.969     5 0.00174  pre0_mod0_post0
3 recall    binary     0.998     5 0.00124  pre0_mod0_post0
cat("Test set Accuracy: ", round(accuracy_2,  3), "\n")
Test set Accuracy:  0.969 
cat("Test set Precision:", round(precision_2, 3), "\n")
Test set Precision: 1 
cat("Test set Recall:   ", round(recall_2,    3), "\n")
Test set Recall:    0.125 
#### Conclusion

##**Regression:** Model B is the better model. It achieves lower RMSE and higher R?? on the test set. Cross-validation confirms this performance is stable.Classification: Model 2 is the better model. In an imbalanced dataset where 96% of firms are healthy, recall is the key metric. Using all three financial ratios together identifies more bankrupt firms than net value per share alone.

Prompt we Used for Ai ??ntegration(Calude and Gemini):We encountiring errors specifically cross-validation part what we should specificly do?How do we perform 5-fold cross-validation using tidymodels for a logistic regression model predicting bankruptcy?

Response: Ai’s the vfold_cv(), fit_resamples(), and collect_metrics() workflow and showed how to use metric_set(accuracy, precision, recall) for classification problems.

Our reflects:The AI was helpful for understanding the workflow structure. However we always cross-checked against the course handouts to ensure we used only functions taught in class. AI tools are most useful as a starting point not a final answer ??? verification is essential.