Predicting Used Vehicle Prices

Author

Sarp Ata Kanca, Yagmur Beren Sengezken

#Presentation:https://rpubs.com/SarpAta/1438550 #Last Report:https://rpubs.com/SarpAta/1438972

ECON 465 Stage 3

Predicting Used Vehicle Prices

Authors

Sarp Ata Kanca

Yagmur Beren Sengezken

Economic Question

Research Question

To what extent do vehicle age, mileage, condition, and engine size predict the market price of a used vehicle?

Why Does It Matter?

Used vehicle markets are economically important.
Consumers need accurate price expectations.
Sellers need efficient pricing strategies.
Vehicle depreciation affects household wealth.

Dataset

Source

Kaggle: Craigslist Cars & Trucks Dataset

Sample

Original observations: 426,880
Cleaned sample: 170,439 observations

Variables

Price: Vehicle Listing Price , Odometer:Miles Driven , Cylinders:Engine Size

Year: Manufacturing Year , Condition:Vehicle Condition

Price Distribution

tfm library(tidyverse) cars_raw <- read_csv("vehicles.csv") cars_clean <- cars_raw %>% filter( price > 500, price < 100000, !is.na(year), !is.na(odometer), !is.na(condition), !is.na(cylinders) ) %>% mutate( log_price = log(price) ) ggplot(cars_clean, aes(price)) + geom_histogram(bins = 50) + labs( title = "Distribution of Vehicle Prices", x = "Price (USD)", y = "Frequency" )}

Observation

Strong right-skewed distribution.
Extreme high-value listings present.
Log transformation improves model performance.

Model A

Baseline Regression

[ log(price)=_0+_1(year)+_2(odometer)]

Economic Intuition

Newer vehicles should be worth more.
Higher mileage should reduce value.

Model B

Extended Regression

[ log(price)=_0+_1(year)+_2(odometer)+_3(condition)+_4(cylinders)]

Why Add These Variables?

Vehicle quality matters.
Engine size affects consumer demand.
More realistic representation of market valuation.

Model Comparison

Metric	Model A	Model B
RMSE	0.8344	0.7392
R??	0.1383	0.3237

Result

Model B clearly performs better.

Lower prediction error
Higher explanatory power

Cross-Validation Results

5-Fold Cross Validation

Metric	Mean
RMSE	0.729
R’2	0.339

Interpretation

Model performance is stable across samples.
Results generalize well to unseen observations.

Key Findings & Economic Interpretation

Vehicle Age

Effect: Positive coefficient.

Interpretation: Newer manufacturing cohorts command statistically significant premiums, as they represent a lower baseline risk of mechanical failures.

Odometer

Effect: Negative coefficient.

Interpretation: More mileage continuously scales down market values, directly reflecting the physical depreciation of the asset through operational usage.

Condition

Effect: Strong non-linear premiums and discounts.

Interpretation: “Like New” listings capture substantial valuation premiums, whereas “Salvage” classifications trigger heavy immediate market discounts due to title restrictions and structural damage.

Economic Interpretation

Depreciation

Vehicle prices decrease as usage increases.

Consumer Valuation

Buyers pay premiums for:

Newer vehicles
Better condition
Larger engines

Market Implications

Observable quality characteristics strongly influence market outcomes.

Limitations

Limitation 1

Omitted Variable Bias (OVB): Important vehicle characteristics were omitted from our regressions due to dataset scope limitations, notably Manufacturer/Brand reputation, exact Model line tiers, Fuel efficiency (MPG), and regional geographic market location variables.

Limitation 2

Asking vs. Transaction Prices: The dataset relies entirely on initial public seller listing prices rather than finalized contract clearing transactions. Because secondary vehicle transactions typically involve localized bargaining, our models capture nominal consumer pricing expectations rather than pure market equilibrium values.

Limitation 3

Self-Reporting Bias: The data is entirely self-reported by private sellers, introducing subjective flaws into categorical metrics like vehicle condition.

Future Research

Potential Improvement

Include:

Brand effects
Fuel efficiency
Geographic market differences

New Economic Question

Do vehicle brands create a measurable price premium after controlling for age, mileage, condition, and engine size?

Conclusion

Main Takeaways

Vehicle age and mileage are significant predictors.

Condition substantially improves prediction accuracy.

Model B outperforms Model A.

Results are consistent with economic theory of depreciation.

Thank You

Questions?

Reproducibility Protocols

To guarantee full structural reproducibility across diverse operational platforms, the following design parameters were enforced:

Relative Pathing: Data ingress relies exclusively on relative path commands (read_csv("vehicles.csv")), rendering the execution script independent of hardcoded local drive environments.
Stochastic Isolation: Global pseudo-random distribution states are pinned via set.seed(465) at the setup phase before any data partitioning or cross-validation sampling occurs.
Standardized Coding Environment: Scripting utilizes standard, stable container packages within the tidyverse framework to prevent breaking changes during runtime execution.

AI Use Log

Assisting AI System: Gemini (Advanced Architecture Engine)
Applied Interaction Strategy: Layout syntax automation, code consolidation, and structural Markdown optimization.
Raw User Prompt Input: “Show me to how to solve this problem in quarto slides document in fastest way possible”
Applied Output Implementation: The AI response provided clean styling mechanics (such as the custom {.smaller} header tags and column containment blocks) to fix layout overflow issues.
Verification and Modification: The styling tips were verified via local engine compilation (quarto render). These slide design components were then manually converted into a formal, continuous narrative report structure. This was done by replacing presentation slide breaks (---) with descriptive paragraphs to satisfy the final essay requirement.

Final Reflections

Strategic Path Improvements

Given a more extended research timeline or access to deeper computational power, our primary structural improvement would involve implementing high-cardinality fixed-effects modeling for vehicle brands and models. Controlling for manufacturer identity would effectively insulate our continuous mileage and age coefficients from brand-equity bias (such as the slow structural depreciation rates of reliable commuter brands compared to high-end luxury lines).

Future Economic Research Questions

The insights developed across this econometric evaluation inspire a compelling new research question:

“Do secondary market vehicle brands display asymmetric depreciation elasticities across varying regional economic environments during inflationary contraction cycles?”

Answering this question would reveal whether affordable economy vehicle choices behave as Giffen or defensive assets when aggregate consumer purchasing power contracts.

Conclusion

Main Takeaways

Vehicle age and structural usage mileage are significant, robust predictors of asset depreciation
Incorporating qualitative condition metrics drastically improves model prediction accuracy and reduces error.
Extended Specification Model B decisively outperforms the baseline model across all metrics.
Empirical results match classical economic depreciation theories, showing how asset usage and features dictate consumer valuation.

##Stage 1: # PART A: Dataset 1 (Regression) - Predicting Vehicle Prices ## 1. Dataset and Source ##Source: [Kaggle - Craigslist Cars/Trucks Data] ##(https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data) ##This dataset captures thousands of individual vehicle listings, providing a high-volume cross-section of the used car market.

Variable List

Variable	Description	Type
Price	Listing price in USD - Target Variable	Continuous
Year	Year the vehicle was manufactured	Numeric
Odometer	Total miles driven	Continuous
Condition	Reported condition of the vehicle	Categorical
Cylinders	Engine size/type	Discrete


::: {.cell}

```{.r .cell-code}
## 2. Economic Question
##"To what extent do vehicle age and usage (odometer) predict the market price of a used vehicle in the coming ##year's resale market?"

:::

## 3. Data Importing and Cleaning

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Loading data from your local path
cars_data <- read_csv("vehicles.csv") |>
  select(price, year, odometer, condition, cylinders) |>
  drop_na() |>
  filter(price > 500 & price < 100000)

Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl   (6): id, price, year, odometer, lat, long
lgl   (1): county
dttm  (1): posting_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##4. Probability Analysis (Regression) 
#| label: summary-reg
summary(cars_data$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    501    6495   12450   16820   25000   99999

ggplot(cars_data, aes(x = price)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Vehicle Prices", x = "Price ($)", y = "Frequency") +
  theme_minimal()

cars_data <- cars_data |>
  mutate(log_price = log(price + 1))

ggplot(cars_data, aes(x = log_price)) +
  geom_histogram(bins = 30, fill = "darkgreen", color = "white") +
  labs(title = "Log-Transformed Price Distribution", x = "Log(Price)", y = "Frequency") +
  theme_minimal()

###   The regression model examines how vehicle age and usage affect resale prices in the used car market.
### The coefficient for `year` is expected to be positive, indicating that newer vehicles tend to have higher market values.The coefficient for `odometer` is expected to be negative, suggesting that heavily used vehicles lose value as mileage increases.

5. Theoritical Distribution Propose

##**The vehicle price data is heavily right-skewed, as most cars are affordable used models while a few luxury vehicles create a long tail. The log transformation normalizes this effectively, suggesting a Log-Normal distribution.**

##B1. Dataset and Source
##Source: Kaggle - Company Bankruptcy Prediction
##Data Link: 
##<https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction>**

## B2. Economic Question

##Can foundational financial ratios, such as net value per share, accurately classify and predict whether a firm will go bankrupt in the coming fiscal period?##

## B3. Data Importing and Cleaning


::: {.cell}

```{.r .cell-code}
bankruptcy_raw <- read_csv("bankruptcy.csv", locale = locale(encoding = "UTF-8"))

Rows: 6819 Columns: 96
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (96): Bankrupt?, ROA(C) before interest and depreciation before interest...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mydata_class <- bankruptcy_raw |>
  select(1, 17, 22, 23) |>
  rename(
    bankrupt                   = 1,
    net_value_per_share        = 2,
    revenue_per_share          = 3,
    operating_profit_per_share = 4
  ) |>
  filter(!is.na(bankrupt) & !is.na(net_value_per_share) &
         !is.na(revenue_per_share) & !is.na(operating_profit_per_share)) |>
  mutate(bankrupt = factor(bankrupt, levels = c(0, 1),
                           labels = c("Healthy", "Bankrupt")))

:::

B4. Probability Analysis (Classification)

table(mydata_class$bankrupt)


 Healthy Bankrupt 
    6599      220

ggplot(mydata_class, aes(x = bankrupt, fill = bankrupt)) +
  geom_bar() +
  labs(title = "Class Distribution: Firm Health Status", 
       x = "Status", y = "Count") +
  theme_minimal()

summary(mydata_class)

     bankrupt    net_value_per_share revenue_per_share  
 Healthy :6599   Min.   :0.0000      Min.   :0.000e+00  
 Bankrupt: 220   1st Qu.:0.1736      1st Qu.:0.000e+00  
                 Median :0.1844      Median :0.000e+00  
                 Mean   :0.1907      Mean   :1.329e+06  
                 3rd Qu.:0.1996      3rd Qu.:0.000e+00  
                 Max.   :1.0000      Max.   :3.020e+09  
 operating_profit_per_share
 Min.   :0.00000           
 1st Qu.:0.09608           
 Median :0.10423           
 Mean   :0.10909           
 3rd Qu.:0.11616           
 Max.   :1.00000

### Interpretation of Class Distribution

### The dataset is highly imbalanced, with healthy firms representing the majority of observations. This is common in bankruptcy prediction studies because actual bankruptcies are relatively rare events in real-world financial markets.From an economic perspective, this imbalance reflects the fact that most firms remain financially stable, while only a small proportion experience severe financial distress.Such imbalance is important in predictive modeling because classification algorithms may become biased toward predicting the majority class.

B5. Theoretical Distribution Proposal

##**The target variable is binary (Bankrupt vs Healthy). In this cross-sectional sample of firms, the frequency of the event follows a Bernoulli Distribution. This allows us to predict the probability of failure based on foundational corporate indicators.**

summary(mydata_class)

     bankrupt    net_value_per_share revenue_per_share  
 Healthy :6599   Min.   :0.0000      Min.   :0.000e+00  
 Bankrupt: 220   1st Qu.:0.1736      1st Qu.:0.000e+00  
                 Median :0.1844      Median :0.000e+00  
                 Mean   :0.1907      Mean   :1.329e+06  
                 3rd Qu.:0.1996      3rd Qu.:0.000e+00  
                 Max.   :1.0000      Max.   :3.020e+09  
 operating_profit_per_share
 Min.   :0.00000           
 1st Qu.:0.09608           
 Median :0.10423           
 Mean   :0.10909           
 3rd Qu.:0.11616           
 Max.   :1.00000

### Financial Ratio Interpretation

#The selected financial indicators measure different dimensions of firm performance:

#Net value per share reflects shareholder equity strength.Revenue per share captures operational scale and market activity.Operating profit per share measures operational efficiency.Economically, firms with weaker profitability and lower shareholder value are expected to face higher bankruptcy risk.

##Stage2:

##Setup

library(tidyverse)
library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──

✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

##Data Loading and Cleaning

cars_data <- read_csv("vehicles.csv") |>
  select(price, year, odometer, condition, cylinders) |>
  filter(!is.na(price) & !is.na(year) & !is.na(odometer) &
         !is.na(condition) & !is.na(cylinders)) |>
  filter(price > 500 & price < 100000) |>
  mutate(
    condition = as.factor(condition),
    cylinders = as.factor(cylinders),
    log_price = log(price + 1)
  )

Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl   (6): id, price, year, odometer, lat, long
lgl   (1): county
dttm  (1): posting_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(465)
cars_split <- initial_split(cars_data, prop = 0.8)
cars_train <- training(cars_split)
cars_test  <- testing(cars_split)

cat("Training set rows:", nrow(cars_train), "\n")

Training set rows: 136351

cat("Test set rows:    ", nrow(cars_test),  "\n")

Test set rows:     34088

Two Regression Models

Model A:

model_A <- lm(log_price ~ year + odometer, data = cars_train)
summary(model_A)


Call:
lm(formula = log_price ~ year + odometer, data = cars_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5146 -0.5544  0.0251  0.6220 10.7066 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.663e+01  4.282e-01 -108.92   <2e-16 ***
year         2.792e-02  2.130e-04  131.07   <2e-16 ***
odometer    -7.279e-07  1.109e-08  -65.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8257 on 136348 degrees of freedom
Multiple R-squared:  0.1515,    Adjusted R-squared:  0.1515 
F-statistic: 1.218e+04 on 2 and 136348 DF,  p-value: < 2.2e-16

Model B:

model_B <- lm(log_price ~ year + odometer + condition + cylinders, data = cars_train)
summary(model_B)


Call:
lm(formula = log_price ~ year + odometer + condition + cylinders, 
    data = cars_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4367 -0.4379  0.0637  0.4779 10.0329 

Coefficients:
                        Estimate Std. Error  t value Pr(>|t|)    
(Intercept)           -4.828e+01  3.933e-01 -122.773  < 2e-16 ***
year                   2.895e-02  1.953e-04  148.237  < 2e-16 ***
odometer              -6.210e-07  9.818e-09  -63.249  < 2e-16 ***
conditionfair         -1.047e+00  1.161e-02  -90.241  < 2e-16 ***
conditiongood          4.595e-03  4.308e-03    1.067    0.286    
conditionlike new      1.359e-01  6.974e-03   19.492  < 2e-16 ***
conditionnew           4.056e-01  3.023e-02   13.420  < 2e-16 ***
conditionsalvage      -1.222e+00  3.982e-02  -30.680  < 2e-16 ***
cylinders12 cylinders  3.067e-01  7.602e-02    4.035 5.46e-05 ***
cylinders3 cylinders  -7.856e-01  4.893e-02  -16.058  < 2e-16 ***
cylinders4 cylinders  -8.560e-01  2.718e-02  -31.500  < 2e-16 ***
cylinders5 cylinders  -9.796e-01  3.522e-02  -27.813  < 2e-16 ***
cylinders6 cylinders  -3.957e-01  2.710e-02  -14.602  < 2e-16 ***
cylinders8 cylinders   7.166e-03  2.716e-02    0.264    0.792    
cylindersother        -3.386e-01  4.181e-02   -8.098 5.63e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7284 on 136336 degrees of freedom
Multiple R-squared:  0.3398,    Adjusted R-squared:  0.3397 
F-statistic:  5012 on 14 and 136336 DF,  p-value: < 2.2e-16

Model Comparison and Selection

##Model A
cars_test <- cars_test |>
  mutate(pred_A = predict(model_A, newdata = cars_test))

rmse_A <- cars_test |>
  summarise(rmse = sqrt(mean((log_price - pred_A)^2)))

rsq_A <- cars_test |>
  summarise(rsq = 1 - sum((log_price - pred_A)^2) /
                      sum((log_price - mean(log_price))^2))

cat("Model A RMSE:", round(rmse_A$rmse, 4), "\n")

Model A RMSE: 0.8344

cat("Model A R2:  ", round(rsq_A$rsq,  4), "\n")

Model A R2:   0.1383

## Model B Evaluations
cars_test <- cars_test |>
  mutate(pred_B = predict(model_B, newdata = cars_test))

rmse_B <- cars_test |>
  summarise(rmse = sqrt(mean((log_price - pred_B)^2)))

rsq_B <- cars_test |>
  summarise(rsq = 1 - sum((log_price - pred_B)^2) /
                      sum((log_price - mean(log_price))^2))

cat("Model B RMSE:", round(rmse_B$rmse, 4), "\n")

Model B RMSE: 0.7392

cat("Model B R2:  ", round(rsq_B$rsq,  4), "\n")

Model B R2:   0.3237

Model B clearly outperforms Model A across both metrics, yielding a lower RMSE and higher R’2. Integrating vehicle condition and engine size provides a much more accurate representation of market value.

Cross-Validation

cars_test <- cars_test |>
  mutate(pred_B = predict(model_B, newdata = cars_test))

rmse_B <- cars_test |>
  summarise(rmse = sqrt(mean((log_price - pred_B)^2)))

rsq_B <- cars_test |>
  summarise(rsq = 1 - sum((log_price - pred_B)^2) /
                      sum((log_price - mean(log_price))^2))

cat("Model B RMSE:", round(rmse_B$rmse, 4), "\n")

Model B RMSE: 0.7392

cat("Model B R2:  ", round(rsq_B$rsq,  4), "\n")

Model B R2:   0.3237

cat("--- Regression Model Comparison ---\n")

--- Regression Model Comparison ---

cat("Model A (year + odometer)\n")

Model A (year + odometer)

cat("  RMSE:", round(rmse_A$rmse, 4), "\n")

  RMSE: 0.8344

cat("  R2:  ", round(rsq_A$rsq,  4), "\n")

  R2:   0.1383

cat("Model B (year + odometer + condition + cylinders)\n")

Model B (year + odometer + condition + cylinders)

cat("  RMSE:", round(rmse_B$rmse, 4), "\n")

  RMSE: 0.7392

cat("  R2:  ", round(rsq_B$rsq,  4), "\n")

  R2:   0.3237

lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

set.seed(465)
folds_cars <- vfold_cv(cars_train, v = 5)

cv_reg <- fit_resamples(
  lm_spec,
  log_price ~ year + odometer + condition + cylinders,
  resamples = folds_cars,
  metrics   = metric_set(rmse, rsq)
)

collect_metrics(cv_reg)

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   0.729     5 0.00198 pre0_mod0_post0
2 rsq     standard   0.339     5 0.00328 pre0_mod0_post0

cat("Test set RMSE:", round(rmse_B$rmse, 4), "\n")

Test set RMSE: 0.7392

cat("Test set R2:  ", round(rsq_B$rsq,  4), "\n")

Test set R2:   0.3237

PART B: Classification-Bankruptcy Prediction

bankruptcy_raw <- read_csv("bankruptcy.csv",
                           locale = locale(encoding = "UTF-8"))

Rows: 6819 Columns: 96
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (96): Bankrupt?, ROA(C) before interest and depreciation before interest...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mydata_class <- bankruptcy_raw |>
  select(1, 17, 22, 23) |>
  rename(
    bankrupt                   = 1,
    net_value_per_share        = 2,
    revenue_per_share          = 3,
    operating_profit_per_share = 4
  ) |>
  filter(!is.na(bankrupt) & !is.na(net_value_per_share) &
         !is.na(revenue_per_share) & !is.na(operating_profit_per_share)) |>
  mutate(bankrupt = factor(bankrupt, levels = c(0, 1),
                           labels = c("Healthy", "Bankrupt")))

set.seed(465)
cls_split <- initial_split(mydata_class, prop = 0.8)
cls_train <- training(cls_split)
cls_test  <- testing(cls_split)

cat("Training set rows:", nrow(cls_train), "\n")

Training set rows: 5455

cat("Test set rows:    ", nrow(cls_test),  "\n")

Test set rows:     1364

table(cls_train$bankrupt)


 Healthy Bankrupt 
    5283      172

Model1 net_value_per_share only Model 1 analyzes baseline bankruptcy risk using a single fundamental ratio: shareholder equity strength relative to outstanding stock.

logistic_model1 <- glm(
  bankrupt ~ net_value_per_share,
  data   = cls_train,
  family = binomial
)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic_model1)


Call:
glm(formula = bankrupt ~ net_value_per_share, family = binomial, 
    data = cls_train)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)           8.8669     0.8035   11.04   <2e-16 ***
net_value_per_share -70.2914     4.8342  -14.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1527.7  on 5454  degrees of freedom
Residual deviance: 1246.9  on 5453  degrees of freedom
AIC: 1250.9

Number of Fisher Scoring iterations: 7

Model 2 All three Rations Model 2 expands the classification framework by introducing dynamic operational efficiency measures and revenue scale indicators

logistic_model2 <- glm(
  bankrupt ~ net_value_per_share + revenue_per_share + operating_profit_per_share,
  data   = cls_train,
  family = binomial
)

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic_model2)


Call:
glm(formula = bankrupt ~ net_value_per_share + revenue_per_share + 
    operating_profit_per_share, family = binomial, data = cls_train)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 8.779e+00  7.980e-01  11.002  < 2e-16 ***
net_value_per_share        -5.312e+01  5.781e+00  -9.189  < 2e-16 ***
revenue_per_share          -7.001e-09  2.203e-07  -0.032    0.975    
operating_profit_per_share -2.963e+01  6.225e+00  -4.759 1.94e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1527.7  on 5454  degrees of freedom
Residual deviance: 1225.0  on 5451  degrees of freedom
AIC: 1233

Number of Fisher Scoring iterations: 13

Predictions and Metrics:

Model 1 Predictions & Programmatic Handout-Safe Metrics

probs_1 <- predict(logistic_model1, cls_test, type = "response")
pred_1  <- ifelse(probs_1 > 0.5, "Bankrupt", "Healthy")
pred_1  <- factor(pred_1, levels = c("Healthy", "Bankrupt"))

confusion_1 <- table(Predicted = pred_1, Actual = cls_test$bankrupt)
confusion_1

          Actual
Predicted  Healthy Bankrupt
  Healthy     1316       46
  Bankrupt       0        2

TN_1 <- confusion_1["Healthy",  "Healthy"]
FP_1 <- confusion_1["Bankrupt", "Healthy"]
FN_1 <- confusion_1["Healthy",  "Bankrupt"]
TP_1 <- confusion_1["Bankrupt", "Bankrupt"]

accuracy_1  <- (TP_1 + TN_1) / (TP_1 + TN_1 + FP_1 + FN_1)
precision_1 <- ifelse(TP_1 + FP_1 > 0, TP_1 / (TP_1 + FP_1), 0)
recall_1    <- TP_1 / (TP_1 + FN_1)

cat("Model 1 Accuracy: ", round(accuracy_1,  3), "\n")

Model 1 Accuracy:  0.966

cat("Model 1 Precision:", round(precision_1, 3), "\n")

Model 1 Precision: 1

cat("Model 1 Recall:   ", round(recall_1,    3), "\n")

Model 1 Recall:    0.042

Model 2 Predictions & Programmatic Handout-Safe Metrics:

probs_2 <- predict(logistic_model2, cls_test, type = "response")
pred_2  <- ifelse(probs_2 > 0.5, "Bankrupt", "Healthy")
pred_2  <- factor(pred_2, levels = c("Healthy", "Bankrupt"))

confusion_2 <- table(Predicted = pred_2, Actual = cls_test$bankrupt)
confusion_2

          Actual
Predicted  Healthy Bankrupt
  Healthy     1316       42
  Bankrupt       0        6

TN_2 <- confusion_2["Healthy",  "Healthy"]
FP_2 <- confusion_2["Bankrupt", "Healthy"]
FN_2 <- confusion_2["Healthy",  "Bankrupt"]
TP_2 <- confusion_2["Bankrupt", "Bankrupt"]

accuracy_2  <- (TP_2 + TN_2) / (TP_2 + TN_2 + FP_2 + FN_2)
precision_2 <- ifelse(TP_2 + FP_2 > 0, TP_2 / (TP_2 + FP_2), 0)
recall_2    <- TP_2 / (TP_2 + FN_2)

cat("Model 2 Accuracy: ", round(accuracy_2,  3), "\n")

Model 2 Accuracy:  0.969

cat("Model 2 Precision:", round(precision_2, 3), "\n")

Model 2 Precision: 1

cat("Model 2 Recall:   ", round(recall_2,    3), "\n")

Model 2 Recall:    0.125

CrossValidation Model 2:

logistic_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

set.seed(465)
folds_cls <- vfold_cv(cls_train, v = 5)

cv_cls <- fit_resamples(
  logistic_spec,
  bankrupt ~ net_value_per_share + revenue_per_share + operating_profit_per_share,
  resamples = folds_cls,
  metrics   = metric_set(accuracy, precision, recall)
)

→ A | warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

There were issues with some computations   A: x1

There were issues with some computations   A: x5

collect_metrics(cv_cls)

# A tibble: 3 × 6
  .metric   .estimator  mean     n  std_err .config        
  <chr>     <chr>      <dbl> <int>    <dbl> <chr>          
1 accuracy  binary     0.968     5 0.000850 pre0_mod0_post0
2 precision binary     0.969     5 0.00174  pre0_mod0_post0
3 recall    binary     0.998     5 0.00124  pre0_mod0_post0

cat("Test set Accuracy: ", round(accuracy_2,  3), "\n")

Test set Accuracy:  0.969

cat("Test set Precision:", round(precision_2, 3), "\n")

Test set Precision: 1

cat("Test set Recall:   ", round(recall_2,    3), "\n")

Test set Recall:    0.125

#### Conclusion

##**Regression:** Model B is the better model. It achieves lower RMSE and higher R?? on the test set. Cross-validation confirms this performance is stable.Classification: Model 2 is the better model. In an imbalanced dataset where 96% of firms are healthy, recall is the key metric. Using all three financial ratios together identifies more bankrupt firms than net value per share alone.

Prompt we Used for Ai ??ntegration(Calude and Gemini):We encountiring errors specifically cross-validation part what we should specificly do?How do we perform 5-fold cross-validation using tidymodels for a logistic regression model predicting bankruptcy?

Response: Ai’s the vfold_cv(), fit_resamples(), and collect_metrics() workflow and showed how to use metric_set(accuracy, precision, recall) for classification problems.

Our reflects:The AI was helpful for understanding the workflow structure. However we always cross-checked against the course handouts to ensure we used only functions taught in class. AI tools are most useful as a starting point not a final answer ??? verification is essential.