More realistic representation of market valuation.
Model Comparison
Metric
Model A
Model B
RMSE
0.8344
0.7392
R??
0.1383
0.3237
Result
Model B clearly performs better.
Lower prediction error
Higher explanatory power
Cross-Validation Results
5-Fold Cross Validation
Metric
Mean
RMSE
0.729
R’2
0.339
Interpretation
Model performance is stable across samples.
Results generalize well to unseen observations.
Key Findings & Economic Interpretation
Vehicle Age
Effect: Positive coefficient.
Interpretation: Newer manufacturing cohorts command statistically significant premiums, as they represent a lower baseline risk of mechanical failures.
Odometer
Effect: Negative coefficient.
Interpretation: More mileage continuously scales down market values, directly reflecting the physical depreciation of the asset through operational usage.
Condition
Effect: Strong non-linear premiums and discounts.
Interpretation: “Like New” listings capture substantial valuation premiums, whereas “Salvage” classifications trigger heavy immediate market discounts due to title restrictions and structural damage.
Omitted Variable Bias (OVB): Important vehicle characteristics were omitted from our regressions due to dataset scope limitations, notably Manufacturer/Brand reputation, exact Model line tiers, Fuel efficiency (MPG), and regional geographic market location variables.
Limitation 2
Asking vs. Transaction Prices: The dataset relies entirely on initial public seller listing prices rather than finalized contract clearing transactions. Because secondary vehicle transactions typically involve localized bargaining, our models capture nominal consumer pricing expectations rather than pure market equilibrium values.
Limitation 3
Self-Reporting Bias: The data is entirely self-reported by private sellers, introducing subjective flaws into categorical metrics like vehicle condition.
Future Research
Potential Improvement
Include:
Brand effects
Fuel efficiency
Geographic market differences
New Economic Question
Do vehicle brands create a measurable price premium after controlling for age, mileage, condition, and engine size?
Conclusion
Main Takeaways
Vehicle age and mileage are significant predictors.
Results are consistent with economic theory of depreciation.
Thank You
Questions?
Reproducibility Protocols
To guarantee full structural reproducibility across diverse operational platforms, the following design parameters were enforced:
Relative Pathing: Data ingress relies exclusively on relative path commands (read_csv("vehicles.csv")), rendering the execution script independent of hardcoded local drive environments.
Stochastic Isolation: Global pseudo-random distribution states are pinned via set.seed(465) at the setup phase before any data partitioning or cross-validation sampling occurs.
Standardized Coding Environment: Scripting utilizes standard, stable container packages within the tidyverse framework to prevent breaking changes during runtime execution.
AI Use Log
Assisting AI System: Gemini (Advanced Architecture Engine)
Raw User Prompt Input:“Show me to how to solve this problem in quarto slides document in fastest way possible”
Applied Output Implementation: The AI response provided clean styling mechanics (such as the custom {.smaller} header tags and column containment blocks) to fix layout overflow issues.
Verification and Modification: The styling tips were verified via local engine compilation (quarto render). These slide design components were then manually converted into a formal, continuous narrative report structure. This was done by replacing presentation slide breaks (---) with descriptive paragraphs to satisfy the final essay requirement.
Final Reflections
Strategic Path Improvements
Given a more extended research timeline or access to deeper computational power, our primary structural improvement would involve implementing high-cardinality fixed-effects modeling for vehicle brands and models. Controlling for manufacturer identity would effectively insulate our continuous mileage and age coefficients from brand-equity bias (such as the slow structural depreciation rates of reliable commuter brands compared to high-end luxury lines).
Future Economic Research Questions
The insights developed across this econometric evaluation inspire a compelling new research question:
“Do secondary market vehicle brands display asymmetric depreciation elasticities across varying regional economic environments during inflationary contraction cycles?”
Answering this question would reveal whether affordable economy vehicle choices behave as Giffen or defensive assets when aggregate consumer purchasing power contracts.
Conclusion
Main Takeaways
Vehicle age and structural usage mileage are significant, robust predictors of asset depreciation
Incorporating qualitative condition metrics drastically improves model prediction accuracy and reduces error.
Extended Specification Model B decisively outperforms the baseline model across all metrics.
Empirical results match classical economic depreciation theories, showing how asset usage and features dictate consumer valuation.
##Stage 1: # PART A: Dataset 1 (Regression) - Predicting Vehicle Prices ## 1. Dataset and Source ##Source: [Kaggle - Craigslist Cars/Trucks Data] ##(https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data) ##This dataset captures thousands of individual vehicle listings, providing a high-volume cross-section of the used car market.
Variable List
Variable
Description
Type
Price
Listing price in USD - Target Variable
Continuous
Year
Year the vehicle was manufactured
Numeric
Odometer
Total miles driven
Continuous
Condition
Reported condition of the vehicle
Categorical
Cylinders
Engine size/type
Discrete
::: {.cell}
```{.r .cell-code}
## 2. Economic Question
##"To what extent do vehicle age and usage (odometer) predict the market price of a used vehicle in the coming ##year's resale market?"
:::
## 3. Data Importing and Cleaninglibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.3 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Loading data from your local pathcars_data <-read_csv("vehicles.csv") |>select(price, year, odometer, condition, cylinders) |>drop_na() |>filter(price >500& price <100000)
Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl (6): id, price, year, odometer, lat, long
lgl (1): county
dttm (1): posting_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##4. Probability Analysis (Regression) #| label: summary-regsummary(cars_data$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
501 6495 12450 16820 25000 99999
ggplot(cars_data, aes(x = price)) +geom_histogram(bins =30, fill ="steelblue", color ="white") +labs(title ="Distribution of Vehicle Prices", x ="Price ($)", y ="Frequency") +theme_minimal()
cars_data <- cars_data |>mutate(log_price =log(price +1))ggplot(cars_data, aes(x = log_price)) +geom_histogram(bins =30, fill ="darkgreen", color ="white") +labs(title ="Log-Transformed Price Distribution", x ="Log(Price)", y ="Frequency") +theme_minimal()
### The regression model examines how vehicle age and usage affect resale prices in the used car market.### The coefficient for `year` is expected to be positive, indicating that newer vehicles tend to have higher market values.The coefficient for `odometer` is expected to be negative, suggesting that heavily used vehicles lose value as mileage increases.
5. Theoritical Distribution Propose
##**The vehicle price data is heavily right-skewed, as most cars are affordable used models while a few luxury vehicles create a long tail. The log transformation normalizes this effectively, suggesting a Log-Normal distribution.**
##B1. Dataset and Source
##Source: Kaggle - Company Bankruptcy Prediction
##Data Link:
##<https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction>**
## B2. Economic Question
##Can foundational financial ratios, such as net value per share, accurately classify and predict whether a firm will go bankrupt in the coming fiscal period?##
## B3. Data Importing and Cleaning
::: {.cell}
```{.r .cell-code}
bankruptcy_raw <- read_csv("bankruptcy.csv", locale = locale(encoding = "UTF-8"))
Rows: 6819 Columns: 96
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (96): Bankrupt?, ROA(C) before interest and depreciation before interest...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(mydata_class, aes(x = bankrupt, fill = bankrupt)) +geom_bar() +labs(title ="Class Distribution: Firm Health Status", x ="Status", y ="Count") +theme_minimal()
summary(mydata_class)
bankrupt net_value_per_share revenue_per_share
Healthy :6599 Min. :0.0000 Min. :0.000e+00
Bankrupt: 220 1st Qu.:0.1736 1st Qu.:0.000e+00
Median :0.1844 Median :0.000e+00
Mean :0.1907 Mean :1.329e+06
3rd Qu.:0.1996 3rd Qu.:0.000e+00
Max. :1.0000 Max. :3.020e+09
operating_profit_per_share
Min. :0.00000
1st Qu.:0.09608
Median :0.10423
Mean :0.10909
3rd Qu.:0.11616
Max. :1.00000
### Interpretation of Class Distribution### The dataset is highly imbalanced, with healthy firms representing the majority of observations. This is common in bankruptcy prediction studies because actual bankruptcies are relatively rare events in real-world financial markets.From an economic perspective, this imbalance reflects the fact that most firms remain financially stable, while only a small proportion experience severe financial distress.Such imbalance is important in predictive modeling because classification algorithms may become biased toward predicting the majority class.
B5. Theoretical Distribution Proposal
##**The target variable is binary (Bankrupt vs Healthy). In this cross-sectional sample of firms, the frequency of the event follows a Bernoulli Distribution. This allows us to predict the probability of failure based on foundational corporate indicators.**summary(mydata_class)
bankrupt net_value_per_share revenue_per_share
Healthy :6599 Min. :0.0000 Min. :0.000e+00
Bankrupt: 220 1st Qu.:0.1736 1st Qu.:0.000e+00
Median :0.1844 Median :0.000e+00
Mean :0.1907 Mean :1.329e+06
3rd Qu.:0.1996 3rd Qu.:0.000e+00
Max. :1.0000 Max. :3.020e+09
operating_profit_per_share
Min. :0.00000
1st Qu.:0.09608
Median :0.10423
Mean :0.10909
3rd Qu.:0.11616
Max. :1.00000
### Financial Ratio Interpretation#The selected financial indicators measure different dimensions of firm performance:#Net value per share reflects shareholder equity strength.Revenue per share captures operational scale and market activity.Operating profit per share measures operational efficiency.Economically, firms with weaker profitability and lower shareholder value are expected to face higher bankruptcy risk.
Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl (6): id, price, year, odometer, lat, long
lgl (1): county
dttm (1): posting_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(465)cars_split <-initial_split(cars_data, prop =0.8)cars_train <-training(cars_split)cars_test <-testing(cars_split)cat("Training set rows:", nrow(cars_train), "\n")
Training set rows: 136351
cat("Test set rows: ", nrow(cars_test), "\n")
Test set rows: 34088
Two Regression Models
Model A:
model_A <-lm(log_price ~ year + odometer, data = cars_train)summary(model_A)
Call:
lm(formula = log_price ~ year + odometer, data = cars_train)
Residuals:
Min 1Q Median 3Q Max
-3.5146 -0.5544 0.0251 0.6220 10.7066
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.663e+01 4.282e-01 -108.92 <2e-16 ***
year 2.792e-02 2.130e-04 131.07 <2e-16 ***
odometer -7.279e-07 1.109e-08 -65.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8257 on 136348 degrees of freedom
Multiple R-squared: 0.1515, Adjusted R-squared: 0.1515
F-statistic: 1.218e+04 on 2 and 136348 DF, p-value: < 2.2e-16
Model B:
model_B <-lm(log_price ~ year + odometer + condition + cylinders, data = cars_train)summary(model_B)
## Model B Evaluationscars_test <- cars_test |>mutate(pred_B =predict(model_B, newdata = cars_test))rmse_B <- cars_test |>summarise(rmse =sqrt(mean((log_price - pred_B)^2)))rsq_B <- cars_test |>summarise(rsq =1-sum((log_price - pred_B)^2) /sum((log_price -mean(log_price))^2))cat("Model B RMSE:", round(rmse_B$rmse, 4), "\n")
Model B RMSE: 0.7392
cat("Model B R2: ", round(rsq_B$rsq, 4), "\n")
Model B R2: 0.3237
Model B clearly outperforms Model A across both metrics, yielding a lower RMSE and higher R’2. Integrating vehicle condition and engine size provides a much more accurate representation of market value.
Rows: 6819 Columns: 96
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (96): Bankrupt?, ROA(C) before interest and depreciation before interest...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Model1 net_value_per_share only Model 1 analyzes baseline bankruptcy risk using a single fundamental ratio: shareholder equity strength relative to outstanding stock.
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model1)
Call:
glm(formula = bankrupt ~ net_value_per_share, family = binomial,
data = cls_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.8669 0.8035 11.04 <2e-16 ***
net_value_per_share -70.2914 4.8342 -14.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1527.7 on 5454 degrees of freedom
Residual deviance: 1246.9 on 5453 degrees of freedom
AIC: 1250.9
Number of Fisher Scoring iterations: 7
Model 2 All three Rations Model 2 expands the classification framework by introducing dynamic operational efficiency measures and revenue scale indicators
cat("Test set Accuracy: ", round(accuracy_2, 3), "\n")
Test set Accuracy: 0.969
cat("Test set Precision:", round(precision_2, 3), "\n")
Test set Precision: 1
cat("Test set Recall: ", round(recall_2, 3), "\n")
Test set Recall: 0.125
#### Conclusion##**Regression:** Model B is the better model. It achieves lower RMSE and higher R?? on the test set. Cross-validation confirms this performance is stable.Classification: Model 2 is the better model. In an imbalanced dataset where 96% of firms are healthy, recall is the key metric. Using all three financial ratios together identifies more bankrupt firms than net value per share alone.
Prompt we Used for Ai ??ntegration(Calude and Gemini):We encountiring errors specifically cross-validation part what we should specificly do?How do we perform 5-fold cross-validation using tidymodels for a logistic regression model predicting bankruptcy?
Response: Ai’s the vfold_cv(), fit_resamples(), and collect_metrics() workflow and showed how to use metric_set(accuracy, precision, recall) for classification problems.
Our reflects:The AI was helpful for understanding the workflow structure. However we always cross-checked against the course handouts to ensure we used only functions taught in class. AI tools are most useful as a starting point not a final answer ??? verification is essential.