project_econ465

Authors

İsmet Erdal Tunç

Ozan Tekin

Stage 1: Data Preparation and Exploratory Analysis

Regression

Used Car Price Prediction

This project investigates the factors that predict used car prices.

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.3     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

Dataset Description and Source


Dataset Source:
https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho

Economic Question

What factors predict used car prices?

Economic Logic for Predictor Selection

Several variables in the dataset are expected to influence used car prices. The variable year is likely to have a strong positive effect because newer cars usually experience less depreciation. In contrast, km is expected to negatively affect selling price because heavily used cars are generally less valuable.

Fuel type, transmission, seller type, and ownership status may also influence prices through consumer preferences, operating costs, and perceived vehicle quality. There may also be some correlation between year and km driven because older cars are likely to have higher mileage.

The variables name and present price were excluded from the analysis. Name was removed because it mainly acts as an identifier, while present price was removed to simplify the model and avoid relying too heavily on a variable that is directly related to selling price.

Data Import

cars <- read_csv2("car_data.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 301 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (6): name, selling_price, present_price, fuel, seller, transmission
dbl (3): year, km, owner

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cars <- cars %>%
  mutate(selling_price = as.numeric(selling_price))

Dataset Overview

glimpse(cars)
Rows: 301
Columns: 9
$ name          <chr> "ritz", "sx4", "ciaz", "wagon r", "swift", "vitara brezz…
$ year          <dbl> 2014, 2013, 2017, 2011, 2014, 2018, 2015, 2015, 2016, 20…
$ selling_price <dbl> 3.35, 4.75, 7.25, 2.85, 4.60, 9.25, 6.75, 6.50, 8.75, 7.…
$ present_price <chr> "5.59", "9.54", "9.85", "4.15", "6.87", "9.83", "8.12", …
$ km            <dbl> 27000, 43000, 6900, 5200, 42450, 2071, 18796, 33429, 202…
$ fuel          <chr> "Petrol", "Diesel", "Petrol", "Petrol", "Diesel", "Diese…
$ seller        <chr> "Dealer", "Dealer", "Dealer", "Dealer", "Dealer", "Deale…
$ transmission  <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua…
$ owner         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
summary(cars)
     name                year      selling_price    present_price     
 Length:301         Min.   :2003   Min.   : 0.100   Length:301        
 Class :character   1st Qu.:2012   1st Qu.: 0.900   Class :character  
 Mode  :character   Median :2014   Median : 3.600   Mode  :character  
                    Mean   :2014   Mean   : 4.661                     
                    3rd Qu.:2016   3rd Qu.: 6.000                     
                    Max.   :2018   Max.   :35.000                     
       km             fuel              seller          transmission      
 Min.   :   500   Length:301         Length:301         Length:301        
 1st Qu.: 15000   Class :character   Class :character   Class :character  
 Median : 32000   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 36947                                                           
 3rd Qu.: 48767                                                           
 Max.   :500000                                                           
     owner        
 Min.   :0.00000  
 1st Qu.:0.00000  
 Median :0.00000  
 Mean   :0.04319  
 3rd Qu.:0.00000  
 Max.   :3.00000  

Data Cleaning

cars <- cars %>%
  drop_na()
cars <- cars %>%
  select(-name, -present_price)

Selling Price Summary Statistics

cars %>%
  summarise(
    mean = mean(selling_price),
    median = median(selling_price),
    sd = sd(selling_price),
    q1 = quantile(selling_price, 0.25),
    q3 = quantile(selling_price, 0.75),
    min = min(selling_price),
    max = max(selling_price)
  )
# A tibble: 1 × 7
   mean median    sd    q1    q3   min   max
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  4.66    3.6  5.08   0.9     6   0.1    35

Distribution of Selling Price

ggplot(cars, aes(x = selling_price)) +
  geom_histogram(bins = 30) +
  theme_minimal() +
  labs(
    title = "Distribution of Selling Price",
    x = "Selling Price",
    y = "Count"
  )

The distribution of selling prices appears to be right-skewed. Most cars are concentrated at lower price levels, while a smaller number of cars have relatively high selling prices.

Distribution of Log Selling Price

cars <- cars %>%
  mutate(log_selling_price = log(selling_price))

ggplot(cars, aes(x = log_selling_price)) +
  geom_histogram(bins = 30) +
  theme_minimal() +
  labs(
    title = "Distribution of Log Selling Price",
    x = "Log Selling Price",
    y = "Count"
  )

After the log transformation, the distribution becomes more symmetric and closer to a normal distribution. This suggests that the original selling price variable may follow a log-normal distribution.

Classification

Bank Term Deposit Subscription Prediction

Dataset Description and Source

This project investigates whether a client subscribes to a bank term deposit following a telemarketing call. The outcome variable is binary, indicating whether the client subscribed (“yes”) or not (“no”).

The dataset includes demographic, financial, and campaign-related characteristics that may help predict subscription decisions.

Source: Bank Marketing Dataset, originally collected by a Portuguese retail bank (May 2008 – November 2010). Mirror of the UCI Machine Learning Repository version, downloaded from Kaggle.

https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset

Economic Question

Can client demographic, financial, and prior-campaign characteristics predict whether a client will subscribe to a bank term deposit?

Economic Logic for Predictor Selection

This analysis focuses on variables that are expected to have stronger predictive power for term deposit subscription decisions. Financial variables such as balance, housing loan status, and personal loan status may reflect the client’s financial condition and saving behavior.

Campaign-related variables such as duration, previous, and poutcome are also expected to be important because they may indicate client interest and responsiveness. In particular, campaign duration is expected to be one of the strongest predictors.

Some variables, such as contact type, day, and month, were excluded to simplify the analysis and reduce model complexity.

Data Import

bank <- read_csv2("bank.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 11162 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (10): job, marital, education, default, housing, loan, contact, month, p...
dbl  (7): age, balance, day, duration, campaign, pdays, previous

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(bank)
Rows: 11,162
Columns: 17
$ age       <dbl> 59, 56, 41, 55, 54, 42, 56, 60, 37, 28, 38, 30, 29, 46, 31, …
$ job       <chr> "admin.", "admin.", "technician", "services", "admin.", "man…
$ marital   <chr> "married", "married", "married", "married", "married", "sing…
$ education <chr> "secondary", "secondary", "secondary", "secondary", "tertiar…
$ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ balance   <dbl> 2343, 45, 1270, 2476, 184, 0, 830, 545, 1, 5090, 100, 309, 1…
$ housing   <chr> "yes", "no", "yes", "yes", "no", "yes", "yes", "yes", "yes",…
$ loan      <chr> "no", "no", "no", "no", "no", "yes", "yes", "no", "no", "no"…
$ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
$ day       <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, …
$ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
$ duration  <dbl> 1042, 1467, 1389, 579, 673, 562, 1201, 1030, 608, 1297, 786,…
$ campaign  <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 2, 4, 2, 2, 1, 3, 1, 2, 1, …
$ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
$ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
$ deposit   <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes…

Data Cleaning

bank <- bank |>
  drop_na() |>
  mutate(
    deposit = factor(deposit, levels = c("no", "yes")),
    job = factor(job),
    marital = factor(marital),
    education = factor(education),
    default = factor(default, levels = c("no", "yes")),
    housing = factor(housing, levels = c("no", "yes")),
    loan = factor(loan, levels = c("no", "yes")),
    contact = factor(contact),
    month = factor(month),
    poutcome = factor(poutcome)
  )

Summary Statistics by Deposit Status

bank |>
  group_by(deposit) |>
  summarize(
    avg_age = mean(age),
    avg_balance = mean(balance),
    avg_duration = mean(duration),
    avg_campaign = mean(campaign)
  )
# A tibble: 2 × 5
  deposit avg_age avg_balance avg_duration avg_campaign
  <fct>     <dbl>       <dbl>        <dbl>        <dbl>
1 no         40.8       1280.         223.         2.84
2 yes        41.7       1804.         537.         2.14

Clients who subscribed to the term deposit tend to have higher average balances and much longer call durations. This suggests that financial capacity and customer engagement may play an important role in subscription decisions.

Distribution of Deposit Subscription

ggplot(bank, aes(x = deposit)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Distribution of Deposit Subscription",
    x = "Deposit Subscription",
    y = "Count"
  )

The outcome variable is binary, with clients either subscribing (“yes”) or not subscribing (“no”) to the term deposit. This type of outcome can be modeled using a Bernoulli distribution.

The distribution appears relatively balanced, although there are slightly more non-subscribers than subscribers in the dataset.

Stage 2: Predictive Modeling

Regression Model

Data Splitting

set.seed(465)

cars_split <- initial_split(cars, prop = 0.8)

cars_train <- training(cars_split)
cars_test <- testing(cars_split)

nrow(cars_train)
[1] 240
nrow(cars_test)
[1] 61

The dataset was split into training (80%) and test (20%) sets using `initial_split()`. The training set contains 240 observations, while the test set contains 61 observations.

Model 1

Model 1 uses year and fuel type as predictors of selling price. Year is included because newer cars are generally expected to have higher values, while fuel type may influence prices through operating costs and consumer preferences.

model1 <- linear_reg() |>
  set_engine("lm")

model1_fit <- model1 |>
  fit(
    selling_price ~ year + fuel,
    data = cars_train
  )

model1_predictions <- predict(model1_fit, cars_test) |>
  bind_cols(cars_test)

metrics(
  model1_predictions,
  truth = selling_price,
  estimate = .pred
)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       3.13 
2 rsq     standard       0.321
3 mae     standard       2.59 

Model 1 explains approximately 32% of the variation in selling prices based on the test set R² value. The RMSE value suggests that prediction errors are still relatively large, indicating that year and fuel type alone may not fully capture the factors affecting used car prices.

Model 2

Model 2 uses km, fuel type, transmission, seller, and ownership status as predictors of selling price. This model includes more vehicle characteristics in order to capture additional factors that may influence used car prices.

model2 <- linear_reg() |>
  set_engine("lm")

model2_fit <- model2 |>
  fit(
    selling_price ~ km + fuel + transmission + seller + owner,
    data = cars_train
  )

model2_predictions <- predict(model2_fit, cars_test) |>
  bind_cols(cars_test)

metrics(
  model2_predictions,
  truth = selling_price,
  estimate = .pred
)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       2.84 
2 rsq     standard       0.449
3 mae     standard       2.07 

Model 2 performs better than Model 1 based on both RMSE and R² values. The lower RMSE indicates smaller prediction errors, while the higher R² suggests that the model explains a larger share of the variation in selling prices.

This improvement suggests that variables such as mileage, transmission, seller type, and ownership status provide additional predictive information beyond year and fuel type alone.

Model Comparison

comparison_table <- tibble(
  Model = c("Model 1", "Model 2"),
  RMSE = c(3.1315285, 2.8380812),
  R_Squared = c(0.3207824, 0.4491195)
)

comparison_table
# A tibble: 2 × 3
  Model    RMSE R_Squared
  <chr>   <dbl>     <dbl>
1 Model 1  3.13     0.321
2 Model 2  2.84     0.449

Model 2 performs better than Model 1 because it has a lower RMSE and a higher R² value. This indicates that Model 2 produces more accurate predictions and explains a larger portion of the variation in selling prices.

Although Model 2 is more complex, the additional variables improve predictive performance and provide more information about factors affecting used car prices.

Cross-Validation

set.seed(465)

cars_folds <- vfold_cv(cars_train, v = 5)

cv_results <- fit_resamples(
  model2,
  selling_price ~ km + fuel + transmission + seller + owner,
  resamples = cars_folds,
  metrics = metric_set(rmse, rsq)
)

collect_metrics(cv_results)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   3.57      5  0.516  pre0_mod0_post0
2 rsq     standard   0.568     5  0.0621 pre0_mod0_post0

The 5-fold cross-validation results show an average RMSE of approximately 3.57 and an average R² of 0.57 across the folds. Compared to the test set performance, the model shows relatively similar predictive performance, suggesting that the model is reasonably stable.

The slightly higher cross-validated RMSE may indicate some variation across folds, but there is no strong evidence of severe overfitting. Overall, the model appears to generalize reasonably well to unseen data.

AI Interaction Log

During Stage 2, I used ChatGPT to better understand how to perform 5-fold cross-validation using tidymodels in R. The AI suggested using `vfold_cv()` and `fit_resamples()` to evaluate model performance across folds using RMSE and R² metrics.

I used the overall structure of the suggested code but modified the predictors and model specification for my own dataset. I also verified the results by comparing the cross-validation performance with the test set performance.

This interaction was helpful for understanding model evaluation, cross-validation, and overfitting in predictive modeling.

Regression Conclusion

For the regression dataset, Model 2 performed better than Model 1 based on both RMSE and R² values. Including additional vehicle characteristics improved predictive performance and helped explain a larger share of the variation in used car prices.

The cross-validation results also suggested that the model was relatively stable and did not show strong signs of overfitting.

Classification Model

Data Splitting

set.seed(465)

bank_split <- initial_split(bank, prop = 0.8)

bank_train <- training(bank_split)
bank_test <- testing(bank_split)

nrow(bank_train)
[1] 8929
nrow(bank_test)
[1] 2233

The dataset was split into training (80%) and test (20%) sets using `initial_split()`. The training set contains 8,929 observations, while the test set contains 2,233 observations.

Model 1

Model 1 uses balance, housing loan status, and personal loan status as predictors of term deposit subscription. These variables may reflect the client’s financial condition and saving behavior.

model1_class <- logistic_reg() |>
  set_engine("glm")

model1_class_fit <- model1_class |>
  fit(
    deposit ~ balance + housing + loan,
    data = bank_train
  )

model1_class_predictions <- predict(
  model1_class_fit,
  bank_test,
  type = "class"
) |>
  bind_cols(bank_test)

metric_set(accuracy, precision, recall)(
  model1_class_predictions,
  truth = deposit,
  estimate = .pred_class
)
# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.603
2 precision binary         0.627
3 recall    binary         0.626

Model 1 achieves moderate predictive performance. The accuracy, precision, and recall values suggest that financial variables alone may not fully explain subscription behavior.

Model 2

Model 2 includes campaign-related variables such as duration, previous campaign outcomes, and previous contacts in addition to financial variables. These variables are expected to improve predictive performance because they may better capture client interest and responsiveness.

model2_class <- logistic_reg() |>
  set_engine("glm")

model2_class_fit <- model2_class |>
  fit(
    deposit ~ balance + housing + loan + duration + previous + poutcome,
    data = bank_train
  )

model2_class_predictions <- predict(
  model2_class_fit,
  bank_test,
  type = "class"
) |>
  bind_cols(bank_test)

metric_set(accuracy, precision, recall)(
  model2_class_predictions,
  truth = deposit,
  estimate = .pred_class
)
# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.789
2 precision binary         0.770
3 recall    binary         0.861

Model 2 performs substantially better than Model 1. Higher accuracy, precision, and recall values suggest that campaign-related variables such as duration, previous contacts, and previous campaign outcomes provide important predictive information about deposit subscription decisions.

In particular, the high recall value indicates that the model is successful at identifying clients who are likely to subscribe to the term deposit.

Model Comparison

comparison_table_class <- tibble(
  Model = c("Model 1", "Model 2"),
  Accuracy = c(0.6032244, 0.7890730),
  Precision = c(0.6270042, 0.7695783),
  Recall = c(0.6259478, 0.8609941)
)

comparison_table_class
# A tibble: 2 × 4
  Model   Accuracy Precision Recall
  <chr>      <dbl>     <dbl>  <dbl>
1 Model 1    0.603     0.627  0.626
2 Model 2    0.789     0.770  0.861

Model 2 performs better overall because it achieves higher accuracy, precision, and recall values, indicating stronger predictive performance.

Although Model 2 is more complex, the additional campaign-related variables substantially improve the model’s ability to predict deposit subscription decisions.

Cross Validation

set.seed(465)

bank_folds <- vfold_cv(bank_train, v = 5)

cv_results_class <- fit_resamples(
  model2_class,
  deposit ~ balance + housing + loan + duration + previous + poutcome,
  resamples = bank_folds,
  metrics = metric_set(accuracy)
)

collect_metrics(cv_results_class)
# A tibble: 1 × 6
  .metric  .estimator  mean     n std_err .config        
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy binary     0.786     5 0.00318 pre0_mod0_post0

The 5-fold cross-validation results show an average accuracy of approximately 79%, which is very close to the test set accuracy. This suggests that the model is relatively stable and generalizes well to unseen data.

Because the cross-validation performance is similar to the test set performance, there is no strong evidence of severe overfitting.

AI Interaction Log

During Stage 2, I used ChatGPT to better understand how to implement logistic regression and cross-validation using tidymodels in R. The AI suggested using functions such as `logistic_reg()`, `vfold_cv()`, and `fit_resamples()` for model evaluation.

I modified the suggested code to fit my own dataset, selected predictors, and evaluation approach. I also verified the outputs by comparing cross-validation performance with test set performance.

This interaction was helpful for understanding classification metrics, model stability, and predictive modeling in R.

Classification Conclusion

For the classification dataset, Model 2 achieved substantially higher accuracy and Kappa values than Model 1. Campaign-related variables such as duration and previous campaign outcomes provided important predictive information about deposit subscription decisions.

The cross-validation results were very similar to the test set performance, suggesting that the model generalizes well and is relatively stable.