Importing

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(boot)
## 
## Attaching package: 'boot'
## 
## The following object is masked from 'package:car':
## 
##     logit
options(scipen = 6)
theme_set(theme_minimal())
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
df$Engine <- sub(x = df$Engine, pattern = " cc", replacement = "")
df$Engine <- strtoi(df$Engine)

Making the linear model

To select the variables for my linear function, I will use step-wise selection on a subset of my dataset that only has numeric variables and without the Price, Kilometer, and Length/Width/Height variables, since those were already converted to miles, inches, and US Dollars. The Year variable was also converted to years_since so the number plays nicer with models. I excluded the Volume variable since it’s reliant on each of the converted length/width/height variables.

df_clean <- df[ , c('PriceUSD', 'years_since', 'Miles', 'Engine', 'FuelGallons', 'Seating.Capacity', 'LengthInch', 'WidthInch', 'HeightInch')]
df_clean <- na.omit(df_clean)
lm_b <- lm(PriceUSD ~ ., df_clean)
step(lm_b, direction = "backward")
## Start:  AIC=38189.53
## PriceUSD ~ years_since + Miles + Engine + FuelGallons + Seating.Capacity + 
##     LengthInch + WidthInch + HeightInch
## 
##                    Df    Sum of Sq          RSS   AIC
## - LengthInch        1    195448335 700944892757 38188
## <none>                             700749444422 38190
## - WidthInch         1   1407750727 702157195149 38191
## - HeightInch        1   2533447999 703282892421 38195
## - Miles             1   8628850916 709378295338 38211
## - FuelGallons       1  21644495176 722393939598 38246
## - Seating.Capacity  1  34471284918 735220729340 38281
## - years_since       1  97256226960 798005671382 38439
## - Engine            1 104453337193 805202781615 38457
## 
## Step:  AIC=38188.07
## PriceUSD ~ years_since + Miles + Engine + FuelGallons + Seating.Capacity + 
##     WidthInch + HeightInch
## 
##                    Df    Sum of Sq          RSS   AIC
## <none>                             700944892757 38188
## - WidthInch         1   1215859162 702160751919 38189
## - HeightInch        1   2471297540 703416190296 38193
## - Miles             1   8761275426 709706168183 38210
## - FuelGallons       1  23462516130 724407408887 38250
## - Seating.Capacity  1  42161459988 743106352745 38299
## - years_since       1  97803380673 798748273430 38439
## - Engine            1 118277988720 819222881477 38488
## 
## Call:
## lm(formula = PriceUSD ~ years_since + Miles + Engine + FuelGallons + 
##     Seating.Capacity + WidthInch + HeightInch, data = df_clean)
## 
## Coefficients:
##      (Intercept)       years_since             Miles            Engine  
##      24002.46665       -2442.55531          -0.06195          21.85850  
##      FuelGallons  Seating.Capacity         WidthInch        HeightInch  
##       1755.46972       -8195.64119         267.63460        -311.64465

Only LengthInch was excised, which is probably a good sign.

df_clean <- subset(df_clean, select = -LengthInch)
lm_clean <- lm(PriceUSD ~ ., df_clean)
summary(lm_clean)
## 
## Call:
## lm(formula = PriceUSD ~ ., data = df_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -97518  -8748  -2618   5309 300261 
## 
## Coefficients:
##                     Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)      24002.46665 10290.78390   2.332  0.01978 *  
## years_since      -2442.55531   148.88238 -16.406  < 2e-16 ***
## Miles               -0.06195     0.01262  -4.910 9.86e-07 ***
## Engine              21.85850     1.21156  18.042  < 2e-16 ***
## FuelGallons       1755.46972   218.46500   8.035 1.61e-15 ***
## Seating.Capacity -8195.64119   760.85332 -10.772  < 2e-16 ***
## WidthInch          267.63460   146.31082   1.829  0.06752 .  
## HeightInch        -311.64465   119.50140  -2.608  0.00918 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19060 on 1929 degrees of freedom
## Multiple R-squared:  0.5631, Adjusted R-squared:  0.5615 
## F-statistic: 355.2 on 7 and 1929 DF,  p-value: < 2.2e-16
df_clean <- subset(df_clean, select = -c(WidthInch, HeightInch))
lm_clean <- lm(PriceUSD ~ ., df_clean)
summary(lm_clean)
## 
## Call:
## lm(formula = PriceUSD ~ ., data = df_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -96756  -8531  -2282   5274 299791 
## 
## Coefficients:
##                     Estimate  Std. Error t value    Pr(>|t|)    
## (Intercept)      27900.34691  3168.38783   8.806     < 2e-16 ***
## years_since      -2465.68447   140.98012 -17.490     < 2e-16 ***
## Miles               -0.06288     0.01262  -4.983 0.000000681 ***
## Engine              22.66866     1.16737  19.419     < 2e-16 ***
## FuelGallons       1852.46373   183.38678  10.101     < 2e-16 ***
## Seating.Capacity -9574.75278   574.35572 -16.670     < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19100 on 1931 degrees of freedom
## Multiple R-squared:  0.5608, Adjusted R-squared:  0.5597 
## F-statistic: 493.1 on 5 and 1931 DF,  p-value: < 2.2e-16

I excised the WidthInch and HeightInch variables since the significance values looked dodgy. Even though the Multiple R-squared error was lowered, I kept it anyways for the sake of simplicity.

Bootstrapped Confidence Interval

To calculate the standard error and confidence interval, I will use bootstrapping.

bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample
    x_sample <- sample(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}

price_means <- bootstrap(df_clean$PriceUSD)
price_se_boot <- sd(price_means)
print(paste0("bootstrapped standard error: ",price_se_boot))
## [1] "bootstrapped standard error: 656.554499835394"

Our bootstrapped standard error is about 652 US dollars, which isn’t too bad, since we’re dealing with products in the tens of thousands of dollars range. Next, I’ll use another bootstrapping function to calculate a bootstrapped confidence interval for the mean and median price in the clean dataset.

#library(boot)
boot_ci <- function (v, func = median, conf = 0.80, n_iter = 1000) {
  # the `boot` library needs the function in this format
  boot_func <- \(x, i) func(x[i])
  
  b <- boot(v, boot_func, R = n_iter)
  
  boot.ci(b, conf = conf, type = "perc")
}
boot_ci(df_clean$PriceUSD, mean, 0.80)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 80%   (19306, 21050 )  
## Calculations and Intervals on Original Scale
boot_ci(df_clean$PriceUSD, median, 0.80)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 80%   ( 9420, 10200 )  
## Calculations and Intervals on Original Scale

The lower value for the median’s confidence interval compared to the mean’s confidence interval suggests a right-skewed distribution for the PriceUSD variable.

Diagnosis

gg_resfitted(lm_clean) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

There’s an odd upward curve, though there seems to be a minimum at around 25,000 fitted values, which is just below the 0 residuals line. the region between approximately 0 and 50,000 fitted values has negative residuals (underestimation), while the area outside of that range has positive residuals (overestimation). The “fanning” from left to right means there’s a higher variance in errors, though the “sweet spot” looks like it’s around the 50,000 region since it’s right on the residuals line.

plots <- gg_resX(lm_clean, plot.all = FALSE)

plots$years_since

plots$Miles

plots$Engine

plots$FuelGallons

plots$Seating.Capacity

All the plots look kind of funky since there are a few extreme values for every attribute (particularly the plot for Residuals vs. Miles), but all of them have a “fanning” that seems to violate the assumption of constant variance.

gg_reshist(lm_clean)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There’s a very strong left skew and a noticeable extreme with a lot of residuals. A linear model might not be enough to handle this, and we’ll need a model that’s more amenable to extreme values and outliers.

gg_qqplot(lm_clean)

The QQ plot is known for being highly sensitive, but together with the histogram, this indicates a high level of deviance from a normal distribution. Normalization of the data may be required.

gg_cooksd(lm_clean, threshold = 'matlab')

According to this Cook’s D analysis plot, There area LOT of outliers in this dataset. It would be impractical to analyze each and every one of these outliers, by plotting them along with non-outliers against price, since they’re too crowded to tell apart, and it would be time-consuming to input the numbers anyways.