Importing Stuff

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'

Making the New Model

My simple linear regression had a single independent variable, Volume.I’ll introduce the Transmission variable since a car’s transmission seems to have a major effect on its price. The fuel capacity (FuelGallons) looks important too, but I will need to create a dataframe with the NA entries in that attribute removed.

model1 <- lm(PriceUSD ~ Volume, df)
#df$IsManual <- ifelse(df$Transmission == "Manual", 1, 0)
model2 <- lm(PriceUSD ~ Volume + Transmission, df)
df_gallon <- df |> drop_na(FuelGallons)
model3 <- lm(PriceUSD ~ Volume + Transmission + FuelGallons, df_gallon)
summary(model1)
## 
## Call:
## lm(formula = PriceUSD ~ Volume, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48134  -9487  -3024   2770 407402 
## 
## Coefficients:
##                  Estimate    Std. Error t value Pr(>|t|)    
## (Intercept) -51414.585827   2707.632375  -18.99   <2e-16 ***
## Volume           0.096984      0.003568   27.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25020 on 1993 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.2705, Adjusted R-squared:  0.2701 
## F-statistic:   739 on 1 and 1993 DF,  p-value: < 2.2e-16
summary(model2)
## 
## Call:
## lm(formula = PriceUSD ~ Volume + Transmission, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -40039 -12002   -578   4786 395162 
## 
## Coefficients:
##                         Estimate    Std. Error t value Pr(>|t|)    
## (Intercept)        -22618.347181   3149.050166  -7.183 9.62e-13 ***
## Volume                  0.071900      0.003729  19.279  < 2e-16 ***
## TransmissionManual -18428.958889   1177.428195 -15.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23620 on 1992 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.3504, Adjusted R-squared:  0.3497 
## F-statistic: 537.2 on 2 and 1992 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## lm(formula = PriceUSD ~ Volume + Transmission + FuelGallons, 
##     data = df_gallon)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47222  -9332   -523   5459 376260 
## 
## Coefficients:
##                         Estimate    Std. Error t value Pr(>|t|)    
## (Intercept)        -23482.100050   3028.318932  -7.754 1.42e-14 ***
## Volume                  0.023226      0.006302   3.686 0.000234 ***
## TransmissionManual -14460.562789   1191.557225 -12.136  < 2e-16 ***
## FuelGallons          2518.411926    259.149534   9.718  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22430 on 1942 degrees of freedom
## Multiple R-squared:  0.3914, Adjusted R-squared:  0.3905 
## F-statistic: 416.3 on 3 and 1942 DF,  p-value: < 2.2e-16

P-values are nice and low for all coefficients on the 3-coefficient model, so they seem fairly reliable in prediction so far. I will proceed with that model.

Residuals vs. Fitted Values

gg_resfitted(model3) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Noticing a bit of “fanning” with increased residuals towards higher numbers of fitted values, but not too bad. This will result in overestimation of price at the higher end of fitted values and underestimation of price at the lower end.

Residuals vs. X Values

plots <- gg_resX(model3, plot.all = FALSE)

# for each variable of interest ...
plots$Volume

plots$Transmission

plots$FuelGallons

I’m noticing a trend for higher residuals at higher values of FuelGallons and Volume, as well as for Automatic transmission. This doesn’t match up with the principle of the constant variance principle. The data itself has a large number of data points that have extreme values, so it’ll be difficult to create a model that accommodates them.

Residual Histogram

gg_reshist(model3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is an extremely long tail to the right of the distribution’s curve. From the grpah’s appearance, there are a handful of rows that have extremely high amounts of residuals.

QQ-Plots

gg_qqplot(model3)

Normally QQ-plots are very sensitive and prone to detecting very slight deviations that don’t affect the overall model, but the other diagnostics look pretty wonky so I’m comfortable with assuming a large deviation from normality.

Cook’s Distance by Observation

gg_cooksd(model3, threshold = 'matlab')

This graph indicates a very large number of rows that have a significant effect on the overall model. Normally after this, I would investigate these rows by plotting various explanatory variables against the response variable and coloring the points that correspond to them, but the large amount of outstanding rows would make it impractical, since it would take forever to type them all in and it would result in an extreme majority of the points being colored with no pattern.

Conclusion

Overall, the model with three explanatory variables seems to be the best, even though it’s pretty far from normality. Then again, my dataset itself has several data points with extreme values that can make creating models difficult. If I ever learn how to correct for these outliers, I will try my best to implement the solution.