library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
My simple linear regression had a single independent variable, Volume.I’ll introduce the Transmission variable since a car’s transmission seems to have a major effect on its price. The fuel capacity (FuelGallons) looks important too, but I will need to create a dataframe with the NA entries in that attribute removed.
model1 <- lm(PriceUSD ~ Volume, df)
#df$IsManual <- ifelse(df$Transmission == "Manual", 1, 0)
model2 <- lm(PriceUSD ~ Volume + Transmission, df)
df_gallon <- df |> drop_na(FuelGallons)
model3 <- lm(PriceUSD ~ Volume + Transmission + FuelGallons, df_gallon)
summary(model1)
##
## Call:
## lm(formula = PriceUSD ~ Volume, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48134 -9487 -3024 2770 407402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -51414.585827 2707.632375 -18.99 <2e-16 ***
## Volume 0.096984 0.003568 27.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25020 on 1993 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.2705, Adjusted R-squared: 0.2701
## F-statistic: 739 on 1 and 1993 DF, p-value: < 2.2e-16
summary(model2)
##
## Call:
## lm(formula = PriceUSD ~ Volume + Transmission, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40039 -12002 -578 4786 395162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22618.347181 3149.050166 -7.183 9.62e-13 ***
## Volume 0.071900 0.003729 19.279 < 2e-16 ***
## TransmissionManual -18428.958889 1177.428195 -15.652 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23620 on 1992 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.3504, Adjusted R-squared: 0.3497
## F-statistic: 537.2 on 2 and 1992 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## lm(formula = PriceUSD ~ Volume + Transmission + FuelGallons,
## data = df_gallon)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47222 -9332 -523 5459 376260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23482.100050 3028.318932 -7.754 1.42e-14 ***
## Volume 0.023226 0.006302 3.686 0.000234 ***
## TransmissionManual -14460.562789 1191.557225 -12.136 < 2e-16 ***
## FuelGallons 2518.411926 259.149534 9.718 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22430 on 1942 degrees of freedom
## Multiple R-squared: 0.3914, Adjusted R-squared: 0.3905
## F-statistic: 416.3 on 3 and 1942 DF, p-value: < 2.2e-16
P-values are nice and low for all coefficients on the 3-coefficient model, so they seem fairly reliable in prediction so far. I will proceed with that model.
gg_resfitted(model3) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Noticing a bit of “fanning” with increased residuals towards higher numbers of fitted values, but not too bad. This will result in overestimation of price at the higher end of fitted values and underestimation of price at the lower end.
plots <- gg_resX(model3, plot.all = FALSE)
# for each variable of interest ...
plots$Volume
plots$Transmission
plots$FuelGallons
I’m noticing a trend for higher residuals at higher values of FuelGallons and Volume, as well as for Automatic transmission. This doesn’t match up with the principle of the constant variance principle. The data itself has a large number of data points that have extreme values, so it’ll be difficult to create a model that accommodates them.
gg_reshist(model3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is an extremely long tail to the right of the distribution’s curve. From the grpah’s appearance, there are a handful of rows that have extremely high amounts of residuals.
gg_qqplot(model3)
Normally QQ-plots are very sensitive and prone to detecting very slight deviations that don’t affect the overall model, but the other diagnostics look pretty wonky so I’m comfortable with assuming a large deviation from normality.
gg_cooksd(model3, threshold = 'matlab')
This graph indicates a very large number of rows that have a significant effect on the overall model. Normally after this, I would investigate these rows by plotting various explanatory variables against the response variable and coloring the points that correspond to them, but the large amount of outstanding rows would make it impractical, since it would take forever to type them all in and it would result in an extreme majority of the points being colored with no pattern.
Overall, the model with three explanatory variables seems to be the best, even though it’s pretty far from normality. Then again, my dataset itself has several data points with extreme values that can make creating models difficult. If I ever learn how to correct for these outliers, I will try my best to implement the solution.