library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(boot)
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:car':
##
## logit
options(scipen = 6)
theme_set(theme_minimal())
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
df$Engine <- sub(x = df$Engine, pattern = " cc", replacement = "")
df$Engine <- strtoi(df$Engine)
To select the variables for my linear function, I will use step-wise selection on a subset of my dataset that only has numeric variables and without the Price, Kilometer, and Length/Width/Height variables, since those were already converted to miles, inches, and US Dollars. The Year variable was also converted to years_since so the number plays nicer with models. I excluded the Volume variable since it’s reliant on each of the converted length/width/height variables.
df_clean <- df[ , c('PriceUSD', 'years_since', 'Miles', 'Engine', 'FuelGallons', 'Seating.Capacity', 'LengthInch', 'WidthInch', 'HeightInch')]
df_clean <- na.omit(df_clean)
lm_b <- lm(PriceUSD ~ ., df_clean)
step(lm_b, direction = "backward")
## Start: AIC=38189.53
## PriceUSD ~ years_since + Miles + Engine + FuelGallons + Seating.Capacity +
## LengthInch + WidthInch + HeightInch
##
## Df Sum of Sq RSS AIC
## - LengthInch 1 195448335 700944892757 38188
## <none> 700749444422 38190
## - WidthInch 1 1407750727 702157195149 38191
## - HeightInch 1 2533447999 703282892421 38195
## - Miles 1 8628850916 709378295338 38211
## - FuelGallons 1 21644495176 722393939598 38246
## - Seating.Capacity 1 34471284918 735220729340 38281
## - years_since 1 97256226960 798005671382 38439
## - Engine 1 104453337193 805202781615 38457
##
## Step: AIC=38188.07
## PriceUSD ~ years_since + Miles + Engine + FuelGallons + Seating.Capacity +
## WidthInch + HeightInch
##
## Df Sum of Sq RSS AIC
## <none> 700944892757 38188
## - WidthInch 1 1215859162 702160751919 38189
## - HeightInch 1 2471297540 703416190296 38193
## - Miles 1 8761275426 709706168183 38210
## - FuelGallons 1 23462516130 724407408887 38250
## - Seating.Capacity 1 42161459988 743106352745 38299
## - years_since 1 97803380673 798748273430 38439
## - Engine 1 118277988720 819222881477 38488
##
## Call:
## lm(formula = PriceUSD ~ years_since + Miles + Engine + FuelGallons +
## Seating.Capacity + WidthInch + HeightInch, data = df_clean)
##
## Coefficients:
## (Intercept) years_since Miles Engine
## 24002.46665 -2442.55531 -0.06195 21.85850
## FuelGallons Seating.Capacity WidthInch HeightInch
## 1755.46972 -8195.64119 267.63460 -311.64465
Only LengthInch was excised, which is probably a good sign.
df_clean <- subset(df_clean, select = -LengthInch)
lm_clean <- lm(PriceUSD ~ ., df_clean)
summary(lm_clean)
##
## Call:
## lm(formula = PriceUSD ~ ., data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -97518 -8748 -2618 5309 300261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24002.46665 10290.78390 2.332 0.01978 *
## years_since -2442.55531 148.88238 -16.406 < 2e-16 ***
## Miles -0.06195 0.01262 -4.910 9.86e-07 ***
## Engine 21.85850 1.21156 18.042 < 2e-16 ***
## FuelGallons 1755.46972 218.46500 8.035 1.61e-15 ***
## Seating.Capacity -8195.64119 760.85332 -10.772 < 2e-16 ***
## WidthInch 267.63460 146.31082 1.829 0.06752 .
## HeightInch -311.64465 119.50140 -2.608 0.00918 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19060 on 1929 degrees of freedom
## Multiple R-squared: 0.5631, Adjusted R-squared: 0.5615
## F-statistic: 355.2 on 7 and 1929 DF, p-value: < 2.2e-16
df_clean <- subset(df_clean, select = -c(WidthInch, HeightInch))
lm_clean <- lm(PriceUSD ~ ., df_clean)
summary(lm_clean)
##
## Call:
## lm(formula = PriceUSD ~ ., data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96756 -8531 -2282 5274 299791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27900.34691 3168.38783 8.806 < 2e-16 ***
## years_since -2465.68447 140.98012 -17.490 < 2e-16 ***
## Miles -0.06288 0.01262 -4.983 0.000000681 ***
## Engine 22.66866 1.16737 19.419 < 2e-16 ***
## FuelGallons 1852.46373 183.38678 10.101 < 2e-16 ***
## Seating.Capacity -9574.75278 574.35572 -16.670 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19100 on 1931 degrees of freedom
## Multiple R-squared: 0.5608, Adjusted R-squared: 0.5597
## F-statistic: 493.1 on 5 and 1931 DF, p-value: < 2.2e-16
I excised the WidthInch and HeightInch variables since the significance values looked dodgy. Even though the Multiple R-squared error was lowered, I kept it anyways for the sake of simplicity.
To calculate the standard error and confidence interval, I will use bootstrapping.
bootstrap <- function (x, func=mean, n_iter=10^4) {
# empty vector to be filled with values from each iteration
func_values <- c(NULL)
# we simulate sampling `n_iter` times
for (i in 1:n_iter) {
# pull the sample
x_sample <- sample(x, size = length(x), replace = TRUE)
# add on this iteration's value to the collection
func_values <- c(func_values, func(x_sample))
}
return(func_values)
}
price_means <- bootstrap(df_clean$PriceUSD)
price_se_boot <- sd(price_means)
print(paste0("bootstrapped standard error: ",price_se_boot))
## [1] "bootstrapped standard error: 656.554499835394"
Our bootstrapped standard error is about 652 US dollars, which isn’t too bad, since we’re dealing with products in the tens of thousands of dollars range. Next, I’ll use another bootstrapping function to calculate a bootstrapped confidence interval for the mean and median price in the clean dataset.
#library(boot)
boot_ci <- function (v, func = median, conf = 0.80, n_iter = 1000) {
# the `boot` library needs the function in this format
boot_func <- \(x, i) func(x[i])
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
boot_ci(df_clean$PriceUSD, mean, 0.80)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 80% (19306, 21050 )
## Calculations and Intervals on Original Scale
boot_ci(df_clean$PriceUSD, median, 0.80)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 80% ( 9420, 10200 )
## Calculations and Intervals on Original Scale
The lower value for the median’s confidence interval compared to the mean’s confidence interval suggests a right-skewed distribution for the PriceUSD variable.
gg_resfitted(lm_clean) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
There’s an odd upward curve, though there seems to be a minimum at around 25,000 fitted values, which is just below the 0 residuals line. the region between approximately 0 and 50,000 fitted values has negative residuals (underestimation), while the area outside of that range has positive residuals (overestimation). The “fanning” from left to right means there’s a higher variance in errors, though the “sweet spot” looks like it’s around the 50,000 region since it’s right on the residuals line.
plots <- gg_resX(lm_clean, plot.all = FALSE)
plots$years_since
plots$Miles
plots$Engine
plots$FuelGallons
plots$Seating.Capacity
All the plots look kind of funky since there are a few extreme values for every attribute (particularly the plot for Residuals vs. Miles), but all of them have a “fanning” that seems to violate the assumption of constant variance.
gg_reshist(lm_clean)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There’s a very strong left skew and a noticeable extreme with a lot of residuals. A linear model might not be enough to handle this, and we’ll need a model that’s more amenable to extreme values and outliers.
gg_qqplot(lm_clean)
The QQ plot is known for being highly sensitive, but together with the histogram, this indicates a high level of deviance from a normal distribution. Normalization of the data may be required.
gg_cooksd(lm_clean, threshold = 'matlab')
According to this Cook’s D analysis plot, There area LOT of outliers in this dataset. It would be impractical to analyze each and every one of these outliers, by plotting them along with non-outliers against price, since they’re too crowded to tell apart, and it would be time-consuming to input the numbers anyways.