library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
options(scipen = 6)
theme_set(theme_minimal())
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Mileage = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
We will need to construct a model to predict the PriceUSD attribute, since price is of the most interest to buyers and sellers.
Next, our explanatory variable needs to be a categorical variable. I will select the vehicle’s drivetrain to be this categorical variable. I needed to remove the rows where the Drivetrain attribute was blank for this.
df_dt <- subset(df, Drivetrain != "")
df_dt |>
ggplot() +
geom_boxplot(mapping = aes(y = Price, x = Drivetrain)) +
scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
annotation_logticks(sides = 'l') +
labs(x = "Drivetrain Type",
y = "Price (USD)")
This seems like a fairly good explanatory variable, so I’ll proceed with this one.
\[ H_0 : \text{average selling price is equal across all types of drivetrain} \]
First I will calculate the F Distribution for each type of drivetrain, then use an ANOVA test.
n <- nrow(df_dt)
k <- n_distinct(df_dt$Drivetrain)
ggplot() +
geom_function(xlim = c(0, 10),
fun = \(x) df(x, k - 1, n - k)) +
geom_vline(xintercept = 1, color = 'orange') +
labs(title = 'F Distribution for Drivetrain Types',
x = "F Values",
y = "Probability Density") +
theme_hc()
m <- aov(PriceUSD ~ Drivetrain, data = df_dt)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## Drivetrain 2 482288153288 241144076644 383.1 <2e-16 ***
## Residuals 1920 1208397120029 629373500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value here is very small, so it’s unlikely that the means are equal.
pairwise.t.test(df_dt$PriceUSD, df_dt$Drivetrain, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: df_dt$PriceUSD and df_dt$Drivetrain
##
## AWD FWD
## FWD < 2e-16 -
## RWD 0.00000000014 < 2e-16
##
## P value adjustment method: bonferroni
All pairings except for AWD and RWD are very different from each other, and AWD and RWD itself clears the 2e-16 boundary by a few orders of magnitude, meaning they’re just slightly more similar than the others. From the earlier p-value being extremely small, we can reject the null hypothesis and conclude that the mean sales price by each type of drivetrain is significantly different.
df |>
ggplot(mapping = aes(x = Volume, y = PriceUSD)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 64 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 64 rows containing missing values or values outside the scale range
## (`geom_point()`).
The correlation seems pretty loose, but I’ll continue so I can tell for sure.
model <- lm(PriceUSD ~ Volume, df)
model$coefficients
## (Intercept) Volume
## -51414.58582665 0.09698433
In this regression model, the coefficient for the Volume attribute represents the selling price of a vehicle per cubic inch of volume. While this linear regression model won’t be any for values of volume that bring the result to a negative prie, it might have some value in predicting the cost of an ordinary vehicle of that volume. Of course, predicting a vehicle’s price on its volume alone would be oversimplifying things, and a multiple linear regression model would serve a prediction involving multiple variables better.