Importing Packages and Dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)

options(scipen = 6)
theme_set(theme_minimal())

df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Mileage = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'

We will need to construct a model to predict the PriceUSD attribute, since price is of the most interest to buyers and sellers.

Selecting the Explanatory Variable

Next, our explanatory variable needs to be a categorical variable. I will select the vehicle’s drivetrain to be this categorical variable. I needed to remove the rows where the Drivetrain attribute was blank for this.

df_dt <- subset(df, Drivetrain != "")
df_dt |>
  ggplot() +
  geom_boxplot(mapping = aes(y = Price, x = Drivetrain)) +
  scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
  annotation_logticks(sides = 'l') +
  labs(x = "Drivetrain Type",
       y = "Price (USD)")

This seems like a fairly good explanatory variable, so I’ll proceed with this one.

\[ H_0 : \text{average selling price is equal across all types of drivetrain} \]

ANOVA Test

First I will calculate the F Distribution for each type of drivetrain, then use an ANOVA test.

n <- nrow(df_dt)
k <- n_distinct(df_dt$Drivetrain)

ggplot() +
  geom_function(xlim = c(0, 10), 
                fun = \(x) df(x, k - 1, n - k)) +
  geom_vline(xintercept = 1, color = 'orange') +
  labs(title = 'F Distribution for Drivetrain Types',
       x = "F Values",
       y = "Probability Density") +
  theme_hc()

m <- aov(PriceUSD ~ Drivetrain, data = df_dt)
summary(m)
##               Df        Sum Sq      Mean Sq F value Pr(>F)    
## Drivetrain     2  482288153288 241144076644   383.1 <2e-16 ***
## Residuals   1920 1208397120029    629373500                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value here is very small, so it’s unlikely that the means are equal.

pairwise.t.test(df_dt$PriceUSD, df_dt$Drivetrain, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  df_dt$PriceUSD and df_dt$Drivetrain 
## 
##     AWD           FWD    
## FWD < 2e-16       -      
## RWD 0.00000000014 < 2e-16
## 
## P value adjustment method: bonferroni

All pairings except for AWD and RWD are very different from each other, and AWD and RWD itself clears the 2e-16 boundary by a few orders of magnitude, meaning they’re just slightly more similar than the others. From the earlier p-value being extremely small, we can reject the null hypothesis and conclude that the mean sales price by each type of drivetrain is significantly different.

Selecting a continuous variable

df |>
  ggplot(mapping = aes(x = Volume, y = PriceUSD)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 64 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 64 rows containing missing values or values outside the scale range
## (`geom_point()`).

The correlation seems pretty loose, but I’ll continue so I can tell for sure.

model <- lm(PriceUSD ~ Volume, df)
model$coefficients
##     (Intercept)          Volume 
## -51414.58582665      0.09698433

In this regression model, the coefficient for the Volume attribute represents the selling price of a vehicle per cubic inch of volume. While this linear regression model won’t be any for values of volume that bring the result to a negative prie, it might have some value in predicting the cost of an ordinary vehicle of that volume. Of course, predicting a vehicle’s price on its volume alone would be oversimplifying things, and a multiple linear regression model would serve a prediction involving multiple variables better.