For my final project, I wanted to analyze the relationship between the number of previous owners a used vehicle has and its sales price. There are lots of factors that influence a used vehicle’s sales price, but I wanted to focus on one factor for this project. I used this vehicle dataset from Kaggle (https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/) for my research; the dataset is from India specifically, but I converted metric units and Indian rupees to imperial units and US Dollars. Also, while the context of the dataset isn’t American, I believe with some adjustment some concepts may be able to used for datasets of used vehicles in the USA.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(boot)
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:car':
##
## logit
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
#remove scientific notation
options(scipen = 6)
df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
After importing the appropriate libraries and loading in the dataset, I converted metric units and Indian rupees to imperial units and US Dollars for ease of understanding. I also used the converted dimensions of the vehicle to calculate its volume. After the conversions, I removed the Unregistered vehicles from the data set because they’re illegal to sell in India, and cleaned up the Owner and Engine attributes.
#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner and Engine attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
df <- subset(df, Owner != 'UnRegistered Car')
df$Engine <- sub(x = df$Engine, pattern = " cc", replacement = "")
df$Engine <- strtoi(df$Engine)
df$Owner <- factor(df$Owner, levels = c("First", "Second", "Third", "4 or More"))
In this box plot, the mean decreases more for each successive previous owner. I used a logarithmic scale to make differences easier to notice. I noticed for vehicles with 4 or more previous owners, the range of values is a lot more narrow than vehicles with less owners. This may indicate a significant difference, but I won’t make any judgements until I do some hypothesis testing.
df |>
ggplot() +
geom_boxplot(mapping = aes(y = PriceUSD, x = Owner)) +
scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
annotation_logticks(sides = 'l') +
labs(x = "# of Previous Owners",
y = "Sales Price (USD)")
Since we are interested in the relationship between the sales price and number of previous owners, our null hypothesis and alternative hypothesis are as follows:
\(H_0\): A used vehicle’s mean sales price is constant, regardless of the number of previous owners.
\(H_1\): At least one number of previous owners has a different mean sales price value than the others.
Running an ANOVA test with the logarithm of the sales price (following the example of the box plot) and number of owners gives us a small p-value, meaning at least one group has a meaningfully different mean sales price from the others. Pairwise t-tests suggest that the second and third groups have some amount of similarity, while the 4 or more group has the least amount of similarity with the others.
owner_aov <- aov(log(PriceUSD) ~ Owner, data = df)
summary(owner_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Owner 3 30.7 10.25 11.02 0.000000355 ***
## Residuals 2034 1891.7 0.93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pairwise.t.test(log(df$PriceUSD), df$Owner, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: log(df$PriceUSD) and df$Owner
##
## First Second Third
## Second 0.02632 - -
## Third 0.00639 0.19452 -
## 4 or More 0.00039 0.00155 0.02646
##
## P value adjustment method: bonferroni
I used bootstrapping to calculate the 95% confidence interval for each class’ mean sales price.
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
# the function we want to run on each iteration i
boot_func <- \(x, i) func(x[i])
# the boot instance, running the function on each iteration
b <- boot(v, boot_func, R = n_iter)
b <- boot.ci(b, conf = conf, type = "perc")
# return just the lower and upper values
return(c("lower" = b$percent[4],
"upper" = b$percent[5]))
}
df_ci <- df |>
group_by(Owner) |>
summarise(ci_lower = boot_ci(PriceUSD, mean)['lower'],
mean_price = mean(PriceUSD),
ci_upper = boot_ci(PriceUSD, mean)['upper'])
df_ci
## # A tibble: 4 × 4
## Owner ci_lower mean_price ci_upper
## <fct> <dbl> <dbl> <dbl>
## 1 First 18922. 20570. 22237.
## 2 Second 15008. 17257. 19743.
## 3 Third 9077. 23783. 36886.
## 4 4 or More 1440 1830 2250
The confidence intervals for 1 and 2 previous owners are narrow while the confidence interval for 3 previous owners is wide and overlaps with both 1 and 2; the upper bound is possibly inflated by a few very expensive vehicles. The confidence interval for 4 or more previous owners is narrow and much lower than the rest; this suggests a mostly-constant sales price for 1 to 3 previous owners before it drops off at 4 and beyond.
df_ci |>
ggplot() +
geom_errorbarh(mapping = aes(y = Owner,
xmin=ci_lower, xmax=ci_upper,
color = '95% C.I.'), height = 0.5) +
geom_point(mapping = aes(x = mean_price, y = Owner,
color = 'Group Mean'),
shape = '|',
size = 5) +
scale_color_manual(values=c('black', 'orange')) +
labs(title = "Sales Price by No. of Owners",
x = "Sales Price (USD)",
y = "No. of Previous Owners",
color = '')
Since vehicles with 1 to 3 previous owners have a largish, mostly-constant sales price, we should focus on marketing those towards more-affluent consumers that would otherwise consider buying a new car. The sustainability of buying a used car over a new one should be emphasized. For cars with 4 or more previous owners, market towards less-affluent consumers; emphasize affordability.