Importing Libraries & Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(boot)

## 
## Attaching package: 'boot'
## 
## The following object is masked from 'package:car':
## 
##     logit

library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

#remove scientific notation
options(scipen = 6)

df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")

After importing the appropriate libraries and loading in the dataset, I converted metric units and Indian rupees to imperial units and US Dollars for ease of understanding. I also used the converted dimensions of the vehicle to calculate its volume. After the conversions, I removed the Unregistered vehicles from the data set because they’re illegal to sell in India, and cleaned up the Owner and Engine attributes.

#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch)
#Cleaning up Owner and Engine attribute
df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
df <- subset(df, Owner != 'UnRegistered Car')
df$Engine <- sub(x = df$Engine, pattern = " cc", replacement = "")
df$Engine <- strtoi(df$Engine)
df$Owner <- factor(df$Owner, levels = c("First", "Second", "Third", "4 or More"))

Preliminary Analysis

In this box plot, the mean decreases more for each successive previous owner. I used a logarithmic scale to make differences easier to notice. I noticed for vehicles with 4 or more previous owners, the range of values is a lot more narrow than vehicles with less owners. This may indicate a significant difference, but I won’t make any judgements until I do some hypothesis testing.

df |>
  ggplot() +
  geom_boxplot(mapping = aes(y = PriceUSD, x = Owner)) +
  scale_y_log10(labels = \(x) paste('$', x / 1000, 'K')) +
  annotation_logticks(sides = 'l') +
  labs(x = "# of Previous Owners",
       y = "Sales Price (USD)")

Since we are interested in the relationship between the sales price and number of previous owners, our null hypothesis and alternative hypothesis are as follows:

\(H_0\): A used vehicle’s mean sales price is constant, regardless of the number of previous owners.

\(H_1\): At least one number of previous owners has a different mean sales price value than the others.

Running an ANOVA test with the logarithm of the sales price (following the example of the box plot) and number of owners gives us a small p-value, meaning at least one group has a meaningfully different mean sales price from the others. Pairwise t-tests suggest that the second and third groups have some amount of similarity, while the 4 or more group has the least amount of similarity with the others.

Model Testing

owner_aov <- aov(log(PriceUSD) ~ Owner, data = df)
summary(owner_aov)

##               Df Sum Sq Mean Sq F value      Pr(>F)    
## Owner          3   30.7   10.25   11.02 0.000000355 ***
## Residuals   2034 1891.7    0.93                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pairwise.t.test(log(df$PriceUSD), df$Owner, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  log(df$PriceUSD) and df$Owner 
## 
##           First   Second  Third  
## Second    0.02632 -       -      
## Third     0.00639 0.19452 -      
## 4 or More 0.00039 0.00155 0.02646
## 
## P value adjustment method: bonferroni

I used bootstrapping to calculate the 95% confidence interval for each class’ mean sales price.

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  # the function we want to run on each iteration i
  boot_func <- \(x, i) func(x[i])
  
  # the boot instance, running the function on each iteration
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  # return just the lower and upper values
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}

df_ci <- df |>
  group_by(Owner) |>
  summarise(ci_lower = boot_ci(PriceUSD, mean)['lower'],
            mean_price = mean(PriceUSD),
            ci_upper = boot_ci(PriceUSD, mean)['upper'])

df_ci

## # A tibble: 4 × 4
##   Owner     ci_lower mean_price ci_upper
##   <fct>        <dbl>      <dbl>    <dbl>
## 1 First       18922.     20570.   22237.
## 2 Second      15008.     17257.   19743.
## 3 Third        9077.     23783.   36886.
## 4 4 or More    1440       1830     2250

The confidence intervals for 1 and 2 previous owners are narrow while the confidence interval for 3 previous owners is wide and overlaps with both 1 and 2; the upper bound is possibly inflated by a few very expensive vehicles. The confidence interval for 4 or more previous owners is narrow and much lower than the rest; this suggests a mostly-constant sales price for 1 to 3 previous owners before it drops off at 4 and beyond.

df_ci |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = Owner, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_price, y = Owner,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'orange')) +
  labs(title = "Sales Price by No. of Owners",
       x = "Sales Price (USD)",
       y = "No. of Previous Owners",
       color = '')

Conclusion

Since vehicles with 1 to 3 previous owners have a largish, mostly-constant sales price, we should focus on marketing those towards more-affluent consumers that would otherwise consider buying a new car. The sustainability of buying a used car over a new one should be emphasized. For cars with 4 or more previous owners, market towards less-affluent consumers; emphasize affordability.

Introduction

Importing Libraries & Data

Preliminary Analysis

Model Testing

Conclusion