Summary

https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/

This dataset contains information about used cars generated by webscraping. All entries come from India, so measurements are recorded in metric and prices are in Indian rupees. The documentation and download links are in the link above. One of the sources is CarDekho, an Indian website for finding used cars, but there may or may not be other unknown websites used as sources. For the purpose of this project, I will be using the “car details v4” table as that one seems to be the most recent.

The primary objective of my project is to use various attributes of used cars to predict their price.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)

#remove scientific notation
options(scipen = 6)

df <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")

Interesting Findings

df['Owner'][df['Owner'] == 'Fourth'] <- '4 or More'
year_own <- df |> group_by(Year, Owner) |> count(sort = TRUE)
#ggplot(year_own, aes(Year, n, group = Owner, color = Owner)) + labs(title = "# of Vehicles by Year of Manufacture and # of Previous Owners (1988-2022)", x = "Year", y = "# of vehicles") +  geom_line()
new_year_own <- subset(year_own, Year > 2004)
ggplot(new_year_own, aes(Year, n, group = Owner, color = Owner)) + labs(title = "# of Vehicles by Year of Manufacture and # of Previous Owners (2004-2022)", x = "Year", y = "# of vehicles") +  geom_line()

The shape of each of the lines in this graph is fairly interesting. The line for 1 previous owner goes the highest before falling, followed by 2, 3, and 4 or more owners. The “hump” of the line moves further back with each previous owner also. What’s also very intriguing is that there are a few unregistered vehicles in the dataset, which I’m pretty sure are illegal to sell in India.

uc_grp_loc <- df |> group_by(Location) |> summarise(average_price = mean(Price))
loc_count <- df |> count(Location, sort = TRUE)
ggplot(uc_grp_loc, aes(x = reorder(Location, -average_price), y = average_price, group = 1)) +
  geom_bar(stat="identity", width = 0.75) + labs(title = "Average selling prices of vehicles by Location", x = "Location", y =  "Average Selling Price (INR)") + coord_flip() + theme(axis.text.y=element_text(size=5))

There is an absolutely ridiculous number of localities in this data, 77 unique values in total. For my previous data dives, I focused on the top and bottom 5 entities, but I may have to find a way to turn these locality names into the states they’re located in (I recently learned India is also divided into states and territories) to reduce the number of unique values to something more manageable.

My Plan Moving Forward

I want to create statistical models for predicting a vehicle’s price based on various attributes such as its number of previous owners, distance previously traveled, fuel capacity, seating capacity, engine, and many other factors.

Initial Findings

#converting non-american stuff to american stuff
df <- df |> mutate(years_since = year(now()) - Year) |> mutate(PriceUSD = Price * 0.012) |> mutate(Miles = Kilometer * 0.621371) |> mutate(LengthInch = Length * 0.0393701) |> mutate(WidthInch = Width * 0.0393701) |> mutate(HeightInch = Height * 0.0393701) |> mutate(FuelGallons = Fuel.Tank.Capacity * 0.264172) |> mutate(Volume = LengthInch * WidthInch * HeightInch) |> drop_na(Volume)

Hypothesis 1: There is a positive correlation between a vehicle’s volume and its price.

options(scipen=10000)
ggplot(df, aes(x = Volume, y = PriceUSD)) + geom_point() + labs(x = "Volume (cubic inches)", y = "Price (USD)", title = "Comparison of Volume to Price")

round(cor(df$Volume, df$PriceUSD), 2)
## [1] 0.52

Hypothesis 2: There is a negative correlation between a vehicle’s distance driven and its price.

ggplot(df, aes(x = Miles, y = PriceUSD)) + geom_point() + labs(x = "Miles Traveled", y = "Price (USD)", title = "Comparison of Miles Traveled to Price")

round(cor(df$Miles, df$PriceUSD), 2)
## [1] -0.15