library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
used_cars <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")
For this notebook, I will be using the “group by” function to analyze the vehicle dataset, as well as use principles of probability. The primary attribute I’m interested in is the selling price of each vehicle.
library(ggthemes)
uc_grp_year <- used_cars |> group_by(Year) |> summarise(average_price = mean(Price))
uc_grp_year$count <- used_cars |> count(Year)
year_count <- used_cars |> count(Year, sort = TRUE)
ggplot(data = uc_grp_year, aes(x = Year, y = average_price, group = 1)) +
labs(title = "Average selling prices of vehicles by Year (1988-2022)", x = "Year", y = "Average Selling Price (INR)") +
geom_line(color = "blue") +
geom_point()
uc_grp_year
## # A tibble: 22 × 3
## Year average_price count$Year $n
## <int> <dbl> <int> <int>
## 1 1988 1300000 1988 1
## 2 1996 130000 1996 1
## 3 2000 1200000 2000 1
## 4 2002 100000 2002 1
## 5 2004 71001 2004 1
## 6 2006 220000 2006 2
## 7 2007 738833. 2007 6
## 8 2008 214000 2008 13
## 9 2009 313030. 2009 33
## 10 2010 508037. 2010 27
## # ℹ 12 more rows
year_count
## Year n
## 1 2018 268
## 2 2017 262
## 3 2019 218
## 4 2014 192
## 5 2016 187
## 6 2015 178
## 7 2021 156
## 8 2020 132
## 9 2013 128
## 10 2012 92
## 11 2022 81
## 12 2011 79
## 13 2009 33
## 14 2010 27
## 15 2008 13
## 16 2007 6
## 17 2006 2
## 18 1988 1
## 19 1996 1
## 20 2000 1
## 21 2002 1
## 22 2004 1
This graph illustrates the average selling price of vehicles in India from 1988 to 2022. There was a fall after 1988 before a spike happened in 2000 before descending again, and after that the average selling prices of vehicles per year increased at a roughly exponential rate. To get the full story, we will need to look at how many many different vehicles there are.
ggplot(uc_grp_year, aes(x = Year, y = count$n, group = 1)) +
geom_line(stat="identity") + labs(title = "Number of Unique Car Models by Year", x = "Year", y = "# of Unique Models") + geom_point()
From the bar graph, there were very very few unique models produced
before the year 2006, and afterwards a gradual increase leading to a
sharp increase until the year 2018, and then a decrease in unique models
afterwards. This could possibly indicate a lack of interest in selling
older vehicles at used car outlets. Combined with the previous data, I
can deduce that older vehicles are seen as less desirable compared to
newer vehicles, and are thus sold less often and for cheaper at used car
dealerships.
uc_grp_loc <- used_cars |> group_by(Location) |> summarise(average_price = mean(Price))
#uc_grp_loc$count <- used_cars |> count(Location)
loc_count <- used_cars |> count(Location, sort = TRUE)
ggplot(uc_grp_loc, aes(x = reorder(Location, -average_price), y = average_price, group = 1)) +
geom_bar(stat="identity", width = 0.75) + labs(title = "Average selling prices of vehicles by Location", x = "Location", y = "Average Selling Price (INR)") + coord_flip() + theme(axis.text.y=element_text(size=5))
uc_grp_loc
## # A tibble: 77 × 2
## Location average_price
## <chr> <dbl>
## 1 Agra 788895.
## 2 Ahmedabad 1507257.
## 3 Allahabad 525000
## 4 Ambala Cantt 418333.
## 5 Amritsar 483750
## 6 Aurangabad 843750
## 7 Bangalore 2028720.
## 8 Bhopal 575714.
## 9 Bhubaneswar 827500
## 10 Bulandshahar 317000
## # ℹ 67 more rows
loc_count
## Location n
## 1 Mumbai 342
## 2 Delhi 307
## 3 Pune 144
## 4 Bangalore 132
## 5 Hyderabad 116
## 6 Lucknow 78
## 7 Ahmedabad 70
## 8 Chennai 63
## 9 Kolkata 60
## 10 Kanpur 52
## 11 Ludhiana 47
## 12 Patna 47
## 13 Mohali 40
## 14 Gurgaon 30
## 15 Coimbatore 29
## 16 Jaipur 29
## 17 Chandigarh 28
## 18 Dehradun 28
## 19 Ranchi 28
## 20 Faridabad 27
## 21 Varanasi 24
## 22 Jalandhar 22
## 23 Agra 19
## 24 Navi Mumbai 19
## 25 Raipur 19
## 26 Thane 19
## 27 Noida 15
## 28 Zirakpur 15
## 29 Karnal 13
## 30 Ambala Cantt 12
## 31 Jamshedpur 12
## 32 Nashik 11
## 33 Surat 10
## 34 Udupi 9
## 35 Meerut 8
## 36 Mysore 8
## 37 Bhopal 7
## 38 Guwahati 7
## 39 Indore 7
## 40 Yamunanagar 6
## 41 Mangalore 5
## 42 Panchkula 5
## 43 Salem 5
## 44 Vadodara 5
## 45 Allahabad 4
## 46 Amritsar 4
## 47 Aurangabad 4
## 48 Bhubaneswar 4
## 49 Goa 4
## 50 Mirzapur 4
## 51 Dharwad 3
## 52 Ghaziabad 3
## 53 Nagpur 3
## 54 Panvel 3
## 55 Roorkee 3
## 56 Rudrapur 3
## 57 Warangal 3
## 58 Bulandshahar 2
## 59 Dak. Kannada 2
## 60 Deoghar 2
## 61 Ernakulam 2
## 62 Gorakhpur 2
## 63 Haldwani 2
## 64 Kharar 2
## 65 Kheda 2
## 66 Kollam 2
## 67 Kota 2
## 68 Muzaffurpur 2
## 69 Purnea 2
## 70 Ranga Reddy 2
## 71 Samastipur 2
## 72 Unnao 2
## 73 Faizabad 1
## 74 Pimpri-Chinchwad 1
## 75 Rohtak 1
## 76 Siliguri 1
## 77 Valsad 1
As this graph illustrates, there are a LOT of data points in the location region. This data would be more useful if we concentrate it into a subset, but I do notice that Ghaziabad has the highest average selling price and Unnao has the lowest average selling price.
From the above table, we can see that the top 5 most frequently occuring locations in the dataset (in descending order) are Mumbai, Delhi, Pune, Bangalore, and Hyderabad, and the bottom 5 (all tied with only 1 vehicle) are Faizabad, Pimpri-Chinchwad, Rohtak, Siliguri, and Valsad. I noticed that the more frequent locations tend to be big, heavily-urban cities, while the least frequent locations tend to be either historic cities or heavily-forested areas. That seems to about match up with the relative utility of having a used car dealership in those locations.
uc_grp_own <- used_cars |> group_by(Owner) |> summarise(average_price = mean(Price))
#uc_grp_own$count <- used_cars |> count(Owner)
own_count <- used_cars |> count(Owner, sort = TRUE)
ggplot(uc_grp_own, aes(x = Owner, y = average_price, group = 1)) +
geom_bar(stat="identity") + labs(title = "Average selling prices of vehicles by Number of Previous Owners", x = "Location", y = "# of Previous Owners")
uc_grp_own
## # A tibble: 6 × 2
## Owner average_price
## <chr> <dbl>
## 1 4 or More 120000
## 2 First 1714159.
## 3 Fourth 163333.
## 4 Second 1438051.
## 5 Third 1981929.
## 6 UnRegistered Car 5285333.
own_count
## Owner n
## 1 First 1619
## 2 Second 373
## 3 Third 42
## 4 UnRegistered Car 21
## 5 Fourth 3
## 6 4 or More 1
The mean price of unregistered cars in the dataset is much higher than all the other entries at over 5 million rupees, with cars that had 1, 2, or 3 owners having only over 1 million rupees. Since driving an unregistered vehicle is illegal in India, I’m questioning the legality of some of these used vehicle outlets… The vast majority of cars had one previous owner, with the number decreasing at a roughly logarithmic rate until it’s only 1 digit for four and four or more previous owners. There are only 21 unregistered vehicles in the dataset, which suggests that the few out there are very much expensive. Back to the mean price of each amount of owners, the price stays relatively stable with 1, 2, and 3 previous owners before dropping after 4 previous owners. This would seem to correlate with used car owners spending less on more “well-used” vehicles.
Oddly, there is both a “Fourth” value and a “4 or more” value. Since the latter would logically include the former, the former should be considered the same value as the latter.
Of each attribute compared to the mean price of the vehicles associated with them, the least frequent are very, very infrequent. Of each vehicle’s model year, there is only 1 each of the years 1988, 1996, 2000, 2002, and 2004. There is only 1 used car sold in each of Faizabad, Pimpri-Chinchwad, Rohtak, Siliguri, and Valsad. Lastly, only 4 car in the dataset has had four or more owners (as mentioned before, “fourth” and “4 or more” have been combined). There are 2059 vehicles total in the dataset, so if you picked one at random, the chance of having one with any of these values is very small.
lowprob_year <- used_cars |> filter(Year %in% c(1988, 1996, 2000, 2002, 2004))
lowprob_loc <- used_cars |> filter(Location %in% c("Faizabad", "Pimpri-Chinchwad", "Rohtak", "Siliguri", "Valsad"))
lowprob_own <- used_cars |> filter(Owner == "Fourth" | Owner == "4 or More")
lowprob_year
## Make Model Price Year Kilometer Fuel.Type
## 1 Maruti Suzuki Zen LXi BS-II 71001 2004 65082 Petrol
## 2 Honda City 1.5 EXi 100000 2002 65494 Petrol
## 3 Mercedes-Benz E-Class E 200 Avantgarde 1300000 1988 71000 Petrol
## 4 Mercedes-Benz S-Class 280 AT 1200000 2000 65000 Petrol
## 5 Maruti Suzuki Zen LX 130000 1996 100000 Petrol + LPG
## Transmission Location Color Owner Seller.Type Engine Max.Power
## 1 Manual Kolkata Yellow First Individual
## 2 Manual Ranchi Beige First Individual
## 3 Automatic Dehradun Red Third Individual 1991 cc 181 bhp @ 5500 rpm
## 4 Automatic Dehradun Yellow Second Individual
## 5 Manual Coimbatore White Second Individual
## Max.Torque Drivetrain Length Width Height Seating.Capacity
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 300 Nm @ 1200 rpm RWD 5063 1860 1494 5
## 4 NA NA NA NA
## 5 NA NA NA NA
## Fuel.Tank.Capacity
## 1 NA
## 2 NA
## 3 80
## 4 NA
## 5 NA
lowprob_loc
## Make Model Price Year Kilometer
## 1 Tata Nexon XMA Petrol 580000 2018 81000
## 2 Volkswagen Vento TSI 650000 2017 51000
## 3 Hyundai Eon 1.0 Kappa Magna + [2014-2016] 240000 2013 64000
## 4 Maruti Suzuki Alto LXi CNG [2014-2018] 330000 2017 63200
## 5 Maruti Suzuki Ritz VXI BS-IV 204999 2011 42152
## Fuel.Type Transmission Location Color Owner Seller.Type Engine
## 1 Petrol Automatic Valsad Red First Individual 1198 cc
## 2 Petrol Automatic Rohtak White First Individual 1197 cc
## 3 Petrol Manual Faizabad White First Individual 998 cc
## 4 CNG Manual Pimpri-Chinchwad Grey First Individual 998 cc
## 5 Petrol Manual Siliguri Brown First Corporate
## Max.Power Max.Torque Drivetrain Length Width Height
## 1 108 bhp @ 5000 rpm 170 Nm @ 1750 rpm FWD 3994 1811 1607
## 2 103 bhp @ 5000 rpm 175 Nm @ 1500 rpm FWD 4384 1699 1466
## 3 68 bhp @ 6200 rpm 94 Nm @ 3500 rpm FWD 3515 1500 1510
## 4 58 bhp @ 6000 rpm 78 Nm @ 3500 rpm FWD 3545 1490 1475
## 5 NA NA NA
## Seating.Capacity Fuel.Tank.Capacity
## 1 5 44
## 2 5 55
## 3 5 32
## 4 4 60
## 5 NA NA
lowprob_own
## Make Model Price Year Kilometer Fuel.Type Transmission
## 1 Hyundai Santro GL LPG 120000 2009 48500 LPG Manual
## 2 Mahindra Xylo D2 BS-IV 210000 2012 85000 Diesel Manual
## 3 Maruti Suzuki Ritz Vdi BS-IV 120000 2009 80000 Diesel Manual
## 4 Toyota Innova 2.5 E 160000 2008 90000 Diesel Manual
## Location Color Owner Seller.Type Engine Max.Power Max.Torque Drivetrain
## 1 Lucknow Grey Fourth Individual 1086 cc 63@5500 89@3000
## 2 Kanpur White Fourth Individual 2489 cc 95@3600 220@1400
## 3 Kanpur Silver 4 or More Individual 1248 cc 75@4000 190@2000
## 4 Varanasi White Fourth Individual 2494 cc 102@5600 200@3400
## Length Width Height Seating.Capacity Fuel.Tank.Capacity
## 1 3565 1525 1590 5 35
## 2 4520 1850 1895 7 55
## 3 3715 1680 1620 5 43
## 4 4555 1770 1755 8 55
All the cars with four or more owners have relatively low prices of 120 thousand to 210 thousand rupees (1,430.40 to 2,503.20 USD). As mentioned before, vehicle prices appear to drop off sharply after they have four or more owners. I therefore hypothesize that a used vehicle’s value stays mostly constant after one to three owners before dropping by an order of magnitude (division by 10) after it acquires four owners. There are more factors that go into how “desirable” used vehicles are including maintenance and features, but this is outside the scope of the dataset.
A vehicle’s location where it was sold and its model year seem fairly intuitive to combine.
year_own <- used_cars |> group_by(Year, Owner) |> count(sort = TRUE)
#ggplot(year_own, aes(Year, n, group = Owner, color = Owner)) + labs(title = "# of Vehicles by Year of Manufacture and # of Previous Owners (1988-2022)", x = "Year", y = "# of vehicles") + geom_line()
new_year_own <- subset(year_own, Year > 2004)
ggplot(new_year_own, aes(Year, n, group = Owner, color = Owner)) + labs(title = "# of Vehicles by Year of Manufacture and # of Previous Owners (2004-2022)", x = "Year", y = "# of vehicles") + geom_line()
year_own
## # A tibble: 60 × 3
## # Groups: Year, Owner [60]
## Year Owner n
## <int> <chr> <int>
## 1 2018 First 230
## 2 2017 First 217
## 3 2019 First 203
## 4 2016 First 150
## 5 2021 First 147
## 6 2015 First 128
## 7 2014 First 124
## 8 2020 First 124
## 9 2013 First 86
## 10 2022 First 70
## # ℹ 50 more rows
new_year_own
## # A tibble: 55 × 3
## # Groups: Year, Owner [55]
## Year Owner n
## <int> <chr> <int>
## 1 2018 First 230
## 2 2017 First 217
## 3 2019 First 203
## 4 2016 First 150
## 5 2021 First 147
## 6 2015 First 128
## 7 2014 First 124
## 8 2020 First 124
## 9 2013 First 86
## 10 2022 First 70
## # ℹ 45 more rows
unregistered_cars <- subset(used_cars, Owner == "UnRegistered Car")
unregistered_cars
## Make Model Price Year Kilometer
## 140 Audi Q5 45 TFSI Premium Plus 5651000 2022 1
## 195 Audi Q3 2.0 TDI quattro Premium 1350000 2012 72000
## 223 Audi A6 2.0 TFSi Technology Pack 7200000 2021 20000
## 363 Honda City ZX CVT Petrol 1700000 2022 8530
## 390 Toyota Innova 2.0 V 575000 2013 82000
## 480 Mercedes-Benz GLC 220d 4MATIC Progressive 6400000 2022 5500
## 757 Mercedes-Benz C-Class 220 CDI AT 5400000 2022 4600
## 816 Renault Kwid 1.0 RXT [2016-2019] 420000 2017 43000
## 840 Mercedes-Benz A-Class Sedan 200d 4100000 2022 3826
## 862 Audi A4 Premium Plus 40 TFSI 4200000 2021 12321
## 1534 Mercedes-Benz E-Class 200 7050000 2022 1102
## 1625 Mercedes-Benz A-Class Sedan 200d 4200000 2021 5600
## 1747 Audi A4 Premium Plus 40 TFSI 4151000 2022 1
## 1767 MINI Cooper JCW Hatchback 5200000 2022 0
## 1783 Renault Kwid CLIMBER 1.0 [2017-2019] 375000 2017 45000
## 1795 Mercedes-Benz C-Class 220 CDI AT 5300000 2022 12500
## 1804 Porsche Cayenne Base 14900000 2021 7534
## 1841 Audi A6 Technology 45 TFSI 5800000 2021 7935
## 1913 Mercedes-Benz S-Class S 350D [2018-2020] 18500000 2021 5000
## 1978 Mercedes-Benz GLC 220d 4MATIC Progressive 8100000 2022 5500
## 1981 Maruti Suzuki Wagon R VXi 1.0 [2019-2019] 420000 2018 50000
## Fuel.Type Transmission Location Color Owner Seller.Type
## 140 Petrol Automatic Delhi Blue UnRegistered Car Individual
## 195 Diesel Automatic Hyderabad White UnRegistered Car Individual
## 223 Petrol Automatic Bangalore White UnRegistered Car Individual
## 363 Petrol Automatic Pune White UnRegistered Car Individual
## 390 Petrol Manual Varanasi Gold UnRegistered Car Individual
## 480 Diesel Automatic Hyderabad White UnRegistered Car Individual
## 757 Diesel Automatic Bangalore White UnRegistered Car Individual
## 816 Petrol Manual Hyderabad Grey UnRegistered Car Individual
## 840 Diesel Automatic Ahmedabad White UnRegistered Car Individual
## 862 Petrol Automatic Delhi Black UnRegistered Car Individual
## 1534 Petrol Manual Bangalore Black UnRegistered Car Individual
## 1625 Diesel Automatic Delhi Grey UnRegistered Car Individual
## 1747 Petrol Automatic Delhi Black UnRegistered Car Individual
## 1767 Petrol Automatic Ahmedabad Yellow UnRegistered Car Individual
## 1783 Petrol Manual Bhubaneswar Grey UnRegistered Car Individual
## 1795 Diesel Automatic Hyderabad White UnRegistered Car Individual
## 1804 Petrol Automatic Mumbai Blue UnRegistered Car Individual
## 1841 Petrol Automatic Surat Brown UnRegistered Car Individual
## 1913 Diesel Automatic Pune White UnRegistered Car Individual
## 1978 Diesel Automatic Hyderabad White UnRegistered Car Individual
## 1981 Petrol Manual Bhubaneswar White UnRegistered Car Individual
## Engine Max.Power Max.Torque Drivetrain Length Width
## 140 1984 cc 248 bhp @ 5000 rpm 370 Nm @ 1600 rpm AWD 4663 1898
## 195 1968 cc 174 bhp @ 4200 rpm 380 Nm @ 1750 rpm AWD 4385 1831
## 223 1984 cc 180 bhp @ 4000 rpm 320 Nm @ 1500 rpm FWD 4915 1874
## 363 1498 cc 119 bhp @ 6600 rpm 145 Nm @ 4300 rpm FWD 4549 1748
## 390 1998 cc 136@5600 182@4000 4555 1770
## 480 1950 cc 192 bhp @ 3800 rpm 400 Nm @ 1600 rpm AWD 4658 1890
## 757 2148 cc 170@3700 410@2000 RWD 4596 1770
## 816 999 cc 67 bhp @ 5500 rpm 91 Nm @ 4250 rpm FWD 3679 1579
## 840 1950 cc 147 bhp @ 3400 rpm 320 Nm @ 1400 rpm FWD 4549 1796
## 862 1984 cc 188 bhp @ 4200 rpm 320 Nm @ 1450 rpm FWD 4762 1847
## 1534 NA NA
## 1625 1950 cc 147 bhp @ 3400 rpm 320 Nm @ 1400 rpm FWD 4549 1796
## 1747 1984 cc 188 bhp @ 4200 rpm 320 Nm @ 1450 rpm FWD 4762 1847
## 1767 1998 cc 228 bhp @ 5200 rpm 320 Nm @ 1450 rpm FWD 3850 1727
## 1783 999 cc 67 bhp @ 5500 rpm 91 Nm @ 4250 rpm FWD 3679 1579
## 1795 2148 cc 170@3700 410@2000 RWD 4596 1770
## 1804 2995 cc 335 bhp @ 5300 rpm 450 Nm @ 1340 rpm AWD 4918 1983
## 1841 1984 cc 241 bhp @ 5000 rpm 370 Nm @ 1600 rpm FWD 4939 1886
## 1913 2925 cc 282 bhp @ 3600 rpm 600 Nm @ 1600 rpm RWD 5246 1899
## 1978 1950 cc 192 bhp @ 3800 rpm 400 Nm @ 1600 rpm AWD 4658 1890
## 1981 3655 1620
## Height Seating.Capacity Fuel.Tank.Capacity
## 140 1659 5 70
## 195 1608 5 64
## 223 1455 5 75
## 363 1489 5 40
## 390 1755 8 55
## 480 1644 5 66
## 757 1447 5 66
## 816 1478 5 28
## 840 1446 5 43
## 862 1433 5 54
## 1534 NA NA NA
## 1625 1446 5 43
## 1747 1433 5 54
## 1767 1414 4 44
## 1783 1478 5 28
## 1795 1447 5 66
## 1804 1696 5 75
## 1841 1457 5 73
## 1913 1494 5 70
## 1978 1644 5 66
## 1981 1675 5 32
I first tried taking a graph of the whole timeline before excluding the handful of entries from 2004 and before to improve readability. Vehicles with only one previous owner are more common than ones with second owners, with further numbers becoming less frequent. The earliest model year of the unregistered cars is 2012, with very slightly more occurrences in more recent years. This would suggest that the sale of unregistered vehicles became a problem starting roughly last decade.