Importing

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
used_cars <- read.csv("C:/Users/toyha/Downloads/vehicle/car details v4.csv")

Group By & Probability Analysis

For this notebook, I will be using the “group by” function to analyze the vehicle dataset, as well as use principles of probability. The primary attribute I’m interested in is the selling price of each vehicle.

library(ggthemes)
uc_grp_year <- used_cars |> group_by(Year) |> summarise(average_price = mean(Price))
uc_grp_year$count <- used_cars |> count(Year)
year_count <- used_cars |> count(Year, sort = TRUE)
ggplot(data = uc_grp_year, aes(x = Year, y = average_price, group = 1)) +
  labs(title = "Average selling prices of vehicles by Year (1988-2022)", x = "Year", y =  "Average Selling Price (INR)") +
  geom_line(color = "blue") +
  geom_point()

uc_grp_year
## # A tibble: 22 × 3
##     Year average_price count$Year    $n
##    <int>         <dbl>      <int> <int>
##  1  1988      1300000        1988     1
##  2  1996       130000        1996     1
##  3  2000      1200000        2000     1
##  4  2002       100000        2002     1
##  5  2004        71001        2004     1
##  6  2006       220000        2006     2
##  7  2007       738833.       2007     6
##  8  2008       214000        2008    13
##  9  2009       313030.       2009    33
## 10  2010       508037.       2010    27
## # ℹ 12 more rows
year_count 
##    Year   n
## 1  2018 268
## 2  2017 262
## 3  2019 218
## 4  2014 192
## 5  2016 187
## 6  2015 178
## 7  2021 156
## 8  2020 132
## 9  2013 128
## 10 2012  92
## 11 2022  81
## 12 2011  79
## 13 2009  33
## 14 2010  27
## 15 2008  13
## 16 2007   6
## 17 2006   2
## 18 1988   1
## 19 1996   1
## 20 2000   1
## 21 2002   1
## 22 2004   1

This graph illustrates the average selling price of vehicles in India from 1988 to 2022. There was a fall after 1988 before a spike happened in 2000 before descending again, and after that the average selling prices of vehicles per year increased at a roughly exponential rate. To get the full story, we will need to look at how many many different vehicles there are.

ggplot(uc_grp_year, aes(x = Year, y = count$n, group = 1)) +
  geom_line(stat="identity") + labs(title = "Number of Unique Car Models by Year", x = "Year", y =  "# of Unique Models") + geom_point()

From the bar graph, there were very very few unique models produced before the year 2006, and afterwards a gradual increase leading to a sharp increase until the year 2018, and then a decrease in unique models afterwards. This could possibly indicate a lack of interest in selling older vehicles at used car outlets. Combined with the previous data, I can deduce that older vehicles are seen as less desirable compared to newer vehicles, and are thus sold less often and for cheaper at used car dealerships.

uc_grp_loc <- used_cars |> group_by(Location) |> summarise(average_price = mean(Price))
#uc_grp_loc$count <- used_cars |> count(Location)
loc_count <- used_cars |> count(Location, sort = TRUE)
ggplot(uc_grp_loc, aes(x = reorder(Location, -average_price), y = average_price, group = 1)) +
  geom_bar(stat="identity", width = 0.75) + labs(title = "Average selling prices of vehicles by Location", x = "Location", y =  "Average Selling Price (INR)") + coord_flip() + theme(axis.text.y=element_text(size=5))

uc_grp_loc
## # A tibble: 77 × 2
##    Location     average_price
##    <chr>                <dbl>
##  1 Agra               788895.
##  2 Ahmedabad         1507257.
##  3 Allahabad          525000 
##  4 Ambala Cantt       418333.
##  5 Amritsar           483750 
##  6 Aurangabad         843750 
##  7 Bangalore         2028720.
##  8 Bhopal             575714.
##  9 Bhubaneswar        827500 
## 10 Bulandshahar       317000 
## # ℹ 67 more rows
loc_count
##            Location   n
## 1            Mumbai 342
## 2             Delhi 307
## 3              Pune 144
## 4         Bangalore 132
## 5         Hyderabad 116
## 6           Lucknow  78
## 7         Ahmedabad  70
## 8           Chennai  63
## 9           Kolkata  60
## 10           Kanpur  52
## 11         Ludhiana  47
## 12            Patna  47
## 13           Mohali  40
## 14          Gurgaon  30
## 15       Coimbatore  29
## 16           Jaipur  29
## 17       Chandigarh  28
## 18         Dehradun  28
## 19           Ranchi  28
## 20        Faridabad  27
## 21         Varanasi  24
## 22        Jalandhar  22
## 23             Agra  19
## 24      Navi Mumbai  19
## 25           Raipur  19
## 26            Thane  19
## 27            Noida  15
## 28         Zirakpur  15
## 29           Karnal  13
## 30     Ambala Cantt  12
## 31       Jamshedpur  12
## 32           Nashik  11
## 33            Surat  10
## 34            Udupi   9
## 35           Meerut   8
## 36           Mysore   8
## 37           Bhopal   7
## 38         Guwahati   7
## 39           Indore   7
## 40      Yamunanagar   6
## 41        Mangalore   5
## 42        Panchkula   5
## 43            Salem   5
## 44         Vadodara   5
## 45        Allahabad   4
## 46         Amritsar   4
## 47       Aurangabad   4
## 48      Bhubaneswar   4
## 49              Goa   4
## 50         Mirzapur   4
## 51          Dharwad   3
## 52        Ghaziabad   3
## 53           Nagpur   3
## 54           Panvel   3
## 55          Roorkee   3
## 56         Rudrapur   3
## 57         Warangal   3
## 58     Bulandshahar   2
## 59     Dak. Kannada   2
## 60          Deoghar   2
## 61        Ernakulam   2
## 62        Gorakhpur   2
## 63         Haldwani   2
## 64           Kharar   2
## 65            Kheda   2
## 66           Kollam   2
## 67             Kota   2
## 68      Muzaffurpur   2
## 69           Purnea   2
## 70      Ranga Reddy   2
## 71       Samastipur   2
## 72            Unnao   2
## 73         Faizabad   1
## 74 Pimpri-Chinchwad   1
## 75           Rohtak   1
## 76         Siliguri   1
## 77           Valsad   1

As this graph illustrates, there are a LOT of data points in the location region. This data would be more useful if we concentrate it into a subset, but I do notice that Ghaziabad has the highest average selling price and Unnao has the lowest average selling price.

From the above table, we can see that the top 5 most frequently occuring locations in the dataset (in descending order) are Mumbai, Delhi, Pune, Bangalore, and Hyderabad, and the bottom 5 (all tied with only 1 vehicle) are Faizabad, Pimpri-Chinchwad, Rohtak, Siliguri, and Valsad. I noticed that the more frequent locations tend to be big, heavily-urban cities, while the least frequent locations tend to be either historic cities or heavily-forested areas. That seems to about match up with the relative utility of having a used car dealership in those locations.

uc_grp_own <- used_cars |> group_by(Owner) |> summarise(average_price = mean(Price))
#uc_grp_own$count <- used_cars |> count(Owner)
own_count <- used_cars |> count(Owner, sort = TRUE)
ggplot(uc_grp_own, aes(x = Owner, y = average_price, group = 1)) +
  geom_bar(stat="identity") + labs(title = "Average selling prices of vehicles by Number of Previous Owners", x = "Location", y =  "# of Previous Owners")

uc_grp_own
## # A tibble: 6 × 2
##   Owner            average_price
##   <chr>                    <dbl>
## 1 4 or More              120000 
## 2 First                 1714159.
## 3 Fourth                 163333.
## 4 Second                1438051.
## 5 Third                 1981929.
## 6 UnRegistered Car      5285333.
own_count
##              Owner    n
## 1            First 1619
## 2           Second  373
## 3            Third   42
## 4 UnRegistered Car   21
## 5           Fourth    3
## 6        4 or More    1

The mean price of unregistered cars in the dataset is much higher than all the other entries at over 5 million rupees, with cars that had 1, 2, or 3 owners having only over 1 million rupees. Since driving an unregistered vehicle is illegal in India, I’m questioning the legality of some of these used vehicle outlets… The vast majority of cars had one previous owner, with the number decreasing at a roughly logarithmic rate until it’s only 1 digit for four and four or more previous owners. There are only 21 unregistered vehicles in the dataset, which suggests that the few out there are very much expensive. Back to the mean price of each amount of owners, the price stays relatively stable with 1, 2, and 3 previous owners before dropping after 4 previous owners. This would seem to correlate with used car owners spending less on more “well-used” vehicles.

Oddly, there is both a “Fourth” value and a “4 or more” value. Since the latter would logically include the former, the former should be considered the same value as the latter.

Lowest Probability

Of each attribute compared to the mean price of the vehicles associated with them, the least frequent are very, very infrequent. Of each vehicle’s model year, there is only 1 each of the years 1988, 1996, 2000, 2002, and 2004. There is only 1 used car sold in each of Faizabad, Pimpri-Chinchwad, Rohtak, Siliguri, and Valsad. Lastly, only 4 car in the dataset has had four or more owners (as mentioned before, “fourth” and “4 or more” have been combined). There are 2059 vehicles total in the dataset, so if you picked one at random, the chance of having one with any of these values is very small.

lowprob_year <- used_cars |> filter(Year %in% c(1988, 1996, 2000, 2002, 2004))
lowprob_loc <- used_cars |> filter(Location %in% c("Faizabad", "Pimpri-Chinchwad", "Rohtak", "Siliguri", "Valsad"))
lowprob_own <- used_cars |> filter(Owner == "Fourth" | Owner == "4 or More")
lowprob_year
##            Make                    Model   Price Year Kilometer    Fuel.Type
## 1 Maruti Suzuki            Zen LXi BS-II   71001 2004     65082       Petrol
## 2         Honda             City 1.5 EXi  100000 2002     65494       Petrol
## 3 Mercedes-Benz E-Class E 200 Avantgarde 1300000 1988     71000       Petrol
## 4 Mercedes-Benz           S-Class 280 AT 1200000 2000     65000       Petrol
## 5 Maruti Suzuki                   Zen LX  130000 1996    100000 Petrol + LPG
##   Transmission   Location  Color  Owner Seller.Type  Engine          Max.Power
## 1       Manual    Kolkata Yellow  First  Individual                           
## 2       Manual     Ranchi  Beige  First  Individual                           
## 3    Automatic   Dehradun    Red  Third  Individual 1991 cc 181 bhp @ 5500 rpm
## 4    Automatic   Dehradun Yellow Second  Individual                           
## 5       Manual Coimbatore  White Second  Individual                           
##          Max.Torque Drivetrain Length Width Height Seating.Capacity
## 1                                  NA    NA     NA               NA
## 2                                  NA    NA     NA               NA
## 3 300 Nm @ 1200 rpm        RWD   5063  1860   1494                5
## 4                                  NA    NA     NA               NA
## 5                                  NA    NA     NA               NA
##   Fuel.Tank.Capacity
## 1                 NA
## 2                 NA
## 3                 80
## 4                 NA
## 5                 NA
lowprob_loc
##            Make                             Model  Price Year Kilometer
## 1          Tata                  Nexon XMA Petrol 580000 2018     81000
## 2    Volkswagen                         Vento TSI 650000 2017     51000
## 3       Hyundai Eon 1.0 Kappa Magna + [2014-2016] 240000 2013     64000
## 4 Maruti Suzuki          Alto LXi CNG [2014-2018] 330000 2017     63200
## 5 Maruti Suzuki                    Ritz VXI BS-IV 204999 2011     42152
##   Fuel.Type Transmission         Location Color Owner Seller.Type  Engine
## 1    Petrol    Automatic           Valsad   Red First  Individual 1198 cc
## 2    Petrol    Automatic           Rohtak White First  Individual 1197 cc
## 3    Petrol       Manual         Faizabad White First  Individual  998 cc
## 4       CNG       Manual Pimpri-Chinchwad  Grey First  Individual  998 cc
## 5    Petrol       Manual         Siliguri Brown First   Corporate        
##            Max.Power        Max.Torque Drivetrain Length Width Height
## 1 108 bhp @ 5000 rpm 170 Nm @ 1750 rpm        FWD   3994  1811   1607
## 2 103 bhp @ 5000 rpm 175 Nm @ 1500 rpm        FWD   4384  1699   1466
## 3  68 bhp @ 6200 rpm  94 Nm @ 3500 rpm        FWD   3515  1500   1510
## 4  58 bhp @ 6000 rpm  78 Nm @ 3500 rpm        FWD   3545  1490   1475
## 5                                                     NA    NA     NA
##   Seating.Capacity Fuel.Tank.Capacity
## 1                5                 44
## 2                5                 55
## 3                5                 32
## 4                4                 60
## 5               NA                 NA
lowprob_own
##            Make          Model  Price Year Kilometer Fuel.Type Transmission
## 1       Hyundai  Santro GL LPG 120000 2009     48500       LPG       Manual
## 2      Mahindra  Xylo D2 BS-IV 210000 2012     85000    Diesel       Manual
## 3 Maruti Suzuki Ritz Vdi BS-IV 120000 2009     80000    Diesel       Manual
## 4        Toyota   Innova 2.5 E 160000 2008     90000    Diesel       Manual
##   Location  Color     Owner Seller.Type  Engine Max.Power Max.Torque Drivetrain
## 1  Lucknow   Grey    Fourth  Individual 1086 cc   63@5500    89@3000           
## 2   Kanpur  White    Fourth  Individual 2489 cc   95@3600   220@1400           
## 3   Kanpur Silver 4 or More  Individual 1248 cc   75@4000   190@2000           
## 4 Varanasi  White    Fourth  Individual 2494 cc  102@5600   200@3400           
##   Length Width Height Seating.Capacity Fuel.Tank.Capacity
## 1   3565  1525   1590                5                 35
## 2   4520  1850   1895                7                 55
## 3   3715  1680   1620                5                 43
## 4   4555  1770   1755                8                 55

All the cars with four or more owners have relatively low prices of 120 thousand to 210 thousand rupees (1,430.40 to 2,503.20 USD). As mentioned before, vehicle prices appear to drop off sharply after they have four or more owners. I therefore hypothesize that a used vehicle’s value stays mostly constant after one to three owners before dropping by an order of magnitude (division by 10) after it acquires four owners. There are more factors that go into how “desirable” used vehicles are including maintenance and features, but this is outside the scope of the dataset.

Combining Categories

A vehicle’s location where it was sold and its model year seem fairly intuitive to combine.

year_own <- used_cars |> group_by(Year, Owner) |> count(sort = TRUE)
#ggplot(year_own, aes(Year, n, group = Owner, color = Owner)) + labs(title = "# of Vehicles by Year of Manufacture and # of Previous Owners (1988-2022)", x = "Year", y = "# of vehicles") +  geom_line()
new_year_own <- subset(year_own, Year > 2004)
ggplot(new_year_own, aes(Year, n, group = Owner, color = Owner)) + labs(title = "# of Vehicles by Year of Manufacture and # of Previous Owners (2004-2022)", x = "Year", y = "# of vehicles") +  geom_line()

year_own
## # A tibble: 60 × 3
## # Groups:   Year, Owner [60]
##     Year Owner     n
##    <int> <chr> <int>
##  1  2018 First   230
##  2  2017 First   217
##  3  2019 First   203
##  4  2016 First   150
##  5  2021 First   147
##  6  2015 First   128
##  7  2014 First   124
##  8  2020 First   124
##  9  2013 First    86
## 10  2022 First    70
## # ℹ 50 more rows
new_year_own
## # A tibble: 55 × 3
## # Groups:   Year, Owner [55]
##     Year Owner     n
##    <int> <chr> <int>
##  1  2018 First   230
##  2  2017 First   217
##  3  2019 First   203
##  4  2016 First   150
##  5  2021 First   147
##  6  2015 First   128
##  7  2014 First   124
##  8  2020 First   124
##  9  2013 First    86
## 10  2022 First    70
## # ℹ 45 more rows
unregistered_cars <- subset(used_cars, Owner == "UnRegistered Car")
unregistered_cars
##               Make                        Model    Price Year Kilometer
## 140           Audi      Q5 45 TFSI Premium Plus  5651000 2022         1
## 195           Audi   Q3 2.0 TDI quattro Premium  1350000 2012     72000
## 223           Audi  A6 2.0 TFSi Technology Pack  7200000 2021     20000
## 363          Honda           City ZX CVT Petrol  1700000 2022      8530
## 390         Toyota                 Innova 2.0 V   575000 2013     82000
## 480  Mercedes-Benz  GLC 220d 4MATIC Progressive  6400000 2022      5500
## 757  Mercedes-Benz           C-Class 220 CDI AT  5400000 2022      4600
## 816        Renault     Kwid 1.0 RXT [2016-2019]   420000 2017     43000
## 840  Mercedes-Benz           A-Class Sedan 200d  4100000 2022      3826
## 862           Audi      A4 Premium Plus 40 TFSI  4200000 2021     12321
## 1534 Mercedes-Benz                  E-Class 200  7050000 2022      1102
## 1625 Mercedes-Benz           A-Class Sedan 200d  4200000 2021      5600
## 1747          Audi      A4 Premium Plus 40 TFSI  4151000 2022         1
## 1767          MINI         Cooper JCW Hatchback  5200000 2022         0
## 1783       Renault Kwid CLIMBER 1.0 [2017-2019]   375000 2017     45000
## 1795 Mercedes-Benz           C-Class 220 CDI AT  5300000 2022     12500
## 1804       Porsche                 Cayenne Base 14900000 2021      7534
## 1841          Audi        A6 Technology 45 TFSI  5800000 2021      7935
## 1913 Mercedes-Benz   S-Class S 350D [2018-2020] 18500000 2021      5000
## 1978 Mercedes-Benz  GLC 220d 4MATIC Progressive  8100000 2022      5500
## 1981 Maruti Suzuki  Wagon R VXi 1.0 [2019-2019]   420000 2018     50000
##      Fuel.Type Transmission    Location  Color            Owner Seller.Type
## 140     Petrol    Automatic       Delhi   Blue UnRegistered Car  Individual
## 195     Diesel    Automatic   Hyderabad  White UnRegistered Car  Individual
## 223     Petrol    Automatic   Bangalore  White UnRegistered Car  Individual
## 363     Petrol    Automatic        Pune  White UnRegistered Car  Individual
## 390     Petrol       Manual    Varanasi   Gold UnRegistered Car  Individual
## 480     Diesel    Automatic   Hyderabad  White UnRegistered Car  Individual
## 757     Diesel    Automatic   Bangalore  White UnRegistered Car  Individual
## 816     Petrol       Manual   Hyderabad   Grey UnRegistered Car  Individual
## 840     Diesel    Automatic   Ahmedabad  White UnRegistered Car  Individual
## 862     Petrol    Automatic       Delhi  Black UnRegistered Car  Individual
## 1534    Petrol       Manual   Bangalore  Black UnRegistered Car  Individual
## 1625    Diesel    Automatic       Delhi   Grey UnRegistered Car  Individual
## 1747    Petrol    Automatic       Delhi  Black UnRegistered Car  Individual
## 1767    Petrol    Automatic   Ahmedabad Yellow UnRegistered Car  Individual
## 1783    Petrol       Manual Bhubaneswar   Grey UnRegistered Car  Individual
## 1795    Diesel    Automatic   Hyderabad  White UnRegistered Car  Individual
## 1804    Petrol    Automatic      Mumbai   Blue UnRegistered Car  Individual
## 1841    Petrol    Automatic       Surat  Brown UnRegistered Car  Individual
## 1913    Diesel    Automatic        Pune  White UnRegistered Car  Individual
## 1978    Diesel    Automatic   Hyderabad  White UnRegistered Car  Individual
## 1981    Petrol       Manual Bhubaneswar  White UnRegistered Car  Individual
##       Engine          Max.Power        Max.Torque Drivetrain Length Width
## 140  1984 cc 248 bhp @ 5000 rpm 370 Nm @ 1600 rpm        AWD   4663  1898
## 195  1968 cc 174 bhp @ 4200 rpm 380 Nm @ 1750 rpm        AWD   4385  1831
## 223  1984 cc 180 bhp @ 4000 rpm 320 Nm @ 1500 rpm        FWD   4915  1874
## 363  1498 cc 119 bhp @ 6600 rpm 145 Nm @ 4300 rpm        FWD   4549  1748
## 390  1998 cc           136@5600          182@4000              4555  1770
## 480  1950 cc 192 bhp @ 3800 rpm 400 Nm @ 1600 rpm        AWD   4658  1890
## 757  2148 cc           170@3700          410@2000        RWD   4596  1770
## 816   999 cc  67 bhp @ 5500 rpm  91 Nm @ 4250 rpm        FWD   3679  1579
## 840  1950 cc 147 bhp @ 3400 rpm 320 Nm @ 1400 rpm        FWD   4549  1796
## 862  1984 cc 188 bhp @ 4200 rpm 320 Nm @ 1450 rpm        FWD   4762  1847
## 1534                                                             NA    NA
## 1625 1950 cc 147 bhp @ 3400 rpm 320 Nm @ 1400 rpm        FWD   4549  1796
## 1747 1984 cc 188 bhp @ 4200 rpm 320 Nm @ 1450 rpm        FWD   4762  1847
## 1767 1998 cc 228 bhp @ 5200 rpm 320 Nm @ 1450 rpm        FWD   3850  1727
## 1783  999 cc  67 bhp @ 5500 rpm  91 Nm @ 4250 rpm        FWD   3679  1579
## 1795 2148 cc           170@3700          410@2000        RWD   4596  1770
## 1804 2995 cc 335 bhp @ 5300 rpm 450 Nm @ 1340 rpm        AWD   4918  1983
## 1841 1984 cc 241 bhp @ 5000 rpm 370 Nm @ 1600 rpm        FWD   4939  1886
## 1913 2925 cc 282 bhp @ 3600 rpm 600 Nm @ 1600 rpm        RWD   5246  1899
## 1978 1950 cc 192 bhp @ 3800 rpm 400 Nm @ 1600 rpm        AWD   4658  1890
## 1981                                                           3655  1620
##      Height Seating.Capacity Fuel.Tank.Capacity
## 140    1659                5                 70
## 195    1608                5                 64
## 223    1455                5                 75
## 363    1489                5                 40
## 390    1755                8                 55
## 480    1644                5                 66
## 757    1447                5                 66
## 816    1478                5                 28
## 840    1446                5                 43
## 862    1433                5                 54
## 1534     NA               NA                 NA
## 1625   1446                5                 43
## 1747   1433                5                 54
## 1767   1414                4                 44
## 1783   1478                5                 28
## 1795   1447                5                 66
## 1804   1696                5                 75
## 1841   1457                5                 73
## 1913   1494                5                 70
## 1978   1644                5                 66
## 1981   1675                5                 32

I first tried taking a graph of the whole timeline before excluding the handful of entries from 2004 and before to improve readability. Vehicles with only one previous owner are more common than ones with second owners, with further numbers becoming less frequent. The earliest model year of the unregistered cars is 2012, with very slightly more occurrences in more recent years. This would suggest that the sale of unregistered vehicles became a problem starting roughly last decade.