According to the National Automobile Dealers Association (NADA) Car Price Guide, in the United States, “the number of used car transactions is approximately three times as large as the number of new car transactions. Furthermore, the dispersion of used car prices is approximately five times as large as the dispersion of new car prices, suggesting that secondary market play an important role in broadening the spectrum of goods available to consumers.” (Gavazza, 2012) Given the huge amount and wide-spreaded influence of used car market, I would like to figure out these questions: What factors are close related to the prices of used cars? How significant these factors will influence the used car prices? Can we find out a formula to predict the prices of certain used cars? By answering these questions, I hope to provide guidance for those who would like to buy a used car for their daily life.
The main data source for this project is from Kaggle – Used Car database.
Key variables include but not limited to: name (name of the car), seller (private or dealer), price (the price of the ad to sell the car), vehicle type, year or registration (at which year the car is first registered), gearbox, power PS (power of the car in the PS), model, kilometer (how many kilometers the car has driven), month of registration (at which month the car was first registered), fuel type, brand, not repaired damage (if the car has a damage which is not repaired yet), nrOfPictures (number of pictures in the ad), postal code, dateCrawled (when the ad was first crawled), and lastSeenOnline (when the crawler saw the ad last online). The dataset consists of 21 variables with more than 370,000 used car items.
The dataset consists of 21 variables with more than 370,000 used car items.
To find out the answer to the research questions, I use different ways to explore the trend and correlation in the U.S. used cars market, including mapping, line plots, linear regression, etc. But before this, the first step is cleaning the data. After the data visualization, I do the linear regression to get the correlation among variables.
## Classes 'data.table' and 'data.frame': 371824 obs. of 20 variables:
## $ dateCrawled : Factor w/ 34 levels "3/10/2016","3/11/2016",..: 15 15 5 8 22 31 28 12 31 8 ...
## $ name : Factor w/ 233698 levels "!!!!!!!!!!!!!!!!!!!!!!!!__GOLF_3_CABRIO___!!!!!!!!!!!!!!!!!!!!!!!",..: 79297 4148 90664 76682 170655 27397 145438 188524 64280 190572 ...
## $ seller : Factor w/ 2 levels "commercial","private": 2 2 2 2 2 2 2 2 2 2 ...
## $ offerType : Factor w/ 2 levels "offer","petition": 1 1 1 1 1 1 1 1 1 1 ...
## $ price : int 480 18300 9800 1500 3600 650 2200 0 14500 999 ...
## $ abtest : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
## $ vehicleType : Factor w/ 9 levels "","bus","cabrio",..: 1 4 9 8 8 6 3 6 2 8 ...
## $ yearOfRegistration : int 1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
## $ gearbox : Factor w/ 3 levels "","automatic",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ powerPS : int 0 190 163 75 69 102 109 50 125 101 ...
## $ model : Factor w/ 252 levels "","100","145",..: 120 1 121 120 105 13 10 42 63 120 ...
## $ kilometer : int 150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
## $ monthOfRegistration: int 0 5 8 6 7 10 8 7 8 0 ...
## $ fuelType : Factor w/ 8 levels "","cng","diesel",..: 8 3 3 8 3 8 8 8 8 1 ...
## $ brand : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
## $ notRepairedDamage : Factor w/ 3 levels "","No","Yes": 1 3 1 2 2 3 2 2 1 1 ...
## $ dateCreated : Factor w/ 114 levels "1/10/2016","1/13/2016",..: 87 87 75 78 95 105 102 84 105 78 ...
## $ nrOfPictures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ postalCode : int 70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
## $ lastSeen : Factor w/ 34 levels "3/10/2016","3/11/2016",..: 34 34 32 8 33 33 32 16 31 22 ...
## - attr(*, ".internal.selfref")=<externalptr>
autos$offerType <- NULL
autos$name <- NULL
autos$seller <- NULL
autos$nrOfPictures <- NULL
autos$abtest <- NULL
autos$dateCrawled <- ymd_hms(autos$dateCrawled)
autos$dateCreated <- ymd_hms(autos$dateCreated)
autos$lastSeen <- ymd_hms(autos$lastSeen)
I observed that in “Price” variable, it contents $0, $99999999, $12345678, $11111111, $1000000, which might be some random numbers key in by the seller and not an authentic used car price. I consider those rows as noise, and decide to remove those from the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.150e+03 2.950e+03 1.729e+04 7.200e+03 2.147e+09
## 95%
## 19788.5
## 99%
## 36000
## 1%
## 0
## 5%
## 200
From the density chart and boxchart above, the price variable become more normalize after removing the top/bottom 5%. So I decide to use only the 5% - 95% range of the price prevent the noise.
auto <- auto[(auto$price > quantile(auto$price,0.05)) & (auto$price < quantile(auto$price,0.95)),]
The vehicle Type is a very important indicator for used car price, so I decide to remove data with no vehicle type.
summary(auto$vehicleType)
## bus cabrio coupe kombi limousine other
## 30857 28054 19567 15214 62749 87923 2981
## small car suv
## 74835 11582
ggplot(aes(x=auto$vehicleType, y=auto$price), data = auto) +
geom_boxplot()
Remove vehicleType is missing.
summary(auto$vehicleType)
## bus cabrio coupe kombi limousine other
## 30857 28054 19567 15214 62749 87923 2981
## small car suv
## 74835 11582
auto <- auto[auto$vehicleType != "",]
By viewing the boxplot and density chart, I decide to use only 1975-2016 as the range of yearOfRegistration.
summary(auto$yearOfRegistration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1910 1999 2003 2002 2007 2018
p1 <- ggplot(aes(x="yearOfRegistration", y=yearOfRegistration), data = auto) +
geom_boxplot()
p2 <- ggplot(aes(x="yearOfRegistration 1975-2016", y=yearOfRegistration), data = auto) +
geom_boxplot() +
ylim(1975, 2016)
grid.arrange(p1, p2, ncol = 2)
d1 <- ggplot(auto, aes(x=auto$yearOfRegistration)) + geom_density()
d2 <- ggplot(auto, aes(x=auto$yearOfRegistration)) + geom_density() +
xlim(1975, 2016)
grid.arrange(d1, d2, ncol = 2)
I removed yearOfRegistration before 1975 and after 2016.
auto <- auto[which((auto$yearOfRegistration >= 1975) &
(auto$yearOfRegistration <= 2016)),]
I decide to replace missing data with “NA” within below variables. By replace with NA but not removing those missing data, can help us remain sufficeient data for future analsis.
summary(auto)
## dateCrawled price vehicleType yearOfRegistration
## Min. :NA Min. : 202 limousine:87235 Min. :1975
## 1st Qu.:NA 1st Qu.: 1350 small car:74625 1st Qu.:1999
## Median :NA Median : 3100 kombi :62684 Median :2003
## Mean :NA Mean : 4762 bus :27991 Mean :2003
## 3rd Qu.:NA 3rd Qu.: 6900 cabrio :19334 3rd Qu.:2007
## Max. :NA Max. :19780 coupe :14886 Max. :2016
## NA's :301108 (Other) :14353
## gearbox powerPS model kilometer
## : 8490 Min. : 0.0 golf : 24158 Min. : 5000
## automatic: 59736 1st Qu.: 75.0 andere : 20712 1st Qu.:125000
## manually :232882 Median : 109.0 3er : 17779 Median :150000
## Mean : 115.9 : 10933 Mean :128657
## 3rd Qu.: 145.0 polo : 10664 3rd Qu.:150000
## Max. :20000.0 corsa : 10102 Max. :150000
## (Other):206760
## monthOfRegistration fuelType brand
## Min. : 0.000 petrol :190518 volkswagen :64060
## 1st Qu.: 3.000 diesel : 91474 bmw :32825
## Median : 6.000 : 13382 opel :32634
## Mean : 6.005 lpg : 4830 mercedes_benz:28466
## 3rd Qu.: 9.000 cng : 503 audi :25988
## Max. :12.000 hybrid : 213 ford :20841
## (Other): 188 (Other) :96294
## notRepairedDamage dateCreated postalCode lastSeen
## : 45225 Min. :NA Min. : 1067 Min. :NA
## No :227677 1st Qu.:NA 1st Qu.:30827 1st Qu.:NA
## Yes: 28206 Median :NA Median :49824 Median :NA
## Mean :NA Mean :50998 Mean :NA
## 3rd Qu.:NA 3rd Qu.:71579 3rd Qu.:NA
## Max. :NA Max. :99998 Max. :NA
## NA's :301108 NA's :301108
auto$gearbox[auto$gearbox == ""] <- NA
auto$model[auto$model == ""] <- NA
auto$monthOfRegistration[auto$monthOfRegistration == "0"] <- NA
auto$fuelType[auto$fuelType == ""] <- NA
auto$brand[auto$brand == ""] <- NA
auto$notRepairedDamage[auto$notRepairedDamage == ""] <- NA
summary(auto)
## dateCrawled price vehicleType yearOfRegistration
## Min. :NA Min. : 202 limousine:87235 Min. :1975
## 1st Qu.:NA 1st Qu.: 1350 small car:74625 1st Qu.:1999
## Median :NA Median : 3100 kombi :62684 Median :2003
## Mean :NA Mean : 4762 bus :27991 Mean :2003
## 3rd Qu.:NA 3rd Qu.: 6900 cabrio :19334 3rd Qu.:2007
## Max. :NA Max. :19780 coupe :14886 Max. :2016
## NA's :301108 (Other) :14353
## gearbox powerPS model kilometer
## : 0 Min. : 0.0 golf : 24158 Min. : 5000
## automatic: 59736 1st Qu.: 75.0 andere : 20712 1st Qu.:125000
## manually :232882 Median : 109.0 3er : 17779 Median :150000
## NA's : 8490 Mean : 115.9 polo : 10664 Mean :128657
## 3rd Qu.: 145.0 corsa : 10102 3rd Qu.:150000
## Max. :20000.0 (Other):206760 Max. :150000
## NA's : 10933
## monthOfRegistration fuelType brand
## Min. : 1.0 petrol :190518 volkswagen :64060
## 1st Qu.: 4.0 diesel : 91474 bmw :32825
## Median : 6.0 lpg : 4830 opel :32634
## Mean : 6.4 cng : 503 mercedes_benz:28466
## 3rd Qu.: 9.0 hybrid : 213 audi :25988
## Max. :12.0 (Other): 188 ford :20841
## NA's :18590 NA's : 13382 (Other) :96294
## notRepairedDamage dateCreated postalCode lastSeen
## : 0 Min. :NA Min. : 1067 Min. :NA
## No :227677 1st Qu.:NA 1st Qu.:30827 1st Qu.:NA
## Yes : 28206 Median :NA Median :49824 Median :NA
## NA's: 45225 Mean :NA Mean :50998 Mean :NA
## 3rd Qu.:NA 3rd Qu.:71579 3rd Qu.:NA
## Max. :NA Max. :99998 Max. :NA
## NA's :301108 NA's :301108
I can tell from the graph that Limousine, small car and kombi are the most popular type in the dataset.
ggplot(auto, aes(x=vehicleType)) +
geom_bar(fill= I('#6edeea'), color='black') +
labs(x= 'Vehicle Type', y= 'number of cars',title='Vehicle Type Frequency Distribuation')
I can tell from the chart that the Kilometer is concentrated at 150000. This value might be the upper limit of the used car dataset. So all value above will be consider as 150000.
ggplot(auto, aes(auto$kilometer)) +
geom_histogram(color='black', fill= I('#a67cb7'),binwidth=20000) +
labs(x= 'Kilometer', y='Number of Cars', title='Kilometer Histogram')
Most of used car are using petrol or diesel in this dataset.
ggplot(aes(x=fuelType),data=auto) +
geom_bar(aes(fill=fuelType), color='black') +
labs(x= 'Fuel Type', y='Number of Cars', title= 'Fuel Type Frequency Chart')
There are too many Brand in the dataset. But just the top 10 brands contribute 80% to the dataset. In order to see meaningful chart on the Brand, our team deicde only shown the top 10 brands frequency distribution.
top10 <- auto %>%
count(brand) %>%
top_n(10) %>%
arrange(n, brand) %>%
mutate(brand = factor(brand, levels = unique(brand)))
sumtop10 <-sum(top10$n)
top10
## # A tibble: 10 x 2
## brand n
## <fctr> <int>
## 1 seat 5805
## 2 fiat 7747
## 3 peugeot 9691
## 4 renault 14626
## 5 ford 20841
## 6 audi 25988
## 7 mercedes_benz 28466
## 8 opel 32634
## 9 bmw 32825
## 10 volkswagen 64060
Most of used car is volkswagen in the dataset.
auto_subbrand <- subset (auto, brand %in% c("seat", "fiat","peugeot" ,"renault", "ford", "audi", "mercedes_benz" , "bmw", "opel", "volkswagen"))
ggplot(auto_subbrand, aes(x = reorder(brand, -table(brand)[brand]))) +
geom_bar(fill='darkblue', color='black') +
labs(x= 'Top 10 Brands', y= 'Number of Cars', title= 'Top 10 Brands Frequency Diagram')
I find that the most expensive vehicle type is SUV and the cheapest is kleinwagen. However, kleinwagen have many outliers.
There is a linear correlation between PowerPS and price.
I add a new variable here: age, which means the years between cars first registered and the years they sold. Price keeps decreasing between 0-20 years (age) but starts to increase after 20 years (between 20-30 years).
Fuel type seems do not to have significant influence to the correlation between price and gearbox.
I have divided the dataset into 2 parts: training data and validation data. I would like to run regression model on the training data, and use the model results to predict validation data afterwards.
## [1] "dateCrawled" "price" "vehicleType"
## [4] "yearOfRegistration" "gearbox" "powerPS"
## [7] "model" "kilometer" "monthOfRegistration"
## [10] "fuelType" "brand" "notRepairedDamage"
## [13] "dateCreated" "postalCode" "lastSeen"
## [16] "age" "bus" "cabrio"
## [19] "coupe" "kleinwagen" "kombi"
## [22] "limousine" "suv" "automatik"
## [25] "manuell" "benzin" "diesel"
## [28] "lpg"
## num [1:210775] 15 14 15 15 5 15 13 17 6 3 ...
##
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio +
## coupe + kleinwagen + kombi + limousine + suv + automatik +
## manuell + benzin + diesel + lpg, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44514 -1877 -516 1181 23151
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.150e+04 2.872e+01 400.326 < 2e-16 ***
## kilometer -3.095e-02 2.187e-04 -141.519 < 2e-16 ***
## age_num -3.667e+02 1.421e+00 -258.147 < 2e-16 ***
## powerPS 2.221e+00 3.989e-02 55.681 < 2e-16 ***
## bus 1.697e+03 2.809e+01 60.404 < 2e-16 ***
## cabrio 4.245e+03 3.075e+01 138.039 < 2e-16 ***
## coupe 4.139e+03 3.445e+01 120.138 < 2e-16 ***
## kleinwagen NA NA NA NA
## kombi 1.712e+03 2.223e+01 77.043 < 2e-16 ***
## limousine 2.252e+03 1.949e+01 115.583 < 2e-16 ***
## suv 4.731e+03 3.892e+01 121.547 < 2e-16 ***
## automatik NA NA NA NA
## manuell NA NA NA NA
## benzin NA NA NA NA
## diesel 1.534e+03 1.781e+01 86.114 < 2e-16 ***
## lpg 1.483e+02 5.443e+01 2.725 0.00644 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3089 on 196588 degrees of freedom
## (14175 observations deleted due to missingness)
## Multiple R-squared: 0.5291, Adjusted R-squared: 0.5291
## F-statistic: 2.008e+04 on 11 and 196588 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ kilometer + age_num + powerPS + bus + cabrio +
## coupe + +kombi + limousine + suv + diesel + lpg, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46583 -1872 -518 1171 23118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.143e+04 2.837e+01 402.886 < 2e-16 ***
## kilometer -3.081e-02 2.151e-04 -143.264 < 2e-16 ***
## age_num -3.643e+02 1.399e+00 -260.415 < 2e-16 ***
## powerPS 2.326e+00 3.968e-02 58.625 < 2e-16 ***
## bus 1.721e+03 2.770e+01 62.126 < 2e-16 ***
## cabrio 4.212e+03 3.032e+01 138.892 < 2e-16 ***
## coupe 4.107e+03 3.408e+01 120.528 < 2e-16 ***
## kombi 1.701e+03 2.189e+01 77.714 < 2e-16 ***
## limousine 2.228e+03 1.920e+01 116.026 < 2e-16 ***
## suv 4.721e+03 3.843e+01 122.833 < 2e-16 ***
## diesel 1.487e+03 1.754e+01 84.754 < 2e-16 ***
## lpg 1.456e+02 5.390e+01 2.701 0.00692 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3085 on 201464 degrees of freedom
## (9299 observations deleted due to missingness)
## Multiple R-squared: 0.5258, Adjusted R-squared: 0.5258
## F-statistic: 2.031e+04 on 11 and 201464 DF, p-value: < 2.2e-16
## Response variable: price
## Total response variance: 20073465
## Analysis based on 201476 observations
##
## 11 Regressors:
## kilometer age_num powerPS bus cabrio coupe kombi limousine suv diesel lpg
## Proportion of variance explained by model: 52.58%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg
## kilometer 0.1903063625
## age_num 0.4940801484
## powerPS 0.0337493254
## bus 0.0079067324
## cabrio 0.0504084411
## coupe 0.0301465268
## kombi 0.0102715681
## limousine 0.0201188747
## suv 0.0594323878
## diesel 0.1031269834
## lpg 0.0004526496
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs 4Xs
## kilometer -0.04664638 -0.04501501 -0.04336278 -0.04170257
## age_num -465.95991639 -458.29559983 -449.83679638 -440.63000609
## powerPS 4.13883719 4.00587169 3.86855812 3.72597454
## bus 818.57117134 732.47004908 665.11708790 620.80212931
## cabrio 2192.44574709 2311.96801943 2438.58949334 2574.77654106
## coupe 1333.48267538 1541.86570739 1753.54173354 1970.72389326
## kombi 317.53163609 305.33249256 311.07933650 338.11959426
## limousine 19.08155833 171.51638365 328.76234629 492.63634391
## suv 4437.00303211 4289.75350565 4161.25871715 4055.74725392
## diesel 2721.41732736 2645.94710162 2568.87490528 2488.51718860
## lpg -616.85645476 -457.20375020 -307.16862102 -168.39438942
## 5Xs 6Xs 7Xs 8Xs
## kilometer -0.04004788 -0.03841292 -0.03681191 -0.03525693
## age_num -430.72884023 -420.20113733 -409.14199177 -397.69712795
## powerPS 3.57675095 3.41884960 3.24924076 3.06344710
## bus 605.30371435 626.25076064 693.50287722 819.53578316
## cabrio 2723.58846520 2888.95001751 3075.98247040 3291.42295870
## coupe 2196.48167206 2435.05292201 2692.22106960 2975.78924981
## kombi 390.88058307 475.13091089 598.26728700 769.61977883
## limousine 665.83804639 852.17765739 1056.83484933 1286.65475383
## suv 3978.65475071 3937.18402506 3940.94053438 4002.60921085
## diesel 2402.52902784 2307.68964445 2199.62518223 2072.46621363
## lpg -42.98651731 66.19297946 155.09591529 217.88311141
## 9Xs 10Xs 11Xs
## kilometer -0.03375289 -0.03228762 -0.03081419
## age_num -386.10354086 -374.75741913 -364.32289198
## powerPS 2.85489051 2.61391759 2.32627561
## bus 1019.79424167 1312.95095473 1720.98898808
## cabrio 3544.15045970 3845.83187361 4211.70874102
## coupe 3296.18035689 3667.20135284 4107.03815583
## kombi 1000.74934283 1305.69193828 1701.08699894
## limousine 1550.47853416 1859.50046504 2227.64372298
## suv 4138.60512797 4369.60051767 4720.80995921
## diesel 1918.45096837 1727.50208107 1486.82568697
## lpg 246.10483770 227.65618536 145.56399356
First, I identify from the data graph that Limousine, small car and kombi are the most popular car type in the dataset. And the most expensive car type is SUV, the most cheapest car type is small car. Second, the most common car brands in the dataset are Volkswagen, Opel, BMW, Mercedes, Audi, Ford, Renault, Peugeot, Fiat and Seat. The top 10 brands contribute to 80% of the data. Also, according to the regression model, and car type, lpg and the age are the most important factors to predict used car price. In the mean time, I find that most of kilometer in the data set is 150,000. This might be the upper limit kilometer of data source platform. The kilometer above 150,000 will be considered 150,000. With the help of visualization, I read a story that most of the second-hand car transactions are made within 35 days, with a high selling ration for the first 10 days including the same day sale for day 0. It implies that Ebay-Kleinanzeigen does a very good job at marketing, especially searching and satisfying the requirement of customers. Also, which is beyond our preconception is that second-hand car market is very fluid. Specially, With the age of the cars increasing, the price continues to drop down and stops at 20 year, however, what is considered a little bit weird is that the price would increase between 20-30. The possible explanation for this maybe that antique emotional attachment could be found in second-hand car aged more than 20 years.
Gavazza, A., Lizzeri, A., & Roketskiy, N. (2012). A Quantitative Analysis of the Used-Car Market. SSRN Electronic Journal. doi:10.2139/ssrn.2046363
Dallegro, J. A. (2014, September 03). Just What Factors Into The Value Of Your Used Car? Retrieved from https://www.investopedia.com/articles/investing/090314/just-what-factors-value-your-used-car.asp
Abigail, B. (n.d.). The Three Key Things that Affect the Price of a Used Car. Retrieved from https://instamotor.com/sell-car/car-value/3-key-things-that-affect-the-price-of-a-used-car