https://www.kaggle.com/datasets/vishakhdapat/price-of-used-toyota-corolla-cars
Price of used Toyota Corolla Cars. Data is from a Toyota dealership. It’s not a proper data for generalizing it to a population as the source says that these are the cars sold “for all previous time” in one of many dealerships, so it’s a measurement during a long time. And it’s a dealership data from which can’t be generalized, for instance, to other dealerships. At the same time the age is indicated as of August 2004, so it’s unclear.
If it’s allowed, I would assume that it’s a simple random sample of Toyota Corolla used cars sold within one month in Europe.
Unit: Used Toyota Corolla car
Price: numeric, in Euro (it’s not mentioned in the datadet description)
Age: Years, numeric
KM: Mileage in kilometers.
Automatic: binary. Transmission: 0 — Manual, 1 — Automatic.
I will use these variables in the regression model. Horse power could be also used, but I have choosen transmission to have a dummy variable in regression model.
Research question: Can the price of a used Toyota Corolla car in Europe be explained by its age, mileage and type of transmission?
tdata <- read.table("./ToyotaCorolla.csv", header = TRUE, sep = ",", dec = ".", quote = "\"")
head(tdata)
## Id Model Price Age_08_04 Mfg_Month
## 1 1 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13500 23 10
## 2 2 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13750 23 10
## 3 3 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13950 24 9
## 4 4 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 14950 26 7
## 5 5 TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 13750 30 3
## 6 6 TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 12950 32 1
## Mfg_Year KM Fuel_Type HP Met_Color Color Automatic CC Doors Cylinders
## 1 2002 46986 Diesel 90 1 Blue 0 2000 3 4
## 2 2002 72937 Diesel 90 1 Silver 0 2000 3 4
## 3 2002 41711 Diesel 90 1 Blue 0 2000 3 4
## 4 2002 48000 Diesel 90 0 Black 0 2000 3 4
## 5 2002 38500 Diesel 90 0 Black 0 2000 3 4
## 6 2002 61000 Diesel 90 0 White 0 2000 3 4
## Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
## 1 5 210 1165 0 1 3 1
## 2 5 210 1165 0 1 3 1
## 3 5 210 1165 1 1 3 1
## 4 5 210 1165 1 1 3 1
## 5 5 210 1170 1 1 3 1
## 6 5 210 1170 0 1 3 1
## Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player Central_Lock
## 1 1 1 0 0 1 0 1
## 2 1 1 1 0 1 1 1
## 3 1 1 0 0 1 0 0
## 4 1 1 0 0 1 0 0
## 5 1 1 1 0 1 0 1
## 6 1 1 1 0 1 0 1
## Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## 1 1 1 0 0 0 1
## 2 0 1 0 0 0 1
## 3 0 1 0 0 0 1
## 4 0 1 0 0 0 1
## 5 1 1 0 1 0 1
## 6 1 1 0 1 0 1
## Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
names(tdata)[4] <- "Age"
names(tdata)[7] <- "Mileage"
library(pastecs)
round(stat.desc(tdata[c(-1,-2)]), 0)
## Price Age Mfg_Month Mfg_Year Mileage Fuel_Type HP
## nbr.val 1436 1436 1436 1436 1436 NA 1436
## nbr.null 0 0 0 0 0 NA 0
## nbr.na 0 0 0 0 0 NA 0
## min 4350 1 1 1998 1 NA 69
## max 32500 80 12 2004 243000 NA 192
## range 28150 79 11 6 242999 NA 123
## sum 15409464 80340 7968 2871462 98413761 NA 145757
## median 9900 61 5 1999 63390 NA 110
## mean 10731 56 6 2000 68533 NA 102
## SE.mean 96 0 0 0 990 NA 0
## CI.mean.0.95 188 1 0 0 1942 NA 1
## var 13154872 346 11 2 1406733707 NA 224
## std.dev 3627 19 3 2 37506 NA 15
## coef.var 0 0 1 0 1 NA 0
## Met_Color Color Automatic CC Doors Cylinders Gears
## nbr.val 1436 NA 1436 1436 1436 1436 1436
## nbr.null 467 NA 1356 0 0 0 0
## nbr.na 0 NA 0 0 0 0 0
## min 0 NA 0 1300 2 4 3
## max 1 NA 1 16000 5 4 6
## range 1 NA 1 14700 3 0 3
## sum 969 NA 80 2264365 5792 5744 7218
## median 1 NA 0 1600 4 4 5
## mean 1 NA 0 1577 4 4 5
## SE.mean 0 NA 0 11 0 0 0
## CI.mean.0.95 0 NA 0 22 0 0 0
## var 0 NA 0 180104 1 0 0
## std.dev 0 NA 0 424 1 0 0
## coef.var 1 NA 4 0 0 0 0
## Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee
## nbr.val 1436 1436 1436 1436
## nbr.null 0 0 848 150
## nbr.na 0 0 0 0
## min 19 1000 0 0
## max 283 1615 1 1
## range 264 615 1 1
## sum 125108 1540052 588 1286
## median 85 1070 0 1
## mean 87 1072 0 1
## SE.mean 1 1 0 0
## CI.mean.0.95 2 3 0 0
## var 1692 2771 0 0
## std.dev 41 53 0 0
## coef.var 0 0 1 0
## Guarantee_Period ABS Airbag_1 Airbag_2 Airco Automatic_airco
## nbr.val 1436 1436 1436 1436 1436 1436
## nbr.null 0 268 42 398 706 1355
## nbr.na 0 0 0 0 0 0
## min 3 0 0 0 0 0
## max 36 1 1 1 1 1
## range 33 1 1 1 1 1
## sum 5479 1168 1394 1038 730 81
## median 3 1 1 1 1 0
## mean 4 1 1 1 1 0
## SE.mean 0 0 0 0 0 0
## CI.mean.0.95 0 0 0 0 0 0
## var 9 0 0 0 0 0
## std.dev 3 0 0 0 1 0
## coef.var 1 0 0 1 1 4
## Boardcomputer CD_Player Central_Lock Powered_Windows
## nbr.val 1436 1436 1436 1436
## nbr.null 1013 1122 603 629
## nbr.na 0 0 0 0
## min 0 0 0 0
## max 1 1 1 1
## range 1 1 1 1
## sum 423 314 833 807
## median 0 0 1 1
## mean 0 0 1 1
## SE.mean 0 0 0 0
## CI.mean.0.95 0 0 0 0
## var 0 0 0 0
## std.dev 0 0 0 0
## coef.var 2 2 1 1
## Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## nbr.val 1436 1436 1436 1436 1436
## nbr.null 32 1226 1067 1005 330
## nbr.na 0 0 0 0 0
## min 0 0 0 0 0
## max 1 1 1 1 1
## range 1 1 1 1 1
## sum 1404 210 369 431 1106
## median 1 0 0 0 1
## mean 1 0 0 0 1
## SE.mean 0 0 0 0 0
## CI.mean.0.95 0 0 0 0 0
## var 0 0 0 0 0
## std.dev 0 0 0 0 0
## coef.var 0 2 2 2 1
## Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## nbr.val 1436 1436 1436 1436
## nbr.null 1142 1227 1432 1037
## nbr.na 0 0 0 0
## min 0 0 0 0
## max 1 1 1 1
## range 1 1 1 1
## sum 294 209 4 399
## median 0 0 0 0
## mean 0 0 0 0
## SE.mean 0 0 0 0
## CI.mean.0.95 0 0 0 0
## var 0 0 0 0
## std.dev 0 0 0 0
## coef.var 2 2 19 2
We have 1436 units which costs from 1350 to 32500 Euro. Median price is 9900 pounds, arithmetic mean is 10731. Standard error of price arithmetic mean is 96. We can say that there’s 95% probability that true arithmetic mean of price of population (all used cars for last month in Europe) is between 10543 and 10919 Euro.
library(psych)
## Warning: package 'psych' was built under R version 4.3.2
describeBy(tdata$Price, tdata$Automatic)
##
## Descriptive statistics by group
## group: 0
## vars n mean sd median trimmed mad min max range skew
## X1 1 1356 10701.69 3625.62 9900 10135.65 2446.29 4350 32500 28150 1.72
## kurtosis se
## X1 3.88 98.46
## ------------------------------------------------------------
## group: 1
## vars n mean sd median trimmed mad min max range skew
## X1 1 80 11224.62 3636.91 9950 10590.16 1994.1 6500 20950 14450 1.4
## kurtosis se
## X1 1.02 406.62
tdata$TransmissionF <- factor(tdata$Automatic, levels = c(0,1), labels=c("A", "M"))
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(tdata[,c(3,4,7)], smooth = FALSE)
On these plots I see that all variables correlate.
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.3.2
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(tdata[,c(3,4,7)]))
## Price Age Mileage
## Price 1.00 -0.88 -0.57
## Age -0.88 1.00 0.51
## Mileage -0.57 0.51 1.00
##
## n= 1436
##
##
## P
## Price Age Mileage
## Price 0 0
## Age 0 0
## Mileage 0 0
There’s a negative correlation between Age and Price and positive between Age and Mileage.
fit <- lm(Price ~ Age + Mileage + TransmissionF, data = tdata)
summary(fit)
##
## Call:
## lm(formula = Price ~ Age + Mileage + TransmissionF, data = tdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6796.3 -965.9 -75.2 836.4 12667.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.045e+04 1.397e+02 146.427 < 2e-16 ***
## Age -1.550e+02 2.734e+00 -56.692 < 2e-16 ***
## Mileage -1.589e-02 1.360e-03 -11.685 < 2e-16 ***
## TransmissionFM 7.089e+02 1.919e+02 3.694 0.000229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1656 on 1432 degrees of freedom
## Multiple R-squared: 0.792, Adjusted R-squared: 0.7915
## F-statistic: 1817 on 3 and 1432 DF, p-value: < 2.2e-16
tdata$StdResid <- round(rstandard(fit), 3)
tdata$CooksD <- round(cooks.distance(fit), 3)
hist(tdata$StdResid, xlab = "Std Residuals", ylab = "Freq")
Units with |r|>3 should be excluded
shapiro.test(tdata$StdResid)
##
## Shapiro-Wilk normality test
##
## data: tdata$StdResid
## W = 0.92187, p-value < 2.2e-16
The normality is not met, but it’s not important as we have large sample >30 units.
tdata <- tdata[tdata$StdResid >= -3,]
tdata <- tdata[tdata$StdResid <= 3,]
hist(tdata$CooksD[tdata$CooksD > 0.005], xlab = "Cooks dist", ylab = "Freq") #Starts from 0.005 to see the gaps
head(tdata[order(-tdata$CooksD),c(1,3,41,42)], 20)
## Id Price StdResid CooksD
## 53 53 20500 2.592 0.026
## 187 188 6950 -1.820 0.021
## 992 996 7950 -2.150 0.015
## 331 332 11000 -1.995 0.013
## 1059 1063 10500 2.866 0.013
## 213 214 11790 -1.735 0.012
## 147 148 20500 1.678 0.011
## 172 173 23750 2.853 0.011
## 190 191 7750 -1.940 0.011
## 380 382 6400 -1.822 0.011
## 139 140 23000 2.814 0.010
## 1183 1189 9900 1.584 0.009
## 150 151 20950 1.366 0.008
## 379 381 6500 -1.403 0.008
## 601 604 11250 -1.545 0.008
## 115 116 22950 2.266 0.007
## 126 127 21750 2.632 0.007
## 1055 1059 8750 1.972 0.007
## 11 11 20950 2.946 0.006
## 59 59 18950 1.208 0.006
tdata <- tdata[tdata$CooksD < 0.020,]
fit1 <- lm(Price ~ Age + Mileage + TransmissionF, data = tdata)
tdata$StdResid <- round(rstandard(fit1), 3)
tdata$CooksD <- round(cooks.distance(fit1), 3)
hist(tdata$StdResid, xlab = "Std Residuals", ylab = "Freq")
In new regression model there are still units with |r|>3 but I would
leave them to keep more data and they shouldn’t affect the model
strongly and I suppose it’s not as strict as Cook’s distances
requirement.
hist(tdata$CooksD[tdata$CooksD > 0.005], xlab = "Cooks dist", ylab = "Freq")
Now it’s appropriate.
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age + Mileage + TransmissionF, data = tdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4798.0 -857.0 -40.5 788.0 5358.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.971e+04 1.188e+02 165.89 <2e-16 ***
## Age -1.455e+02 2.308e+00 -63.06 <2e-16 ***
## Mileage -1.415e-02 1.139e-03 -12.43 <2e-16 ***
## TransmissionFM 7.422e+02 1.583e+02 4.69 3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1357 on 1406 degrees of freedom
## Multiple R-squared: 0.8256, Adjusted R-squared: 0.8252
## F-statistic: 2218 on 3 and 1406 DF, p-value: < 2.2e-16
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.3.2
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit1)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -------------------------------
## DF = 1
## Chi2 = 183.4447
## Prob > Chi2 = 8.577228e-42
We reject the H0 (that the variance of standardized residuals is constant) at p-value<0.01
tdata$StdFitted <- scale(fit1$fitted.values)
scatterplot(y = tdata$StdResid, x = tdata$StdFitted, xlab = "Std fitted values", ylab = "Std resid", boxplots = FALSE, regLine = FALSE, smooth = FALSE)
I don’t see here big visible problems with homoscedasticity. Nevertheless, the Breusch-Pagan test shows that they are. So then we will use robust errors. It’s not linear. Also, it can bee seen that there are two groups. They can represent different engines, for example, or generations.
I’ll go with further regression (assuming there’s no issue with the linearity) and then discuss the issue.
vif(fit)
## Age Mileage TransmissionF
## 1.353336 1.361094 1.014076
All VIF are close enough to 1 so there is no strong multicolinearity.
library(estimatr)
## Warning: package 'estimatr' was built under R version 4.3.2
fit_robust <- lm_robust(Price ~ Age + Mileage + TransmissionF, se_type = "HC1", data = tdata)
summary(fit_robust)
##
## Call:
## lm_robust(formula = Price ~ Age + Mileage + TransmissionF, data = tdata,
## se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper
## (Intercept) 1.971e+04 1.657e+02 118.959 0.000e+00 1.939e+04 2.004e+04
## Age -1.455e+02 2.867e+00 -50.765 3.176e-320 -1.512e+02 -1.399e+02
## Mileage -1.415e-02 1.204e-03 -11.753 1.736e-30 -1.651e-02 -1.179e-02
## TransmissionFM 7.422e+02 1.464e+02 5.069 4.536e-07 4.550e+02 1.029e+03
## DF
## (Intercept) 1406
## Age 1406
## Mileage 1406
## TransmissionFM 1406
##
## Multiple R-squared: 0.8256 , Adjusted R-squared: 0.8252
## F-statistic: 1289 on 3 and 1406 DF, p-value: < 2.2e-16
In this model we see that it explains the price better than initial fit model. All the partial regression coefficients are statistically significant (p<0.046). The variables explains ~82% of variability of price. Population coefficient of determination ro² is 0.82. We reject the hypothesis that the ro²=0 (the model explains nothing) at p-value<0.001. And ro² is higher than initial model “fit” and the values of partial regression coefficients are more statistically significant than for model “fit1” without robust errors.
Explanation of partial regression coefficients:
— New Toyota Corolla car with 0 mileage and manual transmission would costin Europe 19710 Euro in average. p<0.002
— Each additional year for used Toyota Corolla car with manual transmission would lower its price in Europe by 145.5 Euro on average assuming that Mileage stays the same. p<0.001
— Each additional kilometer travelled by used Toyota Corolla car with manual transmission would lower its price in Europe by 1.415 Euro cents on average assuming the Age in years stays the same. p<0.001
— Used Toyota Corolla car with automatic transmission costs in Europe 742.2 Euro in average more than with manual transmission assuming all other variables stays the same. p<0.046
Nevertheless, linearity is not met. I would better find another model, include other variables which will eliminate this issue. For example, engine type, its volume and/or generation.
The answer to the research question: The price of used Toyota Corolla in Europe can’t be explained by its age, mileage and type of transmission.