https://www.kaggle.com/datasets/vishakhdapat/price-of-used-toyota-corolla-cars

Price of used Toyota Corolla Cars. Data is from a Toyota dealership. It’s not a proper data for generalizing it to a population as the source says that these are the cars sold “for all previous time” in one of many dealerships, so it’s a measurement during a long time. And it’s a dealership data from which can’t be generalized, for instance, to other dealerships. At the same time the age is indicated as of August 2004, so it’s unclear.

If it’s allowed, I would assume that it’s a simple random sample of Toyota Corolla used cars sold within one month in Europe.

Unit: Used Toyota Corolla car

Price: numeric, in Euro (it’s not mentioned in the datadet description)

Age: Years, numeric

KM: Mileage in kilometers.

Automatic: binary. Transmission: 0 — Manual, 1 — Automatic.

I will use these variables in the regression model. Horse power could be also used, but I have choosen transmission to have a dummy variable in regression model.

Research question: Can the price of a used Toyota Corolla car in Europe be explained by its age, mileage and type of transmission?

tdata <- read.table("./ToyotaCorolla.csv", header = TRUE, sep = ",", dec = ".", quote = "\"")
head(tdata)
##   Id                                         Model Price Age_08_04 Mfg_Month
## 1  1 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13500        23        10
## 2  2 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13750        23        10
## 3  3 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 13950        24         9
## 4  4 TOYOTA Corolla 2.0 D4D HATCHB TERRA 2/3-Doors 14950        26         7
## 5  5   TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 13750        30         3
## 6  6   TOYOTA Corolla 2.0 D4D HATCHB SOL 2/3-Doors 12950        32         1
##   Mfg_Year    KM Fuel_Type HP Met_Color  Color Automatic   CC Doors Cylinders
## 1     2002 46986    Diesel 90         1   Blue         0 2000     3         4
## 2     2002 72937    Diesel 90         1 Silver         0 2000     3         4
## 3     2002 41711    Diesel 90         1   Blue         0 2000     3         4
## 4     2002 48000    Diesel 90         0  Black         0 2000     3         4
## 5     2002 38500    Diesel 90         0  Black         0 2000     3         4
## 6     2002 61000    Diesel 90         0  White         0 2000     3         4
##   Gears Quarterly_Tax Weight Mfr_Guarantee BOVAG_Guarantee Guarantee_Period ABS
## 1     5           210   1165             0               1                3   1
## 2     5           210   1165             0               1                3   1
## 3     5           210   1165             1               1                3   1
## 4     5           210   1165             1               1                3   1
## 5     5           210   1170             1               1                3   1
## 6     5           210   1170             0               1                3   1
##   Airbag_1 Airbag_2 Airco Automatic_airco Boardcomputer CD_Player Central_Lock
## 1        1        1     0               0             1         0            1
## 2        1        1     1               0             1         1            1
## 3        1        1     0               0             1         0            0
## 4        1        1     0               0             1         0            0
## 5        1        1     1               0             1         0            1
## 6        1        1     1               0             1         0            1
##   Powered_Windows Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## 1               1              1     0         0           0                1
## 2               0              1     0         0           0                1
## 3               0              1     0         0           0                1
## 4               0              1     0         0           0                1
## 5               1              1     0         1           0                1
## 6               1              1     0         1           0                1
##   Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## 1            0              0                 0       0
## 2            0              0                 0       0
## 3            0              0                 0       0
## 4            0              0                 0       0
## 5            0              0                 0       0
## 6            0              0                 0       0
names(tdata)[4] <- "Age"
names(tdata)[7] <- "Mileage"
library(pastecs)
round(stat.desc(tdata[c(-1,-2)]), 0)
##                 Price   Age Mfg_Month Mfg_Year    Mileage Fuel_Type     HP
## nbr.val          1436  1436      1436     1436       1436        NA   1436
## nbr.null            0     0         0        0          0        NA      0
## nbr.na              0     0         0        0          0        NA      0
## min              4350     1         1     1998          1        NA     69
## max             32500    80        12     2004     243000        NA    192
## range           28150    79        11        6     242999        NA    123
## sum          15409464 80340      7968  2871462   98413761        NA 145757
## median           9900    61         5     1999      63390        NA    110
## mean            10731    56         6     2000      68533        NA    102
## SE.mean            96     0         0        0        990        NA      0
## CI.mean.0.95      188     1         0        0       1942        NA      1
## var          13154872   346        11        2 1406733707        NA    224
## std.dev          3627    19         3        2      37506        NA     15
## coef.var            0     0         1        0          1        NA      0
##              Met_Color Color Automatic      CC Doors Cylinders Gears
## nbr.val           1436    NA      1436    1436  1436      1436  1436
## nbr.null           467    NA      1356       0     0         0     0
## nbr.na               0    NA         0       0     0         0     0
## min                  0    NA         0    1300     2         4     3
## max                  1    NA         1   16000     5         4     6
## range                1    NA         1   14700     3         0     3
## sum                969    NA        80 2264365  5792      5744  7218
## median               1    NA         0    1600     4         4     5
## mean                 1    NA         0    1577     4         4     5
## SE.mean              0    NA         0      11     0         0     0
## CI.mean.0.95         0    NA         0      22     0         0     0
## var                  0    NA         0  180104     1         0     0
## std.dev              0    NA         0     424     1         0     0
## coef.var             1    NA         4       0     0         0     0
##              Quarterly_Tax  Weight Mfr_Guarantee BOVAG_Guarantee
## nbr.val               1436    1436          1436            1436
## nbr.null                 0       0           848             150
## nbr.na                   0       0             0               0
## min                     19    1000             0               0
## max                    283    1615             1               1
## range                  264     615             1               1
## sum                 125108 1540052           588            1286
## median                  85    1070             0               1
## mean                    87    1072             0               1
## SE.mean                  1       1             0               0
## CI.mean.0.95             2       3             0               0
## var                   1692    2771             0               0
## std.dev                 41      53             0               0
## coef.var                 0       0             1               0
##              Guarantee_Period  ABS Airbag_1 Airbag_2 Airco Automatic_airco
## nbr.val                  1436 1436     1436     1436  1436            1436
## nbr.null                    0  268       42      398   706            1355
## nbr.na                      0    0        0        0     0               0
## min                         3    0        0        0     0               0
## max                        36    1        1        1     1               1
## range                      33    1        1        1     1               1
## sum                      5479 1168     1394     1038   730              81
## median                      3    1        1        1     1               0
## mean                        4    1        1        1     1               0
## SE.mean                     0    0        0        0     0               0
## CI.mean.0.95                0    0        0        0     0               0
## var                         9    0        0        0     0               0
## std.dev                     3    0        0        0     1               0
## coef.var                    1    0        0        1     1               4
##              Boardcomputer CD_Player Central_Lock Powered_Windows
## nbr.val               1436      1436         1436            1436
## nbr.null              1013      1122          603             629
## nbr.na                   0         0            0               0
## min                      0         0            0               0
## max                      1         1            1               1
## range                    1         1            1               1
## sum                    423       314          833             807
## median                   0         0            1               1
## mean                     0         0            1               1
## SE.mean                  0         0            0               0
## CI.mean.0.95             0         0            0               0
## var                      0         0            0               0
## std.dev                  0         0            0               0
## coef.var                 2         2            1               1
##              Power_Steering Radio Mistlamps Sport_Model Backseat_Divider
## nbr.val                1436  1436      1436        1436             1436
## nbr.null                 32  1226      1067        1005              330
## nbr.na                    0     0         0           0                0
## min                       0     0         0           0                0
## max                       1     1         1           1                1
## range                     1     1         1           1                1
## sum                    1404   210       369         431             1106
## median                    1     0         0           0                1
## mean                      1     0         0           0                1
## SE.mean                   0     0         0           0                0
## CI.mean.0.95              0     0         0           0                0
## var                       0     0         0           0                0
## std.dev                   0     0         0           0                0
## coef.var                  0     2         2           2                1
##              Metallic_Rim Radio_cassette Parking_Assistant Tow_Bar
## nbr.val              1436           1436              1436    1436
## nbr.null             1142           1227              1432    1037
## nbr.na                  0              0                 0       0
## min                     0              0                 0       0
## max                     1              1                 1       1
## range                   1              1                 1       1
## sum                   294            209                 4     399
## median                  0              0                 0       0
## mean                    0              0                 0       0
## SE.mean                 0              0                 0       0
## CI.mean.0.95            0              0                 0       0
## var                     0              0                 0       0
## std.dev                 0              0                 0       0
## coef.var                2              2                19       2

We have 1436 units which costs from 1350 to 32500 Euro. Median price is 9900 pounds, arithmetic mean is 10731. Standard error of price arithmetic mean is 96. We can say that there’s 95% probability that true arithmetic mean of price of population (all used cars for last month in Europe) is between 10543 and 10919 Euro.

library(psych)
## Warning: package 'psych' was built under R version 4.3.2
describeBy(tdata$Price, tdata$Automatic)
## 
##  Descriptive statistics by group 
## group: 0
##    vars    n     mean      sd median  trimmed     mad  min   max range skew
## X1    1 1356 10701.69 3625.62   9900 10135.65 2446.29 4350 32500 28150 1.72
##    kurtosis    se
## X1     3.88 98.46
## ------------------------------------------------------------ 
## group: 1
##    vars  n     mean      sd median  trimmed    mad  min   max range skew
## X1    1 80 11224.62 3636.91   9950 10590.16 1994.1 6500 20950 14450  1.4
##    kurtosis     se
## X1     1.02 406.62
tdata$TransmissionF <- factor(tdata$Automatic, levels = c(0,1), labels=c("A", "M"))
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(tdata[,c(3,4,7)], smooth = FALSE)

On these plots I see that all variables correlate.

library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.3.2
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
rcorr(as.matrix(tdata[,c(3,4,7)]))
##         Price   Age Mileage
## Price    1.00 -0.88   -0.57
## Age     -0.88  1.00    0.51
## Mileage -0.57  0.51    1.00
## 
## n= 1436 
## 
## 
## P
##         Price Age Mileage
## Price          0   0     
## Age      0         0     
## Mileage  0     0

There’s a negative correlation between Age and Price and positive between Age and Mileage.

fit <- lm(Price ~ Age + Mileage + TransmissionF, data = tdata)
summary(fit)
## 
## Call:
## lm(formula = Price ~ Age + Mileage + TransmissionF, data = tdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6796.3  -965.9   -75.2   836.4 12667.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.045e+04  1.397e+02 146.427  < 2e-16 ***
## Age            -1.550e+02  2.734e+00 -56.692  < 2e-16 ***
## Mileage        -1.589e-02  1.360e-03 -11.685  < 2e-16 ***
## TransmissionFM  7.089e+02  1.919e+02   3.694 0.000229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1656 on 1432 degrees of freedom
## Multiple R-squared:  0.792,  Adjusted R-squared:  0.7915 
## F-statistic:  1817 on 3 and 1432 DF,  p-value: < 2.2e-16
tdata$StdResid <- round(rstandard(fit), 3)
tdata$CooksD <- round(cooks.distance(fit), 3)
hist(tdata$StdResid, xlab = "Std Residuals", ylab = "Freq")

Units with |r|>3 should be excluded

shapiro.test(tdata$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  tdata$StdResid
## W = 0.92187, p-value < 2.2e-16

The normality is not met, but it’s not important as we have large sample >30 units.

tdata <- tdata[tdata$StdResid >= -3,]
tdata <- tdata[tdata$StdResid <= 3,]
hist(tdata$CooksD[tdata$CooksD > 0.005], xlab = "Cooks dist", ylab = "Freq") #Starts from 0.005 to see the gaps

head(tdata[order(-tdata$CooksD),c(1,3,41,42)], 20)
##        Id Price StdResid CooksD
## 53     53 20500    2.592  0.026
## 187   188  6950   -1.820  0.021
## 992   996  7950   -2.150  0.015
## 331   332 11000   -1.995  0.013
## 1059 1063 10500    2.866  0.013
## 213   214 11790   -1.735  0.012
## 147   148 20500    1.678  0.011
## 172   173 23750    2.853  0.011
## 190   191  7750   -1.940  0.011
## 380   382  6400   -1.822  0.011
## 139   140 23000    2.814  0.010
## 1183 1189  9900    1.584  0.009
## 150   151 20950    1.366  0.008
## 379   381  6500   -1.403  0.008
## 601   604 11250   -1.545  0.008
## 115   116 22950    2.266  0.007
## 126   127 21750    2.632  0.007
## 1055 1059  8750    1.972  0.007
## 11     11 20950    2.946  0.006
## 59     59 18950    1.208  0.006
tdata <- tdata[tdata$CooksD < 0.020,]
fit1 <- lm(Price ~ Age + Mileage + TransmissionF, data = tdata)
tdata$StdResid <- round(rstandard(fit1), 3)
tdata$CooksD <- round(cooks.distance(fit1), 3)
hist(tdata$StdResid, xlab = "Std Residuals", ylab = "Freq")

In new regression model there are still units with |r|>3 but I would leave them to keep more data and they shouldn’t affect the model strongly and I suppose it’s not as strict as Cook’s distances requirement.

hist(tdata$CooksD[tdata$CooksD > 0.005], xlab = "Cooks dist", ylab = "Freq")

Now it’s appropriate.

summary(fit1)
## 
## Call:
## lm(formula = Price ~ Age + Mileage + TransmissionF, data = tdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4798.0  -857.0   -40.5   788.0  5358.8 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.971e+04  1.188e+02  165.89   <2e-16 ***
## Age            -1.455e+02  2.308e+00  -63.06   <2e-16 ***
## Mileage        -1.415e-02  1.139e-03  -12.43   <2e-16 ***
## TransmissionFM  7.422e+02  1.583e+02    4.69    3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1357 on 1406 degrees of freedom
## Multiple R-squared:  0.8256, Adjusted R-squared:  0.8252 
## F-statistic:  2218 on 3 and 1406 DF,  p-value: < 2.2e-16
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.3.2
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
ols_test_breusch_pagan(fit1)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##          Test Summary           
##  -------------------------------
##  DF            =    1 
##  Chi2          =    183.4447 
##  Prob > Chi2   =    8.577228e-42

We reject the H0 (that the variance of standardized residuals is constant) at p-value<0.01

tdata$StdFitted <- scale(fit1$fitted.values)
scatterplot(y = tdata$StdResid, x = tdata$StdFitted, xlab = "Std fitted values",  ylab = "Std resid", boxplots = FALSE, regLine = FALSE, smooth = FALSE)

I don’t see here big visible problems with homoscedasticity. Nevertheless, the Breusch-Pagan test shows that they are. So then we will use robust errors. It’s not linear. Also, it can bee seen that there are two groups. They can represent different engines, for example, or generations.

I’ll go with further regression (assuming there’s no issue with the linearity) and then discuss the issue.

vif(fit)
##           Age       Mileage TransmissionF 
##      1.353336      1.361094      1.014076

All VIF are close enough to 1 so there is no strong multicolinearity.

library(estimatr)
## Warning: package 'estimatr' was built under R version 4.3.2
fit_robust <- lm_robust(Price ~ Age + Mileage + TransmissionF, se_type = "HC1", data = tdata)
summary(fit_robust)
## 
## Call:
## lm_robust(formula = Price ~ Age + Mileage + TransmissionF, data = tdata, 
##     se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                  Estimate Std. Error t value   Pr(>|t|)   CI Lower   CI Upper
## (Intercept)     1.971e+04  1.657e+02 118.959  0.000e+00  1.939e+04  2.004e+04
## Age            -1.455e+02  2.867e+00 -50.765 3.176e-320 -1.512e+02 -1.399e+02
## Mileage        -1.415e-02  1.204e-03 -11.753  1.736e-30 -1.651e-02 -1.179e-02
## TransmissionFM  7.422e+02  1.464e+02   5.069  4.536e-07  4.550e+02  1.029e+03
##                  DF
## (Intercept)    1406
## Age            1406
## Mileage        1406
## TransmissionFM 1406
## 
## Multiple R-squared:  0.8256 ,    Adjusted R-squared:  0.8252 
## F-statistic:  1289 on 3 and 1406 DF,  p-value: < 2.2e-16

In this model we see that it explains the price better than initial fit model. All the partial regression coefficients are statistically significant (p<0.046). The variables explains ~82% of variability of price. Population coefficient of determination ro² is 0.82. We reject the hypothesis that the ro²=0 (the model explains nothing) at p-value<0.001. And ro² is higher than initial model “fit” and the values of partial regression coefficients are more statistically significant than for model “fit1” without robust errors.

Explanation of partial regression coefficients:

— New Toyota Corolla car with 0 mileage and manual transmission would costin Europe 19710 Euro in average. p<0.002

— Each additional year for used Toyota Corolla car with manual transmission would lower its price in Europe by 145.5 Euro on average assuming that Mileage stays the same. p<0.001

— Each additional kilometer travelled by used Toyota Corolla car with manual transmission would lower its price in Europe by 1.415 Euro cents on average assuming the Age in years stays the same. p<0.001

— Used Toyota Corolla car with automatic transmission costs in Europe 742.2 Euro in average more than with manual transmission assuming all other variables stays the same. p<0.046

Nevertheless, linearity is not met. I would better find another model, include other variables which will eliminate this issue. For example, engine type, its volume and/or generation.

The answer to the research question: The price of used Toyota Corolla in Europe can’t be explained by its age, mileage and type of transmission.