set.seed(1)
mydata <- read.table("car_price.csv", header=TRUE, sep=",", dec=".") 
mydata <- mydata %>% filter(grepl(" Lakh", car_prices_in_rupee ))%>% sample_n(500)

head(mydata)
##      X                               car_name car_prices_in_rupee   kms_driven fuel_type transmission ownership
## 1 1039             Land Rover Freelander 2 SE          13.28 Lakh 1,58,108 kms    Diesel    Automatic 1st Owner
## 2 4936              Maruti Ciaz VDi Plus SHVS           5.20 Lakh 1,00,000 kms    Diesel       Manual 1st Owner
## 3 2233              Honda Amaze S Petrol BSIV           6.80 Lakh   59,858 kms    Petrol       Manual 1st Owner
## 4 5194                   Maruti Swift 1.3 VXi           1.42 Lakh 1,93,593 kms    Petrol       Manual 1st Owner
## 5 1573 Skoda New Laura Ambiente 2.0 TDI CR AT           2.53 Lakh 1,09,403 kms    Diesel    Automatic 3rd Owner
## 6 4725                    Maruti Alto 800 LXI           1.46 Lakh 2,69,930 kms    Petrol       Manual 1st Owner
##   manufacture  engine   Seats
## 1        2014  999 cc 5 Seats
## 2        2016 1493 cc 4 Seats
## 3        2018 2179 cc 5 Seats
## 4        2009 1896 cc 5 Seats
## 5        2012 1993 cc 5 Seats
## 6        2014  814 cc 5 Seats
#Modifying kilometers column
mydata$kms <- gsub("kms","",as.character(mydata$kms_driven))
mydata$kms <- as.numeric(gsub(",","",as.character(mydata$kms)))
mydata <- mydata %>% select(-kms_driven)

#Modifying prices column
mydata$car_prices_in_rupee <- gsub(" ","",as.character(mydata$car_prices_in_rupee))
mydata$car_prices_in_rupee <- gsub(",","",as.character(mydata$car_prices_in_rupee))

mydata$car_prices_in_rupee <- (gsub("Lakh","",as.character(mydata$car_prices_in_rupee)))

#Further in the analysis that coefficients are absolutely crazy high if I use prices in rupees, I will rather use them in lakh.

options(scipen = 0)

#Modifying by fuel type. As there is only electric fueled cars and a few has fueled

mydata <- mydata %>% filter(fuel_type != "electric" & fuel_type != "Cng") 

#Modifying the ownership column

mydata$ownership <- as.numeric(substr(mydata$ownership, 1, 1))

#Modifying engine volume column

mydata$engine <- as.numeric(gsub(" cc", "", as.character(mydata$engine)))

#Modifying the column with number of seats

mydata$Seats <- as.numeric(gsub(" Seats", "", as.character(mydata$Seats)))

colnames(mydata)[3] <- "Price"
mydata$Price <- as.numeric(mydata$Price)

final_data <- mydata %>% select(Price, ownership, transmission, manufacture, engine, kms)

head(final_data)
##   Price ownership transmission manufacture engine    kms
## 1 13.28         1    Automatic        2014    999 158108
## 2  5.20         1       Manual        2016   1493 100000
## 3  6.80         1       Manual        2018   2179  59858
## 4  1.42         1       Manual        2009   1896 193593
## 5  2.53         3    Automatic        2012   1993 109403
## 6  1.46         1       Manual        2014    814 269930

Description of variables

I will perform further analysis on the “final_data” table which consists of the following variables:

Research question

How do number of previous owners, type of transmission, year of manufacture, engine capacity and kilometers driven affect the average price of the car in India?

Regression model description

In the regression model I will analyse how the factors mentioned above affect the price of the car on Indian market. I deducted that these are the most common factors when determening the used car price from the following source:

“The Determinants of Used Car Prices: An Empirical Study” by Jia Wei and Liping Fang, published in the Journal of Industrial Economics and Management

scatterplotMatrix(final_data[,c(1,2,4,5,6)], smooth = FALSE)

The first row represents the correlation of price of used cars to the independent variables. Price is negatively correlated with the number of previous owners and kilometers driven by the car. It is positively correlated with the engine size and the year of manufacture, which is all quite straightforward. There is no visible multicolinearity between the independent/explanatory variables.

rcorr(as.matrix(final_data[,c(1,2,4,5,6)]), 
      type = "pearson")
##             Price ownership manufacture engine   kms
## Price        1.00     -0.18        0.42   0.26 -0.28
## ownership   -0.18      1.00       -0.45   0.01  0.31
## manufacture  0.42     -0.45        1.00  -0.06 -0.49
## engine       0.26      0.01       -0.06   1.00 -0.05
## kms         -0.28      0.31       -0.49  -0.05  1.00
## 
## n= 492 
## 
## 
## P
##             Price  ownership manufacture engine kms   
## Price              0.0000    0.0000      0.0000 0.0000
## ownership   0.0000           0.0000      0.8178 0.0000
## manufacture 0.0000 0.0000                0.1906 0.0000
## engine      0.0000 0.8178    0.1906             0.2484
## kms         0.0000 0.0000    0.0000      0.2484

From correlation matrix we see that the correlation between Price and other independent variables is not strong (if we draw the line strong correlation at 0.5 as most of sources suggest).

The Pearson correlation matrix shows strong correlation of engine with other variables so I will perform a vif test bellow after making the regression estimation.

regression <- lm(data = final_data, Price ~ ownership + transmission + manufacture + engine + kms)
regression
## 
## Call:
## lm(formula = Price ~ ownership + transmission + manufacture + 
##     engine + kms, data = final_data)
## 
## Coefficients:
##        (Intercept)           ownership  transmissionManual         manufacture              engine                 kms  
##         -2.051e+03          -8.263e-01          -1.703e+01           1.026e+00           4.315e-03           1.943e-06
vif(regression)
##    ownership transmission  manufacture       engine          kms 
##     1.288090     1.161079     1.617137     1.066443     1.373437
mean(vif(regression))
## [1] 1.301237
sqrt(5)
## [1] 2.236068

Using the variance inflation factor (VIF) test to identify if the regression has a problem of multicolineartiy. I see that all of the values are smaller than a square root of 5, meaning there is no problem with multicolinearity of the variables in the performed regression.

The mean VIF value is 1.30 and this is a strong indicator of no multicolinearity, however all VIF values are not bellow 1.5, which we would ultimatley want, but we can live with these values and claim no multicolinearity with adequate certainty.

final_data$StdResid <- round(rstandard(regression), 3) 
final_data$CooksD <- round(cooks.distance(regression), 3)
hist(final_data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(final_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  final_data$StdResid
## W = 0.84011, p-value < 2.2e-16
hist(final_data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

The histogram of standardized residuals and the histogram of Cooks distances shows some outliers, which I will remove later in the analysis.

I can assume that the distribution of residuals is not normal from the graph alone as it has a tail on the right, however I will perform the Shapiro_Wilk test to check the normal distribution.

Ho: the standardized residuals is normally distributed

H1: the standardized residuals is not normally distributed

From Shapiro-Wilk test I reject the null hypothesis, meaning standardized residuals are not normally distributed as p-value is lower than 0.001.

head(final_data[order(-final_data$StdResid),], 10) 
##     Price ownership transmission manufacture engine   kms StdResid CooksD
## 138  85.0         1    Automatic        2020   1969 38000    5.686  0.051
## 348  83.0         1    Automatic        2020   1950 25000    5.491  0.047
## 393  74.0         1    Automatic        2021   1968  9900    4.460  0.037
## 243  71.5         1    Automatic        2021   1493 24900    4.408  0.033
## 335  72.5         1    Automatic        2018   2967 15000    4.197  0.060
## 62   67.0         1    Automatic        2021   1950  8900    3.748  0.026
## 222  59.5         2    Automatic        2016   2499 45000    3.344  0.027
## 369  58.0         1    Automatic        2020   1197 17000    3.261  0.019
## 251  56.5         1    Automatic        2020    998 24000    3.196  0.021
## 21   56.0         1    Automatic        2019   1248 45000    3.130  0.016
head(final_data[order(-final_data$CooksD),], 20) 
##     Price ownership transmission manufacture engine    kms StdResid CooksD
## 335 72.50         1    Automatic        2018   2967  15000    4.197  0.060
## 138 85.00         1    Automatic        2020   1969  38000    5.686  0.051
## 492 61.00         1    Automatic        2021   4367  15000    2.115  0.051
## 348 83.00         1    Automatic        2020   1950  25000    5.491  0.047
## 393 74.00         1    Automatic        2021   1968   9900    4.460  0.037
## 243 71.50         1    Automatic        2021   1493  24900    4.408  0.033
## 222 59.50         2    Automatic        2016   2499  45000    3.344  0.027
## 62  67.00         1    Automatic        2021   1950   8900    3.748  0.026
## 251 56.50         1    Automatic        2020    998  24000    3.196  0.021
## 369 58.00         1    Automatic        2020   1197  17000    3.261  0.019
## 46  61.00         1    Automatic        2021   1968  20000    3.120  0.017
## 21  56.00         1    Automatic        2019   1248  45000    3.130  0.016
## 208 51.75         1    Automatic        2019    998  46000    2.807  0.016
## 229 58.00         1    Automatic        2018   1995  70000    3.106  0.016
## 331 60.00         1    Automatic        2021   1997  19000    3.005  0.016
## 405 60.00         1    Automatic        2021   1968  19000    3.018  0.016
## 81  51.00         1    Automatic        2015   1968  40000    2.721  0.013
## 431  4.70         1    Automatic        2019      0  30000   -1.600  0.013
## 5    2.53         3    Automatic        2012   1993 109403   -1.810  0.012
## 186  1.20         1       Manual        2010   3996  15166   -1.041  0.012
final_data <- final_data[abs(final_data$StdResid) <= 3, ]

final_data <- final_data[final_data$CooksD <= 0.02, ]

I drew the line at Cooks distances at value of 0.02 as the decline looked pretty steady from there on. The rule of thumb that suggest that we should cut at 4/nrows (most sources mention that) which would give me a number of 0.008 however on lectures we read the units with high impact from the histogram.

I removed the problematic units of observations regarding the standardized residuals and units with high impact.

shapiro.test(final_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  final_data$StdResid
## W = 0.9342, p-value = 1.173e-13

Again, from the Shapiro-Wilk test I rejejct the null hypothesis and I can conclude that standardised residuals are not normally distributed (p < 0.001), however this has no influence as we have more than 100 observations in the sample and the violation of normality can be dismissed. Now i will reestimate the regression model due to removed outliers.

regression2 <- lm(data = final_data, Price ~ ownership + transmission + manufacture + engine + kms)
final_data$StdResid <- round(rstandard(regression2), 3) 
final_data$CooksD <- round(cooks.distance(regression2), 3) 

hist(final_data$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

shapiro.test(final_data$StdResid)
## 
##  Shapiro-Wilk normality test
## 
## data:  final_data$StdResid
## W = 0.84934, p-value < 2.2e-16
hist(final_data$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

Some standard residuals are still aabove 3, however we said in class if the distribution is not normal we should not eliminate those which are not on the [-3, 3] interval further on, as we could do this too many times and would run out of data.

final_data$StdFittedValues <- scale(regression2$fitted.values)

scatterplot(y = final_data$StdResid, x = final_data$StdFittedValues,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

There seems to be the problem with homoskedasticity, so I will perform the Breusch-Pagan test.

H0: Variance of standardized residuals is constant

H1: Variance of standardized residuals is not constant

ols_test_breusch_pagan(regression2)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    482.1215 
##  Prob > Chi2   =    7.380934e-107

Based on the formal test for heteroskedasticity we reject the null hypothesis and conclude that the variance of standardized residuals is not constant. To overcome this issue we will use robust standard errors.

head(final_data[order(-final_data$StdResid),], 10) 
##     Price ownership transmission manufacture engine   kms StdResid CooksD StdFittedValues
## 208 51.75         1    Automatic        2019    998 46000    4.529  0.044        1.564739
## 81  51.00         1    Automatic        2015   1968 40000    4.463  0.037        1.520969
## 416 49.50         1    Automatic        2018   1995 65000    3.886  0.027        1.860826
## 236 47.00         1    Automatic        2020   1248  7500    3.674  0.028        1.722088
## 324 48.00         1    Automatic        2019   1999 29000    3.598  0.022        1.927362
## 389 48.10         1    Automatic        2019   2179 29000    3.542  0.024        1.995661
## 466 46.00         1    Automatic        2017   1995 65000    3.487  0.021        1.759386
## 234 44.75         2    Automatic        2020   1197 27000    3.484  0.034        1.602298
## 132 46.00         1    Automatic        2017   2179 65000    3.415  0.023        1.829204
## 409 44.00         1    Automatic        2017   1498 49000    3.413  0.018        1.554616
head(final_data[order(-final_data$CooksD),], 10) 
##     Price ownership transmission manufacture engine   kms StdResid CooksD StdFittedValues
## 208 51.75         1    Automatic        2019    998 46000    4.529  0.044        1.564739
## 81  51.00         1    Automatic        2015   1968 40000    4.463  0.037        1.520969
## 234 44.75         2    Automatic        2020   1197 27000    3.484  0.034        1.602298
## 489 47.00         1    Automatic        2020   2755  7500    3.082  0.034        2.293908
## 109 45.00         1    Automatic        2018   2755 25500    2.984  0.028        2.109240
## 236 47.00         1    Automatic        2020   1248  7500    3.674  0.028        1.722088
## 416 49.50         1    Automatic        2018   1995 65000    3.886  0.027        1.860826
## 210 39.75         1    Automatic        2013   1498 95004    3.189  0.026        1.195399
## 304 39.50         2    Automatic        2016   2967 51000    2.423  0.024        1.892434
## 389 48.10         1    Automatic        2019   2179 29000    3.542  0.024        1.995661

##Reestimating the regression model to fix heteroskedasticity with robust standard errors

regression3 <- lm_robust(data = final_data, Price ~ ownership + transmission + manufacture + engine + kms,
                         se_type = "HC1")

summary(regression3)
## 
## Call:
## lm_robust(formula = Price ~ ownership + transmission + manufacture + 
##     engine + kms, data = final_data, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##                      Estimate Std. Error t value  Pr(>|t|)   CI Lower   CI Upper  DF
## (Intercept)        -1.483e+03  1.679e+02  -8.834 2.014e-17 -1.813e+03 -1.153e+03 471
## ownership          -8.809e-01  4.299e-01  -2.049 4.103e-02 -1.726e+00 -3.606e-02 471
## transmissionManual -1.353e+01  1.197e+00 -11.305 2.200e-26 -1.588e+01 -1.118e+01 471
## manufacture         7.436e-01  8.308e-02   8.951 8.169e-18  5.804e-01  9.069e-01 471
## engine              2.782e-03  6.731e-04   4.132 4.251e-05  1.459e-03  4.104e-03 471
## kms                 7.417e-06  5.342e-06   1.388 1.657e-01 -3.080e-06  1.791e-05 471
## 
## Multiple R-squared:  0.5288 ,    Adjusted R-squared:  0.5238 
## F-statistic: 46.98 on 5 and 471 DF,  p-value: < 2.2e-16
lin_relationship <- sqrt(summary(regression3)$r.squared)
lin_relationship
## [1] 0.7271657

Conclusion

Multiple R-squared has the value of 0.529, indicating 52.9% of variability of used car prices is explained by the variability of 5 explanatory variables in the model.

F-statistic:

H0: ^^2 = 0

H1: ^2 > 0

I can reject H0 and conclude that there is some relationship between the dependent and at least one explanatory variable.

Linear relationship between prices of used cars and other 5 explanatory variables is strong as multiple R value is 0.73