set.seed(1)
mydata <- read.table("car_price.csv", header=TRUE, sep=",", dec=".")
mydata <- mydata %>% filter(grepl(" Lakh", car_prices_in_rupee ))%>% sample_n(500)
head(mydata)
## X car_name car_prices_in_rupee kms_driven fuel_type transmission ownership
## 1 1039 Land Rover Freelander 2 SE 13.28 Lakh 1,58,108 kms Diesel Automatic 1st Owner
## 2 4936 Maruti Ciaz VDi Plus SHVS 5.20 Lakh 1,00,000 kms Diesel Manual 1st Owner
## 3 2233 Honda Amaze S Petrol BSIV 6.80 Lakh 59,858 kms Petrol Manual 1st Owner
## 4 5194 Maruti Swift 1.3 VXi 1.42 Lakh 1,93,593 kms Petrol Manual 1st Owner
## 5 1573 Skoda New Laura Ambiente 2.0 TDI CR AT 2.53 Lakh 1,09,403 kms Diesel Automatic 3rd Owner
## 6 4725 Maruti Alto 800 LXI 1.46 Lakh 2,69,930 kms Petrol Manual 1st Owner
## manufacture engine Seats
## 1 2014 999 cc 5 Seats
## 2 2016 1493 cc 4 Seats
## 3 2018 2179 cc 5 Seats
## 4 2009 1896 cc 5 Seats
## 5 2012 1993 cc 5 Seats
## 6 2014 814 cc 5 Seats
#Modifying kilometers column
mydata$kms <- gsub("kms","",as.character(mydata$kms_driven))
mydata$kms <- as.numeric(gsub(",","",as.character(mydata$kms)))
mydata <- mydata %>% select(-kms_driven)
#Modifying prices column
mydata$car_prices_in_rupee <- gsub(" ","",as.character(mydata$car_prices_in_rupee))
mydata$car_prices_in_rupee <- gsub(",","",as.character(mydata$car_prices_in_rupee))
mydata$car_prices_in_rupee <- (gsub("Lakh","",as.character(mydata$car_prices_in_rupee)))
#Further in the analysis that coefficients are absolutely crazy high if I use prices in rupees, I will rather use them in lakh.
options(scipen = 0)
#Modifying by fuel type. As there is only electric fueled cars and a few has fueled
mydata <- mydata %>% filter(fuel_type != "electric" & fuel_type != "Cng")
#Modifying the ownership column
mydata$ownership <- as.numeric(substr(mydata$ownership, 1, 1))
#Modifying engine volume column
mydata$engine <- as.numeric(gsub(" cc", "", as.character(mydata$engine)))
#Modifying the column with number of seats
mydata$Seats <- as.numeric(gsub(" Seats", "", as.character(mydata$Seats)))
colnames(mydata)[3] <- "Price"
mydata$Price <- as.numeric(mydata$Price)
final_data <- mydata %>% select(Price, ownership, transmission, manufacture, engine, kms)
head(final_data)
## Price ownership transmission manufacture engine kms
## 1 13.28 1 Automatic 2014 999 158108
## 2 5.20 1 Manual 2016 1493 100000
## 3 6.80 1 Manual 2018 2179 59858
## 4 1.42 1 Manual 2009 1896 193593
## 5 2.53 3 Automatic 2012 1993 109403
## 6 1.46 1 Manual 2014 814 269930
I will perform further analysis on the “final_data” table which consists of the following variables:
How do number of previous owners, type of transmission, year of manufacture, engine capacity and kilometers driven affect the average price of the car in India?
In the regression model I will analyse how the factors mentioned above affect the price of the car on Indian market. I deducted that these are the most common factors when determening the used car price from the following source:
“The Determinants of Used Car Prices: An Empirical Study” by Jia Wei and Liping Fang, published in the Journal of Industrial Economics and Management
scatterplotMatrix(final_data[,c(1,2,4,5,6)], smooth = FALSE)
The first row represents the correlation of price of used cars to the independent variables. Price is negatively correlated with the number of previous owners and kilometers driven by the car. It is positively correlated with the engine size and the year of manufacture, which is all quite straightforward. There is no visible multicolinearity between the independent/explanatory variables.
rcorr(as.matrix(final_data[,c(1,2,4,5,6)]),
type = "pearson")
## Price ownership manufacture engine kms
## Price 1.00 -0.18 0.42 0.26 -0.28
## ownership -0.18 1.00 -0.45 0.01 0.31
## manufacture 0.42 -0.45 1.00 -0.06 -0.49
## engine 0.26 0.01 -0.06 1.00 -0.05
## kms -0.28 0.31 -0.49 -0.05 1.00
##
## n= 492
##
##
## P
## Price ownership manufacture engine kms
## Price 0.0000 0.0000 0.0000 0.0000
## ownership 0.0000 0.0000 0.8178 0.0000
## manufacture 0.0000 0.0000 0.1906 0.0000
## engine 0.0000 0.8178 0.1906 0.2484
## kms 0.0000 0.0000 0.0000 0.2484
From correlation matrix we see that the correlation between Price and other independent variables is not strong (if we draw the line strong correlation at 0.5 as most of sources suggest).
The Pearson correlation matrix shows strong correlation of engine with other variables so I will perform a vif test bellow after making the regression estimation.
regression <- lm(data = final_data, Price ~ ownership + transmission + manufacture + engine + kms)
regression
##
## Call:
## lm(formula = Price ~ ownership + transmission + manufacture +
## engine + kms, data = final_data)
##
## Coefficients:
## (Intercept) ownership transmissionManual manufacture engine kms
## -2.051e+03 -8.263e-01 -1.703e+01 1.026e+00 4.315e-03 1.943e-06
vif(regression)
## ownership transmission manufacture engine kms
## 1.288090 1.161079 1.617137 1.066443 1.373437
mean(vif(regression))
## [1] 1.301237
sqrt(5)
## [1] 2.236068
Using the variance inflation factor (VIF) test to identify if the regression has a problem of multicolineartiy. I see that all of the values are smaller than a square root of 5, meaning there is no problem with multicolinearity of the variables in the performed regression.
The mean VIF value is 1.30 and this is a strong indicator of no multicolinearity, however all VIF values are not bellow 1.5, which we would ultimatley want, but we can live with these values and claim no multicolinearity with adequate certainty.
final_data$StdResid <- round(rstandard(regression), 3)
final_data$CooksD <- round(cooks.distance(regression), 3)
hist(final_data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(final_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: final_data$StdResid
## W = 0.84011, p-value < 2.2e-16
hist(final_data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
The histogram of standardized residuals and the histogram of Cooks distances shows some outliers, which I will remove later in the analysis.
I can assume that the distribution of residuals is not normal from the graph alone as it has a tail on the right, however I will perform the Shapiro_Wilk test to check the normal distribution.
Ho: the standardized residuals is normally distributed
H1: the standardized residuals is not normally distributed
From Shapiro-Wilk test I reject the null hypothesis, meaning standardized residuals are not normally distributed as p-value is lower than 0.001.
head(final_data[order(-final_data$StdResid),], 10)
## Price ownership transmission manufacture engine kms StdResid CooksD
## 138 85.0 1 Automatic 2020 1969 38000 5.686 0.051
## 348 83.0 1 Automatic 2020 1950 25000 5.491 0.047
## 393 74.0 1 Automatic 2021 1968 9900 4.460 0.037
## 243 71.5 1 Automatic 2021 1493 24900 4.408 0.033
## 335 72.5 1 Automatic 2018 2967 15000 4.197 0.060
## 62 67.0 1 Automatic 2021 1950 8900 3.748 0.026
## 222 59.5 2 Automatic 2016 2499 45000 3.344 0.027
## 369 58.0 1 Automatic 2020 1197 17000 3.261 0.019
## 251 56.5 1 Automatic 2020 998 24000 3.196 0.021
## 21 56.0 1 Automatic 2019 1248 45000 3.130 0.016
head(final_data[order(-final_data$CooksD),], 20)
## Price ownership transmission manufacture engine kms StdResid CooksD
## 335 72.50 1 Automatic 2018 2967 15000 4.197 0.060
## 138 85.00 1 Automatic 2020 1969 38000 5.686 0.051
## 492 61.00 1 Automatic 2021 4367 15000 2.115 0.051
## 348 83.00 1 Automatic 2020 1950 25000 5.491 0.047
## 393 74.00 1 Automatic 2021 1968 9900 4.460 0.037
## 243 71.50 1 Automatic 2021 1493 24900 4.408 0.033
## 222 59.50 2 Automatic 2016 2499 45000 3.344 0.027
## 62 67.00 1 Automatic 2021 1950 8900 3.748 0.026
## 251 56.50 1 Automatic 2020 998 24000 3.196 0.021
## 369 58.00 1 Automatic 2020 1197 17000 3.261 0.019
## 46 61.00 1 Automatic 2021 1968 20000 3.120 0.017
## 21 56.00 1 Automatic 2019 1248 45000 3.130 0.016
## 208 51.75 1 Automatic 2019 998 46000 2.807 0.016
## 229 58.00 1 Automatic 2018 1995 70000 3.106 0.016
## 331 60.00 1 Automatic 2021 1997 19000 3.005 0.016
## 405 60.00 1 Automatic 2021 1968 19000 3.018 0.016
## 81 51.00 1 Automatic 2015 1968 40000 2.721 0.013
## 431 4.70 1 Automatic 2019 0 30000 -1.600 0.013
## 5 2.53 3 Automatic 2012 1993 109403 -1.810 0.012
## 186 1.20 1 Manual 2010 3996 15166 -1.041 0.012
final_data <- final_data[abs(final_data$StdResid) <= 3, ]
final_data <- final_data[final_data$CooksD <= 0.02, ]
I drew the line at Cooks distances at value of 0.02 as the decline looked pretty steady from there on. The rule of thumb that suggest that we should cut at 4/nrows (most sources mention that) which would give me a number of 0.008 however on lectures we read the units with high impact from the histogram.
I removed the problematic units of observations regarding the standardized residuals and units with high impact.
shapiro.test(final_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: final_data$StdResid
## W = 0.9342, p-value = 1.173e-13
Again, from the Shapiro-Wilk test I rejejct the null hypothesis and I can conclude that standardised residuals are not normally distributed (p < 0.001), however this has no influence as we have more than 100 observations in the sample and the violation of normality can be dismissed. Now i will reestimate the regression model due to removed outliers.
regression2 <- lm(data = final_data, Price ~ ownership + transmission + manufacture + engine + kms)
final_data$StdResid <- round(rstandard(regression2), 3)
final_data$CooksD <- round(cooks.distance(regression2), 3)
hist(final_data$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")
shapiro.test(final_data$StdResid)
##
## Shapiro-Wilk normality test
##
## data: final_data$StdResid
## W = 0.84934, p-value < 2.2e-16
hist(final_data$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
Some standard residuals are still aabove 3, however we said in class if the distribution is not normal we should not eliminate those which are not on the [-3, 3] interval further on, as we could do this too many times and would run out of data.
final_data$StdFittedValues <- scale(regression2$fitted.values)
scatterplot(y = final_data$StdResid, x = final_data$StdFittedValues,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
There seems to be the problem with homoskedasticity, so I will perform the Breusch-Pagan test.
H0: Variance of standardized residuals is constant
H1: Variance of standardized residuals is not constant
ols_test_breusch_pagan(regression2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 482.1215
## Prob > Chi2 = 7.380934e-107
Based on the formal test for heteroskedasticity we reject the null hypothesis and conclude that the variance of standardized residuals is not constant. To overcome this issue we will use robust standard errors.
head(final_data[order(-final_data$StdResid),], 10)
## Price ownership transmission manufacture engine kms StdResid CooksD StdFittedValues
## 208 51.75 1 Automatic 2019 998 46000 4.529 0.044 1.564739
## 81 51.00 1 Automatic 2015 1968 40000 4.463 0.037 1.520969
## 416 49.50 1 Automatic 2018 1995 65000 3.886 0.027 1.860826
## 236 47.00 1 Automatic 2020 1248 7500 3.674 0.028 1.722088
## 324 48.00 1 Automatic 2019 1999 29000 3.598 0.022 1.927362
## 389 48.10 1 Automatic 2019 2179 29000 3.542 0.024 1.995661
## 466 46.00 1 Automatic 2017 1995 65000 3.487 0.021 1.759386
## 234 44.75 2 Automatic 2020 1197 27000 3.484 0.034 1.602298
## 132 46.00 1 Automatic 2017 2179 65000 3.415 0.023 1.829204
## 409 44.00 1 Automatic 2017 1498 49000 3.413 0.018 1.554616
head(final_data[order(-final_data$CooksD),], 10)
## Price ownership transmission manufacture engine kms StdResid CooksD StdFittedValues
## 208 51.75 1 Automatic 2019 998 46000 4.529 0.044 1.564739
## 81 51.00 1 Automatic 2015 1968 40000 4.463 0.037 1.520969
## 234 44.75 2 Automatic 2020 1197 27000 3.484 0.034 1.602298
## 489 47.00 1 Automatic 2020 2755 7500 3.082 0.034 2.293908
## 109 45.00 1 Automatic 2018 2755 25500 2.984 0.028 2.109240
## 236 47.00 1 Automatic 2020 1248 7500 3.674 0.028 1.722088
## 416 49.50 1 Automatic 2018 1995 65000 3.886 0.027 1.860826
## 210 39.75 1 Automatic 2013 1498 95004 3.189 0.026 1.195399
## 304 39.50 2 Automatic 2016 2967 51000 2.423 0.024 1.892434
## 389 48.10 1 Automatic 2019 2179 29000 3.542 0.024 1.995661
##Reestimating the regression model to fix heteroskedasticity with robust standard errors
regression3 <- lm_robust(data = final_data, Price ~ ownership + transmission + manufacture + engine + kms,
se_type = "HC1")
summary(regression3)
##
## Call:
## lm_robust(formula = Price ~ ownership + transmission + manufacture +
## engine + kms, data = final_data, se_type = "HC1")
##
## Standard error type: HC1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -1.483e+03 1.679e+02 -8.834 2.014e-17 -1.813e+03 -1.153e+03 471
## ownership -8.809e-01 4.299e-01 -2.049 4.103e-02 -1.726e+00 -3.606e-02 471
## transmissionManual -1.353e+01 1.197e+00 -11.305 2.200e-26 -1.588e+01 -1.118e+01 471
## manufacture 7.436e-01 8.308e-02 8.951 8.169e-18 5.804e-01 9.069e-01 471
## engine 2.782e-03 6.731e-04 4.132 4.251e-05 1.459e-03 4.104e-03 471
## kms 7.417e-06 5.342e-06 1.388 1.657e-01 -3.080e-06 1.791e-05 471
##
## Multiple R-squared: 0.5288 , Adjusted R-squared: 0.5238
## F-statistic: 46.98 on 5 and 471 DF, p-value: < 2.2e-16
lin_relationship <- sqrt(summary(regression3)$r.squared)
lin_relationship
## [1] 0.7271657
Ownership: If the number previous owners increases by 1, then the price of a used car on average decreases by 0.88 Lakh (p < 0.05), assuming all other variables remain unchanged.
transmissonManual: Given the values of all other variables, used cars with manual transmission have on average 1.35 Lakh lower price than used cars with automatic transmission (p < 0.001)
manufacture: If the year that the car was produced increases by 1 year (meaning it is newer), then the price of a used car on average increases by 0.74 Lakh (p < 0.001), assuming all other variables remain unchanged.
engine: If the engines capacity increases by 1cc, then the price of a used car on average decreases by 0.0028 Lakh (p < 0.001), assuming all other variables remain unchanged.
kms: I can not say that kilometers of a car affects the price of used car (p = 0.17)
Multiple R-squared has the value of 0.529, indicating 52.9% of variability of used car prices is explained by the variability of 5 explanatory variables in the model.
F-statistic:
H0: ^^2 = 0
H1: ^2 > 0
I can reject H0 and conclude that there is some relationship between the dependent and at least one explanatory variable.
Linear relationship between prices of used cars and other 5 explanatory variables is strong as multiple R value is 0.73