library(dplyr)
library(ggplot2)
library(ggcorrplot)
library(plotly)
df <- read.csv('data.csv')
df <- na.omit(df)
summary(df)
## maker_key model_key mileage engine_power
## Length:4843 Length:4843 Min. : -64 Min. : 0
## Class :character Class :character 1st Qu.: 102914 1st Qu.:100
## Mode :character Mode :character Median : 141080 Median :120
## Mean : 140963 Mean :129
## 3rd Qu.: 175196 3rd Qu.:135
## Max. :1000376 Max. :423
## registration_date fuel paint_color car_type
## Length:4843 Length:4843 Length:4843 Length:4843
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## feature_1 feature_2 feature_3 feature_4
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:2181 FALSE:1004 FALSE:3865 FALSE:3881
## TRUE :2662 TRUE :3839 TRUE :978 TRUE :962
##
##
##
## feature_5 feature_6 feature_7 feature_8
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:2613 FALSE:3674 FALSE:329 FALSE:2223
## TRUE :2230 TRUE :1169 TRUE :4514 TRUE :2620
##
##
##
## price sold_at
## Min. : 100 Length:4843
## 1st Qu.: 10800 Class :character
## Median : 14200 Mode :character
## Mean : 15828
## 3rd Qu.: 18600
## Max. :178500
df <- df[!(df$mileage < 0 | df$engine_power == 0),]
df$feature_1 <- as.integer(df$feature_1)
df$feature_2 <- as.integer(df$feature_2)
df$feature_3 <- as.integer(df$feature_3)
df$feature_4 <- as.integer(df$feature_4)
df$feature_5 <- as.integer(df$feature_5)
df$feature_6 <- as.integer(df$feature_6)
df$feature_7 <- as.integer(df$feature_7)
df$feature_8 <- as.integer(df$feature_8)
df$sold_at <- as.Date(df$sold_at)
df$registration_date <- as.Date(df$registration_date)
df$age <- df$sold_at - df$registration_date
df$age <- as.numeric(df$age)
df <- df[, !(names(df) %in% c('registration_date', 'sold_at'))]
df <- df[!(df$fuel == 'hybrid_petrol' | df$fuel == 'electro'),]
Price and Age
res <- cor.test(df$price, df$age, method='pearson')
res$estimate
## cor
## -0.4454861
Price and Age have a slight negative relationship, i.e every time age decreases the price increases
Price and Mileage
res <- cor.test(df$price, df$mileage, method='pearson')
res$estimate
## cor
## -0.407236
Price and Mileage have a slight negative relationship, i.e every time mileage decreases the price increases
res <- cor.test(df$price, df$engine_power, method='pearson')
res$estimate
## cor
## 0.6451558
Price and Mileage have a good positive relationship, i.e every time engine power increases the price also increases
factors <- df[,c('mileage', 'engine_power', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'price', 'age')]
corr <- cor(factors)
p.mat <- cor_pmat(factors)
corr.plot <- ggcorrplot(corr, hc.order = FALSE, type = "lower", outline.col = "white", p.mat = p.mat)
ggplotly(corr.plot)
As can be seen there seem to be no evidence of multi-colinearity as no variables is correlated to each other significantly.We can also see that correlation coefficient of any independent variable to the dependent price variable is not higher than 0.65 and hence we can say that none of the independent variables have particularly strong relationship with the price variable.
one.way <- aov(price ~ fuel, data = df)
summary(one.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## fuel 1 3.782e+08 378192469 4.578 0.0324 *
## Residuals 4828 3.989e+11 82614428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The f-statistic for fuel is 4.578, hence it is more likey it is the variation casued by the variable fuel to price is due to chance
two.way <- aov(price ~ mileage + age, data = df)
summary(two.way)
## Df Sum Sq Mean Sq F value Pr(>F)
## mileage 1 6.621e+10 6.621e+10 1057.4 <2e-16 ***
## age 1 3.078e+10 3.078e+10 491.5 <2e-16 ***
## Residuals 4827 3.023e+11 6.262e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, the f-statistic for mileage and age is high. Hence, it is more likey that the variation caused by these variables to price are real and not due to chance.
df1 <- sample_n(df, 500)
fit <- lm(price ~ engine_power, data = df1)
fig <- df1 %>%
plot_ly(x = ~engine_power, y = ~price, size = ~price, color = ~price, type = 'scatter', mode = 'markers', alpha=0.75, name='Scatter points') %>%
layout(title = "Engine Power vs Price",
xaxis = list(title = 'Engine Power', dtick=20, range= c(60, 350)),
yaxis = list(title = 'Price', dtick=10000, range = c(0, 80000))) %>%
add_trace(x = ~engine_power, y = fitted(fit), mode = 'lines', alpha = 1, line = list(color = '#F2CE16', width = 4), name='Regression Line')
fig
fit <- lm(price ~ mileage, data = df1)
fig <- df1 %>%
plot_ly(x = ~mileage, y = ~price, size = ~price, color = ~price, type = 'scatter', mode = 'markers', alpha=0.75, name='Scatter points') %>%
layout(title = "Mileage vs Price",
xaxis = list(title = 'Mileage', dtick=30000, range= c(0, 400000)),
yaxis = list(title = 'Price', dtick=10000, range = c(0, 80000))) %>%
add_trace(x = ~mileage, y = fitted(fit), mode = 'lines', alpha = 1, line = list(color = '#F2CE16', width = 4), name='Regression Line')
fig
fit <- lm(price ~ age, data = df1)
fig <- df1 %>%
plot_ly(x = ~age, y = ~price, size = ~price, color = ~price, type = 'scatter', mode = 'markers', alpha=0.75, name='Scatter points') %>%
layout(title = "Mileage vs Price",
xaxis = list(title = 'Mileage', dtick=1000, range= c(0, 6000)),
yaxis = list(title = 'Price', dtick=10000, range = c(0, 80000))) %>%
add_trace(x = ~age, y = fitted(fit), mode = 'lines', alpha = 1, line = list(color = '#F2CE16', width = 4), name='Regression Line')
fig
Independent variables age, mileage, engine power are some what related to dependent variable price.However, there seem to be many outliers
1.Set up the hypotheses and select the alpha level
H0:β Predictors = 0 (None of the Independent variables are not Predictors of Price)
H1: β Predictors ≠ 0 (at least one of the slope coefficients is different than 0 and is a predictor of Price) α = 0.05
2.Select the appropriate test statistic (we are going to use a F-test statistic).
3.State the decision rule
Reject H0 if p−value <= α. Otherwise, do not reject H0
4.Compute the test statistic and the associated p-value.
factors <- df[,c('mileage', 'fuel', 'model_key', 'engine_power', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'price', 'age')]
m <- lm(formula = price ~ mileage + engine_power + age , data = factors)
summary(m)
##
## Call:
## lm(formula = price ~ mileage + engine_power + age, data = factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28437 -2425 -235 1996 159363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.727e+03 3.665e+02 21.08 <2e-16 ***
## mileage -3.614e-02 1.572e-03 -23.00 <2e-16 ***
## engine_power 1.431e+02 2.111e+00 67.82 <2e-16 ***
## age -2.674e+00 1.021e-01 -26.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5663 on 4826 degrees of freedom
## Multiple R-squared: 0.6124, Adjusted R-squared: 0.6121
## F-statistic: 2541 on 3 and 4826 DF, p-value: < 2.2e-16
p-value = less than 2.2e-16 R-squared value = 0.61
5.Conclusion.
Reject H0, since p <= 0.05. We have sufficient evidence at the significance level that there is a linear association between price and other independent variables.
From the above test we can conclude that some variables are predictors of price. we now create a new model to check if all variables are predictors of price and check the effect on R-squared value.
m <- lm(formula = price ~ ., data = factors)
summary(m)
##
## Call:
## lm(formula = price ~ ., data = factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22347 -1484 27 1560 159074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.315e+04 1.307e+03 10.065 < 2e-16 ***
## mileage -3.329e-02 1.393e-03 -23.905 < 2e-16 ***
## fuelpetrol -2.823e+02 4.193e+02 -0.673 0.500834
## model_key116 2.729e+01 1.236e+03 0.022 0.982378
## model_key118 2.331e+02 1.286e+03 0.181 0.856135
## model_key120 5.488e+02 1.477e+03 0.372 0.710182
## model_key123 -2.819e+03 3.536e+03 -0.797 0.425325
## model_key125 -2.739e+03 2.447e+03 -1.119 0.263020
## model_key135 2.246e+03 2.741e+03 0.819 0.412588
## model_key214 Gran Tourer 3.665e+03 4.800e+03 0.764 0.445156
## model_key216 -9.391e+03 4.810e+03 -1.952 0.050955 .
## model_key216 Active Tourer 6.790e+02 2.942e+03 0.231 0.817473
## model_key216 Gran Tourer 1.751e+03 2.618e+03 0.669 0.503630
## model_key218 2.119e+03 1.774e+03 1.195 0.232312
## model_key218 Active Tourer 1.228e+03 1.744e+03 0.704 0.481575
## model_key218 Gran Tourer -3.669e+02 1.974e+03 -0.186 0.852586
## model_key220 4.589e+03 2.957e+03 1.552 0.120753
## model_key220 Active Tourer -3.147e+03 4.814e+03 -0.654 0.513341
## model_key225 -8.056e+02 4.822e+03 -0.167 0.867309
## model_key225 Active Tourer -3.703e+02 4.841e+03 -0.076 0.939031
## model_key316 1.431e+03 1.253e+03 1.142 0.253435
## model_key318 1.461e+03 1.244e+03 1.174 0.240376
## model_key318 Gran Turismo 3.428e+03 1.315e+03 2.607 0.009159 **
## model_key320 1.131e+03 1.264e+03 0.895 0.371078
## model_key320 Gran Turismo 3.074e+03 1.371e+03 2.243 0.024929 *
## model_key325 7.006e+02 1.681e+03 0.417 0.676867
## model_key325 Gran Turismo 6.076e+03 2.660e+03 2.284 0.022414 *
## model_key328 3.781e+02 2.473e+03 0.153 0.878479
## model_key330 3.901e+02 1.583e+03 0.246 0.805335
## model_key330 Gran Turismo 3.691e+03 3.008e+03 1.227 0.219870
## model_key335 3.858e+03 2.268e+03 1.701 0.088971 .
## model_key335 Gran Turismo 3.852e+03 3.055e+03 1.261 0.207481
## model_key418 Gran Coupé 6.044e+03 2.057e+03 2.939 0.003310 **
## model_key420 7.730e+03 1.435e+03 5.387 7.51e-08 ***
## model_key420 Gran Coupé 7.026e+03 1.496e+03 4.698 2.70e-06 ***
## model_key425 1.264e+04 3.533e+03 3.577 0.000351 ***
## model_key430 9.579e+03 3.555e+03 2.694 0.007075 **
## model_key430 Gran Coupé 6.743e+03 2.694e+03 2.503 0.012344 *
## model_key435 7.027e+03 2.739e+03 2.565 0.010343 *
## model_key435 Gran Coupé 1.024e+04 2.379e+03 4.305 1.70e-05 ***
## model_key518 4.742e+03 1.365e+03 3.475 0.000516 ***
## model_key520 4.094e+03 1.276e+03 3.209 0.001342 **
## model_key520 Gran Turismo 6.549e+03 1.596e+03 4.104 4.14e-05 ***
## model_key523 5.088e+03 2.671e+03 1.905 0.056820 .
## model_key525 3.629e+03 1.348e+03 2.692 0.007135 **
## model_key528 4.366e+03 2.330e+03 1.873 0.061061 .
## model_key530 4.321e+03 1.416e+03 3.051 0.002295 **
## model_key530 Gran Turismo 5.841e+03 1.752e+03 3.334 0.000863 ***
## model_key535 4.752e+03 1.619e+03 2.935 0.003355 **
## model_key535 Gran Turismo 3.845e+03 3.587e+03 1.072 0.283795
## model_key630 -8.538e+02 4.851e+03 -0.176 0.860308
## model_key635 5.667e+03 4.865e+03 1.165 0.244195
## model_key640 1.446e+04 2.211e+03 6.540 6.79e-11 ***
## model_key640 Gran Coupé 1.602e+04 1.832e+03 8.749 < 2e-16 ***
## model_key650 -9.151e+03 3.668e+03 -2.495 0.012624 *
## model_key730 9.533e+03 1.585e+03 6.015 1.93e-09 ***
## model_key735 4.788e+02 4.876e+03 0.098 0.921789
## model_key740 1.738e+04 1.834e+03 9.472 < 2e-16 ***
## model_key750 1.173e+04 3.662e+03 3.203 0.001370 **
## model_keyM135 4.482e+03 4.877e+03 0.919 0.358097
## model_keyM235 4.846e+03 3.060e+03 1.584 0.113284
## model_keyM3 1.673e+04 2.449e+03 6.832 9.40e-12 ***
## model_keyM4 3.199e+04 3.714e+03 8.612 < 2e-16 ***
## model_keyM5 1.398e+04 5.102e+03 2.740 0.006166 **
## model_keyM550 1.084e+04 2.101e+03 5.159 2.59e-07 ***
## model_keyX1 1.242e+03 1.266e+03 0.981 0.326448
## model_keyX3 4.773e+03 1.290e+03 3.700 0.000218 ***
## model_keyX4 1.517e+04 1.465e+03 10.359 < 2e-16 ***
## model_keyX5 1.328e+04 1.410e+03 9.416 < 2e-16 ***
## model_keyX5 M 1.872e+04 1.780e+03 10.518 < 2e-16 ***
## model_keyX5 M50 2.379e+04 3.134e+03 7.590 3.84e-14 ***
## model_keyX6 1.605e+04 1.596e+03 10.061 < 2e-16 ***
## model_keyX6 M 2.565e+04 2.246e+03 11.419 < 2e-16 ***
## model_keyZ4 2.530e+03 2.293e+03 1.104 0.269863
## engine_power 5.228e+01 5.157e+00 10.139 < 2e-16 ***
## feature_1 6.158e+02 1.614e+02 3.815 0.000138 ***
## feature_2 -7.723e+01 2.031e+02 -0.380 0.703805
## feature_3 7.706e+02 1.828e+02 4.215 2.54e-05 ***
## feature_4 9.978e+02 2.271e+02 4.393 1.14e-05 ***
## feature_5 -1.089e+02 1.617e+02 -0.673 0.500721
## feature_6 1.296e+03 1.710e+02 7.582 4.08e-14 ***
## feature_7 7.715e+02 3.150e+02 2.449 0.014353 *
## feature_8 1.416e+03 1.683e+02 8.413 < 2e-16 ***
## age -2.692e+00 1.006e-01 -26.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4641 on 4746 degrees of freedom
## Multiple R-squared: 0.7439, Adjusted R-squared: 0.7395
## F-statistic: 166.1 on 83 and 4746 DF, p-value: < 2.2e-16
As seen from the model above, The R sqaured value increased when we used all the features. Hence, the model is much more concice using all the factors
res = resid(m)
fig <- plot_ly(x = fitted(m), y = res, type = 'scatter', mode = 'markers', alpha=0.75) %>%
layout(title = "Residual vs Fitted values",
xaxis = list(title = 'Fitted values', dtick=10000, range= c(-10000, 80000)),
yaxis = list(title = 'Residual', dtick=10000, range = c(-20000, 20000)))
fig
As evident from the plot above, there appears to be no apparent association between data points and the model seems to be good.
The residuals, even though they are clustered with some outliners, seems reasonable.
boxplot(df$price)$out
## [1] 69700 33400 36300 31000 31000 34300 30500 38900 36300 32100
## [11] 47000 55200 53600 32000 31500 39900 36900 32100 36500 30700
## [21] 45800 35700 31300 30600 33900 36000 37100 40900 30800 34500
## [31] 32100 40500 30900 35700 40000 36500 32100 32100 39300 33100
## [41] 32800 33300 37200 32700 40700 36700 62500 44600 30600 34000
## [51] 43000 31900 33300 35300 34300 58300 31500 63700 40500 38700
## [61] 51200 34600 34900 40700 39900 65400 30600 31500 36900 52400
## [71] 38000 32600 39200 32100 36200 37200 33100 31300 31400 47300
## [81] 32000 68300 68300 30900 51500 68700 41100 41600 47000 38000
## [91] 30600 42700 35200 31100 38100 43700 64300 33300 42800 33200
## [101] 33300 41600 33800 32500 41200 41200 38800 34500 41900 36400
## [111] 39900 48200 30500 36900 43400 34000 38100 41600 37200 30900
## [121] 31800 37700 44000 37500 37500 30600 34800 43400 33000 35700
## [131] 30600 34200 31500 43300 35100 49900 40000 49700 31400 33300
## [141] 40900 33300 37200 47400 35700 39000 47000 32700 57100 63100
## [151] 31900 42000 35300 47000 34900 34500 37800 55500 55300 30500
## [161] 42200 82400 32500 44300 35100 42800 35000 35900 38000 41300
## [171] 39900 44100 36300 41100 39300 43400 34500 44300 46500 59300
## [181] 35600 30900 35300 33300 32700 39300 42400 33600 34500 43800
## [191] 30500 44700 35500 42700 48800 41300 33500 33300 36100 35200
## [201] 35900 50000 32200 35400 41600 39400 49900 32200 38400 36100
## [211] 35800 48800 34200 37400 32100 49100 41100 34000 42100 33000
## [221] 38000 41200 31100 46100 32800 50000 40500 33900 45300 36900
## [231] 30900 37800 32700 44600 39300 34500 46600 62500 39300 36400
## [241] 31700 34200 39500 40500 50000 38700 34500 36300 36300 33200
## [251] 33300 39300 38400 43400 45500 35300 55700 37200 61200 33900
## [261] 44600 34600 36900 43600 34500 32600 34600 32500 39900 34500
## [271] 46400 45500 50600 35600 35700 43000 142800 32500 36100 35700
## [281] 43400 34800 37400 39100 43400 35700 34200 40000 39900 46300
## [291] 73100 35400 41100 60700 31100 66600 178500 32800 42100 33600
## [301] 33000 45100 42800 34600 42700 44700 37500
fig <- plot_ly(y = df$price, type = 'box', mode = 'markers', alpha=0.75)
fig
Q <- quantile(df$price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$price)
up <- Q[2] + 1.5 * iqr
low <- Q[1] - 1.5 * iqr
df2 <- subset(factors, df$price > (Q[1] - 1.5*iqr) & df$price < (Q[2]+1.5*iqr))
Linear model after elimination of outliers
Test and Train split
df3 = sort(sample(nrow(df2), nrow(df2)*0.7))
train <- df2[df3,]
test<- df2[-df3,]
m <- lm(formula = price ~ ., data = train)
summary(m)
##
## Call:
## lm(formula = price ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19480.0 -1223.5 65.4 1417.4 19584.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.296e+04 9.818e+02 13.198 < 2e-16 ***
## mileage -2.451e-02 9.836e-04 -24.923 < 2e-16 ***
## fuelpetrol -7.844e+02 2.983e+02 -2.629 0.008595 **
## model_key116 3.863e+02 9.191e+02 0.420 0.674339
## model_key118 8.193e+02 9.569e+02 0.856 0.391972
## model_key120 1.509e+03 1.086e+03 1.389 0.164826
## model_key123 -1.872e+03 2.142e+03 -0.874 0.382195
## model_key125 -1.576e+03 1.662e+03 -0.948 0.343168
## model_key135 6.054e+03 1.746e+03 3.466 0.000535 ***
## model_key214 Gran Tourer 4.723e+03 2.832e+03 1.668 0.095471 .
## model_key216 Active Tourer 1.542e+03 1.793e+03 0.860 0.389915
## model_key216 Gran Tourer 3.355e+03 1.618e+03 2.074 0.038169 *
## model_key218 1.144e+03 1.250e+03 0.915 0.360094
## model_key218 Active Tourer 2.740e+03 1.221e+03 2.244 0.024913 *
## model_key218 Gran Tourer 1.873e+03 1.426e+03 1.313 0.189176
## model_key220 7.961e+03 2.124e+03 3.748 0.000181 ***
## model_key225 Active Tourer 3.309e+03 2.874e+03 1.152 0.249584
## model_key316 1.355e+03 9.321e+02 1.453 0.146215
## model_key318 1.641e+03 9.289e+02 1.766 0.077425 .
## model_key318 Gran Turismo 3.734e+03 9.708e+02 3.847 0.000122 ***
## model_key320 1.756e+03 9.477e+02 1.852 0.064049 .
## model_key320 Gran Turismo 4.128e+03 1.016e+03 4.064 4.94e-05 ***
## model_key325 1.987e+03 1.256e+03 1.583 0.113547
## model_key325 Gran Turismo 7.344e+03 1.664e+03 4.414 1.05e-05 ***
## model_key328 2.579e+03 1.683e+03 1.533 0.125462
## model_key330 2.811e+03 1.190e+03 2.362 0.018252 *
## model_key330 Gran Turismo 8.412e+03 2.876e+03 2.925 0.003472 **
## model_key335 4.734e+03 1.634e+03 2.898 0.003785 **
## model_key335 Gran Turismo 6.401e+03 1.916e+03 3.341 0.000846 ***
## model_key418 Gran Coupé 6.154e+03 1.514e+03 4.064 4.94e-05 ***
## model_key420 9.033e+03 1.057e+03 8.548 < 2e-16 ***
## model_key420 Gran Coupé 8.299e+03 1.105e+03 7.514 7.49e-14 ***
## model_key430 1.298e+04 2.879e+03 4.509 6.75e-06 ***
## model_key435 Gran Coupé 7.507e+03 2.199e+03 3.414 0.000649 ***
## model_key518 4.999e+03 1.003e+03 4.984 6.56e-07 ***
## model_key520 4.506e+03 9.560e+02 4.714 2.54e-06 ***
## model_key520 Gran Turismo 7.866e+03 1.207e+03 6.519 8.24e-11 ***
## model_key523 2.276e+03 2.859e+03 0.796 0.426038
## model_key525 4.222e+03 1.012e+03 4.173 3.09e-05 ***
## model_key528 7.002e+03 1.576e+03 4.443 9.18e-06 ***
## model_key530 5.342e+03 1.069e+03 4.995 6.21e-07 ***
## model_key530 Gran Turismo 6.576e+03 1.409e+03 4.666 3.20e-06 ***
## model_key535 6.121e+03 1.245e+03 4.917 9.24e-07 ***
## model_key535 Gran Turismo 7.537e+03 2.200e+03 3.426 0.000620 ***
## model_key630 9.396e+02 2.888e+03 0.325 0.744938
## model_key635 7.560e+03 2.905e+03 2.602 0.009313 **
## model_key640 1.301e+04 2.917e+03 4.460 8.50e-06 ***
## model_key640 Gran Coupé 1.452e+04 2.219e+03 6.542 7.09e-11 ***
## model_key730 8.511e+03 1.210e+03 7.036 2.42e-12 ***
## model_key740 8.987e+03 1.524e+03 5.896 4.13e-09 ***
## model_keyM135 8.961e+03 2.917e+03 3.072 0.002146 **
## model_keyM3 1.468e+04 2.303e+03 6.375 2.11e-10 ***
## model_keyM550 7.343e+03 2.009e+03 3.655 0.000261 ***
## model_keyX1 1.880e+03 9.429e+02 1.994 0.046208 *
## model_keyX3 5.244e+03 9.651e+02 5.434 5.95e-08 ***
## model_keyX4 1.069e+04 1.199e+03 8.910 < 2e-16 ***
## model_keyX5 6.467e+03 1.100e+03 5.881 4.52e-09 ***
## model_keyX5 M 1.077e+04 1.868e+03 5.765 8.99e-09 ***
## model_keyX6 1.045e+04 1.286e+03 8.123 6.48e-16 ***
## model_keyZ4 4.085e+03 1.541e+03 2.651 0.008063 **
## engine_power 2.878e+01 4.302e+00 6.690 2.64e-11 ***
## feature_1 4.408e+02 1.129e+02 3.904 9.65e-05 ***
## feature_2 3.708e+02 1.433e+02 2.589 0.009680 **
## feature_3 3.546e+02 1.340e+02 2.647 0.008167 **
## feature_4 1.310e+03 1.698e+02 7.713 1.64e-14 ***
## feature_5 7.324e+02 1.189e+02 6.162 8.10e-10 ***
## feature_6 1.169e+03 1.242e+02 9.411 < 2e-16 ***
## feature_7 5.617e+02 2.247e+02 2.500 0.012466 *
## feature_8 1.395e+03 1.173e+02 11.894 < 2e-16 ***
## age -2.279e+00 7.203e-02 -31.642 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2681 on 3094 degrees of freedom
## Multiple R-squared: 0.7849, Adjusted R-squared: 0.7801
## F-statistic: 163.6 on 69 and 3094 DF, p-value: < 2.2e-16
result <- data.frame(Actual=train$price, Predicted=predict(m))
head(result)
## Actual Predicted
## 3 10200 10109.380
## 6 17100 18693.983
## 7 12400 11469.520
## 8 6100 8646.210
## 9 6200 7616.176
## 10 17300 12520.876
When we remove the outliers and construct our final linear model with the features, it is evident that the R-squared has improved even more.
Even though 0.77 can be an acceptable performance, our model performed poorly on lower priced cars. however, a better sampling/splitting method could result in even better performance.
When sampling the dataset some variables were not sampled properly like fuel type electro and hybrid. Hence, they were removed Some columns with boolean values needed to be changed to numeric for linear regression to work. Some of the values for price column were zero, which is not possible.When analysing correlations, we found that fuel type was not a major predictor of price and model type was a major predictor. We found this by trial and error because model_key is a categorical variable
Even though 0.77 can be an acceptable performance, our model performed poorly on lower priced cars. However, a better sampling/splitting method could result in even better performance.
In conclusion, my model was better than flipping a coin in predicting the price of a used car.