1) Libraries

library(dplyr)
library(ggplot2)
library(ggcorrplot)
library(plotly)

2) Loading Data

df <- read.csv('data.csv')

# Remove NA values
df <- na.omit(df)

# Basic statistics of the data
summary(df)
##   maker_key          model_key            mileage         engine_power
##  Length:4843        Length:4843        Min.   :    -64   Min.   :  0  
##  Class :character   Class :character   1st Qu.: 102914   1st Qu.:100  
##  Mode  :character   Mode  :character   Median : 141080   Median :120  
##                                        Mean   : 140963   Mean   :129  
##                                        3rd Qu.: 175196   3rd Qu.:135  
##                                        Max.   :1000376   Max.   :423  
##  registration_date      fuel           paint_color          car_type        
##  Length:4843        Length:4843        Length:4843        Length:4843       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  feature_1       feature_2       feature_3       feature_4      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2181      FALSE:1004      FALSE:3865      FALSE:3881     
##  TRUE :2662      TRUE :3839      TRUE :978       TRUE :962      
##                                                                 
##                                                                 
##                                                                 
##  feature_5       feature_6       feature_7       feature_8      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2613      FALSE:3674      FALSE:329       FALSE:2223     
##  TRUE :2230      TRUE :1169      TRUE :4514      TRUE :2620     
##                                                                 
##                                                                 
##                                                                 
##      price          sold_at         
##  Min.   :   100   Length:4843       
##  1st Qu.: 10800   Class :character  
##  Median : 14200   Mode  :character  
##  Mean   : 15828                     
##  3rd Qu.: 18600                     
##  Max.   :178500

3) Preprocessing

# Drop row with negative mileage and horse power, because they cannot be negative
df <- df[!(df$mileage < 0 | df$engine_power == 0),]

# Change True/False to 1/0 for feature1 to feature8
df$feature_1 <- as.integer(df$feature_1)
df$feature_2 <- as.integer(df$feature_2)
df$feature_3 <- as.integer(df$feature_3)
df$feature_4 <- as.integer(df$feature_4)
df$feature_5 <- as.integer(df$feature_5)
df$feature_6 <- as.integer(df$feature_6)

# Create a new column to reprsent the age of the car
# age is calculated by subtracting the year it was sold and the registration year
df$sold_at <- as.Date(df$sold_at)
df$registration_date <- as.Date(df$registration_date)
df$age <- df$sold_at - df$registration_date
df$age <- as.numeric(df$age)

# Removing registration date and selling date columns
df <- df[, !(names(df) %in% c('registration_date', 'sold_at'))]

# Removing fuel types not needed for this analysis
df <- df[!(df$fuel == 'hybrid_petrol' | df$fuel == 'electro'),]

4) Analysis

4.1) Correlation Test

Price and Age

res <- cor.test(df$price, df$age, method = 'pearson')
res$estimate
##        cor 
## -0.4454861

Price and Age have a slight negative relationship, i.e. every time age decreases the price increases

Price and Mileage

res <- cor.test(df$price, df$mileage, method = 'pearson')
res$estimate
##       cor 
## -0.407236

Price and Mileage have a slight negative relationship, i.e. every time mileage decreases the price increases

Price and Horse Power

res <- cor.test(df$price, df$engine_power, method = 'pearson')
res$estimate
##       cor 
## 0.6451558

Price and Horse Power has a good positive relationship

4.2) Correlation plot

factors <- df[, c('mileage', 'engine_power', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'price', 'age')]

corr <- cor(factors)
p.mat <- cor_pmat(factors)
corr.plot <- ggcorrplot(corr, hc.order = FALSE, type = "lower", outline.col = "white", p.mat = p.mat)
ggplotly(corr.plot)

As can be seen there seem to be no evidence of multi-linearity as no variables is correlated to each other significantly. We can also see that correlation coefficient of any independent variable to the dependent price variable is not higher than 0.65, except feature_8 and hence we can say that none of the independent variables have particularly strong relationship with the price variable.

4.3) ANOVA

one.way <- aov(price ~ fuel, data = df)

summary(one.way)
##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## fuel           1 3.782e+08 378192469   4.578 0.0324 *
## Residuals   4828 3.989e+11  82614428                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The f-statistic for fuel is 4.578, hence it is more likely that the variation caused by the variable fuel to price is due to chance

4.4) ANOVA with multiple variables

4.4.1) Mileage and Age

two.way <- aov(price ~ mileage + age, data = df)

summary(two.way)
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## mileage        1 6.621e+10 6.621e+10  1057.4 <2e-16 ***
## age            1 3.078e+10 3.078e+10   491.5 <2e-16 ***
## Residuals   4827 3.023e+11 6.262e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the f-statistic for mileage and age is high. Hence, it is more likely that the variation caused by these variables to price are real and not due to chance.

4.5) Scatter plot of features with the highest correlation

4.5.1) Regression plot for price and engine power

df1 <- sample_n(df, 500)
fit <- lm(price ~ engine_power, data = df1)

fig <- df1 %>%
  plot_ly(x = ~engine_power, y = ~price, size = ~price, color = ~price, type = 'scatter', mode = 'markers', alpha = 0.75, name = 'Scatter points') %>%
  layout(title = "Engine Power vs Price",
         xaxis = list(title = 'Engine Power', dtick = 20, range = c(60, 350)),
         yaxis = list(title = 'Price', dtick = 10000, range = c(0, 80000))) %>%
  add_trace(x = ~engine_power, y = fitted(fit), mode = 'lines', alpha = 1, line = list(color = '#F2CE16', width = 4), name = 'Regression Line')
fig

As seen above, the regression line is a decent fit for the data. Engine power is trending towards the price.

4.5.2) Regression plot for price and mileage

fit <- lm(price ~ mileage, data = df1)

fig <- df1 %>%
  plot_ly(x = ~mileage, y = ~price, size = ~price, color = ~price, type = 'scatter', mode = 'markers', alpha = 0.75, name = 'Scatter points') %>%
  layout(title = "Mileage vs Price",
         xaxis = list(title = 'Mileage', dtick = 30000, range = c(0, 400000)),
         yaxis = list(title = 'Price', dtick = 10000, range = c(0, 80000))) %>%
  add_trace(x = ~mileage, y = fitted(fit), mode = 'lines', alpha = 1, line = list(color = '#F2CE16', width = 4), name = 'Regression Line')
fig

Mileage is trending opposite to the price. This is because price decreases as mileage increases. This is expected because price decreases as the car is used.

4.5.3) Regression plot for price and age

fit <- lm(price ~ age, data = df1)

fig <- df1 %>%
  plot_ly(x = ~age, y = ~price, size = ~price, color = ~price, type = 'scatter', mode = 'markers', alpha = 0.75, name = 'Scatter points') %>%
  layout(title = "Mileage vs Price",
         xaxis = list(title = 'Mileage', dtick = 1000, range = c(0, 6000)),
         yaxis = list(title = 'Price', dtick = 10000, range = c(0, 80000))) %>%
  add_trace(x = ~age, y = fitted(fit), mode = 'lines', alpha = 1, line = list(color = '#F2CE16', width = 4), name = 'Regression Line')
fig

In conclusion, independent variables age, mileage, engine power are somewhat related to dependent variable price. However, there seems to be many outliers

4.6) Formal test

1.Set up the hypotheses and select the alpha level

H0:β Predictors = 0 (None of the Independent variables are Predictors of Price)

H1: β Predictors ≠ 0 (at least one of the slope coefficients is different from 0 and is a predictor of Price) α = 0.05

2.Select the appropriate test statistic (I am going to use F-test statistic).

3.State the decision rule

Reject H0 if p−value <= F-statistic. Otherwise, do not reject H0

4.Compute the test statistic and the associated p-value.

factors <- df[, c('mileage', 'fuel', 'model_key', 'engine_power', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'price', 'age')]

m <- lm(formula = price ~ mileage + engine_power + age, data = factors)

summary(m)
## 
## Call:
## lm(formula = price ~ mileage + engine_power + age, data = factors)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28437  -2425   -235   1996 159363 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.727e+03  3.665e+02   21.08   <2e-16 ***
## mileage      -3.614e-02  1.572e-03  -23.00   <2e-16 ***
## engine_power  1.431e+02  2.111e+00   67.82   <2e-16 ***
## age          -2.674e+00  1.021e-01  -26.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5663 on 4826 degrees of freedom
## Multiple R-squared:  0.6124, Adjusted R-squared:  0.6121 
## F-statistic:  2541 on 3 and 4826 DF,  p-value: < 2.2e-16

p-value = less than 2.2e-16 R-squared value = 0.61

5.Conclusion.

Reject H0, since p <= 0.05. We have sufficient evidence at the significance level that there is a linear association between price and other independent variables.

From the above test we can conclude that some variables are predictors of price. we now create a new model to check if all variables are predictors of price and check the effect on R-squared value.

m <- lm(formula = price ~ ., data = factors)
summary(m)
## 
## Call:
## lm(formula = price ~ ., data = factors)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -22347  -1484     27   1560 159074 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.315e+04  1.307e+03  10.065  < 2e-16 ***
## mileage                    -3.329e-02  1.393e-03 -23.905  < 2e-16 ***
## fuelpetrol                 -2.823e+02  4.193e+02  -0.673 0.500834    
## model_key116                2.729e+01  1.236e+03   0.022 0.982378    
## model_key118                2.331e+02  1.286e+03   0.181 0.856135    
## model_key120                5.488e+02  1.477e+03   0.372 0.710182    
## model_key123               -2.819e+03  3.536e+03  -0.797 0.425325    
## model_key125               -2.739e+03  2.447e+03  -1.119 0.263020    
## model_key135                2.246e+03  2.741e+03   0.819 0.412588    
## model_key214 Gran Tourer    3.665e+03  4.800e+03   0.764 0.445156    
## model_key216               -9.391e+03  4.810e+03  -1.952 0.050955 .  
## model_key216 Active Tourer  6.790e+02  2.942e+03   0.231 0.817473    
## model_key216 Gran Tourer    1.751e+03  2.618e+03   0.669 0.503630    
## model_key218                2.119e+03  1.774e+03   1.195 0.232312    
## model_key218 Active Tourer  1.228e+03  1.744e+03   0.704 0.481575    
## model_key218 Gran Tourer   -3.669e+02  1.974e+03  -0.186 0.852586    
## model_key220                4.589e+03  2.957e+03   1.552 0.120753    
## model_key220 Active Tourer -3.147e+03  4.814e+03  -0.654 0.513341    
## model_key225               -8.056e+02  4.822e+03  -0.167 0.867309    
## model_key225 Active Tourer -3.703e+02  4.841e+03  -0.076 0.939031    
## model_key316                1.431e+03  1.253e+03   1.142 0.253435    
## model_key318                1.461e+03  1.244e+03   1.174 0.240376    
## model_key318 Gran Turismo   3.428e+03  1.315e+03   2.607 0.009159 ** 
## model_key320                1.131e+03  1.264e+03   0.895 0.371078    
## model_key320 Gran Turismo   3.074e+03  1.371e+03   2.243 0.024929 *  
## model_key325                7.006e+02  1.681e+03   0.417 0.676867    
## model_key325 Gran Turismo   6.076e+03  2.660e+03   2.284 0.022414 *  
## model_key328                3.781e+02  2.473e+03   0.153 0.878479    
## model_key330                3.901e+02  1.583e+03   0.246 0.805335    
## model_key330 Gran Turismo   3.691e+03  3.008e+03   1.227 0.219870    
## model_key335                3.858e+03  2.268e+03   1.701 0.088971 .  
## model_key335 Gran Turismo   3.852e+03  3.055e+03   1.261 0.207481    
## model_key418 Gran Coupé     6.044e+03  2.057e+03   2.939 0.003310 ** 
## model_key420                7.730e+03  1.435e+03   5.387 7.51e-08 ***
## model_key420 Gran Coupé     7.026e+03  1.496e+03   4.698 2.70e-06 ***
## model_key425                1.264e+04  3.533e+03   3.577 0.000351 ***
## model_key430                9.579e+03  3.555e+03   2.694 0.007075 ** 
## model_key430 Gran Coupé     6.743e+03  2.694e+03   2.503 0.012344 *  
## model_key435                7.027e+03  2.739e+03   2.565 0.010343 *  
## model_key435 Gran Coupé     1.024e+04  2.379e+03   4.305 1.70e-05 ***
## model_key518                4.742e+03  1.365e+03   3.475 0.000516 ***
## model_key520                4.094e+03  1.276e+03   3.209 0.001342 ** 
## model_key520 Gran Turismo   6.549e+03  1.596e+03   4.104 4.14e-05 ***
## model_key523                5.088e+03  2.671e+03   1.905 0.056820 .  
## model_key525                3.629e+03  1.348e+03   2.692 0.007135 ** 
## model_key528                4.366e+03  2.330e+03   1.873 0.061061 .  
## model_key530                4.321e+03  1.416e+03   3.051 0.002295 ** 
## model_key530 Gran Turismo   5.841e+03  1.752e+03   3.334 0.000863 ***
## model_key535                4.752e+03  1.619e+03   2.935 0.003355 ** 
## model_key535 Gran Turismo   3.845e+03  3.587e+03   1.072 0.283795    
## model_key630               -8.538e+02  4.851e+03  -0.176 0.860308    
## model_key635                5.667e+03  4.865e+03   1.165 0.244195    
## model_key640                1.446e+04  2.211e+03   6.540 6.79e-11 ***
## model_key640 Gran Coupé     1.602e+04  1.832e+03   8.749  < 2e-16 ***
## model_key650               -9.151e+03  3.668e+03  -2.495 0.012624 *  
## model_key730                9.533e+03  1.585e+03   6.015 1.93e-09 ***
## model_key735                4.788e+02  4.876e+03   0.098 0.921789    
## model_key740                1.738e+04  1.834e+03   9.472  < 2e-16 ***
## model_key750                1.173e+04  3.662e+03   3.203 0.001370 ** 
## model_keyM135               4.482e+03  4.877e+03   0.919 0.358097    
## model_keyM235               4.846e+03  3.060e+03   1.584 0.113284    
## model_keyM3                 1.673e+04  2.449e+03   6.832 9.40e-12 ***
## model_keyM4                 3.199e+04  3.714e+03   8.612  < 2e-16 ***
## model_keyM5                 1.398e+04  5.102e+03   2.740 0.006166 ** 
## model_keyM550               1.084e+04  2.101e+03   5.159 2.59e-07 ***
## model_keyX1                 1.242e+03  1.266e+03   0.981 0.326448    
## model_keyX3                 4.773e+03  1.290e+03   3.700 0.000218 ***
## model_keyX4                 1.517e+04  1.465e+03  10.359  < 2e-16 ***
## model_keyX5                 1.328e+04  1.410e+03   9.416  < 2e-16 ***
## model_keyX5 M               1.872e+04  1.780e+03  10.518  < 2e-16 ***
## model_keyX5 M50             2.379e+04  3.134e+03   7.590 3.84e-14 ***
## model_keyX6                 1.605e+04  1.596e+03  10.061  < 2e-16 ***
## model_keyX6 M               2.565e+04  2.246e+03  11.419  < 2e-16 ***
## model_keyZ4                 2.530e+03  2.293e+03   1.104 0.269863    
## engine_power                5.228e+01  5.157e+00  10.139  < 2e-16 ***
## feature_1                   6.158e+02  1.614e+02   3.815 0.000138 ***
## feature_2                  -7.723e+01  2.031e+02  -0.380 0.703805    
## feature_3                   7.706e+02  1.828e+02   4.215 2.54e-05 ***
## feature_4                   9.978e+02  2.271e+02   4.393 1.14e-05 ***
## feature_5                  -1.089e+02  1.617e+02  -0.673 0.500721    
## feature_6                   1.296e+03  1.710e+02   7.582 4.08e-14 ***
## feature_7TRUE               7.715e+02  3.150e+02   2.449 0.014353 *  
## feature_8TRUE               1.416e+03  1.683e+02   8.413  < 2e-16 ***
## age                        -2.692e+00  1.006e-01 -26.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4641 on 4746 degrees of freedom
## Multiple R-squared:  0.7439, Adjusted R-squared:  0.7395 
## F-statistic: 166.1 on 83 and 4746 DF,  p-value: < 2.2e-16

As seen from the model above, The R-squared value increased when we used all the features. Hence, the model is much better using all the factors

4.7) Residual Analysis

res <- resid(m)
fig <- plot_ly(x = fitted(m), y = res, type = 'scatter', mode = 'markers', alpha = 0.75) %>%
  layout(title = "Residual vs Fitted values",
         xaxis = list(title = 'Fitted values', dtick = 10000, range = c(-10000, 80000)),
         yaxis = list(title = 'Residual', dtick = 10000, range = c(-20000, 20000)))
fig

As evident from the plot above, there appears to be no apparent association between data points and the model seems to be good.

The residuals, even though they are clustered with some outliers, seem reasonable.

4.7.1) Finding outliers

boxplot(df$price)$out

##   [1]  69700  33400  36300  31000  31000  34300  30500  38900  36300  32100
##  [11]  47000  55200  53600  32000  31500  39900  36900  32100  36500  30700
##  [21]  45800  35700  31300  30600  33900  36000  37100  40900  30800  34500
##  [31]  32100  40500  30900  35700  40000  36500  32100  32100  39300  33100
##  [41]  32800  33300  37200  32700  40700  36700  62500  44600  30600  34000
##  [51]  43000  31900  33300  35300  34300  58300  31500  63700  40500  38700
##  [61]  51200  34600  34900  40700  39900  65400  30600  31500  36900  52400
##  [71]  38000  32600  39200  32100  36200  37200  33100  31300  31400  47300
##  [81]  32000  68300  68300  30900  51500  68700  41100  41600  47000  38000
##  [91]  30600  42700  35200  31100  38100  43700  64300  33300  42800  33200
## [101]  33300  41600  33800  32500  41200  41200  38800  34500  41900  36400
## [111]  39900  48200  30500  36900  43400  34000  38100  41600  37200  30900
## [121]  31800  37700  44000  37500  37500  30600  34800  43400  33000  35700
## [131]  30600  34200  31500  43300  35100  49900  40000  49700  31400  33300
## [141]  40900  33300  37200  47400  35700  39000  47000  32700  57100  63100
## [151]  31900  42000  35300  47000  34900  34500  37800  55500  55300  30500
## [161]  42200  82400  32500  44300  35100  42800  35000  35900  38000  41300
## [171]  39900  44100  36300  41100  39300  43400  34500  44300  46500  59300
## [181]  35600  30900  35300  33300  32700  39300  42400  33600  34500  43800
## [191]  30500  44700  35500  42700  48800  41300  33500  33300  36100  35200
## [201]  35900  50000  32200  35400  41600  39400  49900  32200  38400  36100
## [211]  35800  48800  34200  37400  32100  49100  41100  34000  42100  33000
## [221]  38000  41200  31100  46100  32800  50000  40500  33900  45300  36900
## [231]  30900  37800  32700  44600  39300  34500  46600  62500  39300  36400
## [241]  31700  34200  39500  40500  50000  38700  34500  36300  36300  33200
## [251]  33300  39300  38400  43400  45500  35300  55700  37200  61200  33900
## [261]  44600  34600  36900  43600  34500  32600  34600  32500  39900  34500
## [271]  46400  45500  50600  35600  35700  43000 142800  32500  36100  35700
## [281]  43400  34800  37400  39100  43400  35700  34200  40000  39900  46300
## [291]  73100  35400  41100  60700  31100  66600 178500  32800  42100  33600
## [301]  33000  45100  42800  34600  42700  44700  37500

4.7.2) Visualizing outliers

fig <- plot_ly(y = df$price, type = 'box', mode = 'markers', alpha = 0.75)
fig

4.7.3) Eliminating outliers

Q <- quantile(df$price, probs = c(.25, .75), na.rm = FALSE)

iqr <- IQR(df$price)

up <- Q[2] + 1.5 * iqr
low <- Q[1] - 1.5 * iqr

df2 <- subset(factors, df$price > (Q[1] - 1.5 * iqr) & df$price < (Q[2] + 1.5 * iqr))

Linear model after elimination of outliers

Test and Train split

df3 <- sort(sample(nrow(df2), nrow(df2) * 0.7))
train <- df2[df3,]
test <- df2[-df3,]

m <- lm(formula = price ~ ., data = train)
summary(m)
## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19829.6  -1225.3     67.9   1385.7  19956.4 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 1.314e+04  9.770e+02  13.451  < 2e-16 ***
## mileage                    -2.494e-02  9.838e-04 -25.353  < 2e-16 ***
## fuelpetrol                 -5.927e+02  3.114e+02  -1.904 0.057056 .  
## model_key116               -3.445e+02  9.157e+02  -0.376 0.706831    
## model_key118               -1.172e+01  9.517e+02  -0.012 0.990174    
## model_key120                4.896e+02  1.071e+03   0.457 0.647705    
## model_key123               -2.894e+03  2.129e+03  -1.360 0.174009    
## model_key125               -2.024e+03  1.650e+03  -1.227 0.220072    
## model_key135                3.999e+03  1.901e+03   2.104 0.035482 *  
## model_key216 Active Tourer  9.787e+02  1.785e+03   0.548 0.583523    
## model_key216 Gran Tourer    2.857e+03  1.786e+03   1.599 0.109816    
## model_key218                1.901e+03  1.420e+03   1.339 0.180828    
## model_key218 Active Tourer  1.541e+03  1.241e+03   1.241 0.214548    
## model_key218 Gran Tourer    4.163e+02  1.361e+03   0.306 0.759798    
## model_key220                5.776e+03  2.838e+03   2.035 0.041903 *  
## model_key225                1.051e+03  2.843e+03   0.370 0.711644    
## model_key225 Active Tourer  2.283e+03  2.860e+03   0.798 0.424824    
## model_key316                9.436e+02  9.255e+02   1.020 0.308041    
## model_key318                9.346e+02  9.227e+02   1.013 0.311187    
## model_key318 Gran Turismo   3.217e+03  9.656e+02   3.332 0.000873 ***
## model_key320                9.982e+02  9.403e+02   1.062 0.288527    
## model_key320 Gran Turismo   3.144e+03  1.002e+03   3.137 0.001725 ** 
## model_key325                1.162e+03  1.222e+03   0.951 0.341780    
## model_key325 Gran Turismo   7.844e+03  1.825e+03   4.298 1.78e-05 ***
## model_key328                7.121e+02  2.148e+03   0.332 0.740251    
## model_key330                1.482e+03  1.146e+03   1.294 0.195736    
## model_key330 Gran Turismo   6.577e+03  2.866e+03   2.295 0.021792 *  
## model_key335                3.453e+03  1.618e+03   2.134 0.032927 *  
## model_key335 Gran Turismo   4.999e+03  2.191e+03   2.282 0.022586 *  
## model_key418 Gran Coupé     5.819e+03  1.425e+03   4.083 4.56e-05 ***
## model_key420                7.974e+03  1.050e+03   7.598 3.96e-14 ***
## model_key420 Gran Coupé     7.736e+03  1.098e+03   7.047 2.24e-12 ***
## model_key425                1.279e+04  2.846e+03   4.495 7.23e-06 ***
## model_key430                1.190e+04  2.862e+03   4.157 3.31e-05 ***
## model_key430 Gran Coupé     5.031e+03  2.863e+03   1.757 0.078943 .  
## model_key435                7.736e+03  2.195e+03   3.524 0.000430 ***
## model_key435 Gran Coupé     6.053e+03  2.187e+03   2.768 0.005675 ** 
## model_key518                4.369e+03  1.004e+03   4.352 1.39e-05 ***
## model_key520                3.829e+03  9.486e+02   4.036 5.57e-05 ***
## model_key520 Gran Turismo   7.198e+03  1.155e+03   6.231 5.25e-10 ***
## model_key523                3.591e+03  1.831e+03   1.962 0.049871 *  
## model_key525                3.376e+03  1.002e+03   3.370 0.000760 ***
## model_key528                4.929e+03  2.150e+03   2.293 0.021943 *  
## model_key530                4.329e+03  1.058e+03   4.091 4.41e-05 ***
## model_key530 Gran Turismo   5.305e+03  1.268e+03   4.184 2.95e-05 ***
## model_key535                4.771e+03  1.222e+03   3.903 9.69e-05 ***
## model_key535 Gran Turismo   6.101e+03  2.185e+03   2.792 0.005275 ** 
## model_key630               -3.569e+02  2.873e+03  -0.124 0.901136    
## model_key640 Gran Coupé     1.248e+04  2.902e+03   4.301 1.75e-05 ***
## model_key650               -6.835e+03  2.953e+03  -2.315 0.020682 *  
## model_key730                8.090e+03  1.202e+03   6.728 2.04e-11 ***
## model_key735                4.168e+02  2.899e+03   0.144 0.885684    
## model_key740                6.449e+03  1.746e+03   3.693 0.000226 ***
## model_keyM135               7.033e+03  2.902e+03   2.423 0.015430 *  
## model_keyM235               8.993e+03  2.199e+03   4.089 4.45e-05 ***
## model_keyM3                 1.284e+04  2.017e+03   6.363 2.28e-10 ***
## model_keyM550               5.819e+03  1.989e+03   2.926 0.003454 ** 
## model_keyX1                 1.108e+03  9.370e+02   1.182 0.237174    
## model_keyX3                 4.362e+03  9.605e+02   4.541 5.81e-06 ***
## model_keyX4                 1.016e+04  1.256e+03   8.085 8.84e-16 ***
## model_keyX5                 5.303e+03  1.101e+03   4.819 1.51e-06 ***
## model_keyX5 M               7.877e+03  1.569e+03   5.020 5.45e-07 ***
## model_keyX6                 9.295e+03  1.256e+03   7.403 1.70e-13 ***
## model_keyZ4                 2.521e+03  1.535e+03   1.642 0.100731    
## engine_power                3.421e+01  4.171e+00   8.202 3.44e-16 ***
## feature_1                   4.203e+02  1.131e+02   3.715 0.000207 ***
## feature_2                   9.061e+01  1.435e+02   0.631 0.527828    
## feature_3                   3.389e+02  1.338e+02   2.533 0.011368 *  
## feature_4                   1.280e+03  1.655e+02   7.734 1.40e-14 ***
## feature_5                   8.598e+02  1.174e+02   7.324 3.05e-13 ***
## feature_6                   1.070e+03  1.219e+02   8.783  < 2e-16 ***
## feature_7TRUE               7.022e+02  2.243e+02   3.131 0.001756 ** 
## feature_8TRUE               1.416e+03  1.176e+02  12.038  < 2e-16 ***
## age                        -2.260e+00  7.117e-02 -31.749  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2669 on 3090 degrees of freedom
## Multiple R-squared:  0.7879, Adjusted R-squared:  0.7829 
## F-statistic: 157.2 on 73 and 3090 DF,  p-value: < 2.2e-16

When we remove the outliers and construct our final linear model with the features, it is evident that the R-squared has improved even more.

Even though 0.77 can be an acceptable performance, our model performed poorly on lower priced cars. however, a better sampling/splitting method could result in even better performance.

4.8) Making predictions

result <- data.frame(Actual = train$price, Predicted = predict(m))
head(result)
##    Actual Predicted
## 1   11300 11309.372
## 3   10200 10418.203
## 6   17100 18881.567
## 8    6100  8512.448
## 9    6200  7171.472
## 10  17300 12922.033

4.9) Summary

When sampling the dataset some variables were not sampled properly like fuel type electro and hybrid. Hence, they were removed Some columns with boolean values needed to be changed to numeric for linear regression to work. Some values for price column were zero, which is not possible. When analysing correlations, we found that fuel type was not a major predictor of price and model type was a major predictor. We found this by trial and error because model_key is a categorical variable

Even though 0.77 can be an acceptable performance, our model performed poorly on lower priced cars. However, a better sampling/splitting method could result in even better performance.

In conclusion, my model was better than flipping a coin in predicting the price of a used car.