Linear Regression In Used Car Price Prediction

Intro

Linear Regression

Linear regression is an algorithm used to predict, or visualize, a relationship between two different features/variables. In linear regression tasks, there are two kinds of variables being examined: the dependent variable and the independent variable. The independent or predictor variable is the variable that stands by itself, not impacted by the other variable. As the independent variable is adjusted, the levels of the dependent variable will fluctuate. The dependent or target variable is the variable that is being studied, and it is what the regression model solves for/attempts to predict. In linear regression tasks, every observation/instance is comprised of both the dependent variable value and the independent variable value.

Dataset

This dataset contains information about used cars. The description of each feature explained below:

name: Name of the cars
year: Year of the car when it was bought
selling_price: Price at which the car is being sold
km_driven: Number of Kilometres the car is driven
fuel: Fuel type of car (petrol / diesel / CNG / LPG / electric)
seller_type: Tells if car is sold by individual or dealer
transmission: Gear transmission of the car (Automatic/Manual)
Owner: Number of previous owners
mileage: Mileage of the car (kmpl)
engine: Engine capacity of the car (CC)
max_power: Maximum power of engine (bhp)
seats: number of seats in the car

This data set will be used to predict the selling price of used cars, so that we will set selling price as the target variable.

Data Preparation

Import Required Package

library(tidyverse)
library(caret)
library(GGally)
library(car)
library(lmtest)
library(data.table)
library(MLmetrics)
library(performance)

Load Dataset

car <- read.csv('Used_Cars.csv')
rmarkdown::paged_table(car)

glimpse(car)

## Rows: 8,128
## Columns: 13
## $ X             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ name          <chr> "Maruti Swift Dzire VDI", "Skoda Rapid 1.5 TDI Ambition"~
## $ year          <int> 2014, 2014, 2006, 2010, 2007, 2017, 2007, 2001, 2011, 20~
## $ selling_price <int> 450000, 370000, 158000, 225000, 130000, 440000, 96000, 4~
## $ km_driven     <int> 145500, 120000, 140000, 127000, 120000, 45000, 175000, 5~
## $ fuel          <chr> "Diesel", "Diesel", "Petrol", "Diesel", "Petrol", "Petro~
## $ seller_type   <chr> "Individual", "Individual", "Individual", "Individual", ~
## $ transmission  <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua~
## $ owner         <chr> "First Owner", "Second Owner", "Third Owner", "First Own~
## $ mileage       <chr> "23.4 kmpl", "21.14 kmpl", "17.7 kmpl", "23.0 kmpl", "16~
## $ engine        <chr> "1248 CC", "1498 CC", "1497 CC", "1396 CC", "1298 CC", "~
## $ max_power     <chr> "74 bhp", "103.52 bhp", "78 bhp", "90 bhp", "88.2 bhp", ~
## $ seats         <int> 5, 5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, NA, 5, 5, 7, 5, 5~

The data has 8,128 rows and 13 columns. X and name are unique identifier for each car, so we can drop them because we don’t need that information.

Before we go any further, we first need to make sure that our data is of the correct type. There are some features that we need to clean up and put in the correct type. What we do for the next step is:

Change the fuel, seller_type, transmission, and owner data types to factors
Remove “kmpl” from mileage and change the data type to factor
Remove “CC” from engine and change the data type to factor
Remove “bhp” from max_power

car<- car %>%
  mutate(mileage = as.numeric(str_replace(mileage," kmpl| km/kg","")),
         engine = as.numeric(str_replace(engine," CC","")),
         max_power = as.numeric(str_replace(max_power," bhp","")),
         fuel = as.factor(fuel),
         seller_type = as.factor(seller_type),
         transmission = as.factor(transmission),
         owner = as.factor(owner)) %>% 
  select(-c(X,name))
  
str(car)

## 'data.frame':    8128 obs. of  11 variables:
##  $ year         : int  2014 2014 2006 2010 2007 2017 2007 2001 2011 2013 ...
##  $ selling_price: int  450000 370000 158000 225000 130000 440000 96000 45000 350000 200000 ...
##  $ km_driven    : int  145500 120000 140000 127000 120000 45000 175000 5000 90000 169000 ...
##  $ fuel         : Factor w/ 4 levels "CNG","Diesel",..: 2 2 4 2 4 4 3 4 2 2 ...
##  $ seller_type  : Factor w/ 3 levels "Dealer","Individual",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ transmission : Factor w/ 2 levels "Automatic","Manual": 2 2 2 2 2 2 2 2 2 2 ...
##  $ owner        : Factor w/ 5 levels "First Owner",..: 1 3 5 1 1 1 1 3 1 1 ...
##  $ mileage      : num  23.4 21.1 17.7 23 16.1 ...
##  $ engine       : num  1248 1498 1497 1396 1298 ...
##  $ max_power    : num  74 103.5 78 90 88.2 ...
##  $ seats        : int  5 5 5 5 5 5 5 4 5 5 ...

Check Missing Values

colSums(is.na(car))

##          year selling_price     km_driven          fuel   seller_type 
##             0             0             0             0             0 
##  transmission         owner       mileage        engine     max_power 
##             0             0           221           221           216 
##         seats 
##           221

There are some missing values from mileage, engine, max_power, and seats columns, because most of them are from the same row so we have to drop them all.

car <- drop_na(car)
dim(car)

## [1] 7906   11

Exploratory Data Analysis

Exploratory data analysis is a phase where we explore the data variables, see if there are any pattern that can indicate any kind of correlation between variables.

First we want to know the distribution of our target variable which is selling_price

hist(car$selling_price, col="darkblue")

From the histogram it is known that most of the selling prices of used cars are below 2,000,000

Before we make the model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will set 80% of the data as the training data and the rest of it as the testing data.

RNGkind(sample.kind = "Rounding")
set.seed(123)
intrain <- sample(x=nrow(car), size = nrow(car)*0.8)
car_train <- car[intrain,]
car_test <- car[-intrain,]

dim(car_train)

## [1] 6324   11

dim(car_test)

## [1] 1582   11

Check the distribution of target variable from Train Dataset

hist(car_train$selling_price, col= "darkblue")

Check the distribution of target variable from Test Dataset

hist(car_test$selling_price, col="darkblue")

selling_price have almost the same distribution in both the training dataset and the test dataset

Find the Pearson correlation between features.

ggcorr(car_train, label = T)

The graphic shows that km_driven and mileage have negative correlation with selling_price. selling_price has strong correlation with max_power (0.8)

Modeling

The first model we can make is to use all variables (except selling_price) as predictor variables.

model0 <- lm(selling_price~1, car_train)
model_all <- lm(selling_price~., car_train)
summary(model_all)

## 
## Call:
## lm(formula = selling_price ~ ., data = car_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2707308  -197720    10512   153376  4458792 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -6.724e+07  4.289e+06 -15.676  < 2e-16 ***
## year                         3.335e+04  2.143e+03  15.564  < 2e-16 ***
## km_driven                   -9.389e-01  1.163e-01  -8.073 8.15e-16 ***
## fuelDiesel                  -3.289e+04  6.878e+04  -0.478 0.632578    
## fuelLPG                      1.536e+05  1.134e+05   1.355 0.175540    
## fuelPetrol                  -7.180e+04  6.914e+04  -1.039 0.299063    
## seller_typeIndividual       -2.481e+05  1.849e+04 -13.418  < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05  3.855e+04  -8.956  < 2e-16 ***
## transmissionManual          -4.487e+05  2.206e+04 -20.336  < 2e-16 ***
## ownerFourth & Above Owner    4.160e+04  4.300e+04   0.967 0.333392    
## ownerSecond Owner           -4.278e+04  1.499e+04  -2.853 0.004343 ** 
## ownerTest Drive Car          2.631e+06  2.307e+05  11.403  < 2e-16 ***
## ownerThird Owner            -1.359e+04  2.576e+04  -0.528 0.597849    
## mileage                      1.483e+04  2.369e+03   6.261 4.07e-10 ***
## engine                       1.070e+02  2.643e+01   4.050 5.18e-05 ***
## max_power                    1.288e+04  2.890e+02  44.574  < 2e-16 ***
## seats                       -3.219e+04  8.812e+03  -3.653 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6889 
## F-statistic: 876.3 on 16 and 6307 DF,  p-value: < 2.2e-16

The summary of model_all shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the significant level of the variable toward the model. If the value is below 0.05, than we can safely assume that the variable has significant effect toward the model.

Feature Selection

Feature selection is the stage in selecting the variables to be used, working by evaluating and reducing insignificant variables and paying attention to the AIC value. AIC (Akaike Information Criterion) is a value that represents a lot of missing information in the model, the smaller the AIC value, the better the model.

There are three steps of feature selection that can we implement:

backward elimination: Of all the predictors used, the model is then evaluated by reducing the predictor variables so that the smallest AIC (Akaike Information Criterion) model is obtained..
forward selection: From the model without a predictor, then the model is evaluated by adding predictor variables so that the model with the smallest AIC is obtained.
both: From the model made, it is possible to evaluate the model by adding or subtracting predictor variables so that the model with the smallest AIC is obtained.

Backward Elimination

model_back <- step(model_all,direction = "backward", trace = 0)
summary(model_back)

## 
## Call:
## lm(formula = selling_price ~ year + km_driven + fuel + seller_type + 
##     transmission + owner + mileage + engine + max_power + seats, 
##     data = car_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2707308  -197720    10512   153376  4458792 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -6.724e+07  4.289e+06 -15.676  < 2e-16 ***
## year                         3.335e+04  2.143e+03  15.564  < 2e-16 ***
## km_driven                   -9.389e-01  1.163e-01  -8.073 8.15e-16 ***
## fuelDiesel                  -3.289e+04  6.878e+04  -0.478 0.632578    
## fuelLPG                      1.536e+05  1.134e+05   1.355 0.175540    
## fuelPetrol                  -7.180e+04  6.914e+04  -1.039 0.299063    
## seller_typeIndividual       -2.481e+05  1.849e+04 -13.418  < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05  3.855e+04  -8.956  < 2e-16 ***
## transmissionManual          -4.487e+05  2.206e+04 -20.336  < 2e-16 ***
## ownerFourth & Above Owner    4.160e+04  4.300e+04   0.967 0.333392    
## ownerSecond Owner           -4.278e+04  1.499e+04  -2.853 0.004343 ** 
## ownerTest Drive Car          2.631e+06  2.307e+05  11.403  < 2e-16 ***
## ownerThird Owner            -1.359e+04  2.576e+04  -0.528 0.597849    
## mileage                      1.483e+04  2.369e+03   6.261 4.07e-10 ***
## engine                       1.070e+02  2.643e+01   4.050 5.18e-05 ***
## max_power                    1.288e+04  2.890e+02  44.574  < 2e-16 ***
## seats                       -3.219e+04  8.812e+03  -3.653 0.000261 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6889 
## F-statistic: 876.3 on 16 and 6307 DF,  p-value: < 2.2e-16

Forward Selection

model_forward <- step(model0, direction = "forward", scope = list(lower = model0,
                                                                     upper = model_all),
                      trace = 0)
summary(model_forward)

## 
## Call:
## lm(formula = selling_price ~ max_power + year + transmission + 
##     seller_type + owner + mileage + km_driven + engine + seats + 
##     fuel, data = car_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2707308  -197720    10512   153376  4458792 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -6.724e+07  4.289e+06 -15.676  < 2e-16 ***
## max_power                    1.288e+04  2.890e+02  44.574  < 2e-16 ***
## year                         3.335e+04  2.143e+03  15.564  < 2e-16 ***
## transmissionManual          -4.487e+05  2.206e+04 -20.336  < 2e-16 ***
## seller_typeIndividual       -2.481e+05  1.849e+04 -13.418  < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05  3.855e+04  -8.956  < 2e-16 ***
## ownerFourth & Above Owner    4.160e+04  4.300e+04   0.967 0.333392    
## ownerSecond Owner           -4.278e+04  1.499e+04  -2.853 0.004343 ** 
## ownerTest Drive Car          2.631e+06  2.307e+05  11.403  < 2e-16 ***
## ownerThird Owner            -1.359e+04  2.576e+04  -0.528 0.597849    
## mileage                      1.483e+04  2.369e+03   6.261 4.07e-10 ***
## km_driven                   -9.389e-01  1.163e-01  -8.073 8.15e-16 ***
## engine                       1.070e+02  2.643e+01   4.050 5.18e-05 ***
## seats                       -3.219e+04  8.812e+03  -3.653 0.000261 ***
## fuelDiesel                  -3.289e+04  6.878e+04  -0.478 0.632578    
## fuelLPG                      1.536e+05  1.134e+05   1.355 0.175540    
## fuelPetrol                  -7.180e+04  6.914e+04  -1.039 0.299063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6889 
## F-statistic: 876.3 on 16 and 6307 DF,  p-value: < 2.2e-16

Both

model_both <- step(model0, direction = "both", scope = list(lower = model0,
                                                            upper = model_all),
                   trace = 0)
summary(model_both)

## 
## Call:
## lm(formula = selling_price ~ max_power + year + transmission + 
##     seller_type + owner + mileage + km_driven + engine + seats + 
##     fuel, data = car_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2707308  -197720    10512   153376  4458792 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -6.724e+07  4.289e+06 -15.676  < 2e-16 ***
## max_power                    1.288e+04  2.890e+02  44.574  < 2e-16 ***
## year                         3.335e+04  2.143e+03  15.564  < 2e-16 ***
## transmissionManual          -4.487e+05  2.206e+04 -20.336  < 2e-16 ***
## seller_typeIndividual       -2.481e+05  1.849e+04 -13.418  < 2e-16 ***
## seller_typeTrustmark Dealer -3.452e+05  3.855e+04  -8.956  < 2e-16 ***
## ownerFourth & Above Owner    4.160e+04  4.300e+04   0.967 0.333392    
## ownerSecond Owner           -4.278e+04  1.499e+04  -2.853 0.004343 ** 
## ownerTest Drive Car          2.631e+06  2.307e+05  11.403  < 2e-16 ***
## ownerThird Owner            -1.359e+04  2.576e+04  -0.528 0.597849    
## mileage                      1.483e+04  2.369e+03   6.261 4.07e-10 ***
## km_driven                   -9.389e-01  1.163e-01  -8.073 8.15e-16 ***
## engine                       1.070e+02  2.643e+01   4.050 5.18e-05 ***
## seats                       -3.219e+04  8.812e+03  -3.653 0.000261 ***
## fuelDiesel                  -3.289e+04  6.878e+04  -0.478 0.632578    
## fuelLPG                      1.536e+05  1.134e+05   1.355 0.175540    
## fuelPetrol                  -7.180e+04  6.914e+04  -1.039 0.299063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 459300 on 6307 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6889 
## F-statistic: 876.3 on 16 and 6307 DF,  p-value: < 2.2e-16

Model Comparison

compare_performance(model_all,model_back,model_forward,model_both)

## # Comparison of Model Performance Indices
## 
## Name          | Model |       AIC | AIC weights |       BIC | BIC weights |    R2 | R2 (adj.) |      RMSE |     Sigma
## ---------------------------------------------------------------------------------------------------------------------
## model_all     |    lm | 1.829e+05 |       0.250 | 1.830e+05 |       0.250 | 0.690 |     0.689 | 4.587e+05 | 4.593e+05
## model_back    |    lm | 1.829e+05 |       0.250 | 1.830e+05 |       0.250 | 0.690 |     0.689 | 4.587e+05 | 4.593e+05
## model_forward |    lm | 1.829e+05 |       0.250 | 1.830e+05 |       0.250 | 0.690 |     0.689 | 4.587e+05 | 4.593e+05
## model_both    |    lm | 1.829e+05 |       0.250 | 1.830e+05 |       0.250 | 0.690 |     0.689 | 4.587e+05 | 4.593e+05

After performing feature selection in three ways, it turns out that there is no change in the initial model both from the AIC, R2 (adj.), and RMSE values, then the model we will use is model_all

Linear Regression Assumption

As a statistical model, linear regression has several assumptions that need to be fulfilled so that the interpretation obtained is not biased. This assumption only needs to be fulfilled if the purpose of making a linear regression model is to want an interpretation or to see the effect of each predictor on the value of the target variable. If you only want to use linear regression to make predictions, then the model assumptions are not required to be met.

Linearity

Linearity means that the target variable with its predictor has a linear relationship or the relationship is a straight line. In addition, the effect or coefficient value between variables is additive. If this linearity is not met, then automatically all the coefficient values that we get are invalid because the model assumes that the pattern that we will make is linear.

Normality of Residual (Residual Normal)

The normality assumption means that the residuals from the linear regression model should be normally distributed because we expect to get residuals near the zero value

Homoscedasticity of Residual

Homocesdasticity indicates that the residual or error is constant or does not form a certain pattern. If the error forms a certain pattern such as a linear or conical line, then we call it Heterocesdasticity and it will affect the standard error value on a biased estimate/coefficient of predictor (too narrow or too wide). Homocesdasticity can be checked visually by seeing whether there is a pattern between the predicted results from the data and the residual value.

No Multicolinearity

Multicolinearity occurs when the predictor variables used in the model have a strong relationship. a good model is not expected to have multicolinearity. The presence or absence of multicolinearity can be seen from the value of VIF (Variance Inflation Factor). When the VIF value is more than 10, it means that there is multicolinearity

check_model(model_all)

Model Using Scaled Data

The picture above shows that the assumptions that are met are only multicollinearity, while the others are not appropriate. Now I will try to do data scaling on predictor variables and target variable to overcome Normality of Residual and Heterocesdasticity

# Scaling data
num_data <- car_train %>% 
  select(is.numeric) %>% 
  sapply(scale)

## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## i Please update your code.
## This message is displayed once per session.

fac_data <- car_train %>% select(is.factor)

## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.factor)
## 
##   # Good
##   data %>% select(where(is.factor))
## 
## i Please update your code.
## This message is displayed once per session.

car_scale <- data.frame(num_data,fac_data)

model_scale <- lm(selling_price~., car_scale)
summary(model_scale)

## 
## Call:
## lm(formula = selling_price ~ ., data = car_scale)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2874 -0.2401  0.0128  0.1862  5.4142 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.805766   0.087509   9.208  < 2e-16 ***
## year                         0.155202   0.009972  15.564  < 2e-16 ***
## km_driven                   -0.067133   0.008316  -8.073 8.15e-16 ***
## mileage                      0.073018   0.011662   6.261 4.07e-10 ***
## engine                       0.065462   0.016163   4.050 5.18e-05 ***
## max_power                    0.560830   0.012582  44.574  < 2e-16 ***
## seats                       -0.037662   0.010310  -3.653 0.000261 ***
## fuelDiesel                  -0.039934   0.083522  -0.478 0.632578    
## fuelLPG                      0.186478   0.137646   1.355 0.175540    
## fuelPetrol                  -0.087191   0.083956  -1.039 0.299063    
## seller_typeIndividual       -0.301247   0.022450 -13.418  < 2e-16 ***
## seller_typeTrustmark Dealer -0.419179   0.046806  -8.956  < 2e-16 ***
## transmissionManual          -0.544820   0.026790 -20.336  < 2e-16 ***
## ownerFourth & Above Owner    0.050515   0.052218   0.967 0.333392    
## ownerSecond Owner           -0.051941   0.018205  -2.853 0.004343 ** 
## ownerTest Drive Car          3.194600   0.280164  11.403  < 2e-16 ***
## ownerThird Owner            -0.016501   0.031281  -0.528 0.597849    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5577 on 6307 degrees of freedom
## Multiple R-squared:  0.6897, Adjusted R-squared:  0.6889 
## F-statistic: 876.3 on 16 and 6307 DF,  p-value: < 2.2e-16

check_model(model_scale)

After creating a model with data that has been scaled, it turns out that the results given are not much different from the model_all and produce the same R value of 0.6889, and the assumptions that are met are only multicollinearity. For this reason, it is recommended to use other regression models on this data such as polynomial regression, random forest regression, etc.

Prediction

pred <- predict(model_all, newdata = car_test, interval = "prediction", level = 0.95)
head(pred)

##          fit       lwr       upr
## 2  693441.39 -207413.9 1594296.7
## 3   18287.87 -883617.2  920193.0
## 13  72350.87 -828574.5  973276.2
## 15 485480.94 -415205.7 1386167.6
## 20  98401.11 -802465.1  999267.3
## 22 518553.09 -382295.0 1419401.2

Evaluation

After making predictions from the data, we must find out whether the machine learning model that has been created can produce predictions with the smallest error. There are several ways to do model evaluation on the regression model. To perform model evaluation on the regression model, there are several metrics that can be used:

R-Squared and Adj R-Squared : to determine how well the model explains the variance of the target variable
Error value : to see if the prediction made produces the smallest error value

Error values that we will use to see the performance of the model are MAE (Mean Absolute Error) and MAPE. MAE shows the average of the absolute error values while MAPE shows how big the deviation is in percentage terms.

\[ MAE = \frac{\sum |\hat y - y|}{n} \] \[ MAPE = \frac{1}{n} \sum\frac{|\hat y - y|}{y} \]

min <- min(car$selling_price)
max <- max(car$selling_price)
mape <- MAPE(y_pred = pred, y_true = car_test$selling_price)
mae <- MAE(y_pred = pred, y_true = car_test$selling_price)

value <- c(min,max,mape,mae)
eval <- as.data.frame(value)
row.names(eval) <- c("min","max","mape","mae")
eval

##             value
## min  2.999900e+04
## max  1.000000e+07
## mape 2.263373e+00
## mae  7.127226e+05

When viewed from the MAPE (0.87), it means that the error in the prediction of this model is about 87% so that it can be said that linear regression is not suitable for predicting the selling price of used cars in this data.

Conclusion

The data is a history of used car sales sourced from kaggle.com. The purpose of this analysis is to create a model that can predict the selling price of a used car based on several existing features. linear regression models and tuning models have been used by means of feature selection but still give the same results. only one assumption of linear regression can be met, namely multicolinearity, while the assumptions of Normality of Residual, linearity, and homoscedasticity are not met.

From the results of the analysis that has been carried out I conclude that linear regression is not suitable for predicting the selling price of used cars in this data, it is recommended to use other regression models on this data such as polynomial regression, random forest regression, etc.