1 Intro

1.1 What We’ll Do

We will use linear regression model using car price dataset. We want to know the relationship among variables, especially between the car price with other variables. We also want to predict the price of a new car based on the historical data.

1.2 Business Goal

A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.

They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:

  • Which variables are significant in predicting the price of a car
  • How well those variables describe the price of a car

Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market. We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

2 Data Preparation

2.1 Library and Setup

Before we do analysis, we need to load the required library packages.

#untuk data wrangling
library(tidyverse)

#untuk cek asumsi model
library(lmtest)
library(car)

#untuk menghitung error
library(MLmetrics)

#untuk visualisasi korelasi
library(GGally)

2.2 Import Data

We need the data to do the analysis. Then, we have to load the dataset

car_data <- read.csv("data_input/car data.csv")

glimpse(car_data)
## Rows: 301
## Columns: 9
## $ Car_Name      <chr> "ritz", "sx4", "ciaz", "wagon r", "swift", "vitara brezz~
## $ Year          <int> 2014, 2013, 2017, 2011, 2014, 2018, 2015, 2015, 2016, 20~
## $ Selling_Price <dbl> 3.35, 4.75, 7.25, 2.85, 4.60, 9.25, 6.75, 6.50, 8.75, 7.~
## $ Present_Price <dbl> 5.59, 9.54, 9.85, 4.15, 6.87, 9.83, 8.12, 8.61, 8.89, 8.~
## $ Kms_Driven    <int> 27000, 43000, 6900, 5200, 42450, 2071, 18796, 33429, 202~
## $ Fuel_Type     <chr> "Petrol", "Diesel", "Petrol", "Petrol", "Diesel", "Diese~
## $ Seller_Type   <chr> "Dealer", "Dealer", "Dealer", "Dealer", "Dealer", "Deale~
## $ Transmission  <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua~
## $ Owner         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

2.3 Data Description

Below are data description for each columns :

  • Car_Name: Name of the car
  • Year: Production Year
  • Selling_Price: Selling price
  • Present_Price: Present buying price
  • Kms_Driven: Kilometers driven value
  • Fuel_Type: Fuel type
  • Seller_Type: Seller Type
  • Transmission: Transmission Type
  • Owner: Ownership status (1 = yes, 0 = no)

3 Exploratory Data Analysis

3.1 Check Data Type

After we check the data type of each columns, we found that some of the columns don’t have the required data type. We need to change these columns’ data type for us to ease the analysis process.

df_data <- car_data %>% 
  mutate_if(is.character,as.factor)

3.2 Check Missing Value

We have to check if there is any missing values in our data set

colSums(is.na(df_data))
##      Car_Name          Year Selling_Price Present_Price    Kms_Driven 
##             0             0             0             0             0 
##     Fuel_Type   Seller_Type  Transmission         Owner 
##             0             0             0             0

After we check, there is no missing values in the data set.

3.3 Analysis

Let us check unique values on each category

First we check between Owner and Fuel Type

table(df_data$Owner,df_data$Fuel_Type)
##    
##     CNG Diesel Petrol
##   0   2     59    229
##   1   0      1      9
##   3   0      0      1

After we check the unique values between fuel type and Owner, we found that fuel type “CNG” and owner “3” is very rare. We will drop fuel type CNF and change Owner 3 to Owner 1.

df_data <- df_data %>% 
  mutate(Owner = ifelse(Owner == 3, yes=1,no=Owner)) %>% filter(Fuel_Type != "CNG") 

Now Let us check between Owner and Seller Type

table(df_data$Owner,df_data$Seller_Type)
##    
##     Dealer Individual
##   0    189         99
##   1      4          7

No need to drop any rows and columns between Owner and Seller type

Let us check between Owner and Transmission

table(df_data$Owner,df_data$Transmission)
##    
##     Automatic Manual
##   0        39    249
##   1         1     10

No need to drop any rows and columns between Owner and Transmission

After we check Unique values on each categories, let us check corelation between numerical variables

 ggcorr(df_data,label=T)
## Warning in ggcorr(df_data, label = T): data in column(s) 'Car_Name',
## 'Fuel_Type', 'Seller_Type', 'Transmission' are not numeric and were ignored

Some variables have correlations with selling price and some variable have no correlation with selling price. let us check deeper one by one

Let us check relation between km driven and selling price

options(scipen=999)
df_data %>% ggplot(aes(x=Kms_Driven,y=Selling_Price)) + 
  geom_point() + geom_smooth(method="lm") + theme_bw()
## `geom_smooth()` using formula 'y ~ x'

There are outliers on kms_driven > 150,000 km. Let us check the curve if we drop these outliers

df_filter <- df_data %>% 
  filter(Kms_Driven<150000)

df_filter %>% ggplot(aes(x=Kms_Driven,y=Selling_Price)) +
  geom_point() + geom_smooth(method="lm") + theme_bw()
## `geom_smooth()` using formula 'y ~ x'

The correlation looks higher if we drop the outliers. We will compared later in terms of regression between filtered data and non-filtered

Now let’s check correlation between selling price and year

df_filter %>% 
  ggplot(aes(x=Year,y=Selling_Price)) + geom_point() +
  geom_smooth(method="lm") + theme_bw()
## `geom_smooth()` using formula 'y ~ x'

There are correlation between Year and Selling price. The higher the year the higher the selling price. There are outliers on Selling price > 25 k USD in. Let us check the curve if we drop these outliers

df_filter <- df_data %>% 
  filter(Selling_Price<25)

df_filter %>% ggplot(aes(x=Year,y=Selling_Price)) +
  geom_point() + geom_smooth(method="lm") + theme_bw()
## `geom_smooth()` using formula 'y ~ x'

The correlation looks better between selling price and year after if we filter the data. For now, we will not used the filtered year vs selling price. We will see later if the data needs improvement.

Now let us check again correlation between number variables after we filtered the dataset

ggcorr(df_filter,label=T)

If we look at the picture above, there is no correlation between selling price and owner, we can drop the owner column for further analysis

4 Model Fitting

We will make 2 models for analysis - Filtered data model - Original model

Now let us check the summary on each model

#model_filter

model_filter <- lm(df_filter$Selling_Price~.,data=df_filter
                   %>% select(-Car_Name) %>% select(-Owner)) 
summary(model_filter)
## 
## Call:
## lm(formula = df_filter$Selling_Price ~ ., data = df_filter %>% 
##     select(-Car_Name) %>% select(-Owner))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2597 -0.8122 -0.1420  0.6715  6.0421 
## 
## Coefficients:
##                             Estimate     Std. Error t value
## (Intercept)           -788.751799353   76.816124100 -10.268
## Year                     0.393287656    0.038081707  10.327
## Present_Price            0.493856029    0.020603266  23.970
## Kms_Driven              -0.000006036    0.000002900  -2.081
## Fuel_TypePetrol         -1.512327484    0.273072477  -5.538
## Seller_TypeIndividual   -0.773297045    0.249636762  -3.098
## TransmissionManual      -0.665326079    0.316259434  -2.104
##                                   Pr(>|t|)    
## (Intercept)           < 0.0000000000000002 ***
## Year                  < 0.0000000000000002 ***
## Present_Price         < 0.0000000000000002 ***
## Kms_Driven                         0.03830 *  
## Fuel_TypePetrol               0.0000000685 ***
## Seller_TypeIndividual              0.00214 ** 
## TransmissionManual                 0.03626 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 290 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.8771 
## F-statistic:   353 on 6 and 290 DF,  p-value: < 0.00000000000000022

Interpretation :

  • Adjusted R-Square : 87,71% => Model can explain 87,71% of the variation from selling price, the rest of the percentage can be explained by another variable that is not present in the model.
  • If fuel type = petrol, the selling price will be decreased due to negative coefficient/estimate.
  • If seller type = individual, the selling price will be decreased due to negative coefficient/estimate.
  • If transmisi = manual, the selling price will be decreased due to negative coefficient/estimate.
  • The higher Km driven, the lower selling price due to negative coefficient/estimate.
#model_full

model_full <- lm(df_data$Selling_Price~.,data=df_data %>% select(-Car_Name))
summary(model_full)
## 
## Call:
## lm(formula = df_data$Selling_Price ~ ., data = df_data %>% select(-Car_Name))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6609 -0.8809 -0.1923  0.6668 11.1439 
## 
## Coefficients:
##                             Estimate     Std. Error t value
## (Intercept)           -805.458570093   87.237926959  -9.233
## Year                     0.402366592    0.043246435   9.304
## Present_Price            0.435643174    0.016050289  27.142
## Kms_Driven              -0.000007020    0.000003253  -2.158
## Fuel_TypePetrol         -1.875330696    0.300598161  -6.239
## Seller_TypeIndividual   -1.169958414    0.256995539  -4.552
## TransmissionManual      -1.430382791    0.329675302  -4.339
## Owner                   -0.078750186    0.555773022  -0.142
##                                   Pr(>|t|)    
## (Intercept)           < 0.0000000000000002 ***
## Year                  < 0.0000000000000002 ***
## Present_Price         < 0.0000000000000002 ***
## Kms_Driven                          0.0317 *  
## Fuel_TypePetrol              0.00000000155 ***
## Seller_TypeIndividual        0.00000780168 ***
## TransmissionManual           0.00001978516 ***
## Owner                               0.8874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.776 on 291 degrees of freedom
## Multiple R-squared:  0.8815, Adjusted R-squared:  0.8786 
## F-statistic: 309.2 on 7 and 291 DF,  p-value: < 0.00000000000000022

Interpretation:

  • Adjusted R-Square : 87,86% => Model can explain 87,86% of the variation from selling price, the rest of the percentage can be explained by another variable that is not present in the model.
  • p-value from “Owner” is not significant, then we can assume that there is no influence to selling price
  • If fuel type = petrol, the selling price will be decreased due to negative coefficient/estimate.
  • If seller type = individual, the selling price will be decreased due to negative coefficient/estimate.
  • If transmission = manual, the selling price will be decreased due to negative coefficient/estimate.
  • The higher Km driven, the lower selling price due to negative coefficient/estimate.

Now let us determine which Variables we will use. Since the adjusted R squared from the filtered data Model is slightly the same as the original model, for further analysis we decide to go with the filtered data model since this model uses least variables.

model_step <- step(model_filter,direction="both")
## Start:  AIC=278.62
## df_filter$Selling_Price ~ Year + Present_Price + Kms_Driven + 
##     Fuel_Type + Seller_Type + Transmission
## 
##                 Df Sum of Sq     RSS    AIC
## <none>                        723.95 278.62
## - Kms_Driven     1     10.81  734.76 281.03
## - Transmission   1     11.05  734.99 281.12
## - Seller_Type    1     23.95  747.90 286.29
## - Fuel_Type      1     76.57  800.51 306.48
## - Year           1    266.25  990.20 369.64
## - Present_Price  1   1434.29 2158.24 601.04

SInce no variable AIC value lower than AIC start, we can use all the variable (Transmission, Year, Kms_Driven, Seller_type, Fuel_type, present_price)

summary(model_step)
## 
## Call:
## lm(formula = df_filter$Selling_Price ~ Year + Present_Price + 
##     Kms_Driven + Fuel_Type + Seller_Type + Transmission, data = df_filter %>% 
##     select(-Car_Name) %>% select(-Owner))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2597 -0.8122 -0.1420  0.6715  6.0421 
## 
## Coefficients:
##                             Estimate     Std. Error t value
## (Intercept)           -788.751799353   76.816124100 -10.268
## Year                     0.393287656    0.038081707  10.327
## Present_Price            0.493856029    0.020603266  23.970
## Kms_Driven              -0.000006036    0.000002900  -2.081
## Fuel_TypePetrol         -1.512327484    0.273072477  -5.538
## Seller_TypeIndividual   -0.773297045    0.249636762  -3.098
## TransmissionManual      -0.665326079    0.316259434  -2.104
##                                   Pr(>|t|)    
## (Intercept)           < 0.0000000000000002 ***
## Year                  < 0.0000000000000002 ***
## Present_Price         < 0.0000000000000002 ***
## Kms_Driven                         0.03830 *  
## Fuel_TypePetrol               0.0000000685 ***
## Seller_TypeIndividual              0.00214 ** 
## TransmissionManual                 0.03626 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.58 on 290 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.8771 
## F-statistic:   353 on 6 and 290 DF,  p-value: < 0.00000000000000022

5 Model Evaluation

5.1 Model Performance

5.1.1 Performance

We check the performance from our model with new data set

car_new <- read.csv("data_input/car_test.csv")
car_new
pred_test1 <- predict(model_filter,newdata = car_new)

pred_test1
##          1          2          3          4          5          6          7 
##  8.1346436  4.5204164  6.3730776 -0.3372467  7.3616695 18.5983244 14.1823985 
##          8          9         10         11         12         13         14 
##  4.8908479  4.2717796 20.6178014  6.2264447  4.6861025  3.1676793  6.6928351 
##         15         16         17         18         19         20         21 
## 16.9578693 46.3513302  3.9254315 18.5983244  2.4488207  1.7973028  1.3952354 
##         22         23         24         25         26         27         28 
##  0.5120853 -0.9650371  2.0064905  1.1408446  1.4291063 -0.1226444  0.6767159 
##         29         30         31         32         33         34         35 
##  1.4845279 -0.4295634  2.4496912 -0.4769719  2.0941557  1.4307502  1.1004201 
##         36         37         38         39         40         41         42 
## -1.0408416 -0.2773854 -1.8942674 -0.7310323 -2.4078910  4.0942628  6.3112505 
##         43         44         45         46         47         48         49 
## 10.0988590  3.1345472  5.5220464  5.8476838  4.0547249  2.4099883  4.8498838 
##         50         51         52         53         54         55         56 
##  9.9503370  2.6834633  4.3912601  8.4785166  9.1346077  4.3716157  7.7115799 
##         57         58         59         60 
##  3.6131979  8.5886299  4.3680055  9.9629546

5.1.2 Error Check

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package

#Root Mean Squared Error in model

RMSE(model_filter$fitted.values, df_filter$Selling_Price)
## [1] 1.561259
#Root Mean Squared Error in prediction test

RMSE(pred_test1, car_new$Selling_Price)
## [1] 2.237076

If we see RMSE result above, we can interpret in prediction test, the error of selling price is +- 1.9k USD

5.2 Assumptions

5.2.1 Multicolinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(model_filter)
##          Year Present_Price    Kms_Driven     Fuel_Type   Seller_Type 
##      1.437801      2.436736      1.519105      1.394173      1.701732 
##  Transmission 
##      1.357395

Since all of the VIF value is lower than 10, it means that our variables are all independent

5.2.2 Errors Normality

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test or density plot

#density ploy

plot(density(model_filter$residuals))

Shapiro test hypothesis accepted : p-value >0.05

shapiro.test(model_filter$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_filter$residuals
## W = 0.92677, p-value = 0.00000000006591

The plot density looks good in terms of normality however the shapiro test < 0.05. It seems the data is not big enough to be used for shapiro test.

5.2.3 Heteroscedacity

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one. We can use bptest to check the heteroscedacity

hypothesis accepted : p-value >0.05

#bptest

bptest(model_filter)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_filter
## BP = 111.24, df = 6, p-value < 0.00000000000000022

The result from bptest above is p-value < 0.05, then the hypothesis is rejected. The errors are non-constant.

#model plot check fitted values & residuals

plot(model_filter$fitted.values,
     model_filter$residuals)

If we check the plot above, we can see there is a presence of a funnel shape. It means that heteroscedacity is present.

5.2.4 Linearity

#linearity check

data.frame(prediction=model_filter$fitted.values,
     error=model_filter$residuals) %>% 
  ggplot(aes(prediction,error)) +
  geom_hline(yintercept=0) +
  geom_point() +
  geom_smooth() +
  theme_bw()

From plot above, we can see the plot shaped a U curve. It means that the linearity is not present.

6 Model Improvement

6.1 Model Performance

6.1.1 Tuning

We have already seen that our model doesn’t satisfy some of the assumptions, including the linearity, heterocesdasticity and Errors Normality. Now we will try to fix them. To made the model more linear, we can transform some of the variables

#Transform data target and predictors
#Tranform numerical variables to Log10
#Drop the outliers of selling price vs years

df_filter2 <- df_data %>% 
  filter(Kms_Driven<150000) %>% filter(Selling_Price<25) %>% 
  select(-Owner) %>% select(-Car_Name) %>%  
  mutate_if(~is.numeric(.), ~log10(.))

df_filter2

6.1.2 Performance Test

Now let us make the model using our data that has been tunned and check the summary

model_filter2 <- lm(df_filter2$Selling_Price~.,data=df_filter2)
summary(model_filter2)
## 
## Call:
## lm(formula = df_filter2$Selling_Price ~ ., data = df_filter2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.255218 -0.046566 -0.001306  0.049622  0.203454 
## 
## Coefficients:
##                          Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)           -699.074555   31.293789 -22.339 < 0.0000000000000002 ***
## Year                   211.645730    9.459180  22.375 < 0.0000000000000002 ***
## Present_Price            0.911397    0.020098  45.347 < 0.0000000000000002 ***
## Kms_Driven              -0.066018    0.014726  -4.483           0.00001064 ***
## Fuel_TypePetrol         -0.065190    0.013515  -4.823           0.00000229 ***
## Seller_TypeIndividual   -0.095867    0.020521  -4.672           0.00000460 ***
## TransmissionManual       0.007116    0.014537   0.490                0.625    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0795 on 287 degrees of freedom
## Multiple R-squared:  0.9791, Adjusted R-squared:  0.9786 
## F-statistic:  2238 on 6 and 287 DF,  p-value: < 0.00000000000000022

After we transformed our data with log10, the adjusted R-Squared is getting higher to 97.86% from 87,7% before the data is tunned.

Interpretation:

  • Adjusted R-Square : 97,86% => Model can explain 97,86% of the variation from selling price, the rest of the percentage can be explained by another variable that is not present in the model.
  • If fuel type = petrol, the selling price will be decreased due to negative coefficient/estimate.
  • If seller type = individual, the selling price will be decreased due to negative coefficient/estimate.
  • The higher Km driven, the lower selling price due to negative coefficient/estimate.
  • Transmission variable is not significant at this model to determine the selling price. However at this point we kan keep this variable. Later on if we need another optimization, we need more data. If still not significant, we can drop this variable

6.1.3 Error Check

Now we check our tunned model with new tunned data set

car_new2 <- car_new %>% mutate_if(~is.numeric(.), ~log10(.)) %>% 
  select(-Owner) %>% select(-Car_Name)

pred_test2 <- predict(model_filter2,newdata = car_new2)

pred_test2
##           1           2           3           4           5           6 
##  0.90476133  0.52816824  0.74147346  0.03715839  0.86780893  1.31154187 
##           7           8           9          10          11          12 
##  1.25781573  0.68407605  0.61274658  1.31052491  0.73116207  0.66102334 
##          13          14          15          16          17          18 
##  0.34902130  0.78002841  1.35684452  1.50929855  0.45993983  1.31154187 
##          19          20          21          22          23          24 
##  0.22772248  0.03237757 -0.01958501 -0.12988010 -0.24921166 -0.05587964 
##          25          26          27          28          29          30 
## -0.19169749 -0.20654174 -0.39562346 -0.30892476 -0.16085088 -0.40052792 
##          31          32          33          34          35          36 
## -0.31511515 -0.38491318 -0.27637958 -0.28187366 -0.44791928 -0.65184512 
##          37          38          39          40          41          42 
## -0.54258036 -0.61768574 -0.59792297 -0.78594567  0.48359305  0.88881794 
##          43          44          45          46          47          48 
##  1.02832435  0.48152883  0.73710809  0.69063170  0.56887411  0.36643210 
##          49          50          51          52          53          54 
##  0.67988738  1.04622915  0.41769409  0.62682657  0.98696580  0.97838623 
##          55          56          57          58          59          60 
##  0.63097148  0.96450705  0.55085697  1.01503151  0.62191950  1.09812206

Now we will measure the RMSE. Since our target is transformed with log10, we need to transform it back to get the original price value and get a meaningful comparison to our previous model.

# RMSE of the tunned data model
RMSE(10^(model_filter2$fitted.values),10^(df_filter2$Selling_Price))
## [1] 0.7981968
# RMSE of the prediction
RMSE(10^(pred_test2),10^(car_new2$Selling_Price))
## [1] 0.8183601

Looking at the RMSE result above ,RMSE of tunned dataset is 0.798 while the prediction dataset has RMSE of 0.818. It means that our model is fit with the prediction test. We can interpret in prediction test, the error of selling price is +- 0.818 k USD. This error is better than our model before it was tunned.

6.2 Assumptions

6.2.1 Multicolinearity

vif(model_filter2)
##          Year Present_Price    Kms_Driven     Fuel_Type   Seller_Type 
##      1.553042      5.127556      1.851066      1.327838      4.477903 
##  Transmission 
##      1.106205

Since all of the VIF value is lower than 10, it means that our variables from our tunned datasets are all independent

6.2.2 Errors Normality

We can check by using density plot.

#density plot

plot(density(model_filter2$residuals))

We can see that the plot above as the normality of the residual looks good .

6.2.3 Heteroscedacity

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one. We can use bptest to check the heteroscedacity

#model plot check fitted values & residuals

plot(model_filter2$fitted.values,
     model_filter2$residuals)

If we check the plot above, we can see there is no presence of a funnel shape. It means that heteroscedacity is not present.

6.2.4 Linearity

#linearity check

data.frame(prediction=model_filter2$fitted.values,
     error=model_filter2$residuals) %>% 
  ggplot(aes(prediction,error)) +
  geom_hline(yintercept=0) +
  geom_point() +
  geom_smooth() +
  theme_bw()

From plot above, There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.

Now let us see the summary between model_filter and model_filter2

Model <- c("model_filter", "model_filter2")
Adj_R_Squared <- c(0.8796,0.9786)
RMSE_Model <- c(1.561259,0.7981968 )
RMSE_Pred <- c(2.237076,0.8183601)

df <- data.frame(Model,Adj_R_Squared,RMSE_Model,RMSE_Pred )

print (df)
##           Model Adj_R_Squared RMSE_Model RMSE_Pred
## 1  model_filter        0.8796  1.5612590 2.2370760
## 2 model_filter2        0.9786  0.7981968 0.8183601

7 Conclusion

Variables that are useful to describe the variances in car prices are present prices,Year, kms_driven, fuel type, seller type, and transmission. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 97,86% of the variables can explain the variances in the car price. The accuracy of the model in predicting the car price is measured with RMSE, with model has RMSE of 0.798 and prediction data has RMSE of 0.818, suggesting that our model may fit the prediction dataset.

We have already learn how to build a linear regression model and what need to be concerned when building the model.