We will use linear regression model using car price dataset. We want to know the relationship among variables, especially between the car price with other variables. We also want to predict the price of a new car based on the historical data.
A Chinese automobile company Geely Auto aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts.
They have contracted an automobile consulting company to understand the factors on which the pricing of cars depends. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
Based on various market surveys, the consulting firm has gathered a large dataset of different types of cars across the Americal market. We are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
Before we do analysis, we need to load the required library packages.
#untuk data wrangling
library(tidyverse)
#untuk cek asumsi model
library(lmtest)
library(car)
#untuk menghitung error
library(MLmetrics)
#untuk visualisasi korelasi
library(GGally)We need the data to do the analysis. Then, we have to load the dataset
car_data <- read.csv("data_input/car data.csv")
glimpse(car_data)## Rows: 301
## Columns: 9
## $ Car_Name <chr> "ritz", "sx4", "ciaz", "wagon r", "swift", "vitara brezz~
## $ Year <int> 2014, 2013, 2017, 2011, 2014, 2018, 2015, 2015, 2016, 20~
## $ Selling_Price <dbl> 3.35, 4.75, 7.25, 2.85, 4.60, 9.25, 6.75, 6.50, 8.75, 7.~
## $ Present_Price <dbl> 5.59, 9.54, 9.85, 4.15, 6.87, 9.83, 8.12, 8.61, 8.89, 8.~
## $ Kms_Driven <int> 27000, 43000, 6900, 5200, 42450, 2071, 18796, 33429, 202~
## $ Fuel_Type <chr> "Petrol", "Diesel", "Petrol", "Petrol", "Diesel", "Diese~
## $ Seller_Type <chr> "Dealer", "Dealer", "Dealer", "Dealer", "Dealer", "Deale~
## $ Transmission <chr> "Manual", "Manual", "Manual", "Manual", "Manual", "Manua~
## $ Owner <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
Below are data description for each columns :
Car_Name: Name of the carYear: Production YearSelling_Price: Selling pricePresent_Price: Present buying priceKms_Driven: Kilometers driven valueFuel_Type: Fuel typeSeller_Type: Seller TypeTransmission: Transmission TypeOwner: Ownership status (1 = yes, 0 = no)After we check the data type of each columns, we found that some of the columns don’t have the required data type. We need to change these columns’ data type for us to ease the analysis process.
df_data <- car_data %>%
mutate_if(is.character,as.factor)We have to check if there is any missing values in our data set
colSums(is.na(df_data))## Car_Name Year Selling_Price Present_Price Kms_Driven
## 0 0 0 0 0
## Fuel_Type Seller_Type Transmission Owner
## 0 0 0 0
After we check, there is no missing values in the data set.
Let us check unique values on each category
First we check between Owner and Fuel Type
table(df_data$Owner,df_data$Fuel_Type)##
## CNG Diesel Petrol
## 0 2 59 229
## 1 0 1 9
## 3 0 0 1
After we check the unique values between fuel type and Owner, we found that fuel type “CNG” and owner “3” is very rare. We will drop fuel type CNF and change Owner 3 to Owner 1.
df_data <- df_data %>%
mutate(Owner = ifelse(Owner == 3, yes=1,no=Owner)) %>% filter(Fuel_Type != "CNG") Now Let us check between Owner and Seller Type
table(df_data$Owner,df_data$Seller_Type)##
## Dealer Individual
## 0 189 99
## 1 4 7
No need to drop any rows and columns between Owner and Seller type
Let us check between Owner and Transmission
table(df_data$Owner,df_data$Transmission)##
## Automatic Manual
## 0 39 249
## 1 1 10
No need to drop any rows and columns between Owner and Transmission
After we check Unique values on each categories, let us check corelation between numerical variables
ggcorr(df_data,label=T)## Warning in ggcorr(df_data, label = T): data in column(s) 'Car_Name',
## 'Fuel_Type', 'Seller_Type', 'Transmission' are not numeric and were ignored
Some variables have correlations with selling price and some variable have no correlation with selling price. let us check deeper one by one
Let us check relation between km driven and selling price
options(scipen=999)
df_data %>% ggplot(aes(x=Kms_Driven,y=Selling_Price)) +
geom_point() + geom_smooth(method="lm") + theme_bw()## `geom_smooth()` using formula 'y ~ x'
There are outliers on kms_driven > 150,000 km. Let us check the curve if we drop these outliers
df_filter <- df_data %>%
filter(Kms_Driven<150000)
df_filter %>% ggplot(aes(x=Kms_Driven,y=Selling_Price)) +
geom_point() + geom_smooth(method="lm") + theme_bw()## `geom_smooth()` using formula 'y ~ x'
The correlation looks higher if we drop the outliers. We will compared later in terms of regression between filtered data and non-filtered
Now let’s check correlation between selling price and year
df_filter %>%
ggplot(aes(x=Year,y=Selling_Price)) + geom_point() +
geom_smooth(method="lm") + theme_bw()## `geom_smooth()` using formula 'y ~ x'
There are correlation between Year and Selling price. The higher the year the higher the selling price. There are outliers on Selling price > 25 k USD in. Let us check the curve if we drop these outliers
df_filter <- df_data %>%
filter(Selling_Price<25)
df_filter %>% ggplot(aes(x=Year,y=Selling_Price)) +
geom_point() + geom_smooth(method="lm") + theme_bw()## `geom_smooth()` using formula 'y ~ x'
The correlation looks better between selling price and year after if we filter the data. For now, we will not used the filtered year vs selling price. We will see later if the data needs improvement.
Now let us check again correlation between number variables after we filtered the dataset
ggcorr(df_filter,label=T)If we look at the picture above, there is no correlation between selling price and owner, we can drop the owner column for further analysis
We will make 2 models for analysis - Filtered data model - Original model
Now let us check the summary on each model
#model_filter
model_filter <- lm(df_filter$Selling_Price~.,data=df_filter
%>% select(-Car_Name) %>% select(-Owner))
summary(model_filter)##
## Call:
## lm(formula = df_filter$Selling_Price ~ ., data = df_filter %>%
## select(-Car_Name) %>% select(-Owner))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2597 -0.8122 -0.1420 0.6715 6.0421
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -788.751799353 76.816124100 -10.268
## Year 0.393287656 0.038081707 10.327
## Present_Price 0.493856029 0.020603266 23.970
## Kms_Driven -0.000006036 0.000002900 -2.081
## Fuel_TypePetrol -1.512327484 0.273072477 -5.538
## Seller_TypeIndividual -0.773297045 0.249636762 -3.098
## TransmissionManual -0.665326079 0.316259434 -2.104
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Year < 0.0000000000000002 ***
## Present_Price < 0.0000000000000002 ***
## Kms_Driven 0.03830 *
## Fuel_TypePetrol 0.0000000685 ***
## Seller_TypeIndividual 0.00214 **
## TransmissionManual 0.03626 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.58 on 290 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.8771
## F-statistic: 353 on 6 and 290 DF, p-value: < 0.00000000000000022
Interpretation :
#model_full
model_full <- lm(df_data$Selling_Price~.,data=df_data %>% select(-Car_Name))
summary(model_full)##
## Call:
## lm(formula = df_data$Selling_Price ~ ., data = df_data %>% select(-Car_Name))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6609 -0.8809 -0.1923 0.6668 11.1439
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -805.458570093 87.237926959 -9.233
## Year 0.402366592 0.043246435 9.304
## Present_Price 0.435643174 0.016050289 27.142
## Kms_Driven -0.000007020 0.000003253 -2.158
## Fuel_TypePetrol -1.875330696 0.300598161 -6.239
## Seller_TypeIndividual -1.169958414 0.256995539 -4.552
## TransmissionManual -1.430382791 0.329675302 -4.339
## Owner -0.078750186 0.555773022 -0.142
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Year < 0.0000000000000002 ***
## Present_Price < 0.0000000000000002 ***
## Kms_Driven 0.0317 *
## Fuel_TypePetrol 0.00000000155 ***
## Seller_TypeIndividual 0.00000780168 ***
## TransmissionManual 0.00001978516 ***
## Owner 0.8874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.776 on 291 degrees of freedom
## Multiple R-squared: 0.8815, Adjusted R-squared: 0.8786
## F-statistic: 309.2 on 7 and 291 DF, p-value: < 0.00000000000000022
Interpretation:
Now let us determine which Variables we will use. Since the adjusted R squared from the filtered data Model is slightly the same as the original model, for further analysis we decide to go with the filtered data model since this model uses least variables.
model_step <- step(model_filter,direction="both")## Start: AIC=278.62
## df_filter$Selling_Price ~ Year + Present_Price + Kms_Driven +
## Fuel_Type + Seller_Type + Transmission
##
## Df Sum of Sq RSS AIC
## <none> 723.95 278.62
## - Kms_Driven 1 10.81 734.76 281.03
## - Transmission 1 11.05 734.99 281.12
## - Seller_Type 1 23.95 747.90 286.29
## - Fuel_Type 1 76.57 800.51 306.48
## - Year 1 266.25 990.20 369.64
## - Present_Price 1 1434.29 2158.24 601.04
SInce no variable AIC value lower than AIC start, we can use all the variable (Transmission, Year, Kms_Driven, Seller_type, Fuel_type, present_price)
summary(model_step)##
## Call:
## lm(formula = df_filter$Selling_Price ~ Year + Present_Price +
## Kms_Driven + Fuel_Type + Seller_Type + Transmission, data = df_filter %>%
## select(-Car_Name) %>% select(-Owner))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2597 -0.8122 -0.1420 0.6715 6.0421
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -788.751799353 76.816124100 -10.268
## Year 0.393287656 0.038081707 10.327
## Present_Price 0.493856029 0.020603266 23.970
## Kms_Driven -0.000006036 0.000002900 -2.081
## Fuel_TypePetrol -1.512327484 0.273072477 -5.538
## Seller_TypeIndividual -0.773297045 0.249636762 -3.098
## TransmissionManual -0.665326079 0.316259434 -2.104
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Year < 0.0000000000000002 ***
## Present_Price < 0.0000000000000002 ***
## Kms_Driven 0.03830 *
## Fuel_TypePetrol 0.0000000685 ***
## Seller_TypeIndividual 0.00214 **
## TransmissionManual 0.03626 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.58 on 290 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.8771
## F-statistic: 353 on 6 and 290 DF, p-value: < 0.00000000000000022
We check the performance from our model with new data set
car_new <- read.csv("data_input/car_test.csv")
car_newpred_test1 <- predict(model_filter,newdata = car_new)
pred_test1## 1 2 3 4 5 6 7
## 8.1346436 4.5204164 6.3730776 -0.3372467 7.3616695 18.5983244 14.1823985
## 8 9 10 11 12 13 14
## 4.8908479 4.2717796 20.6178014 6.2264447 4.6861025 3.1676793 6.6928351
## 15 16 17 18 19 20 21
## 16.9578693 46.3513302 3.9254315 18.5983244 2.4488207 1.7973028 1.3952354
## 22 23 24 25 26 27 28
## 0.5120853 -0.9650371 2.0064905 1.1408446 1.4291063 -0.1226444 0.6767159
## 29 30 31 32 33 34 35
## 1.4845279 -0.4295634 2.4496912 -0.4769719 2.0941557 1.4307502 1.1004201
## 36 37 38 39 40 41 42
## -1.0408416 -0.2773854 -1.8942674 -0.7310323 -2.4078910 4.0942628 6.3112505
## 43 44 45 46 47 48 49
## 10.0988590 3.1345472 5.5220464 5.8476838 4.0547249 2.4099883 4.8498838
## 50 51 52 53 54 55 56
## 9.9503370 2.6834633 4.3912601 8.4785166 9.1346077 4.3716157 7.7115799
## 57 58 59 60
## 3.6131979 8.5886299 4.3680055 9.9629546
The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error \[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \] RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package
#Root Mean Squared Error in model
RMSE(model_filter$fitted.values, df_filter$Selling_Price)## [1] 1.561259
#Root Mean Squared Error in prediction test
RMSE(pred_test1, car_new$Selling_Price)## [1] 2.237076
If we see RMSE result above, we can interpret in prediction test, the error of selling price is +- 1.9k USD
Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.
vif(model_filter)## Year Present_Price Kms_Driven Fuel_Type Seller_Type
## 1.437801 2.436736 1.519105 1.394173 1.701732
## Transmission
## 1.357395
Since all of the VIF value is lower than 10, it means that our variables are all independent
The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test or density plot
#density ploy
plot(density(model_filter$residuals)) Shapiro test hypothesis accepted : p-value >0.05
shapiro.test(model_filter$residuals)##
## Shapiro-Wilk normality test
##
## data: model_filter$residuals
## W = 0.92677, p-value = 0.00000000006591
The plot density looks good in terms of normality however the shapiro test < 0.05. It seems the data is not big enough to be used for shapiro test.
Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one. We can use bptest to check the heteroscedacity
hypothesis accepted : p-value >0.05
#bptest
bptest(model_filter)##
## studentized Breusch-Pagan test
##
## data: model_filter
## BP = 111.24, df = 6, p-value < 0.00000000000000022
The result from bptest above is p-value < 0.05, then the hypothesis is rejected. The errors are non-constant.
#model plot check fitted values & residuals
plot(model_filter$fitted.values,
model_filter$residuals)If we check the plot above, we can see there is a presence of a funnel shape. It means that heteroscedacity is present.
#linearity check
data.frame(prediction=model_filter$fitted.values,
error=model_filter$residuals) %>%
ggplot(aes(prediction,error)) +
geom_hline(yintercept=0) +
geom_point() +
geom_smooth() +
theme_bw()From plot above, we can see the plot shaped a U curve. It means that the linearity is not present.
We have already seen that our model doesn’t satisfy some of the assumptions, including the linearity, heterocesdasticity and Errors Normality. Now we will try to fix them. To made the model more linear, we can transform some of the variables
#Transform data target and predictors
#Tranform numerical variables to Log10
#Drop the outliers of selling price vs years
df_filter2 <- df_data %>%
filter(Kms_Driven<150000) %>% filter(Selling_Price<25) %>%
select(-Owner) %>% select(-Car_Name) %>%
mutate_if(~is.numeric(.), ~log10(.))
df_filter2Now let us make the model using our data that has been tunned and check the summary
model_filter2 <- lm(df_filter2$Selling_Price~.,data=df_filter2)
summary(model_filter2)##
## Call:
## lm(formula = df_filter2$Selling_Price ~ ., data = df_filter2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.255218 -0.046566 -0.001306 0.049622 0.203454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -699.074555 31.293789 -22.339 < 0.0000000000000002 ***
## Year 211.645730 9.459180 22.375 < 0.0000000000000002 ***
## Present_Price 0.911397 0.020098 45.347 < 0.0000000000000002 ***
## Kms_Driven -0.066018 0.014726 -4.483 0.00001064 ***
## Fuel_TypePetrol -0.065190 0.013515 -4.823 0.00000229 ***
## Seller_TypeIndividual -0.095867 0.020521 -4.672 0.00000460 ***
## TransmissionManual 0.007116 0.014537 0.490 0.625
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0795 on 287 degrees of freedom
## Multiple R-squared: 0.9791, Adjusted R-squared: 0.9786
## F-statistic: 2238 on 6 and 287 DF, p-value: < 0.00000000000000022
After we transformed our data with log10, the adjusted R-Squared is getting higher to 97.86% from 87,7% before the data is tunned.
Interpretation:
Now we check our tunned model with new tunned data set
car_new2 <- car_new %>% mutate_if(~is.numeric(.), ~log10(.)) %>%
select(-Owner) %>% select(-Car_Name)
pred_test2 <- predict(model_filter2,newdata = car_new2)
pred_test2## 1 2 3 4 5 6
## 0.90476133 0.52816824 0.74147346 0.03715839 0.86780893 1.31154187
## 7 8 9 10 11 12
## 1.25781573 0.68407605 0.61274658 1.31052491 0.73116207 0.66102334
## 13 14 15 16 17 18
## 0.34902130 0.78002841 1.35684452 1.50929855 0.45993983 1.31154187
## 19 20 21 22 23 24
## 0.22772248 0.03237757 -0.01958501 -0.12988010 -0.24921166 -0.05587964
## 25 26 27 28 29 30
## -0.19169749 -0.20654174 -0.39562346 -0.30892476 -0.16085088 -0.40052792
## 31 32 33 34 35 36
## -0.31511515 -0.38491318 -0.27637958 -0.28187366 -0.44791928 -0.65184512
## 37 38 39 40 41 42
## -0.54258036 -0.61768574 -0.59792297 -0.78594567 0.48359305 0.88881794
## 43 44 45 46 47 48
## 1.02832435 0.48152883 0.73710809 0.69063170 0.56887411 0.36643210
## 49 50 51 52 53 54
## 0.67988738 1.04622915 0.41769409 0.62682657 0.98696580 0.97838623
## 55 56 57 58 59 60
## 0.63097148 0.96450705 0.55085697 1.01503151 0.62191950 1.09812206
Now we will measure the RMSE. Since our target is transformed with log10, we need to transform it back to get the original price value and get a meaningful comparison to our previous model.
# RMSE of the tunned data model
RMSE(10^(model_filter2$fitted.values),10^(df_filter2$Selling_Price))## [1] 0.7981968
# RMSE of the prediction
RMSE(10^(pred_test2),10^(car_new2$Selling_Price))## [1] 0.8183601
Looking at the RMSE result above ,RMSE of tunned dataset is 0.798 while the prediction dataset has RMSE of 0.818. It means that our model is fit with the prediction test. We can interpret in prediction test, the error of selling price is +- 0.818 k USD. This error is better than our model before it was tunned.
vif(model_filter2)## Year Present_Price Kms_Driven Fuel_Type Seller_Type
## 1.553042 5.127556 1.851066 1.327838 4.477903
## Transmission
## 1.106205
Since all of the VIF value is lower than 10, it means that our variables from our tunned datasets are all independent
We can check by using density plot.
#density plot
plot(density(model_filter2$residuals))We can see that the plot above as the normality of the residual looks good .
Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one. We can use bptest to check the heteroscedacity
#model plot check fitted values & residuals
plot(model_filter2$fitted.values,
model_filter2$residuals)If we check the plot above, we can see there is no presence of a funnel shape. It means that heteroscedacity is not present.
#linearity check
data.frame(prediction=model_filter2$fitted.values,
error=model_filter2$residuals) %>%
ggplot(aes(prediction,error)) +
geom_hline(yintercept=0) +
geom_point() +
geom_smooth() +
theme_bw()From plot above, There is little to no discernible pattern in our residual plot, we can conclude that our model is linear.
Now let us see the summary between model_filter and model_filter2
Model <- c("model_filter", "model_filter2")
Adj_R_Squared <- c(0.8796,0.9786)
RMSE_Model <- c(1.561259,0.7981968 )
RMSE_Pred <- c(2.237076,0.8183601)
df <- data.frame(Model,Adj_R_Squared,RMSE_Model,RMSE_Pred )
print (df)## Model Adj_R_Squared RMSE_Model RMSE_Pred
## 1 model_filter 0.8796 1.5612590 2.2370760
## 2 model_filter2 0.9786 0.7981968 0.8183601
Variables that are useful to describe the variances in car prices are present prices,Year, kms_driven, fuel type, seller type, and transmission. Our final model has satisfied the classical assumptions. The R-squared of the model is high, with 97,86% of the variables can explain the variances in the car price. The accuracy of the model in predicting the car price is measured with RMSE, with model has RMSE of 0.798 and prediction data has RMSE of 0.818, suggesting that our model may fit the prediction dataset.
We have already learn how to build a linear regression model and what need to be concerned when building the model.