| model | year | price | transmission | mileage | fuelType | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 Series | 2014 | 11200 | Automatic | 67068 | Diesel | 125 | 57.6 | 2 |
| 1 | 6 Series | 2018 | 27000 | Automatic | 14827 | Petrol | 145 | 42.8 | 2 |
| 2 | 5 Series | 2016 | 16000 | Automatic | 62794 | Diesel | 160 | 51.4 | 3 |
| 3 | 1 Series | 2017 | 12750 | Automatic | 26676 | Diesel | 145 | 72.4 | 1.5 |
| 4 | 7 Series | 2014 | 14500 | Automatic | 39554 | Diesel | 160 | 50.4 | 3 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 10781 non-null object
1 year 10781 non-null int64
2 price 10781 non-null int64
3 transmission 10781 non-null object
4 mileage 10781 non-null int64
5 fuelType 10781 non-null object
6 tax 10781 non-null int64
7 mpg 10781 non-null float64
8 engineSize 10781 non-null float64
dtypes: float64(2), int64(4), object(3)
memory usage: 758.2+ KB
None
| year | price | mileage | tax | mpg | engineSize | |
|---|---|---|---|---|---|---|
| count | 10781 | 10781 | 10781 | 10781 | 10781 | 10781 |
| mean | 2017.08 | 22733.4 | 25497 | 131.702 | 56.399 | 2.16777 |
| std | 2.34904 | 11415.5 | 25143.2 | 61.5108 | 31.337 | 0.552054 |
| min | 1996 | 1200 | 1 | 0 | 5.5 | 0 |
| 25% | 2016 | 14950 | 5529 | 135 | 45.6 | 2 |
| 50% | 2017 | 20462 | 18347 | 145 | 53.3 | 2 |
| 75% | 2019 | 27940 | 38206 | 145 | 62.8 | 2 |
| max | 2020 | 123456 | 214000 | 580 | 470.8 | 6.6 |
(Year and Price) and (EngineSize and Price) have high positive correlation, (Year and Mileage) and (Price and Mileage) are significantly negatively correlated.
plt.figure(figsize = (6,5))
sns.countplot(y = bmw["model"], order = bmw['model'].value_counts().index)
plt.show() On the dataset, BMW has 24 cars, top 3 cars are 3 Series, 1 Series, 2 Series constitute 52.32%, with all other cars contributing to 47.68%.
In 2019, number of cars per year of BMW is greater about 3500 cars.
plt.figure(figsize =(6,5), facecolor = 'w')
sns.barplot(x = bmw["year"], y = bmw["price"])
plt.show()The recently manufactured cars (year = 2018, 2019, 2020) are sold for more average price when compared to the cars that are manufactured earlier.
Will make the object type columns to category type so that we can perform linear regression model including those features also
# Model col
bmw['model'] = bmw['model'].astype('category')
bmw['Model'] = bmw['model'].cat.codes
bmw.drop('model', axis=1, inplace = True)
# Year col
bmw['year'] = bmw['year'].astype('category')
bmw['Year'] = bmw['year'].cat.codes
bmw.drop('year', axis = 1, inplace = True)
# Transmission col
bmw['transmission'] = bmw['transmission'].astype('category')
bmw['Transmission'] = bmw['transmission'].cat.codes
bmw.drop('transmission', axis = 1, inplace = True)
# FuelType col
bmw['fuelType'] = bmw['fuelType'].astype('category')
bmw['Fuel'] = bmw['fuelType'].cat.codes
bmw.drop('fuelType', axis = 1, inplace = True)| price | mileage | tax | mpg | engineSize | Model | Year | Transmission | Fuel | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 11200 | 67068 | 125 | 57.6 | 2 | 4 | 18 | 0 | 0 |
| 1 | 27000 | 14827 | 145 | 42.8 | 2 | 5 | 22 | 0 | 4 |
| 2 | 16000 | 62794 | 160 | 51.4 | 3 | 4 | 20 | 0 | 0 |
| 3 | 12750 | 26676 | 145 | 72.4 | 1.5 | 0 | 21 | 0 | 0 |
| 4 | 14500 | 39554 | 160 | 50.4 | 3 | 6 | 18 | 0 | 0 |
| 5 | 14900 | 35309 | 125 | 60.1 | 2 | 4 | 20 | 0 | 0 |
| 6 | 16000 | 38538 | 125 | 60.1 | 2 | 4 | 21 | 0 | 0 |
| 7 | 16250 | 10401 | 145 | 52.3 | 1.5 | 1 | 22 | 1 | 4 |
| 8 | 14250 | 42668 | 30 | 62.8 | 2 | 3 | 21 | 1 | 0 |
| 9 | 14250 | 36099 | 20 | 68.9 | 2 | 4 | 20 | 0 | 0 |
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
var = ['price', 'mileage', 'tax', 'mpg', 'engineSize', 'Model', 'Year', 'Transmission', 'Fuel']
bmw[var]= scaler.fit_transform(bmw[var])
print(bmw.head(10).to_markdown())| price | mileage | tax | mpg | engineSize | Model | Year | Transmission | Fuel | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0817956 | 0.313399 | 0.215517 | 0.111971 | 0.30303 | 0.173913 | 0.75 | 0 | 0 |
| 1 | 0.211033 | 0.0692807 | 0.25 | 0.0801633 | 0.30303 | 0.217391 | 0.916667 | 0 | 1 |
| 2 | 0.121057 | 0.293427 | 0.275862 | 0.098646 | 0.454545 | 0.173913 | 0.833333 | 0 | 0 |
| 3 | 0.0944739 | 0.12465 | 0.25 | 0.143778 | 0.227273 | 0 | 0.875 | 0 | 0 |
| 4 | 0.108788 | 0.184828 | 0.275862 | 0.0964969 | 0.454545 | 0.26087 | 0.75 | 0 | 0 |
| 5 | 0.11206 | 0.164991 | 0.215517 | 0.117344 | 0.30303 | 0.173913 | 0.833333 | 0 | 0 |
| 6 | 0.121057 | 0.18008 | 0.215517 | 0.117344 | 0.30303 | 0.173913 | 0.875 | 0 | 0 |
| 7 | 0.123102 | 0.0485984 | 0.25 | 0.10058 | 0.227273 | 0.0434783 | 0.916667 | 0.5 | 1 |
| 8 | 0.106743 | 0.199379 | 0.0517241 | 0.123146 | 0.30303 | 0.130435 | 0.875 | 0.5 | 0 |
| 9 | 0.106743 | 0.168683 | 0.0344828 | 0.136256 | 0.30303 | 0.173913 | 0.833333 | 0 | 0 |
X_tr = sm.add_constant(X_train)
model1 = sm.OLS(endog = y_train, exog = X_tr)
results = model1.fit()
print(results.summary()) OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.741
Model: OLS Adj. R-squared: 0.741
Method: Least Squares F-statistic: 2699.
Date: Wed, 19 May 2021 Prob (F-statistic): 0.00
Time: 14:37:14 Log-Likelihood: 12364.
No. Observations: 7546 AIC: -2.471e+04
Df Residuals: 7537 BIC: -2.465e+04
Df Model: 8
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -0.3050 0.009 -32.332 0.000 -0.324 -0.287
mileage -0.2057 0.008 -26.268 0.000 -0.221 -0.190
tax -0.0574 0.006 -8.972 0.000 -0.070 -0.045
mpg 0.0100 0.009 1.097 0.273 -0.008 0.028
engineSize 0.4646 0.008 59.045 0.000 0.449 0.480
Model 0.1145 0.002 52.222 0.000 0.110 0.119
Year 0.3771 0.009 41.715 0.000 0.359 0.395
Transmission 0.0061 0.001 4.685 0.000 0.004 0.009
Fuel 0.0098 0.001 7.621 0.000 0.007 0.012
==============================================================================
Omnibus: 4879.884 Durbin-Watson: 1.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 174705.726
Skew: 2.572 Prob(JB): 0.00
Kurtosis: 26.004 Cond. No. 42.4
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
‘mpg’ p-value is more than 5%, which means it is insignificant, so we will drop ‘mpg’ from our dataset again building the model.
X_train.drop('mpg', axis = 1, inplace = True)
X_tr = sm.add_constant(X_train)
model2 = sm.OLS(endog = y_train, exog = X_tr)
results = model2.fit()
print(results.summary()) OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.741
Model: OLS Adj. R-squared: 0.741
Method: Least Squares F-statistic: 3084.
Date: Wed, 19 May 2021 Prob (F-statistic): 0.00
Time: 14:37:14 Log-Likelihood: 12364.
No. Observations: 7546 AIC: -2.471e+04
Df Residuals: 7538 BIC: -2.466e+04
Df Model: 7
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -0.3024 0.009 -33.134 0.000 -0.320 -0.285
mileage -0.2056 0.008 -26.259 0.000 -0.221 -0.190
tax -0.0588 0.006 -9.363 0.000 -0.071 -0.046
engineSize 0.4618 0.007 62.065 0.000 0.447 0.476
Model 0.1149 0.002 53.037 0.000 0.111 0.119
Year 0.3768 0.009 41.700 0.000 0.359 0.395
Transmission 0.0060 0.001 4.600 0.000 0.003 0.009
Fuel 0.0097 0.001 7.549 0.000 0.007 0.012
==============================================================================
Omnibus: 4879.024 Durbin-Watson: 1.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 174641.447
Skew: 2.571 Prob(JB): 0.00
Kurtosis: 26.000 Cond. No. 41.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Every feature is significant.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_markdown())| Features | VIF | |
|---|---|---|
| 2 | engineSize | 20.97 |
| 4 | Year | 16.26 |
| 1 | tax | 8.19 |
| 5 | Transmission | 2.83 |
| 0 | mileage | 2.08 |
| 3 | Model | 1.97 |
| 6 | Fuel | 1.85 |
VIF more than 10 means high collinearity, so first we will drop ‘engineSize’
X_train.drop('engineSize', axis = 1, inplace = True)
X_tr = sm.add_constant(X_train)
model3 = sm.OLS(endog = y_train, exog = X_tr)
results = model3.fit()
print(results.summary()) OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.609
Model: OLS Adj. R-squared: 0.609
Method: Least Squares F-statistic: 1956.
Date: Wed, 19 May 2021 Prob (F-statistic): 0.00
Time: 14:37:15 Log-Likelihood: 10806.
No. Observations: 7546 AIC: -2.160e+04
Df Residuals: 7539 BIC: -2.155e+04
Df Model: 6
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -0.2023 0.011 -18.317 0.000 -0.224 -0.181
mileage -0.1801 0.010 -18.745 0.000 -0.199 -0.161
tax 0.1102 0.007 15.844 0.000 0.097 0.124
Model 0.1339 0.003 50.778 0.000 0.129 0.139
Year 0.3865 0.011 34.803 0.000 0.365 0.408
Transmission 0.0091 0.002 5.680 0.000 0.006 0.012
Fuel -0.0040 0.002 -2.564 0.010 -0.007 -0.001
==============================================================================
Omnibus: 4275.612 Durbin-Watson: 1.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 77309.663
Skew: 2.338 Prob(JB): 0.00
Kurtosis: 17.967 Cond. No. 40.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
All the features are significant, now we will look into their VIF
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_markdown())| | Features | VIF |
|---:|:-------------|------:|
| 3 | Year | 8.37 |
| 1 | tax | 6.31 |
| 4 | Transmission | 2.83 |
| 2 | Model | 1.93 |
| 0 | mileage | 1.85 |
| 5 | Fuel | 1.8 |
All the features are having VIF less than 10, so model3 will be our final model.
(array([1.00e+00, 5.00e+00, 7.20e+01, 7.10e+02, 1.77e+03, 2.79e+03,
1.38e+03, 3.71e+02, 1.96e+02, 9.00e+01, 5.40e+01, 4.30e+01,
2.90e+01, 2.00e+01, 7.00e+00, 3.00e+00, 1.00e+00, 1.00e+00,
2.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00]), array([-0.20785865, -0.17074011, -0.13362158, -0.09650305, -0.05938452,
-0.02226599, 0.01485255, 0.05197108, 0.08908961, 0.12620814,
0.16332668, 0.20044521, 0.23756374, 0.27468227, 0.31180081,
0.34891934, 0.38603787, 0.4231564 , 0.46027493, 0.49739347,
0.534512 , 0.57163053, 0.60874906, 0.6458676 , 0.68298613,
0.72010466, 0.75722319, 0.79434173, 0.83146026, 0.86857879,
0.90569732]), <BarContainer object of 30 artists>)