BMW logo

Importing

Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import markdown as md
import warnings
import statsmodels.api as sm
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

Data

bmw = pd.read_csv("bmw.csv")
print(bmw.head(5).to_markdown())
model year price transmission mileage fuelType tax mpg engineSize
0 5 Series 2014 11200 Automatic 67068 Diesel 125 57.6 2
1 6 Series 2018 27000 Automatic 14827 Petrol 145 42.8 2
2 5 Series 2016 16000 Automatic 62794 Diesel 160 51.4 3
3 1 Series 2017 12750 Automatic 26676 Diesel 145 72.4 1.5
4 7 Series 2014 14500 Automatic 39554 Diesel 160 50.4 3
  • model : Model of the car
  • year : The year that car was made
  • price : Price (Pounds)
  • Transmission : Type of gear
  • milage : How many miles the car went (1 mile = 1,609344 km)
  • fuelType : Fuel type
  • tax : tax
  • mpg : Miles per gallon (1 galon = 3,78541178 liters)
  • engine size : Size of engine (liters)

Info data

print(bmw.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         10781 non-null  object 
 1   year          10781 non-null  int64  
 2   price         10781 non-null  int64  
 3   transmission  10781 non-null  object 
 4   mileage       10781 non-null  int64  
 5   fuelType      10781 non-null  object 
 6   tax           10781 non-null  int64  
 7   mpg           10781 non-null  float64
 8   engineSize    10781 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 758.2+ KB
None

Exploratory data analysis

Descriptive statistics

print(bmw.describe().to_markdown())
year price mileage tax mpg engineSize
count 10781 10781 10781 10781 10781 10781
mean 2017.08 22733.4 25497 131.702 56.399 2.16777
std 2.34904 11415.5 25143.2 61.5108 31.337 0.552054
min 1996 1200 1 0 5.5 0
25% 2016 14950 5529 135 45.6 2
50% 2017 20462 18347 145 53.3 2
75% 2019 27940 38206 145 62.8 2
max 2020 123456 214000 580 470.8 6.6

Correlation

plt.figure(figsize = (6,5)) 
sns.heatmap(bmw.corr(), annot = True, cmap = "coolwarm")
plt.show()

(Year and Price) and (EngineSize and Price) have high positive correlation, (Year and Mileage) and (Price and Mileage) are significantly negatively correlated.

Model

plt.figure(figsize = (6,5)) 
sns.countplot(y = bmw["model"], order = bmw['model'].value_counts().index)
plt.show()

On the dataset, BMW has 24 cars, top 3 cars are 3 Series, 1 Series, 2 Series constitute 52.32%, with all other cars contributing to 47.68%.

Year

plt.figure(figsize = (6,5)) 
sns.countplot(y = bmw["year"])
plt.show()

In 2019, number of cars per year of BMW is greater about 3500 cars.

Price

  • Price distribution
plt.figure(figsize = (6,5)) 
sns.distplot(bmw['price'])
plt.show()

  • Price by year
plt.figure(figsize  =(6,5), facecolor = 'w')
sns.barplot(x = bmw["year"], y = bmw["price"])
plt.show()

The recently manufactured cars (year = 2018, 2019, 2020) are sold for more average price when compared to the cars that are manufactured earlier.

Transmission

plt.figure(figsize = (6,5)) 
sns.barplot(x = bmw["transmission"], y = bmw["price"])
plt.show()

Mileage

  • Mileage distribuition
plt.figure(figsize = (6,5)) 
sns.distplot(bmw['mileage'])
plt.show()

plt.figure(figsize = (12,6)) 
sns.scatterplot(bmw["mileage"], bmw["price"], hue = bmw["year"])
plt.show()

Fuel type

plt.figure(figsize = (6,5)) 
sns.countplot(bmw["fuelType"])
plt.show()

Engine Size

engine = bmw.groupby('engineSize').count()[['tax']].sort_values(by = 'tax').reset_index()
engine = engine.rename(columns = {'tax':'count'})
plt.figure(figsize = (10,5))
sns.barplot(x = engine['engineSize'], y = engine['count'], color = 'royalblue')
plt.show()

Pre-processing for modeling

Category var

Will make the object type columns to category type so that we can perform linear regression model including those features also

# Model col
bmw['model'] = bmw['model'].astype('category')
bmw['Model'] = bmw['model'].cat.codes
bmw.drop('model', axis=1, inplace = True)

# Year col
bmw['year'] = bmw['year'].astype('category')
bmw['Year'] = bmw['year'].cat.codes
bmw.drop('year', axis = 1, inplace = True)

# Transmission col
bmw['transmission'] = bmw['transmission'].astype('category')
bmw['Transmission'] = bmw['transmission'].cat.codes
bmw.drop('transmission', axis = 1, inplace = True)

# FuelType col
bmw['fuelType'] = bmw['fuelType'].astype('category')
bmw['Fuel'] = bmw['fuelType'].cat.codes
bmw.drop('fuelType', axis = 1, inplace = True)

Data clean

print(bmw.head(10).to_markdown())
price mileage tax mpg engineSize Model Year Transmission Fuel
0 11200 67068 125 57.6 2 4 18 0 0
1 27000 14827 145 42.8 2 5 22 0 4
2 16000 62794 160 51.4 3 4 20 0 0
3 12750 26676 145 72.4 1.5 0 21 0 0
4 14500 39554 160 50.4 3 6 18 0 0
5 14900 35309 125 60.1 2 4 20 0 0
6 16000 38538 125 60.1 2 4 21 0 0
7 16250 10401 145 52.3 1.5 1 22 1 4
8 14250 42668 30 62.8 2 3 21 1 0
9 14250 36099 20 68.9 2 4 20 0 0

Scale data

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
var = ['price', 'mileage', 'tax', 'mpg', 'engineSize', 'Model', 'Year', 'Transmission', 'Fuel']
bmw[var]= scaler.fit_transform(bmw[var])
print(bmw.head(10).to_markdown())
price mileage tax mpg engineSize Model Year Transmission Fuel
0 0.0817956 0.313399 0.215517 0.111971 0.30303 0.173913 0.75 0 0
1 0.211033 0.0692807 0.25 0.0801633 0.30303 0.217391 0.916667 0 1
2 0.121057 0.293427 0.275862 0.098646 0.454545 0.173913 0.833333 0 0
3 0.0944739 0.12465 0.25 0.143778 0.227273 0 0.875 0 0
4 0.108788 0.184828 0.275862 0.0964969 0.454545 0.26087 0.75 0 0
5 0.11206 0.164991 0.215517 0.117344 0.30303 0.173913 0.833333 0 0
6 0.121057 0.18008 0.215517 0.117344 0.30303 0.173913 0.875 0 0
7 0.123102 0.0485984 0.25 0.10058 0.227273 0.0434783 0.916667 0.5 1
8 0.106743 0.199379 0.0517241 0.123146 0.30303 0.130435 0.875 0.5 0
9 0.106743 0.168683 0.0344828 0.136256 0.30303 0.173913 0.833333 0 0

Split data

y = bmw.pop("price")
X_train, X_test, y_train, y_test = train_test_split(bmw, y, test_size = 0.3, random_state = 100)

Modeling

X_tr = sm.add_constant(X_train)
model1 = sm.OLS(endog = y_train, exog = X_tr)
results = model1.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.741
Method:                 Least Squares   F-statistic:                     2699.
Date:                Wed, 19 May 2021   Prob (F-statistic):               0.00
Time:                        14:37:14   Log-Likelihood:                 12364.
No. Observations:                7546   AIC:                        -2.471e+04
Df Residuals:                    7537   BIC:                        -2.465e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.3050      0.009    -32.332      0.000      -0.324      -0.287
mileage         -0.2057      0.008    -26.268      0.000      -0.221      -0.190
tax             -0.0574      0.006     -8.972      0.000      -0.070      -0.045
mpg              0.0100      0.009      1.097      0.273      -0.008       0.028
engineSize       0.4646      0.008     59.045      0.000       0.449       0.480
Model            0.1145      0.002     52.222      0.000       0.110       0.119
Year             0.3771      0.009     41.715      0.000       0.359       0.395
Transmission     0.0061      0.001      4.685      0.000       0.004       0.009
Fuel             0.0098      0.001      7.621      0.000       0.007       0.012
==============================================================================
Omnibus:                     4879.884   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           174705.726
Skew:                           2.572   Prob(JB):                         0.00
Kurtosis:                      26.004   Cond. No.                         42.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

‘mpg’ p-value is more than 5%, which means it is insignificant, so we will drop ‘mpg’ from our dataset again building the model.

Rebuilding model

X_train.drop('mpg', axis = 1, inplace = True)
X_tr = sm.add_constant(X_train)
model2 = sm.OLS(endog = y_train, exog = X_tr)
results = model2.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.741
Method:                 Least Squares   F-statistic:                     3084.
Date:                Wed, 19 May 2021   Prob (F-statistic):               0.00
Time:                        14:37:14   Log-Likelihood:                 12364.
No. Observations:                7546   AIC:                        -2.471e+04
Df Residuals:                    7538   BIC:                        -2.466e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.3024      0.009    -33.134      0.000      -0.320      -0.285
mileage         -0.2056      0.008    -26.259      0.000      -0.221      -0.190
tax             -0.0588      0.006     -9.363      0.000      -0.071      -0.046
engineSize       0.4618      0.007     62.065      0.000       0.447       0.476
Model            0.1149      0.002     53.037      0.000       0.111       0.119
Year             0.3768      0.009     41.700      0.000       0.359       0.395
Transmission     0.0060      0.001      4.600      0.000       0.003       0.009
Fuel             0.0097      0.001      7.549      0.000       0.007       0.012
==============================================================================
Omnibus:                     4879.024   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           174641.447
Skew:                           2.571   Prob(JB):                         0.00
Kurtosis:                      26.000   Cond. No.                         41.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Every feature is significant.

Tests of Regression

The multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_markdown())
Features VIF
2 engineSize 20.97
4 Year 16.26
1 tax 8.19
5 Transmission 2.83
0 mileage 2.08
3 Model 1.97
6 Fuel 1.85

VIF more than 10 means high collinearity, so first we will drop ‘engineSize’

X_train.drop('engineSize', axis = 1, inplace = True)
X_tr = sm.add_constant(X_train)
model3 = sm.OLS(endog = y_train, exog = X_tr)
results = model3.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.609
Model:                            OLS   Adj. R-squared:                  0.609
Method:                 Least Squares   F-statistic:                     1956.
Date:                Wed, 19 May 2021   Prob (F-statistic):               0.00
Time:                        14:37:15   Log-Likelihood:                 10806.
No. Observations:                7546   AIC:                        -2.160e+04
Df Residuals:                    7539   BIC:                        -2.155e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.2023      0.011    -18.317      0.000      -0.224      -0.181
mileage         -0.1801      0.010    -18.745      0.000      -0.199      -0.161
tax              0.1102      0.007     15.844      0.000       0.097       0.124
Model            0.1339      0.003     50.778      0.000       0.129       0.139
Year             0.3865      0.011     34.803      0.000       0.365       0.408
Transmission     0.0091      0.002      5.680      0.000       0.006       0.012
Fuel            -0.0040      0.002     -2.564      0.010      -0.007      -0.001
==============================================================================
Omnibus:                     4275.612   Durbin-Watson:                   1.985
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            77309.663
Skew:                           2.338   Prob(JB):                         0.00
Kurtosis:                      17.967   Cond. No.                         40.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

All the features are significant, now we will look into their VIF

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_markdown())
|    | Features     |   VIF |
|---:|:-------------|------:|
|  3 | Year         |  8.37 |
|  1 | tax          |  6.31 |
|  4 | Transmission |  2.83 |
|  2 | Model        |  1.93 |
|  0 | mileage      |  1.85 |
|  5 | Fuel         |  1.8  |

All the features are having VIF less than 10, so model3 will be our final model.

Residual Analysis

residuals = results.resid
plt.figure(figsize = (6, 5))
plt.hist(residuals, bins = 30)
(array([1.00e+00, 5.00e+00, 7.20e+01, 7.10e+02, 1.77e+03, 2.79e+03,
       1.38e+03, 3.71e+02, 1.96e+02, 9.00e+01, 5.40e+01, 4.30e+01,
       2.90e+01, 2.00e+01, 7.00e+00, 3.00e+00, 1.00e+00, 1.00e+00,
       2.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00]), array([-0.20785865, -0.17074011, -0.13362158, -0.09650305, -0.05938452,
       -0.02226599,  0.01485255,  0.05197108,  0.08908961,  0.12620814,
        0.16332668,  0.20044521,  0.23756374,  0.27468227,  0.31180081,
        0.34891934,  0.38603787,  0.4231564 ,  0.46027493,  0.49739347,
        0.534512  ,  0.57163053,  0.60874906,  0.6458676 ,  0.68298613,
        0.72010466,  0.75722319,  0.79434173,  0.83146026,  0.86857879,
        0.90569732]), <BarContainer object of 30 artists>)
plt.show()