bmw_car.knit

Importing

Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import markdown as md
import warnings
import statsmodels.api as sm
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

Data

bmw = pd.read_csv("bmw.csv")
print(bmw.head(5).to_markdown())

	model	year	price	transmission	mileage	fuelType	tax	mpg	engineSize
0	5 Series	2014	11200	Automatic	67068	Diesel	125	57.6	2
1	6 Series	2018	27000	Automatic	14827	Petrol	145	42.8	2
2	5 Series	2016	16000	Automatic	62794	Diesel	160	51.4	3
3	1 Series	2017	12750	Automatic	26676	Diesel	145	72.4	1.5
4	7 Series	2014	14500	Automatic	39554	Diesel	160	50.4	3

model : Model of the car
year : The year that car was made
price : Price (Pounds)
Transmission : Type of gear
milage : How many miles the car went (1 mile = 1,609344 km)
fuelType : Fuel type
tax : tax
mpg : Miles per gallon (1 galon = 3,78541178 liters)
engine size : Size of engine (liters)

Info data

print(bmw.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   model         10781 non-null  object 
 1   year          10781 non-null  int64  
 2   price         10781 non-null  int64  
 3   transmission  10781 non-null  object 
 4   mileage       10781 non-null  int64  
 5   fuelType      10781 non-null  object 
 6   tax           10781 non-null  int64  
 7   mpg           10781 non-null  float64
 8   engineSize    10781 non-null  float64
dtypes: float64(2), int64(4), object(3)
memory usage: 758.2+ KB
None

Exploratory data analysis

Descriptive statistics

print(bmw.describe().to_markdown())

	year	price	mileage	tax	mpg	engineSize
count	10781	10781	10781	10781	10781	10781
mean	2017.08	22733.4	25497	131.702	56.399	2.16777
std	2.34904	11415.5	25143.2	61.5108	31.337	0.552054
min	1996	1200	1	0	5.5	0
25%	2016	14950	5529	135	45.6	2
50%	2017	20462	18347	145	53.3	2
75%	2019	27940	38206	145	62.8	2
max	2020	123456	214000	580	470.8	6.6

Correlation

plt.figure(figsize = (6,5)) 
sns.heatmap(bmw.corr(), annot = True, cmap = "coolwarm")
plt.show()

(Year and Price) and (EngineSize and Price) have high positive correlation, (Year and Mileage) and (Price and Mileage) are significantly negatively correlated.

Model

plt.figure(figsize = (6,5)) 
sns.countplot(y = bmw["model"], order = bmw['model'].value_counts().index)
plt.show()

On the dataset, BMW has 24 cars, top 3 cars are 3 Series, 1 Series, 2 Series constitute 52.32%, with all other cars contributing to 47.68%.

Year

plt.figure(figsize = (6,5)) 
sns.countplot(y = bmw["year"])
plt.show()

In 2019, number of cars per year of BMW is greater about 3500 cars.

Price

Price distribution

plt.figure(figsize = (6,5)) 
sns.distplot(bmw['price'])
plt.show()

Price by year

plt.figure(figsize  =(6,5), facecolor = 'w')
sns.barplot(x = bmw["year"], y = bmw["price"])
plt.show()

The recently manufactured cars (year = 2018, 2019, 2020) are sold for more average price when compared to the cars that are manufactured earlier.

Transmission

plt.figure(figsize = (6,5)) 
sns.barplot(x = bmw["transmission"], y = bmw["price"])
plt.show()

Mileage

Mileage distribuition

plt.figure(figsize = (6,5)) 
sns.distplot(bmw['mileage'])
plt.show()

plt.figure(figsize = (12,6)) 
sns.scatterplot(bmw["mileage"], bmw["price"], hue = bmw["year"])
plt.show()

Fuel type

plt.figure(figsize = (6,5)) 
sns.countplot(bmw["fuelType"])
plt.show()

Engine Size

engine = bmw.groupby('engineSize').count()[['tax']].sort_values(by = 'tax').reset_index()
engine = engine.rename(columns = {'tax':'count'})
plt.figure(figsize = (10,5))
sns.barplot(x = engine['engineSize'], y = engine['count'], color = 'royalblue')
plt.show()

Pre-processing for modeling

Category var

Will make the object type columns to category type so that we can perform linear regression model including those features also

# Model col
bmw['model'] = bmw['model'].astype('category')
bmw['Model'] = bmw['model'].cat.codes
bmw.drop('model', axis=1, inplace = True)

# Year col
bmw['year'] = bmw['year'].astype('category')
bmw['Year'] = bmw['year'].cat.codes
bmw.drop('year', axis = 1, inplace = True)

# Transmission col
bmw['transmission'] = bmw['transmission'].astype('category')
bmw['Transmission'] = bmw['transmission'].cat.codes
bmw.drop('transmission', axis = 1, inplace = True)

# FuelType col
bmw['fuelType'] = bmw['fuelType'].astype('category')
bmw['Fuel'] = bmw['fuelType'].cat.codes
bmw.drop('fuelType', axis = 1, inplace = True)

Data clean

print(bmw.head(10).to_markdown())

	price	mileage	tax	mpg	engineSize	Model	Year	Transmission	Fuel
0	11200	67068	125	57.6	2	4	18	0	0
1	27000	14827	145	42.8	2	5	22	0	4
2	16000	62794	160	51.4	3	4	20	0	0
3	12750	26676	145	72.4	1.5	0	21	0	0
4	14500	39554	160	50.4	3	6	18	0	0
5	14900	35309	125	60.1	2	4	20	0	0
6	16000	38538	125	60.1	2	4	21	0	0
7	16250	10401	145	52.3	1.5	1	22	1	4
8	14250	42668	30	62.8	2	3	21	1	0
9	14250	36099	20	68.9	2	4	20	0	0

Scale data

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
var = ['price', 'mileage', 'tax', 'mpg', 'engineSize', 'Model', 'Year', 'Transmission', 'Fuel']
bmw[var]= scaler.fit_transform(bmw[var])
print(bmw.head(10).to_markdown())

	price	mileage	tax	mpg	engineSize	Model	Year	Transmission	Fuel
0	0.0817956	0.313399	0.215517	0.111971	0.30303	0.173913	0.75	0	0
1	0.211033	0.0692807	0.25	0.0801633	0.30303	0.217391	0.916667	0	1
2	0.121057	0.293427	0.275862	0.098646	0.454545	0.173913	0.833333	0	0
3	0.0944739	0.12465	0.25	0.143778	0.227273	0	0.875	0	0
4	0.108788	0.184828	0.275862	0.0964969	0.454545	0.26087	0.75	0	0
5	0.11206	0.164991	0.215517	0.117344	0.30303	0.173913	0.833333	0	0
6	0.121057	0.18008	0.215517	0.117344	0.30303	0.173913	0.875	0	0
7	0.123102	0.0485984	0.25	0.10058	0.227273	0.0434783	0.916667	0.5	1
8	0.106743	0.199379	0.0517241	0.123146	0.30303	0.130435	0.875	0.5	0
9	0.106743	0.168683	0.0344828	0.136256	0.30303	0.173913	0.833333	0	0

Split data

y = bmw.pop("price")
X_train, X_test, y_train, y_test = train_test_split(bmw, y, test_size = 0.3, random_state = 100)

Modeling

X_tr = sm.add_constant(X_train)
model1 = sm.OLS(endog = y_train, exog = X_tr)
results = model1.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.741
Method:                 Least Squares   F-statistic:                     2699.
Date:                Wed, 19 May 2021   Prob (F-statistic):               0.00
Time:                        14:37:14   Log-Likelihood:                 12364.
No. Observations:                7546   AIC:                        -2.471e+04
Df Residuals:                    7537   BIC:                        -2.465e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.3050      0.009    -32.332      0.000      -0.324      -0.287
mileage         -0.2057      0.008    -26.268      0.000      -0.221      -0.190
tax             -0.0574      0.006     -8.972      0.000      -0.070      -0.045
mpg              0.0100      0.009      1.097      0.273      -0.008       0.028
engineSize       0.4646      0.008     59.045      0.000       0.449       0.480
Model            0.1145      0.002     52.222      0.000       0.110       0.119
Year             0.3771      0.009     41.715      0.000       0.359       0.395
Transmission     0.0061      0.001      4.685      0.000       0.004       0.009
Fuel             0.0098      0.001      7.621      0.000       0.007       0.012
==============================================================================
Omnibus:                     4879.884   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           174705.726
Skew:                           2.572   Prob(JB):                         0.00
Kurtosis:                      26.004   Cond. No.                         42.4
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

‘mpg’ p-value is more than 5%, which means it is insignificant, so we will drop ‘mpg’ from our dataset again building the model.

Rebuilding model

X_train.drop('mpg', axis = 1, inplace = True)
X_tr = sm.add_constant(X_train)
model2 = sm.OLS(endog = y_train, exog = X_tr)
results = model2.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.741
Method:                 Least Squares   F-statistic:                     3084.
Date:                Wed, 19 May 2021   Prob (F-statistic):               0.00
Time:                        14:37:14   Log-Likelihood:                 12364.
No. Observations:                7546   AIC:                        -2.471e+04
Df Residuals:                    7538   BIC:                        -2.466e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.3024      0.009    -33.134      0.000      -0.320      -0.285
mileage         -0.2056      0.008    -26.259      0.000      -0.221      -0.190
tax             -0.0588      0.006     -9.363      0.000      -0.071      -0.046
engineSize       0.4618      0.007     62.065      0.000       0.447       0.476
Model            0.1149      0.002     53.037      0.000       0.111       0.119
Year             0.3768      0.009     41.700      0.000       0.359       0.395
Transmission     0.0060      0.001      4.600      0.000       0.003       0.009
Fuel             0.0097      0.001      7.549      0.000       0.007       0.012
==============================================================================
Omnibus:                     4879.024   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           174641.447
Skew:                           2.571   Prob(JB):                         0.00
Kurtosis:                      26.000   Cond. No.                         41.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Every feature is significant.

Tests of Regression

The multicollinearity

from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_markdown())

	Features	VIF
2	engineSize	20.97
4	Year	16.26
1	tax	8.19
5	Transmission	2.83
0	mileage	2.08
3	Model	1.97
6	Fuel	1.85

VIF more than 10 means high collinearity, so first we will drop ‘engineSize’

X_train.drop('engineSize', axis = 1, inplace = True)
X_tr = sm.add_constant(X_train)
model3 = sm.OLS(endog = y_train, exog = X_tr)
results = model3.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.609
Model:                            OLS   Adj. R-squared:                  0.609
Method:                 Least Squares   F-statistic:                     1956.
Date:                Wed, 19 May 2021   Prob (F-statistic):               0.00
Time:                        14:37:15   Log-Likelihood:                 10806.
No. Observations:                7546   AIC:                        -2.160e+04
Df Residuals:                    7539   BIC:                        -2.155e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.2023      0.011    -18.317      0.000      -0.224      -0.181
mileage         -0.1801      0.010    -18.745      0.000      -0.199      -0.161
tax              0.1102      0.007     15.844      0.000       0.097       0.124
Model            0.1339      0.003     50.778      0.000       0.129       0.139
Year             0.3865      0.011     34.803      0.000       0.365       0.408
Transmission     0.0091      0.002      5.680      0.000       0.006       0.012
Fuel            -0.0040      0.002     -2.564      0.010      -0.007      -0.001
==============================================================================
Omnibus:                     4275.612   Durbin-Watson:                   1.985
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            77309.663
Skew:                           2.338   Prob(JB):                         0.00
Kurtosis:                      17.967   Cond. No.                         40.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

All the features are significant, now we will look into their VIF

vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
print(vif.to_markdown())

|    | Features     |   VIF |
|---:|:-------------|------:|
|  3 | Year         |  8.37 |
|  1 | tax          |  6.31 |
|  4 | Transmission |  2.83 |
|  2 | Model        |  1.93 |
|  0 | mileage      |  1.85 |
|  5 | Fuel         |  1.8  |

All the features are having VIF less than 10, so model3 will be our final model.

Residual Analysis

residuals = results.resid
plt.figure(figsize = (6, 5))
plt.hist(residuals, bins = 30)

(array([1.00e+00, 5.00e+00, 7.20e+01, 7.10e+02, 1.77e+03, 2.79e+03,
       1.38e+03, 3.71e+02, 1.96e+02, 9.00e+01, 5.40e+01, 4.30e+01,
       2.90e+01, 2.00e+01, 7.00e+00, 3.00e+00, 1.00e+00, 1.00e+00,
       2.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
       0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.00e+00]), array([-0.20785865, -0.17074011, -0.13362158, -0.09650305, -0.05938452,
       -0.02226599,  0.01485255,  0.05197108,  0.08908961,  0.12620814,
        0.16332668,  0.20044521,  0.23756374,  0.27468227,  0.31180081,
        0.34891934,  0.38603787,  0.4231564 ,  0.46027493,  0.49739347,
        0.534512  ,  0.57163053,  0.60874906,  0.6458676 ,  0.68298613,
        0.72010466,  0.75722319,  0.79434173,  0.83146026,  0.86857879,
        0.90569732]), <BarContainer object of 30 artists>)

plt.show()

BMW Prices - Linear Regression Model