Assignment Description

Select a small data set from the available public data sets (you can find a list of public data sets here http://www.teymourian.de/public-data-sets-for-data-analytic-projects/ ).

Describe a research scenario and specify a research question based on data analytic methods that we learned in our class, for example methods like, one and two sample means, t-test, correlation tests, simple and multiple linear regression, ANOVA and ANCOVA, one and two-Sample Tests for Proportions and logistic regression.

Clean up your data and reduce it to no more than 500 observations if your data set is large.

Response - The Dataset selected for this Project is ‘BMW Pricing Challenge’. it is available on Kaggle.com and has been sampled to select 500 rows from roughly from 5000.

Research Scenario Description (no more than 200 words)

Describe your research scenario in no more than 200 words. This is a general description of the use case. Similar to our class examples, we first describe the overall scenario and then we specify a specific research question based on it.

Response - In 2019 alone, about 41 million used cars were sold in the US. Evaluation of the value of a used car is one of the primary challenges faced by the automotive companies and dealers around the world. Apart from the physical aspects like the color, model, transmission type etc., it is also a derivation of other factors like prestige value and perception of the quality of the maker.

In this project, we try to model the price, using the data analytic method of Multiple Linear Regression. We try to find out which aspects of a BMW car affect its price and which do not and we try to gauge how influential each factor is. In the end, we formulate and test the accuracy of model that can predict the price of the next car based on the relevant factors discovered during analysis.

Describe the data set (no more than 200 words)

Research Question (no more than 100 words)

Describe briefly in one or two sentences the main research question. This is similar to the last sentence of our class examples.

Can we quantify the physical and other features of pre-owned vehicles to generate a model of the price of a car which can we used to predict the price of the other similar natured cars?

Your solution R code

## 'data.frame':    4843 obs. of  18 variables:
##  $ maker_key        : chr  "BMW" "BMW" "BMW" "BMW" ...
##  $ model_key        : chr  "118" "M4" "320" "420" ...
##  $ mileage          : int  140411 13929 183297 128035 97097 152352 205219 115560 123886 139541 ...
##  $ engine_power     : int  100 317 120 135 160 225 145 105 125 135 ...
##  $ registration_date: chr  "2012-02-01" "2016-04-01" "2012-04-01" "2014-07-01" ...
##  $ fuel             : chr  "diesel" "petrol" "diesel" "diesel" ...
##  $ paint_color      : chr  "black" "grey" "white" "red" ...
##  $ car_type         : chr  "convertible" "convertible" "convertible" "convertible" ...
##  $ feature_1        : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ feature_2        : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ feature_3        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ feature_4        : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ feature_5        : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ feature_6        : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
##  $ feature_7        : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ feature_8        : logi  FALSE TRUE FALSE TRUE TRUE TRUE ...
##  $ price            : int  11300 69700 10200 25100 33400 17100 12400 6100 6200 17300 ...
##  $ sold_at          : chr  "2018-01-01" "2018-02-01" "2018-02-01" "2018-02-01" ...
##   maker_key          model_key            mileage         engine_power
##  Length:4843        Length:4843        Min.   :    -64   Min.   :  0  
##  Class :character   Class :character   1st Qu.: 102914   1st Qu.:100  
##  Mode  :character   Mode  :character   Median : 141080   Median :120  
##                                        Mean   : 140963   Mean   :129  
##                                        3rd Qu.: 175196   3rd Qu.:135  
##                                        Max.   :1000376   Max.   :423  
##  registration_date      fuel           paint_color          car_type        
##  Length:4843        Length:4843        Length:4843        Length:4843       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  feature_1       feature_2       feature_3       feature_4      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2181      FALSE:1004      FALSE:3865      FALSE:3881     
##  TRUE :2662      TRUE :3839      TRUE :978       TRUE :962      
##                                                                 
##                                                                 
##                                                                 
##  feature_5       feature_6       feature_7       feature_8      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:2613      FALSE:3674      FALSE:329       FALSE:2223     
##  TRUE :2230      TRUE :1169      TRUE :4514      TRUE :2620     
##                                                                 
##                                                                 
##                                                                 
##      price          sold_at         
##  Min.   :   100   Length:4843       
##  1st Qu.: 10800   Class :character  
##  Median : 14200   Mode  :character  
##  Mean   : 15828                     
##  3rd Qu.: 18600                     
##  Max.   :178500

After loading the dataset and checking the basic structure, we can see that there is one car with negative mileage and another one, which has zero engine power. After close examination, it was concluded that the negative mileage car was probably a new car as it was the newest car in our dataset but must have been bought by a dealer from another dealer and hence part of the used car dataset. The mileage is thus converted to zero.

A new age column is created to gauge the age of the car at the time of sale to the second owner. It is calculated by extracting the number of days between the registration date and the date of sale and is represented in number of days.

The feature columns are also converted from logical to binary for ease of use when adding to the multiple linear regression model (MLR).

Column maker_key, Registration Date and Sold date are dropped as maker is same for all the cars ‘BMW’ and does not provide any new information. The date columns are dropped because we already extracted the age of the car from them and hence do not provide other information required for this analysis.

As mentioned earlier, cars of fuel type hybrid and electro are removed. (11 rows removed out of 4842) as they will not be considered for the model. Some of the column names are hence changed for better aesthetics.

The dataset is then organised by type and fuel (type) which will form the basis of sampling of the dataset. First six rows of the modified dataset (before sampling) is shown above.

## Stratum 1 
## 
## Population total and number of selected units: 28 3 
## Stratum 2 
## 
## Population total and number of selected units: 19 2 
## Stratum 3 
## 
## Population total and number of selected units: 82 9 
## Stratum 4 
## 
## Population total and number of selected units: 21 3 
## Stratum 5 
## 
## Population total and number of selected units: 1574 160 
## Stratum 6 
## 
## Population total and number of selected units: 32 4 
## Stratum 7 
## 
## Population total and number of selected units: 659 67 
## Stratum 8 
## 
## Population total and number of selected units: 35 4 
## Stratum 9 
## 
## Population total and number of selected units: 1105 113 
## Stratum 10 
## 
## Population total and number of selected units: 62 7 
## Stratum 11 
## 
## Population total and number of selected units: 109 12 
## Stratum 12 
## 
## Population total and number of selected units: 7 1 
## Stratum 13 
## 
## Population total and number of selected units: 1041 106 
## Stratum 14 
## 
## Population total and number of selected units: 13 2 
## Stratum 15 
## 
## Population total and number of selected units: 42 5 
## Stratum 16 
## 
## Population total and number of selected units: 2 1 
## Number of strata  16 
## Total number of selected units 499
##   Model_Name           Mileage           Power          Color          
##  Length:499         Min.   :   476   Min.   : 70.0   Length:499        
##  Class :character   1st Qu.:101510   1st Qu.:100.0   Class :character  
##  Mode  :character   Median :138024   Median :120.0   Mode  :character  
##                     Mean   :138302   Mean   :130.2                     
##                     3rd Qu.:174801   3rd Qu.:140.0                     
##                     Max.   :370014   Max.   :317.0                     
##    feature_1        feature_2        feature_3        feature_4     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.5671   Mean   :0.8016   Mean   :0.2305   Mean   :0.2004  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    feature_5       feature_6        feature_7        feature_8     
##  Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.493   Mean   :0.2866   Mean   :0.9399   Mean   :0.5371  
##  3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Price            Age           Type               Fuel          
##  Min.   :  800   Min.   : 272   Length:499         Length:499        
##  1st Qu.:11000   1st Qu.:1490   Class :character   Class :character  
##  Median :14400   Median :1764   Mode  :character   Mode  :character  
##  Mean   :16146   Mean   :1948                                        
##  3rd Qu.:18350   3rd Qu.:2114                                        
##  Max.   :73100   Max.   :6451

Five-hundred rows are then sampled from the dataset to form the basis of the analysis. The sampled strata details and summary of the sampled dataset is displayed above.

Correlation plot is created to check for correlation between variables and each variables relation with the dependent price variable. As can be seen there seem to be no evidence of multi-collinearity as no variables is correlated to each other significantly. We can also see that correlation coefficient of any independent variable to the dependent price variable is not higher than 0.7 (Power) and hence we can note that none of the independent variables have particularly strong relationship with the price variable.





As was seen from the correlation plot, the independent variables Power, Mileage and Age seem to more closely relate to the dependent variable price. Scatter plots with regression lines depicting the relationship of each variable to the price are presented above. It can be seen that form of the three variables is linear, with the points moving in a straight line. The Age and Mileage variables are negatively correlated (negative direction) to the price variable whereas power is positively correlated (positive direction).However, there does appear to be outliers and points do not seem to be very closely packed hence giving an impression of lower strength of the relationships.


Formal Test to test (using the 5-step procedure) whether the set of these predictors are associated with price at the alpha = 0.05

1. Set up the hypotheses and select the alpha level

\(H_0:\beta\)Predictors \(= 0\) (None of the Independent variables are not Predictors of Price)

\(H_1: H_0:\) \(\beta\)Predictors \(\neq\) 0 (at least one of the slope coefficients is different than 0 and is a predictor of Price)

\(\alpha\) = 0.05

2. Select the appropriate test statistic

we are going to use a F-test statistic.

\[\mathbf{F} = \frac{MS\ Reg}{MS\ Res}\ where\ df = k,n - k - 1\]

3.State the decision rule


Reject \(H_0\) if \(p-value <= \alpha\). Otherwise, do not reject \(H_0\)

4. Compute the test statistic and the associated p-value.

## 
## Call:
## lm(formula = Price ~ ., data = MLR.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18407  -1450      0   1508  17637 
## 
## Coefficients: (1 not defined because of singularities)
##                                  Estimate    Std. Error t value
## (Intercept)                  13676.616016   4063.065027   3.366
## Model_Name116                 -379.560267   2229.276462  -0.170
## Model_Name118                  531.352980   2422.531795   0.219
## Model_Name120                 -395.491108   2950.480959  -0.134
## Model_Name123                -1068.143897   4352.905662  -0.245
## Model_Name125                -6504.031412   4347.085177  -1.496
## Model_Name218               -12113.538373   4953.317595  -2.446
## Model_Name218 Active Tourer  -1195.020942   4187.128295  -0.285
## Model_Name218 Gran Tourer    -3320.852907   4206.887647  -0.789
## Model_Name225 Active Tourer    -96.913365   5280.544036  -0.018
## Model_Name316                 1831.775809   2537.969392   0.722
## Model_Name318                 2116.095544   2531.953979   0.836
## Model_Name318 Gran Turismo    2922.631634   2484.880404   1.176
## Model_Name320                 1686.453728   2598.470699   0.649
## Model_Name320 Gran Turismo    1949.318613   2702.184494   0.721
## Model_Name325                 -884.685405   3333.693333  -0.265
## Model_Name325 Gran Turismo    5001.972148   4336.413513   1.153
## Model_Name330                 1420.635758   3283.247451   0.433
## Model_Name335                  -73.754789   4212.202166  -0.018
## Model_Name420                 5874.055062   3440.204922   1.707
## Model_Name420 Gran Coupé     7166.720041   2911.406278   2.462
## Model_Name425                12108.084749   5100.481103   2.374
## Model_Name435                12551.299567   5324.603997   2.357
## Model_Name435 Gran Coupé    15375.973168   4635.786986   3.317
## Model_Name518                 4329.622606   2888.657707   1.499
## Model_Name520                 3998.306819   2659.453626   1.503
## Model_Name520 Gran Turismo    7478.695245   3015.981514   2.480
## Model_Name525                 3956.657582   2872.259282   1.378
## Model_Name528                 4888.992518   4496.819057   1.087
## Model_Name530                 5764.482747   3103.828658   1.857
## Model_Name530 Gran Turismo    5112.734364   4887.218875   1.046
## Model_Name535                 3432.881447   4192.245172   0.819
## Model_Name630                  621.902293   5088.656228   0.122
## Model_Name640                12770.272193   4523.796025   2.823
## Model_Name640 Gran Coupé    15802.078861   4823.890951   3.276
## Model_Name730                 7911.491925   3330.493936   2.375
## Model_Name740                31426.110262   3952.578911   7.951
## Model_NameM235                6519.471476   5275.934266   1.236
## Model_NameM4                 36608.459248   5857.396835   6.250
## Model_NameM550               15846.823109   5255.191593   3.015
## Model_NameX1                  2251.070155   4373.701368   0.515
## Model_NameX3                  4810.391539   4396.348133   1.094
## Model_NameX4                 10842.518936   4867.135296   2.228
## Model_NameX5                 12600.106877   4643.694958   2.713
## Model_NameX5 M               29475.414959   5122.129532   5.755
## Model_NameX6                 14720.371011   5200.074801   2.831
## Model_NameX6 M               36306.894625   6238.419819   5.820
## Mileage                         -0.034984      0.003729  -9.382
## Power                           52.816340     14.816859   3.565
## Colorblack                     707.422298   1888.266846   0.375
## Colorblue                       86.498705   1907.858782   0.045
## Colorbrown                    -952.105353   1958.594961  -0.486
## Colorgreen                   -3900.705528   3417.949930  -1.141
## Colorgrey                     1395.139998   1907.634961   0.731
## Colorred                      -831.586541   2769.003013  -0.300
## Colorsilver                   1737.519752   1981.997441   0.877
## Colorwhite                    1464.588481   1949.263072   0.751
## feature_1                      441.633037    388.200998   1.138
## feature_2                      267.510039    514.683506   0.520
## feature_3                      393.254775    426.486754   0.922
## feature_4                      892.308269    568.454201   1.570
## feature_5                      282.763118    418.827865   0.675
## feature_6                      508.168520    409.589546   1.241
## feature_7                     2063.617048    845.491628   2.441
## feature_8                     1247.267092    420.546714   2.966
## Age                             -3.046208      0.257066 -11.850
## Typecoupe                    -1174.550498   3251.465753  -0.361
## Typeestate                   -2459.506841   2339.743446  -1.051
## Typehatchback                -1649.577686   2645.902806  -0.623
## Typesedan                    -1213.847079   2346.422686  -0.517
## Typesubcompact               -1816.297268   2801.542256  -0.648
## Typesuv                      -1948.995019   4272.969309  -0.456
## Typevan                                NA            NA      NA
## Fuelpetrol                    -722.939314    943.282219  -0.766
##                                         Pr(>|t|)    
## (Intercept)                             0.000832 ***
## Model_Name116                           0.864885    
## Model_Name118                           0.826492    
## Model_Name120                           0.893432    
## Model_Name123                           0.806275    
## Model_Name125                           0.135347    
## Model_Name218                           0.014867 *  
## Model_Name218 Active Tourer             0.775474    
## Model_Name218 Gran Tourer               0.430326    
## Model_Name225 Active Tourer             0.985366    
## Model_Name316                           0.470845    
## Model_Name318                           0.403761    
## Model_Name318 Gran Turismo              0.240185    
## Model_Name320                           0.516677    
## Model_Name320 Gran Turismo              0.471068    
## Model_Name325                           0.790847    
## Model_Name325 Gran Turismo              0.249360    
## Model_Name330                           0.665457    
## Model_Name335                           0.986038    
## Model_Name420                           0.088463 .  
## Model_Name420 Gran Coupé               0.014227 *  
## Model_Name425                           0.018043 *  
## Model_Name435                           0.018864 *  
## Model_Name435 Gran Coupé               0.000989 ***
## Model_Name518                           0.134657    
## Model_Name520                           0.133469    
## Model_Name520 Gran Turismo              0.013535 *  
## Model_Name525                           0.169068    
## Model_Name528                           0.277558    
## Model_Name530                           0.063970 .  
## Model_Name530 Gran Turismo              0.296088    
## Model_Name535                           0.413321    
## Model_Name630                           0.902788    
## Model_Name640                           0.004982 ** 
## Model_Name640 Gran Coupé               0.001140 ** 
## Model_Name730                           0.017968 *  
## Model_Name740                 0.0000000000000168 ***
## Model_NameM235                          0.217251    
## Model_NameM4                  0.0000000009973957 ***
## Model_NameM550                          0.002719 ** 
## Model_NameX1                            0.607041    
## Model_NameX3                            0.274495    
## Model_NameX4                            0.026422 *  
## Model_NameX5                            0.006930 ** 
## Model_NameX5 M                0.0000000166262998 ***
## Model_NameX6                            0.004863 ** 
## Model_NameX6 M                0.0000000115930715 ***
## Mileage                     < 0.0000000000000002 ***
## Power                                   0.000406 ***
## Colorblack                              0.708114    
## Colorblue                               0.963859    
## Colorbrown                              0.627135    
## Colorgreen                              0.254411    
## Colorgrey                               0.464970    
## Colorred                                0.764080    
## Colorsilver                             0.381170    
## Colorwhite                              0.452854    
## feature_1                               0.255910    
## feature_2                               0.603503    
## feature_3                               0.357009    
## feature_4                               0.117225    
## feature_5                               0.499960    
## feature_6                               0.215408    
## feature_7                               0.015064 *  
## feature_8                               0.003189 ** 
## Age                         < 0.0000000000000002 ***
## Typecoupe                               0.718101    
## Typeestate                              0.293769    
## Typehatchback                           0.533325    
## Typesedan                               0.605203    
## Typesubcompact                          0.517127    
## Typesuv                                 0.648535    
## Typevan                                       NA    
## Fuelpetrol                              0.443858    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3439 on 426 degrees of freedom
## Multiple R-squared:  0.8819, Adjusted R-squared:  0.8619 
## F-statistic: 44.18 on 72 and 426 DF,  p-value: < 0.00000000000000022


\(F-Statistic =\) 44.18 on 72 and 426 DF
\(p-value =\) less than < 0.00000000000000022
\(Adjusted\ R^2 =\) 0.86


5.Conclusion


Reject \(H_0\) since \(p <= \alpha\). We have significant evidence at the \(\alpha=0.05\) level that independent variables when taken together are predictive of price. That is, there is evidence of a linear association between Price and all other independent variables.

From the above test we can conclude that some but not all variables are predictors of price. we now create a new model after removing the redundent variables and check the effect on R-squared value. we do this so as to keep the model simple and also avoid the possibility of overfitting.

## 
## Call:
## lm(formula = Price ~ . - Color - Type - feature_1 - feature_2 - 
##     feature_3 - feature_4 - feature_5 - feature_6 - Fuel, data = MLR.df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19311  -1568      0   1594  17856 
## 
## Coefficients:
##                                 Estimate   Std. Error t value
## (Intercept)                 11035.533036  2457.585768   4.490
## Model_Name116                -367.148781  2144.222770  -0.171
## Model_Name118                 365.628899  2343.901188   0.156
## Model_Name120                -121.484247  2838.921068  -0.043
## Model_Name123                -156.627857  4249.557013  -0.037
## Model_Name125               -5964.664909  4295.126228  -1.389
## Model_Name218               -9205.343724  4169.884660  -2.208
## Model_Name218 Active Tourer   -78.448695  3232.448513  -0.024
## Model_Name218 Gran Tourer   -2086.675230  3248.685052  -0.642
## Model_Name225 Active Tourer   339.737970  4320.107547   0.079
## Model_Name316                1673.407342  2170.758919   0.771
## Model_Name318                1885.799624  2147.686961   0.878
## Model_Name318 Gran Turismo   3192.691023  2350.432214   1.358
## Model_Name320                1485.878699  2240.415309   0.663
## Model_Name320 Gran Turismo   3143.603252  2547.820386   1.234
## Model_Name325               -1496.859392  2983.573744  -0.502
## Model_Name325 Gran Turismo   5318.374518  4247.894528   1.252
## Model_Name330                 438.037486  2902.284516   0.151
## Model_Name335                 861.360304  3836.631219   0.225
## Model_Name420                6622.952058  2711.828377   2.442
## Model_Name420 Gran Coupé    7195.868539  2793.662380   2.576
## Model_Name425               14669.951862  4239.324330   3.460
## Model_Name435               12486.467196  4726.634404   2.642
## Model_Name435 Gran Coupé   14204.081694  4554.170257   3.119
## Model_Name518                5221.792856  2564.436953   2.036
## Model_Name520                3890.926161  2268.184091   1.715
## Model_Name520 Gran Turismo   8539.710306  2870.077660   2.975
## Model_Name525                3559.275249  2493.758610   1.427
## Model_Name528                4427.094292  4357.974928   1.016
## Model_Name530                5371.769142  2750.173068   1.953
## Model_Name530 Gran Turismo   5106.691172  4421.419217   1.155
## Model_Name535                3545.947162  3961.757557   0.895
## Model_Name630                 536.703208  4462.547160   0.120
## Model_Name640               12319.849904  3634.666830   3.390
## Model_Name640 Gran Coupé   17055.529137  4598.127713   3.709
## Model_Name730                8672.975605  2933.738990   2.956
## Model_Name740               31513.122136  3612.619228   8.723
## Model_NameM235               6710.995356  4678.554941   1.434
## Model_NameM4                36158.837660  5245.327005   6.894
## Model_NameM550              15086.087913  4985.580422   3.026
## Model_NameX1                 2366.960650  2204.327190   1.074
## Model_NameX3                 5206.299910  2272.938057   2.291
## Model_NameX4                11127.363522  3054.682786   3.643
## Model_NameX5                12825.357418  2696.690526   4.756
## Model_NameX5 M              30362.034646  3397.816503   8.936
## Model_NameX6                13973.470681  3429.537546   4.074
## Model_NameX6 M              35328.977005  4962.621333   7.119
## Mileage                        -0.036270     0.003633  -9.984
## Power                          64.659908    14.029388   4.609
## feature_7                    3104.130639   743.244574   4.176
## feature_8                    1425.665215   408.435591   3.491
## Age                            -3.009185     0.233225 -12.902
##                                         Pr(>|t|)    
## (Intercept)                     0.00000905948056 ***
## Model_Name116                           0.864123    
## Model_Name118                           0.876110    
## Model_Name120                           0.965886    
## Model_Name123                           0.970615    
## Model_Name125                           0.165614    
## Model_Name218                           0.027781 *  
## Model_Name218 Active Tourer             0.980649    
## Model_Name218 Gran Tourer               0.520999    
## Model_Name225 Active Tourer             0.937353    
## Model_Name316                           0.441182    
## Model_Name318                           0.380382    
## Model_Name318 Gran Turismo              0.175040    
## Model_Name320                           0.507534    
## Model_Name320 Gran Turismo              0.217911    
## Model_Name325                           0.616125    
## Model_Name325 Gran Turismo              0.211224    
## Model_Name330                           0.880100    
## Model_Name335                           0.822463    
## Model_Name420                           0.014983 *  
## Model_Name420 Gran Coupé               0.010321 *  
## Model_Name425                           0.000591 ***
## Model_Name435                           0.008538 ** 
## Model_Name435 Gran Coupé               0.001932 ** 
## Model_Name518                           0.042316 *  
## Model_Name520                           0.086958 .  
## Model_Name520 Gran Turismo              0.003084 ** 
## Model_Name525                           0.154199    
## Model_Name528                           0.310245    
## Model_Name530                           0.051414 .  
## Model_Name530 Gran Turismo              0.248712    
## Model_Name535                           0.371245    
## Model_Name630                           0.904325    
## Model_Name640                           0.000762 ***
## Model_Name640 Gran Coupé               0.000234 ***
## Model_Name730                           0.003278 ** 
## Model_Name740               < 0.0000000000000002 ***
## Model_NameM235                          0.152153    
## Model_NameM4                    0.00000000001862 ***
## Model_NameM550                          0.002622 ** 
## Model_NameX1                            0.283501    
## Model_NameX3                            0.022454 *  
## Model_NameX4                            0.000301 ***
## Model_NameX5                    0.00000266899351 ***
## Model_NameX5 M              < 0.0000000000000002 ***
## Model_NameX6                    0.00005457688954 ***
## Model_NameX6 M                  0.00000000000436 ***
## Mileage                     < 0.0000000000000002 ***
## Power                           0.00000528952795 ***
## feature_7                       0.00003561163814 ***
## feature_8                               0.000530 ***
## Age                         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3517 on 447 degrees of freedom
## Multiple R-squared:  0.8704, Adjusted R-squared:  0.8556 
## F-statistic: 58.86 on 51 and 447 DF,  p-value: < 0.00000000000000022

\(F-Statistic =\) 58.86 on 51 and 447 DF
\(p-value =\) less than < 0.00000000000000022
\(Adjusted\ R^2 =\) 0.85

As can be seen that removal of some unimportant features has marginal effect on the adjusted R-squared value but the model is much more concise and less likely to exhibit overfitting.

Residuals



As is evident from the plots, there appears to be no apparent association between data points and the model seems to be reasonable. The residuals, even though display a sort of a cluster with some outliers, the model seems reasonable and they do seem to be normally distributed.


Outliers
Model_Name Mileage Power Color feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 Price Age Type Fuel
425 X6 46041 230 grey 1 1 1 0 1 0 1 1 55500 942 suv diesel
484 X5 M 4530 230 silver 1 1 0 1 0 0 0 1 73100 760 suv diesel
329 740 166286 230 brown 1 1 1 1 0 1 1 1 31300 1764 sedan diesel
407 X5 64729 190 grey 1 1 1 1 0 1 1 1 48200 1248 suv diesel
457 X5 52253 190 white 1 1 1 1 1 0 1 1 46600 851 suv diesel
491 X3 97835 190 brown 1 1 1 0 0 0 0 1 5400 1917 suv diesel
431 X5 123474 190 black 1 1 0 1 1 1 1 1 43400 1126 suv diesel
481 X5 148480 155 silver 1 1 1 1 1 0 1 1 35700 1521 suv diesel


We use the diagnostic plots to identify some of the outliers. The list of such data points is presented above. Interestingly, all of them are SUV’s baring one sedan and all are diesel-fueled cars. Most of them display a similar trend of high price & low mileage the ratio that seems to be not comparable to other cars in the dataset.

## 
## Call:
## lm(formula = Price ~ . - Color - Type - feature_1 - feature_2 - 
##     feature_3 - feature_4 - feature_5 - feature_6 - Fuel, data = MLR.df[-outliers.index, 
##     ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8349  -1430      0   1437   7708 
## 
## Coefficients:
##                                  Estimate    Std. Error t value
## (Intercept)                  10225.418273   1924.367212   5.314
## Model_Name116                 -843.925620   1672.746822  -0.505
## Model_Name118                 -321.469508   1831.777816  -0.175
## Model_Name120                 -993.841835   2222.883343  -0.447
## Model_Name123                -1558.599069   3325.338571  -0.469
## Model_Name125                -6951.303151   3372.196933  -2.061
## Model_Name218               -10157.398055   3259.185805  -3.117
## Model_Name218 Active Tourer   -127.593824   2521.661800  -0.051
## Model_Name218 Gran Tourer    -2206.344451   2535.347870  -0.870
## Model_Name225 Active Tourer   -764.464683   3393.357307  -0.225
## Model_Name316                 1085.419362   1693.471308   0.641
## Model_Name318                 1119.368746   1677.565473   0.667
## Model_Name318 Gran Turismo    2529.113626   1834.976274   1.378
## Model_Name320                  351.457023   1756.777336   0.200
## Model_Name320 Gran Turismo    2188.677023   1994.421534   1.097
## Model_Name325                -2493.264970   2349.604020  -1.061
## Model_Name325 Gran Turismo    4359.108767   3324.103056   1.311
## Model_Name330                -1160.302529   2293.167830  -0.506
## Model_Name335                -1203.458761   3028.938311  -0.397
## Model_Name420                 5953.633711   2124.487920   2.802
## Model_Name420 Gran Coupé     6769.274694   2184.846135   3.098
## Model_Name425                13780.408783   3316.165778   4.156
## Model_Name435                11167.254961   3728.186240   2.995
## Model_Name435 Gran Coupé    13251.860546   3583.226372   3.698
## Model_Name518                 4384.862112   2001.429814   2.191
## Model_Name520                 2766.727673   1778.065564   1.556
## Model_Name520 Gran Turismo    7214.457320   2245.606021   3.213
## Model_Name525                 2007.642009   1963.526512   1.022
## Model_Name528                 2967.369134   3415.874014   0.869
## Model_Name530                 3818.029050   2175.673297   1.755
## Model_Name530 Gran Turismo    2735.921655   3466.749506   0.789
## Model_Name535                 1135.962300   3130.566846   0.363
## Model_Name630                -1639.583114   3505.207019  -0.468
## Model_Name640                10421.315195   2879.577755   3.619
## Model_Name640 Gran Coupé    15660.826000   3618.495416   4.328
## Model_Name730                 6772.912888   2313.352094   2.928
## Model_Name740                40159.132270   3071.876714  13.073
## Model_NameM235                5489.814717   3687.361434   1.489
## Model_NameM4                 34637.430520   4162.489102   8.321
## Model_NameM550               12929.194470   3942.676643   3.279
## Model_NameX1                  1624.392216   1722.977598   0.943
## Model_NameX3                  4534.304236   1778.853103   2.549
## Model_NameX4                 10572.539560   2393.300732   4.418
## Model_NameX5                  8758.307380   2166.313730   4.043
## Model_NameX5 M               20199.100943   2838.896335   7.115
## Model_NameX6                  6712.199473   2838.006750   2.365
## Model_NameX6 M               34134.768913   3925.008449   8.697
## Mileage                         -0.031384      0.002858 -10.980
## Power                           72.226982     11.419255   6.325
## feature_7                     2628.472080    591.784918   4.442
## feature_8                     1582.177719    318.870473   4.962
## Age                             -2.734781      0.183953 -14.867
##                                         Pr(>|t|)    
## (Intercept)                  0.00000017125673619 ***
## Model_Name116                           0.614153    
## Model_Name118                           0.860771    
## Model_Name120                           0.655026    
## Model_Name123                           0.639514    
## Model_Name125                           0.039857 *  
## Model_Name218                           0.001950 ** 
## Model_Name218 Active Tourer             0.959668    
## Model_Name218 Gran Tourer               0.384648    
## Model_Name225 Active Tourer             0.821864    
## Model_Name316                           0.521894    
## Model_Name318                           0.504958    
## Model_Name318 Gran Turismo              0.168818    
## Model_Name320                           0.841528    
## Model_Name320 Gran Turismo              0.273068    
## Model_Name325                           0.289209    
## Model_Name325 Gran Turismo              0.190420    
## Model_Name330                           0.613123    
## Model_Name335                           0.691324    
## Model_Name420                           0.005297 ** 
## Model_Name420 Gran Coupé               0.002071 ** 
## Model_Name425                0.00003902587448898 ***
## Model_Name435                           0.002896 ** 
## Model_Name435 Gran Coupé               0.000245 ***
## Model_Name518                           0.028987 *  
## Model_Name520                           0.120421    
## Model_Name520 Gran Turismo              0.001412 ** 
## Model_Name525                           0.307123    
## Model_Name528                           0.385486    
## Model_Name530                           0.079979 .  
## Model_Name530 Gran Turismo              0.430427    
## Model_Name535                           0.716883    
## Model_Name630                           0.640191    
## Model_Name640                           0.000330 ***
## Model_Name640 Gran Coupé    0.00001865291906992 ***
## Model_Name730                           0.003592 ** 
## Model_Name740               < 0.0000000000000002 ***
## Model_NameM235                          0.137253    
## Model_NameM4                 0.00000000000000111 ***
## Model_NameM550                          0.001124 ** 
## Model_NameX1                            0.346311    
## Model_NameX3                            0.011142 *  
## Model_NameX4                 0.00001258887740148 ***
## Model_NameX5                 0.00006233629747939 ***
## Model_NameX5 M               0.00000000000458098 ***
## Model_NameX6                            0.018459 *  
## Model_NameX6 M              < 0.0000000000000002 ***
## Mileage                     < 0.0000000000000002 ***
## Power                        0.00000000062447781 ***
## feature_7                    0.00001131483790137 ***
## feature_8                    0.00000100062147108 ***
## Age                         < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2742 on 439 degrees of freedom
## Multiple R-squared:  0.9039, Adjusted R-squared:  0.8927 
## F-statistic: 80.96 on 51 and 439 DF,  p-value: < 0.00000000000000022

\(F-Statistic =\) 80.96 on 51 and 449 DF
\(p-value =\) less than < 0.00000000000000022
\(Adjusted\ R^2 =\) 0.89

We remove the outliers and construct our final model with only the relevant features. Immediately it is evident that our adjusted R-score has improved by forty basis points due to the removal of the outliers.

The residual plot of the newly created model is presented above. the residuals seem to follow a perfect normal distribution.

Testing and Conclusion.


##  [1] "Table: Prdicted vs Actual Prices from the Test Data"
##  [2] ""                                                   
##  [3] "|  Predicted  | Actual |"                           
##  [4] "|:-----------:|:------:|"                           
##  [5] "| 9441.32180  | 11300  |"                           
##  [6] "| 10292.15859 | 10200  |"                           
##  [7] "| 22539.52007 | 25100  |"                           
##  [8] "| 7232.01080  | 12400  |"                           
##  [9] "| 7998.01506  |  6100  |"                           
## [10] "| 13837.24816 | 17300  |"

\(\ R^2\ from\ test\ data = 0.7\)

We subtracted the cars, which we used to build the model from the original dataset to come up with the test data set. When we originally sampled our dataset, our sample did not include some of the car models as we sampled the data by type (SUV, Compact, etc.) and fuel (Diesel or Petrol) and not by the model. We identified the models not used in our sample and removed them from our test data to avoid extrapolation. We also removed the cars with fuel type electro and hybrid for the same reason.

Some of the predicted values for the price of the car came as zero, which is not possible. Hence, we converted them to zero. Some of actual vs predicted values are presented above. We finally then calculated our test data R2 squared value which came out to be 0.6621321 or 0.7 rounded.

An important observation in hindsight is that fuel type was not a major predictor of price in BMW cars and model type was a major predictor, either of which was not expected.

Even though 0.7 can be an acceptable performance, our model performed poorly on the test data in comparison with the training data. The main reason of which is the size of the sample. Our train data was relatively much smaller than the test data (4258 rows of test data vs 499 of train). Our train-test split of the data was constrained by the project parameter of five-hundred observations, however, a more robust and holistic sampling/splitting method could result in a better performing model.