This is a case study to exercise model building techniques using multiple linear regression and bootstrap sampling techniques. Bootstrap sampling will be performed on the multiple linear regression model generated from the cars data set.
This data set was found on kaggle.com (https://www.kaggle.com/datasets/CooperUnion/cardataset) using ChatGPT to scan the internet for a data set with valid parameters for the assignment. This data set provides information cars such as make, model, year, engine type and MSRP. The sampling methods of the data were not specified in the description of the data. The data set was uploaded to Github for universal access (https://raw.githubusercontent.com/JackRoss10089/STA-321/main/data%203.csv). The variables in the data set are as follows:
The central objective is to identify the relationship between MSRP and relevant explanatory variables in the data set. When buying a car, the MSRP is often times one of the largest contributors to the buying decision. This practical question will evaluate what factors have a relationship the MSRP of a car from this data set.
To begin analysis, first it is necessary to evaluate the variables in the data set and choose which variables can be used to build the model.
We are now left with a data set that contains MSRP as a response variable and vehicle size, highway miles per gallon, and Engine Horsepower as predictor variables.
Next the final data set is fit to a multiple linear regression model.
We start with a linear model that includes all predictor variables from our final data set.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -60351.5014 | 1752.57723 | -34.43586 | 0 |
| Engine.HP | 376.1885 | 3.16146 | 118.99201 | 0 |
| highway.MPG | 576.2932 | 47.30442 | 12.18265 | 0 |
| Vehicle.Size0Large | -19000.2912 | 858.82280 | -22.12365 | 0 |
| Vehicle.Size0Midsize | -11870.2455 | 688.64790 | -17.23703 | 0 |
Next, we conduct residual diagnostic analysis to check the validity of the model before making inferences about the model.
Residual plots of the full model
Based upon the residual plots, it is apparent that there are some violations of the model assumptions:
Because the assumption of constant variance is violated, the Box-Cox procedure is adequate to search for a transformation of the response variable.
From this Box-Cox Transformation, we can now conclude that the appropriate value for lambda is 0.36. We will use this value to transform the response variable in the new model.
After using the Box Cox transformation to assess the appropriate value for lambda, we can now transform the response variable MSRP to generate a second candidate model.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1.4392310 | 0.4508095 | -3.192548 | 0.0014139 |
| Engine.HP | 0.1237154 | 0.0008132 | 152.131854 | 0.0000000 |
| highway.MPG | 0.4783881 | 0.0121680 | 39.315403 | 0.0000000 |
| Vehicle.Size0Large | -3.1078217 | 0.2209121 | -14.068139 | 0.0000000 |
| Vehicle.Size0Midsize | -1.1792477 | 0.1771386 | -6.657205 | 0.0000000 |
Residual plots are given below.
After the transformation of MSRP, the variance has become more constant than when using the previous model. The normality assumption is justified by the central limit theorem as we have an adequately large sample size. The residuals vs. leverage plot maintains the same patterns from the original model, suggesting that their are some outlier observations that have much higher leverage in the model than the majority of the other observations.
Now a model must be selected from the candidate models that were generated by the previous transformations of the response variable. Next, goodness of fit measures will be performed on the candidate models in order to select a final model.
| SSE | R.sq | R.adj | Cp | AIC | SBC | PRESS | |
|---|---|---|---|---|---|---|---|
| full.model | 1.209389e+13 | 0.5720122 | 0.5718676 | 5 | 245585.18 | 245622.08 | 1.211704e+13 |
| transformed.model | 8.001985e+05 | 0.6785266 | 0.6784179 | 5 | 49889.95 | 49926.84 | 8.014480e+05 |
Based upon the output from the goodness-of-fit comparison, it appears that the transformed model is the best of the two models. This is due to the transformed model having the best goodness of fit parameters along with a valid interpretation for the practical question being asked. We also must consider the transformation of the response variable when interpreting this model in the context of the practical question.
Next we define a function to make histograms of the bootstrap regression coefficients.
The following histograms of the bootstrap estimates of regression coefficients represent the sampling distributions of the corresponding estimates in the final model.
Two normal-density curves were placed on each of the histograms.
We can see from the above histograms that the two density curves in all histograms are close to each other. We would expect that significance test results and the corresponding bootstrap confidence intervals are consistent. Next, we find 95% bootstrap confidence intervals of each regression coefficient and combined them with the output of the final model.
| Estimate | Std. Error | t value | Pr(>|t|) | btc.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -1.4392 | 0.4508 | -3.1925 | 0.0014 | [ -64183.0851 , -56620.8234 ] |
| Engine.HP | 0.1237 | 0.0008 | 152.1319 | 0.0000 | [ 360.9307 , 391.2124 ] |
| highway.MPG | 0.4784 | 0.0122 | 39.3154 | 0.0000 | [ 507.4119 , 648.4028 ] |
| Vehicle.Size0Large | -3.1078 | 0.2209 | -14.0681 | 0.0000 | [ -21256.6353 , -16694.7627 ] |
| Vehicle.Size0Midsize | -1.1792 | 0.1771 | -6.6572 | 0.0000 | [ -13128.738 , -10541.7908 ] |
We can see from the above table of summarized statistics, the significance tests of regression coefficients based on the p-values and the corresponding 95% confidence intervals are consistent.
Now that we have created bootstrap confidence intervals for the model, we will next assess the bootstrap residuals.
The distribution of the residuals from the transformed.MSRP model is depicted in the following histogram.
Next, bootstrap confidence intervals of regression coefficients are generated. These confidence intervals will be helpful later when validating the significance of the variables in the model.
Next, histograms of the residual bootstrap estimates of the regression coefficients are constructed.
The residual bootstrap sampling distributions of each estimated regression coefficient. The normal and LOESS curves are close to each other. This indicates that the inference of the significance of variables based on p-values and residual bootstrap will yield the same results.
The 95% residual bootstrap confidence intervals are given in the following
| Estimate | Std. Error | t value | Pr(>|t|) | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -1.4392 | 0.4508 | -3.1925 | 0.0014 | [ -2.3692 , -0.4785 ] |
| Engine.HP | 0.1237 | 0.0008 | 152.1319 | 0.0000 | [ 0.1215 , 0.1261 ] |
| highway.MPG | 0.4784 | 0.0122 | 39.3154 | 0.0000 | [ 0.4549 , 0.5022 ] |
| Vehicle.Size0Large | -3.1078 | 0.2209 | -14.0681 | 0.0000 | [ 0 , 0 ] |
| Vehicle.Size0Midsize | -1.1792 | 0.1771 | -6.6572 | 0.0000 | [ -3.5584 , -2.6556 ] |
As expected, the residual bootstrap confidence intervals yield the same results as p-values do. This is because the sample size is large enough so that the sampling distributions of estimated coefficients have sufficiently good approximations of normal distributions.
Finally, all inferential statistics are combined into a single table so the results can be compared.
| Estimate | Std. Error | Pr(>|t|) | btc.ci.95 | btr.ci.95 | |
|---|---|---|---|---|---|
| (Intercept) | -1.4392 | 0.4508 | 0.0014 | [ -64183.0851 , -56620.8234 ] | [ -2.3692 , -0.4785 ] |
| Engine.HP | 0.1237 | 0.0008 | 0.0000 | [ 360.9307 , 391.2124 ] | [ 0.1215 , 0.1261 ] |
| highway.MPG | 0.4784 | 0.0122 | 0.0000 | [ 507.4119 , 648.4028 ] | [ 0.4549 , 0.5022 ] |
| Vehicle.Size0Large | -3.1078 | 0.2209 | 0.0000 | [ -21256.6353 , -16694.7627 ] | [ 0 , 0 ] |
| Vehicle.Size0Midsize | -1.1792 | 0.1771 | 0.0000 | [ -13128.738 , -10541.7908 ] | [ -3.5584 , -2.6556 ] |
We can explicitly write the final model as the following expression:
\[ MSRP^{0.36} = -1.4392 + 0.1237\times Engine.HP +0.4748\times highway.MPG -3.1078\times Vehicle.Size0Large - 1.1792\times Vehicle.Size0Midsize \] Before we interpret the model, it is important to consider the transformation of the response variable. Because we transformed MSRP to MSRP^{0.36}, when the output is produced from this model it must be interpreted with respect to this transformation. We can also transform the entire model by taking the 0.36th root of each side of the model in order to transform the model back to a default interpretation similar to the interpretation of the original full model with no transformation.
The model interpretation for the transformed response is as follows:
For a one unit change in Engine.HP (engine horsepower), we expect the transformed MSRP to increase by a factor of 0.1237.
For a one unit change in highway.MPG (highway miles per gallon), we expect the transformed MSRP to increase by a factor of 0.4784.
Relative to the base level of Vehicle Size, which is “Vehicle.Size0Compact”, the difference between the base level and Vehicle.Size0Large is -3.1078. This is a negative “difference”, which intuitively suggests that we expect a large size vehicle to be 3.1078 units more expensive than the compact size vehicle.
Relative to the base level of Vehicle Size, which is “Vehicle.Size0Compact”, the difference between the base level and Vehicle.Size0Midsize is -1.1792. This is a negative “difference”, which intuitively suggests that we expect a mid size vehicle to be 1.1792 units more expensive than the compact size vehicle.
From this model, we can deduct a few key findings. To begin, highway miles per gallon and engine horsepower are both positively related with the response variable MSRP. Both of these factors are often considered in the car buying process, and we would expect that higher horsepower or highway miles per gallon would indicate a higher price of the car. This model aligns with this intuitive assumption of the car buying process. Also, this model suggests that price increases as the size of the car increases. This is also an intuitive assumption that aligns with the car buying process as larger cars when compared to smaller cars, with other variables held constant, are generally more expensive.
This model requires a transformation of the response variable which is a drawback in terms of the practical interpretation. A data set that aligns more closely with the assumptions of the linear regression modeling practices would yield a more practical interpretation. This model can be applied to cars that fall within the parameters of the cars within the data sets. Using this model for cars outside of the parameters of the cars within the dataset will likely yield misleading or wrong information.