An analyst for the auto industry has asked for your help in modeling data on the prices of new cars. Interest centers on modeling suggested retail price as a function of the cost to the dealer for 234 new cars. the first model fit to the data was:
Based on the output for model the analyst concluded the following, provide a detailed critique of this conclusion.
##
## Call:
## lm(formula = SuggestedRetailPrice ~ DealerCost, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2941.9 -305.3 -41.1 266.3 4959.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.558e+02 7.739e+01 -3.305 0.00103 **
## DealerCost 1.100e+00 2.223e-03 494.939 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 810.6 on 426 degrees of freedom
## Multiple R-squared: 0.9983, Adjusted R-squared: 0.9983
## F-statistic: 2.45e+05 on 1 and 426 DF, p-value: < 2.2e-16
The conclusion can be drawn from the figure above: The R-squared is 0.9983, which means the model can explains 99.8% 0f the variability in Suggested Retail Price and the coefficient of Dealer Cost. The Pr is very small that can reject the null hypothesis.
par(mfrow = c(2,2))
plot(fit1)
However, according to the figures above: Residuals vs Fitted shows that residuals are not normally distributed. as the same question in Scale-Location. So, this model might be misleading and says little about the assumptions of regression.
Carefully describe all the shortcomings evident in model 1. For each shortcoming, describe the steps needed to overcome the shortcoming.
· Although Residuals vs Fitted shows a relativerly good normal distribution,but there is at least one leverage point which may require futher examination, which also can be proved in Residuals vs Leverage, and there are at least two points that have higher cook’s distance, potentially removal.
· There are too many points in Residuals vs Fitted that have large residuals value and the range from -2000~4000. It’s usually best distributed around 0.
· Residuals vs Fitted shows that all points have three trends: to the upper/lower right and to the middle. we need to examine whether there have any catergorical variables that need to fit the new model
Is model 2 an improvement over model 1 in terms of predicting Suggested Retail Price? If so, please describe all the ways in which it is an improvement.
YES
· The model2’s Residuals vs Fitted shows a better normal distribution, and the value range is -0.06~0.02(compared to model1’s -2000~4000). The leverage point appears in Model2 is not so obvious.
· The model2’s Scale-Location is also better than model1, and has smaller sqr(standardized residuals) .
· There is no point in Residuals vs Leverage that show up as leverage point, and fewer outliers and influential points.
par(mfrow = c(2,2))
plot(fit2)
Interpret the estimated coefficient of log(Dealer Cost) in model2.
summary(fit2)
##
## Call:
## lm(formula = log(SuggestedRetailPrice) ~ log(DealerCost), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.064464 -0.010656 -0.001066 0.011423 0.051499
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.092902 0.021075 -4.408 1.32e-05 ***
## log(DealerCost) 1.017400 0.002067 492.227 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02041 on 426 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.9982
## F-statistic: 2.423e+05 on 1 and 426 DF, p-value: < 2.2e-16
according to the calculated data, the model2 is log(Suggested Retail Price) = -0.093 \(\beta_0\) + 1.017 \(\beta_1\) log(Dealer Cost)
which means the log(Suggested Retail Price) is -0.093 plus 1.017 times log(Dealer Cost)
List any weaknesses apparent in model2.
Although model2 is better than model1, but there still at least one leverage point which effect the whole model that need to be further examine.