More than one million motorcycles are sold annually (www.webbikeworld.com). Off-road motorcycles (often called “dirt bikes”) are a market segment (about 18%) that is highly specialized and offers great variation in features. This makes it a good segment to study to learn about which features account for the cost (manufacturer’s suggested retail price, MSRP) of a dirt bike. Researchers collected data on 2005 model dirt bikes based on a randomized experiment.
Getting to know my data
names(x)
## [1] "Year" "Manufacturer" "Model" "MSRP"
## [5] "Displacement" "Engine.Type" "Cooling" "Fuel.System"
## [9] "Ignition" "Starting.System" "Transmission" "Wheel.Base"
## [13] "Seat.Height" "Front.Suspension" "Rear.Suspension" "Front.Brake"
## [17] "Rear.Brake" "Front.Tire" "Rear.Tire" "Fuel.Capacity"
## [21] "Dry.Weight" "Bore" "Stroke" "Ratio"
## [25] "Weight" "Rake" "Trail" "Tank"
## [29] "Air.Cooled" "Engine.cooling"
Finding quantitative variables
Based on the scatterplots shown below, the Displacement and Bore can be used as predictors. Unlike the other two predictors whose shape are not linear but a bend for wheelbase (it violates the linearity assumption), and a fan shape (it violates the equal variance assumption)
par(mfrow = c(2,2))
plot(x$Wheel.Base, x$MSRP, xlab = 'Wheelbase', ylab = 'MSRP')
plot(x$Displacement, x$MSRP, xlab = 'Displacement', ylab = 'MSRP')
plot(MSRP ~ Bore, data = x, xlab = 'Bore', ylab = 'MSRP')
plot(MSRP ~ Trail, data = x, xlab = 'Trail', ylab = 'MSRP')
Interpreting values in a Regression summary output in R
the fitted model \(\hat{MSRP}\) = -323.299 + 4.382 Displacement + 82.908 Bore
The \(R^2\): 0.7224, The adjusted R-squared: 0.7174
m <- lm(MSRP ~ Displacement + Bore, data = x)
summary(m)
##
## Call:
## lm(formula = MSRP ~ Displacement + Bore, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2467.5 -1158.6 152.1 907.2 2401.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -323.299 1196.613 -0.270 0.7875
## Displacement 4.382 4.108 1.067 0.2884
## Bore 82.908 30.544 2.714 0.0077 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1299 on 111 degrees of freedom
## Multiple R-squared: 0.7224, Adjusted R-squared: 0.7174
## F-statistic: 144.4 on 2 and 111 DF, p-value: < 2.2e-16
Checking the 4 assumptions required in multiple regression
Linearity Assumption: The residual plot does not show any bend or other nonlinearities. So, the assumption is satisfied.
Independence Assumption: The data are collected from a randomized experiment, so the motorcycles are independent of each other, and the assumption is satisfied.
Equal Variance Assumption: The residual plot shows equal spread around 0, so the assumption is satisfied.
Normality Assumption: The distribution of residuals is unimodal, slightly right skewed and has no outlier. So, we don’t need to worry about this assumption, especially when we have a large sample (n = 93).
plot(m$fitted.values, m$residuals, xlab = 'Fitted Value', ylab = 'Residual')
abline(0, 0)
par(mfrow = c(1,2))
hist(m$residuals, main = 'Histogram of Residuals', xlab = 'Residual')
qqnorm(m$residuals)
qqline(m$residuals)
Conducting an overall regression test to see if the fitted multiple regression model is statistically useful
We will check for: \(H_{0}\): \(\beta_{1}\) = \(\beta_{2}\) = 0 \(H_{\alpha}\): at least one \(\beta\) is not zero
F test of linear Relationship By critical value method: Reject \(H_{0}\) if F-statistic > \(F_{crit}\), otherwise do not reject it. By p-value method: reject \(H_{0}\) if p-value < \(\alpha\), otherwise do not reject it
bigger F values means smaller p-values if \(H_{0}\) is true, then F will be near 1
From the output, the F-statistic is 144.4 and its p-value is < 2.2e-16. We therefore reject \(H_{0}\) and conclude that there is one useful predictor in the model.
summary(m)
##
## Call:
## lm(formula = MSRP ~ Displacement + Bore, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2467.5 -1158.6 152.1 907.2 2401.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -323.299 1196.613 -0.270 0.7875
## Displacement 4.382 4.108 1.067 0.2884
## Bore 82.908 30.544 2.714 0.0077 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1299 on 111 degrees of freedom
## Multiple R-squared: 0.7224, Adjusted R-squared: 0.7174
## F-statistic: 144.4 on 2 and 111 DF, p-value: < 2.2e-16
To find which predictor makes significant contribution to MSRP, we conduct a t-test on the slope of each predictor.From the output above (summary(m)) the t-test for Displacement has a p-value = 0.2884. We thus fail to reject \(H_{0}\). The t-test for Bore has a p-value = 0.0077. We thus reject \(H_{0}\) at a significance level of 5%.
As a result, the Bore makes significant contribution to the MSRP but Displacement does not, when both of them are in the model. But it does not mean Displacement is not a useful predictor by itself. It’s just not useful when Displacement is already in the model.
Improving \(R^{2}\) by proposiong a new multiple regression model
\(R^{2}\) = 0.8622 which means it has increased from our former model. However, the significance level of the independent variables have changed.
Displacement and Wheel Base are now statistically signicant while Bore is no longer significant. This means that Bore contributes little to the model given that Displacement and Wheel Base are in the model.
new_m <- lm(MSRP ~ Displacement + Bore + Wheel.Base, data = x)
summary(new_m)
##
## Call:
## lm(formula = MSRP ~ Displacement + Bore + Wheel.Base, data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2399.75 -460.92 49.53 534.95 2166.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6535.435 1004.368 -6.507 2.35e-09 ***
## Displacement 9.616 2.889 3.328 0.00119 **
## Bore -14.462 22.967 -0.630 0.53021
## Wheel.Base 216.388 19.697 10.986 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 900.7 on 110 degrees of freedom
## Multiple R-squared: 0.8676, Adjusted R-squared: 0.864
## F-statistic: 240.3 on 3 and 110 DF, p-value: < 2.2e-16
plot(new_m$fitted.values, new_m$residuals, xlab = 'Fitted', ylab = 'Residuals')
abline(0,0)
par(mfrow = c(1, 2))
hist(new_m$residuals, main = 'Histogram of Residuals', xlab = 'Residuals')
qqnorm(new_m$residuals)
qqline(new_m$residuals)