Motorcycles

Motorcycles - Multiple linear Regression

More than one million motorcycles are sold annually (www.webbikeworld.com). Off-road motorcycles (often called “dirt bikes”) are a market segment (about 18%) that is highly specialized and offers great variation in features. This makes it a good segment to study to learn about which features account for the cost (manufacturer’s suggested retail price, MSRP) of a dirt bike. Researchers collected data on 2005 model dirt bikes based on a randomized experiment.

Getting to know my data

names(x)

##  [1] "Year"             "Manufacturer"     "Model"            "MSRP"            
##  [5] "Displacement"     "Engine.Type"      "Cooling"          "Fuel.System"     
##  [9] "Ignition"         "Starting.System"  "Transmission"     "Wheel.Base"      
## [13] "Seat.Height"      "Front.Suspension" "Rear.Suspension"  "Front.Brake"     
## [17] "Rear.Brake"       "Front.Tire"       "Rear.Tire"        "Fuel.Capacity"   
## [21] "Dry.Weight"       "Bore"             "Stroke"           "Ratio"           
## [25] "Weight"           "Rake"             "Trail"            "Tank"            
## [29] "Air.Cooled"       "Engine.cooling"

Finding quantitative variables

Based on the scatterplots shown below, the Displacement and Bore can be used as predictors. Unlike the other two predictors whose shape are not linear but a bend for wheelbase (it violates the linearity assumption), and a fan shape (it violates the equal variance assumption)

par(mfrow = c(2,2))
plot(x$Wheel.Base, x$MSRP, xlab = 'Wheelbase', ylab = 'MSRP')
plot(x$Displacement, x$MSRP, xlab = 'Displacement', ylab = 'MSRP')
plot(MSRP ~ Bore, data = x, xlab = 'Bore', ylab = 'MSRP')
plot(MSRP ~ Trail, data = x, xlab = 'Trail', ylab = 'MSRP')

Interpreting values in a Regression summary output in R

the fitted model \(\hat{MSRP}\) = -323.299 + 4.382 Displacement + 82.908 Bore

The \(R^2\): 0.7224, The adjusted R-squared: 0.7174

m <- lm(MSRP ~ Displacement + Bore, data = x)
summary(m)

## 
## Call:
## lm(formula = MSRP ~ Displacement + Bore, data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2467.5 -1158.6   152.1   907.2  2401.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -323.299   1196.613  -0.270   0.7875   
## Displacement    4.382      4.108   1.067   0.2884   
## Bore           82.908     30.544   2.714   0.0077 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1299 on 111 degrees of freedom
## Multiple R-squared:  0.7224, Adjusted R-squared:  0.7174 
## F-statistic: 144.4 on 2 and 111 DF,  p-value: < 2.2e-16

Checking the 4 assumptions required in multiple regression

Linearity Assumption: The residual plot does not show any bend or other nonlinearities. So, the assumption is satisfied.

Independence Assumption: The data are collected from a randomized experiment, so the motorcycles are independent of each other, and the assumption is satisfied.

Equal Variance Assumption: The residual plot shows equal spread around 0, so the assumption is satisfied.

Normality Assumption: The distribution of residuals is unimodal, slightly right skewed and has no outlier. So, we don’t need to worry about this assumption, especially when we have a large sample (n = 93).

plot(m$fitted.values, m$residuals, xlab = 'Fitted Value', ylab = 'Residual')
abline(0, 0)

par(mfrow = c(1,2))
hist(m$residuals, main = 'Histogram of Residuals', xlab = 'Residual')
qqnorm(m$residuals)
qqline(m$residuals)

Conducting an overall regression test to see if the fitted multiple regression model is statistically useful

We will check for: \(H_{0}\): \(\beta_{1}\) = \(\beta_{2}\) = 0 \(H_{\alpha}\): at least one \(\beta\) is not zero

F test of linear Relationship By critical value method: Reject \(H_{0}\) if F-statistic > \(F_{crit}\), otherwise do not reject it. By p-value method: reject \(H_{0}\) if p-value < \(\alpha\), otherwise do not reject it

bigger F values means smaller p-values if \(H_{0}\) is true, then F will be near 1

From the output, the F-statistic is 144.4 and its p-value is < 2.2e-16. We therefore reject \(H_{0}\) and conclude that there is one useful predictor in the model.

summary(m)

## 
## Call:
## lm(formula = MSRP ~ Displacement + Bore, data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2467.5 -1158.6   152.1   907.2  2401.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -323.299   1196.613  -0.270   0.7875   
## Displacement    4.382      4.108   1.067   0.2884   
## Bore           82.908     30.544   2.714   0.0077 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1299 on 111 degrees of freedom
## Multiple R-squared:  0.7224, Adjusted R-squared:  0.7174 
## F-statistic: 144.4 on 2 and 111 DF,  p-value: < 2.2e-16

To find which predictor makes significant contribution to MSRP, we conduct a t-test on the slope of each predictor.From the output above (summary(m)) the t-test for Displacement has a p-value = 0.2884. We thus fail to reject \(H_{0}\). The t-test for Bore has a p-value = 0.0077. We thus reject \(H_{0}\) at a significance level of 5%.

As a result, the Bore makes significant contribution to the MSRP but Displacement does not, when both of them are in the model. But it does not mean Displacement is not a useful predictor by itself. It’s just not useful when Displacement is already in the model.

Improving \(R^{2}\) by proposiong a new multiple regression model

\(R^{2}\) = 0.8622 which means it has increased from our former model. However, the significance level of the independent variables have changed.

Displacement and Wheel Base are now statistically signicant while Bore is no longer significant. This means that Bore contributes little to the model given that Displacement and Wheel Base are in the model.

new_m <- lm(MSRP ~ Displacement + Bore + Wheel.Base, data = x)
summary(new_m)

## 
## Call:
## lm(formula = MSRP ~ Displacement + Bore + Wheel.Base, data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2399.75  -460.92    49.53   534.95  2166.17 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6535.435   1004.368  -6.507 2.35e-09 ***
## Displacement     9.616      2.889   3.328  0.00119 ** 
## Bore           -14.462     22.967  -0.630  0.53021    
## Wheel.Base     216.388     19.697  10.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 900.7 on 110 degrees of freedom
## Multiple R-squared:  0.8676, Adjusted R-squared:  0.864 
## F-statistic: 240.3 on 3 and 110 DF,  p-value: < 2.2e-16

plot(new_m$fitted.values, new_m$residuals, xlab = 'Fitted', ylab = 'Residuals')
abline(0,0)

par(mfrow = c(1, 2))
hist(new_m$residuals, main = 'Histogram of Residuals', xlab = 'Residuals')
qqnorm(new_m$residuals)
qqline(new_m$residuals)

Motorcycles

Christian Zuna Largo

5/27/2020

Motorcycles - Multiple linear Regression