Project 2: Multiple Linear Regression

Dirt Bike

Jane Braun

RPI

March 12th - Initial Draft - Version 1.0

1. Data

Data Selection.

The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Dirt Bike Data”, and is a summary of all off-road dirt bikes available in the United States. The data provides information about the bike, including engine size, price, frame size, etc.

Reading In and Cleaning Data

x <- read.csv("~/Spring2015/Applied Regression/dirtbike.csv")
x <- x[c("msrp", "displacement", "bore", "stroke", "wheelbase", "seat.height", "fuel.cap", "total.weight")]
x <- x[c("msrp", "bore", "stroke", "wheelbase", "fuel.cap", "total.weight")]
x$bore <- as.numeric(x$bore)
x$msrp <- as.numeric(x$msrp)

head(x)
##   msrp bore stroke wheelbase fuel.cap total.weight
## 1 2999   48     48      49.1     1.35          144
## 2 3099   48     48      49.1     1.35          145
## 3 2249   53     45      49.2     1.50          165
## 4 2999   64     49      52.2     2.20          216
## 5 3499   66     66      54.1     2.20          238
## 6 5899   78     52      58.2     2.00          204

Data Organization

Oringinally in this dataset, there are 104 observations with 30 different columns each. Unfortunately, some of the data was missing from a large portion of these variables. Therefore, the data was cleaned down to include only variables with complete data. This ended up being 6 variables.

All patient days and in hundreds, and all numeric values are in hundreds of dollars.

summary(x)
##       msrp            bore            stroke       wheelbase    
##  Min.   : 1079   Min.   : 39.00   Min.   :38.0   Min.   :33.70  
##  1st Qu.: 3099   1st Qu.: 53.75   1st Qu.:49.0   1st Qu.:49.75  
##  Median : 5248   Median : 70.00   Median :58.0   Median :57.00  
##  Mean   : 4928   Mean   : 70.84   Mean   :57.6   Mean   :53.18  
##  3rd Qu.: 6398   3rd Qu.: 91.25   3rd Qu.:63.0   3rd Qu.:58.25  
##  Max.   :19500   Max.   :100.00   Max.   :83.0   Max.   :58.90  
##                                                  NA's   :1      
##     fuel.cap      total.weight  
##  Min.   :0.000   Min.   : 82.0  
##  1st Qu.:1.350   1st Qu.:147.9  
##  Median :2.100   Median :218.5  
##  Mean   :1.964   Mean   :201.7  
##  3rd Qu.:2.400   3rd Qu.:241.5  
##  Max.   :6.100   Max.   :337.0  
##  NA's   :1       NA's   :1
str(x)
## 'data.frame':    104 obs. of  6 variables:
##  $ msrp        : num  2999 3099 2249 2999 3499 ...
##  $ bore        : num  48 48 53 64 66 78 78 96 96 39 ...
##  $ stroke      : num  48 48 45 49 66 52 52 62 62 41 ...
##  $ wheelbase   : num  49.1 49.1 49.2 52.2 54.1 58.2 58.6 58.7 58.2 36 ...
##  $ fuel.cap    : num  1.35 1.35 1.5 2.2 2.2 2 2.2 1.9 2.27 0.8 ...
##  $ total.weight: num  144 145 165 216 238 ...

Continuous variables

All of the variables that were included in this analysis are continuous variables.

Response variables

The response variable that I am interested in is the MSRP, or the Manufacturer’s Suggested Retail Price, and how variation in some other factors may be able to predict the variation in this price. This variable is measured US dollars.

Initial Plot and Outlier Analysis

The boxplot below shows the values of MSRP for the dataset. As seen below, most of the data appears to be centered around $5,000. However, there also appears to be a single outlier, which has an MSRP much greater than all the others. This outlier will be analyzed further.

bplot <- boxplot(x$msrp, main = "MSRP of Dirt Bikes", ylab = "Dollars ($)")

bplot
## $stats
##        [,1]
## [1,] 1079.0
## [2,] 3099.0
## [3,] 5248.5
## [4,] 6398.0
## [5,] 8998.0
## 
## $n
## [1] 104
## 
## $conf
##         [,1]
## [1,] 4737.38
## [2,] 5759.62
## 
## $out
## [1] 19500
## 
## $group
## [1] 1
## 
## $names
## [1] ""

The boxplot function allows you to view the outliers of the dataset that you have just plotted.

One value has been identified as an outlier:

bplot$out
## [1] 19500

This data point has an MSRP value of $19,500.

While removing outliers is a matter of personal discretion, it is better to have a good model of the majority of the points, rather than an “okay” model of all of the points. For this reason, I will remove this outlier for the remainder of the analysis.

x <-x[x$msrp!=19500,]
bplot2 <- boxplot(x$msrp)

bplot2$out
## numeric(0)

As seen above, the outlier has been removed, and the boxplot function no longer returns a value for an outlier.

2. Multiple Linear Model

Null Hypothesis

My null hypothesis is that the variation in MSRP of the off-road dirt bikes cannot be explained by anything other than randomization.

Scattergram of Variables

plot(x)

The plot above shows each of the variables included in this dataset, and their relationship to each of the other variables. Since we are primarily interested in MSRP as a response variable, the top row of plots are of interest.

As seen in the plot, some of the variable have stronger relationships than others. “Bore” appears to have a fairly strong, positive, linear relationship, whereas the relationship between “total weight” and “msrp” is slightly less clear, though probably slightly positive.

Building the Hierarchical Linear Model

The model will be built hierarchically based on what independent variables appear to be most correlated with the dependent variable, MSRP.

The following correlation matrix shows the correlation between each of the variables and themselves. To narrow down the size of the matrix, only the variables of interest were displayed. As seen below, each variable is obviously perfectly correlated with itself - with a correlation of 1.

x <- x[c("msrp", "bore", "wheelbase", "fuel.cap")]
cor(x, use="complete")
##                msrp      bore wheelbase  fuel.cap
## msrp      1.0000000 0.8410597 0.8389980 0.4726100
## bore      0.8410597 1.0000000 0.7918625 0.6368591
## wheelbase 0.8389980 0.7918625 1.0000000 0.6982879
## fuel.cap  0.4726100 0.6368591 0.6982879 1.0000000

Linear Model with 1 Independent Variable

Based on the correlation matrix, Bore and MSRP are most highly correlated, at .841. We will begin building the model with bore as the primary independent variable.

Bore is a measure of an internal combustion engine in a motorcycle. The value of a bike’s bore can affect the speed and power. The larger the bore, the more powerful the bike. Bore is measured in inches and is the diameter of the chambers in an engine cylinder.

This linear model analyzes if the variation in MSRP can be expained by the variation in bore.

We use the “lm” function to calculate the linear model.

“MSRP ~ bore” reads as: “MSRP is explained by bore”

Model
attach(x)
x <- x[order(x$msrp),]
model1 <- lm(x$msrp ~ x$bore)

summary(model1)
## 
## Call:
## lm(formula = x$msrp ~ x$bore)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2244.57  -906.79   -27.89   763.42  1994.64 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1109.988    392.573  -2.827  0.00566 ** 
## x$bore         83.536      5.355  15.600  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1077 on 101 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.7038 
## F-statistic: 243.4 on 1 and 101 DF,  p-value: < 2.2e-16
Interpretation

For this simple linear model, the estimate of the y-intercept is -1,109, and the slope of the model is 83.536. This means that for a bore of value 0, the MSRP is $-1,109. While this is not plausible, it indicates that a motorcycle cannot have a bore of 0, and that bore must have a fairly significant value for the bike to be worth anything. With a slope of 83.536, this says that for each additional unit of bore (in inches), the MSRP increases by $83.536.

The R^2 value for the model is 0.7067. Generally, a larger R^2 value indicates a stronger relationship between the independent and dependent variables, where an R^2 of 1 would indicate a perfect correlation, and a 0 would indicate zero correlation. With this R^2 value of 0.7067, we can say that variation in the independent variable, bore size, can explain 70.67% of the variation in the dependent variable, MSRP. This is a fairly high R^2 value, and therefore a moderate percent of the variation in the dependent variable can be explained by the independent variable. This may, however, increase as we include more variables.

Another indicator of the strength of the model is the p-value, which in this case is 2 x 10^-16. This is a very small value, and it is the probability that the F-statistic (in this case 243.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-value is less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size may help to explain the variation in MSRP.

Scatterplot with Regression
par(mfrow=c(1,1))
plot(msrp ~ bore, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Bore with Regression Line", xlab = "Bore (in inches)", ylab = "MSRP (in dollars)")
abline(model1, lwd=2.5, col='dodgerblue4')

Linear Model with 2 Independent Variables

To continue building this model, the next most highly correlated independent variable based on the correlation matrix is the wheelbase.

Wheelbase is the distance between the front and rear wheel of a motorcycle, and can affect the function of the bike (sport/trick bikes versus leisure bikes). Wheelbase is measured in inches.

Model
model2 <- lm(x$msrp ~ x$bore + x$wheelbase)

summary(model2)
## 
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1916.26  -792.94   -12.21   690.55  2175.73 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5587.318    807.084  -6.923 4.47e-10 ***
## x$bore         47.049      7.532   6.247 1.06e-08 ***
## x$wheelbase   132.839     21.720   6.116 1.92e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 925 on 99 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7876, Adjusted R-squared:  0.7833 
## F-statistic: 183.6 on 2 and 99 DF,  p-value: < 2.2e-16
Interpretation

For this multiple linear model, the estimate of the y-intercept is -5,587, and the slope of bore is 47.049, and the slope of wheelbase is 132.839. This means that for a bore size and wheelbase size, both of value 0, the MSRP is $-5,587. While this is not plausible, it indicates that a motorcycle cannot have a bore of 0 and wheelbase size of 0. With a slope of bore of 47.049, this says that for each additional unit of bore (in inches), the MSRP increases by $47.049, and for every additional inch in wheelbase, the MSRP increases by $132.839.

The adjusted R^2 value for the model is 0.7833. This is an increase from our model with bore as the only independent variable, meaning that it was appropriate to add wheelbase size to the model. With this R^2 value of 0.7833, we can say that variation in the independent variables, bore size and wheelbase size, can explain 78.33% of the variation in the dependent variable, MSRP. This is a fairly high R^2 value, and therefore a moderate percent of the variation in the dependent variable can be explained by the independent variables.

Another indicator of the strength of the model are the p-values.In this case, the p-value for bore is 1.06 x 10^-08 and the p-value for wheelbase is 1.92 x 10^-08. These are both very small values, and they indicate the probabilities that the F-statistic (in this case 183.6) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size and wheelbase size may help to explain the variation in MSRP.

Scatterplot with Regression
# install.packages("scatterplot3d") # commented out to avoid errors
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
mdl2 <- scatterplot3d(x$bore, x$wheelbase, x$msrp, pch=21, main = "MSRP vs. Bore and Wheelbase with Regression Plane", bg = 'ivory4', xlab = "Bore (in inches)", ylab = "Wheelbase (in inches)", zlab = "MSRP (in dollars $)", axis=TRUE)
mdl2$plane3d(model2, col = 'dodgerblue4')

Linear Model with 3 Independent Variables

The next independent variable that I was most interested in was fuel capacity of the dirt bikes.

Fuel capacity is the amount of fuel that the motorcycle can hold. This value is measured in gallons.

Model
model3 <- lm(x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)

summary(model3)
## 
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1536.5  -598.6  -146.6   572.5  2032.8 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7167.352    774.357  -9.256 4.99e-15 ***
## x$bore         53.976      6.798   7.940 3.40e-12 ***
## x$wheelbase   179.021     21.114   8.479 2.39e-13 ***
## x$fuel.cap   -695.461    130.998  -5.309 6.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 819.3 on 98 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8351, Adjusted R-squared:   0.83 
## F-statistic: 165.4 on 3 and 98 DF,  p-value: < 2.2e-16
Interpretation

For this final multiple linear model, the estimate of the y-intercept is -7,167, and the slope of bore is 53.98, the slope of wheelbase is 179.02, and the slope of the fuel capacity is -695.46. This means that for a bore size, wheelbase size, and fuel capacity all of value 0, the MSRP is $-7,167. While this is not plausible, it indicates that a motorcycle cannot have a bore, wheelbase size, or fuel capacity of 0. With a slope of bore of 53.98, this says that for each additional unit of bore (in inches), the MSRP increases by $53.98, for every additional inch in wheelbase, the MSRP increases by $179.02, and for every additional gallon of fuel capacity, the MSRP decreases by $695.46.

The adjusted R^2 value for the model is 0.83, which is an increase from the previous 2 IV model. This signifies that it was appropriate to add fuel capacity to the model. With this R^2 value of 0.83, we can say that variation in the independent variables, bore size, wheelbase size, and fuel capacity, can explain 83% of the variation in the dependent variable, MSRP. This is a very high R^2 value, and therefore a large percent of the variation in the dependent variable can be explained by the independent variables.

Another indicator of the strength of the model are the p-values.In this case, the p-values for bore, wheelbase size and fuel capacity are all very small numbers - less than 1 each. These p-values indicate the probabilities that the F-statistic (in this case 165.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size, wheelbase size, and fuel capacity may help to explain the variation in MSRP.

Scatterplot with Regression

Below, I tried to do 2-D interpretations of the regression line for each of the three variables. Unfortunatly, I could not get the line to appear on the plots. I will have to work on this for the final draft.

par(mfrow=c(3,1))
plot(msrp ~ bore, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Bore with Regression Line", xlab = "Bore", ylab = "MSRP")
abline(a=summary(model3)$coefficients[1,1] ,b=summary(model3)$coefficients[1,2], lwd=2.5, col='dodgerblue4')
plot(msrp ~ wheelbase, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Wheelbase with Regression Line", xlab = "Wheelbase", ylab = "MSRP")
abline(a=summary(model3)$coefficients[1,1] ,b=summary(model3)$coefficients[1,3], lwd=2.5, col='dodgerblue4')
plot(msrp ~ fuel.cap, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Fuel Capacity with Regression Line", xlab = "Fuel Capacity", ylab = "MSRP")
abline(a=summary(model3)$coefficients[1,1] ,b=summary(model3)$coefficients[1,4], lwd=2.5, col='dodgerblue4')

Scattergram of Final Model
plot(x, pch=21, cex=1, bg='ivory4', main="MSRP versus Bore, Wheelbase, and Fuel Capacity")

Fitted Values versus Residuals
par(mfrow=c(1,1))
model.res <- resid(model3)
plot(fitted(model3), model.res, pch=21, cex=1, bg='ivory4',main="Plot of Fitted Values vs. Residuals  (Not Standardized)", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')

This plot should appear relatively scattered and random across the dynamic range. Although there are more points in the greater part of the range than the lesser part, this appears to be a fairly good fit.

95% Confidence Intervals of the Regression Line, b0, b1, b2, and b3

model_pred <- predict(model3, interval="confidence")

^ I’m having difficulty/confusion with plotting these confidence intervals in the multiple linear regression scenario.