The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Dirt Bike Data”, and is a summary of all off-road dirt bikes available in the United States. The data provides information about the bike, including engine size, price, frame size, etc.
x <- read.csv("~/Spring2015/Applied Regression/dirtbike.csv")
x <- x[c("msrp", "displacement", "bore", "stroke", "wheelbase", "seat.height", "fuel.cap", "total.weight")]
x <- x[c("msrp", "bore", "stroke", "wheelbase", "fuel.cap", "total.weight")]
x$bore <- as.numeric(x$bore)
x$msrp <- as.numeric(x$msrp)
head(x)
## msrp bore stroke wheelbase fuel.cap total.weight
## 1 2999 48 48 49.1 1.35 144
## 2 3099 48 48 49.1 1.35 145
## 3 2249 53 45 49.2 1.50 165
## 4 2999 64 49 52.2 2.20 216
## 5 3499 66 66 54.1 2.20 238
## 6 5899 78 52 58.2 2.00 204
Oringinally in this dataset, there are 104 observations with 30 different columns each. Unfortunately, some of the data was missing from a large portion of these variables. Therefore, the data was cleaned down to include only variables with complete data. This ended up being 6 variables.
All patient days and in hundreds, and all numeric values are in hundreds of dollars.
summary(x)
## msrp bore stroke wheelbase
## Min. : 1079 Min. : 39.00 Min. :38.0 Min. :33.70
## 1st Qu.: 3099 1st Qu.: 53.75 1st Qu.:49.0 1st Qu.:49.75
## Median : 5248 Median : 70.00 Median :58.0 Median :57.00
## Mean : 4928 Mean : 70.84 Mean :57.6 Mean :53.18
## 3rd Qu.: 6398 3rd Qu.: 91.25 3rd Qu.:63.0 3rd Qu.:58.25
## Max. :19500 Max. :100.00 Max. :83.0 Max. :58.90
## NA's :1
## fuel.cap total.weight
## Min. :0.000 Min. : 82.0
## 1st Qu.:1.350 1st Qu.:147.9
## Median :2.100 Median :218.5
## Mean :1.964 Mean :201.7
## 3rd Qu.:2.400 3rd Qu.:241.5
## Max. :6.100 Max. :337.0
## NA's :1 NA's :1
str(x)
## 'data.frame': 104 obs. of 6 variables:
## $ msrp : num 2999 3099 2249 2999 3499 ...
## $ bore : num 48 48 53 64 66 78 78 96 96 39 ...
## $ stroke : num 48 48 45 49 66 52 52 62 62 41 ...
## $ wheelbase : num 49.1 49.1 49.2 52.2 54.1 58.2 58.6 58.7 58.2 36 ...
## $ fuel.cap : num 1.35 1.35 1.5 2.2 2.2 2 2.2 1.9 2.27 0.8 ...
## $ total.weight: num 144 145 165 216 238 ...
All of the variables that were included in this analysis are continuous variables.
The response variable that I am interested in is the MSRP, or the Manufacturer’s Suggested Retail Price, and how variation in some other factors may be able to predict the variation in this price. This variable is measured US dollars.
The boxplot below shows the values of MSRP for the dataset. As seen below, most of the data appears to be centered around $5,000. However, there also appears to be a single outlier, which has an MSRP much greater than all the others. This outlier will be analyzed further.
bplot <- boxplot(x$msrp, main = "MSRP of Dirt Bikes", ylab = "Dollars ($)")
bplot
## $stats
## [,1]
## [1,] 1079.0
## [2,] 3099.0
## [3,] 5248.5
## [4,] 6398.0
## [5,] 8998.0
##
## $n
## [1] 104
##
## $conf
## [,1]
## [1,] 4737.38
## [2,] 5759.62
##
## $out
## [1] 19500
##
## $group
## [1] 1
##
## $names
## [1] ""
The boxplot function allows you to view the outliers of the dataset that you have just plotted.
One value has been identified as an outlier:
bplot$out
## [1] 19500
This data point has an MSRP value of $19,500.
While removing outliers is a matter of personal discretion, it is better to have a good model of the majority of the points, rather than an “okay” model of all of the points. For this reason, I will remove this outlier for the remainder of the analysis.
x <-x[x$msrp!=19500,]
bplot2 <- boxplot(x$msrp)
bplot2$out
## numeric(0)
As seen above, the outlier has been removed, and the boxplot function no longer returns a value for an outlier.
My null hypothesis is that the variation in MSRP of the off-road dirt bikes cannot be explained by anything other than randomization.
plot(x)
The plot above shows each of the variables included in this dataset, and their relationship to each of the other variables. Since we are primarily interested in MSRP as a response variable, the top row of plots are of interest.
As seen in the plot, some of the variable have stronger relationships than others. “Bore” appears to have a fairly strong, positive, linear relationship, whereas the relationship between “total weight” and “msrp” is slightly less clear, though probably slightly positive.
The model will be built hierarchically based on what independent variables appear to be most correlated with the dependent variable, MSRP.
The following correlation matrix shows the correlation between each of the variables and themselves. To narrow down the size of the matrix, only the variables of interest were displayed. As seen below, each variable is obviously perfectly correlated with itself - with a correlation of 1.
x <- x[c("msrp", "bore", "wheelbase", "fuel.cap")]
cor(x, use="complete")
## msrp bore wheelbase fuel.cap
## msrp 1.0000000 0.8410597 0.8389980 0.4726100
## bore 0.8410597 1.0000000 0.7918625 0.6368591
## wheelbase 0.8389980 0.7918625 1.0000000 0.6982879
## fuel.cap 0.4726100 0.6368591 0.6982879 1.0000000
Based on the correlation matrix, Bore and MSRP are most highly correlated, at .841. We will begin building the model with bore as the primary independent variable.
Bore is a measure of an internal combustion engine in a motorcycle. The value of a bike’s bore can affect the speed and power. The larger the bore, the more powerful the bike. Bore is measured in inches and is the diameter of the chambers in an engine cylinder.
This linear model analyzes if the variation in MSRP can be expained by the variation in bore.
We use the “lm” function to calculate the linear model.
“MSRP ~ bore” reads as: “MSRP is explained by bore”
attach(x)
x <- x[order(x$msrp),]
model1 <- lm(x$msrp ~ x$bore)
summary(model1)
##
## Call:
## lm(formula = x$msrp ~ x$bore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2244.57 -906.79 -27.89 763.42 1994.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1109.988 392.573 -2.827 0.00566 **
## x$bore 83.536 5.355 15.600 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1077 on 101 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.7038
## F-statistic: 243.4 on 1 and 101 DF, p-value: < 2.2e-16
For this simple linear model, the estimate of the y-intercept is -1,109, and the slope of the model is 83.536. This means that for a bore of value 0, the MSRP is $-1,109. While this is not plausible, it indicates that a motorcycle cannot have a bore of 0, and that bore must have a fairly significant value for the bike to be worth anything. With a slope of 83.536, this says that for each additional unit of bore (in inches), the MSRP increases by $83.536.
The R^2 value for the model is 0.7067. Generally, a larger R^2 value indicates a stronger relationship between the independent and dependent variables, where an R^2 of 1 would indicate a perfect correlation, and a 0 would indicate zero correlation. With this R^2 value of 0.7067, we can say that variation in the independent variable, bore size, can explain 70.67% of the variation in the dependent variable, MSRP. This is a fairly high R^2 value, and therefore a moderate percent of the variation in the dependent variable can be explained by the independent variable. This may, however, increase as we include more variables.
Another indicator of the strength of the model is the p-value, which in this case is 2 x 10^-16. This is a very small value, and it is the probability that the F-statistic (in this case 243.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-value is less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size may help to explain the variation in MSRP.
par(mfrow=c(1,1))
plot(msrp ~ bore, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Bore with Regression Line", xlab = "Bore (in inches)", ylab = "MSRP (in dollars)")
abline(model1, lwd=2.5, col='dodgerblue4')
To continue building this model, the next most highly correlated independent variable based on the correlation matrix is the wheelbase.
Wheelbase is the distance between the front and rear wheel of a motorcycle, and can affect the function of the bike (sport/trick bikes versus leisure bikes). Wheelbase is measured in inches.
model2 <- lm(x$msrp ~ x$bore + x$wheelbase)
summary(model2)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1916.26 -792.94 -12.21 690.55 2175.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5587.318 807.084 -6.923 4.47e-10 ***
## x$bore 47.049 7.532 6.247 1.06e-08 ***
## x$wheelbase 132.839 21.720 6.116 1.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 925 on 99 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7876, Adjusted R-squared: 0.7833
## F-statistic: 183.6 on 2 and 99 DF, p-value: < 2.2e-16
For this multiple linear model, the estimate of the y-intercept is -5,587, and the slope of bore is 47.049, and the slope of wheelbase is 132.839. This means that for a bore size and wheelbase size, both of value 0, the MSRP is $-5,587. While this is not plausible, it indicates that a motorcycle cannot have a bore of 0 and wheelbase size of 0. With a slope of bore of 47.049, this says that for each additional unit of bore (in inches), the MSRP increases by $47.049, and for every additional inch in wheelbase, the MSRP increases by $132.839.
The adjusted R^2 value for the model is 0.7833. This is an increase from our model with bore as the only independent variable, meaning that it was appropriate to add wheelbase size to the model. With this R^2 value of 0.7833, we can say that variation in the independent variables, bore size and wheelbase size, can explain 78.33% of the variation in the dependent variable, MSRP. This is a fairly high R^2 value, and therefore a moderate percent of the variation in the dependent variable can be explained by the independent variables.
Another indicator of the strength of the model are the p-values.In this case, the p-value for bore is 1.06 x 10^-08 and the p-value for wheelbase is 1.92 x 10^-08. These are both very small values, and they indicate the probabilities that the F-statistic (in this case 183.6) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size and wheelbase size may help to explain the variation in MSRP.
# install.packages("scatterplot3d") # commented out to avoid errors
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
mdl2 <- scatterplot3d(x$bore, x$wheelbase, x$msrp, pch=21, main = "MSRP vs. Bore and Wheelbase with Regression Plane", bg = 'ivory4', xlab = "Bore (in inches)", ylab = "Wheelbase (in inches)", zlab = "MSRP (in dollars $)", axis=TRUE)
mdl2$plane3d(model2, col = 'dodgerblue4')
The next independent variable that I was most interested in was fuel capacity of the dirt bikes.
Fuel capacity is the amount of fuel that the motorcycle can hold. This value is measured in gallons.
model3 <- lm(x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
summary(model3)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1536.5 -598.6 -146.6 572.5 2032.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7167.352 774.357 -9.256 4.99e-15 ***
## x$bore 53.976 6.798 7.940 3.40e-12 ***
## x$wheelbase 179.021 21.114 8.479 2.39e-13 ***
## x$fuel.cap -695.461 130.998 -5.309 6.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 819.3 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
For this final multiple linear model, the estimate of the y-intercept is -7,167, and the slope of bore is 53.98, the slope of wheelbase is 179.02, and the slope of the fuel capacity is -695.46. This means that for a bore size, wheelbase size, and fuel capacity all of value 0, the MSRP is $-7,167. While this is not plausible, it indicates that a motorcycle cannot have a bore, wheelbase size, or fuel capacity of 0. With a slope of bore of 53.98, this says that for each additional unit of bore (in inches), the MSRP increases by $53.98, for every additional inch in wheelbase, the MSRP increases by $179.02, and for every additional gallon of fuel capacity, the MSRP decreases by $695.46.
The adjusted R^2 value for the model is 0.83, which is an increase from the previous 2 IV model. This signifies that it was appropriate to add fuel capacity to the model. With this R^2 value of 0.83, we can say that variation in the independent variables, bore size, wheelbase size, and fuel capacity, can explain 83% of the variation in the dependent variable, MSRP. This is a very high R^2 value, and therefore a large percent of the variation in the dependent variable can be explained by the independent variables.
Another indicator of the strength of the model are the p-values.In this case, the p-values for bore, wheelbase size and fuel capacity are all very small numbers - less than 1 each. These p-values indicate the probabilities that the F-statistic (in this case 165.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size, wheelbase size, and fuel capacity may help to explain the variation in MSRP.
Below, I tried to do 2-D interpretations of the regression line for each of the three variables. Unfortunatly, I could not get the line to appear on the plots. I will have to work on this for the final draft.
par(mfrow=c(3,1))
plot(msrp ~ bore, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Bore with Regression Line", xlab = "Bore", ylab = "MSRP")
abline(a=summary(model3)$coefficients[1,1] ,b=summary(model3)$coefficients[1,2], lwd=2.5, col='dodgerblue4')
plot(msrp ~ wheelbase, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Wheelbase with Regression Line", xlab = "Wheelbase", ylab = "MSRP")
abline(a=summary(model3)$coefficients[1,1] ,b=summary(model3)$coefficients[1,3], lwd=2.5, col='dodgerblue4')
plot(msrp ~ fuel.cap, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Fuel Capacity with Regression Line", xlab = "Fuel Capacity", ylab = "MSRP")
abline(a=summary(model3)$coefficients[1,1] ,b=summary(model3)$coefficients[1,4], lwd=2.5, col='dodgerblue4')
plot(x, pch=21, cex=1, bg='ivory4', main="MSRP versus Bore, Wheelbase, and Fuel Capacity")
par(mfrow=c(1,1))
model.res <- resid(model3)
plot(fitted(model3), model.res, pch=21, cex=1, bg='ivory4',main="Plot of Fitted Values vs. Residuals (Not Standardized)", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')
This plot should appear relatively scattered and random across the dynamic range. Although there are more points in the greater part of the range than the lesser part, this appears to be a fairly good fit.
model_pred <- predict(model3, interval="confidence")
^ I’m having difficulty/confusion with plotting these confidence intervals in the multiple linear regression scenario.