The data was selected from the 100+ datasets, and is part of Carnegie Melon’s data library. The dataset is called “Dirt Bike Data”, and is a summary of all off-road dirt bikes available in the United States. The data provides information about the bike, including engine size, price, frame size, etc.
x <- read.csv("~/Spring2015/Applied Regression/dirtbike.csv")
x <- x[c("msrp", "displacement", "bore", "stroke", "wheelbase", "seat.height", "fuel.cap", "total.weight")]
x <- x[c("msrp", "bore", "stroke", "wheelbase", "fuel.cap", "total.weight")]
x$bore <- as.numeric(x$bore)
x$msrp <- as.numeric(x$msrp)
head(x)
## msrp bore stroke wheelbase fuel.cap total.weight
## 1 2999 48 48 49.1 1.35 144
## 2 3099 48 48 49.1 1.35 145
## 3 2249 53 45 49.2 1.50 165
## 4 2999 64 49 52.2 2.20 216
## 5 3499 66 66 54.1 2.20 238
## 6 5899 78 52 58.2 2.00 204
Oringinally in this dataset, there are 104 observations with 30 different columns each. Unfortunately, some of the data was missing from a large portion of these variables. Therefore, the data was cleaned down to include only variables with complete data. This ended up being 6 variables.
All patient days and in hundreds, and all numeric values are in hundreds of dollars.
summary(x)
## msrp bore stroke wheelbase
## Min. : 1079 Min. : 39.00 Min. :38.0 Min. :33.70
## 1st Qu.: 3099 1st Qu.: 53.75 1st Qu.:49.0 1st Qu.:49.75
## Median : 5248 Median : 70.00 Median :58.0 Median :57.00
## Mean : 4928 Mean : 70.84 Mean :57.6 Mean :53.18
## 3rd Qu.: 6398 3rd Qu.: 91.25 3rd Qu.:63.0 3rd Qu.:58.25
## Max. :19500 Max. :100.00 Max. :83.0 Max. :58.90
## NA's :1
## fuel.cap total.weight
## Min. :0.000 Min. : 82.0
## 1st Qu.:1.350 1st Qu.:147.9
## Median :2.100 Median :218.5
## Mean :1.964 Mean :201.7
## 3rd Qu.:2.400 3rd Qu.:241.5
## Max. :6.100 Max. :337.0
## NA's :1 NA's :1
str(x)
## 'data.frame': 104 obs. of 6 variables:
## $ msrp : num 2999 3099 2249 2999 3499 ...
## $ bore : num 48 48 53 64 66 78 78 96 96 39 ...
## $ stroke : num 48 48 45 49 66 52 52 62 62 41 ...
## $ wheelbase : num 49.1 49.1 49.2 52.2 54.1 58.2 58.6 58.7 58.2 36 ...
## $ fuel.cap : num 1.35 1.35 1.5 2.2 2.2 2 2.2 1.9 2.27 0.8 ...
## $ total.weight: num 144 145 165 216 238 ...
All of the variables that were included in this analysis are continuous variables.
The response variable that I am interested in is the MSRP, or the Manufacturer’s Suggested Retail Price, and how variation in some other factors may be able to predict the variation in this price. This variable is measured in US dollars.
The histogram below shows the values of MSRP for the dirt bikes. Histograms are useful for displaying the variable of interest, as well as to show outliers, skewness, and kurtosis. From the histogram below, it is quite obvious that there is an outlier on the right side of the range. If we ignore this outlier, the kurtosis also seems large, because there does not seem to be a very normally distributed peak.
hist(x$msrp, breaks = 15)
The boxplot below shows the values of MSRP for the dataset. As seen below, most of the data appears to be centered around $5,000. However, there also appears to be a single outlier, which has an MSRP much greater than all the others. This outlier will be analyzed further.
bplot <- boxplot(x$msrp, main = "MSRP of Dirt Bikes", ylab = "Dollars ($)")
bplot
## $stats
## [,1]
## [1,] 1079.0
## [2,] 3099.0
## [3,] 5248.5
## [4,] 6398.0
## [5,] 8998.0
##
## $n
## [1] 104
##
## $conf
## [,1]
## [1,] 4737.38
## [2,] 5759.62
##
## $out
## [1] 19500
##
## $group
## [1] 1
##
## $names
## [1] ""
The boxplot function allows you to view the outliers of the dataset that you have just plotted.
One value has been identified as an outlier:
bplot$out
## [1] 19500
This data point has an MSRP value of $19,500.
While removing outliers is a matter of personal discretion, it is better to have a good model of the majority of the points, rather than an “okay” model of all of the points. For this reason, I will remove this outlier for the remainder of the analysis.
x <-x[x$msrp!=19500,]
bplot2 <- boxplot(x$msrp)
bplot2$out
## numeric(0)
As seen above, the outlier has been removed, and the boxplot function no longer returns a value for an outlier.
My null hypothesis is that the variation in MSRP of the off-road dirt bikes cannot be explained by anything other than randomization.
plot(x)
The plot above shows each of the variables included in this dataset, and their relationship to each of the other variables. Since we are primarily interested in MSRP as a response variable, the top row of plots are of interest.
As seen in the plot, some of the variable have stronger relationships than others. “Bore” appears to have a fairly strong, positive, linear relationship, whereas the relationship between “total weight” and “msrp” is slightly less clear, though probably slightly positive.
There will be three different methods used in the following analysis in order to model the dependent variable. Hierarchical, step-wise, and entry-wise models will be used. The summary of each model will be plotted, but the complete analysis and interpretation will occur last, with the step-wise model.
Hierarchical models are built using theory in order to determine which variables should be able to explain the dependent variable.
From my limited knowledge of dirt bikes, I would assume that there are two main deciding factors that would determine how much I would pay (MSRP). Since dirt bikes are used for sport, my primary concern would be the power of the bike, which bore is a measurement of.
x <- x[c("msrp", "bore", "wheelbase", "fuel.cap")]
attach(x)
x <- x[order(x$msrp),]
model1 <- lm(x$msrp ~ x$bore)
summary(model1)
##
## Call:
## lm(formula = x$msrp ~ x$bore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2244.57 -906.79 -27.89 763.42 1994.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1109.988 392.573 -2.827 0.00566 **
## x$bore 83.536 5.355 15.600 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1077 on 101 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.7038
## F-statistic: 243.4 on 1 and 101 DF, p-value: < 2.2e-16
My second concern would be the size of the bike, since different sized bikes have different performance capabilities. This is measured via the wheelbase variable.
attach(x)
## The following objects are masked from x (pos = 3):
##
## bore, fuel.cap, msrp, wheelbase
model2 <- lm(x$msrp ~ x$bore + x$wheelbase)
summary(model2)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1916.26 -792.94 -12.21 690.55 2175.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5587.318 807.084 -6.923 4.47e-10 ***
## x$bore 47.049 7.532 6.247 1.06e-08 ***
## x$wheelbase 132.839 21.720 6.116 1.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 925 on 99 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7876, Adjusted R-squared: 0.7833
## F-statistic: 183.6 on 2 and 99 DF, p-value: < 2.2e-16
Finally, the variable I would be least concerned with would be fuel capacity. Because this is a sport bike, it’s not necessary to hold large amounts of fuel. You will likely be close enough to the “home base” so that you don’t have to worry about storing a lot of fuel.
attach(x)
## The following objects are masked from x (pos = 3):
##
## bore, fuel.cap, msrp, wheelbase
##
## The following objects are masked from x (pos = 4):
##
## bore, fuel.cap, msrp, wheelbase
model3 <- lm(x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
summary(model3)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1536.5 -598.6 -146.6 572.5 2032.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7167.352 774.357 -9.256 4.99e-15 ***
## x$bore 53.976 6.798 7.940 3.40e-12 ***
## x$wheelbase 179.021 21.114 8.479 2.39e-13 ***
## x$fuel.cap -695.461 130.998 -5.309 6.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 819.3 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
In order to build an entry-wise linear model, you simply use all of the variables the first time. Instead of building the model starting with 1 IV, then 2 IV, then 3 IV, the entry-wise starts with all of them at once.
model4 <- lm(x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
summary(model4)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1536.5 -598.6 -146.6 572.5 2032.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7167.352 774.357 -9.256 4.99e-15 ***
## x$bore 53.976 6.798 7.940 3.40e-12 ***
## x$wheelbase 179.021 21.114 8.479 2.39e-13 ***
## x$fuel.cap -695.461 130.998 -5.309 6.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 819.3 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
The model will be built step-wise based on what independent variables appear to be most correlated with the dependent variable, MSRP.
The following correlation matrix shows the correlation between each of the variables and themselves. To narrow down the size of the matrix, only the variables of interest were displayed. As seen below, each variable is obviously perfectly correlated with itself - with a correlation of 1.
x <- x[c("msrp", "bore", "wheelbase", "fuel.cap")]
cor(x, use="complete")
## msrp bore wheelbase fuel.cap
## msrp 1.0000000 0.8410597 0.8389980 0.4726100
## bore 0.8410597 1.0000000 0.7918625 0.6368591
## wheelbase 0.8389980 0.7918625 1.0000000 0.6982879
## fuel.cap 0.4726100 0.6368591 0.6982879 1.0000000
Another characteristic to look out for is suppression. As seen in the correlation matrix above, there is a high correlation between some of the independent variables:
bore - wheelbase = 0.79
bore - fuel capacity = 0.64
fuel capacity - wheelbase = 0.70
Because of this high correlation, it is possible that there may be suppression when looking at the coefficients of the model. It will cause some coefficients to be inflated.
This is something that we will have to keep an eye on.
Based on the correlation matrix, Bore and MSRP are most highly correlated, at .841. We will begin building the model with bore as the primary independent variable.
Bore is a measure of an internal combustion engine in a motorcycle. The value of a bike’s bore can affect the speed and power. The larger the bore, the more powerful the bike. Bore is measured in inches and is the diameter of the chambers in an engine cylinder.
This linear model analyzes if the variation in MSRP can be expained by the variation in bore.
We use the “lm” function to calculate the linear model.
“MSRP ~ bore” reads as: “MSRP is explained by bore”
attach(x)
## The following objects are masked from x (pos = 3):
##
## bore, fuel.cap, msrp, wheelbase
##
## The following objects are masked from x (pos = 4):
##
## bore, fuel.cap, msrp, wheelbase
##
## The following objects are masked from x (pos = 5):
##
## bore, fuel.cap, msrp, wheelbase
x <- x[order(x$msrp),]
model5 <- lm(x$msrp ~ x$bore)
summary(model5)
##
## Call:
## lm(formula = x$msrp ~ x$bore)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2244.57 -906.79 -27.89 763.42 1994.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1109.988 392.573 -2.827 0.00566 **
## x$bore 83.536 5.355 15.600 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1077 on 101 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.7038
## F-statistic: 243.4 on 1 and 101 DF, p-value: < 2.2e-16
For this simple linear model, the estimate of the y-intercept is -1,109, and the slope of the model is 83.536. This means that for a bore of value 0, the MSRP is $-1,109. While this is not plausible, it indicates that a motorcycle cannot have a bore of 0, and that bore must have a fairly significant value for the bike to be worth anything. With a slope of 83.536, this says that for each additional unit of bore (in inches), the MSRP increases by $83.536.
The R^2 value for the model is 0.7067. Generally, a larger R^2 value indicates a stronger relationship between the independent and dependent variables, where an R^2 of 1 would indicate a perfect correlation, and a 0 would indicate zero correlation. With this R^2 value of 0.7067, we can say that variation in the independent variable, bore size, can explain 70.67% of the variation in the dependent variable, MSRP. This is a fairly high R^2 value, and therefore a moderate percent of the variation in the dependent variable can be explained by the independent variable. This may, however, increase as we include more variables.
Another indicator of the strength of the model is the p-value, which in this case is 2 x 10^-16. This is a very small value, and it is the probability that the F-statistic (in this case 243.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-value is less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size may help to explain the variation in MSRP.
par(mfrow=c(1,1))
plot(msrp ~ bore, data = x, pch=21, cex=1, bg='ivory4', main="MSRP vs. Bore with Regression Line", xlab = "Bore (in inches)", ylab = "MSRP (in dollars)")
abline(model5, lwd=2.5, col='dodgerblue4')
To continue building this model, the next most highly correlated independent variable based on the correlation matrix is the wheelbase.
Wheelbase is the distance between the front and rear wheel of a motorcycle, and can affect the function of the bike (sport/trick bikes versus leisure bikes). Wheelbase is measured in inches.
model6 <- lm(x$msrp ~ x$bore + x$wheelbase)
summary(model6)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1916.26 -792.94 -12.21 690.55 2175.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5587.318 807.084 -6.923 4.47e-10 ***
## x$bore 47.049 7.532 6.247 1.06e-08 ***
## x$wheelbase 132.839 21.720 6.116 1.92e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 925 on 99 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7876, Adjusted R-squared: 0.7833
## F-statistic: 183.6 on 2 and 99 DF, p-value: < 2.2e-16
For this multiple linear model, the estimate of the y-intercept is -5,587, and the slope of bore is 47.049, and the slope of wheelbase is 132.839. This means that for a bore size and wheelbase size, both of value 0, the MSRP is $-5,587. While this is not plausible, it indicates that a motorcycle cannot have a bore of 0 and wheelbase size of 0. With a slope of bore of 47.049, this says that for each additional unit of bore (in inches), the MSRP increases by $47.049, and for every additional inch in wheelbase, the MSRP increases by $132.839.
The adjusted R^2 value for the model is 0.7833. This is an increase from our model with bore as the only independent variable, meaning that it was appropriate to add wheelbase size to the model. With this R^2 value of 0.7833, we can say that variation in the independent variables, bore size and wheelbase size, can explain 78.33% of the variation in the dependent variable, MSRP. This is a fairly high R^2 value, and therefore a moderate percent of the variation in the dependent variable can be explained by the independent variables.
Another indicator of the strength of the model are the p-values.In this case, the p-value for bore is 1.06 x 10^-08 and the p-value for wheelbase is 1.92 x 10^-08. These are both very small values, and they indicate the probabilities that the F-statistic (in this case 183.6) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size and wheelbase size may help to explain the variation in MSRP.
# install.packages("scatterplot3d") # commented out to avoid errors
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.1.3
mdl6 <- scatterplot3d(x$bore, x$wheelbase, x$msrp, pch=21, main = "MSRP vs. Bore and Wheelbase with Regression Plane", bg = 'ivory4', xlab = "Bore (in inches)", ylab = "Wheelbase (in inches)", zlab = "MSRP (in dollars $)", axis=TRUE)
mdl6$plane3d(model6, col = 'dodgerblue4')
The next independent variable that I was most interested in was fuel capacity of the dirt bikes.
Fuel capacity is the amount of fuel that the motorcycle can hold. This value is measured in gallons.
model7 <- lm(x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
summary(model7)
##
## Call:
## lm(formula = x$msrp ~ x$bore + x$wheelbase + x$fuel.cap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1536.5 -598.6 -146.6 572.5 2032.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7167.352 774.357 -9.256 4.99e-15 ***
## x$bore 53.976 6.798 7.940 3.40e-12 ***
## x$wheelbase 179.021 21.114 8.479 2.39e-13 ***
## x$fuel.cap -695.461 130.998 -5.309 6.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 819.3 on 98 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8351, Adjusted R-squared: 0.83
## F-statistic: 165.4 on 3 and 98 DF, p-value: < 2.2e-16
For this final multiple linear model, the estimate of the y-intercept is -7,167, and the slope of bore is 53.98, the slope of wheelbase is 179.02, and the slope of the fuel capacity is -695.46. This means that for a bore size, wheelbase size, and fuel capacity all of value 0, the MSRP is $-7,167. While this is not plausible, it indicates that a motorcycle cannot have a bore, wheelbase size, or fuel capacity of 0. With a slope of bore of 53.98, this says that for each additional unit of bore (in inches), the MSRP increases by $53.98, for every additional inch in wheelbase, the MSRP increases by $179.02, and for every additional gallon of fuel capacity, the MSRP decreases by $695.46.
The adjusted R^2 value for the model is 0.83, which is an increase from the previous 2 IV model. This signifies that it was appropriate to add fuel capacity to the model. With this R^2 value of 0.83, we can say that variation in the independent variables, bore size, wheelbase size, and fuel capacity, can explain 83% of the variation in the dependent variable, MSRP. This is a very high R^2 value, and therefore a large percent of the variation in the dependent variable can be explained by the independent variables.
Another indicator of the strength of the model are the p-values.In this case, the p-values for bore, wheelbase size and fuel capacity are all very small numbers - less than 1 each. These p-values indicate the probabilities that the F-statistic (in this case 165.4) is due to randomization alone - showing that this event is very unlikely. If we analyze using an alpha of 0.05, the p-values are less than 0.05. Therefore, we may reject our null hypothesis, which states that the variation in MSRP cannot be explained by anything other than randomization. Instead, we suggest an alternative - that variation in bore size, wheelbase size, and fuel capacity may help to explain the variation in MSRP.
plot(x, pch=21, cex=1, bg='ivory4', main="MSRP versus Bore, Wheelbase, and Fuel Capacity")
There are multiple assumptions that are required for our model to be valid. In the following section, we will check various plots and tests to verify these assumptions.
Histograms are good for initial verification of the skewness and kurtosis of a dataset.
hist(x$msrp, breaks = 15, main = "Histogram of MSRP", xlab = "MSRP (in dollars)")
Based on the histogram above, the kurtosis seems to be very large - there is no distinct peak. While the data may be slightly skew left, it is not a very significant skewness.
Boxplots are another useful tool for examining skewness and outliers. Since outliers were already removed in the outlier analysis near the beginning, they will not appear in this analysis.
boxplot(x$msrp, main="Boxplot of MSRP")
As stated before with the histogram, there appears to be a slight skew. There is a slightly longer tail in the lower range of the data.
par(mfrow=c(1,1))
model.res <- resid(model7)
plot(fitted(model7), model.res, pch=21, cex=1, bg='ivory4',main="Plot of Fitted Values vs. Residuals (Not Sandardized)", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')
This plot should appear relatively scattered and random across the dynamic range. Although there are more points in the greater part of the range than the lesser part, this appears to be a fairly good fit.
The second plot, below, shows the standardized residuals versus the fitted values. Similarly, there are more points in the second half of the dynamic range.
par(mfrow=c(1,1))
model.st.res <- rstandard(model7)
plot(fitted(model7), model.st.res, pch=21, cex=1, bg='ivory4',main="Standardized Plot of Fitted Values vs. Residuals", xlab = "Fitted Values of Model", ylab = "Residuals")
abline(0,0, lwd=2.5, col='dodgerblue4')
If the residuals are normally distributed, the Q-Q plot will appear relatively linear, and will not have any distinct curvature.
The Q-Q plot below seems relatively linear. The extremeties do seem to diverge a bit, but for the most part, the points maintain that linearity.
par(mfrow = c(1,1))
qqnorm(residuals(model7))
qqline(residuals(model7))
Confidence intervals are used to say that in 95% of samples, the true value of the parameter will lie between certian values. As seen below, there are confidence intervals for the intercepts, the coefficient for bore, the coefficient for wheelbase, and the coefficient for fuel capacity.
confint(model7)
## 2.5 % 97.5 %
## (Intercept) -8704.0379 -5630.66677
## x$bore 40.4863 67.46637
## x$wheelbase 137.1213 220.92015
## x$fuel.cap -955.4220 -435.49979