Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
My response to this weeks discussion topic was inspired by a paper by Shounda Kuiper called "Introduction to Multiple Regression: How Much Is Your Car Worth?" The data set I will use for this discussion was collected from Kelly Blue Book comprising of 804 observations of used GM cars. The dataset can be downloaded via the Journal of Statistics Education data archive (the original file name is kuiper.xls).
First we will import the dataset into R and then take a brief look at the data.
# Load the Kelly Blue Book used GM cars dataset into R. 7 of the columns in the data set
# contain only NA values, so we only import the first 12 columns that contain actual data.
cars_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-week-13/KellyBlueBookCarSales.csv')[ ,1:12]
# Attach the dataset so we can easily access the variables.
attach(cars_dataset)
# Look at a sample of the dataset, and count the number of observations.
kable(head(cars_dataset), format = 'markdown')
| Price | Mileage | Make | Model | Trim | Type | Cylinder | Liter | Doors | Cruise | Sound | Leather |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 17314.10 | 8221 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 1 |
| 17542.04 | 9135 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
| 16218.85 | 13196 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
| 16336.91 | 16342 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 0 | 0 |
| 16339.17 | 19832 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 0 | 1 |
| 15709.05 | 22236 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
nrow(cars_dataset)
## [1] 804
As we can see from the above table, the dataset contains 804 observations of used GM cars, and 12 variables: Price, Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, and Leather.
The aim of this exercise is to establish which of the following variables are the best predictors of car price: Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, Leather.
Â
Â
We will start with the mileage variable as this is the most obvious predictor of car price. For this, we will run a simple linear model to see if this is indeed the case.
# Run a model to check if car mileage has a linear relationship with car price.
priceMileageModel <- lm(Price ~ Mileage, cars)
summary(priceMileageModel)
##
## Call:
## lm(formula = Price ~ Mileage, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13905 -7254 -3520 5188 46091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.476e+04 9.044e+02 27.383 < 2e-16 ***
## Mileage -1.725e-01 4.215e-02 -4.093 4.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9789 on 802 degrees of freedom
## Multiple R-squared: 0.02046, Adjusted R-squared: 0.01924
## F-statistic: 16.75 on 1 and 802 DF, p-value: 4.685e-05
The 4.68 p-value reported by the model summary tells us that there is a strong relationship between the mileage of a car and its price. However, the R-squared value is only 0.02 which means mileage only explains 2% of the variance. To explore this further, we can plot the results visually to get a better picture of the variance.
plot(Price ~ Mileage, cars, main = 'Mileage Vs Car Price')
abline(priceMileageModel, col = "red", lwd = 3)
From the above plot, we can see that there is a lot of variation in the data, with a lot of outliers. Despite this, there is still a strong relatioship between mileage and price. In order to explain some of those outliers and increase the R-squared value of our model, let's pull in some more variables.
# Run a new Price ~ Mileage model, but add the Make and Type variables to
# the model to see if our R-squared value increases.
priceMultipleModel <- lm(Price ~ Mileage + Make + Type, cars)
summary(priceMultipleModel)
##
## Call:
## lm(formula = Price ~ Mileage + Make + Type, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9889.1 -3190.3 -332.7 2720.5 20265.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.926e+04 1.006e+03 39.047 < 2e-16 ***
## Mileage -1.791e-01 2.010e-02 -8.909 < 2e-16 ***
## MakeCadillac 1.800e+04 7.423e+02 24.251 < 2e-16 ***
## MakeChevrolet -5.217e+03 6.188e+02 -8.430 < 2e-16 ***
## MakePontiac -3.221e+03 6.682e+02 -4.820 1.72e-06 ***
## MakeSAAB 4.534e+03 7.475e+02 6.065 2.04e-09 ***
## MakeSaturn -7.494e+03 8.104e+02 -9.248 < 2e-16 ***
## TypeCoupe -1.286e+04 8.576e+02 -15.002 < 2e-16 ***
## TypeHatchback -1.646e+04 9.863e+02 -16.684 < 2e-16 ***
## TypeSedan -1.479e+04 7.492e+02 -19.740 < 2e-16 ***
## TypeWagon -1.362e+04 9.063e+02 -15.024 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4653 on 793 degrees of freedom
## Multiple R-squared: 0.7812, Adjusted R-squared: 0.7785
## F-statistic: 283.2 on 10 and 793 DF, p-value: < 2.2e-16
After adding the Make and Type variables to the model, our R-squared values has jumped from 0.02 to 0.78 which is a significant increase. We have gone from explaining 2% of the variance, to explaining 78% of the variance.
Next I want to add a dichotomous variable to the model to see what effect it has on price. We will add the dichotomous variable "Leather" to the model.
# Add the dichotomous term Leather to the model.
priceMultipleModel <- lm(Price ~ Mileage + Make + Type + Leather, cars)
summary(priceMultipleModel)
##
## Call:
## lm(formula = Price ~ Mileage + Make + Type + Leather, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9881.0 -3226.6 -145.7 2650.0 20123.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.889e+04 1.013e+03 38.412 < 2e-16 ***
## Mileage -1.799e-01 2.003e-02 -8.979 < 2e-16 ***
## MakeCadillac 1.743e+04 7.728e+02 22.558 < 2e-16 ***
## MakeChevrolet -5.542e+03 6.299e+02 -8.798 < 2e-16 ***
## MakePontiac -3.394e+03 6.694e+02 -5.071 4.93e-07 ***
## MakeSAAB 4.245e+03 7.536e+02 5.633 2.46e-08 ***
## MakeSaturn -7.507e+03 8.076e+02 -9.295 < 2e-16 ***
## TypeCoupe -1.301e+04 8.567e+02 -15.192 < 2e-16 ***
## TypeHatchback -1.661e+04 9.848e+02 -16.866 < 2e-16 ***
## TypeSedan -1.484e+04 7.469e+02 -19.869 < 2e-16 ***
## TypeWagon -1.371e+04 9.040e+02 -15.168 < 2e-16 ***
## Leather 9.952e+02 3.923e+02 2.537 0.0114 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4637 on 792 degrees of freedom
## Multiple R-squared: 0.783, Adjusted R-squared: 0.78
## F-statistic: 259.8 on 11 and 792 DF, p-value: < 2.2e-16
After adding the dichotomous variable Leather to the model, it appears that our R-squared values was slightly increased, so while it does have significance in terms of price, it appears the significance is marginal.
Finally we will add the quadratic variable Cylinder to the model.
# Add the quadratic term Cylinder to the model.
priceMultipleModel <- lm(Price ~ Mileage + Make + Type + Leather + Cylinder, cars)
summary(priceMultipleModel)
##
## Call:
## lm(formula = Price ~ Mileage + Make + Type + Leather + Cylinder,
## data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9328.3 -1517.6 174.1 1467.3 13826.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.444e+04 9.461e+02 15.263 < 2e-16 ***
## Mileage -1.845e-01 1.258e-02 -14.669 < 2e-16 ***
## MakeCadillac 1.231e+04 5.069e+02 24.285 < 2e-16 ***
## MakeChevrolet -1.548e+03 4.117e+02 -3.759 0.000183 ***
## MakePontiac -2.436e+03 4.212e+02 -5.783 1.06e-08 ***
## MakeSAAB 1.138e+04 5.154e+02 22.075 < 2e-16 ***
## MakeSaturn -1.210e+03 5.382e+02 -2.248 0.024841 *
## TypeCoupe -1.095e+04 5.412e+02 -20.229 < 2e-16 ***
## TypeHatchback -1.433e+04 6.218e+02 -23.043 < 2e-16 ***
## TypeSedan -1.231e+04 4.745e+02 -25.938 < 2e-16 ***
## TypeWagon -8.010e+03 5.906e+02 -13.562 < 2e-16 ***
## Leather 8.340e+02 2.463e+02 3.386 0.000745 ***
## Cylinder 3.681e+03 1.055e+02 34.899 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2911 on 791 degrees of freedom
## Multiple R-squared: 0.9146, Adjusted R-squared: 0.9133
## F-statistic: 705.5 on 12 and 791 DF, p-value: < 2.2e-16
Adding the quadratic variable Cylinder to our model resulted in a significant increase in our R-squared value. Our model now explains 91% of the variance in the data.
plot(priceMultipleModel$fitted.values, priceMultipleModel$residuals,
xlab = 'Fitted Values', ylab = 'Residuals', main = 'Residuals vs. Fitted')
abline(h = 0)
qqnorm(priceMultipleModel$residuals)
qqline(priceMultipleModel$residuals)
Based on the \(R^2\) value of our final model, the model explains 91% of the variance in the data which is good.
The Residuals vs. Fitted plot above displays some outliers in the data, but overall the variability is fairly consistent. Additionally, the QQ plot, while not perfect, does show consistency with outliers at each end of the line. Therefore, It seems that our multiple linear model is an appropriate model for predicting the price of cars.