Assignment Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Assignment Response.

My response to this weeks discussion topic was inspired by a paper by Shounda Kuiper called "Introduction to Multiple Regression: How Much Is Your Car Worth?" The data set I will use for this discussion was collected from Kelly Blue Book comprising of 804 observations of used GM cars. The dataset can be downloaded via the Journal of Statistics Education data archive (the original file name is kuiper.xls).

Dataset Exploration

First we will import the dataset into R and then take a brief look at the data.

# Load the Kelly Blue Book used GM cars dataset into R. 7 of the columns in the data set
# contain only NA values, so we only import the first 12 columns that contain actual data.
cars_dataset <- read.csv('https://raw.githubusercontent.com/stephen-haslett/data605/data605-week-13/KellyBlueBookCarSales.csv')[ ,1:12]

# Attach the dataset so we can easily access the variables.
attach(cars_dataset)

# Look at a sample of the dataset, and count the number of observations.
kable(head(cars_dataset), format = 'markdown')
Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound Leather
17314.10 8221 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 1
17542.04 9135 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
16218.85 13196 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
16336.91 16342 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 0
16339.17 19832 Buick Century Sedan 4D Sedan 6 3.1 4 1 0 1
15709.05 22236 Buick Century Sedan 4D Sedan 6 3.1 4 1 1 0
nrow(cars_dataset)
## [1] 804

As we can see from the above table, the dataset contains 804 observations of used GM cars, and 12 variables: Price, Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, and Leather.

The aim of this exercise is to establish which of the following variables are the best predictors of car price: Mileage, Make, Model, Trim, Type, Cylinder, Liter, Doors, Cruise, Sound, Leather.

 

Linear Models

 

Mileage vs price model

We will start with the mileage variable as this is the most obvious predictor of car price. For this, we will run a simple linear model to see if this is indeed the case.

# Run a model to check if car mileage has a linear relationship with car price.
priceMileageModel <- lm(Price ~ Mileage, cars)
summary(priceMileageModel)
## 
## Call:
## lm(formula = Price ~ Mileage, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13905  -7254  -3520   5188  46091 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.476e+04  9.044e+02  27.383  < 2e-16 ***
## Mileage     -1.725e-01  4.215e-02  -4.093 4.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9789 on 802 degrees of freedom
## Multiple R-squared:  0.02046,    Adjusted R-squared:  0.01924 
## F-statistic: 16.75 on 1 and 802 DF,  p-value: 4.685e-05

The 4.68 p-value reported by the model summary tells us that there is a strong relationship between the mileage of a car and its price. However, the R-squared value is only 0.02 which means mileage only explains 2% of the variance. To explore this further, we can plot the results visually to get a better picture of the variance.

plot(Price ~ Mileage, cars, main = 'Mileage Vs Car Price')
abline(priceMileageModel, col = "red", lwd = 3)

From the above plot, we can see that there is a lot of variation in the data, with a lot of outliers. Despite this, there is still a strong relatioship between mileage and price. In order to explain some of those outliers and increase the R-squared value of our model, let's pull in some more variables.

# Run a new Price ~ Mileage model, but add the Make and Type variables to
# the model to see if our R-squared value increases.
priceMultipleModel <- lm(Price ~ Mileage + Make + Type, cars)
summary(priceMultipleModel)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Type, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9889.1 -3190.3  -332.7  2720.5 20265.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.926e+04  1.006e+03  39.047  < 2e-16 ***
## Mileage       -1.791e-01  2.010e-02  -8.909  < 2e-16 ***
## MakeCadillac   1.800e+04  7.423e+02  24.251  < 2e-16 ***
## MakeChevrolet -5.217e+03  6.188e+02  -8.430  < 2e-16 ***
## MakePontiac   -3.221e+03  6.682e+02  -4.820 1.72e-06 ***
## MakeSAAB       4.534e+03  7.475e+02   6.065 2.04e-09 ***
## MakeSaturn    -7.494e+03  8.104e+02  -9.248  < 2e-16 ***
## TypeCoupe     -1.286e+04  8.576e+02 -15.002  < 2e-16 ***
## TypeHatchback -1.646e+04  9.863e+02 -16.684  < 2e-16 ***
## TypeSedan     -1.479e+04  7.492e+02 -19.740  < 2e-16 ***
## TypeWagon     -1.362e+04  9.063e+02 -15.024  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4653 on 793 degrees of freedom
## Multiple R-squared:  0.7812, Adjusted R-squared:  0.7785 
## F-statistic: 283.2 on 10 and 793 DF,  p-value: < 2.2e-16

After adding the Make and Type variables to the model, our R-squared values has jumped from 0.02 to 0.78 which is a significant increase. We have gone from explaining 2% of the variance, to explaining 78% of the variance.

Next I want to add a dichotomous variable to the model to see what effect it has on price. We will add the dichotomous variable "Leather" to the model.

# Add the dichotomous term Leather to the model.
priceMultipleModel <- lm(Price ~ Mileage + Make + Type + Leather, cars)
summary(priceMultipleModel)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Type + Leather, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9881.0 -3226.6  -145.7  2650.0 20123.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.889e+04  1.013e+03  38.412  < 2e-16 ***
## Mileage       -1.799e-01  2.003e-02  -8.979  < 2e-16 ***
## MakeCadillac   1.743e+04  7.728e+02  22.558  < 2e-16 ***
## MakeChevrolet -5.542e+03  6.299e+02  -8.798  < 2e-16 ***
## MakePontiac   -3.394e+03  6.694e+02  -5.071 4.93e-07 ***
## MakeSAAB       4.245e+03  7.536e+02   5.633 2.46e-08 ***
## MakeSaturn    -7.507e+03  8.076e+02  -9.295  < 2e-16 ***
## TypeCoupe     -1.301e+04  8.567e+02 -15.192  < 2e-16 ***
## TypeHatchback -1.661e+04  9.848e+02 -16.866  < 2e-16 ***
## TypeSedan     -1.484e+04  7.469e+02 -19.869  < 2e-16 ***
## TypeWagon     -1.371e+04  9.040e+02 -15.168  < 2e-16 ***
## Leather        9.952e+02  3.923e+02   2.537   0.0114 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4637 on 792 degrees of freedom
## Multiple R-squared:  0.783,  Adjusted R-squared:   0.78 
## F-statistic: 259.8 on 11 and 792 DF,  p-value: < 2.2e-16

After adding the dichotomous variable Leather to the model, it appears that our R-squared values was slightly increased, so while it does have significance in terms of price, it appears the significance is marginal.

Finally we will add the quadratic variable Cylinder to the model.

# Add the quadratic term Cylinder to the model.
priceMultipleModel <- lm(Price ~ Mileage + Make + Type + Leather + Cylinder, cars)
summary(priceMultipleModel)
## 
## Call:
## lm(formula = Price ~ Mileage + Make + Type + Leather + Cylinder, 
##     data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9328.3 -1517.6   174.1  1467.3 13826.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.444e+04  9.461e+02  15.263  < 2e-16 ***
## Mileage       -1.845e-01  1.258e-02 -14.669  < 2e-16 ***
## MakeCadillac   1.231e+04  5.069e+02  24.285  < 2e-16 ***
## MakeChevrolet -1.548e+03  4.117e+02  -3.759 0.000183 ***
## MakePontiac   -2.436e+03  4.212e+02  -5.783 1.06e-08 ***
## MakeSAAB       1.138e+04  5.154e+02  22.075  < 2e-16 ***
## MakeSaturn    -1.210e+03  5.382e+02  -2.248 0.024841 *  
## TypeCoupe     -1.095e+04  5.412e+02 -20.229  < 2e-16 ***
## TypeHatchback -1.433e+04  6.218e+02 -23.043  < 2e-16 ***
## TypeSedan     -1.231e+04  4.745e+02 -25.938  < 2e-16 ***
## TypeWagon     -8.010e+03  5.906e+02 -13.562  < 2e-16 ***
## Leather        8.340e+02  2.463e+02   3.386 0.000745 ***
## Cylinder       3.681e+03  1.055e+02  34.899  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2911 on 791 degrees of freedom
## Multiple R-squared:  0.9146, Adjusted R-squared:  0.9133 
## F-statistic: 705.5 on 12 and 791 DF,  p-value: < 2.2e-16

Adding the quadratic variable Cylinder to our model resulted in a significant increase in our R-squared value. Our model now explains 91% of the variance in the data.

Residual Analysis

plot(priceMultipleModel$fitted.values, priceMultipleModel$residuals,
     xlab = 'Fitted Values', ylab = 'Residuals', main = 'Residuals vs. Fitted')
abline(h = 0)

qqnorm(priceMultipleModel$residuals)
qqline(priceMultipleModel$residuals)

Summary

Based on the \(R^2\) value of our final model, the model explains 91% of the variance in the data which is good.

The Residuals vs. Fitted plot above displays some outliers in the data, but overall the variability is fairly consistent. Additionally, the QQ plot, while not perfect, does show consistency with outliers at each end of the line. Therefore, It seems that our multiple linear model is an appropriate model for predicting the price of cars.