Background

Dr. Saunders, at BYU-Idaho, created a multiple regression model to determine the price of Cadillac cars. He determined that both mileage and car model were the two variables that best explained price. I want to see if the type of car is a better variable than the model of car to explain price. I am suspicious it could be because there are almost two lines of data and there type of cars for Cadillacs.

The mathematicalmodel for these multiple linear regressions are as follows:

\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \epsilon_i\), where \(X_{2i}, X_{3i},...X_{pi} = 1\) or \(0\) such that only one second explanatory variable exists at a time.

Formally, the null and alternative for both hypothesis for all the tests conducted in this analysis are as follows:

\(H_0: \beta_j = 0\)
\(H_a: \beta_j \neq 0\)

The level of significance, for this analysis, is set at:
\(\alpha = 0.05\)

The data for this analysis is provided by Dr. Saunders and is listed below:

datatable(CarPrices, options=list(lengthMenu = c(5, 20, 50)))

Analysis

Visualisation

#Side by side
par(mfrow = c(1,3))

# Linear Regression Model
CarPrices1 <- subset(CarPrices, Make == "Cadillac")
myData.lm <- lm(Price ~ Mileage, data = CarPrices1)
#Plot of my Simple Linear Regression model
plot(Price ~ Mileage, data = CarPrices1, pch=16, col = "sienna1", xlim=c(0,50000), main = "Simple Linear Regression\n Price Vs. Mileage")
abline(myData.lm$coefficients[1], myData.lm$coefficients[2], col = "sienna1")


#Multiple Linear Regression


#Multiple Linear Regression Data
myData.lm2 <- lm(Price ~ Mileage + Type, data = CarPrices1)

#Plots of my MLR Data
#Extra colors: ,"skyblue","sienna1","gray","sienna4"
palette(c("skyblue4","firebrick"))
plot(Price ~ Mileage, data = CarPrices1, pch=16, col = CarPrices1$Type, xlim=c(0,50000), main = "Multiple Linear Regression with\n Type of Car As The Second Variable")

abline(myData.lm2$coefficients[1], myData.lm2$coefficients[2], col = palette()[1])

abline(myData.lm2$coefficients[1] +  myData.lm2$coefficients[3], myData.lm2$coefficients[2], col = palette()[2])

legend("topright", myData.lm2$xlevels$Type, lty = 1, lwd = 5, col = palette(), cex = 0.7)

#Plots of my second MLR Data

myData.lm3 <- lm(Price ~ Mileage + Model, data = CarPrices1)

palette(c("skyblue4","firebrick", "skyblue", "sienna1", "gray","sienna4"))
plot(Price ~ Mileage, data = CarPrices1, pch=16, col = CarPrices1$Model, xlim=c(0,50000), main = "Multiple Linear Regression with\n Model As The Second Variable")

abline(myData.lm3$coefficients[1], myData.lm3$coefficients[2], col = palette()[1])
 
abline(myData.lm3$coefficients[1] + myData.lm3$coefficients[3], myData.lm3$coefficients[2], col = palette()[2])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[4], myData.lm3$coefficients[2], col = palette()[3])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[5], myData.lm3$coefficients[2], col = palette()[4])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[6], myData.lm3$coefficients[2], col = palette()[5])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[7], myData.lm3$coefficients[2], col = palette()[6])

legend("topright", myData.lm3$xlevels$Model, lty = 1, lwd = 5, col = palette(), cex = 0.7)


The plot on the left is a visualization of a simple linear regression with mileage as the only variable affecting price of Cadillac. The plot in the middle is a visualization of a multiple linear regression with the type of car as a second explanatory variable. The plot on the right is a multiple linear regression with the model of the car as the second explanatory variable.

The simple linear regression has a number of data points that are well above its regression line. This visually indicates that a multiple linear regression might be more appropriate. The middle plot has two regression lines that go though both groups of data. This is an improvement over the simple linear regression. That said, the line that represents the Sedan group has a wide spread around the regression line so there might be a better way of explaining the price of a Cadillac. If we look at the last plot there are 5 (almost 6) regression lines that cut cleanly through the data points. The last plot looks the most accurate. Let’s see what the data says.

Numeric Data

Simple Linear Regression Price Vs. Mileage
pander(summary(myData.lm))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 46957 2533 18.54 2.62e-30
Mileage -0.3184 0.1212 -2.627 0.01036
Fitting linear model: Price ~ Mileage
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
80 9656 0.0813 0.06952
Multiple Linear Regression with Type of Car As The Second Variable
pander(summary(myData.lm2))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 68809 1887 36.47 2.462e-50
Mileage -0.3128 0.06006 -5.208 1.546e-06
TypeSedan -25095 1618 -15.51 2.097e-25
Fitting linear model: Price ~ Mileage + Type
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
80 4785 0.7773 0.7715
Multiple Linear Regression with Model As The Second Variable
pander(summary(myData.lm3))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 51558 759.3 67.9 1.014e-67
Mileage -0.3379 0.024 -14.08 1.693e-22
ModelCTS -15488 852.3 -18.17 8.312e-29
ModelDeville -8636 693.9 -12.45 1.01e-19
ModelSTS-V6 -7826 849.9 -9.208 7.595e-14
ModelSTS-V8 -2357 850.1 -2.773 0.007047
ModelXLR-V8 17722 849.9 20.85 1.796e-32
Fitting linear model: Price ~ Mileage + Model
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
80 1900 0.9667 0.964


Our data shows that all p-values are smaller than our \(\alpha\) (0.05). This means we are confident that all 3 of the mathematical models describe Cadillac price. So which model is the best? If we look at the adjusted \(R^2\) values they listed as 0.06952, 0.7715, and 0.964. The last value belongs to the equation that includes mileage and model of car as explanatory variables. Since it’s the highest it best describes the price of the Cadillacs.

Checking for Errors

#Side by side
par(mfrow = c(1,3))

#Adding Variable Checking Assumptions
plot(myData.lm$residuals ~ Model, data = CarPrices1, ylab = "Residuals", main = "Residuals of Car Model\n Vs. Car Model")


#Checking assumption for MLR
plot(myData.lm3, which=1:2)


This analysis is going to check the assumptions of the multiple regression model where car model is included. The first plot demonstrates that the extra variable, car model, provides new information. The second plot demonstrates that our regression isn’t completely linear and demonstrates that the residuals that vary a little. The last plot demonstrates that the error terms are not normal for errors less than -0.5. These plots demonstrate that this multiple linear regression model pushes the boundaries of what you can do with a multiple linear regression.

Interpretation

While the type of car worked decently as a second explanatory variable, the model of the car works better to describe the price of the car. The regression model Brother Saunders used is a better fit. That said, I would advise a word of caution. The model Brother Saunders suggested struggled with three assumptions that are required for a multiple linear regression. Perhaps there is a better mathematical model for predicting car prices. This is a question that should be explored in a future analysis.