Car Prices

Background

Dr. Saunders, at BYU-Idaho, created a multiple regression model to determine the price of Cadillac cars. He determined that both mileage and car model were the two variables that best explained price. I want to see if the type of car is a better variable than the model of car to explain price. I am suspicious it could be because there are almost two lines of data and there type of cars for Cadillacs.

The mathematicalmodel for these multiple linear regressions are as follows:

\(Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_p X_{pi} + \epsilon_i\), where \(X_{2i}, X_{3i},...X_{pi} = 1\) or \(0\) such that only one second explanatory variable exists at a time.

Formally, the null and alternative for both hypothesis for all the tests conducted in this analysis are as follows:

\(H_0: \beta_j = 0\)
\(H_a: \beta_j \neq 0\)

The level of significance, for this analysis, is set at:
\(\alpha = 0.05\)

The data for this analysis is provided by Dr. Saunders and is listed below:

datatable(CarPrices, options=list(lengthMenu = c(5, 20, 50)))

Analysis

Visualisation

#Side by side
par(mfrow = c(1,3))

# Linear Regression Model
CarPrices1 <- subset(CarPrices, Make == "Cadillac")
myData.lm <- lm(Price ~ Mileage, data = CarPrices1)
#Plot of my Simple Linear Regression model
plot(Price ~ Mileage, data = CarPrices1, pch=16, col = "sienna1", xlim=c(0,50000), main = "Simple Linear Regression\n Price Vs. Mileage")
abline(myData.lm$coefficients[1], myData.lm$coefficients[2], col = "sienna1")


#Multiple Linear Regression


#Multiple Linear Regression Data
myData.lm2 <- lm(Price ~ Mileage + Type, data = CarPrices1)

#Plots of my MLR Data
#Extra colors: ,"skyblue","sienna1","gray","sienna4"
palette(c("skyblue4","firebrick"))
plot(Price ~ Mileage, data = CarPrices1, pch=16, col = CarPrices1$Type, xlim=c(0,50000), main = "Multiple Linear Regression with\n Type of Car As The Second Variable")

abline(myData.lm2$coefficients[1], myData.lm2$coefficients[2], col = palette()[1])

abline(myData.lm2$coefficients[1] +  myData.lm2$coefficients[3], myData.lm2$coefficients[2], col = palette()[2])

legend("topright", myData.lm2$xlevels$Type, lty = 1, lwd = 5, col = palette(), cex = 0.7)

#Plots of my second MLR Data

myData.lm3 <- lm(Price ~ Mileage + Model, data = CarPrices1)

palette(c("skyblue4","firebrick", "skyblue", "sienna1", "gray","sienna4"))
plot(Price ~ Mileage, data = CarPrices1, pch=16, col = CarPrices1$Model, xlim=c(0,50000), main = "Multiple Linear Regression with\n Model As The Second Variable")

abline(myData.lm3$coefficients[1], myData.lm3$coefficients[2], col = palette()[1])
 
abline(myData.lm3$coefficients[1] + myData.lm3$coefficients[3], myData.lm3$coefficients[2], col = palette()[2])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[4], myData.lm3$coefficients[2], col = palette()[3])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[5], myData.lm3$coefficients[2], col = palette()[4])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[6], myData.lm3$coefficients[2], col = palette()[5])

abline(myData.lm3$coefficients[1] +  myData.lm3$coefficients[7], myData.lm3$coefficients[2], col = palette()[6])

legend("topright", myData.lm3$xlevels$Model, lty = 1, lwd = 5, col = palette(), cex = 0.7)

The plot on the left is a visualization of a simple linear regression with mileage as the only variable affecting price of Cadillac. The plot in the middle is a visualization of a multiple linear regression with the type of car as a second explanatory variable. The plot on the right is a multiple linear regression with the model of the car as the second explanatory variable.

The simple linear regression has a number of data points that are well above its regression line. This visually indicates that a multiple linear regression might be more appropriate. The middle plot has two regression lines that go though both groups of data. This is an improvement over the simple linear regression. That said, the line that represents the Sedan group has a wide spread around the regression line so there might be a better way of explaining the price of a Cadillac. If we look at the last plot there are 5 (almost 6) regression lines that cut cleanly through the data points. The last plot looks the most accurate. Let’s see what the data says.

Numeric Data

Simple Linear Regression Price Vs. Mileage

pander(summary(myData.lm))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	46957	2533	18.54	2.62e-30
Mileage	-0.3184	0.1212	-2.627	0.01036

Fitting linear model: Price ~ Mileage
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
80	9656	0.0813	0.06952

Multiple Linear Regression with Type of Car As The Second Variable

pander(summary(myData.lm2))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	68809	1887	36.47	2.462e-50
Mileage	-0.3128	0.06006	-5.208	1.546e-06
TypeSedan	-25095	1618	-15.51	2.097e-25

Fitting linear model: Price ~ Mileage + Type
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
80	4785	0.7773	0.7715

Multiple Linear Regression with Model As The Second Variable

pander(summary(myData.lm3))

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	51558	759.3	67.9	1.014e-67
Mileage	-0.3379	0.024	-14.08	1.693e-22
ModelCTS	-15488	852.3	-18.17	8.312e-29
ModelDeville	-8636	693.9	-12.45	1.01e-19
ModelSTS-V6	-7826	849.9	-9.208	7.595e-14
ModelSTS-V8	-2357	850.1	-2.773	0.007047
ModelXLR-V8	17722	849.9	20.85	1.796e-32

Fitting linear model: Price ~ Mileage + Model
Observations	Residual Std. Error	\(R^2\)	Adjusted \(R^2\)
80	1900	0.9667	0.964

Our data shows that all p-values are smaller than our \(\alpha\) (0.05). This means we are confident that all 3 of the mathematical models describe Cadillac price. So which model is the best? If we look at the adjusted \(R^2\) values they listed as 0.06952, 0.7715, and 0.964. The last value belongs to the equation that includes mileage and model of car as explanatory variables. Since it’s the highest it best describes the price of the Cadillacs.

Checking for Errors

#Side by side
par(mfrow = c(1,3))

#Adding Variable Checking Assumptions
plot(myData.lm$residuals ~ Model, data = CarPrices1, ylab = "Residuals", main = "Residuals of Car Model\n Vs. Car Model")


#Checking assumption for MLR
plot(myData.lm3, which=1:2)

This analysis is going to check the assumptions of the multiple regression model where car model is included. The first plot demonstrates that the extra variable, car model, provides new information. The second plot demonstrates that our regression isn’t completely linear and demonstrates that the residuals that vary a little. The last plot demonstrates that the error terms are not normal for errors less than -0.5. These plots demonstrate that this multiple linear regression model pushes the boundaries of what you can do with a multiple linear regression.

Interpretation

While the type of car worked decently as a second explanatory variable, the model of the car works better to describe the price of the car. The regression model Brother Saunders used is a better fit. That said, I would advise a word of caution. The model Brother Saunders suggested struggled with three assumptions that are required for a multiple linear regression. Perhaps there is a better mathematical model for predicting car prices. This is a question that should be explored in a future analysis.