Tanner Norton
# Load your libraries
library(car)
library(tidyverse)
library(mosaic)
library(DT)
library(readr)
library(pander)
# Load your data after saving a csv file in your Data folder.
# You can use either
CarPrices <- read.csv("CarPrices.csv", header=TRUE)
Cars_dev <- read.csv("Cars_dev.csv", header=TRUE)
Cars_dev1 <- read.csv("Cars_dev1.csv", header=TRUE)
The purpose of this analysis is to create a model that provides the best fit possible for the data that comes from the Cadillac Deville. In the example analysis all other models of Cadillac appear to have a very good fit to them except for the Deville. When analyzing the dataset I came to the conclusion that there were only two possible variables that could add more predictive power to the Deville’s price. Those were wether or not the car had a sound system and what type of Trim the car had. Other variables such as the number of Cylinders, Liter, Doors would not contain any more predictive power because they were all the same for the Deville. In other words there was no variance in them.
I therefore created a model with Sound as the new predicting variable but it did not yield significant results. I then created a new model to represent the Deville using a new explanatory variable which is Trim. There are three levels in Trim which are the DHS (most luxiourious), DTS (best performance), and the Sedan (most basic). Because there are three levels that are qualitative, the second and third levels will also be compared against the first. This means the T-test will measure if there is a statistical difference between the DTS and Sedan, against the DHS. For this reason the DHS is always ommited in the regression output.
Mathematical Model: \[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{1i} X_{2i} + \epsilon_i \]
datatable(Cars_dev, options=list(lengthMenu = c(10,50)), style = "default")
Dev.lm <- lm(Price ~ Mileage + Trim + Mileage*Trim, data=Cars_dev1)
b <- coef(Dev.lm)
plot(Price ~ Mileage, data=Cars_dev1, col=c("orange","skyblue", "green")[as.factor(Trim_id)], pch=16, bg="gray83",
main="Three-lines Model for Cadillac Deville", ylim =c(25000,50000), cex.main=1)
abline(v=c(0,10000, 20000,30000,40000), h=c(25000,30000,35000,40000,45000,50000),
col=rgb(.8,.8,.8,.4), lty=2)
palette(c("orange","skyblue","green"))
abline(b[1] , b[2], , col=palette()[1])
abline(b[1]+b[3], b[2]+b[5], col=palette()[2])
abline(b[1]+b[4], b[2]+b[6], col=palette()[3])
# Add text for each regression line:
text(38000, 30500, "$-.4281/mile", cex=0.6, col="orange", pos=3)
text(35000, 36000, "$-.5527/mile", cex=0.6, col="skyblue", pos=1)
text(30000, 30000, "$-.5902/mile", cex=0.6, col="green", pos=1)
legend("topright",Dev.lm$xlevels$Trim, lty=1, lwd=5, col=palette(), cex=0.9)
The Cadillac Deville showed some spread between data points. In order to best fit this data it was determined that using multiple regression lines would be best, each of which is specific to the trim of the deville.
Below are all the plots needed to verify that the assumptions for Multiple Linear Regression were met for this analysis. All assumptions were met, to see this in detail see each of the plots below. The first plot of the residuals vs fitted shows that the relationship is linear and has near constant variance.
plot(Dev.lm, which=1)
The QQplot demonstrates that the data is normally distributed as it falls within the bounds that were set.
qqPlot(Dev.lm)
## [1] 14 20
This residuals vs order plot shows that there is no relationship in the error terms. This means the data is not serially correlated.
# Check for departure 5:
plot(Dev.lm$residuals, ylab="Residuals", las=1, cex.axis=.8)
pander(summary(Dev.lm))
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 46151 | 885.2 | 52.13 | 3.272e-26 |
| Mileage | -0.4281 | 0.04261 | -10.05 | 4.49e-10 |
| TrimDTS Sedan 4D | -1731 | 975.6 | -1.774 | 0.08875 |
| TrimSedan 4D | -8132 | 1035 | -7.86 | 4.308e-08 |
| Mileage:TrimDTS Sedan 4D | 0.1246 | 0.04589 | 2.716 | 0.01205 |
| Mileage:TrimSedan 4D | 0.1621 | 0.05122 | 3.164 | 0.004185 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 30 | 595.3 | 0.9731 | 0.9675 |
The model that was created yielded an adjusted \(R^2\) of .9675, meaning that this model explaines 96.75% of the variance in the data. The old model that used mileage as the only predicting variable had a much smaller \(R^2\) of .3475, therefore the new model is much better at explaining the deville data. Five of the six variables were found to have a significant statistical impact on the price of the Deville. Although Trim DTS on its own was not significant this does not mean that it should be thrown out or that the model is not valid. The F-Stat for the model was 173.6 with a corresponding P-value well below the .05 level. The F-stat proves that together all the beta coefficients are not equal to zero.
The purpose of this analysis was to be able to better fit the Cadillac Deville data with an appropriate regression. The variable that I found to be of most help in predicted the price of the Deville was the Trim variable with the levels of DHS, DTS, & Sedan. Because the original model did not include the Trim variable it was lacking the predictive power held by the different levels in Trim. The new regression plot shows how the data was fit with three lines, one for each level.