In this project, we will use Linear Regression to predict the price of a car given its features.
The metric used to determine good-fit of the model is Rsquared, which determines the proportion of variability in the dependent variables that can be explained by the independent variable.
library(ggplot2)
library(dplyr)
library(caret)
library(readxl)
library(leaps)
car <- read_excel("car.xls")
head(car)
## # A tibble: 6 x 12
## Price Mileage Make Model Trim Type Cylinder Liter Doors Cruise Sound
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 17314. 8221 Buick Cent~ Seda~ Sedan 6 3.1 4 1 1
## 2 17542. 9135 Buick Cent~ Seda~ Sedan 6 3.1 4 1 1
## 3 16219. 13196 Buick Cent~ Seda~ Sedan 6 3.1 4 1 1
## 4 16337. 16342 Buick Cent~ Seda~ Sedan 6 3.1 4 1 0
## 5 16339. 19832 Buick Cent~ Seda~ Sedan 6 3.1 4 1 0
## 6 15709. 22236 Buick Cent~ Seda~ Sedan 6 3.1 4 1 1
## # ... with 1 more variable: Leather <dbl>
summary(car)
## Price Mileage Make Model
## Min. : 8639 Min. : 266 Length:804 Length:804
## 1st Qu.:14273 1st Qu.:14624 Class :character Class :character
## Median :18025 Median :20914 Mode :character Mode :character
## Mean :21343 Mean :19832
## 3rd Qu.:26717 3rd Qu.:25213
## Max. :70755 Max. :50387
## Trim Type Cylinder Liter
## Length:804 Length:804 Min. :4.000 Min. :1.600
## Class :character Class :character 1st Qu.:4.000 1st Qu.:2.200
## Mode :character Mode :character Median :6.000 Median :2.800
## Mean :5.269 Mean :3.037
## 3rd Qu.:6.000 3rd Qu.:3.800
## Max. :8.000 Max. :6.000
## Doors Cruise Sound Leather
## Min. :2.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:4.000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :4.000 Median :1.0000 Median :1.0000 Median :1.0000
## Mean :3.527 Mean :0.7525 Mean :0.6791 Mean :0.7239
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :4.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
car$Leather <- as.factor(car$Leather)
car$Price <- as.integer(car$Price)
car$Sound <- as.factor(car$Sound)
car$Make <- as.factor(car$Make)
car$Cylinder <- as.factor(car$Cylinder)
car$Doors <- as.factor(car$Doors)
car$Cruise <- as.factor(car$Cruise)
ggplot(car, aes(x = Mileage, y = Price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Mileage") +
ylab("Price") +
ggtitle("Price of Car by Mileage")
There is a negative correlation between the mileage of a car and its price.
ggplot(car, aes(x = Make, y = Price, fill = Make)) +
geom_boxplot(outlier.color = "red") +
theme(legend.position = "none") +
xlab("Make") +
ylab("Price") +
ggtitle("Price Distribution by Make")
Cadillac has the largest distribution of prices, with cars costing as low as about $30,000 and cars costing upwards of $70,000. Chevrolet has the most outliers in these distributions. Saturn has the least variation and least expensive cars, with all of them costing below $20,000.
ggplot(car, aes(x = Cylinder, y = Price, fill = Make)) +
geom_boxplot(outlier.color = "red") +
xlab("Number of Cylinders") +
ylab("Price") +
ggtitle("Price Distribution of Engine Cylinders")
The greatest distribution appears to be among Cadillacs with 8-cylinder engines. This distribution has a multitude of outliers, all of them above the price of $60,000. Not all car makes use a certain type of cylinder engine. Cadillac does not use 4-cylinder engines, Buick does not use 4-cylinder or 8-cylinder engines, SAAB does not use 6-cylinder or 8-cylinder engines, and Saturn does not use 8-cylinder engines.
ggplot(car, aes(x = Type, y = Price, fill = Type)) +
geom_boxplot(outlier.color = "red") +
theme(legend.position = "none") +
xlab("Type of Car") +
ylab("Price") +
ggtitle("Price Distribution of Car Types")
There is some disparity between types, the mean value of convertibles is the highest among all types. Hatchbacks on average cost the least.
ggplot(car, aes(x = Price, fill = Doors, color = Type)) +
geom_density(alpha = 0.3) +
xlab("Price") +
ylab("Density") +
labs(subtitle = "Highlighted by Number of Doors") +
ggtitle("Price Distribution of Car Types")
Cars with two doors are priced higher than cars with four doors. This is due to the influence of convertible sports cars, which tend to have relatively high prices compared to cars with 4 doors.
ggplot(car, aes(x = Sound, y = Price, fill = Sound)) +
geom_boxplot(outlier.color = "red") +
theme(legend.position = "none") +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
xlab("Sound") +
ylab("Price") +
ggtitle("Price of Cars with or without Sound System")
Dispersion of prices are slightly greater for cars without upgraded speakers, though there are many outliers among cars with upgraded speakers. Cars with upgraded speakers are no more expensive than cars without upgraded speakers.
ggplot(car, aes(x = Cruise, y = Price, fill = Make)) +
geom_boxplot(outlier.color = "red") +
scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
xlab("Cruise Control") +
ylab("Price") +
ggtitle("Price of Cars with or without Cruise Control")
Saab, Buick, and Cadillac do not offer cars that have no cruise control. The greatest variation in price is for Cadillacs that have cruise control. The smallest variation are from Pontiacs that do not have cruise control. Cars without cruise control goes for a price less than $20,000, while a large proportion of the cars that do go for more than $20,000.
res = regsubsets(Price ~ Mileage + Make + Cylinder + Liter + Doors + Cruise +
Sound + Leather + Type, data = car, nbest = 3,
method = "exhaustive", really.big = T)
par(cex.axis = 1, cex.lab = 1)
plot(res, scale = "bic")
The BIC table ranks some of the best-fitting models given the set of predictors available in the data. The best-fitting by BIC includes mileage, make, the number of cylinders, liters, the number of doors, and type of car. The table excludes cruise control, upgraded sound, and leather. To reduce the model complexity for this model, we will also exclude the number of doors and cylinders.
set.seed(444)
data <- car %>%
select(-c(Model, Trim, Cylinder, Doors, Cruise, Sound, Leather))
data$Price <- log(data$Price)
colnames(data) <- make.names(colnames(data))
train_ind <- createDataPartition(y = data$Price, p = 0.8, list = FALSE)
training <- data[train_ind,]
testing <- data[-train_ind,]
The estimated response variable are expressed as the natural-log of the price.
lm_ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
fitted_lm <- train(Price~., data=training,
trControl = lm_ctrl,
method = "lm",
metric = "Rsquared")
plot(fitted_lm$finalModel, which=c(1,1))
The residual plot of the transformed model shows that there are constant variances. This can be shown by the “random” pattern that appears in the plot. The randomness indicates that as fitted value increases, the deviation of the residuals stays constant or relatively constant. The transformation fixes the non-constant variances that appeared at higher fitted values.
plot(fitted_lm$finalModel, which=c(2,2))
Transforming the model by using the natural-log of the response variable makes the Q-Q plot more normal.
The assumptions of linear regression are fulfilled while also achieving a Rsquared values of 0.9507 on training data.
predict_lm <- predict(fitted_lm, testing)
R2(predict_lm, testing$Price)
## [1] 0.9602989
Using the fitted linear model, the Rsquared value on the testing data is 0.9602989.
Based on the results of the training and testing, the Rsquared metric indicates a good-fitted model.