In this project, we will use Linear Regression to predict the price of a car given its features.

The metric used to determine good-fit of the model is Rsquared, which determines the proportion of variability in the dependent variables that can be explained by the independent variable.

Setup

Load Packages

library(ggplot2)
library(dplyr)
library(caret)
library(readxl)
library(leaps)

Load dataset

car <- read_excel("car.xls")
head(car)
## # A tibble: 6 x 12
##    Price Mileage Make  Model Trim  Type  Cylinder Liter Doors Cruise Sound
##    <dbl>   <dbl> <chr> <chr> <chr> <chr>    <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 17314.    8221 Buick Cent~ Seda~ Sedan        6   3.1     4      1     1
## 2 17542.    9135 Buick Cent~ Seda~ Sedan        6   3.1     4      1     1
## 3 16219.   13196 Buick Cent~ Seda~ Sedan        6   3.1     4      1     1
## 4 16337.   16342 Buick Cent~ Seda~ Sedan        6   3.1     4      1     0
## 5 16339.   19832 Buick Cent~ Seda~ Sedan        6   3.1     4      1     0
## 6 15709.   22236 Buick Cent~ Seda~ Sedan        6   3.1     4      1     1
## # ... with 1 more variable: Leather <dbl>
summary(car)
##      Price          Mileage          Make              Model          
##  Min.   : 8639   Min.   :  266   Length:804         Length:804        
##  1st Qu.:14273   1st Qu.:14624   Class :character   Class :character  
##  Median :18025   Median :20914   Mode  :character   Mode  :character  
##  Mean   :21343   Mean   :19832                                        
##  3rd Qu.:26717   3rd Qu.:25213                                        
##  Max.   :70755   Max.   :50387                                        
##      Trim               Type              Cylinder         Liter      
##  Length:804         Length:804         Min.   :4.000   Min.   :1.600  
##  Class :character   Class :character   1st Qu.:4.000   1st Qu.:2.200  
##  Mode  :character   Mode  :character   Median :6.000   Median :2.800  
##                                        Mean   :5.269   Mean   :3.037  
##                                        3rd Qu.:6.000   3rd Qu.:3.800  
##                                        Max.   :8.000   Max.   :6.000  
##      Doors           Cruise           Sound           Leather      
##  Min.   :2.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:4.000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :4.000   Median :1.0000   Median :1.0000   Median :1.0000  
##  Mean   :3.527   Mean   :0.7525   Mean   :0.6791   Mean   :0.7239  
##  3rd Qu.:4.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Clean dataset

Convert variables to categorical

car$Leather <- as.factor(car$Leather)
car$Price <- as.integer(car$Price)
car$Sound <- as.factor(car$Sound)
car$Make <- as.factor(car$Make)
car$Cylinder <- as.factor(car$Cylinder)
car$Doors <- as.factor(car$Doors)
car$Cruise <- as.factor(car$Cruise)

EDA

ggplot(car, aes(x = Mileage, y = Price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Mileage") +
  ylab("Price") +
  ggtitle("Price of Car by Mileage")

There is a negative correlation between the mileage of a car and its price.

ggplot(car, aes(x = Make, y = Price, fill = Make)) + 
  geom_boxplot(outlier.color = "red") + 
  theme(legend.position = "none") +
  xlab("Make") +
  ylab("Price") +
  ggtitle("Price Distribution by Make")

Cadillac has the largest distribution of prices, with cars costing as low as about $30,000 and cars costing upwards of $70,000. Chevrolet has the most outliers in these distributions. Saturn has the least variation and least expensive cars, with all of them costing below $20,000.

ggplot(car, aes(x = Cylinder, y = Price, fill = Make)) + 
  geom_boxplot(outlier.color = "red") +
  xlab("Number of Cylinders") +
  ylab("Price") +
  ggtitle("Price Distribution of Engine Cylinders")

The greatest distribution appears to be among Cadillacs with 8-cylinder engines. This distribution has a multitude of outliers, all of them above the price of $60,000. Not all car makes use a certain type of cylinder engine. Cadillac does not use 4-cylinder engines, Buick does not use 4-cylinder or 8-cylinder engines, SAAB does not use 6-cylinder or 8-cylinder engines, and Saturn does not use 8-cylinder engines.

ggplot(car, aes(x = Type, y = Price, fill = Type)) + 
  geom_boxplot(outlier.color = "red") + 
  theme(legend.position = "none") +
  xlab("Type of Car") +
  ylab("Price") +
  ggtitle("Price Distribution of Car Types")

There is some disparity between types, the mean value of convertibles is the highest among all types. Hatchbacks on average cost the least.

ggplot(car, aes(x = Price, fill = Doors, color = Type)) +
  geom_density(alpha = 0.3) +
  xlab("Price") +
  ylab("Density") + 
  labs(subtitle = "Highlighted by Number of Doors") +
  ggtitle("Price Distribution of Car Types")

Cars with two doors are priced higher than cars with four doors. This is due to the influence of convertible sports cars, which tend to have relatively high prices compared to cars with 4 doors.

ggplot(car, aes(x = Sound, y = Price, fill = Sound)) + 
  geom_boxplot(outlier.color = "red") + 
  theme(legend.position = "none") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  xlab("Sound") +
  ylab("Price") +
  ggtitle("Price of Cars with or without Sound System")

Dispersion of prices are slightly greater for cars without upgraded speakers, though there are many outliers among cars with upgraded speakers. Cars with upgraded speakers are no more expensive than cars without upgraded speakers.

ggplot(car, aes(x = Cruise, y = Price, fill = Make)) +
  geom_boxplot(outlier.color = "red") +
  scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
  xlab("Cruise Control") +
  ylab("Price") +
  ggtitle("Price of Cars with or without Cruise Control")

Saab, Buick, and Cadillac do not offer cars that have no cruise control. The greatest variation in price is for Cadillacs that have cruise control. The smallest variation are from Pontiacs that do not have cruise control. Cars without cruise control goes for a price less than $20,000, while a large proportion of the cars that do go for more than $20,000.

Variable Selection

Bayes Information Criterion

res = regsubsets(Price ~ Mileage + Make + Cylinder + Liter + Doors + Cruise + 
                   Sound + Leather + Type, data = car, nbest = 3, 
                 method = "exhaustive", really.big = T)
par(cex.axis = 1, cex.lab = 1)
plot(res, scale = "bic")

The BIC table ranks some of the best-fitting models given the set of predictors available in the data. The best-fitting by BIC includes mileage, make, the number of cylinders, liters, the number of doors, and type of car. The table excludes cruise control, upgraded sound, and leather. To reduce the model complexity for this model, we will also exclude the number of doors and cylinders.

Train test split

set.seed(444)

data <- car %>%
  select(-c(Model, Trim, Cylinder, Doors, Cruise, Sound, Leather))
data$Price <- log(data$Price)
colnames(data) <- make.names(colnames(data))

train_ind <- createDataPartition(y = data$Price, p = 0.8, list = FALSE)

training <- data[train_ind,]
testing <- data[-train_ind,]

Linear Regression

The estimated response variable are expressed as the natural-log of the price.

lm_ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10)
fitted_lm <- train(Price~., data=training, 
                   trControl = lm_ctrl,
                   method = "lm",
                   metric = "Rsquared")
plot(fitted_lm$finalModel, which=c(1,1))

The residual plot of the transformed model shows that there are constant variances. This can be shown by the “random” pattern that appears in the plot. The randomness indicates that as fitted value increases, the deviation of the residuals stays constant or relatively constant. The transformation fixes the non-constant variances that appeared at higher fitted values.

plot(fitted_lm$finalModel, which=c(2,2))

Transforming the model by using the natural-log of the response variable makes the Q-Q plot more normal.

The assumptions of linear regression are fulfilled while also achieving a Rsquared values of 0.9507 on training data.

predict_lm <- predict(fitted_lm, testing)
R2(predict_lm, testing$Price)
## [1] 0.9602989

Using the fitted linear model, the Rsquared value on the testing data is 0.9602989.

Based on the results of the training and testing, the Rsquared metric indicates a good-fitted model.