Problem Statement

We are given a data set containing information about various cars. A Chinese automobile company wants to enter the US market and produce cars locally in order to establish more competition to their US and European counterparts. Before they can do so, they want to understand which variables affect the pricing of cars in the American market. More specifically, they want to know:
Which variables are significant in predicting the price of the car?
How well do those variables describe the price of the car?

Our Goal

Our goal is to model the price of cars with the existing data set and variables. We want to learn how the variables interact and apply this knowledge to develop a business strategy.

Importing the Data

The data was sourced from Kaggle. It was downloaded in a csv format, so we need to use the read.csv() function to import the data set which is stored in a “data” folder in the working directory.

Link to the Kaggle Dataset

carPriceData <- read.csv('data/CarPrice_Assignment.csv')
head(carPriceData)

Exploratory Data Analysis

Before we can create any models, we need to examine the data for any missing values or errors.

Importing Libraries

We’re going to use these libraries to aid in visualization, cleaning, and analysis of the data set.

library(ggplot2)
library(plotly)
library(dplyr)
library(ggpubr)
library(stringr)
library(car)

Data Cleaning

Through a preliminary examination of the data, I found no missing data however there were a couple of spelling errors that we need to fix. We are also going to remove any special characters.

carPriceData$CarName <- str_replace(carPriceData$CarName, "alfa-romero", "alfa_romero")
carPriceData$CarName <- str_replace(carPriceData$CarName, "maxda", "mazda")
carPriceData$CarName <- str_replace(carPriceData$CarName, "Nissan", "nissan")
carPriceData$CarName <- str_replace(carPriceData$CarName, "porcshce", "porsche")
carPriceData$CarName <- str_replace(carPriceData$CarName, "toyouta", "toyota")
carPriceData$CarName <- str_replace(carPriceData$CarName, "vokswagen", "volkswagen")
carPriceData$CarName <- str_replace(carPriceData$CarName, "vw", "volkswagen")

The structure of the data:

str(carPriceData)
## 'data.frame':    205 obs. of  26 variables:
##  $ car_ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ symboling       : int  3 3 1 2 2 2 1 1 1 0 ...
##  $ CarName         : chr  "alfa_romero giulia" "alfa_romero stelvio" "alfa_romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype        : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration      : chr  "std" "std" "std" "std" ...
##  $ doornumber      : chr  "two" "two" "two" "four" ...
##  $ carbody         : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel      : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation  : chr  "front" "front" "front" "front" ...
##  $ wheelbase       : num  88.6 88.6 94.5 99.8 99.4 ...
##  $ carlength       : num  169 169 171 177 177 ...
##  $ carwidth        : num  64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
##  $ carheight       : num  48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
##  $ curbweight      : int  2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
##  $ enginetype      : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber  : chr  "four" "four" "six" "four" ...
##  $ enginesize      : int  130 130 152 109 136 136 136 136 131 131 ...
##  $ fuelsystem      : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...
##  $ boreratio       : num  3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
##  $ stroke          : num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
##  $ compressionratio: num  9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
##  $ horsepower      : int  111 111 154 102 115 110 110 110 140 160 ...
##  $ peakrpm         : int  5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
##  $ citympg         : int  21 21 19 24 18 19 19 19 17 16 ...
##  $ highwaympg      : int  27 27 26 30 22 25 25 25 20 22 ...
##  $ price           : num  13495 16500 16500 13950 17450 ...

We are going to remove the car_ID because it is not necessary for analysis.

carPriceData <- subset(carPriceData, select = -c(car_ID))

Correlation Between Variables

We are going to use the cor() function to compute the correlation coefficient between all numerical variables within the data set.

cor(carPriceData[, unlist(lapply(carPriceData, is.numeric))])
##                     symboling  wheelbase  carlength   carwidth   carheight
## symboling         1.000000000 -0.5319537 -0.3576115 -0.2329191 -0.54103820
## wheelbase        -0.531953682  1.0000000  0.8745875  0.7951436  0.58943476
## carlength        -0.357611523  0.8745875  1.0000000  0.8411183  0.49102946
## carwidth         -0.232919061  0.7951436  0.8411183  1.0000000  0.27921032
## carheight        -0.541038200  0.5894348  0.4910295  0.2792103  1.00000000
## curbweight       -0.227690588  0.7763863  0.8777285  0.8670325  0.29557173
## enginesize       -0.105789709  0.5693287  0.6833599  0.7354334  0.06714874
## boreratio        -0.130051360  0.4887499  0.6064544  0.5591499  0.17107092
## stroke           -0.008735141  0.1609590  0.1295326  0.1829417 -0.05530667
## compressionratio -0.178515084  0.2497858  0.1584137  0.1811286  0.26121423
## horsepower        0.070872724  0.3532945  0.5526230  0.6407321 -0.10880206
## peakrpm           0.273606245 -0.3604687 -0.2872422 -0.2200123 -0.32041072
## citympg          -0.035822628 -0.4704136 -0.6709087 -0.6427043 -0.04863963
## highwaympg        0.034606001 -0.5440819 -0.7046616 -0.6772179 -0.10735763
## price            -0.079978225  0.5778156  0.6829200  0.7593253  0.11933623
##                  curbweight  enginesize    boreratio       stroke
## symboling        -0.2276906 -0.10578971 -0.130051360 -0.008735141
## wheelbase         0.7763863  0.56932868  0.488749875  0.160959047
## carlength         0.8777285  0.68335987  0.606454358  0.129532611
## carwidth          0.8670325  0.73543340  0.559149909  0.182941693
## carheight         0.2955717  0.06714874  0.171070922 -0.055306674
## curbweight        1.0000000  0.85059407  0.648479749  0.168790035
## enginesize        0.8505941  1.00000000  0.583774327  0.203128588
## boreratio         0.6484797  0.58377433  1.000000000 -0.055908983
## stroke            0.1687900  0.20312859 -0.055908983  1.000000000
## compressionratio  0.1513617  0.02897136  0.005197339  0.186110110
## horsepower        0.7507393  0.80976865  0.573676823  0.080939536
## peakrpm          -0.2662432 -0.24465983 -0.254975528 -0.067963753
## citympg          -0.7574138 -0.65365792 -0.584531716 -0.042144754
## highwaympg       -0.7974648 -0.67746991 -0.587011784 -0.043930930
## price             0.8353049  0.87414480  0.553173237  0.079443084
##                  compressionratio  horsepower     peakrpm     citympg
## symboling            -0.178515084  0.07087272  0.27360625 -0.03582263
## wheelbase             0.249785845  0.35329448 -0.36046875 -0.47041361
## carlength             0.158413706  0.55262297 -0.28724220 -0.67090866
## carwidth              0.181128627  0.64073208 -0.22001230 -0.64270434
## carheight             0.261214226 -0.10880206 -0.32041072 -0.04863963
## curbweight            0.151361740  0.75073925 -0.26624318 -0.75741378
## enginesize            0.028971360  0.80976865 -0.24465983 -0.65365792
## boreratio             0.005197339  0.57367682 -0.25497553 -0.58453172
## stroke                0.186110110  0.08093954 -0.06796375 -0.04214475
## compressionratio      1.000000000 -0.20432623 -0.43574051  0.32470142
## horsepower           -0.204326226  1.00000000  0.13107251 -0.80145618
## peakrpm              -0.435740514  0.13107251  1.00000000 -0.11354438
## citympg               0.324701425 -0.80145618 -0.11354438  1.00000000
## highwaympg            0.265201389 -0.77054389 -0.05427481  0.97133704
## price                 0.067983506  0.80813882 -0.08526715 -0.68575134
##                   highwaympg       price
## symboling         0.03460600 -0.07997822
## wheelbase        -0.54408192  0.57781560
## carlength        -0.70466160  0.68292002
## carwidth         -0.67721792  0.75932530
## carheight        -0.10735763  0.11933623
## curbweight       -0.79746479  0.83530488
## enginesize       -0.67746991  0.87414480
## boreratio        -0.58701178  0.55317324
## stroke           -0.04393093  0.07944308
## compressionratio  0.26520139  0.06798351
## horsepower       -0.77054389  0.80813882
## peakrpm          -0.05427481 -0.08526715
## citympg           0.97133704 -0.68575134
## highwaympg        1.00000000 -0.69759909
## price            -0.69759909  1.00000000

When specifically looking at the price variable, we can make note of correlation coefficients that are closer to 1 or -1 indicating a stronger linear relationship with the price variable. Most notably, symboling, carheight, stroke, compressionratio, and peakrpm have really low correlation so we can remove them from the model equation. We will consider coefficients at 0.5 or greater to have a strong correlation. From this, we can see that carlength, carwidth, curbweight, enginesize, and horsepower have a strong positive correlation, while citympg and highwaympg have a strong negative correlation with price.

Correlation Visualization

To understand the correlation between non-numeric variables and price, we can visualize the data using boxplots.

# first we will split up the data into a categorical subset
carPriceData_categorical <- subset(carPriceData, select = !sapply(carPriceData, is.numeric))
str(carPriceData_categorical) # just so we can see all the categorical variables
## 'data.frame':    205 obs. of  10 variables:
##  $ CarName       : chr  "alfa_romero giulia" "alfa_romero stelvio" "alfa_romero Quadrifoglio" "audi 100 ls" ...
##  $ fueltype      : chr  "gas" "gas" "gas" "gas" ...
##  $ aspiration    : chr  "std" "std" "std" "std" ...
##  $ doornumber    : chr  "two" "two" "two" "four" ...
##  $ carbody       : chr  "convertible" "convertible" "hatchback" "sedan" ...
##  $ drivewheel    : chr  "rwd" "rwd" "rwd" "fwd" ...
##  $ enginelocation: chr  "front" "front" "front" "front" ...
##  $ enginetype    : chr  "dohc" "dohc" "ohcv" "ohc" ...
##  $ cylindernumber: chr  "four" "four" "six" "four" ...
##  $ fuelsystem    : chr  "mpfi" "mpfi" "mpfi" "mpfi" ...

Looking at all these box plots, we can see that on average diesel fuel types are more expensive than gas. Similarly, turbo aspiration tends to be more expensive. Convertible and hardtop car bodies tend to be higher priced. Rear wheel drive (rwd) also tends to be more expensive. Rear engine locations are much more expensive than cars with engines in the front. The ohcv engine type is higher priced than other engine types. The cylinder number generally increases proportionally to an increase in price. The mpfi fuel system tends to be higher priced but there isn’t much of an overall effect. The number of doors doesn’t have much of an effect on the pricing, so that variable can also be removed from the model equation.

We can also evaluate the car name and it’s relation to price.

# first we have to group the car names together
nameData <- carPriceData %>%
  mutate(grouped_brand_ = case_when(
    grepl("^alfa_romero", CarName) ~ "alfa",
    grepl("^audi", CarName) ~ "audi",
    grepl("^bmw", CarName) ~ "bmw",
    grepl("^buick", CarName) ~ "buick",
    grepl("^chevrolet", CarName) ~ "chevrolet",
    grepl("^dodge", CarName) ~ "dodge",
    grepl("^honda", CarName) ~ "honda",
    grepl("^isuzu", CarName) ~ "isuzu",
    grepl("^jaguar", CarName) ~ "jaguar",
    grepl("^mazda", CarName) ~ "mazda",
    grepl("^mercury", CarName) ~ "mercury",
    grepl("^mitsubishi", CarName) ~ "mitsubishi",
    grepl("^nissan", CarName) ~ "nissan",
    grepl("^peugeot", CarName) ~ "peugeot",
    grepl("^plymouth", CarName) ~ "plymouth",
    grepl("^porsche", CarName) ~ "porsche",
    grepl("^renault", CarName) ~ "renault",
    grepl("^saab", CarName) ~ "saab",
    grepl("^subaru", CarName) ~ "subaru",
    grepl("^toyota", CarName) ~ "toyota",
    grepl("^volkswagen", CarName) ~ "volkswagen",
    grepl("^volvo", CarName) ~ "volvo",
  ))

# group the brands and calculate the average price
avgPrice <- nameData %>% group_by(grouped_brand_) %>% summarise(avg_price = mean(price))

sortedPrice <- avgPrice %>% arrange(desc(avg_price))

g <- ggplot(sortedPrice, aes(x = reorder(grouped_brand_, -avg_price), y = avg_price)) + 
  geom_bar(stat = "identity", fill = "lightblue") + 
  labs(x = "Car Brand", y = "Average Price") + theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(g)

Data Preparation

Because we have many categorical variables, we need to encode them as dummy variables (using binary) so that we can use it in our model.

# first we can get all the numeric variables
carPriceData_numeric <- subset(carPriceData, select = sapply(carPriceData, is.numeric))

# we can also remove symboling, carheight, stroke, compressionratio, and peakrpm because they don't have high correlation
carPriceData_numeric <- subset(carPriceData_numeric, select = -c(symboling, carheight, stroke, compressionratio, peakrpm))

# group the car names and add it to the numeric variables
groups <- subset(nameData, select = c(grouped_brand_))

# now we are going to use the model.matrix function to encode all the dummy variables (grouped_brand, enginetype, fueltype, carbody, aspiration, cylindernumber, drivewheel)
dummy_engine <- model.matrix(~ enginetype - 1, data = carPriceData)
dummy_fuel <- model.matrix(~ fueltype - 1, data = carPriceData)
dummy_carbody <- model.matrix(~ carbody - 1, data = carPriceData)
dummy_asp <- model.matrix(~ aspiration - 1, data = carPriceData)
dummy_cyl <- model.matrix(~ cylindernumber - 1, data = carPriceData)
dummy_drive <- model.matrix(~ drivewheel - 1, data = carPriceData)
dummy_brand <- model.matrix(~ grouped_brand_ - 1, data = groups)

finalDf <- carPriceData_numeric

finalDf <- cbind(finalDf, dummy_engine)
finalDf <- cbind(finalDf, dummy_fuel)
finalDf <- cbind(finalDf, dummy_carbody)
finalDf <- cbind(finalDf, dummy_asp)
finalDf <- cbind(finalDf, dummy_cyl)
finalDf <- cbind(finalDf, dummy_drive)
finalDf <- cbind(finalDf, dummy_brand)

head(finalDf)
finalDf

Building the Model

Now that we have the final data set put together, we can finally build a multiple linear regression model

# first we're going to split the data set into 70% for training and 30% for testing
set.seed(1)

dfSample <- sample(c(TRUE, FALSE), nrow(finalDf), replace = TRUE, prob = c(0.7, 0.3))

trainSet <- finalDf[dfSample, ]
testSet <- finalDf[!dfSample, ]

print(paste("Dimension of the final data set: ", dim(finalDf)[1]))
## [1] "Dimension of the final data set:  205"
print(paste("Dimension of the training data set: ", dim(trainSet)[1]))
## [1] "Dimension of the training data set:  143"
print(paste("Dimension of the testing data set: ", dim(testSet)[1]))
## [1] "Dimension of the testing data set:  62"

The original data set had 205 rows. The training set has 143 rows which is approximately 69.8% of the original rows while the test set has 62 rows or about 30.2% of the original rows.

model <- lm(price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower + citympg + highwaympg + enginetypedohcv + enginetypel + enginetypeohc +
              enginetypeohcf + enginetypeohcv +enginetyperotor + fueltypediesel + fueltypegas + carbodyconvertible + carbodyhardtop + carbodyhatchback + carbodysedan + carbodywagon +
              aspirationstd + aspirationturbo + cylindernumbereight + cylindernumberfive + cylindernumberfour + cylindernumbersix + cylindernumberthree + cylindernumbertwelve +
              cylindernumbertwo + drivewheel4wd + drivewheelfwd + drivewheelrwd + grouped_brand_alfa + grouped_brand_audi + grouped_brand_bmw + grouped_brand_buick +
              grouped_brand_chevrolet + grouped_brand_dodge + grouped_brand_honda + grouped_brand_isuzu + grouped_brand_jaguar + grouped_brand_mazda + grouped_brand_mercury + 
              grouped_brand_mitsubishi + grouped_brand_nissan + grouped_brand_peugeot + grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_renault + grouped_brand_saab + 
              grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen + grouped_brand_volvo, data = trainSet)
summary(model)
## 
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + curbweight + 
##     enginesize + boreratio + horsepower + citympg + highwaympg + 
##     enginetypedohcv + enginetypel + enginetypeohc + enginetypeohcf + 
##     enginetypeohcv + enginetyperotor + fueltypediesel + fueltypegas + 
##     carbodyconvertible + carbodyhardtop + carbodyhatchback + 
##     carbodysedan + carbodywagon + aspirationstd + aspirationturbo + 
##     cylindernumbereight + cylindernumberfive + cylindernumberfour + 
##     cylindernumbersix + cylindernumberthree + cylindernumbertwelve + 
##     cylindernumbertwo + drivewheel4wd + drivewheelfwd + drivewheelrwd + 
##     grouped_brand_alfa + grouped_brand_audi + grouped_brand_bmw + 
##     grouped_brand_buick + grouped_brand_chevrolet + grouped_brand_dodge + 
##     grouped_brand_honda + grouped_brand_isuzu + grouped_brand_jaguar + 
##     grouped_brand_mazda + grouped_brand_mercury + grouped_brand_mitsubishi + 
##     grouped_brand_nissan + grouped_brand_peugeot + grouped_brand_plymouth + 
##     grouped_brand_porsche + grouped_brand_renault + grouped_brand_saab + 
##     grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen + 
##     grouped_brand_volvo, data = trainSet)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2906.1  -829.1     0.0   700.3  4141.5 
## 
## Coefficients: (10 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -43092.842  14919.136  -2.888 0.004785 ** 
## wheelbase                   225.086     76.781   2.932 0.004216 ** 
## carlength                  -144.172     50.483  -2.856 0.005261 ** 
## carwidth                    930.937    269.600   3.453 0.000826 ***
## curbweight                    3.715      1.561   2.380 0.019307 *  
## enginesize                  105.541     35.537   2.970 0.003763 ** 
## boreratio                 -8034.254   3239.588  -2.480 0.014879 *  
## horsepower                    2.635     18.649   0.141 0.887922    
## citympg                    -281.149    142.248  -1.976 0.050971 .  
## highwaympg                  256.376    121.604   2.108 0.037610 *  
## enginetypedohcv           -5966.470   4981.010  -1.198 0.233927    
## enginetypel               -6455.592   1701.437  -3.794 0.000259 ***
## enginetypeohc             -2570.096   1072.819  -2.396 0.018530 *  
## enginetypeohcf            10947.271   2518.309   4.347 3.43e-05 ***
## enginetypeohcv            -1560.281   1225.843  -1.273 0.206155    
## enginetyperotor           10036.574   6483.229   1.548 0.124893    
## fueltypediesel              240.913   1013.072   0.238 0.812539    
## fueltypegas                      NA         NA      NA       NA    
## carbodyconvertible         3027.699   1433.372   2.112 0.037257 *  
## carbodyhardtop              285.589   1251.253   0.228 0.819943    
## carbodyhatchback           -387.202    680.913  -0.569 0.570920    
## carbodysedan                665.132    563.891   1.180 0.241098    
## carbodywagon                     NA         NA      NA       NA    
## aspirationstd             -1751.228    695.229  -2.519 0.013424 *  
## aspirationturbo                  NA         NA      NA       NA    
## cylindernumbereight        7045.901   3747.252   1.880 0.063101 .  
## cylindernumberfive         6018.004   5284.878   1.139 0.257653    
## cylindernumberfour         6761.511   5621.947   1.203 0.232051    
## cylindernumbersix          2028.775   3722.568   0.545 0.587022    
## cylindernumberthree       13444.000   6523.197   2.061 0.042012 *  
## cylindernumbertwelve             NA         NA      NA       NA    
## cylindernumbertwo                NA         NA      NA       NA    
## drivewheel4wd              -968.227   1136.728  -0.852 0.396463    
## drivewheelfwd             -1382.954    773.856  -1.787 0.077078 .  
## drivewheelrwd                    NA         NA      NA       NA    
## grouped_brand_alfa        -2239.428   2422.011  -0.925 0.357484    
## grouped_brand_audi         -823.776   2374.698  -0.347 0.729428    
## grouped_brand_bmw          6634.391   1200.122   5.528 2.78e-07 ***
## grouped_brand_buick         807.553   2573.531   0.314 0.754360    
## grouped_brand_chevrolet          NA         NA      NA       NA    
## grouped_brand_dodge       -3626.610   1474.970  -2.459 0.015733 *  
## grouped_brand_honda       -2690.889   1505.235  -1.788 0.076982 .  
## grouped_brand_isuzu       -1281.160   1395.728  -0.918 0.360964    
## grouped_brand_jaguar      -1338.616   3169.303  -0.422 0.673700    
## grouped_brand_mazda       -1364.558   1188.343  -1.148 0.253704    
## grouped_brand_mercury            NA         NA      NA       NA    
## grouped_brand_mitsubishi  -4137.431   1484.641  -2.787 0.006416 ** 
## grouped_brand_nissan      -2003.044   1271.953  -1.575 0.118597    
## grouped_brand_peugeot            NA         NA      NA       NA    
## grouped_brand_plymouth    -3343.088   1494.435  -2.237 0.027598 *  
## grouped_brand_porsche      5996.126   2064.065   2.905 0.004558 ** 
## grouped_brand_renault     -3142.850   1839.637  -1.708 0.090792 .  
## grouped_brand_saab         1818.306   1468.545   1.238 0.218673    
## grouped_brand_subaru     -13283.362   2698.950  -4.922 3.56e-06 ***
## grouped_brand_toyota      -3102.771   1127.279  -2.752 0.007073 ** 
## grouped_brand_volkswagen  -3200.139   1556.397  -2.056 0.042485 *  
## grouped_brand_volvo              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1536 on 96 degrees of freedom
## Multiple R-squared:  0.9751, Adjusted R-squared:  0.9632 
## F-statistic: 81.81 on 46 and 96 DF,  p-value: < 2.2e-16

Upon preliminary analysis based on the model summary, we can note that many of the variables have been marked as NA which indicates that the variable has been excluded from the model. This can be due to a lack of data within those variables or potential multicollinearity (high correlation to other variables). We can test for multicollinearity by looking at the correlation coefficients in the final data set. The results have been hidden for brevity but an example screenshot is shown.

cor(finalDf)
Image showing high correlation for ‘citympg’ and ‘highwaympg’ as well as ‘cylindernumbertwo’ and ‘engintyeprotor’
Image showing high correlation for ‘citympg’ and ‘highwaympg’ as well as ‘cylindernumbertwo’ and ‘engintyeprotor’

Upon looking at the correlations, we can see that citympg and highwaympg are highly correlated. Citympg and highwaympg represent almost the same thing and they can actually be combined to create a combined mpg. The fuel types for gas and diesel are also perfectly correlated, so we can remove one. The car brand peugot is also highly correlated with enginetypel and after reviewing the data, Peugot is the only car brand that has that specific engine type. Therefore, the engine type can be excluded because Peugot represents the same information in the model. The cylindernumbertwo is also directly correlated with enginetyperotor so we can remove the cylindernumbertwo variable because it is already represented by the rotor engine type. The variables aspirationturbo and aspirationstd are perfectly correlated, so one of those can be removed. The variable drivewheelrwd is highly correlated with drivewheelfwd so rear wheel drive can be removed from the model.

# combined fuel economy is calculated by weighing city mpg by 55% and highway mpg by 45% and taking the average (according to EPA)
# Resource link for calculating combined fuel economy:
# https://www.epa.gov/fueleconomy/text-version-gasoline-label#:~:text=Fuel%20Economy,-For%20gasoline%20vehicles&text=The%20Combined%20MPG%20value%20is,the%20Highway%20value%20by%2045%25.
finalDf$combinedFuelEconomy <- ((finalDf$citympg * 0.55) + (finalDf$highwaympg * 0.45)) / 2
head(finalDf)

After removing the variables that were highly correlated, we can now try updating the model.

# splitting the dataset again
set.seed(1)

dfSample <- sample(c(TRUE, FALSE), nrow(finalDf), replace = TRUE, prob = c(0.7, 0.3))

train2Set <- finalDf[dfSample, ]
test2Set <- finalDf[!dfSample, ]

# building the updated model
model2 <- lm(price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower + combinedFuelEconomy + enginetypedohcv + enginetypeohc +
              enginetypeohcf + enginetypeohcv +enginetyperotor + fueltypegas + carbodyconvertible + carbodyhardtop + carbodyhatchback + carbodysedan + carbodywagon +
              aspirationstd + cylindernumbertwelve + cylindernumbereight + cylindernumberfive + cylindernumberfour + cylindernumbersix + cylindernumberthree +
              drivewheel4wd + drivewheelfwd + grouped_brand_alfa + grouped_brand_audi + grouped_brand_bmw + grouped_brand_buick +
              grouped_brand_chevrolet + grouped_brand_dodge + grouped_brand_honda + grouped_brand_isuzu + grouped_brand_jaguar + grouped_brand_mazda + grouped_brand_mercury + 
              grouped_brand_mitsubishi + grouped_brand_nissan + grouped_brand_peugeot + grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_renault + grouped_brand_saab + 
              grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen + grouped_brand_volvo, data = train2Set)
summary(model2)
## 
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + curbweight + 
##     enginesize + boreratio + horsepower + combinedFuelEconomy + 
##     enginetypedohcv + enginetypeohc + enginetypeohcf + enginetypeohcv + 
##     enginetyperotor + fueltypegas + carbodyconvertible + carbodyhardtop + 
##     carbodyhatchback + carbodysedan + carbodywagon + aspirationstd + 
##     cylindernumbertwelve + cylindernumbereight + cylindernumberfive + 
##     cylindernumberfour + cylindernumbersix + cylindernumberthree + 
##     drivewheel4wd + drivewheelfwd + grouped_brand_alfa + grouped_brand_audi + 
##     grouped_brand_bmw + grouped_brand_buick + grouped_brand_chevrolet + 
##     grouped_brand_dodge + grouped_brand_honda + grouped_brand_isuzu + 
##     grouped_brand_jaguar + grouped_brand_mazda + grouped_brand_mercury + 
##     grouped_brand_mitsubishi + grouped_brand_nissan + grouped_brand_peugeot + 
##     grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_renault + 
##     grouped_brand_saab + grouped_brand_subaru + grouped_brand_toyota + 
##     grouped_brand_volkswagen + grouped_brand_volvo, data = train2Set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2908.2  -827.5     0.0   730.8  4261.8 
## 
## Coefficients: (5 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -39992.335  15367.277  -2.602  0.01071 *  
## wheelbase                   231.562     78.064   2.966  0.00379 ** 
## carlength                  -144.305     51.368  -2.809  0.00601 ** 
## carwidth                    910.682    274.150   3.322  0.00126 ** 
## curbweight                    3.811      1.588   2.400  0.01829 *  
## enginesize                   91.288     35.497   2.572  0.01164 *  
## boreratio                 -6541.090   3216.341  -2.034  0.04471 *  
## horsepower                    8.624     18.754   0.460  0.64664    
## combinedFuelEconomy          33.409    128.389   0.260  0.79525    
## enginetypedohcv           -5555.758   5064.407  -1.097  0.27535    
## enginetypeohc             -1860.415   1036.296  -1.795  0.07573 .  
## enginetypeohcf            11496.410   2548.649   4.511 1.81e-05 ***
## enginetypeohcv            -1162.037   1232.375  -0.943  0.34806    
## enginetyperotor            2601.404   2975.488   0.874  0.38413    
## fueltypegas                 208.708   1007.644   0.207  0.83635    
## carbodyconvertible         2782.868   1453.680   1.914  0.05852 .  
## carbodyhardtop              198.989   1272.491   0.156  0.87606    
## carbodyhatchback           -376.423    692.826  -0.543  0.58816    
## carbodysedan                744.862    572.476   1.301  0.19630    
## carbodywagon                     NA         NA      NA       NA    
## aspirationstd             -1671.416    706.359  -2.366  0.01996 *  
## cylindernumbertwelve      -6446.552   6096.194  -1.057  0.29292    
## cylindernumbereight         337.800   5553.789   0.061  0.95163    
## cylindernumberfive        -1622.094   3237.337  -0.501  0.61747    
## cylindernumberfour         -947.986   2477.762  -0.383  0.70286    
## cylindernumbersix         -4792.633   3335.687  -1.437  0.15400    
## cylindernumberthree              NA         NA      NA       NA    
## drivewheel4wd              -972.717   1156.646  -0.841  0.40243    
## drivewheelfwd             -1489.273    785.738  -1.895  0.06102 .  
## grouped_brand_alfa         -546.661   2324.599  -0.235  0.81458    
## grouped_brand_audi          101.596   2374.534   0.043  0.96596    
## grouped_brand_bmw          7037.782   1205.479   5.838 7.01e-08 ***
## grouped_brand_buick        1570.937   2592.492   0.606  0.54596    
## grouped_brand_chevrolet          NA         NA      NA       NA    
## grouped_brand_dodge       -2726.769   1436.380  -1.898  0.06062 .  
## grouped_brand_honda       -2126.539   1507.114  -1.411  0.16144    
## grouped_brand_isuzu        -982.378   1412.821  -0.695  0.48851    
## grouped_brand_jaguar        -96.115   3168.400  -0.030  0.97586    
## grouped_brand_mazda        -559.086   1144.744  -0.488  0.62637    
## grouped_brand_mercury            NA         NA      NA       NA    
## grouped_brand_mitsubishi  -3131.220   1430.186  -2.189  0.03097 *  
## grouped_brand_nissan      -1309.019   1249.991  -1.047  0.29760    
## grouped_brand_peugeot     -5447.242   1661.198  -3.279  0.00145 ** 
## grouped_brand_plymouth    -2499.739   1464.941  -1.706  0.09114 .  
## grouped_brand_porsche      6738.404   2069.349   3.256  0.00156 ** 
## grouped_brand_renault     -1711.045   1739.176  -0.984  0.32765    
## grouped_brand_saab         2828.711   1412.176   2.003  0.04796 *  
## grouped_brand_subaru     -13183.511   2745.823  -4.801 5.73e-06 ***
## grouped_brand_toyota      -2453.917   1103.303  -2.224  0.02846 *  
## grouped_brand_volkswagen  -2186.571   1505.952  -1.452  0.14974    
## grouped_brand_volvo              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1563 on 97 degrees of freedom
## Multiple R-squared:  0.974,  Adjusted R-squared:  0.9619 
## F-statistic: 80.68 on 45 and 97 DF,  p-value: < 2.2e-16

Because we still have aliased variables in the data but they’re not highly correlated with other variables, we can try using a stepwise regression method to help with variable selection. Stepwise regression works to select the best subset of variables that result in the best model performance which is dictated by the lowest prediction error. The method operates by adding and removing variables to the model iteratively. The default function in r uses both forward and backward selection so the model starts with no predictor variables and, in a stepwise fashion, adds the predictor variables that best improve the model and then removes the variables that don’t contribute to improving the model. The results have been hidden for brevity.

Resource Link for Stepwise Regression in R

# using stepwise regression
step_model <- step(model2)

Summary of the Model

summary(step_model)
## 
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + curbweight + 
##     enginesize + boreratio + enginetypedohcv + enginetypeohc + 
##     enginetypeohcf + enginetypeohcv + carbodyconvertible + carbodysedan + 
##     aspirationstd + cylindernumbertwelve + cylindernumberfive + 
##     cylindernumberfour + cylindernumbersix + drivewheel4wd + 
##     drivewheelfwd + grouped_brand_bmw + grouped_brand_dodge + 
##     grouped_brand_honda + grouped_brand_mitsubishi + grouped_brand_peugeot + 
##     grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_saab + 
##     grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen, 
##     data = train2Set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2651.1  -899.2  -173.9   817.6  4654.1 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -45148.934  10037.169  -4.498 1.68e-05 ***
## wheelbase                   256.472     59.544   4.307 3.56e-05 ***
## carlength                  -146.047     40.267  -3.627 0.000433 ***
## carwidth                   1002.590    191.929   5.224 8.18e-07 ***
## curbweight                    4.108      1.177   3.490 0.000693 ***
## enginesize                   89.565      8.927  10.033  < 2e-16 ***
## boreratio                 -6691.797   1178.326  -5.679 1.08e-07 ***
## enginetypedohcv           -6684.841   2282.841  -2.928 0.004130 ** 
## enginetypeohc             -1913.435    714.372  -2.678 0.008509 ** 
## enginetypeohcf            13047.616   2011.111   6.488 2.45e-09 ***
## enginetypeohcv            -1261.598    874.550  -1.443 0.151933    
## carbodyconvertible         3189.471    881.216   3.619 0.000445 ***
## carbodysedan                994.585    299.162   3.325 0.001197 ** 
## aspirationstd             -1723.650    438.336  -3.932 0.000146 ***
## cylindernumbertwelve      -7543.193   1985.795  -3.799 0.000237 ***
## cylindernumberfive        -3119.113   1076.634  -2.897 0.004531 ** 
## cylindernumberfour        -3063.382    923.668  -3.317 0.001229 ** 
## cylindernumbersix         -7146.191    795.098  -8.988 7.12e-15 ***
## drivewheel4wd             -1359.410    848.221  -1.603 0.111826    
## drivewheelfwd             -2009.825    531.985  -3.778 0.000255 ***
## grouped_brand_bmw          7754.573    926.063   8.374 1.80e-13 ***
## grouped_brand_dodge       -1650.666    680.469  -2.426 0.016873 *  
## grouped_brand_honda       -1065.872    692.895  -1.538 0.126799    
## grouped_brand_mitsubishi  -2098.360    689.870  -3.042 0.002931 ** 
## grouped_brand_peugeot     -5770.251   1091.303  -5.287 6.20e-07 ***
## grouped_brand_plymouth    -1419.052    769.571  -1.844 0.067833 .  
## grouped_brand_porsche      7249.700   1727.733   4.196 5.46e-05 ***
## grouped_brand_saab         3771.957    877.167   4.300 3.66e-05 ***
## grouped_brand_subaru     -13762.164   2260.060  -6.089 1.63e-08 ***
## grouped_brand_toyota      -1611.072    430.980  -3.738 0.000294 ***
## grouped_brand_volkswagen  -1531.333    746.611  -2.051 0.042596 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1483 on 112 degrees of freedom
## Multiple R-squared:  0.9729, Adjusted R-squared:  0.9657 
## F-statistic: 134.3 on 30 and 112 DF,  p-value: < 2.2e-16

From the model summary, we can see that there are no more aliased variables and most of the remaining variables are significant. Looking at the r-squared value we can see that 97.3% of variance of the data can be accounted for by the model. This also means that the model is a good fit for the data.

We can now try to use our model for predicting to test the accuracy of our model.

mod_train <- predict(step_model, train2Set)
mod_test <- predict(step_model, test2Set)

# plotting the model
trainDf <- data.frame(train2Set$price, mod_train)
testDf <- data.frame(test2Set$price, mod_test)

Plot for the Training Data set

p = ggplot(data = trainDf, aes(x = train2Set$price, y = mod_train)) + geom_point() +
  geom_smooth(method = "lm", color = "blue") + theme_minimal() + 
  ggtitle("Car Price Prediction on the Training Set") + xlab("True Training Values") + ylab("Training Values Predicted by the Model") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15))

ggplotly(p)

We can see from the plot that the model fits quite well to the data. The blue line represents the model prediction and all of the data points are quite close to the line indicating that the model is a good fit for the data.

Plot for the Testing Data set

p2 = ggplot(data = testDf, aes(x = test2Set$price, y = mod_test)) + geom_point() +
  geom_smooth(method = "lm", color = "blue") + theme_minimal() + 
  ggtitle("Car Price Prediction on the Test Set") + xlab("True Test Values") + ylab("Test Values Predicted by the Model") +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15))

ggplotly(p2)

Compared to the plot for the data set dedicated to training the model, we can see that the model doesn’t predict quite as well as. It seems that there are more points that tend to be farther from the blue line which represents the relationship between the model prediction and the actual values.

Ploting the Residuals

res <- resid(step_model)

qqnorm(res)
qqline(res)

plot(density(res))

When looking at both the QQ plot and the density plot of the residuals, we can see that the residuals tend to be pretty close to a normal distribution which indicates that the errors are random and independent. This suggests that the models performs well and accurately represents the patterns and variations within the data set. If errors were correlated or dependent, this could indicate a reduction in accuracy of the model predictions when given new data.

Conclusions

From our goals for analysis outlined by the problem set, we have found the significant variables in the data set that are crucial to predicting a car’s price which is given in the model summary by the listed coefficients. The r squared value also provides information about how well this collection of variables can describe the variation of a car’s price. For developing a business strategy, the company can look at the predictor variables and it’s relationship to the car’s price in order to gain a better understanding of the car market in the US. The variables with a positive coefficient indicate a positive relationship with the car’s price (increase in the variable leads to an increase in the price). For negative coefficients an increase in the variable leads to a decrease in the price (a negative relationship). The magnitude of the coefficient represents how much of an effect a change in the predictor will have on the car’s price.