We are given a data set containing information about various cars. A
Chinese automobile company wants to enter the US market and produce cars
locally in order to establish more competition to their US and European
counterparts. Before they can do so, they want to understand which
variables affect the pricing of cars in the American market. More
specifically, they want to know:
Which variables are significant in predicting the price of the
car?
How well do those variables describe the price of the
car?
Our goal is to model the price of cars with the existing data set and variables. We want to learn how the variables interact and apply this knowledge to develop a business strategy.
The data was sourced from Kaggle. It was downloaded in a csv format, so we need to use the read.csv() function to import the data set which is stored in a “data” folder in the working directory.
carPriceData <- read.csv('data/CarPrice_Assignment.csv')
head(carPriceData)
Before we can create any models, we need to examine the data for any missing values or errors.
We’re going to use these libraries to aid in visualization, cleaning, and analysis of the data set.
library(ggplot2)
library(plotly)
library(dplyr)
library(ggpubr)
library(stringr)
library(car)
Through a preliminary examination of the data, I found no missing data however there were a couple of spelling errors that we need to fix. We are also going to remove any special characters.
carPriceData$CarName <- str_replace(carPriceData$CarName, "alfa-romero", "alfa_romero")
carPriceData$CarName <- str_replace(carPriceData$CarName, "maxda", "mazda")
carPriceData$CarName <- str_replace(carPriceData$CarName, "Nissan", "nissan")
carPriceData$CarName <- str_replace(carPriceData$CarName, "porcshce", "porsche")
carPriceData$CarName <- str_replace(carPriceData$CarName, "toyouta", "toyota")
carPriceData$CarName <- str_replace(carPriceData$CarName, "vokswagen", "volkswagen")
carPriceData$CarName <- str_replace(carPriceData$CarName, "vw", "volkswagen")
The structure of the data:
str(carPriceData)
## 'data.frame': 205 obs. of 26 variables:
## $ car_ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
## $ CarName : chr "alfa_romero giulia" "alfa_romero stelvio" "alfa_romero Quadrifoglio" "audi 100 ls" ...
## $ fueltype : chr "gas" "gas" "gas" "gas" ...
## $ aspiration : chr "std" "std" "std" "std" ...
## $ doornumber : chr "two" "two" "two" "four" ...
## $ carbody : chr "convertible" "convertible" "hatchback" "sedan" ...
## $ drivewheel : chr "rwd" "rwd" "rwd" "fwd" ...
## $ enginelocation : chr "front" "front" "front" "front" ...
## $ wheelbase : num 88.6 88.6 94.5 99.8 99.4 ...
## $ carlength : num 169 169 171 177 177 ...
## $ carwidth : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
## $ carheight : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
## $ curbweight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
## $ enginetype : chr "dohc" "dohc" "ohcv" "ohc" ...
## $ cylindernumber : chr "four" "four" "six" "four" ...
## $ enginesize : int 130 130 152 109 136 136 136 136 131 131 ...
## $ fuelsystem : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
## $ boreratio : num 3.47 3.47 2.68 3.19 3.19 3.19 3.19 3.19 3.13 3.13 ...
## $ stroke : num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 3.4 ...
## $ compressionratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
## $ horsepower : int 111 111 154 102 115 110 110 110 140 160 ...
## $ peakrpm : int 5000 5000 5000 5500 5500 5500 5500 5500 5500 5500 ...
## $ citympg : int 21 21 19 24 18 19 19 19 17 16 ...
## $ highwaympg : int 27 27 26 30 22 25 25 25 20 22 ...
## $ price : num 13495 16500 16500 13950 17450 ...
We are going to remove the car_ID because it is not necessary for analysis.
carPriceData <- subset(carPriceData, select = -c(car_ID))
We are going to use the cor() function to compute the correlation coefficient between all numerical variables within the data set.
cor(carPriceData[, unlist(lapply(carPriceData, is.numeric))])
## symboling wheelbase carlength carwidth carheight
## symboling 1.000000000 -0.5319537 -0.3576115 -0.2329191 -0.54103820
## wheelbase -0.531953682 1.0000000 0.8745875 0.7951436 0.58943476
## carlength -0.357611523 0.8745875 1.0000000 0.8411183 0.49102946
## carwidth -0.232919061 0.7951436 0.8411183 1.0000000 0.27921032
## carheight -0.541038200 0.5894348 0.4910295 0.2792103 1.00000000
## curbweight -0.227690588 0.7763863 0.8777285 0.8670325 0.29557173
## enginesize -0.105789709 0.5693287 0.6833599 0.7354334 0.06714874
## boreratio -0.130051360 0.4887499 0.6064544 0.5591499 0.17107092
## stroke -0.008735141 0.1609590 0.1295326 0.1829417 -0.05530667
## compressionratio -0.178515084 0.2497858 0.1584137 0.1811286 0.26121423
## horsepower 0.070872724 0.3532945 0.5526230 0.6407321 -0.10880206
## peakrpm 0.273606245 -0.3604687 -0.2872422 -0.2200123 -0.32041072
## citympg -0.035822628 -0.4704136 -0.6709087 -0.6427043 -0.04863963
## highwaympg 0.034606001 -0.5440819 -0.7046616 -0.6772179 -0.10735763
## price -0.079978225 0.5778156 0.6829200 0.7593253 0.11933623
## curbweight enginesize boreratio stroke
## symboling -0.2276906 -0.10578971 -0.130051360 -0.008735141
## wheelbase 0.7763863 0.56932868 0.488749875 0.160959047
## carlength 0.8777285 0.68335987 0.606454358 0.129532611
## carwidth 0.8670325 0.73543340 0.559149909 0.182941693
## carheight 0.2955717 0.06714874 0.171070922 -0.055306674
## curbweight 1.0000000 0.85059407 0.648479749 0.168790035
## enginesize 0.8505941 1.00000000 0.583774327 0.203128588
## boreratio 0.6484797 0.58377433 1.000000000 -0.055908983
## stroke 0.1687900 0.20312859 -0.055908983 1.000000000
## compressionratio 0.1513617 0.02897136 0.005197339 0.186110110
## horsepower 0.7507393 0.80976865 0.573676823 0.080939536
## peakrpm -0.2662432 -0.24465983 -0.254975528 -0.067963753
## citympg -0.7574138 -0.65365792 -0.584531716 -0.042144754
## highwaympg -0.7974648 -0.67746991 -0.587011784 -0.043930930
## price 0.8353049 0.87414480 0.553173237 0.079443084
## compressionratio horsepower peakrpm citympg
## symboling -0.178515084 0.07087272 0.27360625 -0.03582263
## wheelbase 0.249785845 0.35329448 -0.36046875 -0.47041361
## carlength 0.158413706 0.55262297 -0.28724220 -0.67090866
## carwidth 0.181128627 0.64073208 -0.22001230 -0.64270434
## carheight 0.261214226 -0.10880206 -0.32041072 -0.04863963
## curbweight 0.151361740 0.75073925 -0.26624318 -0.75741378
## enginesize 0.028971360 0.80976865 -0.24465983 -0.65365792
## boreratio 0.005197339 0.57367682 -0.25497553 -0.58453172
## stroke 0.186110110 0.08093954 -0.06796375 -0.04214475
## compressionratio 1.000000000 -0.20432623 -0.43574051 0.32470142
## horsepower -0.204326226 1.00000000 0.13107251 -0.80145618
## peakrpm -0.435740514 0.13107251 1.00000000 -0.11354438
## citympg 0.324701425 -0.80145618 -0.11354438 1.00000000
## highwaympg 0.265201389 -0.77054389 -0.05427481 0.97133704
## price 0.067983506 0.80813882 -0.08526715 -0.68575134
## highwaympg price
## symboling 0.03460600 -0.07997822
## wheelbase -0.54408192 0.57781560
## carlength -0.70466160 0.68292002
## carwidth -0.67721792 0.75932530
## carheight -0.10735763 0.11933623
## curbweight -0.79746479 0.83530488
## enginesize -0.67746991 0.87414480
## boreratio -0.58701178 0.55317324
## stroke -0.04393093 0.07944308
## compressionratio 0.26520139 0.06798351
## horsepower -0.77054389 0.80813882
## peakrpm -0.05427481 -0.08526715
## citympg 0.97133704 -0.68575134
## highwaympg 1.00000000 -0.69759909
## price -0.69759909 1.00000000
When specifically looking at the price variable, we can make note of correlation coefficients that are closer to 1 or -1 indicating a stronger linear relationship with the price variable. Most notably, symboling, carheight, stroke, compressionratio, and peakrpm have really low correlation so we can remove them from the model equation. We will consider coefficients at 0.5 or greater to have a strong correlation. From this, we can see that carlength, carwidth, curbweight, enginesize, and horsepower have a strong positive correlation, while citympg and highwaympg have a strong negative correlation with price.
To understand the correlation between non-numeric variables and price, we can visualize the data using boxplots.
# first we will split up the data into a categorical subset
carPriceData_categorical <- subset(carPriceData, select = !sapply(carPriceData, is.numeric))
str(carPriceData_categorical) # just so we can see all the categorical variables
## 'data.frame': 205 obs. of 10 variables:
## $ CarName : chr "alfa_romero giulia" "alfa_romero stelvio" "alfa_romero Quadrifoglio" "audi 100 ls" ...
## $ fueltype : chr "gas" "gas" "gas" "gas" ...
## $ aspiration : chr "std" "std" "std" "std" ...
## $ doornumber : chr "two" "two" "two" "four" ...
## $ carbody : chr "convertible" "convertible" "hatchback" "sedan" ...
## $ drivewheel : chr "rwd" "rwd" "rwd" "fwd" ...
## $ enginelocation: chr "front" "front" "front" "front" ...
## $ enginetype : chr "dohc" "dohc" "ohcv" "ohc" ...
## $ cylindernumber: chr "four" "four" "six" "four" ...
## $ fuelsystem : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
Looking at all these box plots, we can see that on average diesel fuel types are more expensive than gas. Similarly, turbo aspiration tends to be more expensive. Convertible and hardtop car bodies tend to be higher priced. Rear wheel drive (rwd) also tends to be more expensive. Rear engine locations are much more expensive than cars with engines in the front. The ohcv engine type is higher priced than other engine types. The cylinder number generally increases proportionally to an increase in price. The mpfi fuel system tends to be higher priced but there isn’t much of an overall effect. The number of doors doesn’t have much of an effect on the pricing, so that variable can also be removed from the model equation.
We can also evaluate the car name and it’s relation to price.
# first we have to group the car names together
nameData <- carPriceData %>%
mutate(grouped_brand_ = case_when(
grepl("^alfa_romero", CarName) ~ "alfa",
grepl("^audi", CarName) ~ "audi",
grepl("^bmw", CarName) ~ "bmw",
grepl("^buick", CarName) ~ "buick",
grepl("^chevrolet", CarName) ~ "chevrolet",
grepl("^dodge", CarName) ~ "dodge",
grepl("^honda", CarName) ~ "honda",
grepl("^isuzu", CarName) ~ "isuzu",
grepl("^jaguar", CarName) ~ "jaguar",
grepl("^mazda", CarName) ~ "mazda",
grepl("^mercury", CarName) ~ "mercury",
grepl("^mitsubishi", CarName) ~ "mitsubishi",
grepl("^nissan", CarName) ~ "nissan",
grepl("^peugeot", CarName) ~ "peugeot",
grepl("^plymouth", CarName) ~ "plymouth",
grepl("^porsche", CarName) ~ "porsche",
grepl("^renault", CarName) ~ "renault",
grepl("^saab", CarName) ~ "saab",
grepl("^subaru", CarName) ~ "subaru",
grepl("^toyota", CarName) ~ "toyota",
grepl("^volkswagen", CarName) ~ "volkswagen",
grepl("^volvo", CarName) ~ "volvo",
))
# group the brands and calculate the average price
avgPrice <- nameData %>% group_by(grouped_brand_) %>% summarise(avg_price = mean(price))
sortedPrice <- avgPrice %>% arrange(desc(avg_price))
g <- ggplot(sortedPrice, aes(x = reorder(grouped_brand_, -avg_price), y = avg_price)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(x = "Car Brand", y = "Average Price") + theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)
Because we have many categorical variables, we need to encode them as dummy variables (using binary) so that we can use it in our model.
# first we can get all the numeric variables
carPriceData_numeric <- subset(carPriceData, select = sapply(carPriceData, is.numeric))
# we can also remove symboling, carheight, stroke, compressionratio, and peakrpm because they don't have high correlation
carPriceData_numeric <- subset(carPriceData_numeric, select = -c(symboling, carheight, stroke, compressionratio, peakrpm))
# group the car names and add it to the numeric variables
groups <- subset(nameData, select = c(grouped_brand_))
# now we are going to use the model.matrix function to encode all the dummy variables (grouped_brand, enginetype, fueltype, carbody, aspiration, cylindernumber, drivewheel)
dummy_engine <- model.matrix(~ enginetype - 1, data = carPriceData)
dummy_fuel <- model.matrix(~ fueltype - 1, data = carPriceData)
dummy_carbody <- model.matrix(~ carbody - 1, data = carPriceData)
dummy_asp <- model.matrix(~ aspiration - 1, data = carPriceData)
dummy_cyl <- model.matrix(~ cylindernumber - 1, data = carPriceData)
dummy_drive <- model.matrix(~ drivewheel - 1, data = carPriceData)
dummy_brand <- model.matrix(~ grouped_brand_ - 1, data = groups)
finalDf <- carPriceData_numeric
finalDf <- cbind(finalDf, dummy_engine)
finalDf <- cbind(finalDf, dummy_fuel)
finalDf <- cbind(finalDf, dummy_carbody)
finalDf <- cbind(finalDf, dummy_asp)
finalDf <- cbind(finalDf, dummy_cyl)
finalDf <- cbind(finalDf, dummy_drive)
finalDf <- cbind(finalDf, dummy_brand)
head(finalDf)
finalDf
Now that we have the final data set put together, we can finally build a multiple linear regression model
# first we're going to split the data set into 70% for training and 30% for testing
set.seed(1)
dfSample <- sample(c(TRUE, FALSE), nrow(finalDf), replace = TRUE, prob = c(0.7, 0.3))
trainSet <- finalDf[dfSample, ]
testSet <- finalDf[!dfSample, ]
print(paste("Dimension of the final data set: ", dim(finalDf)[1]))
## [1] "Dimension of the final data set: 205"
print(paste("Dimension of the training data set: ", dim(trainSet)[1]))
## [1] "Dimension of the training data set: 143"
print(paste("Dimension of the testing data set: ", dim(testSet)[1]))
## [1] "Dimension of the testing data set: 62"
The original data set had 205 rows. The training set has 143 rows which is approximately 69.8% of the original rows while the test set has 62 rows or about 30.2% of the original rows.
model <- lm(price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower + citympg + highwaympg + enginetypedohcv + enginetypel + enginetypeohc +
enginetypeohcf + enginetypeohcv +enginetyperotor + fueltypediesel + fueltypegas + carbodyconvertible + carbodyhardtop + carbodyhatchback + carbodysedan + carbodywagon +
aspirationstd + aspirationturbo + cylindernumbereight + cylindernumberfive + cylindernumberfour + cylindernumbersix + cylindernumberthree + cylindernumbertwelve +
cylindernumbertwo + drivewheel4wd + drivewheelfwd + drivewheelrwd + grouped_brand_alfa + grouped_brand_audi + grouped_brand_bmw + grouped_brand_buick +
grouped_brand_chevrolet + grouped_brand_dodge + grouped_brand_honda + grouped_brand_isuzu + grouped_brand_jaguar + grouped_brand_mazda + grouped_brand_mercury +
grouped_brand_mitsubishi + grouped_brand_nissan + grouped_brand_peugeot + grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_renault + grouped_brand_saab +
grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen + grouped_brand_volvo, data = trainSet)
summary(model)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + curbweight +
## enginesize + boreratio + horsepower + citympg + highwaympg +
## enginetypedohcv + enginetypel + enginetypeohc + enginetypeohcf +
## enginetypeohcv + enginetyperotor + fueltypediesel + fueltypegas +
## carbodyconvertible + carbodyhardtop + carbodyhatchback +
## carbodysedan + carbodywagon + aspirationstd + aspirationturbo +
## cylindernumbereight + cylindernumberfive + cylindernumberfour +
## cylindernumbersix + cylindernumberthree + cylindernumbertwelve +
## cylindernumbertwo + drivewheel4wd + drivewheelfwd + drivewheelrwd +
## grouped_brand_alfa + grouped_brand_audi + grouped_brand_bmw +
## grouped_brand_buick + grouped_brand_chevrolet + grouped_brand_dodge +
## grouped_brand_honda + grouped_brand_isuzu + grouped_brand_jaguar +
## grouped_brand_mazda + grouped_brand_mercury + grouped_brand_mitsubishi +
## grouped_brand_nissan + grouped_brand_peugeot + grouped_brand_plymouth +
## grouped_brand_porsche + grouped_brand_renault + grouped_brand_saab +
## grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen +
## grouped_brand_volvo, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2906.1 -829.1 0.0 700.3 4141.5
##
## Coefficients: (10 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -43092.842 14919.136 -2.888 0.004785 **
## wheelbase 225.086 76.781 2.932 0.004216 **
## carlength -144.172 50.483 -2.856 0.005261 **
## carwidth 930.937 269.600 3.453 0.000826 ***
## curbweight 3.715 1.561 2.380 0.019307 *
## enginesize 105.541 35.537 2.970 0.003763 **
## boreratio -8034.254 3239.588 -2.480 0.014879 *
## horsepower 2.635 18.649 0.141 0.887922
## citympg -281.149 142.248 -1.976 0.050971 .
## highwaympg 256.376 121.604 2.108 0.037610 *
## enginetypedohcv -5966.470 4981.010 -1.198 0.233927
## enginetypel -6455.592 1701.437 -3.794 0.000259 ***
## enginetypeohc -2570.096 1072.819 -2.396 0.018530 *
## enginetypeohcf 10947.271 2518.309 4.347 3.43e-05 ***
## enginetypeohcv -1560.281 1225.843 -1.273 0.206155
## enginetyperotor 10036.574 6483.229 1.548 0.124893
## fueltypediesel 240.913 1013.072 0.238 0.812539
## fueltypegas NA NA NA NA
## carbodyconvertible 3027.699 1433.372 2.112 0.037257 *
## carbodyhardtop 285.589 1251.253 0.228 0.819943
## carbodyhatchback -387.202 680.913 -0.569 0.570920
## carbodysedan 665.132 563.891 1.180 0.241098
## carbodywagon NA NA NA NA
## aspirationstd -1751.228 695.229 -2.519 0.013424 *
## aspirationturbo NA NA NA NA
## cylindernumbereight 7045.901 3747.252 1.880 0.063101 .
## cylindernumberfive 6018.004 5284.878 1.139 0.257653
## cylindernumberfour 6761.511 5621.947 1.203 0.232051
## cylindernumbersix 2028.775 3722.568 0.545 0.587022
## cylindernumberthree 13444.000 6523.197 2.061 0.042012 *
## cylindernumbertwelve NA NA NA NA
## cylindernumbertwo NA NA NA NA
## drivewheel4wd -968.227 1136.728 -0.852 0.396463
## drivewheelfwd -1382.954 773.856 -1.787 0.077078 .
## drivewheelrwd NA NA NA NA
## grouped_brand_alfa -2239.428 2422.011 -0.925 0.357484
## grouped_brand_audi -823.776 2374.698 -0.347 0.729428
## grouped_brand_bmw 6634.391 1200.122 5.528 2.78e-07 ***
## grouped_brand_buick 807.553 2573.531 0.314 0.754360
## grouped_brand_chevrolet NA NA NA NA
## grouped_brand_dodge -3626.610 1474.970 -2.459 0.015733 *
## grouped_brand_honda -2690.889 1505.235 -1.788 0.076982 .
## grouped_brand_isuzu -1281.160 1395.728 -0.918 0.360964
## grouped_brand_jaguar -1338.616 3169.303 -0.422 0.673700
## grouped_brand_mazda -1364.558 1188.343 -1.148 0.253704
## grouped_brand_mercury NA NA NA NA
## grouped_brand_mitsubishi -4137.431 1484.641 -2.787 0.006416 **
## grouped_brand_nissan -2003.044 1271.953 -1.575 0.118597
## grouped_brand_peugeot NA NA NA NA
## grouped_brand_plymouth -3343.088 1494.435 -2.237 0.027598 *
## grouped_brand_porsche 5996.126 2064.065 2.905 0.004558 **
## grouped_brand_renault -3142.850 1839.637 -1.708 0.090792 .
## grouped_brand_saab 1818.306 1468.545 1.238 0.218673
## grouped_brand_subaru -13283.362 2698.950 -4.922 3.56e-06 ***
## grouped_brand_toyota -3102.771 1127.279 -2.752 0.007073 **
## grouped_brand_volkswagen -3200.139 1556.397 -2.056 0.042485 *
## grouped_brand_volvo NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1536 on 96 degrees of freedom
## Multiple R-squared: 0.9751, Adjusted R-squared: 0.9632
## F-statistic: 81.81 on 46 and 96 DF, p-value: < 2.2e-16
Upon preliminary analysis based on the model summary, we can note that many of the variables have been marked as NA which indicates that the variable has been excluded from the model. This can be due to a lack of data within those variables or potential multicollinearity (high correlation to other variables). We can test for multicollinearity by looking at the correlation coefficients in the final data set. The results have been hidden for brevity but an example screenshot is shown.
cor(finalDf)
Upon looking at the correlations, we can see that citympg and highwaympg are highly correlated. Citympg and highwaympg represent almost the same thing and they can actually be combined to create a combined mpg. The fuel types for gas and diesel are also perfectly correlated, so we can remove one. The car brand peugot is also highly correlated with enginetypel and after reviewing the data, Peugot is the only car brand that has that specific engine type. Therefore, the engine type can be excluded because Peugot represents the same information in the model. The cylindernumbertwo is also directly correlated with enginetyperotor so we can remove the cylindernumbertwo variable because it is already represented by the rotor engine type. The variables aspirationturbo and aspirationstd are perfectly correlated, so one of those can be removed. The variable drivewheelrwd is highly correlated with drivewheelfwd so rear wheel drive can be removed from the model.
# combined fuel economy is calculated by weighing city mpg by 55% and highway mpg by 45% and taking the average (according to EPA)
# Resource link for calculating combined fuel economy:
# https://www.epa.gov/fueleconomy/text-version-gasoline-label#:~:text=Fuel%20Economy,-For%20gasoline%20vehicles&text=The%20Combined%20MPG%20value%20is,the%20Highway%20value%20by%2045%25.
finalDf$combinedFuelEconomy <- ((finalDf$citympg * 0.55) + (finalDf$highwaympg * 0.45)) / 2
head(finalDf)
After removing the variables that were highly correlated, we can now try updating the model.
# splitting the dataset again
set.seed(1)
dfSample <- sample(c(TRUE, FALSE), nrow(finalDf), replace = TRUE, prob = c(0.7, 0.3))
train2Set <- finalDf[dfSample, ]
test2Set <- finalDf[!dfSample, ]
# building the updated model
model2 <- lm(price ~ wheelbase + carlength + carwidth + curbweight + enginesize + boreratio + horsepower + combinedFuelEconomy + enginetypedohcv + enginetypeohc +
enginetypeohcf + enginetypeohcv +enginetyperotor + fueltypegas + carbodyconvertible + carbodyhardtop + carbodyhatchback + carbodysedan + carbodywagon +
aspirationstd + cylindernumbertwelve + cylindernumbereight + cylindernumberfive + cylindernumberfour + cylindernumbersix + cylindernumberthree +
drivewheel4wd + drivewheelfwd + grouped_brand_alfa + grouped_brand_audi + grouped_brand_bmw + grouped_brand_buick +
grouped_brand_chevrolet + grouped_brand_dodge + grouped_brand_honda + grouped_brand_isuzu + grouped_brand_jaguar + grouped_brand_mazda + grouped_brand_mercury +
grouped_brand_mitsubishi + grouped_brand_nissan + grouped_brand_peugeot + grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_renault + grouped_brand_saab +
grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen + grouped_brand_volvo, data = train2Set)
summary(model2)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + curbweight +
## enginesize + boreratio + horsepower + combinedFuelEconomy +
## enginetypedohcv + enginetypeohc + enginetypeohcf + enginetypeohcv +
## enginetyperotor + fueltypegas + carbodyconvertible + carbodyhardtop +
## carbodyhatchback + carbodysedan + carbodywagon + aspirationstd +
## cylindernumbertwelve + cylindernumbereight + cylindernumberfive +
## cylindernumberfour + cylindernumbersix + cylindernumberthree +
## drivewheel4wd + drivewheelfwd + grouped_brand_alfa + grouped_brand_audi +
## grouped_brand_bmw + grouped_brand_buick + grouped_brand_chevrolet +
## grouped_brand_dodge + grouped_brand_honda + grouped_brand_isuzu +
## grouped_brand_jaguar + grouped_brand_mazda + grouped_brand_mercury +
## grouped_brand_mitsubishi + grouped_brand_nissan + grouped_brand_peugeot +
## grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_renault +
## grouped_brand_saab + grouped_brand_subaru + grouped_brand_toyota +
## grouped_brand_volkswagen + grouped_brand_volvo, data = train2Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2908.2 -827.5 0.0 730.8 4261.8
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39992.335 15367.277 -2.602 0.01071 *
## wheelbase 231.562 78.064 2.966 0.00379 **
## carlength -144.305 51.368 -2.809 0.00601 **
## carwidth 910.682 274.150 3.322 0.00126 **
## curbweight 3.811 1.588 2.400 0.01829 *
## enginesize 91.288 35.497 2.572 0.01164 *
## boreratio -6541.090 3216.341 -2.034 0.04471 *
## horsepower 8.624 18.754 0.460 0.64664
## combinedFuelEconomy 33.409 128.389 0.260 0.79525
## enginetypedohcv -5555.758 5064.407 -1.097 0.27535
## enginetypeohc -1860.415 1036.296 -1.795 0.07573 .
## enginetypeohcf 11496.410 2548.649 4.511 1.81e-05 ***
## enginetypeohcv -1162.037 1232.375 -0.943 0.34806
## enginetyperotor 2601.404 2975.488 0.874 0.38413
## fueltypegas 208.708 1007.644 0.207 0.83635
## carbodyconvertible 2782.868 1453.680 1.914 0.05852 .
## carbodyhardtop 198.989 1272.491 0.156 0.87606
## carbodyhatchback -376.423 692.826 -0.543 0.58816
## carbodysedan 744.862 572.476 1.301 0.19630
## carbodywagon NA NA NA NA
## aspirationstd -1671.416 706.359 -2.366 0.01996 *
## cylindernumbertwelve -6446.552 6096.194 -1.057 0.29292
## cylindernumbereight 337.800 5553.789 0.061 0.95163
## cylindernumberfive -1622.094 3237.337 -0.501 0.61747
## cylindernumberfour -947.986 2477.762 -0.383 0.70286
## cylindernumbersix -4792.633 3335.687 -1.437 0.15400
## cylindernumberthree NA NA NA NA
## drivewheel4wd -972.717 1156.646 -0.841 0.40243
## drivewheelfwd -1489.273 785.738 -1.895 0.06102 .
## grouped_brand_alfa -546.661 2324.599 -0.235 0.81458
## grouped_brand_audi 101.596 2374.534 0.043 0.96596
## grouped_brand_bmw 7037.782 1205.479 5.838 7.01e-08 ***
## grouped_brand_buick 1570.937 2592.492 0.606 0.54596
## grouped_brand_chevrolet NA NA NA NA
## grouped_brand_dodge -2726.769 1436.380 -1.898 0.06062 .
## grouped_brand_honda -2126.539 1507.114 -1.411 0.16144
## grouped_brand_isuzu -982.378 1412.821 -0.695 0.48851
## grouped_brand_jaguar -96.115 3168.400 -0.030 0.97586
## grouped_brand_mazda -559.086 1144.744 -0.488 0.62637
## grouped_brand_mercury NA NA NA NA
## grouped_brand_mitsubishi -3131.220 1430.186 -2.189 0.03097 *
## grouped_brand_nissan -1309.019 1249.991 -1.047 0.29760
## grouped_brand_peugeot -5447.242 1661.198 -3.279 0.00145 **
## grouped_brand_plymouth -2499.739 1464.941 -1.706 0.09114 .
## grouped_brand_porsche 6738.404 2069.349 3.256 0.00156 **
## grouped_brand_renault -1711.045 1739.176 -0.984 0.32765
## grouped_brand_saab 2828.711 1412.176 2.003 0.04796 *
## grouped_brand_subaru -13183.511 2745.823 -4.801 5.73e-06 ***
## grouped_brand_toyota -2453.917 1103.303 -2.224 0.02846 *
## grouped_brand_volkswagen -2186.571 1505.952 -1.452 0.14974
## grouped_brand_volvo NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1563 on 97 degrees of freedom
## Multiple R-squared: 0.974, Adjusted R-squared: 0.9619
## F-statistic: 80.68 on 45 and 97 DF, p-value: < 2.2e-16
Because we still have aliased variables in the data but they’re not highly correlated with other variables, we can try using a stepwise regression method to help with variable selection. Stepwise regression works to select the best subset of variables that result in the best model performance which is dictated by the lowest prediction error. The method operates by adding and removing variables to the model iteratively. The default function in r uses both forward and backward selection so the model starts with no predictor variables and, in a stepwise fashion, adds the predictor variables that best improve the model and then removes the variables that don’t contribute to improving the model. The results have been hidden for brevity.
Resource Link for Stepwise Regression in R
# using stepwise regression
step_model <- step(model2)
summary(step_model)
##
## Call:
## lm(formula = price ~ wheelbase + carlength + carwidth + curbweight +
## enginesize + boreratio + enginetypedohcv + enginetypeohc +
## enginetypeohcf + enginetypeohcv + carbodyconvertible + carbodysedan +
## aspirationstd + cylindernumbertwelve + cylindernumberfive +
## cylindernumberfour + cylindernumbersix + drivewheel4wd +
## drivewheelfwd + grouped_brand_bmw + grouped_brand_dodge +
## grouped_brand_honda + grouped_brand_mitsubishi + grouped_brand_peugeot +
## grouped_brand_plymouth + grouped_brand_porsche + grouped_brand_saab +
## grouped_brand_subaru + grouped_brand_toyota + grouped_brand_volkswagen,
## data = train2Set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2651.1 -899.2 -173.9 817.6 4654.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45148.934 10037.169 -4.498 1.68e-05 ***
## wheelbase 256.472 59.544 4.307 3.56e-05 ***
## carlength -146.047 40.267 -3.627 0.000433 ***
## carwidth 1002.590 191.929 5.224 8.18e-07 ***
## curbweight 4.108 1.177 3.490 0.000693 ***
## enginesize 89.565 8.927 10.033 < 2e-16 ***
## boreratio -6691.797 1178.326 -5.679 1.08e-07 ***
## enginetypedohcv -6684.841 2282.841 -2.928 0.004130 **
## enginetypeohc -1913.435 714.372 -2.678 0.008509 **
## enginetypeohcf 13047.616 2011.111 6.488 2.45e-09 ***
## enginetypeohcv -1261.598 874.550 -1.443 0.151933
## carbodyconvertible 3189.471 881.216 3.619 0.000445 ***
## carbodysedan 994.585 299.162 3.325 0.001197 **
## aspirationstd -1723.650 438.336 -3.932 0.000146 ***
## cylindernumbertwelve -7543.193 1985.795 -3.799 0.000237 ***
## cylindernumberfive -3119.113 1076.634 -2.897 0.004531 **
## cylindernumberfour -3063.382 923.668 -3.317 0.001229 **
## cylindernumbersix -7146.191 795.098 -8.988 7.12e-15 ***
## drivewheel4wd -1359.410 848.221 -1.603 0.111826
## drivewheelfwd -2009.825 531.985 -3.778 0.000255 ***
## grouped_brand_bmw 7754.573 926.063 8.374 1.80e-13 ***
## grouped_brand_dodge -1650.666 680.469 -2.426 0.016873 *
## grouped_brand_honda -1065.872 692.895 -1.538 0.126799
## grouped_brand_mitsubishi -2098.360 689.870 -3.042 0.002931 **
## grouped_brand_peugeot -5770.251 1091.303 -5.287 6.20e-07 ***
## grouped_brand_plymouth -1419.052 769.571 -1.844 0.067833 .
## grouped_brand_porsche 7249.700 1727.733 4.196 5.46e-05 ***
## grouped_brand_saab 3771.957 877.167 4.300 3.66e-05 ***
## grouped_brand_subaru -13762.164 2260.060 -6.089 1.63e-08 ***
## grouped_brand_toyota -1611.072 430.980 -3.738 0.000294 ***
## grouped_brand_volkswagen -1531.333 746.611 -2.051 0.042596 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1483 on 112 degrees of freedom
## Multiple R-squared: 0.9729, Adjusted R-squared: 0.9657
## F-statistic: 134.3 on 30 and 112 DF, p-value: < 2.2e-16
From the model summary, we can see that there are no more aliased variables and most of the remaining variables are significant. Looking at the r-squared value we can see that 97.3% of variance of the data can be accounted for by the model. This also means that the model is a good fit for the data.
We can now try to use our model for predicting to test the accuracy of our model.
mod_train <- predict(step_model, train2Set)
mod_test <- predict(step_model, test2Set)
# plotting the model
trainDf <- data.frame(train2Set$price, mod_train)
testDf <- data.frame(test2Set$price, mod_test)
p = ggplot(data = trainDf, aes(x = train2Set$price, y = mod_train)) + geom_point() +
geom_smooth(method = "lm", color = "blue") + theme_minimal() +
ggtitle("Car Price Prediction on the Training Set") + xlab("True Training Values") + ylab("Training Values Predicted by the Model") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15))
ggplotly(p)
We can see from the plot that the model fits quite well to the data. The blue line represents the model prediction and all of the data points are quite close to the line indicating that the model is a good fit for the data.
p2 = ggplot(data = testDf, aes(x = test2Set$price, y = mod_test)) + geom_point() +
geom_smooth(method = "lm", color = "blue") + theme_minimal() +
ggtitle("Car Price Prediction on the Test Set") + xlab("True Test Values") + ylab("Test Values Predicted by the Model") +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15))
ggplotly(p2)
Compared to the plot for the data set dedicated to training the model, we can see that the model doesn’t predict quite as well as. It seems that there are more points that tend to be farther from the blue line which represents the relationship between the model prediction and the actual values.
res <- resid(step_model)
qqnorm(res)
qqline(res)
plot(density(res))
When looking at both the QQ plot and the density plot of the residuals, we can see that the residuals tend to be pretty close to a normal distribution which indicates that the errors are random and independent. This suggests that the models performs well and accurately represents the patterns and variations within the data set. If errors were correlated or dependent, this could indicate a reduction in accuracy of the model predictions when given new data.
From our goals for analysis outlined by the problem set, we have found the significant variables in the data set that are crucial to predicting a car’s price which is given in the model summary by the listed coefficients. The r squared value also provides information about how well this collection of variables can describe the variation of a car’s price. For developing a business strategy, the company can look at the predictor variables and it’s relationship to the car’s price in order to gain a better understanding of the car market in the US. The variables with a positive coefficient indicate a positive relationship with the car’s price (increase in the variable leads to an increase in the price). For negative coefficients an increase in the variable leads to a decrease in the price (a negative relationship). The magnitude of the coefficient represents how much of an effect a change in the predictor will have on the car’s price.