For this project, I aim to further expand my knowledge of linear regression that I was introduced to in taking STATS 3A03 and MATH 3MB3 at McMaster University. By applying techniques learnt from lectures to real-world datasets I am passionate about, like cars, I hope to gain deeper insight into what factors influence car prices.
Being passionate about cars, this project offers an opportunity to merge my interests with skills I’ve developed in math and statistics. I’ve always spent countless hours browsing through car listing sites like autotrader.ca, daydreaming about driving cars I can’t afford yet. Websites like Autotrader have a nifty system called “Price Analysis” that calculates price ranges for their vehicles listed into categories like “Great Price”, “Good Price”, “Fair Price”, and “Above Average”. I’ve always been curious about how they categorize these vehicles and what kind of algorithms they use. Now, after learning more about linear regression, I feel that I have a much better understanding of how they might do it.
Starting with cleaning the data, I’ll be exploring different correlations between factors and how they can influence car prices, and understanding which variables hold more weight in determining a vehicle value. At the end of my project, I will be applying my model to my dream car, the Lexus ISF, to see if it can accurately estimate its market price and compare it to ones I often find myself admiring on Autotrader.
Additionally, I want to do this entire project in R and RStudio to improve my skills in these tools. I’ve learnt basic R in the course HTHSCI1M03, and I want to further expand my knowledge in publishing a project using markdown. At the end of this project, I hope to become more comfortable with the R ecosystem.
Thanks to Austin Reese, I was able to easily collect my used cars
data from Kaggle
where he scrapes data from Craigslist. The CSV file consists of 426,880
entries and 26 columns.
Below shows the summary of the raw data set that was collected. In it’s current form, it is uncleaned and unusable for any meaningful analysis due to several issues. The presence of outliers are evident in the price column where the values range from zero to 3.7 billion, which is unrealistic for used car prices. Similarly the odometer values range from zero to 10 million miles which suggests data entry errors. There are also an abundance of missing values which will skew our model so they will be removed. Lastly, there are several redundant columns that are no use to this project which will also be dropped.
id url region region_url
Min. :7.207e+09 Length:426880 Length:426880 Length:426880
1st Qu.:7.308e+09 Class :character Class :character Class :character
Median :7.313e+09 Mode :character Mode :character Mode :character
Mean :7.311e+09
3rd Qu.:7.315e+09
Max. :7.317e+09
price year manufacturer model
Min. :0.000e+00 Min. :1900 Length:426880 Length:426880
1st Qu.:5.900e+03 1st Qu.:2008 Class :character Class :character
Median :1.395e+04 Median :2013 Mode :character Mode :character
Mean :7.520e+04 Mean :2011
3rd Qu.:2.649e+04 3rd Qu.:2017
Max. :3.737e+09 Max. :2022
NA's :1205
condition cylinders fuel odometer
Length:426880 Length:426880 Length:426880 Min. : 0
Class :character Class :character Class :character 1st Qu.: 37704
Mode :character Mode :character Mode :character Median : 85548
Mean : 98043
3rd Qu.: 133542
Max. :10000000
NA's :4400
title_status transmission VIN drive
Length:426880 Length:426880 Length:426880 Length:426880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
size type paint_color image_url
Length:426880 Length:426880 Length:426880 Length:426880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
description county state lat
Length:426880 Mode:logical Length:426880 Min. :-84.12
Class :character NA's:426880 Class :character 1st Qu.: 34.60
Mode :character Mode :character Median : 39.15
Mean : 38.49
3rd Qu.: 42.40
Max. : 82.39
NA's :6549
long posting_date
Min. :-159.83 Min. :2021-04-04 07:00:25.00
1st Qu.:-111.94 1st Qu.:2021-04-17 10:47:46.75
Median : -88.43 Median :2021-04-26 01:08:31.50
Mean : -94.75 Mean :2021-04-24 00:23:42.98
3rd Qu.: -80.83 3rd Qu.:2021-05-01 13:32:26.25
Max. : 173.89 Max. :2021-05-05 04:24:09.00
NA's :6549 NA's :68
Unfortunately, working with raw data comes with some flaws. There will always be some oddballs lurking around, like someone trying to sell their 2000 Civic for $1,000,000 or someone accidentally adding an extra zero or two to their odometer reading. To combat this, I used the 95th percentile for the price ($44,500) and the 99th percentile for odometer (280,000 mi.) as a cutoff point to get rid of the outliers, which is a reasonable cutoff for used cars. As for the fuel, transmission, type, and title, I limited my data to values that can have meaning in my model. Lastly, removing all listing entries with any NA values, I was left with something I could work with. The columns I kept are listed as follows:
Here is a sample of the presence of outliers in the prices column. Viewing the raw data with a histogram of the price distribution shows how skewed the data is prior to cleaning.
| Price | Year | Manufacturer | Model | Condition | Cylinders | Fuel | Odometer | Title Status | Transmission | Type |
|---|---|---|---|---|---|---|---|---|---|---|
| $13,950 | 2007 | bmw | 335i convertible | excellent | 6 | gas | 56,000 | clean | automatic | convertible |
| $2,650 | 2001 | honda | accord | fair | 4 | gas | 213,000 | clean | automatic | sedan |
| $8,490 | 2006 | ford | f-150 | excellent | 6 | gas | 168,000 | clean | automatic | pickup |
| $1,999 | 2004 | ford | f150 | fair | 8 | gas | 131,000 | rebuilt | automatic | pickup |
| $6,900 | 2009 | honda | fit | good | 4 | gas | 84,475 | clean | manual | hatchback |
| $12,900 | 2008 | toyota | tundra | good | 8 | gas | 201,143 | clean | automatic | truck |
| $13,999 | 2011 | bmw | x5 xdrive35d | good | 6 | diesel | 100,675 | clean | automatic | SUV |
| $5,950 | 2003 | toyota | matrix xr hatchback | excellent | 4 | gas | 202,000 | clean | automatic | hatchback |
| $12,916 | 2013 | cadillac | xts | excellent | 6 | gas | 114,263 | clean | automatic | sedan |
| $24,500 | 2000 | ford | excursion limited 4x4 | excellent | 8 | diesel | 209,500 | clean | automatic | SUV |
price year manufacturer model
Min. : 500 Min. :1980 Length:103886 Length:103886
1st Qu.: 5900 1st Qu.:2006 Class :character Class :character
Median : 9950 Median :2011 Mode :character Mode :character
Mean :12871 Mean :2010
3rd Qu.:17499 3rd Qu.:2014
Max. :44500 Max. :2022
condition cylinders fuel odometer
Length:103886 Min. : 3.000 Length:103886 Min. : 537
Class :character 1st Qu.: 4.000 Class :character 1st Qu.: 77201
Mode :character Median : 6.000 Mode :character Median :115510
Mean : 5.783 Mean :117582
3rd Qu.: 8.000 3rd Qu.:155000
Max. :12.000 Max. :280000
title_status transmission type
Length:103886 Length:103886 Length:103886
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Before modeling and applying linear regression, I conducted a correlation analysis on the quantitative variables in the data set. Specifically, I was interested in understanding which variable out of year, number of cylinders, and odometer most effects the price of cars by applying cor() to these variables.
| Variable 1 | Variable 2 | Correlation |
|---|---|---|
| odometer | price | 0.5077 |
| year | price | 0.5063 |
| odometer | year | 0.4501 |
| cylinders | price | 0.2869 |
| odometer | cylinders | -0.1505 |
| cylinders | year | -0.2328 |
Initially when producing this table, the odometer correlation read out a negative number, which means as the mileage of the car goes up, the price goes down. To make more sense of the data, I simply used a negative odometer reading. As expected, the price gets affected the most by the odometer and year of the vehicle. This makes sense as older cars that tend to have higher mileage typically suggests a lower value for the vehicle. These findings align with expectations of depreciation tends. Intuitively the high correlation value for odometer and year also makes perfect sense as older vehicles are driven more, racking up the odometer reading. The weak positive correlation value for cylinders to price suggests that vehicles with more cylinders (V6 or V8 engines) are slightly pricier due to some of these vehicles being sports cars. However, the weak correlation suggests it is not a strong determinant of price. One key reasoning behind this is because of the prevalence of non-sporty, everyday vehicles can be equipped with larger engines that don’t command for a markup on the prices.
After cleaning the data and exploring relationships between variables, I selected all variables except for the model of the vehicle for the model to predict the price, since the model variable has a high carnality of 4098 unique values. As for the year of the car, I changed the variable into the age of the car instead as it is more intuitive for predicting prices. I split the data set, with 80% to build the model and the remaining 20% to test the model. The model used is as follows:
\[\begin{aligned} &\text{price} = \beta_0 + \beta_1(\text{age}) + \beta_2(\text{odometer}) + \beta_3(\text{cylinders}) + \beta_4(\text{manufacturer}) + \\ &\quad \beta_5(\text{condition}) + \beta_6(\text{fuel}) + \beta_7(\text{title_status}) + \beta_8(\text{transmission}) + \beta_9(\text{type}) + \epsilon \end{aligned}\]`
set.seed(138)
cars_model <- cars_clean %>%
mutate(across(c(manufacturer, condition, fuel, title_status, transmission, type), as.factor))
cars_model$age <- 2024 - cars_model$year
cars_model <- cars_model %>% select(-year)
train_data <- cars_model %>% sample_frac(0.8)
test_data <- cars_model %>% anti_join(train_data)
lm_model <- lm(price ~ age + odometer + cylinders +
manufacturer + condition + fuel +
title_status + transmission + type,
data = train_data)
test_data$pred_price <- predict(lm_model, newdata = test_data)
| Metric | Value |
|---|---|
| MAE | $ 3780.97 |
| R^2 | 0.691 |
I opted in using two key evaluation metrics, MAE (Mean absolute
error), and R^2. The mean absolute error represents on average, my
model’s predictions are off by about ±$3780.97. This means if a
vehicle’s true value is priced at $20,000 the model might predict in the
range of ($16,219, $23,780). While I am proud of this number, due to
lack of factors that goes into pricing a vehicle for my model, I wish to
improve this model in the future with a data set involving more
variables.
The R-squared value represents the proportion of variance in price that
can be explained by the independent variables. In other words, R-squared
tells us how well the data fit the regression model, “the goodness of
fit”.
Therefore, the model explains 69.1% of the price variation in the data
set with an average prediction error of $3780.97.
This scatter plot compares the predicted prices from my model to the actual prices from the data set, where the red line represents predicted = actual. We can see that on the lower valued cars, the model is overestimating the value, while in the higher end, it is underestimating the values. Due to the nature of our data set being collected from craigslist, there are far fewer luxury cars, thus the model struggles to learn their pricing rules when training. Older cars often have other factors such as poor aesthetic conditions and mechanical issues that also do not get taken into account in my model.
| Manufacturer | Count | Average Price | Average Age |
|---|---|---|---|
| ferrari | 1 | 43000.00 | 38 |
| aston-martin | 3 | 38631.67 | 21 |
| land rover | 6 | 6990.00 | 19 |
| datsun | 13 | 12799.15 | 42 |
| harley-davidson | 28 | 15191.36 | 21 |
| alfa-romeo | 40 | 20089.15 | 19 |
| fiat | 173 | 8669.73 | 11 |
| porsche | 284 | 20181.35 | 20 |
| All Cars | 103886 | 12870.81 | 14 |
We can see that luxury brands like Ferrari, Porsche, Alpha-Romeo, and Aston-Martin are extremely underrepresented in our data set and can be considered “outliers” due to their average price compared to the rest of the data set. There are many factors that go into pricing a car that we cannot capture entirely with this data set, like underrepresented luxury cars to hidden damages, which can influence one another when modeling. I will dive more into this in the next section.
I discussed previously about luxury brands and their influence on the model. Using the coefficient from our model, we can see which factors impact the price of a vehicle the most, ranked by their estimated effect on prices. All these coefficients are statistically significant with a p-val < 0.001.
| Value | Estimate | STD Error | Statistic | P-Value |
|---|---|---|---|---|
| manufacturerferrari | 31403.296 | 5301.3815 | 5.923606 | 0.00e+00 |
| manufacturerdatsun | 16312.766 | 1610.3293 | 10.130081 | 0.00e+00 |
| manufactureraston-martin | 14985.896 | 3750.4078 | 3.995804 | 6.45e-05 |
| manufactureralfa-romeo | 5816.924 | 952.9894 | 6.103871 | 0.00e+00 |
| manufacturerporsche | 5546.499 | 397.5018 | 13.953394 | 0.00e+00 |
| typeoffroad | 3954.806 | 588.1318 | 6.724354 | 0.00e+00 |
| typetruck | 3608.994 | 506.1789 | 7.129879 | 0.00e+00 |
| manufactureraudi | 2677.824 | 223.2103 | 11.996864 | 0.00e+00 |
| typepickup | 2647.384 | 507.8446 | 5.212981 | 2.00e-07 |
| transmissionmanual | 2267.993 | 77.5666 | 29.239298 | 0.00e+00 |
| manufacturerlexus | 2059.275 | 215.0177 | 9.577236 | 0.00e+00 |
As mentioned previously, luxury cars are an obvious price boosters, however, we also see Datsun, a manufacture known for their classic/vintage appeal, driving prices beyond typical market value. There is also a trend of a type of vehicle in this table, off-roaders and trucks. With more powerful engines and larger sizes, and versatility for work use can boost their costs. Lastly, a finding I was surprised to find were manual transmissions. We are moving to an age where automatic cars are becoming the norm, and only sportier cars are keeping the manual transmission which may reflect enthusiast demand these cars.
To test the accuracy of my used car price prediction model, I’ve selected 3 cars from autotrader.ca that represents different segments of the market: a simple passenger sedan, a practical family SUV, and my dream car. Since the data set was collected from the American market, I converted the cars I found in my area from CAD -> USD, and KM -> MI.
predict_car_price <- function(age, odometer, cylinders,
manufacturer, fuel, transmission, type) {
new_data <- data.frame(
age = age,
odometer = odometer,
cylinders = cylinders,
manufacturer = factor(manufacturer, levels = levels(cars_model$manufacturer)),
condition = factor("good", levels = levels(cars_model$condition)),
fuel = fuel,
title_status = factor("clean", levels = levels(cars_model$title_status)),
transmission = transmission,
type = type
)
predict(lm_model, newdata = new_data)
}
* Listed Price $14,388 CAD ($10,048.44 USD)
* Age: 10
* Odometer: 131680 KM (81822 MI)
* Cylinders: 4
* Manufacturer: Toyota
* Fuel: Gas
* Transmission: Automatic
* Type: Sedan
corolla2015 <- predict_car_price(
age = 10,
odometer = 131680,
cylinders = 4,
manufacturer = "toyota",
fuel = "gas",
transmission = "automatic",
type = "sedan"
)
cat("Actual price: 10048 USD\nPredicted price:", corolla2015, "USD\nError of:",
round((abs(corolla2015 - 10084)/10084)*100, 2),"%")Actual price: 10048 USD
Predicted price: 11235.96 USD
Error of: 11.42 %
* Listed Price $29,950 CAD ($20,944)
* Age: 3
* Odometer: 89979 KM (55910 MI)
* Cylinders: 4
* Manufacturer: Toyota
* Fuel: Gas
* Transmission: Automatic
* Type: SUV
RAV42022 <- predict_car_price(
age = 3,
odometer = 55923,
cylinders = 4,
manufacturer = "toyota",
fuel = "gas",
transmission = "automatic",
type = "SUV"
)
cat("Actual price: 20944 USD\nPredicted price:", RAV42022, "USD\nError of:",
round((abs(RAV42022 - 20944)/20944)*100, 2),"%")Actual price: 20944 USD
Predicted price: 23754.07 USD
Error of: 13.42 %
* Listed Price $42,000 CAD ($29,332)
* Age: 16
* Odometer: 82000 KM (50952 MI)
* Cylinders: 8
* Manufacturer: Lexus
* Fuel: Gas
* Transmission: Automatic
* Type: Sedan
ISF2009 <- predict_car_price(
age = 16,
odometer = 50952,
cylinders = 8,
manufacturer = "lexus",
fuel = "gas",
transmission = "automatic",
type = "sedan"
)
cat("Actual price: 29332 USD\nPredicted price:", ISF2009, "USD\nError of:",
round((abs(ISF2009 - 29332)/29332)*100, 2),"%")Actual price: 29332 USD
Predicted price: 20299.01 USD
Error of: 30.8 %
I am deeply proud of applying statistical principles learnt from my course work to real world problems I care about. Also I am pleased with the results of the model achieving an R-squared value of 69.1% explaining the variance of the used car prices, and the mean absolute error of $3780.97. For my first linear regression project the model performs reasonably well for the intended purpose, particularly for common vehicles like sedans and SUV’s of which the dataset is mostly made up of. For example, the Toyota Corolla and RAV4 cases demonstrated respectable errors of 11.42% and 13.42%. The accuracy of my model depends heavily on the heavy representation of mainstream vehicles in the market. However, for more luxury vehicles or sportier cars like the ISF case the error rocketed up to 30.8% as it struggles to capture an accurate predicted price since my dataset does not consist of many luxury vehicles. The model also lacks sufficient factors that are important in determining higher-end vehicles such as performance features, collector demand, and brand prestige. Alongside this, for any vehicle there are many features, damages, maintenance records, that gives lots of variance that affects the prices.
Past data limitations, there are different statistical ways to improve my model with the lack of factors. In STATS 3A03 we explored how to implement weights to underrepresented variables to combat them being drowned by higher volume variable entries. Also, finding that the price distribution is right skewed I would like to apply a natural logarithm to the price variable to reduce the skewness, which will make the distribution more normally distributed which allows linear regression assumptions more valid. I can also choose this by applying better choices for the cutoff points each qualitative variable.
Course materials from STATS 3A03, and MATH 3MB3 for Linear regression and modeling, and HTHS1M03 for R and RStudio