1. Introduction

For this project, I aim to further expand my knowledge of linear regression that I was introduced to in taking STATS 3A03 and MATH 3MB3 at McMaster University. By applying techniques learnt from lectures to real-world datasets I am passionate about, like cars, I hope to gain deeper insight into what factors influence car prices.

Being passionate about cars, this project offers an opportunity to merge my interests with skills I’ve developed in math and statistics. I’ve always spent countless hours browsing through car listing sites like autotrader.ca, daydreaming about driving cars I can’t afford yet. Websites like Autotrader have a nifty system called “Price Analysis” that calculates price ranges for their vehicles listed into categories like “Great Price”, “Good Price”, “Fair Price”, and “Above Average”. I’ve always been curious about how they categorize these vehicles and what kind of algorithms they use. Now, after learning more about linear regression, I feel that I have a much better understanding of how they might do it.

Starting with cleaning the data, I’ll be exploring different correlations between factors and how they can influence car prices, and understanding which variables hold more weight in determining a vehicle value. At the end of my project, I will be applying my model to my dream car, the Lexus ISF, to see if it can accurately estimate its market price and compare it to ones I often find myself admiring on Autotrader.

Additionally, I want to do this entire project in R and RStudio to improve my skills in these tools. I’ve learnt basic R in the course HTHSCI1M03, and I want to further expand my knowledge in publishing a project using markdown. At the end of this project, I hope to become more comfortable with the R ecosystem.

My dream car, the Lexus IS-F
My dream car, the Lexus IS-F




2. Data

Thanks to Austin Reese, I was able to easily collect my used cars data from Kaggle where he scrapes data from Craigslist. The CSV file consists of 426,880 entries and 26 columns.

Below shows the summary of the raw data set that was collected. In it’s current form, it is uncleaned and unusable for any meaningful analysis due to several issues. The presence of outliers are evident in the price column where the values range from zero to 3.7 billion, which is unrealistic for used car prices. Similarly the odometer values range from zero to 10 million miles which suggests data entry errors. There are also an abundance of missing values which will skew our model so they will be removed. Lastly, there are several redundant columns that are no use to this project which will also be dropped.


Click to view uncleaned dataset summary
       id                url               region           region_url       
 Min.   :7.207e+09   Length:426880      Length:426880      Length:426880     
 1st Qu.:7.308e+09   Class :character   Class :character   Class :character  
 Median :7.313e+09   Mode  :character   Mode  :character   Mode  :character  
 Mean   :7.311e+09                                                           
 3rd Qu.:7.315e+09                                                           
 Max.   :7.317e+09                                                           
                                                                             
     price                year      manufacturer          model          
 Min.   :0.000e+00   Min.   :1900   Length:426880      Length:426880     
 1st Qu.:5.900e+03   1st Qu.:2008   Class :character   Class :character  
 Median :1.395e+04   Median :2013   Mode  :character   Mode  :character  
 Mean   :7.520e+04   Mean   :2011                                        
 3rd Qu.:2.649e+04   3rd Qu.:2017                                        
 Max.   :3.737e+09   Max.   :2022                                        
                     NA's   :1205                                        
  condition          cylinders             fuel              odometer       
 Length:426880      Length:426880      Length:426880      Min.   :       0  
 Class :character   Class :character   Class :character   1st Qu.:   37704  
 Mode  :character   Mode  :character   Mode  :character   Median :   85548  
                                                          Mean   :   98043  
                                                          3rd Qu.:  133542  
                                                          Max.   :10000000  
                                                          NA's   :4400      
 title_status       transmission           VIN               drive          
 Length:426880      Length:426880      Length:426880      Length:426880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     size               type           paint_color         image_url        
 Length:426880      Length:426880      Length:426880      Length:426880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 description         county           state                lat        
 Length:426880      Mode:logical   Length:426880      Min.   :-84.12  
 Class :character   NA's:426880    Class :character   1st Qu.: 34.60  
 Mode  :character                  Mode  :character   Median : 39.15  
                                                      Mean   : 38.49  
                                                      3rd Qu.: 42.40  
                                                      Max.   : 82.39  
                                                      NA's   :6549    
      long          posting_date                   
 Min.   :-159.83   Min.   :2021-04-04 07:00:25.00  
 1st Qu.:-111.94   1st Qu.:2021-04-17 10:47:46.75  
 Median : -88.43   Median :2021-04-26 01:08:31.50  
 Mean   : -94.75   Mean   :2021-04-24 00:23:42.98  
 3rd Qu.: -80.83   3rd Qu.:2021-05-01 13:32:26.25  
 Max.   : 173.89   Max.   :2021-05-05 04:24:09.00  
 NA's   :6549      NA's   :68                      
  




3. Data Cleaning

Unfortunately, working with raw data comes with some flaws. There will always be some oddballs lurking around, like someone trying to sell their 2000 Civic for $1,000,000 or someone accidentally adding an extra zero or two to their odometer reading. To combat this, I used the 95th percentile for the price ($44,500) and the 99th percentile for odometer (280,000 mi.) as a cutoff point to get rid of the outliers, which is a reasonable cutoff for used cars. As for the fuel, transmission, type, and title, I limited my data to values that can have meaning in my model. Lastly, removing all listing entries with any NA values, I was left with something I could work with. The columns I kept are listed as follows:


Here is a sample of the presence of outliers in the prices column. Viewing the raw data with a histogram of the price distribution shows how skewed the data is prior to cleaning.



10 Random Vehicles From Clenaed Data
Price Year Manufacturer Model Condition Cylinders Fuel Odometer Title Status Transmission Type
$13,950 2007 bmw 335i convertible excellent 6 gas 56,000 clean automatic convertible
$2,650 2001 honda accord fair 4 gas 213,000 clean automatic sedan
$8,490 2006 ford f-150 excellent 6 gas 168,000 clean automatic pickup
$1,999 2004 ford f150 fair 8 gas 131,000 rebuilt automatic pickup
$6,900 2009 honda fit good 4 gas 84,475 clean manual hatchback
$12,900 2008 toyota tundra good 8 gas 201,143 clean automatic truck
$13,999 2011 bmw x5 xdrive35d good 6 diesel 100,675 clean automatic SUV
$5,950 2003 toyota matrix xr hatchback excellent 4 gas 202,000 clean automatic hatchback
$12,916 2013 cadillac xts excellent 6 gas 114,263 clean automatic sedan
$24,500 2000 ford excursion limited 4x4 excellent 8 diesel 209,500 clean automatic SUV
Click to view the cleaned dataset summary
     price            year      manufacturer          model          
 Min.   :  500   Min.   :1980   Length:103886      Length:103886     
 1st Qu.: 5900   1st Qu.:2006   Class :character   Class :character  
 Median : 9950   Median :2011   Mode  :character   Mode  :character  
 Mean   :12871   Mean   :2010                                        
 3rd Qu.:17499   3rd Qu.:2014                                        
 Max.   :44500   Max.   :2022                                        
  condition           cylinders          fuel              odometer     
 Length:103886      Min.   : 3.000   Length:103886      Min.   :   537  
 Class :character   1st Qu.: 4.000   Class :character   1st Qu.: 77201  
 Mode  :character   Median : 6.000   Mode  :character   Median :115510  
                    Mean   : 5.783                      Mean   :117582  
                    3rd Qu.: 8.000                      3rd Qu.:155000  
                    Max.   :12.000                      Max.   :280000  
 title_status       transmission           type          
 Length:103886      Length:103886      Length:103886     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
  




4. Correlation Analysis

Before modeling and applying linear regression, I conducted a correlation analysis on the quantitative variables in the data set. Specifically, I was interested in understanding which variable out of year, number of cylinders, and odometer most effects the price of cars by applying cor() to these variables.


Correlation Between Quantitative Variable
Variable 1 Variable 2 Correlation
odometer price 0.5077
year price 0.5063
odometer year 0.4501
cylinders price 0.2869
odometer cylinders -0.1505
cylinders year -0.2328


Initially when producing this table, the odometer correlation read out a negative number, which means as the mileage of the car goes up, the price goes down. To make more sense of the data, I simply used a negative odometer reading. As expected, the price gets affected the most by the odometer and year of the vehicle. This makes sense as older cars that tend to have higher mileage typically suggests a lower value for the vehicle. These findings align with expectations of depreciation tends. Intuitively the high correlation value for odometer and year also makes perfect sense as older vehicles are driven more, racking up the odometer reading. The weak positive correlation value for cylinders to price suggests that vehicles with more cylinders (V6 or V8 engines) are slightly pricier due to some of these vehicles being sports cars. However, the weak correlation suggests it is not a strong determinant of price. One key reasoning behind this is because of the prevalence of non-sporty, everyday vehicles can be equipped with larger engines that don’t command for a markup on the prices.




5. Modeling

After cleaning the data and exploring relationships between variables, I selected all variables except for the model of the vehicle for the model to predict the price, since the model variable has a high carnality of 4098 unique values. As for the year of the car, I changed the variable into the age of the car instead as it is more intuitive for predicting prices. I split the data set, with 80% to build the model and the remaining 20% to test the model. The model used is as follows:



\[\begin{aligned} &\text{price} = \beta_0 + \beta_1(\text{age}) + \beta_2(\text{odometer}) + \beta_3(\text{cylinders}) + \beta_4(\text{manufacturer}) + \\ &\quad \beta_5(\text{condition}) + \beta_6(\text{fuel}) + \beta_7(\text{title_status}) + \beta_8(\text{transmission}) + \beta_9(\text{type}) + \epsilon \end{aligned}\]`

set.seed(138)

cars_model <- cars_clean %>%
  mutate(across(c(manufacturer, condition, fuel, title_status, transmission, type), as.factor))
cars_model$age <- 2024 - cars_model$year
cars_model <- cars_model %>% select(-year)

train_data <- cars_model %>% sample_frac(0.8)
test_data <- cars_model %>% anti_join(train_data)
lm_model <- lm(price ~ age + odometer + cylinders + 
               manufacturer + condition + fuel + 
               title_status + transmission + type,
             data = train_data)

test_data$pred_price <- predict(lm_model, newdata = test_data)

6. Model Performance



Model Performance MAE & R^2
Metric Value
MAE $ 3780.97
R^2 0.691

I opted in using two key evaluation metrics, MAE (Mean absolute error), and R^2. The mean absolute error represents on average, my model’s predictions are off by about ±$3780.97. This means if a vehicle’s true value is priced at $20,000 the model might predict in the range of ($16,219, $23,780). While I am proud of this number, due to lack of factors that goes into pricing a vehicle for my model, I wish to improve this model in the future with a data set involving more variables.

The R-squared value represents the proportion of variance in price that can be explained by the independent variables. In other words, R-squared tells us how well the data fit the regression model, “the goodness of fit”.

Therefore, the model explains 69.1% of the price variation in the data set with an average prediction error of $3780.97.




7. Actual vs. Predicted Price Plot



This scatter plot compares the predicted prices from my model to the actual prices from the data set, where the red line represents predicted = actual. We can see that on the lower valued cars, the model is overestimating the value, while in the higher end, it is underestimating the values. Due to the nature of our data set being collected from craigslist, there are far fewer luxury cars, thus the model struggles to learn their pricing rules when training. Older cars often have other factors such as poor aesthetic conditions and mechanical issues that also do not get taken into account in my model.


Least Represented Manufacturer
Manufacturer Count Average Price Average Age
ferrari 1 43000.00 38
aston-martin 3 38631.67 21
land rover 6 6990.00 19
datsun 13 12799.15 42
harley-davidson 28 15191.36 21
alfa-romeo 40 20089.15 19
fiat 173 8669.73 11
porsche 284 20181.35 20
All Cars 103886 12870.81 14


We can see that luxury brands like Ferrari, Porsche, Alpha-Romeo, and Aston-Martin are extremely underrepresented in our data set and can be considered “outliers” due to their average price compared to the rest of the data set. There are many factors that go into pricing a car that we cannot capture entirely with this data set, like underrepresented luxury cars to hidden damages, which can influence one another when modeling. I will dive more into this in the next section.




8. Price Boosters

I discussed previously about luxury brands and their influence on the model. Using the coefficient from our model, we can see which factors impact the price of a vehicle the most, ranked by their estimated effect on prices. All these coefficients are statistically significant with a p-val < 0.001.


Coefficients in Model With Highest Increase in Price
Value Estimate STD Error Statistic P-Value
manufacturerferrari 31403.296 5301.3815 5.923606 0.00e+00
manufacturerdatsun 16312.766 1610.3293 10.130081 0.00e+00
manufactureraston-martin 14985.896 3750.4078 3.995804 6.45e-05
manufactureralfa-romeo 5816.924 952.9894 6.103871 0.00e+00
manufacturerporsche 5546.499 397.5018 13.953394 0.00e+00
typeoffroad 3954.806 588.1318 6.724354 0.00e+00
typetruck 3608.994 506.1789 7.129879 0.00e+00
manufactureraudi 2677.824 223.2103 11.996864 0.00e+00
typepickup 2647.384 507.8446 5.212981 2.00e-07
transmissionmanual 2267.993 77.5666 29.239298 0.00e+00
manufacturerlexus 2059.275 215.0177 9.577236 0.00e+00


As mentioned previously, luxury cars are an obvious price boosters, however, we also see Datsun, a manufacture known for their classic/vintage appeal, driving prices beyond typical market value. There is also a trend of a type of vehicle in this table, off-roaders and trucks. With more powerful engines and larger sizes, and versatility for work use can boost their costs. Lastly, a finding I was surprised to find were manual transmissions. We are moving to an age where automatic cars are becoming the norm, and only sportier cars are keeping the manual transmission which may reflect enthusiast demand these cars.




9. Model Use Examples

To test the accuracy of my used car price prediction model, I’ve selected 3 cars from autotrader.ca that represents different segments of the market: a simple passenger sedan, a practical family SUV, and my dream car. Since the data set was collected from the American market, I converted the cars I found in my area from CAD -> USD, and KM -> MI.


predict_car_price <- function(age, odometer, cylinders, 
                              manufacturer, fuel, transmission, type) {
  
  new_data <- data.frame(
    age = age,
    odometer = odometer,
    cylinders = cylinders,
    manufacturer = factor(manufacturer, levels = levels(cars_model$manufacturer)),
    condition = factor("good", levels = levels(cars_model$condition)),
    fuel = fuel,
    title_status = factor("clean", levels = levels(cars_model$title_status)),
    transmission = transmission,
    type = type
  )
  predict(lm_model, newdata = new_data)
}




Car #1: 2015 Toyota Corolla


* Listed Price $14,388 CAD ($10,048.44 USD)
* Age: 10
* Odometer: 131680 KM (81822 MI)
* Cylinders: 4
* Manufacturer: Toyota
* Fuel: Gas
* Transmission: Automatic
* Type: Sedan

corolla2015 <- predict_car_price(
  age = 10, 
  odometer = 131680, 
  cylinders = 4, 
  manufacturer = "toyota",
  fuel = "gas",
  transmission = "automatic",
  type = "sedan"
)

cat("Actual price: 10048 USD\nPredicted price:", corolla2015, "USD\nError of:", 
    round((abs(corolla2015 - 10084)/10084)*100, 2),"%")
Actual price: 10048 USD
Predicted price: 11235.96 USD
Error of: 11.42 %




Car #2: 2022 Toyota Rav4


* Listed Price $29,950 CAD ($20,944)
* Age: 3
* Odometer: 89979 KM (55910 MI)
* Cylinders: 4
* Manufacturer: Toyota
* Fuel: Gas
* Transmission: Automatic
* Type: SUV

RAV42022 <- predict_car_price(
  age = 3, 
  odometer = 55923, 
  cylinders = 4, 
  manufacturer = "toyota",
  fuel = "gas",
  transmission = "automatic",
  type = "SUV"
)
cat("Actual price: 20944 USD\nPredicted price:", RAV42022, "USD\nError of:", 
    round((abs(RAV42022 - 20944)/20944)*100, 2),"%")
Actual price: 20944 USD
Predicted price: 23754.07 USD
Error of: 13.42 %




Car #3: 2009 Lexus ISF


* Listed Price $42,000 CAD ($29,332)
* Age: 16
* Odometer: 82000 KM (50952 MI)
* Cylinders: 8
* Manufacturer: Lexus
* Fuel: Gas
* Transmission: Automatic
* Type: Sedan

ISF2009 <- predict_car_price(
  age = 16, 
  odometer = 50952, 
  cylinders = 8, 
  manufacturer = "lexus",
  fuel = "gas",
  transmission = "automatic",
  type = "sedan"
)

cat("Actual price: 29332 USD\nPredicted price:", ISF2009, "USD\nError of:", 
    round((abs(ISF2009 - 29332)/29332)*100, 2),"%")
Actual price: 29332 USD
Predicted price: 20299.01 USD
Error of: 30.8 %




10. Conclusion

I am deeply proud of applying statistical principles learnt from my course work to real world problems I care about. Also I am pleased with the results of the model achieving an R-squared value of 69.1% explaining the variance of the used car prices, and the mean absolute error of $3780.97. For my first linear regression project the model performs reasonably well for the intended purpose, particularly for common vehicles like sedans and SUV’s of which the dataset is mostly made up of. For example, the Toyota Corolla and RAV4 cases demonstrated respectable errors of 11.42% and 13.42%. The accuracy of my model depends heavily on the heavy representation of mainstream vehicles in the market. However, for more luxury vehicles or sportier cars like the ISF case the error rocketed up to 30.8% as it struggles to capture an accurate predicted price since my dataset does not consist of many luxury vehicles. The model also lacks sufficient factors that are important in determining higher-end vehicles such as performance features, collector demand, and brand prestige. Alongside this, for any vehicle there are many features, damages, maintenance records, that gives lots of variance that affects the prices.




11. Future Improvements

Past data limitations, there are different statistical ways to improve my model with the lack of factors. In STATS 3A03 we explored how to implement weights to underrepresented variables to combat them being drowned by higher volume variable entries. Also, finding that the price distribution is right skewed I would like to apply a natural logarithm to the price variable to reduce the skewness, which will make the distribution more normally distributed which allows linear regression assumptions more valid. I can also choose this by applying better choices for the cutoff points each qualitative variable.




12. References


12.1 Data Source

Used Car Dataset, Vehicles listings from Craigslist. Retrieved from kaggle


12.2 Academic references

Course materials from STATS 3A03, and MATH 3MB3 for Linear regression and modeling, and HTHS1M03 for R and RStudio


12.3 R packages

  • tidyverse
  • dplyr
  • knitr
  • gridExtra
  • tidyr
  • reshape2
  • Metrics
  • readr