1. Introduction
2. Data
3. Data Cleaning
4. Correlation Analysis
5. Modeling
6. Model Performance
7. Actual vs. Predicted Price Plot
8. Price Boosters
9. Model Use Examples
10. Conclusion
11. Future Improvements
12. References

1. Introduction

For this project, I aim to further expand my knowledge of linear regression that I was introduced to in taking STATS 3A03 and MATH 3MB3 at McMaster University. By applying techniques learnt from lectures to real-world datasets I am passionate about, like cars, I hope to gain deeper insight into what factors influence car prices.

Being passionate about cars, this project offers an opportunity to merge my interests with skills I’ve developed in math and statistics. I’ve always spent countless hours browsing through car listing sites like autotrader.ca, daydreaming about driving cars I can’t afford yet. Websites like Autotrader have a nifty system called “Price Analysis” that calculates price ranges for their vehicles listed into categories like “Great Price”, “Good Price”, “Fair Price”, and “Above Average”. I’ve always been curious about how they categorize these vehicles and what kind of algorithms they use. Now, after learning more about linear regression, I feel that I have a much better understanding of how they might do it.

Starting with cleaning the data, I’ll be exploring different correlations between factors and how they can influence car prices, and understanding which variables hold more weight in determining a vehicle value. At the end of my project, I will be applying my model to my dream car, the Lexus ISF, to see if it can accurately estimate its market price and compare it to ones I often find myself admiring on Autotrader.

Additionally, I want to do this entire project in R and RStudio to improve my skills in these tools. I’ve learnt basic R in the course HTHSCI1M03, and I want to further expand my knowledge in publishing a project using markdown. At the end of this project, I hope to become more comfortable with the R ecosystem.

My dream car, the Lexus IS-F

2. Data

Thanks to Austin Reese, I was able to easily collect my used cars data from Kaggle where he scrapes data from Craigslist. The CSV file consists of 426,880 entries and 26 columns.

Below shows the summary of the raw data set that was collected. In it’s current form, it is uncleaned and unusable for any meaningful analysis due to several issues. The presence of outliers are evident in the price column where the values range from zero to 3.7 billion, which is unrealistic for used car prices. Similarly the odometer values range from zero to 10 million miles which suggests data entry errors. There are also an abundance of missing values which will skew our model so they will be removed. Lastly, there are several redundant columns that are no use to this project which will also be dropped.

Click to view uncleaned dataset summary

       id                url               region           region_url       
 Min.   :7.207e+09   Length:426880      Length:426880      Length:426880     
 1st Qu.:7.308e+09   Class :character   Class :character   Class :character  
 Median :7.313e+09   Mode  :character   Mode  :character   Mode  :character  
 Mean   :7.311e+09                                                           
 3rd Qu.:7.315e+09                                                           
 Max.   :7.317e+09                                                           
                                                                             
     price                year      manufacturer          model          
 Min.   :0.000e+00   Min.   :1900   Length:426880      Length:426880     
 1st Qu.:5.900e+03   1st Qu.:2008   Class :character   Class :character  
 Median :1.395e+04   Median :2013   Mode  :character   Mode  :character  
 Mean   :7.520e+04   Mean   :2011                                        
 3rd Qu.:2.649e+04   3rd Qu.:2017                                        
 Max.   :3.737e+09   Max.   :2022                                        
                     NA's   :1205                                        
  condition          cylinders             fuel              odometer       
 Length:426880      Length:426880      Length:426880      Min.   :       0  
 Class :character   Class :character   Class :character   1st Qu.:   37704  
 Mode  :character   Mode  :character   Mode  :character   Median :   85548  
                                                          Mean   :   98043  
                                                          3rd Qu.:  133542  
                                                          Max.   :10000000  
                                                          NA's   :4400      
 title_status       transmission           VIN               drive          
 Length:426880      Length:426880      Length:426880      Length:426880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     size               type           paint_color         image_url        
 Length:426880      Length:426880      Length:426880      Length:426880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 description         county           state                lat        
 Length:426880      Mode:logical   Length:426880      Min.   :-84.12  
 Class :character   NA's:426880    Class :character   1st Qu.: 34.60  
 Mode  :character                  Mode  :character   Median : 39.15  
                                                      Mean   : 38.49  
                                                      3rd Qu.: 42.40  
                                                      Max.   : 82.39  
                                                      NA's   :6549    
      long          posting_date                   
 Min.   :-159.83   Min.   :2021-04-04 07:00:25.00  
 1st Qu.:-111.94   1st Qu.:2021-04-17 10:47:46.75  
 Median : -88.43   Median :2021-04-26 01:08:31.50  
 Mean   : -94.75   Mean   :2021-04-24 00:23:42.98  
 3rd Qu.: -80.83   3rd Qu.:2021-05-01 13:32:26.25  
 Max.   : 173.89   Max.   :2021-05-05 04:24:09.00  
 NA's   :6549      NA's   :68

3. Data Cleaning

Unfortunately, working with raw data comes with some flaws. There will always be some oddballs lurking around, like someone trying to sell their 2000 Civic for $1,000,000 or someone accidentally adding an extra zero or two to their odometer reading. To combat this, I used the 95th percentile for the price ($44,500) and the 99th percentile for odometer (280,000 mi.) as a cutoff point to get rid of the outliers, which is a reasonable cutoff for used cars. As for the fuel, transmission, type, and title, I limited my data to values that can have meaning in my model. Lastly, removing all listing entries with any NA values, I was left with something I could work with. The columns I kept are listed as follows:

Price (USD)
Year
Manufacturer (Car brand)
Model (Make of car)
Condition (New, like new, fair, salvage, etc.)
Cylinders
Fuel (Gas, diesel, hybrid, or electric)
Odometer (Miles)
Title status (Clean or salvage)
Transmission (Automatic or manual)
Type (SUV, sedan, etc.)

Here is a sample of the presence of outliers in the prices column. Viewing the raw data with a histogram of the price distribution shows how skewed the data is prior to cleaning.

10 Random Vehicles From Clenaed Data
Price	Year	Manufacturer	Model	Condition	Cylinders	Fuel	Odometer	Title Status	Transmission	Type
$13,950	2007	bmw	335i convertible	excellent	6	gas	56,000	clean	automatic	convertible
$2,650	2001	honda	accord	fair	4	gas	213,000	clean	automatic	sedan
$8,490	2006	ford	f-150	excellent	6	gas	168,000	clean	automatic	pickup
$1,999	2004	ford	f150	fair	8	gas	131,000	rebuilt	automatic	pickup
$6,900	2009	honda	fit	good	4	gas	84,475	clean	manual	hatchback
$12,900	2008	toyota	tundra	good	8	gas	201,143	clean	automatic	truck
$13,999	2011	bmw	x5 xdrive35d	good	6	diesel	100,675	clean	automatic	SUV
$5,950	2003	toyota	matrix xr hatchback	excellent	4	gas	202,000	clean	automatic	hatchback
$12,916	2013	cadillac	xts	excellent	6	gas	114,263	clean	automatic	sedan
$24,500	2000	ford	excursion limited 4x4	excellent	8	diesel	209,500	clean	automatic	SUV

Click to view the cleaned dataset summary

     price            year      manufacturer          model          
 Min.   :  500   Min.   :1980   Length:103886      Length:103886     
 1st Qu.: 5900   1st Qu.:2006   Class :character   Class :character  
 Median : 9950   Median :2011   Mode  :character   Mode  :character  
 Mean   :12871   Mean   :2010                                        
 3rd Qu.:17499   3rd Qu.:2014                                        
 Max.   :44500   Max.   :2022                                        
  condition           cylinders          fuel              odometer     
 Length:103886      Min.   : 3.000   Length:103886      Min.   :   537  
 Class :character   1st Qu.: 4.000   Class :character   1st Qu.: 77201  
 Mode  :character   Median : 6.000   Mode  :character   Median :115510  
                    Mean   : 5.783                      Mean   :117582  
                    3rd Qu.: 8.000                      3rd Qu.:155000  
                    Max.   :12.000                      Max.   :280000  
 title_status       transmission           type          
 Length:103886      Length:103886      Length:103886     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character

4. Correlation Analysis

Before modeling and applying linear regression, I conducted a correlation analysis on the quantitative variables in the data set. Specifically, I was interested in understanding which variable out of year, number of cylinders, and odometer most effects the price of cars by applying cor() to these variables.

Correlation Between Quantitative Variable
Variable 1	Variable 2	Correlation
odometer	price	0.5077
year	price	0.5063
odometer	year	0.4501
cylinders	price	0.2869
odometer	cylinders	-0.1505
cylinders	year	-0.2328

Initially when producing this table, the odometer correlation read out a negative number, which means as the mileage of the car goes up, the price goes down. To make more sense of the data, I simply used a negative odometer reading. As expected, the price gets affected the most by the odometer and year of the vehicle. This makes sense as older cars that tend to have higher mileage typically suggests a lower value for the vehicle. These findings align with expectations of depreciation tends. Intuitively the high correlation value for odometer and year also makes perfect sense as older vehicles are driven more, racking up the odometer reading. The weak positive correlation value for cylinders to price suggests that vehicles with more cylinders (V6 or V8 engines) are slightly pricier due to some of these vehicles being sports cars. However, the weak correlation suggests it is not a strong determinant of price. One key reasoning behind this is because of the prevalence of non-sporty, everyday vehicles can be equipped with larger engines that don’t command for a markup on the prices.

5. Modeling

After cleaning the data and exploring relationships between variables, I selected all variables except for the model of the vehicle for the model to predict the price, since the model variable has a high carnality of 4098 unique values. As for the year of the car, I changed the variable into the age of the car instead as it is more intuitive for predicting prices. I split the data set, with 80% to build the model and the remaining 20% to test the model. The model used is as follows:

\[\begin{aligned} &\text{price} = \beta_0 + \beta_1(\text{age}) + \beta_2(\text{odometer}) + \beta_3(\text{cylinders}) + \beta_4(\text{manufacturer}) + \\ &\quad \beta_5(\text{condition}) + \beta_6(\text{fuel}) + \beta_7(\text{title_status}) + \beta_8(\text{transmission}) + \beta_9(\text{type}) + \epsilon \end{aligned}\]`

set.seed(138)

cars_model <- cars_clean %>%
  mutate(across(c(manufacturer, condition, fuel, title_status, transmission, type), as.factor))
cars_model$age <- 2024 - cars_model$year
cars_model <- cars_model %>% select(-year)

train_data <- cars_model %>% sample_frac(0.8)
test_data <- cars_model %>% anti_join(train_data)
lm_model <- lm(price ~ age + odometer + cylinders + 
               manufacturer + condition + fuel + 
               title_status + transmission + type,
             data = train_data)

test_data$pred_price <- predict(lm_model, newdata = test_data)

6. Model Performance

Model Performance MAE & R^2
Metric	Value
MAE	$ 3780.97
R^2	0.691

I opted in using two key evaluation metrics, MAE (Mean absolute error), and R^2. The mean absolute error represents on average, my model’s predictions are off by about ±$3780.97. This means if a vehicle’s true value is priced at $20,000 the model might predict in the range of ($16,219, $23,780). While I am proud of this number, due to lack of factors that goes into pricing a vehicle for my model, I wish to improve this model in the future with a data set involving more variables.

The R-squared value represents the proportion of variance in price that can be explained by the independent variables. In other words, R-squared tells us how well the data fit the regression model, “the goodness of fit”.

Therefore, the model explains 69.1% of the price variation in the data set with an average prediction error of $3780.97.

7. Actual vs. Predicted Price Plot

This scatter plot compares the predicted prices from my model to the actual prices from the data set, where the red line represents predicted = actual. We can see that on the lower valued cars, the model is overestimating the value, while in the higher end, it is underestimating the values. Due to the nature of our data set being collected from craigslist, there are far fewer luxury cars, thus the model struggles to learn their pricing rules when training. Older cars often have other factors such as poor aesthetic conditions and mechanical issues that also do not get taken into account in my model.

Least Represented Manufacturer
Manufacturer	Count	Average Price	Average Age
ferrari	1	43000.00	38
aston-martin	3	38631.67	21
land rover	6	6990.00	19
datsun	13	12799.15	42
harley-davidson	28	15191.36	21
alfa-romeo	40	20089.15	19
fiat	173	8669.73	11
porsche	284	20181.35	20
All Cars	103886	12870.81	14

We can see that luxury brands like Ferrari, Porsche, Alpha-Romeo, and Aston-Martin are extremely underrepresented in our data set and can be considered “outliers” due to their average price compared to the rest of the data set. There are many factors that go into pricing a car that we cannot capture entirely with this data set, like underrepresented luxury cars to hidden damages, which can influence one another when modeling. I will dive more into this in the next section.

8. Price Boosters

I discussed previously about luxury brands and their influence on the model. Using the coefficient from our model, we can see which factors impact the price of a vehicle the most, ranked by their estimated effect on prices. All these coefficients are statistically significant with a p-val < 0.001.

Coefficients in Model With Highest Increase in Price
Value	Estimate	STD Error	Statistic	P-Value
manufacturerferrari	31403.296	5301.3815	5.923606	0.00e+00
manufacturerdatsun	16312.766	1610.3293	10.130081	0.00e+00
manufactureraston-martin	14985.896	3750.4078	3.995804	6.45e-05
manufactureralfa-romeo	5816.924	952.9894	6.103871	0.00e+00
manufacturerporsche	5546.499	397.5018	13.953394	0.00e+00
typeoffroad	3954.806	588.1318	6.724354	0.00e+00
typetruck	3608.994	506.1789	7.129879	0.00e+00
manufactureraudi	2677.824	223.2103	11.996864	0.00e+00
typepickup	2647.384	507.8446	5.212981	2.00e-07
transmissionmanual	2267.993	77.5666	29.239298	0.00e+00
manufacturerlexus	2059.275	215.0177	9.577236	0.00e+00

As mentioned previously, luxury cars are an obvious price boosters, however, we also see Datsun, a manufacture known for their classic/vintage appeal, driving prices beyond typical market value. There is also a trend of a type of vehicle in this table, off-roaders and trucks. With more powerful engines and larger sizes, and versatility for work use can boost their costs. Lastly, a finding I was surprised to find were manual transmissions. We are moving to an age where automatic cars are becoming the norm, and only sportier cars are keeping the manual transmission which may reflect enthusiast demand these cars.

9. Model Use Examples

To test the accuracy of my used car price prediction model, I’ve selected 3 cars from autotrader.ca that represents different segments of the market: a simple passenger sedan, a practical family SUV, and my dream car. Since the data set was collected from the American market, I converted the cars I found in my area from CAD -> USD, and KM -> MI.

predict_car_price <- function(age, odometer, cylinders, 
                              manufacturer, fuel, transmission, type) {
  
  new_data <- data.frame(
    age = age,
    odometer = odometer,
    cylinders = cylinders,
    manufacturer = factor(manufacturer, levels = levels(cars_model$manufacturer)),
    condition = factor("good", levels = levels(cars_model$condition)),
    fuel = fuel,
    title_status = factor("clean", levels = levels(cars_model$title_status)),
    transmission = transmission,
    type = type
  )
  predict(lm_model, newdata = new_data)
}

Car #1: 2015 Toyota Corolla

* Listed Price $14,388 CAD ($10,048.44 USD)
* Age: 10
* Odometer: 131680 KM (81822 MI)
* Cylinders: 4
* Manufacturer: Toyota
* Fuel: Gas
* Transmission: Automatic
* Type: Sedan

corolla2015 <- predict_car_price(
  age = 10, 
  odometer = 131680, 
  cylinders = 4, 
  manufacturer = "toyota",
  fuel = "gas",
  transmission = "automatic",
  type = "sedan"
)

cat("Actual price: 10048 USD\nPredicted price:", corolla2015, "USD\nError of:", 
    round((abs(corolla2015 - 10084)/10084)*100, 2),"%")

Actual price: 10048 USD
Predicted price: 11235.96 USD
Error of: 11.42 %

Car #2: 2022 Toyota Rav4

* Listed Price $29,950 CAD ($20,944)
* Age: 3
* Odometer: 89979 KM (55910 MI)
* Cylinders: 4
* Manufacturer: Toyota
* Fuel: Gas
* Transmission: Automatic
* Type: SUV

RAV42022 <- predict_car_price(
  age = 3, 
  odometer = 55923, 
  cylinders = 4, 
  manufacturer = "toyota",
  fuel = "gas",
  transmission = "automatic",
  type = "SUV"
)
cat("Actual price: 20944 USD\nPredicted price:", RAV42022, "USD\nError of:", 
    round((abs(RAV42022 - 20944)/20944)*100, 2),"%")

Actual price: 20944 USD
Predicted price: 23754.07 USD
Error of: 13.42 %

Car #3: 2009 Lexus ISF

* Listed Price $42,000 CAD ($29,332)
* Age: 16
* Odometer: 82000 KM (50952 MI)
* Cylinders: 8
* Manufacturer: Lexus
* Fuel: Gas
* Transmission: Automatic
* Type: Sedan

ISF2009 <- predict_car_price(
  age = 16, 
  odometer = 50952, 
  cylinders = 8, 
  manufacturer = "lexus",
  fuel = "gas",
  transmission = "automatic",
  type = "sedan"
)

cat("Actual price: 29332 USD\nPredicted price:", ISF2009, "USD\nError of:", 
    round((abs(ISF2009 - 29332)/29332)*100, 2),"%")

Actual price: 29332 USD
Predicted price: 20299.01 USD
Error of: 30.8 %

10. Conclusion

I am deeply proud of applying statistical principles learnt from my course work to real world problems I care about. Also I am pleased with the results of the model achieving an R-squared value of 69.1% explaining the variance of the used car prices, and the mean absolute error of $3780.97. For my first linear regression project the model performs reasonably well for the intended purpose, particularly for common vehicles like sedans and SUV’s of which the dataset is mostly made up of. For example, the Toyota Corolla and RAV4 cases demonstrated respectable errors of 11.42% and 13.42%. The accuracy of my model depends heavily on the heavy representation of mainstream vehicles in the market. However, for more luxury vehicles or sportier cars like the ISF case the error rocketed up to 30.8% as it struggles to capture an accurate predicted price since my dataset does not consist of many luxury vehicles. The model also lacks sufficient factors that are important in determining higher-end vehicles such as performance features, collector demand, and brand prestige. Alongside this, for any vehicle there are many features, damages, maintenance records, that gives lots of variance that affects the prices.

11. Future Improvements

Past data limitations, there are different statistical ways to improve my model with the lack of factors. In STATS 3A03 we explored how to implement weights to underrepresented variables to combat them being drowned by higher volume variable entries. Also, finding that the price distribution is right skewed I would like to apply a natural logarithm to the price variable to reduce the skewness, which will make the distribution more normally distributed which allows linear regression assumptions more valid. I can also choose this by applying better choices for the cutoff points each qualitative variable.

12. References

12.1 Data Source

Used Car Dataset, Vehicles listings from Craigslist. Retrieved from kaggle

12.2 Academic references

Course materials from STATS 3A03, and MATH 3MB3 for Linear regression and modeling, and HTHS1M03 for R and RStudio

12.3 R packages

tidyverse
dplyr
knitr
gridExtra
tidyr
reshape2
Metrics
readr

Analyzing and Predicting Used Car Prices

A Data Science Project

Katsuma Onishi

March 20, 2025