I am choosing to analyze car depreciation from listings on Craigslist. Growing up I have always had a love for cars. To this day I enjoy cars and will continually scroll through apps like Cargurus and Autotader like I’m looking through the window of a store. My goal is to find correlations and non correlations between the information of cars listed on Craigslist.
I am using a large data set found on kaggle titled vehicle_depreciation. There are over 400,000 rows and 26 columns of information that I will be drawing from. Each of the rows is a vehicle. If you would like to use the data set provided is a link to it Vehicle_Depreciation.
My purpose in analyzing data on used cars is to hopefully discover ways to know where depreciation is the most prevalent, what brands retain their value the best, the difference in depreciation between trucks diesel and gas, how condition, mileage, and age play a roll in the price of your vehicle.
Data
library(tidyverse) # Data wrangling and visualization
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr) # HTTP requests for web scrapinglibrary(rvest) # HTML scraping for secondary data source
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
library(lubridate) # Date handlinglibrary(magrittr) # Additional pipe operators
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
library(jsonlite) # Parsing JSON from APIs
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(knitr) # Table rendering in Quarto
Here I am loading in the .csv and filtering out cars that are irrelevant to what I plan to achieve.
car <-read_csv("vehicle_depreciation.csv")
New names:
Rows: 458213 Columns: 26
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(18): url, region, region_url, manufacturer, model, condition, cylinder... dbl
(7): ...1, id, price, year, odometer, lat, long dttm (1): posting_date
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
car <- car %>%filter( price >500, price <100000, year >=2000, year <=2026, odometer >100, odometer <300000,!is.na(manufacturer), ) %>%mutate(age =2026- year,fuel =factor(fuel),transmission =factor(transmission),condition =factor(condition, levels =c("new", "like new", "excellent","good", "fair","salvage")),type =factor(type) )
Now that the data is filtered through to where we can show relevant information. Let’s start building visualizations and drawing conclusions. First I want to show a distribution of prices of the vehicle we have in the data set.
car %>%ggplot(aes(x = price)) +geom_histogram(bins =100, fill ="blue2") +geom_vline(aes(xintercept =median(price, na.rm =TRUE)), color ="red", linetype ="dashed", linewidth =1) +labs(title ="Distribution of Vehicle Prices")
Here we can see the median price of a vehicle we have is around 13,000 dollars. There is a high volume of car less than 10,000 dollars. The distributions is skewed heavily right and tails out when going towards 100,000 dollars. This highlights the slim amount of luxury cars we may see while there will be a majority of “average cars.”
car %>%group_by(year) %>%summarize(avg_price =mean(price, na.rm =TRUE)) %>%ggplot(aes(x = year, y = avg_price)) +geom_line(color ="red", linewidth =1.2) +geom_smooth(method ="lm", se =TRUE, linetype ="dashed",color ="gray40", fill ="lightblue", alpha =0.3) +scale_y_continuous(labels = dollar) +scale_x_continuous(breaks =seq(2000, 2026, by =4)) +labs(title ="Average Listed Price by Model Year",subtitle ="Newer vehicles command significantly higher asking prices",x ="Model Year",y ="Average Listed Price" )
`geom_smooth()` using formula = 'y ~ x'
Visualized by the line graph, we can see an almost exponential decrease in value of vehicles over as the model year gets older. At about 16 years old, we see the depreciation of vehicles start to level out. The sharpest depreciation happens throughout the first 3 years of purchase.
mustang_v8_manual <- car %>%filter(str_detect(tolower(model), "mustang"), cylinders =="8 cylinders", transmission =="manual", odometer <100000, age <=10 ) %>%mutate(car_name ="Ford Mustang V8")camaro_v8_manual <- car %>%filter(str_detect(tolower(model), "camaro"), cylinders =="8 cylinders", transmission =="manual", odometer <100000, age <=10 ) %>%mutate(car_name ="Chevrolet Camaro V8")challenger_v8_manual <- car %>%filter(str_detect(tolower(model), "challenger"), cylinders =="8 cylinders", transmission =="manual", odometer <100000, age <=10 ) %>%mutate(car_name ="Dodge Challenger V8")muscle_cars <-bind_rows(mustang_v8_manual, camaro_v8_manual, challenger_v8_manual)muscle_cars %>%ggplot(aes(x = odometer, y = price, color = car_name)) +geom_point(alpha =0.2, size =1.5) +geom_smooth(method ="loess", se =FALSE, linewidth =1.4) +scale_y_continuous(labels = dollar) +scale_x_continuous(labels = comma) +scale_color_manual(values =c("Ford Mustang V8"="blue","Chevrolet Camaro V8"="yellow2","Dodge Challenger V8"="red3")) +labs(title ="Price vs. Mileage — American Muscle V8",x ="Odometer (miles)",y ="Listed Price",color ="Vehicle") +theme_minimal(base_size =13) +theme(legend.position ="bottom")
`geom_smooth()` using formula = 'y ~ x'
The muscle car is a staple of american vehicle. Deciding between a Camaro, Challenger, and Mustang is a hard choice to make. Of these 3 muscle cars with an age less than 10 and all V8 and manual, which vehicle in relation to miles on the odometer holds it value the best, so when deciding which car to get, what will you be able to get the most miles for the value. Shown by the visualization the Dodge Challenger has the lowest initial value but finishes above the competitors, there are of course other factors to take into consideration like trim level of your vehicle, just your regular challenger R/T is not going to hold its value compared to a Scat Pack or higher.
get_avg_price <-function(manufacturer_input, model_input, year_input) { result <- car %>%filter( manufacturer ==tolower(manufacturer_input),str_detect(model, tolower(model_input)), year == year_input )if (nrow(result) ==0) {print(paste("No listings found for a", year_input, manufacturer_input, model_input))return(NULL) } avg <-mean(result$price, na.rm =TRUE) low <-min(result$price, na.rm =TRUE) high <-max(result$price, na.rm =TRUE) n <-nrow(result)print(paste("Vehicle: ", year_input, manufacturer_input, model_input))print(paste("Listings Found: ", n))print(paste("Average Price: ", dollar(avg)))print(paste("Lowest Listing: ", dollar(low)))print(paste("Highest Listing:", dollar(high)))}get_avg_price("ford", "fusion", 2007)
simple lookup function that allows anyone to query the dataset by manufacturer, model, and year to get a quick price summary. For example, if you are curious what a 2013 Ford Mustang was listing for on Craigslist, you can call the function and instantly see the number of listings found, the average asking price, and the lowest and highest listings in the dataset. This is useful because a single average can sometimes be misleading. Knowing that a car has 47 listings ranging from $8,000 to $22,000 tells a much more complete story than just seeing an average of $14,000. Rather than manually filtering the dataset every time, this function does the work for you and makes the data approachable for anyone who wants to look up a specific vehicle. It is a small but practical example of how data analysis can be turned into a usable tool.
car %>%mutate(drive =str_replace(drive, "4wd","awd")) %>%filter(!is.na(drive),!is.na(cylinders), drive %in%c("fwd", "awd", "rwd") ) %>%ggplot(aes(x = drive, fill = drive)) +geom_bar() +scale_y_continuous(labels = comma) +scale_fill_manual(values =c("fwd"="blue","awd"="red","rwd"="orange" )) +labs(title ="Number of Vehicles by Drive Type",x ="Drive Type",y ="Number of Listings",fill ="Drive Type" ) +theme_minimal(base_size =13) +theme(legend.position ="bottom")
The most common drive type for vehicle listed on Craiglist is all wheel drive. This is a nice feature especially if your location gets lots of rain or snow. Next closest is front wheel drive which is commonly seen in lower trim vehicle which allows the vehicle to be less expensive. Least common is rear wheel drive which is commonly found in trucks and sports cars, not commonly found in your regular everyday car.
car %>%mutate(type =str_replace(type, "pickup", "truck"),cylinders =str_remove(cylinders, " cylinders") # clean cylinders here ) %>%filter( type %in%c("sedan", "SUV", "truck"), cylinders %in%c("4", "6", "8"),!is.na(cylinders),!is.na(type) ) %>%ggplot(aes(x = cylinders, fill = cylinders)) +geom_bar() +facet_wrap(~ type) +scale_y_continuous(labels = comma) +theme_minimal(base_size =13) +theme(legend.position ="none" ) +labs(title ="Number of Listings by Cylinders",subtitle ="Faceted by vehicle type",x ="Cylinders",y ="Number of Listings" )
Size of your vehicle correlates to the size of your engine. The majority of sedans are equipped with a 4 cylinder engine while SUV’s are equipped with a 6 Cylinder and lastly most trucks contain an 8 cylinder engine.
# Top manufacturers for cleaner visualtop_manufacturers <- car %>%count(manufacturer) %>%slice_max(n, n =15) %>%pull(manufacturer)# Plot 1 - Overall listings by manufacturercar %>%filter(manufacturer %in% top_manufacturers) %>%count(manufacturer) %>%ggplot(aes(x =reorder(manufacturer, n), y = n, fill = manufacturer)) +geom_col() +coord_flip() +scale_y_continuous(labels = comma) +labs(title ="Number of Listings by Manufacturer",subtitle ="Top 15 most listed brands on Craigslist",x =NULL,y ="Number of Listings") +theme_minimal(base_size =13) +theme(legend.position ="none")
The number of listing by brand is what I would expect. Ford and Chevrolet are the most popular american brands and commonly found in america. Next is Honda and Toyota, both are Japanese car manufacturers with a huge consumer base here in the United States characterized by there reliability and good designing.
Lets now cross analyze Audi S7’s from CarGurus compared to our listings on Craigslist.
Now that the two dataframes are combines lets compare how many car we have at each price.
combined %>%ggplot(aes(x = price, fill = site)) +geom_histogram(position ="identity", alpha =0.5, bins =30) +labs(title ="Price Distribution by Site",x ="Price",y ="Count") +theme_minimal(base_size =13)
There are a lot fewer Audi S7’s listed on Craigslist than all of CarGurus but we can see that the cars on Craigslist are in the middle of the price range.
Next lets compare how cars are priced based on mileage on each site. Again using ggplot to visualize the comparison.
combined %>%ggplot(aes(x = mileage, y = price, color = site)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +labs(title ="Price vs Mileage by Site",x ="Mileage",y ="Price") +theme_minimal(base_size =13)
`geom_smooth()` using formula = 'y ~ x'
The results show that price is steady in Craigslist with a slight decrease in price because of miles while on CarGurus the change is much more dramatic.
Lastely lets look at the price compared to the year to see how much deprecation there is on an Audi S7.
combined %>%ggplot(aes(x = year, y = price)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +scale_x_continuous(breaks =seq(2013, 2025, by =1)) +labs(title ="Price vs Year",x ="Year",y ="Price") +theme_minimal(base_size =13)
`geom_smooth()` using formula = 'y ~ x'
We can see that in 10 years, a $70,000 Audi S7 will be worth about $20,000. That is a depreciation rate of about 12% per year. Typical annual depreciation is between 10-15% which mean an Audi S7 depreciates like a normal car.