The dataset that I am working with has been scrapped from Craigslist by Austin Reese. This dataset includes variables such as the date of the vehicle listing, price, region, year, model, make, and many more. I would like to explore how price relates to the age of the vehicle and how that changes as a vehicle goes from relatively new to extremely old. I am also interested in exploring the relationship between price, mileage, and the amount of listings. I believe that would reveal how mileage impacts price, as well as how common it is for a car with a particular mileage to be listed on Craigslist. I will also explore how these variables relate to the transmission type as well as the type of fuel used. The reason why I chose this dataset is because every car that I have owned has been a used car, and I unfortunately tend to have bad luck with cars and end up having to get another one. I was curious about the used car market on Craigslist and what can be learned from the scrapped data, especially because I have no personal experience using Craigslist.
# Here I am calling in the packages I need. library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl (6): id, price, year, odometer, lat, long
lgl (1): county
dttm (1): posting_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I pulled the summary to get a general idea of the data that I am working with. summary(vehicles)
id url region region_url
Min. :7.207e+09 Length:426880 Length:426880 Length:426880
1st Qu.:7.308e+09 Class :character Class :character Class :character
Median :7.313e+09 Mode :character Mode :character Mode :character
Mean :7.311e+09
3rd Qu.:7.315e+09
Max. :7.317e+09
price year manufacturer model
Min. :0.000e+00 Min. :1900 Length:426880 Length:426880
1st Qu.:5.900e+03 1st Qu.:2008 Class :character Class :character
Median :1.395e+04 Median :2013 Mode :character Mode :character
Mean :7.520e+04 Mean :2011
3rd Qu.:2.649e+04 3rd Qu.:2017
Max. :3.737e+09 Max. :2022
NA's :1205
condition cylinders fuel odometer
Length:426880 Length:426880 Length:426880 Min. : 0
Class :character Class :character Class :character 1st Qu.: 37704
Mode :character Mode :character Mode :character Median : 85548
Mean : 98043
3rd Qu.: 133542
Max. :10000000
NA's :4400
title_status transmission VIN drive
Length:426880 Length:426880 Length:426880 Length:426880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
size type paint_color image_url
Length:426880 Length:426880 Length:426880 Length:426880
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
description county state lat
Length:426880 Mode:logical Length:426880 Min. :-84.12
Class :character NA's:426880 Class :character 1st Qu.: 34.60
Mode :character Mode :character Median : 39.15
Mean : 38.49
3rd Qu.: 42.40
Max. : 82.39
NA's :6549
long posting_date
Min. :-159.83 Min. :2021-04-04 07:00:25.00
1st Qu.:-111.94 1st Qu.:2021-04-17 10:47:46.75
Median : -88.43 Median :2021-04-26 01:08:31.50
Mean : -94.75 Mean :2021-04-24 00:23:42.98
3rd Qu.: -80.83 3rd Qu.:2021-05-01 13:32:26.25
Max. : 173.89 Max. :2021-05-05 04:24:09.00
NA's :6549 NA's :68
# This is where I clean out my data and also create another variable so that the age of vehicles can be included. I wanted to not only remove NA values but to also be able to refine my dataset so that I am not working with extreme outliers that may skew the data in a way that is not helpful. vehicles1 <- vehicles %>%filter(!is.na(price), price >50, price <400000,!is.na(odometer), odometer >0, odometer <700000,!is.na(year), year >=1960, year <=2025,!is.na(condition), !is.na(fuel),!is.na(transmission), !is.na(type)) %>%mutate(age =2025- year)
# Here I am creating a model for how price relates to age.agemod <-lm(price ~ age, data = vehicles1)summary(agemod)
Call:
lm(formula = price ~ age, data = vehicles1)
Residuals:
Min 1Q Median 3Q Max
-26602 -8739 -2995 6810 324556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29302.201 51.925 564.3 <2e-16 ***
age -771.913 3.301 -233.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12220 on 206768 degrees of freedom
Multiple R-squared: 0.2092, Adjusted R-squared: 0.2092
F-statistic: 5.47e+04 on 1 and 206768 DF, p-value: < 2.2e-16
This model displays that older vehicles are cheaper than newer vehicles, and that there is statistical significance. The p-value which is less than 2.2e-16 shows that there is statistical significance in the relationship between age and price. The r squared value is 0.2092 which suggests that approximately 21 percent of why a vehicle is priced is because of the vehicle’s age. Although age is not the only factor in price, this shows that age is still quite a large factor.
# This is where I use ggplot to create my linear regression plot.ggplot(vehicles1, aes(x = age, y = price)) +geom_point(alpha =0.2) +geom_smooth(method ="lm", color ="darkred") +labs(title ="Price Compared to Vehicle Age",x ="Vehicle Age (Years)",y ="Price in US Dollar",caption ="Source: Craigslist (Scrapped by Austin Reese") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
# I created a density plot to be able to display a large amount of data by mileage and price. ggplot(vehicles1, aes(x = odometer, y = price)) +geom_hex(bins =50) +scale_fill_viridis_c(option ="inferno") +labs(title ="Density of Vehicle Listings by Mileage and Price",x ="Vehicle Mileage", y ="Price (US Dollar)", fill ="Count") +theme_light()
# This is where I am able to use plotly to create an interactive bubble chart using a sample of the data due to how large the dataset is. vehicles_sample <- vehicles %>%filter(!is.na(price), price >50, price <100000,!is.na(odometer), odometer <300000,!is.na(year), year >=1990) %>%mutate(age =2025- year) %>%slice_sample(n =1000)plot_ly(vehicles_sample,x =~odometer,y =~price,color =~fuel,size =~age,text =~paste("Model:", model, "<br>Fuel:", fuel, "<br>Price: $", price),type ="scatter",mode ="markers",marker =list(opacity =0.5)) %>%layout(title ="Price vs. Odometer by Fuel Type")
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Source 2: ChatGBT “How to use sampling for a bubble chart in Rstudio.”
Source 3: Referenced a project submission for project 2.
# I created a distribution using ggridges which I had never done before. vehicles %>%filter(!is.na(price), price <100000,!is.na(transmission)) %>%ggplot(aes(x = price, y = transmission, fill = transmission)) +geom_density_ridges(scale =1.2, alpha =0.7) +labs(title ="Price Distributions by Transmission Type",x ="Price", y ="Transmission Type") +theme_ridges() +theme(legend.position ="none")
Picking joint bandwidth of 1080
Source 4: ChatGBT “How to use ggridges to create a distribution in Rstudio.”
The topic of the dataset that I am working with is used cars, but specifically used cars from Craigslist. The data was scrapped from Craigslist by Austin Reese. The variables that I included are age of vehicles, odometer (mileage), transmission type, price, and fuel type. Transmission and fuel type are categorical while all the other variables I worked with are numerical. I cleaned the data by removing NA values from the columns I needed to work with, which I do not believe should skew the data due to how much data has been scrapped as a whole. I chose this topic because I personally have a decent amount of experience with used cars. All of my cars have been used and my parents both also have purchased many used cars over the years. The same goes for my siblings, and I thought it would be interesting to see if the patterns I have noticed/my family has noticed are consistent with the data scrapped from Craigslist.
There are so many factors that play a role in the price of a used car, as my plots show, and it is important for people to be aware of those factors for many reasons. Not only is the car market a huge indicator of the state of the economy, but also many people opt for used cars and should be able to make informed decisions when purchasing used cars. In the article “Why Used Car Prices Are Rising in 2025”, there is a detailed explanation regarding the recent spike in the cost of used cars due to tariffs, as well as references to previous economic states and how used car prices changed. In the article it says “Used car buyers have found themselves in the eye of a perfect storm created by a post-pandemic economy and a brewing trade war. Used car prices were declining in recent months because dealers were sitting on older inventory, which often gets less desirable over time for the average consumer” (Singh). Unfortunately, as tariffs kick in, the used car market becomes more expensive when it is supposed to be an affordable option. This is particularly unfortunate because the economy was just starting to recover from the impact of the pandemic which made used cars extremely expensive.
My first visualization is a linear model that represents the relationship between price and vehicle age. My second visualization is a density plot that shows the number of vehicle listings compared to mileage and price. My third visualization is a bubble plot that shows the relationship between the mileage and price by fuel type, and it is interactive. My final visualization is a distribution that shows the relationship between price and the type of transmission. I found it interesting that in my second visualization there seems to be a small spike in prices for older cars, which makes me think of cars that are so old that they are seen as stylish and become desirable to own. I noticed the same pattern with the first visualization where very old cars seem to become a little more pricey than moderately old cars. The only thing that I wish I would have included is a plot that could somehow show if the cars sold during the height of the pandemic were more expensive, because I believe there could have been interesting findings.
Works Cited
Singh, Charles. “Why Used Car Prices Are Rising in 2025: What Every Buyer Needs to Know.” USA Today, Gannett Satellite Information Network, 14 Apr. 2025, www.usatoday.com/story/money/2025/04/11/used-car-prices-are-rising-2025/83050309007/.
“History and Density Plots in R.” DataCamp, www.datacamp.com/doc/r/histograms-and-density. Accessed 12 May 2025.
ChatGBT “How to use sampling for a bubble chart in Rstudio.”
ChatGBT “How to use ggridges to create a distribution in Rstudio.”