Project 3

Author

Marie-Anne Kemajou

The dataset that I am working with has been scrapped from Craigslist by Austin Reese. This dataset includes variables such as the date of the vehicle listing, price, region, year, model, make, and many more. I would like to explore how price relates to the age of the vehicle and how that changes as a vehicle goes from relatively new to extremely old. I am also interested in exploring the relationship between price, mileage, and the amount of listings. I believe that would reveal how mileage impacts price, as well as how common it is for a car with a particular mileage to be listed on Craigslist. I will also explore how these variables relate to the transmission type as well as the type of fuel used. The reason why I chose this dataset is because every car that I have owned has been a used car, and I unfortunately tend to have bad luck with cars and end up having to get another one. I was curious about the used car market on Craigslist and what can be learned from the scrapped data, especially because I have no personal experience using Craigslist.

# Here I am calling in the packages I need. 
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(ggridges)
vehicles <- read_csv("/Users/marieannekemajou/Documents/Data 110/vehicles.csv")
Rows: 426880 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (18): url, region, region_url, manufacturer, model, condition, cylinder...
dbl   (6): id, price, year, odometer, lat, long
lgl   (1): county
dttm  (1): posting_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# I pulled the summary to get a general idea of the data that I am working with. 
summary(vehicles)
       id                url               region           region_url       
 Min.   :7.207e+09   Length:426880      Length:426880      Length:426880     
 1st Qu.:7.308e+09   Class :character   Class :character   Class :character  
 Median :7.313e+09   Mode  :character   Mode  :character   Mode  :character  
 Mean   :7.311e+09                                                           
 3rd Qu.:7.315e+09                                                           
 Max.   :7.317e+09                                                           
                                                                             
     price                year      manufacturer          model          
 Min.   :0.000e+00   Min.   :1900   Length:426880      Length:426880     
 1st Qu.:5.900e+03   1st Qu.:2008   Class :character   Class :character  
 Median :1.395e+04   Median :2013   Mode  :character   Mode  :character  
 Mean   :7.520e+04   Mean   :2011                                        
 3rd Qu.:2.649e+04   3rd Qu.:2017                                        
 Max.   :3.737e+09   Max.   :2022                                        
                     NA's   :1205                                        
  condition          cylinders             fuel              odometer       
 Length:426880      Length:426880      Length:426880      Min.   :       0  
 Class :character   Class :character   Class :character   1st Qu.:   37704  
 Mode  :character   Mode  :character   Mode  :character   Median :   85548  
                                                          Mean   :   98043  
                                                          3rd Qu.:  133542  
                                                          Max.   :10000000  
                                                          NA's   :4400      
 title_status       transmission           VIN               drive          
 Length:426880      Length:426880      Length:426880      Length:426880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
     size               type           paint_color         image_url        
 Length:426880      Length:426880      Length:426880      Length:426880     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 description         county           state                lat        
 Length:426880      Mode:logical   Length:426880      Min.   :-84.12  
 Class :character   NA's:426880    Class :character   1st Qu.: 34.60  
 Mode  :character                  Mode  :character   Median : 39.15  
                                                      Mean   : 38.49  
                                                      3rd Qu.: 42.40  
                                                      Max.   : 82.39  
                                                      NA's   :6549    
      long          posting_date                   
 Min.   :-159.83   Min.   :2021-04-04 07:00:25.00  
 1st Qu.:-111.94   1st Qu.:2021-04-17 10:47:46.75  
 Median : -88.43   Median :2021-04-26 01:08:31.50  
 Mean   : -94.75   Mean   :2021-04-24 00:23:42.98  
 3rd Qu.: -80.83   3rd Qu.:2021-05-01 13:32:26.25  
 Max.   : 173.89   Max.   :2021-05-05 04:24:09.00  
 NA's   :6549      NA's   :68                      
# This is where I clean out my data and also create another variable so that the age of vehicles can be included. I wanted to not only remove NA values but to also be able to refine my dataset so that I am not working with extreme outliers that may skew the data in a way that is not helpful. 
vehicles1 <- vehicles %>%
  filter(!is.na(price), price > 50, price < 400000,
         !is.na(odometer), odometer > 0, odometer < 700000,
         !is.na(year), year >= 1960, year <= 2025,
         !is.na(condition), !is.na(fuel),
         !is.na(transmission), !is.na(type)) %>%
  mutate(age = 2025 - year)
# Here I am creating a model for how price relates to age.
agemod <- lm(price ~ age, data = vehicles1)
summary(agemod)

Call:
lm(formula = price ~ age, data = vehicles1)

Residuals:
   Min     1Q Median     3Q    Max 
-26602  -8739  -2995   6810 324556 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29302.201     51.925   564.3   <2e-16 ***
age          -771.913      3.301  -233.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12220 on 206768 degrees of freedom
Multiple R-squared:  0.2092,    Adjusted R-squared:  0.2092 
F-statistic: 5.47e+04 on 1 and 206768 DF,  p-value: < 2.2e-16

This model displays that older vehicles are cheaper than newer vehicles, and that there is statistical significance. The p-value which is less than 2.2e-16 shows that there is statistical significance in the relationship between age and price. The r squared value is 0.2092 which suggests that approximately 21 percent of why a vehicle is priced is because of the vehicle’s age. Although age is not the only factor in price, this shows that age is still quite a large factor.

# This is where I use ggplot to create my linear regression plot.
ggplot(vehicles1, aes(x = age, y = price)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", color = "darkred") +
  labs(title = "Price Compared to Vehicle Age",
       x = "Vehicle Age (Years)",
       y = "Price in US Dollar",
       caption = "Source: Craigslist (Scrapped by Austin Reese") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

# I created a density plot to be able to display a large amount of data by mileage and price. 
ggplot(vehicles1, aes(x = odometer, y = price)) +
  geom_hex(bins = 50) +
  scale_fill_viridis_c(option = "inferno") +
  labs(title = "Density of  Vehicle Listings by Mileage and Price",
       x = "Vehicle Mileage", y = "Price (US Dollar)", fill = "Count") +
  theme_light()

Source 1: Data Camp https://www.datacamp.com/doc/r/histograms-and-density

# This is where I am able to use plotly to create an interactive bubble chart using a sample of the data due to how large the dataset is. 
vehicles_sample <- vehicles %>%
  filter(!is.na(price), price > 50, price < 100000,
         !is.na(odometer), odometer < 300000,
         !is.na(year), year >= 1990) %>%
  mutate(age = 2025 - year) %>%
  slice_sample(n = 1000)

plot_ly(vehicles_sample,
        x = ~odometer,
        y = ~price,
        color = ~fuel,
        size = ~age,
        text = ~paste("Model:", model, "<br>Fuel:", fuel, "<br>Price: $", price),
        type = "scatter",
        mode = "markers",
        marker = list(opacity = 0.5)) %>%
  layout(title = "Price vs. Odometer by Fuel Type")
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.
Warning: `line.width` does not currently support multiple values.

Source 2: ChatGBT “How to use sampling for a bubble chart in Rstudio.”

Source 3: Referenced a project submission for project 2.

# I created a distribution using ggridges which I had never done before. 
vehicles %>%
  filter(!is.na(price), price < 100000,
         !is.na(transmission)) %>%
  ggplot(aes(x = price, y = transmission, fill = transmission)) +
  geom_density_ridges(scale = 1.2, alpha = 0.7) +
  labs(title = "Price Distributions by Transmission Type",
       x = "Price", y = "Transmission Type") +
  theme_ridges() +
  theme(legend.position = "none")
Picking joint bandwidth of 1080

Source 4: ChatGBT “How to use ggridges to create a distribution in Rstudio.”

The topic of the dataset that I am working with is used cars, but specifically used cars from Craigslist. The data was scrapped from Craigslist by Austin Reese. The variables that I included are age of vehicles, odometer (mileage), transmission type, price, and fuel type. Transmission and fuel type are categorical while all the other variables I worked with are numerical. I cleaned the data by removing NA values from the columns I needed to work with, which I do not believe should skew the data due to how much data has been scrapped as a whole. I chose this topic because I personally have a decent amount of experience with used cars. All of my cars have been used and my parents both also have purchased many used cars over the years. The same goes for my siblings, and I thought it would be interesting to see if the patterns I have noticed/my family has noticed are consistent with the data scrapped from Craigslist.

There are so many factors that play a role in the price of a used car, as my plots show, and it is important for people to be aware of those factors for many reasons. Not only is the car market a huge indicator of the state of the economy, but also many people opt for used cars and should be able to make informed decisions when purchasing used cars. In the article “Why Used Car Prices Are Rising in 2025”, there is a detailed explanation regarding the recent spike in the cost of used cars due to tariffs, as well as references to previous economic states and how used car prices changed. In the article it says “Used car buyers have found themselves in the eye of a perfect storm created by a post-pandemic economy and a brewing trade war. Used car prices were declining in recent months because dealers were sitting on older inventory, which often gets less desirable over time for the average consumer” (Singh). Unfortunately, as tariffs kick in, the used car market becomes more expensive when it is supposed to be an affordable option. This is particularly unfortunate because the economy was just starting to recover from the impact of the pandemic which made used cars extremely expensive.

My first visualization is a linear model that represents the relationship between price and vehicle age. My second visualization is a density plot that shows the number of vehicle listings compared to mileage and price. My third visualization is a bubble plot that shows the relationship between the mileage and price by fuel type, and it is interactive. My final visualization is a distribution that shows the relationship between price and the type of transmission. I found it interesting that in my second visualization there seems to be a small spike in prices for older cars, which makes me think of cars that are so old that they are seen as stylish and become desirable to own. I noticed the same pattern with the first visualization where very old cars seem to become a little more pricey than moderately old cars. The only thing that I wish I would have included is a plot that could somehow show if the cars sold during the height of the pandemic were more expensive, because I believe there could have been interesting findings.

Works Cited

Singh, Charles. “Why Used Car Prices Are Rising in 2025: What Every Buyer Needs to Know.” USA Today, Gannett Satellite Information Network, 14 Apr. 2025, www.usatoday.com/story/money/2025/04/11/used-car-prices-are-rising-2025/83050309007/. 

“History and Density Plots in R.” DataCamp, www.datacamp.com/doc/r/histograms-and-density. Accessed 12 May 2025. 

ChatGBT “How to use sampling for a bubble chart in Rstudio.”

ChatGBT “How to use ggridges to create a distribution in Rstudio.”