1. Introduction.

New vehicles are typically sold at fixed and transparent prices, which renders purchasing decisions relatively straightforward. In contrast, the pricing of used vehicles is far less transparent, as it varies considerably with factors such as condition, usage history, and geographic location, thereby complicating informed decision-making. For instance, an identical vehicle may be offered at a substantially lower price in one state than in another, a discrepancy that may not be immediately apparent to prospective buyers.

To address this challenge, this project examines used car prices across the United States in order to identify underlying market trends, generate meaningful pricing insights, and determine the principal factors influencing vehicle valuation. The findings are presented through a series of visualizations designed to support prospective buyers in making more informed purchasing decisions.

2. Relevance of the project.

Analyzes real U.S. used cars data to understand market trends.
Examines how price, mileage, and car make influence vehicle value.
Uses visualizations to reveal pricing and distribution patterns.
Supports informed decision-making in the automotive market.

3. Methodology.

This section explains the steps taken to prepare and visualize the dataset. It is organized into four parts: dataset description, data cleaning, data transformation, and data visualization.

The dataset description outlines the structure of the data and its key variables.
Data cleaning addresses missing values and inconsistencies to improve accuracy.
Data transformation focuses on selecting and organizing the variables relevant to the analysis. Finally;
Data visualization uses charts—such as histograms, boxplots, and bar charts—to examine distributions and highlight patterns in key variables, including price, mileage, and vehicle make.

3.1. Data Set Information.

The dataset used in this analysis was sourced from Kaggle and comprises 426,880 observations, each described by the following attributes:

Make – The manufacturer of the vehicle.
Model – The specific model of the vehicle.
Year – The year the vehicle was manufactured.
Mileage – The total distance the vehicle has been driven.
Price – The listed selling price of the vehicle.
Title Status – The condition of the vehicle’s title (e.g., clean, salvage, rebuilt).
State – The state in which the vehicle is listed.
City – The city where the vehicle is located.
Fuel – The type of fuel the vehicle uses (e.g., gas, diesel).
Lat – The latitude coordinate indicating the vehicle’s geographic location.
Long – The longitude coordinate indicating the vehicle’s geographic location.

3.2. Loading data into R.

This project was conducted in R, where the dataset was systematically cleaned, transformed, and visualized to support analysis. The workflow focused on preparing the data for examination and generating visualizations to identify and interpret underlying patterns.

The dataset was imported from a CSV file and then subjected to structured preprocessing. This included renaming variables for clarity, assessing and handling missing values, removing duplicate records, and excluding irrelevant variables.

suppressMessages(library(dplyr))
suppressMessages(library(tidyverse))
suppressMessages(library(ggplot2))
suppressMessages(library(stringr))
suppressMessages(library(scales))
suppressMessages(library(maps))

## Warning: package 'maps' was built under R version 4.3.3

suppressMessages(library(gridExtra))

## Warning: package 'gridExtra' was built under R version 4.3.3

suppressMessages(library(GGally))
suppressMessages(library(plotly))

## Warning: package 'plotly' was built under R version 4.3.3

suppressMessages(library(leaflet))
suppressMessages(library(sf))

## Warning: package 'sf' was built under R version 4.3.3

#install.packages("tinytex")
#tinytex::install_tinytex()

Vehicles_df <- read.csv(file="C:/Users/Hill85/Desktop/Hill Docs/vehicles.csv")
head(Vehicles_df,3)

##           id
## 1 7222695916
## 2 7218891961
## 3 7221797935
##                                                                                      url
## 1        https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html
## 2 https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html
## 3        https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html
##         region                      region_url price year manufacturer model
## 1     prescott https://prescott.craigslist.org  6000   NA                   
## 2 fayetteville    https://fayar.craigslist.org 11900   NA                   
## 3 florida keys     https://keys.craigslist.org 21000   NA                   
##   condition cylinders fuel odometer title_status transmission VIN drive size
## 1                                NA                                         
## 2                                NA                                         
## 3                                NA                                         
##   type paint_color image_url description county state lat long posting_date
## 1                                            NA    az  NA   NA             
## 2                                            NA    ar  NA   NA             
## 3                                            NA    fl  NA   NA

nrow(Vehicles_df)

## [1] 426880

3.3. Data Exploration and Cleaning.

This crucial step entailed the following steps:

Data cleaning: The dataset was examined for missing values (NAs and empty strings), and irrelevant columns were removed to enhance data quality and ensure analytical reliability.
Filtering and removal: Rows containing missing values were removed, duplicate records were eliminated, and the dataset was restricted to vehicles manufactured between 2000 and 2025 to ensure relevance and consistency in the analysis.
Standardization and structure: Column names and text values were cleaned and standardized, and variables were reorganized for easier analysis.

3.3.1. Checking for Missing Values and Empty Strings in the Data.

This initial step involved identifying missing values (NA) and empty strings within the dataset to assess its overall completeness. Detecting these issues is important as it helps highlight data quality problems and guides the necessary steps for the data cleaning process.

#Obtain counts for missing values per column.
missing_summary <- data.frame(
  Column = names(Vehicles_df),
  N_A = sapply(Vehicles_df, function(x) sum(is.na(x))),
  Blank_Strings = sapply(Vehicles_df, function(x) sum(x == "")),
  Total_Missing = sapply(Vehicles_df, function(x) sum(is.na(x) | x == ""))
)

missing_summary

##                    Column    N_A Blank_Strings Total_Missing
## id                     id      0             0             0
## url                   url      0             0             0
## region             region      0             0             0
## region_url     region_url      0             0             0
## price               price      0             0             0
## year                 year   1205            NA          1205
## manufacturer manufacturer      0         17646         17646
## model               model      0          5276          5276
## condition       condition      0        174104        174104
## cylinders       cylinders      0        177678        177678
## fuel                 fuel      0          3013          3013
## odometer         odometer   4400            NA          4400
## title_status title_status      0          8242          8242
## transmission transmission      0          2556          2556
## VIN                   VIN      0        161042        161042
## drive               drive      0        130567        130567
## size                 size      0        306361        306361
## type                 type      0         92858         92858
## paint_color   paint_color      0        130203        130203
## image_url       image_url      0            68            68
## description   description      0            69            69
## county             county 426880            NA        426880
## state               state      0             0             0
## lat                   lat   6549            NA          6549
## long                 long   6549            NA          6549
## posting_date posting_date      0            68            68

3.3.2. Removing Irrelevant Variables from the Data Set.

Irrelevant variables such as condition, VIN, drive type, and paint color were removed to improve data quality and focus on key analytical features. Non-essential fields like URLs, image links, identifiers, descriptions, and timestamps were also excluded as they were not relevant to the analysis.

#Removing Irrelevant Variables.
Vehicles_df <- Vehicles_df[, !(names(Vehicles_df) %in% c(
  "Condition", "cylinders", "VIN", "drive", "size",
  "type", "paint_color", "condition", "county", "url", "region_url", "id","image_url","description", "posting_date"
))]

#nrow(Vehicles_df)
#colSums(is.na(Vehicles_df))
head(Vehicles_df,3)

##         region price year manufacturer model fuel odometer title_status
## 1     prescott  6000   NA                               NA             
## 2 fayetteville 11900   NA                               NA             
## 3 florida keys 21000   NA                               NA             
##   transmission state lat long
## 1                 az  NA   NA
## 2                 ar  NA   NA
## 3                 fl  NA   NA

#Checking to see if there is any variable that still has NAs
Final_Missing_Summary <- data.frame(
  Column = names(Vehicles_df),
  N_A = sapply(Vehicles_df, function(x) sum(is.na(x))),
  Blank_Strings = sapply(Vehicles_df, function(x) sum(x == "")),
  Total_Missing = sapply(Vehicles_df, function(x) sum(is.na(x) | x == ""))
)

#Final_Missing_Summary

3.3.3. Removing Records with Missing values.

Records with missing values or empty strings in key variables were removed to ensure complete and reliable data. This step focused on essential vehicle and geographic attributes to improve the quality and validity of the dataset for analysis.

#remove missing values/empty strings
Vehicles_df <- Vehicles_df %>%
  filter(
    !is.na(year) & year != "",
    !is.na(manufacturer) & manufacturer != "",
    !is.na(model) & model != "",
    !is.na(fuel) & fuel != "",
    !is.na(title_status) & title_status != "",
    !is.na(transmission) & transmission != "",
    !is.na(lat),
    !is.na(long),
    !is.na(odometer)
  )
nrow(Vehicles_df)

## [1] 383489

Final_Missing_Summary <- data.frame(
  Column = names(Vehicles_df),
  N_A = sapply(Vehicles_df, function(x) sum(is.na(x))),
  Blank_Strings = sapply(Vehicles_df, function(x) sum(x == "")),
  Total_Missing = sapply(Vehicles_df, function(x) sum(is.na(x) | x == ""))
)

Final_Missing_Summary

##                    Column N_A Blank_Strings Total_Missing
## region             region   0             0             0
## price               price   0             0             0
## year                 year   0             0             0
## manufacturer manufacturer   0             0             0
## model               model   0             0             0
## fuel                 fuel   0             0             0
## odometer         odometer   0             0             0
## title_status title_status   0             0             0
## transmission transmission   0             0             0
## state               state   0             0             0
## lat                   lat   0             0             0
## long                 long   0             0             0

3.3.4. Renaming/Standardizing, and reordering Column Names.

The dataset was further reorganized by prioritizing key variables, such as Make, Model, Year, and Mileage, by positioning them at the beginning of the data frame to improve readability and facilitate interpretation, while the remaining variables were retained in their original order thereafter.

# Capitalize first letter
names(Vehicles_df) <- tools::toTitleCase(names(Vehicles_df))

# Renaming specific varaibles.
names(Vehicles_df) <- gsub("^Odometer$", "Mileage", names(Vehicles_df))
names(Vehicles_df) <- gsub("^Title_status$", "TitleStatus", names(Vehicles_df))
names(Vehicles_df) <- gsub("^Manufacturer$", "Make", names(Vehicles_df))
names(Vehicles_df) <- gsub("^Region$", "City", names(Vehicles_df))
Vehicles_df$Price.1 <- NULL

#Reordering the column names in the data
Vehicles_df <- Vehicles_df %>%
  select(
    Make, Model, Year, Mileage, Price,
    TitleStatus, State, City,
    everything()
  )

#capitalizing the first letter of each word in every record.
Vehicles_df <- Vehicles_df %>%
  mutate(across(where(is.character), str_to_title))
#Capitalizing State Abbreviations.
Vehicles_df$State <- toupper(Vehicles_df$State)
head(Vehicles_df,3)

##        Make                    Model Year Mileage Price TitleStatus State
## 1       Gmc Sierra 1500 Crew Cab Slt 2014   57923 33590       Clean    AL
## 2 Chevrolet           Silverado 1500 2010   71229 22590       Clean    AL
## 3 Chevrolet      Silverado 1500 Crew 2020   19160 39590       Clean    AL
##     City Fuel Transmission   Lat   Long
## 1 Auburn  Gas        Other 32.59 -85.48
## 2 Auburn  Gas        Other 32.59 -85.48
## 3 Auburn  Gas        Other 32.59 -85.48

3.3.5. Removing Duplicated Records.

49,563 duplicate records were identified and removed from the dataset. This reduced the dataset to 333,926 unique observations, improving data quality and ensuring that each record represents a distinct vehicle listing.

#Establishing presence of duplicate records.
sum(duplicated(Vehicles_df))

## [1] 49563

#Dropping the duplicate values
Vehicles_df <- Vehicles_df[!duplicated(Vehicles_df), ]
nrow(Vehicles_df)

## [1] 333926

3.3.6 Retaining Relevant Vehicle Years.

The dataset was filtered to include only vehicles manufactured between 2000 and 2025, ensuring the analysis focused on modern and relevant observations. This further reduced the dataset to 315,000 records, resulting in a more focused and representative sample. The remaining observations were primarily concentrated in recent model years (2000–2022), which improved the relevance and consistency of the subsequent visual analysis.

#obtaining unique years in the dataset
sort(unique(Vehicles_df$Year))

##   [1] 1900 1905 1913 1915 1918 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930
##  [16] 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
##  [31] 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
##  [46] 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975
##  [61] 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
##  [76] 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
##  [91] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## [106] 2021 2022

table(Vehicles_df$Year)

## 
##  1900  1905  1913  1915  1918  1921  1922  1923  1924  1925  1926  1927  1928 
##     3     1     2     1     1     1     1    30     6     5     9    26    29 
##  1929  1930  1931  1932  1933  1934  1935  1936  1937  1938  1939  1940  1941 
##    38    53    26    46    17    22    14    20    41    14    32    47    36 
##  1942  1943  1944  1945  1946  1947  1948  1949  1950  1951  1952  1953  1954 
##     6     1     3     1    33    43    40    42    44    56    64    55    68 
##  1955  1956  1957  1958  1959  1960  1961  1962  1963  1964  1965  1966  1967 
##   135    98   118    49    55    67    49    98   160   196   273   270   230 
##  1968  1969  1970  1971  1972  1973  1974  1975  1976  1977  1978  1979  1980 
##   312   280   228   219   299   255   184   132   150   198   260   285   204 
##  1981  1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993 
##   163   170   192   302   385   393   393   400   472   494   501   470   599 
##  1994  1995  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006 
##   789  1004  1041  1400  1577  2400  2869  3574  4568  5827  7207  8332  9899 
##  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018  2019 
## 11677 13348  9434 12333 15929 18230 23320 22787 23745 22783 28975 30541 21381 
##  2020  2021  2022 
## 16682  1526    33

#Filtering data to remain only with years 2000 through 2025
Vehicles_df <- Vehicles_df[Vehicles_df$Year >= 2000 & Vehicles_df$Year <= 2025, ]

#head(Vehicles_df, 5)
nrow(Vehicles_df)

## [1] 315000

3.4. Data Description/Distribution.

This subsection examines the statistical distributions of the variables within the dataset to describe their individual behavior and underlying relationships. Particular attention is given to the variability of these attributes and their association with the target variable, Price, which constitutes the primary outcome of interest in this analysis.

3.4.1. Yearly Counts of Used Cars.

2018 accounts for the highest proportion of vehicle listings in the dataset (9.70%), followed closely by 2017 (9.20%). Other notable years, including 2015 (7.54%), 2013 (7.40%), 2014 (7.23%), and 2016 (7.23%), reinforce a sustained concentration of observations within mid- to late-2010s vehicle models. Overall, the distribution is strongly skewed toward vehicles manufactured between 2011 and 2019, indicating that the dataset is predominantly composed of relatively recent model-year listings and, consequently, that newer vehicles are more heavily represented in the market structure.

This pattern is partly attributable to the post–recession recovery in the United States, which supported a gradual expansion in vehicle production and sustained consumer demand throughout the 2010s. In contrast, 2020 shows lower representation, reflecting disruptions associated with the COVID-19 pandemic, which significantly affected supply chains and temporarily reduced activity in the used vehicle market.

Car_count_per_year <- Vehicles_df %>%
  group_by(Year) %>%
  summarise(n = n()) %>%
  mutate(Freq = n/sum(n)*100) %>%
  arrange(desc(Freq)) %>%
  as.data.frame() %>%
  mutate(Freq = round(Freq, 2))

#head(Car_count_per_year,10)

ggplot(data=Car_count_per_year,aes(x=Year,y=Freq)) +
  geom_line(color="turquoise",linewidth=1) + 
   geom_point(size=1,color="blue") +
  labs(x="Year",y="Used Cars Sales(%)",
       title="Yearly Distribution of Used Car Sales Volumes.")+
  theme(plot.title =element_text(color="Black",size=12,
  face="bold",lineheight = 0.7),axis.text.x = element_text())

3.4.2. Distribution of Used Car Listings Across States.

California accounts for the largest proportion of listings (11.55%), followed by Florida (6.60%) and Texas (5.63%), indicating pronounced market concentration in high-population and economically active states. In contrast, states such as North Dakota (0.09%), Wyoming (0.15%), and Nebraska (0.24%) exhibit minimal representation, reflecting comparatively lower market activity in sparsely populated regions. Overall, the distribution underscores a strong geographic concentration of vehicle listings within major U.S. population centers.

# Top and bottom states with percentages
state_summary <- Vehicles_df %>% 
  group_by(State) %>% 
  summarize(Total_Sales = n()) %>% 
  mutate(Percentage = round(Total_Sales / sum(Total_Sales) * 100, 2))

top_states <- state_summary %>% 
  arrange(desc(Total_Sales)) %>% 
  slice(1:10)
#head(top_states,3)

bottom_states <- state_summary %>% 
  arrange(Total_Sales) %>% 
  slice(1:10)
#head(bottom_states,3)

# Top states plot
ggplot(top_states, aes(x = reorder(State, -Total_Sales), y = Total_Sales)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.3, size = 3) +
  ggtitle("States with the Most Used Car Sales") +
  xlab("State") +
  ylab("Number of Cars Sold") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Bottom states plot
ggplot(bottom_states, aes(x = reorder(State, Total_Sales), y = Total_Sales)) +
  geom_bar(stat = "identity", fill = "khaki2") +
  geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.3, size = 3) +
  ggtitle("States with the Least used Car Used Sales.") +
  xlab("State") +
  ylab("Number of Cars Sold") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

3.4.3. Variation in Title Status Across Vehicle Year of Manufacture.

Clear differences are observed in the distribution of vehicle manufacturing years across title status categories. Vehicles classified as Clean, Rebuilt, and Salvage are predominantly concentrated in more recent model years, largely within the 2010–2020 range, indicating a stronger representation of newer vehicles within these classifications. In contrast, Missing and Parts Only titles are associated with comparatively older model years, suggesting a tendency for these categories to be linked with aging or end-of-life vehicles. The Lien category exhibits a notably broader dispersion in manufacturing years, including some of the oldest observations in the dataset, which indicates substantial variability in vehicle age within this group.

#Variations in status of car titles.
ggplot(Vehicles_df, aes(x = TitleStatus, y = Year, fill = TitleStatus)) + 
  geom_boxplot() +
  labs(
    x = "Status of Car Titles",
    y = "Year",
    title = "Distribution of Car Title Status"
  ) +
  theme(
    plot.title = element_text(color = "black", size = 12, face = "bold", lineheight = 0.8),
    axis.text.x = element_text()
  )

3.4.4. Distribution of Car Makes.

Ford represents the largest share of listings at 16.9%, followed by Chevrolet (12.8%), Toyota (8.5%), and Honda (5.4%). Overall, the distribution of the top ten makes is concentrated among American and Japanese manufacturers, indicating a pronounced market preference for vehicles originating from these regions. This pattern is likely driven by their long-standing presence in the U.S. automotive market, extensive dealership networks, and high availability within the used vehicle segment.

# Distribution of Car Makes.
Car_Make_Count <- Vehicles_df %>%
  group_by(Make) %>%
  summarise(n = n(), .groups = "drop") %>%
  mutate(Freq = n / sum(n) * 100) %>%
  arrange(desc(n)) %>%
  slice_head(n = 10)

ggplot(Car_Make_Count, aes(x = reorder(Make, n), y = n)) +
  geom_col(fill = "turquoise") +
  geom_text(aes(label = paste0(round(Freq, 1), "%")),
            vjust = -0.3, size = 3) +
  labs(x = "Car Make", y = "Number of Cars",
       title = "Top 10 Car Makes") +
  theme(plot.title = element_text(color = "black", size = 12, face = "bold"),
        axis.text.x = element_text(angle = 90, hjust = 1))

3.4.5. Average Prices of used Cars over the Last Ten Years.

Average used car prices peaked between 2018 and 2020, reflecting strong market demand and tightening supply conditions during this period, with 2020 particularly influenced by disruptions from the COVID-19 pandemic. In contrast, prices decline noticeably by 2022, likely due to market stabilization and normalization of supply chains following the pandemic-related disruption.

avg_prices_by_year <- Vehicles_df %>%
  filter(Year %in% sort(unique(Year), decreasing = TRUE)[1:10]) %>%
  group_by(Year) %>%
  summarise(avg_price = mean(Price, na.rm = TRUE)) %>%
  arrange(desc(Year))

ggplot(avg_prices_by_year, aes(x = reorder(Year, -avg_price), y = avg_price)) +
  geom_bar(stat = "identity", fill = "darkgray", color = "black") +
  labs(
    title = "Average Used Car Prices over the Last 10 Years",
    x = "Year",
    y = "Average Price"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

3.4.6. Median Prices of Different Car Models.

Ferrari and Aston Martin occupy the highest tier of median pricing, consistent with their position in the ultra-luxury, low-volume segment of the automotive market. Tesla, Jaguar, and Porsche follow, reflecting strong demand and high resale values associated with premium and performance-oriented vehicles. The price of Ram is comparatively lower, largely because its model range includes both commercial and consumer pickup trucks, which introduces greater variation and lowers the overall median price.

#Median Prices of Different Car Models.
median_car_prices <- Vehicles_df %>%
  group_by(Make) %>%
  summarise(median_car_price = median(Price, na.rm = TRUE)) %>%
  mutate(Make = fct_reorder(Make, median_car_price)) %>%
  arrange(desc(median_car_price))

ggplot(median_car_prices, aes(x = Make, y = median_car_price)) +
  geom_col(fill = "darkcyan") +
  labs(x = "Make", y = "Median Car Price",
       title = "Median Used Car Prices by Make") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

3.4.7. Distribution of Used cars mileages.

The distribution of vehicle mileage is positively skewed, indicating that a large proportion of vehicles fall within the higher mileage range, while comparatively fewer observations are concentrated at lower mileage levels. The mean mileage of approximately 91,642 reflects this asymmetry, as it is elevated by the presence of high-mileage vehicles within the dataset.

options(scipen = 999)

#omit na
Vehicles_clean <- Vehicles_df %>%
  filter(!is.na(Mileage), Mileage > 0)

avg_mileage <- mean(Vehicles_clean$Mileage)

ggplot(Vehicles_clean, aes(x = Mileage)) + 
  geom_histogram(color = "firebrick4", fill = "cornflowerblue", bins = 160) +
  
  geom_vline(xintercept = avg_mileage, color = "yellow", linetype = "dashed", linewidth = 1) +
  
  annotate(
    "text",
    x = avg_mileage,
    y = Inf,
    label = paste("Average:", round(avg_mileage, 0)),
    vjust = 2,
    hjust = -0.3,
    color = "black"
  ) +
  
  labs(
    x = "Car Mileage",
    y = "Number of Cars",
    title = "Distribution of Used Car Mileages"
  ) +
  
  scale_x_log10(labels = scales::comma)

3.4.8. Distribution of Used cars per State.

# Filter years
Vehicles_df <- Vehicles_df %>%
  filter(Year >= 2000 & Year <= 2025)

# Count vehicles per state.
state_counts <- Vehicles_df %>%
  group_by(State) %>%
  summarise(count = n(), .groups = "drop")

# Convert state abbreviations to full names (required for map)
state_counts$region <- tolower(state.name[match(state_counts$State, state.abb)])

# Get US map
us_map <- map_data("state")

# Merge map with car data
map_data_full <- left_join(us_map, state_counts, by = "region")

# Plotting the distribution.
ggplot(map_data_full, aes(x = long, y = lat, group = group, fill = count)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "gray90") +
  labs(title = "Distribution of used Cars Per State",
       fill = "Vehicle Count") +
  theme_minimal()

3.4.9. Distribution of Median prices of Used cars per State.

Comparatively lower median prices are concentrated in the Midwestern United States. This pattern can be partially attributed to the nature of the markets, including lower state-level taxes and registration fees, as well as less stringent or absent vehicle inspection requirements in several states. These regulatory conditions allow older and higher-mileage vehicles to remain in circulation for longer periods, thereby increasing the supply of lower-priced vehicles in the market.

In addition, regional market dynamics further contribute to reduced price levels. A relatively high density of dealerships, combined with lower population density in certain areas, results in a higher supply of vehicles relative to local demand, placing downward pressure on median prices.

options(scipen = 999)

Vehicles_df <- Vehicles_df %>%
  filter(Year >= 2000 & Year <= 2025) %>%
  
  # ensuring that price is numeric.
  mutate(Price = as.numeric(Price)) %>%
  
  # removing missing + unrealistic values
  filter(!is.na(Price),
         Price > 500,     # remove junk price values like 0 or 1 in data
         Price < 100000)  # removing extreme outliers

# median prices
state_price <- Vehicles_df %>%
  group_by(State) %>%
  summarise(avg_price = median(Price), .groups = "drop")

state_price$region <- tolower(state.name[match(state_price$State, state.abb)])

us_map <- map_data("state")

map_data_full <- left_join(us_map, state_price, by = "region")

ggplot(map_data_full, aes(x = long, y = lat, group = group, fill = avg_price)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_gradient(
    low = "lightyellow",
    high = "red",
    na.value = "gray90"
  ) +
  labs(title = "Median Prices of Used Cars by State",
       fill = "Price ($)") +
  theme_minimal()

#Interactive map showing Median prices in specific states.
# summary of your car dataset by state
state_price <- Vehicles_df %>%
  filter(Year >= 2000 & Year <= 2025) %>%
  group_by(State) %>%
  summarise(
    median_price = median(Price, na.rm = TRUE),
    count = n(),
    .groups = "drop"
  )

# Plotting interactive Choropleth map showing median car prices per state
Choropleth_Price_map <- plot_geo(state_price) %>%
  add_trace(
    locations = ~State,
    locationmode = "USA-states",
    z = ~median_price,
    text = ~paste(
      "State:", State,
      "<br>Median Price: $", round(median_price, 0),
      "<br>Count:", count
    ),
    hoverinfo = "text"
  ) %>%
  layout(
    title = "Median Prices of Used Cars by State",
    geo = list(
      scope = "usa",
      projection = list(type = "albers usa")
    )
  )

Choropleth_Price_map

4. Conclusion.

Newer used cars (especially those from 2011–2019) tend to have higher prices, indicating that buyers generally pay more for more recent vehicles. Furthermore, temporal trends show that prices peaked between 2018 and 2020, likely reflecting broader market disruptions during the COVID period, including supply constraints and shifts in demand.

Vehicle prices also vary by geographical location. Midwestern states show lower median prices compared to higher-priced markets such as California, Florida, and Texas, suggesting that regional demand and economic conditions influence pricing differences.

Overall, used car prices are shaped by vehicle age, regional differences, and time-related market conditions, particularly the disruptions observed between 2018 and 2020.

5. Limitations of the Study.

This study had several limitations including:

The dataset covers the period 2000–2022 and may not reflect more recent changes in the used car market.
Some states were underrepresented, which introduced geographic bias.
The analysis was primarily descriptive and did not include predictive modeling, limiting its ability to estimate more reliable vehicle prices.
Finally, the large size of the dataset limited the development of an interactive visualization that would have combined vehicle model, geographic location, and price due to computational and implementation constraints.

Visualization of Used Car Prices in the United States

Hillary Kemei