New vehicles are typically sold at fixed and transparent prices, which renders purchasing decisions relatively straightforward. In contrast, the pricing of used vehicles is far less transparent, as it varies considerably with factors such as condition, usage history, and geographic location, thereby complicating informed decision-making. For instance, an identical vehicle may be offered at a substantially lower price in one state than in another, a discrepancy that may not be immediately apparent to prospective buyers.
To address this challenge, this project examines used car prices across the United States in order to identify underlying market trends, generate meaningful pricing insights, and determine the principal factors influencing vehicle valuation. The findings are presented through a series of visualizations designed to support prospective buyers in making more informed purchasing decisions.
This section explains the steps taken to prepare and visualize the dataset. It is organized into four parts: dataset description, data cleaning, data transformation, and data visualization.
The dataset used in this analysis was sourced from Kaggle and comprises 426,880 observations, each described by the following attributes:
This project was conducted in R, where the dataset was systematically cleaned, transformed, and visualized to support analysis. The workflow focused on preparing the data for examination and generating visualizations to identify and interpret underlying patterns.
The dataset was imported from a CSV file and then subjected to structured preprocessing. This included renaming variables for clarity, assessing and handling missing values, removing duplicate records, and excluding irrelevant variables.
suppressMessages(library(dplyr))
suppressMessages(library(tidyverse))
suppressMessages(library(ggplot2))
suppressMessages(library(stringr))
suppressMessages(library(scales))
suppressMessages(library(maps))
## Warning: package 'maps' was built under R version 4.3.3
suppressMessages(library(gridExtra))
## Warning: package 'gridExtra' was built under R version 4.3.3
suppressMessages(library(GGally))
suppressMessages(library(plotly))
## Warning: package 'plotly' was built under R version 4.3.3
suppressMessages(library(leaflet))
suppressMessages(library(sf))
## Warning: package 'sf' was built under R version 4.3.3
#install.packages("tinytex")
#tinytex::install_tinytex()
Vehicles_df <- read.csv(file="C:/Users/Hill85/Desktop/Hill Docs/vehicles.csv")
head(Vehicles_df,3)
## id
## 1 7222695916
## 2 7218891961
## 3 7221797935
## url
## 1 https://prescott.craigslist.org/cto/d/prescott-2010-ford-ranger/7222695916.html
## 2 https://fayar.craigslist.org/ctd/d/bentonville-2017-hyundai-elantra-se/7218891961.html
## 3 https://keys.craigslist.org/cto/d/summerland-key-2005-excursion/7221797935.html
## region region_url price year manufacturer model
## 1 prescott https://prescott.craigslist.org 6000 NA
## 2 fayetteville https://fayar.craigslist.org 11900 NA
## 3 florida keys https://keys.craigslist.org 21000 NA
## condition cylinders fuel odometer title_status transmission VIN drive size
## 1 NA
## 2 NA
## 3 NA
## type paint_color image_url description county state lat long posting_date
## 1 NA az NA NA
## 2 NA ar NA NA
## 3 NA fl NA NA
nrow(Vehicles_df)
## [1] 426880
This crucial step entailed the following steps:
This initial step involved identifying missing values (NA) and empty strings within the dataset to assess its overall completeness. Detecting these issues is important as it helps highlight data quality problems and guides the necessary steps for the data cleaning process.
#Obtain counts for missing values per column.
missing_summary <- data.frame(
Column = names(Vehicles_df),
N_A = sapply(Vehicles_df, function(x) sum(is.na(x))),
Blank_Strings = sapply(Vehicles_df, function(x) sum(x == "")),
Total_Missing = sapply(Vehicles_df, function(x) sum(is.na(x) | x == ""))
)
missing_summary
## Column N_A Blank_Strings Total_Missing
## id id 0 0 0
## url url 0 0 0
## region region 0 0 0
## region_url region_url 0 0 0
## price price 0 0 0
## year year 1205 NA 1205
## manufacturer manufacturer 0 17646 17646
## model model 0 5276 5276
## condition condition 0 174104 174104
## cylinders cylinders 0 177678 177678
## fuel fuel 0 3013 3013
## odometer odometer 4400 NA 4400
## title_status title_status 0 8242 8242
## transmission transmission 0 2556 2556
## VIN VIN 0 161042 161042
## drive drive 0 130567 130567
## size size 0 306361 306361
## type type 0 92858 92858
## paint_color paint_color 0 130203 130203
## image_url image_url 0 68 68
## description description 0 69 69
## county county 426880 NA 426880
## state state 0 0 0
## lat lat 6549 NA 6549
## long long 6549 NA 6549
## posting_date posting_date 0 68 68
Irrelevant variables such as condition, VIN, drive type, and paint color were removed to improve data quality and focus on key analytical features. Non-essential fields like URLs, image links, identifiers, descriptions, and timestamps were also excluded as they were not relevant to the analysis.
#Removing Irrelevant Variables.
Vehicles_df <- Vehicles_df[, !(names(Vehicles_df) %in% c(
"Condition", "cylinders", "VIN", "drive", "size",
"type", "paint_color", "condition", "county", "url", "region_url", "id","image_url","description", "posting_date"
))]
#nrow(Vehicles_df)
#colSums(is.na(Vehicles_df))
head(Vehicles_df,3)
## region price year manufacturer model fuel odometer title_status
## 1 prescott 6000 NA NA
## 2 fayetteville 11900 NA NA
## 3 florida keys 21000 NA NA
## transmission state lat long
## 1 az NA NA
## 2 ar NA NA
## 3 fl NA NA
#Checking to see if there is any variable that still has NAs
Final_Missing_Summary <- data.frame(
Column = names(Vehicles_df),
N_A = sapply(Vehicles_df, function(x) sum(is.na(x))),
Blank_Strings = sapply(Vehicles_df, function(x) sum(x == "")),
Total_Missing = sapply(Vehicles_df, function(x) sum(is.na(x) | x == ""))
)
#Final_Missing_Summary
Records with missing values or empty strings in key variables were removed to ensure complete and reliable data. This step focused on essential vehicle and geographic attributes to improve the quality and validity of the dataset for analysis.
#remove missing values/empty strings
Vehicles_df <- Vehicles_df %>%
filter(
!is.na(year) & year != "",
!is.na(manufacturer) & manufacturer != "",
!is.na(model) & model != "",
!is.na(fuel) & fuel != "",
!is.na(title_status) & title_status != "",
!is.na(transmission) & transmission != "",
!is.na(lat),
!is.na(long),
!is.na(odometer)
)
nrow(Vehicles_df)
## [1] 383489
Final_Missing_Summary <- data.frame(
Column = names(Vehicles_df),
N_A = sapply(Vehicles_df, function(x) sum(is.na(x))),
Blank_Strings = sapply(Vehicles_df, function(x) sum(x == "")),
Total_Missing = sapply(Vehicles_df, function(x) sum(is.na(x) | x == ""))
)
Final_Missing_Summary
## Column N_A Blank_Strings Total_Missing
## region region 0 0 0
## price price 0 0 0
## year year 0 0 0
## manufacturer manufacturer 0 0 0
## model model 0 0 0
## fuel fuel 0 0 0
## odometer odometer 0 0 0
## title_status title_status 0 0 0
## transmission transmission 0 0 0
## state state 0 0 0
## lat lat 0 0 0
## long long 0 0 0
The dataset was further reorganized by prioritizing key variables, such as Make, Model, Year, and Mileage, by positioning them at the beginning of the data frame to improve readability and facilitate interpretation, while the remaining variables were retained in their original order thereafter.
# Capitalize first letter
names(Vehicles_df) <- tools::toTitleCase(names(Vehicles_df))
# Renaming specific varaibles.
names(Vehicles_df) <- gsub("^Odometer$", "Mileage", names(Vehicles_df))
names(Vehicles_df) <- gsub("^Title_status$", "TitleStatus", names(Vehicles_df))
names(Vehicles_df) <- gsub("^Manufacturer$", "Make", names(Vehicles_df))
names(Vehicles_df) <- gsub("^Region$", "City", names(Vehicles_df))
Vehicles_df$Price.1 <- NULL
#Reordering the column names in the data
Vehicles_df <- Vehicles_df %>%
select(
Make, Model, Year, Mileage, Price,
TitleStatus, State, City,
everything()
)
#capitalizing the first letter of each word in every record.
Vehicles_df <- Vehicles_df %>%
mutate(across(where(is.character), str_to_title))
#Capitalizing State Abbreviations.
Vehicles_df$State <- toupper(Vehicles_df$State)
head(Vehicles_df,3)
## Make Model Year Mileage Price TitleStatus State
## 1 Gmc Sierra 1500 Crew Cab Slt 2014 57923 33590 Clean AL
## 2 Chevrolet Silverado 1500 2010 71229 22590 Clean AL
## 3 Chevrolet Silverado 1500 Crew 2020 19160 39590 Clean AL
## City Fuel Transmission Lat Long
## 1 Auburn Gas Other 32.59 -85.48
## 2 Auburn Gas Other 32.59 -85.48
## 3 Auburn Gas Other 32.59 -85.48
49,563 duplicate records were identified and removed from the dataset. This reduced the dataset to 333,926 unique observations, improving data quality and ensuring that each record represents a distinct vehicle listing.
#Establishing presence of duplicate records.
sum(duplicated(Vehicles_df))
## [1] 49563
#Dropping the duplicate values
Vehicles_df <- Vehicles_df[!duplicated(Vehicles_df), ]
nrow(Vehicles_df)
## [1] 333926
The dataset was filtered to include only vehicles manufactured between 2000 and 2025, ensuring the analysis focused on modern and relevant observations. This further reduced the dataset to 315,000 records, resulting in a more focused and representative sample. The remaining observations were primarily concentrated in recent model years (2000–2022), which improved the relevance and consistency of the subsequent visual analysis.
#obtaining unique years in the dataset
sort(unique(Vehicles_df$Year))
## [1] 1900 1905 1913 1915 1918 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930
## [16] 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945
## [31] 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
## [46] 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975
## [61] 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
## [76] 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [91] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## [106] 2021 2022
table(Vehicles_df$Year)
##
## 1900 1905 1913 1915 1918 1921 1922 1923 1924 1925 1926 1927 1928
## 3 1 2 1 1 1 1 30 6 5 9 26 29
## 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941
## 38 53 26 46 17 22 14 20 41 14 32 47 36
## 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954
## 6 1 3 1 33 43 40 42 44 56 64 55 68
## 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
## 135 98 118 49 55 67 49 98 160 196 273 270 230
## 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## 312 280 228 219 299 255 184 132 150 198 260 285 204
## 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993
## 163 170 192 302 385 393 393 400 472 494 501 470 599
## 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
## 789 1004 1041 1400 1577 2400 2869 3574 4568 5827 7207 8332 9899
## 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## 11677 13348 9434 12333 15929 18230 23320 22787 23745 22783 28975 30541 21381
## 2020 2021 2022
## 16682 1526 33
#Filtering data to remain only with years 2000 through 2025
Vehicles_df <- Vehicles_df[Vehicles_df$Year >= 2000 & Vehicles_df$Year <= 2025, ]
#head(Vehicles_df, 5)
nrow(Vehicles_df)
## [1] 315000
This subsection examines the statistical distributions of the variables within the dataset to describe their individual behavior and underlying relationships. Particular attention is given to the variability of these attributes and their association with the target variable, Price, which constitutes the primary outcome of interest in this analysis.
2018 accounts for the highest proportion of vehicle listings in the dataset (9.70%), followed closely by 2017 (9.20%). Other notable years, including 2015 (7.54%), 2013 (7.40%), 2014 (7.23%), and 2016 (7.23%), reinforce a sustained concentration of observations within mid- to late-2010s vehicle models. Overall, the distribution is strongly skewed toward vehicles manufactured between 2011 and 2019, indicating that the dataset is predominantly composed of relatively recent model-year listings and, consequently, that newer vehicles are more heavily represented in the market structure.
This pattern is partly attributable to the post–recession recovery in the United States, which supported a gradual expansion in vehicle production and sustained consumer demand throughout the 2010s. In contrast, 2020 shows lower representation, reflecting disruptions associated with the COVID-19 pandemic, which significantly affected supply chains and temporarily reduced activity in the used vehicle market.
Car_count_per_year <- Vehicles_df %>%
group_by(Year) %>%
summarise(n = n()) %>%
mutate(Freq = n/sum(n)*100) %>%
arrange(desc(Freq)) %>%
as.data.frame() %>%
mutate(Freq = round(Freq, 2))
#head(Car_count_per_year,10)
ggplot(data=Car_count_per_year,aes(x=Year,y=Freq)) +
geom_line(color="turquoise",linewidth=1) +
geom_point(size=1,color="blue") +
labs(x="Year",y="Used Cars Sales(%)",
title="Yearly Distribution of Used Car Sales Volumes.")+
theme(plot.title =element_text(color="Black",size=12,
face="bold",lineheight = 0.7),axis.text.x = element_text())
California accounts for the largest proportion of listings (11.55%), followed by Florida (6.60%) and Texas (5.63%), indicating pronounced market concentration in high-population and economically active states. In contrast, states such as North Dakota (0.09%), Wyoming (0.15%), and Nebraska (0.24%) exhibit minimal representation, reflecting comparatively lower market activity in sparsely populated regions. Overall, the distribution underscores a strong geographic concentration of vehicle listings within major U.S. population centers.
# Top and bottom states with percentages
state_summary <- Vehicles_df %>%
group_by(State) %>%
summarize(Total_Sales = n()) %>%
mutate(Percentage = round(Total_Sales / sum(Total_Sales) * 100, 2))
top_states <- state_summary %>%
arrange(desc(Total_Sales)) %>%
slice(1:10)
#head(top_states,3)
bottom_states <- state_summary %>%
arrange(Total_Sales) %>%
slice(1:10)
#head(bottom_states,3)
# Top states plot
ggplot(top_states, aes(x = reorder(State, -Total_Sales), y = Total_Sales)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.3, size = 3) +
ggtitle("States with the Most Used Car Sales") +
xlab("State") +
ylab("Number of Cars Sold") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Bottom states plot
ggplot(bottom_states, aes(x = reorder(State, Total_Sales), y = Total_Sales)) +
geom_bar(stat = "identity", fill = "khaki2") +
geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.3, size = 3) +
ggtitle("States with the Least used Car Used Sales.") +
xlab("State") +
ylab("Number of Cars Sold") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Clear differences are observed in the distribution of vehicle manufacturing years across title status categories. Vehicles classified as Clean, Rebuilt, and Salvage are predominantly concentrated in more recent model years, largely within the 2010–2020 range, indicating a stronger representation of newer vehicles within these classifications. In contrast, Missing and Parts Only titles are associated with comparatively older model years, suggesting a tendency for these categories to be linked with aging or end-of-life vehicles. The Lien category exhibits a notably broader dispersion in manufacturing years, including some of the oldest observations in the dataset, which indicates substantial variability in vehicle age within this group.
#Variations in status of car titles.
ggplot(Vehicles_df, aes(x = TitleStatus, y = Year, fill = TitleStatus)) +
geom_boxplot() +
labs(
x = "Status of Car Titles",
y = "Year",
title = "Distribution of Car Title Status"
) +
theme(
plot.title = element_text(color = "black", size = 12, face = "bold", lineheight = 0.8),
axis.text.x = element_text()
)
Ford represents the largest share of listings at 16.9%, followed by Chevrolet (12.8%), Toyota (8.5%), and Honda (5.4%). Overall, the distribution of the top ten makes is concentrated among American and Japanese manufacturers, indicating a pronounced market preference for vehicles originating from these regions. This pattern is likely driven by their long-standing presence in the U.S. automotive market, extensive dealership networks, and high availability within the used vehicle segment.
# Distribution of Car Makes.
Car_Make_Count <- Vehicles_df %>%
group_by(Make) %>%
summarise(n = n(), .groups = "drop") %>%
mutate(Freq = n / sum(n) * 100) %>%
arrange(desc(n)) %>%
slice_head(n = 10)
ggplot(Car_Make_Count, aes(x = reorder(Make, n), y = n)) +
geom_col(fill = "turquoise") +
geom_text(aes(label = paste0(round(Freq, 1), "%")),
vjust = -0.3, size = 3) +
labs(x = "Car Make", y = "Number of Cars",
title = "Top 10 Car Makes") +
theme(plot.title = element_text(color = "black", size = 12, face = "bold"),
axis.text.x = element_text(angle = 90, hjust = 1))
Average used car prices peaked between 2018 and 2020, reflecting strong market demand and tightening supply conditions during this period, with 2020 particularly influenced by disruptions from the COVID-19 pandemic. In contrast, prices decline noticeably by 2022, likely due to market stabilization and normalization of supply chains following the pandemic-related disruption.
avg_prices_by_year <- Vehicles_df %>%
filter(Year %in% sort(unique(Year), decreasing = TRUE)[1:10]) %>%
group_by(Year) %>%
summarise(avg_price = mean(Price, na.rm = TRUE)) %>%
arrange(desc(Year))
ggplot(avg_prices_by_year, aes(x = reorder(Year, -avg_price), y = avg_price)) +
geom_bar(stat = "identity", fill = "darkgray", color = "black") +
labs(
title = "Average Used Car Prices over the Last 10 Years",
x = "Year",
y = "Average Price"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Ferrari and Aston Martin occupy the highest tier of median pricing, consistent with their position in the ultra-luxury, low-volume segment of the automotive market. Tesla, Jaguar, and Porsche follow, reflecting strong demand and high resale values associated with premium and performance-oriented vehicles. The price of Ram is comparatively lower, largely because its model range includes both commercial and consumer pickup trucks, which introduces greater variation and lowers the overall median price.
#Median Prices of Different Car Models.
median_car_prices <- Vehicles_df %>%
group_by(Make) %>%
summarise(median_car_price = median(Price, na.rm = TRUE)) %>%
mutate(Make = fct_reorder(Make, median_car_price)) %>%
arrange(desc(median_car_price))
ggplot(median_car_prices, aes(x = Make, y = median_car_price)) +
geom_col(fill = "darkcyan") +
labs(x = "Make", y = "Median Car Price",
title = "Median Used Car Prices by Make") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The distribution of vehicle mileage is positively skewed, indicating that a large proportion of vehicles fall within the higher mileage range, while comparatively fewer observations are concentrated at lower mileage levels. The mean mileage of approximately 91,642 reflects this asymmetry, as it is elevated by the presence of high-mileage vehicles within the dataset.
options(scipen = 999)
#omit na
Vehicles_clean <- Vehicles_df %>%
filter(!is.na(Mileage), Mileage > 0)
avg_mileage <- mean(Vehicles_clean$Mileage)
ggplot(Vehicles_clean, aes(x = Mileage)) +
geom_histogram(color = "firebrick4", fill = "cornflowerblue", bins = 160) +
geom_vline(xintercept = avg_mileage, color = "yellow", linetype = "dashed", linewidth = 1) +
annotate(
"text",
x = avg_mileage,
y = Inf,
label = paste("Average:", round(avg_mileage, 0)),
vjust = 2,
hjust = -0.3,
color = "black"
) +
labs(
x = "Car Mileage",
y = "Number of Cars",
title = "Distribution of Used Car Mileages"
) +
scale_x_log10(labels = scales::comma)
# Filter years
Vehicles_df <- Vehicles_df %>%
filter(Year >= 2000 & Year <= 2025)
# Count vehicles per state.
state_counts <- Vehicles_df %>%
group_by(State) %>%
summarise(count = n(), .groups = "drop")
# Convert state abbreviations to full names (required for map)
state_counts$region <- tolower(state.name[match(state_counts$State, state.abb)])
# Get US map
us_map <- map_data("state")
# Merge map with car data
map_data_full <- left_join(us_map, state_counts, by = "region")
# Plotting the distribution.
ggplot(map_data_full, aes(x = long, y = lat, group = group, fill = count)) +
geom_polygon(color = "white") +
coord_fixed(1.3) +
scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "gray90") +
labs(title = "Distribution of used Cars Per State",
fill = "Vehicle Count") +
theme_minimal()
Comparatively lower median prices are concentrated in the Midwestern United States. This pattern can be partially attributed to the nature of the markets, including lower state-level taxes and registration fees, as well as less stringent or absent vehicle inspection requirements in several states. These regulatory conditions allow older and higher-mileage vehicles to remain in circulation for longer periods, thereby increasing the supply of lower-priced vehicles in the market.
In addition, regional market dynamics further contribute to reduced price levels. A relatively high density of dealerships, combined with lower population density in certain areas, results in a higher supply of vehicles relative to local demand, placing downward pressure on median prices.
options(scipen = 999)
Vehicles_df <- Vehicles_df %>%
filter(Year >= 2000 & Year <= 2025) %>%
# ensuring that price is numeric.
mutate(Price = as.numeric(Price)) %>%
# removing missing + unrealistic values
filter(!is.na(Price),
Price > 500, # remove junk price values like 0 or 1 in data
Price < 100000) # removing extreme outliers
# median prices
state_price <- Vehicles_df %>%
group_by(State) %>%
summarise(avg_price = median(Price), .groups = "drop")
state_price$region <- tolower(state.name[match(state_price$State, state.abb)])
us_map <- map_data("state")
map_data_full <- left_join(us_map, state_price, by = "region")
ggplot(map_data_full, aes(x = long, y = lat, group = group, fill = avg_price)) +
geom_polygon(color = "white") +
coord_fixed(1.3) +
scale_fill_gradient(
low = "lightyellow",
high = "red",
na.value = "gray90"
) +
labs(title = "Median Prices of Used Cars by State",
fill = "Price ($)") +
theme_minimal()
#Interactive map showing Median prices in specific states.
# summary of your car dataset by state
state_price <- Vehicles_df %>%
filter(Year >= 2000 & Year <= 2025) %>%
group_by(State) %>%
summarise(
median_price = median(Price, na.rm = TRUE),
count = n(),
.groups = "drop"
)
# Plotting interactive Choropleth map showing median car prices per state
Choropleth_Price_map <- plot_geo(state_price) %>%
add_trace(
locations = ~State,
locationmode = "USA-states",
z = ~median_price,
text = ~paste(
"State:", State,
"<br>Median Price: $", round(median_price, 0),
"<br>Count:", count
),
hoverinfo = "text"
) %>%
layout(
title = "Median Prices of Used Cars by State",
geo = list(
scope = "usa",
projection = list(type = "albers usa")
)
)
Choropleth_Price_map
Newer used cars (especially those from 2011–2019) tend to have higher prices, indicating that buyers generally pay more for more recent vehicles. Furthermore, temporal trends show that prices peaked between 2018 and 2020, likely reflecting broader market disruptions during the COVID period, including supply constraints and shifts in demand.
Vehicle prices also vary by geographical location. Midwestern states show lower median prices compared to higher-priced markets such as California, Florida, and Texas, suggesting that regional demand and economic conditions influence pricing differences.
Overall, used car prices are shaped by vehicle age, regional differences, and time-related market conditions, particularly the disruptions observed between 2018 and 2020.
This study had several limitations including: