Introduction

The purpose of this analysis is to delve into the characteristics of used cars listed for sale at a static date in 2022, utilizing a dataset encompassing various attributes of these vehicles. This dataset is complemented by geographical data, offering a broad-ranging view of the used car market. The primary dataset includes a range of vehicle-specific details such as price, engine type, mileage, and consumer ratings, among others. These elements are critical in understanding the dynamics that influence the used car market, shedding light on factors that dictate car valuation and consumer preferences.

By integrating this dataset with latitude and longitude information for ZIP codes, the analysis achieves a multidimensional perspective, enabling a geographical breakdown of car listings. Such an approach allows for a granular examination of market distribution and trends across different regions.

The analysis aims to unearth patterns and preferences that influence the sale and purchase of used cars, offering valuable insights into the automotive secondary market. Through this exploration, stakeholders can gain a deeper understanding of the factors that drive consumer choices and market dynamics in the realm of used vehicles.

Dataset

The core attributes analyzed in the dataset include:

  • Price - The selling price of the car
  • Engine Type - The specification of the car’s engine
  • Mileage - The total miles the car has been driven
  • ZIP Code - The geographical area of the car listing
  • Consumer Ratings - The rating given by consumers, as surveyed by Consumer Reports
  • Consumer Reviews - A count of reviews given by consumers
  • Reliability Ratings - A measure of the car’s reliability based on the Consumer Reports survey

The reliability ratings, which are a key component of a few of the visuals, are based on a statistical model that estimates problem rates within the first five years of ownership.

Analysis

Car Listings by Zip Code

This map shows the distribution of car listings across various zip codes in the United States. It plots points on a map corresponding to the number of cars listed for sale within each zip code. The size of each point (blue circle) on the map is proportional to the square root of the number of car listings in that location, scaled up by a factor of 2 for visibility.

Observations

  • Car listings are spread across the country, but with a higher concentration in populated areas. Rural areas have fewer listings.
  • There are areas with a particularly high density of car listings, notably in urban and metropolitan areas. For example, large clusters are visible in California (Los Angeles, San Francisco Bay Area), the Northeastern corridor (New York City, Philadelphia, Boston), and other major cities like Chicago and Miami.
  • For businesses in the automotive industry, such as dealerships or online car marketplaces, the concentration in heavily populated areas can denote market saturation, while potential areas for expansion may exist in those regions that have fewer listings and as a result may be underserved.

Considerations

  • While this map provides an immediate visual cue about the distribution of car listings, it may be beneficial to expand further and additional layers or filters to display other relevant data such as average prices, car models distribution, or demographic information to further enhance analysis and decision-making utility.
import pandas as pd
import folium as f
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import plotly.io as pio
import os
from branca.element import Figure
from matplotlib.ticker import FuncFormatter
from matplotlib.colors import LinearSegmentedColormap

os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/ProgramData/Anaconda3/Library/plugins/platforms'

path = 'C:/Users/GregoryNeeley/Desktop/Python Visualizations/'

# Load the dataset for car listings

df = pd.read_csv(path + 'cars_raw.csv')

# Drop all Tesla rows, as the data has proven erroneous and unreliable in review
df = df[df['Make'] != 'Tesla']

# Load the dataset with latitude and longitude data for zip codes

zip_lat_long_df = pd.read_csv(path + 'zip_lat_long.csv')

### Data Preparation

# Drop rows without price data and convert prices to Float

df = df[df['Price'] != 'Not Priced']
df['Price'] = df['Price'].str.replace('$', '', regex=True).str.replace(',', '', regex=True).astype(float)

# Convert the ZIP code columns to string

df['Zipcode'] = df['Zipcode'].astype(str)
zip_lat_long_df['ZIP'] = zip_lat_long_df['ZIP'].astype(str)

### Car Listings by Zip Code

## Data Preparation

# Copy main dataframe for transformations
location_df = df[['Zipcode']].copy()

# Count the number of cars in each ZIP code
car_counts = location_df['Zipcode'].value_counts().reset_index()
car_counts.columns = ['Zipcode', 'CarCount']

# Merge the car counts back with the original dataframe
location_df = pd.merge(location_df, car_counts, on='Zipcode', how='left')

# Append Zip Codes to main dataframe
location_df = pd.merge(location_df, zip_lat_long_df, left_on='Zipcode', right_on='ZIP', how='left')

## Plot

map_title = "Car Listings by Zip Code"
title_html = f'<h1 style="position:absolute;z-index:100000;left:40vw" >{map_title}</h1>'

# Initialize the map at a central location
m = f.Map(location=[37.0902, -95.7129], zoom_start=4)
m.get_root().html.add_child(f.Element(title_html))
# Add points to the map for each car listing, scaled by car count
for idx, row in location_df.iterrows():
    if not pd.isnull(row['LAT']) and not pd.isnull(row['LNG']) and not pd.isnull(row['CarCount']):
        radius = row['CarCount'] ** 0.5 * 2
        f.CircleMarker(
            location=[row['LAT'], row['LNG']],
            radius=radius,
            fill=True,
            fill_color='blue',
            fill_opacity=0.1,
            color='blue',
            opacity=0.05,
            tooltip=f"Zip Code: {row['Zipcode']}, Cars: {row['CarCount']}"
        ).add_to(m)
m.save(path + 'Car_Listings_by_Zip_Code.html')

Car_Listings_by_Zip_Code

Deal Types by Consumer Rating Distribution

Here we see consumer ratings across three different types of deals: Fair (Red), Good (Yellow), and Great (Green) to show distribution of Deal Ratings compared to the Consumer Rating value for the vehicle. It is important to note that any listing with a vehicle whose Consumer Reviews count was less than 100 has been omitted. The reasoning behind these exclusions are detailed further below. Somewhat counterintuitively, “Good”” deals are rated the highest by consumers, whereas “Great” and “Fair” deals have lower and more varied consumer ratings.

Observations

  • The “Fair” deal type has the lowest median rating, although the specific value isn’t shown. The spread is similar to that of the ‘Great’ deals, with a wide range of ratings.
  • The “Good” deal type has a median rating of 4.8, with a lower quartile (Q1) of 4.7 and an upper quartile (Q3) close to 4.9. The ratings range from a minimum of 4.0 to a maximum of 4.9. The interquartile range (IQR) is narrow, indicating that most of the consumer ratings are closely packed around the median.
  • The “Great” deal type has a slightly lower median rating than the ‘Good’ deal type, at 4.7. The spread is wider for the ‘Great’ deals, as depicted by a longer box, which indicates considerably more variation in the consumer ratings.
  • All three categories are more or less consistent in having a few outliers.

Considerations

  • The analysis is limited to deals with at least 100 consumer reviews, which ensures that only deals with a significant number of reviews are considered. This may exclude new deals or those less frequently purchased, but it helps to avoid skewing the data with deals that have too few reviews to be statistically significant.
  • It has been shown through other consumer surveys, such as the Volume Conversion Impact analysis performed by Power Reviews, that customers are exponentially less likely to consider products with fewer reviews.

Volume Conversion Impact

## Comparison of Deal Type by Consumer Rating

# Data Preparation

# Filter for 
deal_df = df[df['ConsumerReviews'] >= 100]

# Define a color dictionary
color_discrete_map = {'Great': 'green', 'Good': 'yellow', 'Fair': 'red'}

# Create the plot
fig = px.box(deal_df, x='DealType', y='ConsumerRating', color='DealType',
             title='Distribution of Deal Types by Consumer Rating',
             color_discrete_map=color_discrete_map);

# Customizing the plot layout
fig.update_layout(
    xaxis_title='Deal Type',
    yaxis_title='Consumer Rating',
    height=600,
    title_x=0.5,
    legend_title_text='Deal Type',
    legend=dict(
        x=1,
        y=0.5,
        xanchor='left',
        bgcolor='rgba(255, 255, 255, 0.5)'
    )
);

fig.update_xaxes(tickangle=45);

# Update traces for marker properties
fig.update_traces(marker=dict(line=dict(width=2)))

pio.write_html(fig, path + "Deal_Type_by_Consumer_Rating.html", auto_open=False)

Price vs. Mileage by Cylinders

Below is a scatter plot visualizing the relationship between the price of a vehicle and its mileage, with data points colored based on the number of engine cylinders. The analysis of this chart focuses on identifying trends or patterns in how the price of a car correlates with its mileage and engine size. As is expected, the inverse relationship between price and mileage is shown, with engine size as a factor for price differentiation. Luxury and performance vehicles with more engine cylinders generally retain their value better, even at higher mileages.

Observations

  • There is a general trend showing that as mileage increases, the price of the vehicle tends to decrease, which is a common expectation in the used car market due to wear and tear and the age of the vehicle.
  • The scatter plot indicates that vehicles with higher numbers of cylinders (e.g., 10, 12) have data points that tend toward the higher price range, which might suggest these are more premium or performance-oriented vehicles.
  • The majority of the data points are clustered towards the bottom left of the chart, indicating a large number of vehicles with lower mileage and lower price.
  • The spread of data points becomes less dense as mileage and price increase.
  • There are a few outliers with high prices even at higher mileages, particularly for vehicles with higher numbers of engine cylinders. These tend to be specific models that retain their value better or are luxury/performance cars that are more expensive regardless of mileage.
  • Buyers looking for less expensive options might consider vehicles with higher mileage and fewer engine cylinders.

Considerations

  • Outliers in price and mileage beyond certain thresholds (150,000 USD and 150,000 miles) were removed to avoid skewing the analysis.
  • The scatter plot does not show the age of the vehicle, which is also a critical factor in determining the price of a vehicle.
  • The plot does not discriminate between car models, brands, or conditions, which can significantly impact their price.
### Define functions

## Price vs Mileage by Engine Type Functions

# Function to format the y-axis as currency

def currency_formatting(y, pos):
    return '${:,.0f}'.format(y)

# Function to format the x-axis as currency

def thousands_formatting(x, pos):
    return '{:,.0f}'.format(x)
    
# Function to categorize the engine type

def categorize_engine(engine):
    engine = engine.upper()
    cylinders = {
    "V2": "2", "V3": "3", "V4": "4", "V5": "5", "V6": "6", 
    "V7": "7", "V8": "8", "V9": "9", "V10": "10", "V11": "11", 
    "V12": "12", "V-2": "2", "V-3": "3", "V-4": "4", "V-5": "5",
    "V-6": "6", "V-7": "7", "V-8": "8", "V-9": "9", "V-10": "10", 
    "V-11": "11", "V-12": "12", "I2": "2", "I3": "3", "I4": "4", 
    "I5": "5", "I6": "6", "I7": "7", "I8": "8", "I9": "9", 
    "I10": "10", "I11": "11", "I12": "12", "I-2": "2", "I-3": "3", 
    "I-4": "4", "I-5": "5", "I-6": "6", "I-7": "7", "I-8": "8", 
    "I-9": "9", "I-10": "10", "I-11": "11", "I-12": "12", "1-CYL": "1", 
    "2-CYL": "2", "3-CYL": "3", "4-CYL": "4", "5-CYL": "5", 
    "6-CYL": "6", "7-CYL": "7", "8-CYL": "8", "9-CYL": "9", 
    "10-CYL": "10", "11-CYL": "11", "12-CYL": "12", "H4": "4", 
    "H6": "6", "W12": "12", "H-4": "4", "H-6": "6", "W-12": "12", 
    "1 Cylinder": "1", "2 Cylinder": "2", "3 Cylinder": "3", "4 Cylinder": "4", 
    "5 Cylinder": "5", "6 Cylinder": "6", "7 Cylinder": "7", "8 Cylinder": "8", 
    "9 Cylinder": "9", "10 Cylinder": "10", "11 Cylinder": "11", "12 Cylinder": "12", 
    "ELECTRIC": "Electric"
}
    for key, value in cylinders.items():
        if key in engine:
            return value
    return 'Other'
    
#### Price vs Mileage by Engine Type

### General Data Preparation

# List of valid engine cylinder categories
valid_categories = [
    "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"
]

# Copy dataframe and apply function to the Engine column to create a new Engine Cylinders column
engine_df = df.copy()
engine_df['Engine Cylinders'] = engine_df['Engine'].apply(categorize_engine)

# Filter data based on valid engine cylinder categories
engine_df = engine_df[engine_df['Engine Cylinders'].isin(valid_categories)]


### Price vs Mileage by Engine Type specific data preparation

# Drop rows with missing values in the relevant columns
pvm_engine_df = engine_df.dropna(subset=['Mileage', 'Price', 'Engine'])

# Drop rows with outlier values for Price and Mileage
pvm_engine_df = engine_df[(engine_df['Price'] <= 150000) & (engine_df['Mileage'] <= 150000)]

## Create the Price vs Mileage by Engine Type plot
fig = px.scatter(pvm_engine_df, x='Mileage', y='Price', color='Engine Cylinders',
                 title='Price vs. Mileage by Engine Cylinders',
                 labels={'Mileage': 'Mileage (miles)', 'Price': 'Price (USD)'},
                 category_orders={"Engine Cylinders": valid_categories});

# Update layout and legend settings
fig.update_layout(
    legend_title_text='Engine Cylinders',
    legend=dict(
        x=1,
        y=1,
        xanchor='left',
        bgcolor='rgba(255, 255, 255, 0.5)',
        traceorder='normal'
    ),
    height=600,
    title_x=0.5,
    margin=dict(l=30, r=30, t=60, b=30)
);

fig.update_xaxes(tickangle=45)

pio.write_html(fig, path + "Price_vs_Mileage.html", auto_open=False)

MPG by Cylinders and Reliability

This heatmap shows the relationship between the average miles per gallon (MPG) of vehicles, their engine cylinder count, and their reliability rating.

Observations

  • Generally speaking, vehicles with fewer cylinders have higher Miles Per Gallon (MPG).
  • The analysis indicates that vehicles with a higher reliability rating tend to have better MPG. For example, vehicles with a reliability rating of 5 have MPG ratings ranging from 18.0 to 38.0, which are the highest values on the chart.
  • Converse to the last point, as the reliability rating decreases to the 2 to 2.5 range, the MPG also drops with the lowest being 20.4 for the 6-cylinder vehicles.
  • The chart does not show a clear trend between the number of cylinders and reliability.
  • While 4-cylinder vehicles show a common range of reliability from 3 to 5 (with apparent outliers in the 0.5 to 1 range) and MPG from 23.7 to 28.0, the 6-cylinder vehicles show a lower reliability range of 2 to 3 with MPG also lower, ranging from 20.4 to 21.8.
  • For those prioritizing fuel efficiency, a vehicle with fewer cylinders and a higher reliability rating would be desirable.

Considerations

  • The data is not capturing all variables that affect MPG and reliability, such as vehicle model, weight, or type of vehicle
# Function to categorize the engine type
def categorize_engine(engine):
    engine = engine.upper()
    cylinders = {
    "V2": "2", "V3": "3", "V4": "4", "V5": "5", "V6": "6", 
    "V7": "7", "V8": "8", "V9": "9", "V10": "10", "V11": "11", 
    "V12": "12", "V-2": "2", "V-3": "3", "V-4": "4", "V-5": "5",
    "V-6": "6", "V-7": "7", "V-8": "8", "V-9": "9", "V-10": "10", 
    "V-11": "11", "V-12": "12", "I2": "2", "I3": "3", "I4": "4", 
    "I5": "5", "I6": "6", "I7": "7", "I8": "8", "I9": "9", 
    "I10": "10", "I11": "11", "I12": "12", "I-2": "2", "I-3": "3", 
    "I-4": "4", "I-5": "5", "I-6": "6", "I-7": "7", "I-8": "8", 
    "I-9": "9", "I-10": "10", "I-11": "11", "I-12": "12", "1-CYL": "1", 
    "2-CYL": "2", "3-CYL": "3", "4-CYL": "4", "5-CYL": "5", 
    "6-CYL": "6", "7-CYL": "7", "8-CYL": "8", "9-CYL": "9", 
    "10-CYL": "10", "11-CYL": "11", "12-CYL": "12", "H4": "4", 
    "H6": "6", "W12": "12", "H-4": "4", "H-6": "6", "W-12": "12", 
    "1 Cylinder": "1", "2 Cylinder": "2", "3 Cylinder": "3", "4 Cylinder": "4", 
    "5 Cylinder": "5", "6 Cylinder": "6", "7 Cylinder": "7", "8 Cylinder": "8", 
    "9 Cylinder": "9", "10 Cylinder": "10", "11 Cylinder": "11", "12 Cylinder": "12", 
    "ELECTRIC": "Electric"
}
    for key, value in cylinders.items():
        if key in engine:
            return value
    return 'Other'

# Function to bin reliability ratings
def bucket_reliability_rating(rating):
    return round(rating * 2) / 2
    
# General Data Preparation

# List of most common engine cylinder categories
filtered_valid_categories = [
    "2", "3", "4", "5", "6", "8"
]

# List to serve as buckets for ReliabilityRatings
reliability_range = [x * 0.5 for x in range(2, 11)]

# Copy dataframe and apply function to the Engine column to create a new Engine Cylinders column
engine_df = df.copy()
engine_df['Engine Cylinders'] = engine_df['Engine'].apply(categorize_engine)

# Filter data based on valid engine cylinder categories
engine_df = engine_df[engine_df['Engine Cylinders'].isin(filtered_valid_categories)]

# Apply the bucketing function to the ReliabilityRating column
df['ReliabilityRating'] = df['ReliabilityRating'].apply(bucket_reliability_rating)

# Calculate the Average MPG
engine_df['AverageMPG'] = (engine_df['MinMPG'] + engine_df['MaxMPG']) / 2
engine_df['ReliabilityRating'] = pd.Categorical(engine_df['ReliabilityRating'], categories=reliability_range)

# Sort the data frame
engine_df.sort_values(by=['ReliabilityRating', 'Engine Cylinders'], inplace=True)

# Group by Engine Cylinders and ReliabilityRating and then calculate the mean AverageMPG
mpg_reliability_data = engine_df.groupby(['ReliabilityRating', 'Engine Cylinders'])['AverageMPG'].mean().reset_index()

# Pivot the data for heatmap visualization
mpg_reliability_pivot = mpg_reliability_data.pivot(index='ReliabilityRating', columns='Engine Cylinders', values='AverageMPG')

# Create the Average Reliability Rating by Engine Type plot

# Define the color gradient for the heatmap
c = ["darkred","red","lightcoral","mistyrose","lightgreen", "palegreen","green","darkgreen"]
v = [0, 0.142857142857143, 0.285714285714286, 0.428571428571429, 0.571428571428571, 0.714285714285714, 0.857142857142857, 1]
l = list(zip(v,c))

# Set the Seaborn theme and plot data
sns.set_theme(style="whitegrid")

plt.figure(figsize=(10, 8))
hm = sns.heatmap(mpg_reliability_pivot,
            cbar_kws={'label': 'Average MPG', 'shrink': .5},
            annot=True,
            fmt=".1f",
            cmap=LinearSegmentedColormap.from_list('rg', l, N=256),
            linewidths=.5,
)

hm.invert_yaxis()

# Define values to format the y-axis
tick_values = np.arange(.5, 5.5, 0.5)
tick_positions = np.arange(len(tick_values))

# Format the plot
plt.title('Average MPG by Engine Cylinders and Reliability Rating')
plt.xlabel('Engine Cylinders')
plt.ylabel('Reliability Rating')
plt.yticks(tick_positions, tick_values, rotation=0)
plt.tight_layout()

plt.show()

Average Reliability

Our final analysis is a line graph displaying the average reliability rating of the most plentiful car makes in our dataset, limited to models from the year 2013 to 2022.It underscores the ebb and flow of quality in the automotive industry, with each make showing unique trends in reliability over time.

Observations

  • The ratings generally range between approximately 3.5 and 5.0, with most makes clustering around the 4.0 to 4.8 range.
  • Most makes show fluctuations in reliability over the years, with no consistent trend across all makes.
  • Some makes exhibit significant variability from year to year, suggesting possible changes in production quality, model refreshes, or changes in consumer expectations.
  • At various points in time, different makes take the lead in reliability ratings. There is no single make that consistently has the highest or lowest ratings throughout the entire period.
  • No make maintains a perfect or near-perfect average reliability rating over the entire period, emphasizing the competitive and dynamic nature of the auto industry.
  • Cadillac specifically shows a sharp decline in the average reliability rating from 2020 to 2021. This is primarily due to the ratings of the Escalade, which saw a significant decrease year-over-year and comprise the majority of the Cadillac vehicles posted for sale.
  • GMC, another General Motors manufactured model, saw a similar downtrend. Given that GM shares its platforms across brands, this is not surprising given the decline seen with Cadillac.

Considerations

  • The graph does not account for the number of models offered by each make or other factors such as the severity of the issues that affect reliability ratings.
path = 'C:/Users/GregoryNeeley/Desktop/Python Visualizations/'

### Average Reliability Rating by Make Over Time

## Data Preparation

# Filter out rows with a year prior to 2013
reliability_df = df[df['Year'] >= 2013]

# Drop rows without Reliability ratings
reliability_df = reliability_df.dropna(subset=['ReliabilityRating'])

# Count the number of occurrences for each Make and filter for those with 200 or more listings
make_counts = reliability_df['Make'].value_counts()
popular_makes = make_counts[make_counts >= 200].index.tolist()
reliability_df = reliability_df[reliability_df['Make'].isin(popular_makes)]

# Group by Year and Make, then calculate the mean reliability rating
reliability_data = reliability_df.groupby(['Year', 'Make'])['ReliabilityRating'].mean().reset_index()

# Create the line plot for average reliability rating over time by make
fig = px.line(reliability_data, x="Year", y="ReliabilityRating", color='Make',
              title='Average Reliability Rating Over Time by Make (2013 and Later)',
              labels={'ReliabilityRating': 'Average Reliability Rating', 'Year': 'Year'});

# Add labels and title to the heatmap

fig.update_layout(xaxis_title='Year',
                  yaxis_title='Average Reliability Rating',
                  height=600,
                  title_x=0.5,
                  legend_title='Make',
                  xaxis=dict(tickmode='linear'),
                  margin=dict(l=30, r=30, t=60, b=30));

fig.update_xaxes(tickangle=45)

pio.write_html(fig, path + "Average_Reliability.html", auto_open=False)

Conclusion

The analysis of the used car listings dataset offers valuable insights into the market dynamics, consumer preferences, and the interplay between various car attributes. These visualizations provide a view of the data, aiding users in understanding some of the many aspects involved in making informed decisions in the used car market. With vehicle prices continually rising due to external influences such as inflation and supply chain disruption, it is important to consider factors not only along the lines of price, but also reliability as more expensive purchases are accompanied by a desire to extend the lifetime of their utility.