Introduction to My Analysis

In this analysis, I explore vehicle sales trends and key factors influencing the market from a comprehensive dataset. Using data visualization techniques, I examine details such as vehicle make, model, condition, and selling prices over time. Through bar charts, pie charts, line plots, and heatmaps, I identify patterns in vehicle sales, such as the most popular makes and models, price fluctuations, and trends across different states and conditions. This analysis provides valuable insights into the dynamics of the vehicle sales market.

About my Dataset

The “Vehicle Sales and Market Trends Dataset” is a detailed collection of data on vehicle sales. It includes information like the year, make, model, trim, body type, transmission, VIN (Vehicle Identification Number), state of registration, condition rating, odometer reading, colors inside and out, seller info, Manheim Market Report (MMR) values, selling prices, and sale dates.

Findings

The analysis revealed several key findings. Ford, Chevrolet, Nissan, and Toyota were the most popular car brands in the dataset. California, Florida, Pennsylvania, and Texas stood out as the states with the highest car sales. To dive deeper into regional trends, I focused on the East Coast for one of my graphics, showing how many cars were sold in nearby states. Additionally, in terms of MMR values, Ford led the way, with its vehicles consistently showing the highest values across the dataset.

Visualization 1

For my first visualization, I focused on identifying the top 10 car makes in the dataset. To accomplish this, I created a bar chart that displayed the frequency of each car make’s sales. To provide additional context, I included a mean line on the chart, which helped to visually show whether each car brand was performing above or below the average. This allowed for a clearer understanding of how the top 10 makes compared to the overall dataset, highlighting which brands were driving sales and how they measured up to the overall trend.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.patches as mpatches
import warnings
warnings.filterwarnings("ignore")

path = "C:/Python/Dataset/"
filename = path + "car_data.csv"
df = pd.read_csv(filename, usecols = ['make'])

top_makes = df['make'].value_counts().nlargest(10)
mean_value = top_makes.mean()

def pick_colors_according_to_mean_count(data):
    colors = []
    for value in data:
        if value > mean_value:
            colors.append('RoyalBlue') 
        elif value < mean_value * 1.01 and value > mean_value * 0.99:
            colors.append('Gold')  
        else:
            colors.append('DarkGray')  
    return colors

my_colors = pick_colors_according_to_mean_count(top_makes)
Above = mpatches.Patch(color='#4169E1', label='Above Average')  
At = mpatches.Patch(color='#FFD700', label='Within 1% of the Average')  
Below = mpatches.Patch(color='#A9A9A9', label='Below Average')  
fig = plt.figure(figsize=(18, 16))
fig.suptitle('Top 10 Car Makes Distribution\n', fontsize=18, fontweight='bold')
ax1 = fig.add_subplot(2, 1, 1)
bars = ax1.bar(top_makes.index, top_makes.values, label='Count', color=my_colors)
plt.axhline(mean_value, color='black', linestyle='dashed')
ax1.text(len(top_makes) - 1, mean_value + 0.05, f'Mean = {round(mean_value, 2)}', rotation=0, fontsize=14)
ax1.legend(handles=[Above, At, Below], fontsize=14)
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
plt.xticks(rotation=45)
for bar in bars:
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.05,  
             round(bar.get_height(), 0), ha='center', va='bottom', fontsize=12)  
ax1.set_title('Top 10 Car Makes', size=20)
plt.show()

Visualization 2

For my second visualization, I created a bar chart to examine the top 10 states with the highest number of cars sold. Florida and California stood out significantly, far outpacing the other states in terms of sales. To provide further context, I added a mean line to the chart, which helped to illustrate how each state compared to one another in terms of car sales. This allowed for a clearer perspective on the dominance of Florida and California, as well as how the other states stacked up against the average.

import matplotlib.pyplot as plt

df2 = pd.read_csv(filename, usecols=['state'])

state_counts = df2['state'].value_counts()
top_10_states = state_counts.head(10)
mean_value = top_10_states.mean()

plt.figure(figsize=(18, 10))

bars = plt.bar(top_10_states.index, top_10_states.values, color='steelblue', edgecolor='black')


for bar in bars:
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.5,  
             round(bar.get_height(), 0), ha='center', va='bottom', fontsize=12)


plt.axhline(mean_value, color='red', linestyle='dashed', linewidth=2)


plt.text(len(top_10_states) - 1, mean_value + 0.5, f'Mean = {round(mean_value, 2)}', 
         color='red', fontsize=14)

plt.title('Top 10 States with the Most Cars Sold (With Mean)', fontsize=18, fontweight='bold')
plt.xlabel('State', fontsize=14)
plt.ylabel('Number of Cars Sold', fontsize=14)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.show()

Visualization 3

For my third visualization, I created a heatmap to display the number of cars sold between 2010 and 2015 across five states: Maryland, Massachusetts, Pennsylvania, New Jersey, and New York. This heatmap provided a clear visual representation of car sales based on the car’s registration year, highlighting trends over time. It was particularly interesting to see that cars made in 2012 were especially popular in Pennsylvania, with sales in that year standing out compared to other years. This helped to illustrate regional preferences and sales patterns in those specific states.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

hm_df = pd.read_csv(filename, usecols=['year', 'state'])

states_of_interest = ['ma', 'md', 'pa', 'nj', 'ny']
hm_df_filtered = hm_df[(hm_df['year'] >= 2010) & (hm_df['year'] <= 2015) & (hm_df['state'].isin(states_of_interest))]

heatmap_data = hm_df_filtered.groupby(['year', 'state']).size().unstack(fill_value=0)

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

ax = sns.heatmap(heatmap_data, linewidth=0.2, annot=True, cmap='coolwarm', fmt=',.0f',
                 square=True, annot_kws={'size': 11}, cbar_kws={'format': comma_fmt, 'orientation': 'vertical'})
plt.title('Heatmap of Cars Sold from 2010 to 2015 by State (MA, MD, PA, NJ, NY)', fontsize=18, pad=15)
plt.xlabel('Car Registration State', fontsize=18, labelpad=10)
plt.ylabel('Car Registration Year', fontsize=18, labelpad=10)
plt.yticks(rotation=0, size=14)
plt.xticks(size=14)
ax.invert_yaxis()
cbar = ax.collections[0].colorbar
max_count = heatmap_data.to_numpy().max()
my_colorbar_ticks = [*range(0, max_count + 1, max_count // 5)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)

cbar.set_label('Number of Cars Sold', rotation=270, fontsize=14, color='black', labelpad=20)

plt.show()

Visualization 4

For my fourth visualization, I created a line plot to show the average selling prices of Ford, Chevrolet, and Nissan vehicles. Since these three brands were the most popular in the dataset, I thought it would be valuable to examine how their prices varied based on vehicle condition. The line plot allowed me to track the average selling price trends for each brand, providing insight into how condition impacted pricing across these popular car makes. This visualization helped to highlight differences in pricing based on the state of the vehicle and brand preference.

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import pandas as pd

lp_df = pd.read_csv(filename, usecols=['make', 'condition', 'sellingprice'])
brands = ['Ford', 'Chevrolet', 'Nissan']
filtered_df = lp_df[lp_df['make'].isin(brands)]
brand_condition_sales = filtered_df.groupby(['make', 'condition'])['sellingprice'].mean().unstack()

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)
for brand in brands:
    ax.plot(brand_condition_sales.columns, brand_condition_sales.loc[brand], marker='o', label=brand)
    
plt.title('Average Selling Price by Condition for Ford, Chevrolet, and Nissan', fontsize=18)
ax.set_xlabel('Condition', fontsize=18)
ax.set_ylabel('Average Selling Price ($)', fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14)
ax.tick_params(axis='y', labelsize=14)
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'${x:,.0f}'))

plt.legend(title='Car Make', fontsize=14)
plt.tight_layout()
plt.show()

Visualization 5

For my fifth and final visualization, I created a pie chart to examine the top 5 car makes by total MMR. MMR, or Manheim Market Report, is a reference used to determine the wholesale market value of vehicles, providing insight into the value trends for different car brands. The pie chart visually breaks down the distribution of total MMR across the top 5 makes, highlighting which brands held the highest overall market value in the dataset. This allowed me to compare the relative market strength of each brand based on their total MMR value.

import pandas as pd
import matplotlib.pyplot as plt

stacked_df = pd.read_csv(filename, usecols=['make', 'mmr'])
top_5_makes = stacked_df.groupby('make')['mmr'].sum().nlargest(5)

fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)

# Function to display both percentage and total value with a dollar sign
def func(pct, all_values):
    absolute = pct / 100. * sum(all_values)
    return '{:.2f}%\n(${:,.0f})'.format(pct, absolute)

top_5_makes.plot(
    kind='pie', 
    ax=ax, 
    autopct=lambda pct: func(pct, top_5_makes), 
    startangle=90, 
    wedgeprops=dict(edgecolor='white', linewidth=1),  
    textprops={'fontsize': 12}  
)

plt.title('Top 5 Car Makes by Total MMR', fontsize=18)
plt.show()