Introduction

Cars are one of the most popular modes of transportation worldwide, especially in the United States. This report explores the trends throughout the years of cars in US dealerships with car manufacturing by country, the average price of both common and luxury vehicles, and which brands have the highest rating. Different visualizations, such as line plots, bar charts, scatterplots, bump charts, and heatmaps, illustrate key trends, including technological advancements, improved safety standards, and the transition to electric and hybrid vehicles.

Dataset

The dataset used provides insight on car listings in the United States, including 266,847 records and 19 variables, of which I’ve only used 6/19 variables. The variables used include: ‘MODEL’ (Year), ‘BRAND’, ‘PRICE($)’, ‘STOCK TYPE’, ‘RATING’, and ‘MAKE ORIGIN’. Out of the variables missing in this dataset, they were either dropped or replaced with 0. This dataset includes car models from 1919 to 2025.

Findings


import pandas as pd
import matplotlib.pyplot as plt
import warnings
import matplotlib.ticker as ticker
from matplotlib.ticker import FormatStrFormatter
import numpy as np
from matplotlib.ticker import FuncFormatter
import seaborn as sns

warnings.filterwarnings("ignore")

filename = "/Users/maikasa/Downloads/ALL_CARS_DATA.csv"

df = pd.read_csv(filename, usecols=['MODEL', 'BRAND', 'PRICE($)', 'STOCK TYPE', 'RATING', 'MAKE ORIGIN'])

Line Plot

Which Common Car Brands have the Highest Average Price?

This multiple line plot illustrates the price trends for popular car brands from 2015 to 2025, focusing on everyday cars like Nissan, Honda, and Toyota. In the early years, such as 2015 to 2020, the average price between car brands were relatively close, but over time, prices steadily increase, making the gap between them bigger. This could be a result of technological advancements, the shift toward electric and hybrid vehicles, and stricter safety standards. Also, a few brands, like Ford, Chevrolet, and Toyota, show a price decrease for the 2025 model year, which is possibly due to the limited availability of the newer models.

# Choose common brands
common_brands = ['Chevrolet', 'Ford', 'Honda', 'Jeep', 'Kia', 'Nissan', 'Subaru', 'Toyota']

# Filter for only those brands for the past 10 years
common_df = df[df['BRAND'].isin(common_brands)]
common_df = common_df[common_df['MODEL'] >= 2015]

# Find the average price throughout the years
avgprice = common_df.groupby(['BRAND', 'MODEL'])['PRICE($)'].mean().reset_index()

# Create line plot
fig1 = plt.figure(figsize=(12, 6))
ax1 = fig1.add_subplot(1, 1, 1)
for brand, grp in avgprice.groupby('BRAND'):
    ax1.plot(grp['MODEL'], grp['PRICE($)'], label=brand, marker='o')

# Format line plot
plt.title('Average Price of Common Car Brands (2015-)', fontsize=18, fontfamily='Times New Roman', fontweight='bold')

# Format legend
legend = plt.legend(loc="upper left", title='Brand', fontsize=9, title_fontsize=11, prop={'family': 'Times New Roman'})
legend.get_title().set_fontsize(11)
legend.get_title().set_fontfamily('Times New Roman')
legend.get_title().set_fontweight('bold')

# Format x-axis
ax1.set_xticks(avgprice['MODEL'].unique())
ax1.set_xlabel('Year', fontsize=16, fontfamily='Times New Roman', fontweight='bold')
ax1.tick_params(axis='x', labelsize=14, rotation=0)
ax1.set_xticklabels(ax1.get_xticks(), fontfamily='Times New Roman')
ax1.xaxis.set_major_formatter(FormatStrFormatter('%d'))

# Format y-axis
ax1.set_ylabel('Average Price', fontsize=16, labelpad=20, fontfamily='Times New Roman', fontweight='bold')
ax1.tick_params(axis='y', labelsize=14, rotation=0)
ax1.set_yticklabels(ax1.get_yticks(), fontfamily='Times New Roman')
ax1.get_yaxis().set_major_formatter(ticker.FuncFormatter(lambda x, _: f'${int(x):,}'))

plt.show()

Bar Chart

Which Brands have the Most Amount of New and Used Cars?

This stacked bar chart illustrates the number of new and used cars sold by the top twenty brands at dealerships across the United States. While several brands have more used cars than new, others, like Ford, Mercedes, and GMC, show a higher volume of new cars, which may suggest a greater demand for newer vehicles in these brands.

# Filter out new and used cars
new_used_cars = df[df['STOCK TYPE'].isin(['New', 'Used'])]

# Count the total cars in each brand and filter for those with new AND used
brandcount = new_used_cars.groupby(['BRAND', 'STOCK TYPE']).size().reset_index(name='count')
brands_new_used = brandcount.groupby('BRAND')['STOCK TYPE'].nunique()
brands_with_both = brands_new_used[brands_new_used == 2].index

# Filter for brands with new and used cars
filtered_brands = brandcount[brandcount['BRAND'].isin(brands_with_both)]

# Find the total car count to get the top 20 brands for graph
total_carcount = filtered_brands.groupby('BRAND')['count'].sum()
top20brands = total_carcount.nlargest(20)

# Count the new and used cars for the top 20 brands and separate them into different columns
top20brand_counts = brandcount[brandcount['BRAND'].isin(top20brands.index)]
top20brand_counts = top20brand_counts.pivot(index='BRAND', columns='STOCK TYPE', values='count')

# Sort by the total counts of new and used cars
top20brand_counts['Total'] = top20brand_counts.sum(axis=1)
top20brand_counts = top20brand_counts.sort_values(by='Total', ascending=False)

# Drop the total column to not show in graph
top20brand_counts = top20brand_counts.drop(columns='Total')

# Create the horizontal stacked bar chart
fig2, ax = plt.subplots(figsize=(12, 6))
top20brand_counts.plot(kind='barh', stacked=True, color=['pink', 'red'], ax=ax)

# Format title
ax.set_title('Car Count by Brand and Stock Type', fontsize=18, fontfamily='Times New Roman', fontweight='bold')

# Format x-axis
ax.set_xlabel('Number of Cars', fontsize=16, fontfamily='Times New Roman', fontweight='bold')
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.set_xticklabels(ax.get_xticks(), fontfamily='Times New Roman')
ax.get_xaxis().set_major_formatter(ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

# Format y-axis
ax.set_ylabel('Brand', fontsize=16, fontfamily='Times New Roman', fontweight='bold')
ax.tick_params(axis='y', labelsize=14, rotation=0)
ax.set_yticklabels(top20brand_counts.index, fontfamily='Times New Roman')

# Format legend
legend2 = plt.legend(loc="upper right", title='Stock Type', fontsize=12, title_fontsize=14, prop={'family': 'Times New Roman'})
legend2.get_title().set_fontfamily('Times New Roman')
legend2.get_title().set_fontweight('bold')

plt.tight_layout()
plt.show()

Scatterplot

Which Brands have the Highest Ratings?

This scatterplot illustrates color coded average customer ratings for the top ten brands that appear the most in this dataset. Each dot represents a brand’s average rating over the years, with colors indicating showing the rating levels: purple to blue for lower ratings, blue to green for average ratings, and green to yellow for higher ratings. Over the years, ratings for most brands initially started high, dropped slightly, then rose again.

# Count the brands and find the top 15 after 2000
brandcount2 = df['BRAND'].value_counts()
top15 = brandcount2.head(15).index.tolist()
df_top15 = df[(df['BRAND'].isin(top15)) & (df['MODEL'] >= 2000)][['BRAND', 'MODEL', 'RATING']]

# Find the average ratings
averageratings = df_top15.groupby(['BRAND', 'MODEL'])['RATING'].mean().reset_index(name='avg ratings')

# Create scatterplot
plt.figure(figsize=(14, 8))
plt.scatter(averageratings['MODEL'], averageratings['BRAND'], marker='8', cmap='viridis', c=averageratings['avg ratings'],
            s=averageratings['avg ratings'] * 25)

plt.title('Ratings by Brand and Year', fontsize=18, fontfamily='Times New Roman', fontweight='bold')

# Format x-axis
plt.xlabel('Year', fontsize=16, fontfamily='Times New Roman', fontweight='bold')
ax2 = plt.gca()
ax2.tick_params(axis='x', labelsize=14, rotation=0)
ax2.set_xticklabels(ax2.get_xticks(), fontfamily='Times New Roman')
ax2.xaxis.set_major_formatter(FormatStrFormatter('%d'))

# Format y-axis
plt.ylabel('Brand', fontsize=16, fontfamily='Times New Roman', fontweight='bold')
ax2.tick_params(axis='y', labelsize=14, rotation=0)
ax2.set_yticklabels(ax2.get_yticklabels(), fontname='Times New Roman', fontsize=14)

# Format color bar
cbar = plt.colorbar()
cbar.set_label('Average Rating', rotation=270, fontsize=16, color='black', labelpad=20,
               fontname='Times New Roman', fontweight='bold')
plt.show()

Bump Chart

Which Luxury Brands have the Highest Average Price?

This bump chart illustrates the ranking of five luxury car brands based on their average price over the past ten years. Some brands, like Lamborghini, Ferrari, and especially Rolls-Royce, remained steady in their rankings for a few years, while others fluctuated at the 4th and 5th rank. Also, most brands experienced small fluctuations, like moving one rank up or down, except Lamborghini, which dropped two ranks from 2021 to 2023.

# My choosing of luxury brands in the dataset
luxury_brands = ['Rolls-Royce', 'Lamborghini', 'Ferrari', 'Porsche', 'Maserati']

# Filter for only those brands after 2015
filtered_df = df[df['BRAND'].isin(luxury_brands)]
filtered_df = filtered_df[filtered_df['MODEL'] >= 2015]

# Find the total price for each brand in each year
avg_price = filtered_df.groupby(['BRAND', 'MODEL'])['PRICE($)'].mean().reset_index(name='avg_price')

# Pivot to change format of the layout for bump chart
filtered_df = avg_price.pivot(index='BRAND', columns='MODEL', values='avg_price')

# Create a new dataframe for rankings
filtered_df_ranked = filtered_df.rank(0, ascending=False, method='dense').astype(int)
filtered_df_ranked = filtered_df_ranked.T

# Create chart
fig4 = plt.figure(figsize=(14, 10))
ax4 = fig4.add_subplot(1, 1, 1)
filtered_df_ranked.plot(kind='line', ax=ax4, marker='o', markeredgewidth=1, linewidth=6, markersize=44,
                        markerfacecolor='white')

# Invert axis so 1 shows at the top
ax4.invert_yaxis()

# Format title
plt.title('Ranking by Average Price of Luxury Cars (2015-)', fontsize=18, pad=20, fontfamily='Times New Roman',
          fontweight='bold')

# Format x axis
plt.xlabel('Year', fontsize=16, labelpad=10, fontfamily='Times New Roman', fontweight='bold')
ax4.set_xticks(filtered_df_ranked.index)
ax4.set_xticklabels(ax4.get_xticks(), fontfamily='Times New Roman', fontsize=14)

# Format y axis
plt.ylabel('Yearly Ranking', fontsize=16, labelpad=10, fontfamily='Times New Roman', fontweight='bold',)
y_ticks = range(1, filtered_df_ranked.values.max() + 1)
plt.yticks(y_ticks)
ax4.set_yticklabels(ax4.get_yticks(), fontfamily='Times New Roman', fontsize=14)

# Format legend
ax4.legend(bbox_to_anchor=(1.01, 1.01), fontsize=16, labelspacing=1, markerscale=0.4, borderpad=1,
           handletextpad=0.8, prop={'family': 'Times New Roman', 'size': 14})

# Add numbers to the markers
i = 0
j = 0
for eachcol in filtered_df_ranked.columns:
    for eachrow in filtered_df_ranked.index:
        this_rank = filtered_df_ranked.iloc[i, j]
        ax4.text(eachrow, this_rank, '$' + str(round(filtered_df.iloc[j, i]/1e3,1)) + 'K',
        ha='center',va='center',fontsize=10, fontfamily='Times New Roman')
        i+=1
    j+=1
    i=0
    
plt.tight_layout()
plt.show()

Heat Map

Which Country Produces the Most Cars Driven by Americans?

This heatmap displays the number of cars produced by country of origin and year. Most countries show little to no car production from 2015 to 2025, except Germany, Japan, and the USA, with higher numbers than others. This is because many of the popular cars in the United States are manufactured in these three countries, including brands like Toyota, Nissan, Ford, BMW, Audi, and GMC. Cars produced in Italy and France, such as Bugatti and Ferrari, are less common in the US as they’re very expensive.

# Group USA and United States together
df['MAKE ORIGIN'] = df['MAKE ORIGIN'].replace({'United States': 'USA'})

# Group data by year and origin and count the number of cars from them
heatmap_df = df.groupby(['MODEL', 'MAKE ORIGIN']).size().reset_index(name='count')

# Remove NA
heatmap_df = heatmap_df[heatmap_df['MODEL'].notna() & heatmap_df['MAKE ORIGIN'].notna()]

# Filter to only include from after 2015
heatmap_df = heatmap_df[heatmap_df['MODEL'] >= 2015]

# Pivot table and fill NA with 0
heatmap_df = pd.pivot(heatmap_df, index='MAKE ORIGIN', columns='MODEL', values='count')
heatmap_df = heatmap_df.fillna(0)

# Create heatmap
fig3 = plt.figure(figsize=(17, 12))
ax3 = fig3.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
ax3 = sns.heatmap(heatmap_df, linewidth=0.2, annot=True, cmap='YlOrRd', fmt=',.0f', square=True,
                  annot_kws={'size': 13, 'fontname': 'Times New Roman'},
                  cbar_kws={'format': comma_fmt, 'orientation': 'vertical'})

# Format title
plt.title('Number of Cars Made by Country Origin and Year', fontsize=18, pad=15,
          fontfamily='Times New Roman', fontweight='bold')

# Format x-axis
plt.xlabel('Year', fontsize=16, labelpad=10, fontfamily='Times New Roman', fontweight='bold')
ax3.set_xticklabels(ax3.get_xticklabels(), fontname='Times New Roman', fontsize=14)

# Format y-axis
plt.ylabel('Country Origin', fontsize=16, labelpad=10, fontfamily='Times New Roman', fontweight='bold')
ax3.set_yticklabels(ax3.get_yticklabels(), fontname='Times New Roman', fontsize=14)

cbar = ax3.collections[0].colorbar

# Format and title the colorbar
cbar.set_label('Number of Cars', rotation=270, fontsize=16, labelpad=25,
               fontfamily='Times New Roman', fontweight='bold')
cbar.ax.set_yticklabels(cbar.ax.get_yticklabels(), fontname='Times New Roman', fontsize=12)

plt.show()

Conclusion

Overall, this report shows the trends in car availability in the United States, based on the car model, brand, country of origin, stock type, and rating. By looking into these variables, it was noted that the increase in ratings and price are likely due to technological advancements, the shift to hybrid and eletric vehicles, as well as increased safety standards.