Introduction

For module 2, I decided to look at a data set that involved information about Walmart customers and their purchases. I wanted to be able to uncover patterns in customer spending behavior by product category, customer age, date, city, and by whether or not they used discounts. I want this analysis to be able to inform business decisions by discovering how customers are spending their money and why.

Description of Project/Data Set

The first thing I did was look at some basic information about the data set, like what each of the variables were (their types), and if there were any missing/null variables involved in the set that I might have to deal with. After printing out that information, I found that there were no missing variables so I was set to continue working. I then looked at the basic summary statistics for the numerical values in the data set which were age, purchase amount, and rating. It was good to see how many numerical values were in the data set so I knew what kind of visualization I was going to be able to make with each of the variables. It was interesting to see that the average purchase amount was around 255 with a max of around 500, as I thought there would be some higher numbers involved. The minimum age was 18, so there are no minors involved in this data set. The average rating was right around 3, which was good to see as the scale was from 1 to 5.

Data Visualization

After looking at these statistics and seeing what kind of variables I was working with, it was time to start creating my visualizations to help tell the story of this data set, and try to explain what customers had been spending their money on and how/why.

The first visualization I decided to make was a simple bar chart, showing what the average purchase amount was in each category of spending. There are four categories: beauty, clothing, electronics, and home. What I was expecting to see was a lot of variation between the categories, but I instead found that each of them were around the same amount. That amount is right around the average of 255 dollars that I had previously found in the summary statistics for the variable as a whole. This was an easy visualization to make, but I think it provided some valuable information about the things that customers are spending their money on. There is no one area that customers are more focused in than others.

The second visualization I made was a scatter plot with a trend line that has again has information about average purchase amounts, but this time I wanted to see if there was a difference in the average based on the age group that the customer was in. The ages range from 18 to 60, and what I assumed before making the chart was that there would be a negative association between purchase amount and age, meaning that as age increase, average purchase amount would decrease. After making the chart, I did find that they have a negative association, but it wasn’t as dramatic as I thought it would be. When first making this chart, I didn’t have the trend line, but the points being spread out all over the plot like they are made the trend line necessary in order to confirm the type of association that the variables had. The most interesting part of the scatter plot for me was around the middle, with the age groups 42 through 51. Ages 42-48 had the lowest 2 points on the graph but 48-51 had the highest point on the graph. It was interesting to see that much variation in such a small difference in age.

The third visualization I made was a heat map again containing information about average purchase amount, but now I was looking at the average purchase amount by month. The darker rectangles are where the average was lower, and the lighter ones are where the average purchase amount was higher. What was most interesting on this chart was that there is only an 11 dollar variation between the highest value (259), and the lowest value (248). Walmart’s customers must be extremely consistent in their spending habits, because all of the variables I have compared with average purchase amount so far have shown little to no variation in the variable.

For the fourth visualization, I wanted to make a bump chart, so I needed to decide what part of my data set was going to be the most efficient to use to rank. I wanted to examine the city variable to see if there was any variation there in terms of spending, so I created a ranking system/chart to show which cities spent the most throughout the months provided in this data set. At first, I was using all of the cities involved, but quickly realized I needed to cut down on that number in order to have a clean, effective visualization. I decided that just the top 10 cities in terms of spending amounts would do. This chart showed a lot of variation. There was no one city that stayed at the top or the bottom of the rankings for too long. Each of the cities bumped around in the rankings greatly throughout the time period being displayed. It was good to see that there was variation here, because I had yet to see that much variation in my other 3 charts.

For the last visualization, I wanted to look at time again, and I decided to look at the discount applied variable this time. I wanted to know if there was a difference between how much money was being spent using discounts versus how much was being spent without discounts. I thought there would be more money being spent with discounts, because even though a discount is being used, that usually inspires customers to spend more in general because they think they are getting good deals by spending that extra money. What I found was almost an exact 50/50 split every single month. There was never more than a full percentage difference between the two categories. I also decided to display the total amount spent per month in the middle of each of the charts to get an idea of what was being spent, and that amount was right around 1000 dollars each month except for the last month which I am assuming is much lower because they didn’t have all the data collected for it yet when this data set was being created.

import pandas as pd
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from matplotlib.ticker import FuncFormatter

warnings.filterwarnings("ignore")
path = "/Users/rubysullivan/Desktop"
filename = path + '/Walmart_customer_purchases.csv'
df = pd.read_csv(filename)

print("First 5 rows:")
## First 5 rows:
print(df.head())
##                             Customer_ID  Age  ... Rating Repeat_Customer
## 0  84607c1f-910c-44d5-b89f-e1ee06dd34c0   49  ...      1             Yes
## 1  f2a81712-a73e-4424-8b39-4c615a0bd4ea   36  ...      1              No
## 2  da9be287-8b0e-4688-bccd-1a2cdd7567c6   52  ...      1              No
## 3  50ec6932-3ac7-492f-9e55-4b148212f302   47  ...      2             Yes
## 4  8fdc3098-fc75-4b0f-983c-d8d8168c6362   43  ...      2             Yes
## 
## [5 rows x 12 columns]
print("\nDataFrame Info:")
## 
## DataFrame Info:
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 50000 entries, 0 to 49999
## Data columns (total 12 columns):
##  #   Column            Non-Null Count  Dtype  
## ---  ------            --------------  -----  
##  0   Customer_ID       50000 non-null  object 
##  1   Age               50000 non-null  int64  
##  2   Gender            50000 non-null  object 
##  3   City              50000 non-null  object 
##  4   Category          50000 non-null  object 
##  5   Product_Name      50000 non-null  object 
##  6   Purchase_Date     50000 non-null  object 
##  7   Purchase_Amount   50000 non-null  float64
##  8   Payment_Method    50000 non-null  object 
##  9   Discount_Applied  50000 non-null  object 
##  10  Rating            50000 non-null  int64  
##  11  Repeat_Customer   50000 non-null  object 
## dtypes: float64(1), int64(2), object(9)
## memory usage: 4.6+ MB
print("\nMissing Values Per Column:")
## 
## Missing Values Per Column:
print(df.isnull().sum())
## Customer_ID         0
## Age                 0
## Gender              0
## City                0
## Category            0
## Product_Name        0
## Purchase_Date       0
## Purchase_Amount     0
## Payment_Method      0
## Discount_Applied    0
## Rating              0
## Repeat_Customer     0
## dtype: int64
print("\nSummary Statistics:")
## 
## Summary Statistics:
print(df.describe())
##                 Age  Purchase_Amount        Rating
## count  50000.000000     50000.000000  50000.000000
## mean      38.945220       255.532230      2.998680
## std       12.398137       141.574416      1.417956
## min       18.000000        10.010000      1.000000
## 25%       28.000000       133.050000      2.000000
## 50%       39.000000       255.045000      3.000000
## 75%       50.000000       378.912500      4.000000
## max       60.000000       499.990000      5.000000
cat_columns = df.select_dtypes(include=['object']).columns
for col in cat_columns:
    unique_vals = df[col].nunique()
    print(f"Unique values in '{col}': {unique_vals}")
## Unique values in 'Customer_ID': 50000
## Unique values in 'Gender': 3
## Unique values in 'City': 25096
## Unique values in 'Category': 4
## Unique values in 'Product_Name': 16
## Unique values in 'Purchase_Date': 366
## Unique values in 'Payment_Method': 4
## Unique values in 'Discount_Applied': 2
## Unique values in 'Repeat_Customer': 2

Visualization 1: Bar chart showing average amount spent per category

# average amount spent per product category bar chart
category_avg_spent = df.groupby("Category", as_index=False)["Purchase_Amount"].mean()

plt.figure(figsize=(12, 6))
ax = sns.barplot(data=category_avg_spent, x="Category", y="Purchase_Amount", palette="viridis", edgecolor="black")
for bar in ax.patches:
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2,
            f'${bar.get_height():.2f}', fontsize=12, ha='center', color='black')
plt.xlabel("Product Category", fontsize=14)
plt.ylabel("Average Amount Spent ($)", fontsize=14)
plt.title("Average Amount Spent per Product Category", fontsize=16)
plt.xticks(rotation=45, ha="right");

plt.show()


Visualization 2: scatterplot of age versus purchase amount with a trendline

# scatter plot of age versus purchase amount with trendline
df["Age_Group"] = pd.cut(df["Age"], bins=range(15, 70, 3), right=False)
age_group_avg = df.groupby("Age_Group")["Purchase_Amount"].mean().reset_index()
age_group_avg["Age_Group"] = age_group_avg["Age_Group"].astype(str)


plt.figure(figsize=(12, 6))
sns.regplot(data=age_group_avg, x=age_group_avg.index, y="Purchase_Amount", scatter=True,
            ci=None, marker="o", line_kws={"linestyle": "dashed"})
plt.xticks(ticks=range(len(age_group_avg)), labels=age_group_avg["Age_Group"], rotation=45, ha="right");
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Average Purchase Amount ($)", fontsize=14)
plt.title("Average Purchase Amount by Age Group (With Trend Line)", fontsize=16)

plt.show()

Visualization 4: bump chart showing ranking over time for top 10 cities and average purchase amount

# bump chart showing ranking over time for top 10 cities and average purchase amount
df["Year"] = df["Purchase_Date"].dt.year
df["Month"] = df["Purchase_Date"].dt.month
bump_df = df.groupby(['Year', 'Month', 'City'])['Purchase_Amount'].sum().reset_index()
bump_df = bump_df.rename(columns={'Purchase_Amount': 'TotalSpending'})
top_cities = bump_df.groupby("City")["TotalSpending"].sum().nlargest(10).index
bump_df = bump_df[bump_df["City"].isin(top_cities)]

bump_df_pivot = bump_df.pivot(index=['Year', 'Month'], columns='City', values='TotalSpending')
bump_df_pivot = bump_df_pivot.fillna(bump_df_pivot.max().max() + 1)
bump_df_ranked = bump_df_pivot.rank(axis=1, ascending=False, method='min')
bump_df_ranked = bump_df_ranked.apply(pd.to_numeric, errors='coerce')
bump_df_ranked.index = [f"{year}-{month:02d}" for year, month in bump_df_ranked.index]
fig, ax = plt.subplots(figsize=(14, 7))
for city in bump_df_ranked.columns:
    ax.plot(bump_df_ranked.index, bump_df_ranked[city], marker='o', linestyle='-', markersize=5, label=city)
ax.invert_yaxis()
plt.ylabel('City Spending Rank (1 = Highest)', fontsize=14, labelpad=10)
plt.xlabel('Year-Month', fontsize=14)
plt.title('City Spending Rankings Over Time (Top 10 Cities Only)', fontsize=16, pad=15)
plt.xticks(rotation=45, ha="right", fontsize=10);
ax.legend(title="Cities", bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=10,
          labelspacing=0.5, markerscale=0.8, borderpad=0.8, handletextpad=0.8, frameon=False)
plt.subplots_adjust(bottom=0.2, right=0.75)

plt.show()

Visualization 5: Monthly pie charts of purchase amount split by discount usage

def plot_discount_pie_charts():
    import matplotlib.pyplot as plt

    df["Purchase_Date"] = pd.to_datetime(df["Purchase_Date"])
    df["Year-Month"] = df["Purchase_Date"].dt.to_period("M").astype(str)
    df["Month-Date"] = pd.to_datetime(df["Year-Month"])
    df["Month-Label"] = df["Month-Date"].dt.strftime("%m-%Y")
    sorted_months = df.drop_duplicates("Month-Label").sort_values("Month-Date")["Month-Label"].values

    cols = 5
    rows = 3
    colors = ['#1f77b4', '#20b2aa']
    label_map = {"No": "No Discount", "Yes": "Discount Applied"}

    fig, axes = plt.subplots(rows, cols, figsize=(cols * 2, rows * 2.5))
    axes = axes.flatten()

    for i, month in enumerate(sorted_months):
        ax = axes[i]
        month_data = df[df["Month-Label"] == month].groupby("Discount_Applied")["Purchase_Amount"].sum()
        values = month_data.values
        total = values.sum()

        wedges, _, _ = ax.pie(
            values,
            labels=None,
            colors=colors,
            startangle=90,
            autopct=lambda pct: f"{pct:.1f}%" if pct > 0 else "",
            textprops=dict(color="black", fontsize=7),
            wedgeprops=dict(width=0.4),
            pctdistance=0.8
        )

        ax.set_title(month, fontsize=9, pad=10)
        ax.text(0, 0, f"${int(total):,}", ha='center', va='center', fontsize=7, fontweight='bold')
        ax.set(aspect="equal")

    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    fig.legend(
        handles=wedges,
        labels=[label_map["No"], label_map["Yes"]],
        loc="lower right",
        bbox_to_anchor=(0.85, 0.20),
        fontsize=9,
        frameon=False
    )

    plt.suptitle("Monthly Purchase Amount Split by Discount Usage", fontsize=13)
    plt.tight_layout(rect=[0, 0, 1, 0.93])
    plt.show()

plot_discount_pie_charts()

Conclusion

Overall, this analysis showed that Walmart customers are pretty consistent in how much they spend. The average purchase amount stayed right around $255 no matter the category, age group, or month. The one area where there was some variation was in the cities — spending amounts moved around a lot from month to month there. Discounts didn’t seem to make a big difference either, since purchases with and without them were split almost 50/50. These results could help Walmart plan inventory more evenly across departments, look into what’s causing the shifts in certain cities, and rethink how useful discounts really are when it comes to getting people to spend more.