For module 2, I decided to look at a data set that involved information about Walmart customers and their purchases. I wanted to be able to uncover patterns in customer spending behavior by product category, customer age, date, city, and by whether or not they used discounts. I want this analysis to be able to inform business decisions by discovering how customers are spending their money and why.
The first thing I did was look at some basic information about the data set, like what each of the variables were (their types), and if there were any missing/null variables involved in the set that I might have to deal with. After printing out that information, I found that there were no missing variables so I was set to continue working. I then looked at the basic summary statistics for the numerical values in the data set which were age, purchase amount, and rating. It was good to see how many numerical values were in the data set so I knew what kind of visualization I was going to be able to make with each of the variables. It was interesting to see that the average purchase amount was around 255 with a max of around 500, as I thought there would be some higher numbers involved. The minimum age was 18, so there are no minors involved in this data set. The average rating was right around 3, which was good to see as the scale was from 1 to 5.
After looking at these statistics and seeing what kind of variables I was working with, it was time to start creating my visualizations to help tell the story of this data set, and try to explain what customers had been spending their money on and how/why.
The first visualization I decided to make was a simple bar chart, showing what the average purchase amount was in each category of spending. There are four categories: beauty, clothing, electronics, and home. What I was expecting to see was a lot of variation between the categories, but I instead found that each of them were around the same amount. That amount is right around the average of 255 dollars that I had previously found in the summary statistics for the variable as a whole. This was an easy visualization to make, but I think it provided some valuable information about the things that customers are spending their money on. There is no one area that customers are more focused in than others.
The second visualization I made was a scatter plot with a trend line that has again has information about average purchase amounts, but this time I wanted to see if there was a difference in the average based on the age group that the customer was in. The ages range from 18 to 60, and what I assumed before making the chart was that there would be a negative association between purchase amount and age, meaning that as age increase, average purchase amount would decrease. After making the chart, I did find that they have a negative association, but it wasn’t as dramatic as I thought it would be. When first making this chart, I didn’t have the trend line, but the points being spread out all over the plot like they are made the trend line necessary in order to confirm the type of association that the variables had. The most interesting part of the scatter plot for me was around the middle, with the age groups 42 through 51. Ages 42-48 had the lowest 2 points on the graph but 48-51 had the highest point on the graph. It was interesting to see that much variation in such a small difference in age.
The third visualization I made was a heat map again containing information about average purchase amount, but now I was looking at the average purchase amount by month. The darker rectangles are where the average was lower, and the lighter ones are where the average purchase amount was higher. What was most interesting on this chart was that there is only an 11 dollar variation between the highest value (259), and the lowest value (248). Walmart’s customers must be extremely consistent in their spending habits, because all of the variables I have compared with average purchase amount so far have shown little to no variation in the variable.
For the fourth visualization, I wanted to make a bump chart, so I needed to decide what part of my data set was going to be the most efficient to use to rank. I wanted to examine the city variable to see if there was any variation there in terms of spending, so I created a ranking system/chart to show which cities spent the most throughout the months provided in this data set. At first, I was using all of the cities involved, but quickly realized I needed to cut down on that number in order to have a clean, effective visualization. I decided that just the top 10 cities in terms of spending amounts would do. This chart showed a lot of variation. There was no one city that stayed at the top or the bottom of the rankings for too long. Each of the cities bumped around in the rankings greatly throughout the time period being displayed. It was good to see that there was variation here, because I had yet to see that much variation in my other 3 charts.
For the last visualization, I wanted to look at time again, and I decided to look at the discount applied variable this time. I wanted to know if there was a difference between how much money was being spent using discounts versus how much was being spent without discounts. I thought there would be more money being spent with discounts, because even though a discount is being used, that usually inspires customers to spend more in general because they think they are getting good deals by spending that extra money. What I found was almost an exact 50/50 split every single month. There was never more than a full percentage difference between the two categories. I also decided to display the total amount spent per month in the middle of each of the charts to get an idea of what was being spent, and that amount was right around 1000 dollars each month except for the last month which I am assuming is much lower because they didn’t have all the data collected for it yet when this data set was being created.
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from matplotlib.ticker import FuncFormatter
warnings.filterwarnings("ignore")
path = "/Users/rubysullivan/Desktop"
filename = path + '/Walmart_customer_purchases.csv'
df = pd.read_csv(filename)
print("First 5 rows:")
## First 5 rows:
print(df.head())
## Customer_ID Age ... Rating Repeat_Customer
## 0 84607c1f-910c-44d5-b89f-e1ee06dd34c0 49 ... 1 Yes
## 1 f2a81712-a73e-4424-8b39-4c615a0bd4ea 36 ... 1 No
## 2 da9be287-8b0e-4688-bccd-1a2cdd7567c6 52 ... 1 No
## 3 50ec6932-3ac7-492f-9e55-4b148212f302 47 ... 2 Yes
## 4 8fdc3098-fc75-4b0f-983c-d8d8168c6362 43 ... 2 Yes
##
## [5 rows x 12 columns]
print("\nDataFrame Info:")
##
## DataFrame Info:
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 50000 entries, 0 to 49999
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Customer_ID 50000 non-null object
## 1 Age 50000 non-null int64
## 2 Gender 50000 non-null object
## 3 City 50000 non-null object
## 4 Category 50000 non-null object
## 5 Product_Name 50000 non-null object
## 6 Purchase_Date 50000 non-null object
## 7 Purchase_Amount 50000 non-null float64
## 8 Payment_Method 50000 non-null object
## 9 Discount_Applied 50000 non-null object
## 10 Rating 50000 non-null int64
## 11 Repeat_Customer 50000 non-null object
## dtypes: float64(1), int64(2), object(9)
## memory usage: 4.6+ MB
print("\nMissing Values Per Column:")
##
## Missing Values Per Column:
print(df.isnull().sum())
## Customer_ID 0
## Age 0
## Gender 0
## City 0
## Category 0
## Product_Name 0
## Purchase_Date 0
## Purchase_Amount 0
## Payment_Method 0
## Discount_Applied 0
## Rating 0
## Repeat_Customer 0
## dtype: int64
print("\nSummary Statistics:")
##
## Summary Statistics:
print(df.describe())
## Age Purchase_Amount Rating
## count 50000.000000 50000.000000 50000.000000
## mean 38.945220 255.532230 2.998680
## std 12.398137 141.574416 1.417956
## min 18.000000 10.010000 1.000000
## 25% 28.000000 133.050000 2.000000
## 50% 39.000000 255.045000 3.000000
## 75% 50.000000 378.912500 4.000000
## max 60.000000 499.990000 5.000000
cat_columns = df.select_dtypes(include=['object']).columns
for col in cat_columns:
unique_vals = df[col].nunique()
print(f"Unique values in '{col}': {unique_vals}")
## Unique values in 'Customer_ID': 50000
## Unique values in 'Gender': 3
## Unique values in 'City': 25096
## Unique values in 'Category': 4
## Unique values in 'Product_Name': 16
## Unique values in 'Purchase_Date': 366
## Unique values in 'Payment_Method': 4
## Unique values in 'Discount_Applied': 2
## Unique values in 'Repeat_Customer': 2
# average amount spent per product category bar chart
category_avg_spent = df.groupby("Category", as_index=False)["Purchase_Amount"].mean()
plt.figure(figsize=(12, 6))
ax = sns.barplot(data=category_avg_spent, x="Category", y="Purchase_Amount", palette="viridis", edgecolor="black")
for bar in ax.patches:
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2,
f'${bar.get_height():.2f}', fontsize=12, ha='center', color='black')
plt.xlabel("Product Category", fontsize=14)
plt.ylabel("Average Amount Spent ($)", fontsize=14)
plt.title("Average Amount Spent per Product Category", fontsize=16)
plt.xticks(rotation=45, ha="right");
plt.show()
# scatter plot of age versus purchase amount with trendline
df["Age_Group"] = pd.cut(df["Age"], bins=range(15, 70, 3), right=False)
age_group_avg = df.groupby("Age_Group")["Purchase_Amount"].mean().reset_index()
age_group_avg["Age_Group"] = age_group_avg["Age_Group"].astype(str)
plt.figure(figsize=(12, 6))
sns.regplot(data=age_group_avg, x=age_group_avg.index, y="Purchase_Amount", scatter=True,
ci=None, marker="o", line_kws={"linestyle": "dashed"})
plt.xticks(ticks=range(len(age_group_avg)), labels=age_group_avg["Age_Group"], rotation=45, ha="right");
plt.xlabel("Age Group", fontsize=14)
plt.ylabel("Average Purchase Amount ($)", fontsize=14)
plt.title("Average Purchase Amount by Age Group (With Trend Line)", fontsize=16)
plt.show()
# heat map showing average spending trends over time by month and by year
df["Purchase_Date"] = pd.to_datetime(df["Purchase_Date"])
df["Year-Month"] = df["Purchase_Date"].dt.strftime("%Y-%m")
hm_df = df.groupby("Year-Month")["Purchase_Amount"].mean().reset_index()
hm_df = hm_df.set_index("Year-Month").T
fig = plt.figure(figsize=(18, 6))
ax = fig.add_subplot(1, 1, 1)
dollar_fmt = FuncFormatter(lambda x, p: f"${x:,.0f}")
ax = sns.heatmap(hm_df, linewidth=0.2, annot=True, cmap="mako", fmt=".0f",
square=False, annot_kws={"size": 11},
cbar_kws={"format": dollar_fmt, "orientation": "horizontal", "pad": 0.3})
for text in ax.texts:
text.set_text(f"${text.get_text()}") # Add $ sign to each annotation
plt.title("Average Purchase Amount by Month", fontsize=16, pad=20)
plt.xlabel("Year-Month", fontsize=14, labelpad=10)
plt.xticks(rotation=45, ha="right", fontsize=12);
plt.yticks([])
## ([], [])
plt.show()
# bump chart showing ranking over time for top 10 cities and average purchase amount
df["Year"] = df["Purchase_Date"].dt.year
df["Month"] = df["Purchase_Date"].dt.month
bump_df = df.groupby(['Year', 'Month', 'City'])['Purchase_Amount'].sum().reset_index()
bump_df = bump_df.rename(columns={'Purchase_Amount': 'TotalSpending'})
top_cities = bump_df.groupby("City")["TotalSpending"].sum().nlargest(10).index
bump_df = bump_df[bump_df["City"].isin(top_cities)]
bump_df_pivot = bump_df.pivot(index=['Year', 'Month'], columns='City', values='TotalSpending')
bump_df_pivot = bump_df_pivot.fillna(bump_df_pivot.max().max() + 1)
bump_df_ranked = bump_df_pivot.rank(axis=1, ascending=False, method='min')
bump_df_ranked = bump_df_ranked.apply(pd.to_numeric, errors='coerce')
bump_df_ranked.index = [f"{year}-{month:02d}" for year, month in bump_df_ranked.index]
fig, ax = plt.subplots(figsize=(14, 7))
for city in bump_df_ranked.columns:
ax.plot(bump_df_ranked.index, bump_df_ranked[city], marker='o', linestyle='-', markersize=5, label=city)
ax.invert_yaxis()
plt.ylabel('City Spending Rank (1 = Highest)', fontsize=14, labelpad=10)
plt.xlabel('Year-Month', fontsize=14)
plt.title('City Spending Rankings Over Time (Top 10 Cities Only)', fontsize=16, pad=15)
plt.xticks(rotation=45, ha="right", fontsize=10);
ax.legend(title="Cities", bbox_to_anchor=(1.05, 1), loc="upper left", fontsize=10,
labelspacing=0.5, markerscale=0.8, borderpad=0.8, handletextpad=0.8, frameon=False)
plt.subplots_adjust(bottom=0.2, right=0.75)
plt.show()
def plot_discount_pie_charts():
import matplotlib.pyplot as plt
df["Purchase_Date"] = pd.to_datetime(df["Purchase_Date"])
df["Year-Month"] = df["Purchase_Date"].dt.to_period("M").astype(str)
df["Month-Date"] = pd.to_datetime(df["Year-Month"])
df["Month-Label"] = df["Month-Date"].dt.strftime("%m-%Y")
sorted_months = df.drop_duplicates("Month-Label").sort_values("Month-Date")["Month-Label"].values
cols = 5
rows = 3
colors = ['#1f77b4', '#20b2aa']
label_map = {"No": "No Discount", "Yes": "Discount Applied"}
fig, axes = plt.subplots(rows, cols, figsize=(cols * 2, rows * 2.5))
axes = axes.flatten()
for i, month in enumerate(sorted_months):
ax = axes[i]
month_data = df[df["Month-Label"] == month].groupby("Discount_Applied")["Purchase_Amount"].sum()
values = month_data.values
total = values.sum()
wedges, _, _ = ax.pie(
values,
labels=None,
colors=colors,
startangle=90,
autopct=lambda pct: f"{pct:.1f}%" if pct > 0 else "",
textprops=dict(color="black", fontsize=7),
wedgeprops=dict(width=0.4),
pctdistance=0.8
)
ax.set_title(month, fontsize=9, pad=10)
ax.text(0, 0, f"${int(total):,}", ha='center', va='center', fontsize=7, fontweight='bold')
ax.set(aspect="equal")
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
fig.legend(
handles=wedges,
labels=[label_map["No"], label_map["Yes"]],
loc="lower right",
bbox_to_anchor=(0.85, 0.20),
fontsize=9,
frameon=False
)
plt.suptitle("Monthly Purchase Amount Split by Discount Usage", fontsize=13)
plt.tight_layout(rect=[0, 0, 1, 0.93])
plt.show()
plot_discount_pie_charts()
Overall, this analysis showed that Walmart customers are pretty consistent in how much they spend. The average purchase amount stayed right around $255 no matter the category, age group, or month. The one area where there was some variation was in the cities — spending amounts moved around a lot from month to month there. Discounts didn’t seem to make a big difference either, since purchases with and without them were split almost 50/50. These results could help Walmart plan inventory more evenly across departments, look into what’s causing the shifts in certain cities, and rethink how useful discounts really are when it comes to getting people to spend more.