Python Assignment

Customer Spending Analysis

For this project I selected a dataset that looked at sales data from a country. This information offered insights on the spending habits of the customers from many different countries. It provided us detailed information on customer age, gender, product categories, product subcategories, quantities of products purchased, unit cost, price, and revenue generated through sales of products listed in the dataset. When we look at the first few values of this dataset we can see the following:

#Importing packages needed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import math
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go

#Defining Path and pointing to data
path = "/Users/chloesass/Desktop/Viz Making/"
filename = path + 'Sales_Table.csv'
df = pd.read_csv(filename) #reading in only first 5 rows to start
print(df.columns) #show columns

df.head()

##    index      Date    Year     Month  ...  Unit Price   Cost Revenue Column1
## 0      0  02/19/16  2016.0  February  ...  109.000000   80.0   109.0     NaN
## 1      1  02/20/16  2016.0  February  ...   28.500000   49.0    57.0     NaN
## 2      2  02/27/16  2016.0  February  ...    5.000000   11.0    15.0     NaN
## 3      3  03/12/16  2016.0     March  ...  116.500000  175.0   233.0     NaN
## 4      4  03/12/16  2016.0     March  ...   41.666667  105.0   125.0     NaN
## 
## [5 rows x 16 columns]

Data Cleaning

This immediately shows us that we have at least 1 column with NaN values and that we’ll need to clean up this data before moving ahead with any analysis. Our cleaning includes looking at the information pertaining to the dataset and from that we can see that there is one row that contains only a value for revenue and its index, so we dropped that row as well as the columns ‘index’ and ‘Column1’ as they don’t help us in our effort.

The cleansing of our data is complete and the next task we want to take on is breaking up the ‘Date’ column provided to us and glean more information such as quarter, day, month, year, weekday, and month name. Creating these new columns will allow us to look at how time may have affected shopping for this specific company.

df[df['Cost'].isnull()] #Finding row in cost column where theres a NA value & dropping it

df = df.dropna(subset=['Cost'])
df = df.drop(columns = ['index','Column1'], axis =1)#Dropping unnecessary columns
df.info() #Checking to see is columns were dropped

df['Date']= pd.to_datetime(df["Date"], format = '%m/%d/%y') #Adding more time columns
df['Quarter'] = df.Date.dt.quarter
df['Day'] = df.Date.dt.day
df['Month'] = df.Date.dt.month
df['Year'] = df.Date.dt.year
df['Weekday'] = df.Date.dt.strftime('%a')
df['MonthName'] = df.Date.dt.strftime('%b')

df.head() #looking at first 5 rows

##         Date  Year  Month  Customer Age  ... Quarter Day Weekday MonthName
## 0 2016-02-19  2016      2          29.0  ...       1  19     Fri       Feb
## 1 2016-02-20  2016      2          29.0  ...       1  20     Sat       Feb
## 2 2016-02-27  2016      2          29.0  ...       1  27     Sat       Feb
## 3 2016-03-12  2016      3          29.0  ...       1  12     Sat       Mar
## 4 2016-03-12  2016      3          29.0  ...       1  12     Sat       Mar
## 
## [5 rows x 18 columns]

Total Revenue and Profit by Country

Now that our data is at a point where we feel comfortable doing analysis the first question that we were interested in related to countries and profit. We know that this data represents a company who sells its products globally. We want to know which countries they sell to and what does the revenue and total profit look like for each of these countries. To do this we created side-by-side bar graphs analyzing revenue and profit by country as shown below:

df['Profit'] = df['Revenue'] - df['Cost'] #creating profit column
pm_df = df.groupby('Country')['Revenue'].sum().reset_index(name='TotalRevenue') #Revenue DF
pm_df1 = df.groupby('Country')['Profit'].sum().reset_index(name='TotalProfit') #Profit DF

fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols =2, figsize = (18,10)) #Fig properties

fig.suptitle('Total Revenue & Profit by Country', fontsize = 22) #large title

formatter = ticker.FormatStrFormatter('$%0.f') #formatting yaxis ticks
ax1.yaxis.set_major_formatter(formatter)
ax2.yaxis.set_major_formatter(formatter)

#Revenue/Country Graph
ax1.bar(pm_df['Country'], pm_df['TotalRevenue'])
ax1.set_xlabel('Country',fontsize = 12)
ax1.set_ylabel('Revenue', fontsize = 12)
ax1.set_title('Total Revenue by Country', fontsize = 18)

#Profit/Country Graph
ax2.bar(pm_df1['Country'], pm_df1['TotalProfit'], color ='lightblue')
ax2.set_xlabel('Country', fontsize = 12)
ax2.set_ylabel('Profit', fontsize = 12)
ax2.set_title('Total Profit by Country', fontsize = 18);

The graphs tell us that this company is selling to France, Germany, the United Kingdom, and the United States. We can see that the United States is the front runner when it comes to revenue, meaning that they are bringing in the company a lot of money comparatively. However, just because they are generating a lot of money doesn’t always mean that their making the company any profit, and that is exactly what we are seeing on the right. Germany is generating the company about one million dollars in profit where the United States is only generating about $700,000. This is most likely happening because consumers in the United States are buying items with a low profit margin for the company.

Profitability by Month and Country

Our dataset encompasses almost two full years of purchase history starting in January of 2015 and ending in July of 2016. We will assume that 2015 was their first year in business, and we are curious to see how the countries ranked over the first 12 months of business when it came to providing the company profit. From our last graph we could see that over the two-year period that Germany was the number 1 country when it came to profits, but the first 12 months of any business are typically trial and error when it comes to pricing goods to make a profit. Below, we have implemented a bump chart that ranks the countries who bought from the company based on how much profit they generated the company in the year 2015.

df['Profitability'] = (df['Profit']/ df['Revenue'])*100 #Creating New Column
#New Dataframe for the viz
bump_df = df[df['Year'] == 2015].groupby(['Country','MonthName'])['Profitability'].sum().reset_index(name = 'Profitability')
#creating Pivot Table
bump_df = bump_df.pivot(index = 'Country', columns = 'MonthName', values = 'Profitability')
month_order = ['Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
bump_df = bump_df.reindex(columns = month_order) #re indexing by Month
bump_df = bump_df.dropna() #unnecessary lines there are no NA in 2015 

bump_df_ranked = bump_df.rank(0, ascending = False, method = 'min')
bump_df_ranked = bump_df_ranked.T
bump_df_ranked

#Setting up figure

fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)

bump_df_ranked.plot(kind = 'line', ax = ax, marker = 'o', markeredgewidth = 1, linewidth = 6,
                   markersize = 44, 
                   markerfacecolor = 'white')
ax.invert_yaxis() #have rank 1 at top
num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]
plt.ylabel('Monthly Ranking', fontsize = 18, labelpad = 10)
plt.title('Ranking of Profitability by Month and Country\n Bump Chart', fontsize = 18, pad = 15)
plt.xticks(np.arange(num_rows), month_order, fontsize = 14)

plt.yticks(range(1, num_cols + 1, 1), fontsize = 14)

ax.set_xlabel('Month', fontsize = 18)

handles, labels = ax.get_legend_handles_labels()
handles = [handles[3], handles[2], handles[1], handles[0]] #correcting countries for legend
labels  = [ labels[3], labels[2],  labels[1],  labels[0]]
#formatting legend
ax.legend(handles, labels, bbox_to_anchor = (1.01,1.01), fontsize = 14,
         labelspacing = 1, 
         markerscale = 0.4,
         borderpad = 1,
         handletextpad = 0.8)

#Setting up inside data of Profit
i = 0
j = 0
for eachcol in bump_df_ranked.columns:
    for eachrow in bump_df_ranked.index:
        this_rank = bump_df_ranked.iloc[i, j]
        ax.text(i, this_rank, '$' + str(round(bump_df.iloc[j, i]/1e3, 1)) + 'K', ha = 'center', va = 'center', fontsize = 12.5)
        i += 1
    j += 1
    i = 0
    

plt.show()

From the bump chart we can see that in 2015 Germany remained at the number one ranking for all months except August and November where the United States took the top spot. The bump chart also provides how much profit the company made or lost by country each month. From this we can see that France, Germany, and the United States did not provide the company with any positive profits until July of 2015, meaning when looking at all the months and profits made the company was operating a loss until July.

Deviation between Actual and Bugeted Monthly Revenue

The data from the graph provided great insights on how the first year of the company’s sales went. Backing off that information we wanted to dive into the revenue they were receiving. We were told (well I made this up for the purpose of the project!) that the company had budgeted $500,000 each month for the revenue they needed to receive to stay afloat in this business. To report back to the company, we created a waterfall diagram as seen below. It documents the total revenue they made or lost each month in the year 2015. Additionally, it is equipped with tooltips that show both the revenue made or lost from that month and the cumulative deviation in revenue up to that month.

# Data Prep for Waterfall Diagram
wf_df = df[df['Year'] == 2015].groupby(['MonthName'])['Revenue'].sum().reset_index(name = 'TotalRevenue')
wf_df['Budget'] = 5e5 #Making up value for budget
wf_df['Deviation'] = wf_df.TotalRevenue - wf_df.Budget #creating new variable/column
wf_df.loc[wf_df.index.max()+1] = ['Total',
                                 wf_df.TotalRevenue.sum(),
                                 wf_df.Budget.sum(),
                                 wf_df.TotalRevenue.sum() - wf_df.Budget.sum()]
months = ['Jan', 'Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec', 'Total']
wf_df.MonthName = pd.Categorical(wf_df.MonthName, categories = months, ordered = True) #pointing months[] to MonthName
wf_df.sort_values(by = 'MonthName', inplace = True) #sorting df
wf_df.reset_index(inplace = True, drop = True)
wf_df

#Waterfall Diagram
#determining colors for when we are negative/positive on Revenue

if wf_df.loc[12, 'Deviation'] > 0:
    end_color = 'black'
elif wf_df.loc[12, 'Deviation'] < 0:
    end_color = 'red'
else:
    end_color = 'blue'
    
fig = go.Figure(go.Waterfall(name = '', orientation = 'v', x = wf_df['MonthName'], textposition = 'outside',
                            measure = ['relative','relative','relative','relative','relative','relative',
                                      'relative','relative','relative','relative','relative','relative',
                                      'total'], #setting up x-axis
                             y = wf_df['Deviation']/1e5,
                            text = ['${:.2f}K'.format(each/1e5) for each in wf_df['TotalRevenue']],
                            decreasing = {'marker':{'color':'red'}},
                            increasing = {'marker':{'color':'green'}},
                            totals = {'marker':{'color':end_color}},
                            hovertemplate = 'Cumulative Deviation to Date: ' + '$%{y:,.2f}K' + '<br>' +
                                            'Total Revenue in %{x}: %{text}'))
fig.layout = go.Layout(yaxis=dict(tickformat='.1f'))
fig.update_xaxes(title_text='Months', title_font={'size':18});
fig.update_yaxes(title_text='Total Revenue (Running Total $K)', title_font={'size':18}, 
                dtick = 0.5, tickprefix = '$', ticksuffix = 'K',zeroline = True, tickmode = 'auto', nticks = 15); 
fig.update_layout(title = dict(text='Deviation between Actual and Bugeted Monthly Revenue in 2015 (Waterfall Diagram) <br>' +
                              'Surpluses appear in Green, Deficits appear in Red',
                              font = dict(family = 'Arial', size = 18, color = 'black')),
                 template = 'simple_white',
                 title_x = 0.5,
                 showlegend = False,
                 autosize = True,
                 margin = dict(l = 30, r = 30, t = 60, b = 30));
fig.show()

From this graphic we see a similar trend in the previous bump chart where the company was working at loss in revenue through June of 2015. However, we can see that starting in July through December the revenue generated starts to skyrocket. This could be for a number or reasons such as a change in pricing or better marketing on the company’s side. We are able to see that at the end of their first calendar year in business the company was able to generate just shy of $100,000.

Number of Products Bought by Month and Day

Now that we have a decent understanding of the profit and revenue that company was making in 2015, we wanted to pivot our focus on the things that feed into profit and revenue, the products customers bought! Our first goal was to discover if there were any trends in the days of the week in which products were bought. We constructed a heatmap of the products bought in 2015 by month and by weekday shown below. Month/Day combinations that are of a red hue signify Month/Days where lots of products were bought.

x = df[df['Year']== 2015].groupby(['MonthName','Weekday'])['Quantity'].sum().reset_index(name = 'Products')
x = pd.DataFrame(x) #creating new dataframe
x

hm_df = pd.pivot_table(x, index = 'MonthName', columns = 'Weekday', values = 'Products') #coverting to pivot table
hm_df

fig = plt.figure(figsize=(8,12)) #formatting figure
ax = fig.add_subplot(1,1,1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x),',')) #comma formatting

ax = sns.heatmap(hm_df, linewidth = 0.2, annot = True, cmap = 'coolwarm',
                 fmt = ',.0f', square = True, annot_kws = {'size': 11},
                 cbar_kws = {'format': comma_fmt, 'orientation':'vertical'}) #setting up heatmap

plt.title('Number of Products Bought by Month and Day',
         fontsize = 16, pad = 15)
plt.xlabel('Month', fontsize = 18, labelpad =10)
plt.ylabel('Weekday', fontsize = 18, labelpad =10)
plt.yticks(rotation = 0, fontsize = 14)

plt.yticks(size = 14)

ax.invert_yaxis()
#setting up colorbar for legend
cbar = ax.collections[0].colorbar
max_count = int(hm_df.to_numpy().max())
my_colorbar_ticks = [*range(0, max_count,100)]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]
cbar.set_ticklabels(my_colorbar_tick_labels)
cbar.set_label('Number of Products Bought', rotation = 270, fontsize = 14, color = 'black', labelpad = 20)
plt.show();

From this heatmap you can see that August through December were the month in which the most products were purchased from the company. This trend also follows the patterns we saw in the bump chart and the waterfall diagram where profit and revenue started to increase in the month of July. Taking a closer look, we can see that Tuesday in December was our highest selling Month/Day combination with Thursdays in December as a close second. We could potentially theorize that the company held a year end sale trying to get rid of their inventory to stock up on new items, or that they had a Christmas sale.

Product Category Revenue by Month & Profits of Bike Subcategories by Month.

Now that we know the second half of the year was indeed beneficial for this company, we ought to answer the question of - what is this company even selling? We also took it one step further and looked at the product that generated them the most revenue and broke it up into its subcategories. We then investigated the subcategories and tried to determine the profits being made from each subcategory. Below is a side-by-side line graph of product category revenue by month and the profits of bike subcategories by month.

df['Year_Month'] = df['Date'].dt.strftime('%Y-%m')
Total = df.groupby('Year_Month')['Revenue'].sum() #creating data for plot in new varibales
Bikes = df[df['Product Category'] == 'Bikes'].groupby('Year_Month')['Revenue'].sum()#creating data for plot in new varibales
Access = df[df['Product Category'] == 'Accessories'].groupby('Year_Month')['Revenue'].sum()#creating data for plot in new varibales
Clothes = df[df['Product Category'] == 'Clothing'].groupby('Year_Month')['Revenue'].sum()#creating data for plot in new varibales
TotalBike_PM = df[df['Product Category'] == 'Bikes'].groupby('Year_Month')['Profit'].sum()#creating data for plot in new varibales
Mountain_PM =df[df['Sub Category'] == 'Mountain Bikes'].groupby('Year_Month')['Profit'].sum()#creating data for plot in new varibales
Touring_PM = df[df['Sub Category'] == 'Touring Bikes'].groupby('Year_Month')['Profit'].sum()#creating data for plot in new varibales
Road_PM = df[df['Sub Category'] == 'Road Bikes'].groupby('Year_Month')['Profit'].sum()#creating data for plot in new varibales
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols =2, figsize = (18,10)) #setting up figure

fig.suptitle('Revenue by Category & Bike Profit Margin', fontsize = 24) #title

ax1.plot(Total,label='Total Revenue', linestyle = '-') #plotting data w/ different line types
ax1.plot(Bikes,label='Bike Revenue', linestyle = ':')
ax1.plot(Access,label='Accessories Revenue',linestyle = '--')
ax1.plot(Clothes,label='Clothing Revenue',linestyle = '-.')
ax1.legend(loc = 'best')
ax1.set_xlabel('Year - Month')
ax1.set_ylabel('Revenue')
ax1.set_title('Product Category Revenue by Month', fontsize = 18)
ax1.grid()

for label in ax1.xaxis.get_ticklabels(): #formatting xaxis labels
    label.set_color('black')
    label.set_rotation(45)
    label.set_fontsize(10)
formatter = ticker.FormatStrFormatter('$%1.0f') #creating formatting
ax1.yaxis.set_major_formatter(formatter) #applying formatting to yaxis

ax2.plot()

ax2.plot(TotalBike_PM, label = 'Total Bike Profit')
ax2.plot(Mountain_PM, label = 'Mountain Bike Profit')
ax2.plot(Touring_PM, label = 'Touring Bike Profit')
ax2.plot(Road_PM, label = 'Road Bike Profit')
ax2.legend(loc= 'best')
ax2.axhline(y=0, color='red', lw=3, ls='--', alpha=0.6)
ax2.set_xlabel('Year - Month')
ax2.set_ylabel('Profit')
ax2.set_title('Profits of Bike Sub-Categories by Month', fontsize = 18)
ax2.grid()

for label in ax2.xaxis.get_ticklabels():#formatting xaxis labels
    label.set_color('black')
    label.set_rotation(45)
    label.set_fontsize(10)
ax2.yaxis.set_major_formatter(formatter)#applying formatting to yaxis

From our line graph we can see that this company sells three main types of products. They fall under the categories of bike, accessories, and clothing. From the graph on the left we can see that the blue line represents the total revenue made by the company over the two-year period. We can also see that of the three categories of products that bike products generate the most revenue. It’s also important to note that accessories and clothing weren’t included as categories until July of 2015. From the graph on the right, we can see that the bike category is made up of Mountain, Touring, and Road bikes. From these three subcategories that company did not make one dollar in profits for one whole year. The very first month they saw any profits from bikes came in January of 2016. It is hard to imagine this company was able to stay afloat and running for as long as they did making no profits, but we can see that in 2016 they were able to continue generating positive profits.

United States Consumption

To finish up our analysis we decided to pivot back to looking at the countries. As Americans we decided it would be interesting to look at the consumption the United States had on the company’s products. More specifically we were interested in looking at the total number of products that were purchased from the United States by quarter and by month. Below is the donut chart we created to investigate this.

df_US = df.loc[df.Country == 'United States'] #creating new dataframe for only US purchases
df_US

pie_df = df_US.groupby(['Quarter','MonthName','Month'])['Quantity'].sum().reset_index(name='TotalProductsBought')
pie_df.sort_values(by = ['Month'], inplace = True) #sorting by Month
pie_df.reset_index(inplace = True, drop = True)
del pie_df['Month']
pie_df

#set up inside and outside reference numbers for colors

number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

number_inside_colors = len(pie_df.MonthName.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)

inside_color_ref_number = []
for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        inside_color_ref_number.append(each)

labels = ['Q1', 'Q2', 'Q3', 'Q4'] #creating manual labels that include "Q"

fig = plt.figure(figsize=(10,10)) #creating figure
ax = fig.add_subplot(1,1,1)

colormap = plt.get_cmap("tab20") #color scheme
outer_colors = colormap(outside_color_ref_number)

all_prod = pie_df.TotalProductsBought.sum()

pie_df.groupby(['Quarter'])['TotalProductsBought'].sum().plot(
        kind = 'pie', radius = 1, colors = outer_colors, pctdistance = 0.85,
        labeldistance = 1.1, wedgeprops = dict(edgecolor='w'), textprops = {'fontsize' : 18},
        labels = labels,
        autopct = lambda p: '{:.2f}%\n({:.2f}K)'.format(p, (p/100)*all_prod/1e+3),
        startangle = 90) #outside portion of pie graph

inner_colors = colormap(inside_color_ref_number) #inside portion of donut
pie_df.TotalProductsBought.plot(
        kind = 'pie', radius = 0.7, colors = inner_colors, pctdistance = 0.55,
        labeldistance = 0.8, wedgeprops = dict(edgecolor='w'), textprops = {'fontsize' : 13},
        labels = pie_df.MonthName, 
        autopct = '%1.2f%%',
        startangle = 90)

hole = plt.Circle((0,0), 0.3, fc = 'white') #creating size of inner hole
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('United States Consumption \n Total Products Bought by Quarter', fontsize = 18)
ax.text(0,0, 'Total Products Bought \n' + str(round(all_prod/1e3,2)) + 'K', size = 16, ha = 'center', va = 'center')
ax.axis('equal')

plt.tight_layout()

plt.show()

The chart shows us that over the two-year period the United States contributed to 36,500 sales for the company. They made most of their purchases in the second quarter of the year, and most of the purchases in that quarter came from the month of May. Quarter 3 was the slowest for Americans where they only purchased 7,000 products with their slowest month being July. We can try and make sense of this data and presume that Americans are trying to buy bikes before summer to enjoy the nice weather but mater of factly we simply will not ever know this reasoning!