Using SQL to Analyze a Fictional Music Store's Sales Data

Executive Summary

In this guided project, we will be working with a database called Chinook which contains information about a fictional music shop. The database includes several tables on invoice information, track, album, artist and genre data, and employee and customer information related to the store's sales. We will use this database and the sqlite3 module in order to explore and analyze four fictional business questions and propositions.

Note: This guided project is part of DataQuest's Data Science in Python track.

Set up

We begin by importing the sqlite3, pandas and matplotlib modules, and by taking a look at the database's schema in order to understand which tables it contains and how these are related. We then proceed to creating some helper functions in order to run queries and commands more efficiently in sqlite3.

In [13]:
import sqlite3
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
In [14]:
schema = 'chinook-schema.png'
image = plt.imread(schema)
plt.figure(figsize = (10, 10))
plt.imshow(image)
plt.show()

In the following code we create a run_query() function which takes as an argument a SQL query and returns the pandas dataframe corresponding to this query.

In [33]:
def run_query(q):
    with sqlite3.connect('chinook.db') as conn:
        return pd.read_sql(q, conn)

As a next step, we create a run_command() function that takes a SQL command as argument and executes it using the sqlite module.

In [30]:
def run_command(c):
    with sqlite3.connect('chinook.db') as conn:
        conn.isolation_level = None
        conn.execute(c)

We now create a show_tables() function which uses our previous run_query() function to return a list of all the tables and views in our database.

In [34]:
def show_tables():
    view_query = 'SELECT name, type FROM sqlite_master WHERE type IN ("table","view");'
    return run_query(view_query)

Running this show_tables() function, we note that we have 11 tables including information on artists, albums, playlists, tracks and genre and business information such as customer, employee and invoices.

In [35]:
show_tables()
Out[35]:
name type
0 album table
1 artist table
2 customer table
3 employee table
4 genre table
5 invoice table
6 invoice_line table
7 media_type table
8 playlist table
9 playlist_track table
10 track table

Question 1.

Suppose the Chinook record store signs a deal with a new record label and needs to select the first three albums it will add to the store from the following list. These albums are by artists that do not currently have tracks in the store. The new record label specializes in American artists, and have given Chinook funds to advertise the new albums in the United States. To address this business issue we will now analyze which genres are the most popular in the US.

Artist Name Genre
Regal Hip-Hop
Red Tone Punk
Meteor and the Girls Pop
Slim Jim Bites Blues

Let us begin by writing a query which returns the number of tracks sold in the US in absolute and relative terms.

In [38]:
q = 'SELECT * FROM INVOICE LIMIT 5'
run_query(q)
Out[38]:
invoice_id customer_id invoice_date billing_address billing_city billing_state billing_country billing_postal_code total
0 1 18 2017-01-03 00:00:00 627 Broadway New York NY USA 10012-2612 15.84
1 2 30 2017-01-03 00:00:00 230 Elgin Street Ottawa ON Canada K2P 1L7 9.90
2 3 40 2017-01-05 00:00:00 8, Rue Hanovre Paris None France 75002 1.98
3 4 18 2017-01-06 00:00:00 627 Broadway New York NY USA 10012-2612 7.92
4 5 27 2017-01-07 00:00:00 1033 N Park Ave Tucson AZ USA 85719 16.83
In [83]:
q = '''
WITH usa_sales AS (
    SELECT 
        i.invoice_id AS invoice_id,
        il.track_id as track_id
    FROM invoice i
    INNER JOIN invoice_line il ON il.invoice_line_id = i.invoice_id
    WHERE i.billing_country = 'USA'
    ),
    usa_tracks_genres AS (
    SELECT
        usa_sales.invoice_id AS invoice_id,
        usa_sales.track_id AS track_id, 
        g.name AS genre
    FROM usa_sales 
    LEFT JOIN track t ON t.track_id = usa_sales.track_id
    LEFT JOIN genre g ON g.genre_id = t.genre_id),
    genre_totals AS (
    SELECT
        utg.genre as Genre, 
        COUNT(utg.track_id) AS Tracks_sold_total
    FROM usa_tracks_genres AS utg
    GROUP BY 1
        )

SELECT 
    gt.Genre AS Genre,
    gt.Tracks_sold_total AS Tracks_sold_total,
    CAST((CAST(gt.Tracks_sold_total as float) / (SELECT SUM(Tracks_sold_total)
                              FROM genre_totals)) * 100 as float) as Tracks_sold_percentage
FROM genre_totals AS gt
'''
genre_sales_usa = run_query(q)
In [84]:
genre_sales_usa
Out[84]:
Genre Tracks_sold_total Tracks_sold_percentage
0 Alternative 2 1.526718
1 Alternative & Punk 10 7.633588
2 Blues 1 0.763359
3 Classical 4 3.053435
4 Drama 1 0.763359
5 Easy Listening 1 0.763359
6 Electronica/Dance 2 1.526718
7 Jazz 3 2.290076
8 Latin 2 1.526718
9 Metal 8 6.106870
10 Pop 2 1.526718
11 R&B/Soul 6 4.580153
12 Reggae 1 0.763359
13 Rock 88 67.175573
In [85]:
genre_sales_usa = genre_sales_usa.set_index('Genre')
plt.figure()
genre_sales_usa['Tracks_sold_total'].plot.bar()
plt.title('Number of Tracks Sold in the US per Genre')
plt.show()

From the above graph, we observe that the most popular genre sold by Chinook in the US is by far the Rock genre (67% of track units sold). From the list of artists, we note that the most popular genres in order of popularity ar Alternative & Punk (7.6%), R&B/Soul (4.6%) and Pop (1.5%). We therefore recommend to add the albums by Red Tone, Meteor and the Girls, and Slim Jim Bites.

Question 2.

Each Chinook customer gets assigned to a sales support agent after completing their first purchase. We will now analyze customer purchases belonging to each employee to identify and compare the performance of sales support agents.

In [100]:
q = '''
WITH employee_sales AS (
        SELECT
            c.support_rep_id AS employee_id,
            SUM(i.total) AS total_purchased
        FROM invoice i
        INNER JOIN customer c ON c.customer_id = i.customer_id
        GROUP BY 1
        )

SELECT
    e.employee_id AS employee_id,
    e.first_name || ' ' || e.last_name AS employee_name,
    e.title AS title,
    e.birthdate AS birthdate,
    e.reports_to AS reports_to,
    e.hire_date AS hire_date,
    e.country AS country,
    es.total_purchased AS employee_sales
FROM employee e
INNER JOIN employee_sales es ON e.employee_id = es.employee_id
ORDER BY 2
'''
employee_sales = run_query(q)
In [90]:
employee_sales
Out[90]:
employee_id employee_name title birthdate reports_to hire_date country employee_sales
0 3 Jane Peacock Sales Support Agent 1973-08-29 00:00:00 2 2017-04-01 00:00:00 Canada 1731.51
1 4 Margaret Park Sales Support Agent 1947-09-19 00:00:00 2 2017-05-03 00:00:00 Canada 1584.00
2 5 Steve Johnson Sales Support Agent 1965-03-03 00:00:00 2 2017-10-17 00:00:00 Canada 1393.92

In order to check that our table contains all the sales support agents covering all of Chinook's sales, we run the following query and verify that Total Sales as reported in the invoice table equals the sum of employee_sales reported in the previous table.

In [95]:
q = '''
SELECT 
    SUM(total) as Total_Sales
FROM invoice
'''
run_query(q)
Out[95]:
Total_Sales
0 4709.43
In [101]:
employee_sales = employee_sales.set_index('employee_name')
plt.figure()
employee_sales['employee_sales'].plot.bar()
plt.title('Total Sales per Chinook Employee')
plt.show()

We note that there are three sales support agents covering Chinook's customers. Sales are distributed relatively evenly between the agents. We make the following observations: first, all of the agents are from Canada and share the same title. Secondly, the agents have approximately the same level of experience in the store as they were hired within a few months from each other. Third, the agent with the highest amout of sales is the youngest, which could potentially reflect an ability to connect to a younger customer market.

Question 3.

We will now analyze the sales data for customers from each country. Guidance has been provided on using the country value from the customer table as opposed to the billing_country from the invoice table. This could be due to the customer's billing address differing from the actual location where the customer completed their purchase.

For each country, we will calculate the total number of customers, total value of sales, average value of sales per customer and average order value.

In [127]:
c = '''
CREATE VIEW country_sales AS
    SELECT 
        c.country AS Country,
        SUM(i.total) AS Total_Purchases,
        COUNT(DISTINCT(c.customer_id)) AS Total_Customers,
        COUNT(i.invoice_id) AS Total_Orders
    FROM invoice i
    INNER JOIN customer c ON c.customer_id = i.customer_id
    GROUP BY 1    
'''
run_command(c)
In [128]:
q = '''
SELECT * FROM country_sales
'''
run_query(q)
Out[128]:
Country Total_Purchases Total_Customers Total_Orders
0 Argentina 39.60 1 5
1 Australia 81.18 1 10
2 Austria 69.30 1 9
3 Belgium 60.39 1 7
4 Brazil 427.68 5 61
5 Canada 535.59 8 76
6 Chile 97.02 1 13
7 Czech Republic 273.24 2 30
8 Denmark 37.62 1 10
9 Finland 79.20 1 11
10 France 389.07 5 50
11 Germany 334.62 4 41
12 Hungary 78.21 1 10
13 India 183.15 2 21
14 Ireland 114.84 1 13
15 Italy 50.49 1 9
16 Netherlands 65.34 1 10
17 Norway 72.27 1 9
18 Poland 76.23 1 10
19 Portugal 185.13 2 29
20 Spain 98.01 1 11
21 Sweden 75.24 1 10
22 USA 1040.49 13 131
23 United Kingdom 245.52 3 28

In order to verify that the above table is correct, let us calculate Chinook's total customers and total orders for the period and check that these numbers correspond to the sums in the above table. We already know that total sales amounted to $4,709 for the period, which coincides with the below.

In [130]:
q = '''
SELECT 
    SUM(total_purchases) AS Total_sales,
    SUM(Total_Customers) AS Total_customers,
    SUM(Total_Orders) AS Total_orders
FROM country_sales
'''
run_query(q)
Out[130]:
Total_sales Total_customers Total_orders
0 4709.43 59 614
In [131]:
q = '''
SELECT COUNT(invoice_id) AS Total_Orders 
FROM invoice
'''
run_query(q)
Out[131]:
Total_Orders
0 614
In [132]:
q = '''
SELECT COUNT(customer_id) AS Total_Customers
FROM customer
'''
run_query(q)
Out[132]:
Total_Customers
0 59

As noted above, our totals coincide and so we may confirm that the country_sales table is correct. We now continue to calculate our variables of interest.

In [142]:
q = '''
WITH country_sales_other AS (
        SELECT
            CASE
                WHEN Total_Customers = 1 THEN 'Other'
                ELSE Country
            END AS country_identification,
            Total_Purchases,
            Total_Customers,
            Total_Orders
        FROM country_sales),
    country_sales_agg AS (
        SELECT
            country_identification,
            SUM(Total_purchases) AS Total_Sales,
            SUM(Total_customers) AS Total_Customers,
            SUM(Total_Orders) AS Total_Orders
        FROM country_sales_other
        GROUP BY 1), 
    country_variables AS (
        SELECT
            country_identification as Country,
            Total_Customers,
            Total_Sales, 
            CAST(Total_Sales / Total_Customers as float) AS Average_sales_per_customer, 
            CAST(Total_Sales / Total_Orders as float) AS Average_order_value
        FROM country_sales_agg
            )

SELECT
    Country,
    Total_Sales,
    Average_sales_per_customer,
    Average_order_value
FROM (
      SELECT 
          cv.*,
          CASE
              WHEN cv.Country = 'Other' THEN 1
              ELSE 0
          END AS sort
      FROM country_variables cv
      )
ORDER BY sort ASC
'''
run_query(q)
Out[142]:
Country Total_Sales Average_sales_per_customer Average_order_value
0 Brazil 427.68 85.536000 7.011148
1 Canada 535.59 66.948750 7.047237
2 Czech Republic 273.24 136.620000 9.108000
3 France 389.07 77.814000 7.781400
4 Germany 334.62 83.655000 8.161463
5 India 183.15 91.575000 8.721429
6 Portugal 185.13 92.565000 6.383793
7 USA 1040.49 80.037692 7.942672
8 United Kingdom 245.52 81.840000 8.768571
9 Other 1094.94 72.996000 7.448571
In [143]:
c = '''
CREATE VIEW country_sales_data AS
     WITH country_sales_other AS (
            SELECT
                CASE
                    WHEN Total_Customers = 1 THEN 'Other'
                    ELSE Country
                END AS country_identification,
                Total_Purchases,
                Total_Customers,
                Total_Orders
            FROM country_sales),
        country_sales_agg AS (
            SELECT
                country_identification,
                SUM(Total_purchases) AS Total_Sales,
                SUM(Total_customers) AS Total_Customers,
                SUM(Total_Orders) AS Total_Orders
            FROM country_sales_other
            GROUP BY 1), 
        country_variables AS (
            SELECT
                country_identification as Country,
                Total_Customers,
                Total_Sales, 
                CAST(Total_Sales / Total_Customers as float) AS Average_sales_per_customer, 
                CAST(Total_Sales / Total_Orders as float) AS Average_order_value
            FROM country_sales_agg
                )

    SELECT
        Country,
        Total_Sales,
        Average_sales_per_customer,
        Average_order_value
    FROM (
          SELECT 
              cv.*,
              CASE
                  WHEN cv.Country = 'Other' THEN 1
                  ELSE 0
              END AS sort
          FROM country_variables cv
          )
    ORDER BY sort ASC  
'''
run_command(c)
In [148]:
q = '''
SELECT * FROM country_sales_data
'''
country_sales_data = run_query(q)
In [150]:
total_sales_country = country_sales_data.copy()
total_sales_country = total_sales_country[['Country', 'Total_Sales']]
total_sales_country = total_sales_country.set_index('Country')

plt.figure()
total_sales_country.plot.bar()
plt.title('Total Chinook Sales per Country')
plt.plot()
Out[150]:
[]
<Figure size 432x288 with 0 Axes>
In [152]:
average_order_value_country = country_sales_data.copy()
average_order_value_country = average_order_value_country[['Country', 'Average_order_value']]
average_order_value_country = average_order_value_country.set_index('Country')

plt.figure()
average_order_value_country.plot.bar()
plt.title('Average Order Value for each Country')
plt.plot()
Out[152]:
[]
<Figure size 432x288 with 0 Axes>
In [ ]:
average_sales_country_customer = country_sales_data.copy()
average_sales_country_customer = average_sales_country_customer[['Country', 'Average_sales_per_customer']]
average_sales_country_customer = average_sales_country_customer.set_index('Country')

plt.figure()
average_sales_country_customer.plot.bar()
plt.title('Average Sales per Customer for each Country')
plt.plot()

From the above table and visualizations, we can observe that, on its own, the US is Chinook's largest market, followed by Canada and Brazil. The 'Other' row includes sales from countries with a single customer. However, we see a clear difference between average sales in all of its countries (around 80 - 90 USD per customer on average) and average sales in Czech Republic of 136 USD per customer. This indicates that in this country average revenue per customer is significantly higher and targeted marketing could potentially be more profitable per customer compared to its other markets. Average order value tends to be approximately 8 USD per order; again, we notice that in the Czech Republic this value is slightly higher than in other countries at 9 USD per order.

Question 4.

Chinook's current sales strategy allows customers to make purchases in the following ways:

  • purchase a whole album
  • purchase a collection of one or more individual tracks

Customers are not allowed to purchase a whole album and then add individual tracks to that same purchase, unless they do so by manually choosing each track. Customers who purchase an album are charged the equivalent of the sum of the price of each of the tracks purchased.

Chinook's management is considering a change in their purchasing strategy by which they only purchase the most popular tracks from each album, instead of purchasing all of the tracks. To evaluate this proposal, we will analyze the percentage of purchases that are individual tracks compared to purchases of entire albums in order to understand the impact on revenue should management pursue this change of strategy.

To answer this question, we must identify which invoices include all the tracks from an album.

Let us begin by creating a table including album ID and track ID from the tracks included in that album.

In [170]:
c = '''
CREATE VIEW album_tracks_dictionary AS
    SELECT 
        a.album_id AS album_id,
        t.track_id AS track_id
    FROM album a
    INNER JOIN track t ON a.album_id = t.album_id
'''
run_command(c)
In [171]:
q = '''
SELECT * FROM album_tracks_dictionary
'''
run_query(q)
Out[171]:
album_id track_id
0 1 1
1 1 6
2 1 7
3 1 8
4 1 9
... ... ...
3498 343 3499
3499 344 3500
3500 345 3501
3501 346 3502
3502 347 3503

3503 rows × 2 columns

We next create a table of Chinook's sales including invoice ID, track ID and album ID.

In [169]:
c = '''
CREATE VIEW invoice_tracks_album AS
    SELECT 
        il.invoice_id as invoice_id,
        at.album_id AS album_id,
        at.track_id AS track_id
    FROM invoice_line il
    INNER JOIN album_tracks at ON at.track_id = il.track_id
'''
run_command(c)

q = '''
SELECT * FROM invoice_tracks_album
'''
run_query(q)
Out[169]:
invoice_id album_id track_id
0 1 91 1158
1 1 91 1159
2 1 91 1160
3 1 91 1161
4 1 91 1162
... ... ... ...
4752 614 215 2659
4753 614 215 2660
4754 614 215 2661
4755 614 215 2662
4756 614 215 2663

4757 rows × 3 columns

In [180]:
q = '''
WITH invoice_full_dictionary AS
(
SELECT 
    ita.invoice_id AS invoice_id,
    CASE 
        WHEN
            (
            SELECT track_id FROM (SELECT itainv.track_id AS track_id
                                  FROM invoice_tracks_album itainv
                                  WHERE itainv.invoice_id = ita.invoice_id)
            EXCEPT 
            SELECT track_id FROM (SELECT ald.track_id AS track_id
                                  FROM album_tracks_dictionary ald
                                  WHERE ald.album_id = ita.album_id)
            ) IS NULL
            AND
            (
            SELECT track_id FROM (SELECT ald.track_id AS track_id
                                  FROM album_tracks_dictionary ald
                                  WHERE ald.album_id = ita.album_id)
            EXCEPT 
            SELECT track_id FROM (SELECT itainv.track_id AS track_id
                                  FROM invoice_tracks_album itainv
                                  WHERE itainv.invoice_id = ita.invoice_id)
            ) IS NULL
            THEN 1
        ELSE 0
    END AS Full_Album
FROM invoice_tracks_album ita
),
invoice_dictionary AS
(
SELECT 
    invoice_id,
    MAX(Full_Album) as Full_Album
FROM invoice_full_dictionary 
GROUP BY 1
)

SELECT
    COUNT(id.invoice_id) AS Total_invoices,
    SUM(id.Full_Album) AS Total_Full_Album,
    (CAST(SUM(id.Full_Album) / CAST(COUNT(id.invoice_id) as float) * 100 as float)) AS Percentage_Full_Album
FROM invoice_dictionary id
'''
run_query(q)
Out[180]:
Total_invoices Total_Full_Album Percentage_Full_Album
0 614 114 18.566775

From the above, we note that only 19% of invoices are related to the purchase of whole albums. From this, we can infer that approximately 80% of purchases are made by customers seeking specific tracks, instead of purchasing whole albums. This seems to indicate that Chinook's strategy of only purchasing popular tracks could be a profitable one as purchases do not appear to be driven by less popular songs that are part of albums.

Conclusions

In this project, we explored Chinook's sales data using the sqlite3 and pandas modules to understand the nature of its sales and employees, and to evaluate strategic business questions in fictional scenarios such as the introduction of new artists to its product portfolio and a change in purchasing strategy. We also visualized some of the tables that were created using the matplotlib module in order to present a direct approach to responding these questions.