In this guided project, we will be working with a database called Chinook which contains information about a fictional music shop. The database includes several tables on invoice information, track, album, artist and genre data, and employee and customer information related to the store's sales. We will use this database and the sqlite3 module in order to explore and analyze four fictional business questions and propositions.
Note: This guided project is part of DataQuest's Data Science in Python track.
We begin by importing the sqlite3, pandas and matplotlib modules, and by taking a look at the database's schema in order to understand which tables it contains and how these are related. We then proceed to creating some helper functions in order to run queries and commands more efficiently in sqlite3.
import sqlite3
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
schema = 'chinook-schema.png'
image = plt.imread(schema)
plt.figure(figsize = (10, 10))
plt.imshow(image)
plt.show()
In the following code we create a run_query() function which takes as an argument a SQL query and returns the pandas dataframe corresponding to this query.
def run_query(q):
with sqlite3.connect('chinook.db') as conn:
return pd.read_sql(q, conn)
As a next step, we create a run_command() function that takes a SQL command as argument and executes it using the sqlite module.
def run_command(c):
with sqlite3.connect('chinook.db') as conn:
conn.isolation_level = None
conn.execute(c)
We now create a show_tables() function which uses our previous run_query() function to return a list of all the tables and views in our database.
def show_tables():
view_query = 'SELECT name, type FROM sqlite_master WHERE type IN ("table","view");'
return run_query(view_query)
Running this show_tables() function, we note that we have 11 tables including information on artists, albums, playlists, tracks and genre and business information such as customer, employee and invoices.
show_tables()
Suppose the Chinook record store signs a deal with a new record label and needs to select the first three albums it will add to the store from the following list. These albums are by artists that do not currently have tracks in the store. The new record label specializes in American artists, and have given Chinook funds to advertise the new albums in the United States. To address this business issue we will now analyze which genres are the most popular in the US.
| Artist Name | Genre |
|---|---|
| Regal | Hip-Hop |
| Red Tone | Punk |
| Meteor and the Girls | Pop |
| Slim Jim Bites | Blues |
Let us begin by writing a query which returns the number of tracks sold in the US in absolute and relative terms.
q = 'SELECT * FROM INVOICE LIMIT 5'
run_query(q)
q = '''
WITH usa_sales AS (
SELECT
i.invoice_id AS invoice_id,
il.track_id as track_id
FROM invoice i
INNER JOIN invoice_line il ON il.invoice_line_id = i.invoice_id
WHERE i.billing_country = 'USA'
),
usa_tracks_genres AS (
SELECT
usa_sales.invoice_id AS invoice_id,
usa_sales.track_id AS track_id,
g.name AS genre
FROM usa_sales
LEFT JOIN track t ON t.track_id = usa_sales.track_id
LEFT JOIN genre g ON g.genre_id = t.genre_id),
genre_totals AS (
SELECT
utg.genre as Genre,
COUNT(utg.track_id) AS Tracks_sold_total
FROM usa_tracks_genres AS utg
GROUP BY 1
)
SELECT
gt.Genre AS Genre,
gt.Tracks_sold_total AS Tracks_sold_total,
CAST((CAST(gt.Tracks_sold_total as float) / (SELECT SUM(Tracks_sold_total)
FROM genre_totals)) * 100 as float) as Tracks_sold_percentage
FROM genre_totals AS gt
'''
genre_sales_usa = run_query(q)
genre_sales_usa
genre_sales_usa = genre_sales_usa.set_index('Genre')
plt.figure()
genre_sales_usa['Tracks_sold_total'].plot.bar()
plt.title('Number of Tracks Sold in the US per Genre')
plt.show()
From the above graph, we observe that the most popular genre sold by Chinook in the US is by far the Rock genre (67% of track units sold). From the list of artists, we note that the most popular genres in order of popularity ar Alternative & Punk (7.6%), R&B/Soul (4.6%) and Pop (1.5%). We therefore recommend to add the albums by Red Tone, Meteor and the Girls, and Slim Jim Bites.
Each Chinook customer gets assigned to a sales support agent after completing their first purchase. We will now analyze customer purchases belonging to each employee to identify and compare the performance of sales support agents.
q = '''
WITH employee_sales AS (
SELECT
c.support_rep_id AS employee_id,
SUM(i.total) AS total_purchased
FROM invoice i
INNER JOIN customer c ON c.customer_id = i.customer_id
GROUP BY 1
)
SELECT
e.employee_id AS employee_id,
e.first_name || ' ' || e.last_name AS employee_name,
e.title AS title,
e.birthdate AS birthdate,
e.reports_to AS reports_to,
e.hire_date AS hire_date,
e.country AS country,
es.total_purchased AS employee_sales
FROM employee e
INNER JOIN employee_sales es ON e.employee_id = es.employee_id
ORDER BY 2
'''
employee_sales = run_query(q)
employee_sales
In order to check that our table contains all the sales support agents covering all of Chinook's sales, we run the following query and verify that Total Sales as reported in the invoice table equals the sum of employee_sales reported in the previous table.
q = '''
SELECT
SUM(total) as Total_Sales
FROM invoice
'''
run_query(q)
employee_sales = employee_sales.set_index('employee_name')
plt.figure()
employee_sales['employee_sales'].plot.bar()
plt.title('Total Sales per Chinook Employee')
plt.show()
We note that there are three sales support agents covering Chinook's customers. Sales are distributed relatively evenly between the agents. We make the following observations: first, all of the agents are from Canada and share the same title. Secondly, the agents have approximately the same level of experience in the store as they were hired within a few months from each other. Third, the agent with the highest amout of sales is the youngest, which could potentially reflect an ability to connect to a younger customer market.
We will now analyze the sales data for customers from each country. Guidance has been provided on using the country value from the customer table as opposed to the billing_country from the invoice table. This could be due to the customer's billing address differing from the actual location where the customer completed their purchase.
For each country, we will calculate the total number of customers, total value of sales, average value of sales per customer and average order value.
c = '''
CREATE VIEW country_sales AS
SELECT
c.country AS Country,
SUM(i.total) AS Total_Purchases,
COUNT(DISTINCT(c.customer_id)) AS Total_Customers,
COUNT(i.invoice_id) AS Total_Orders
FROM invoice i
INNER JOIN customer c ON c.customer_id = i.customer_id
GROUP BY 1
'''
run_command(c)
q = '''
SELECT * FROM country_sales
'''
run_query(q)
In order to verify that the above table is correct, let us calculate Chinook's total customers and total orders for the period and check that these numbers correspond to the sums in the above table. We already know that total sales amounted to $4,709 for the period, which coincides with the below.
q = '''
SELECT
SUM(total_purchases) AS Total_sales,
SUM(Total_Customers) AS Total_customers,
SUM(Total_Orders) AS Total_orders
FROM country_sales
'''
run_query(q)
q = '''
SELECT COUNT(invoice_id) AS Total_Orders
FROM invoice
'''
run_query(q)
q = '''
SELECT COUNT(customer_id) AS Total_Customers
FROM customer
'''
run_query(q)
As noted above, our totals coincide and so we may confirm that the country_sales table is correct. We now continue to calculate our variables of interest.
q = '''
WITH country_sales_other AS (
SELECT
CASE
WHEN Total_Customers = 1 THEN 'Other'
ELSE Country
END AS country_identification,
Total_Purchases,
Total_Customers,
Total_Orders
FROM country_sales),
country_sales_agg AS (
SELECT
country_identification,
SUM(Total_purchases) AS Total_Sales,
SUM(Total_customers) AS Total_Customers,
SUM(Total_Orders) AS Total_Orders
FROM country_sales_other
GROUP BY 1),
country_variables AS (
SELECT
country_identification as Country,
Total_Customers,
Total_Sales,
CAST(Total_Sales / Total_Customers as float) AS Average_sales_per_customer,
CAST(Total_Sales / Total_Orders as float) AS Average_order_value
FROM country_sales_agg
)
SELECT
Country,
Total_Sales,
Average_sales_per_customer,
Average_order_value
FROM (
SELECT
cv.*,
CASE
WHEN cv.Country = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_variables cv
)
ORDER BY sort ASC
'''
run_query(q)
c = '''
CREATE VIEW country_sales_data AS
WITH country_sales_other AS (
SELECT
CASE
WHEN Total_Customers = 1 THEN 'Other'
ELSE Country
END AS country_identification,
Total_Purchases,
Total_Customers,
Total_Orders
FROM country_sales),
country_sales_agg AS (
SELECT
country_identification,
SUM(Total_purchases) AS Total_Sales,
SUM(Total_customers) AS Total_Customers,
SUM(Total_Orders) AS Total_Orders
FROM country_sales_other
GROUP BY 1),
country_variables AS (
SELECT
country_identification as Country,
Total_Customers,
Total_Sales,
CAST(Total_Sales / Total_Customers as float) AS Average_sales_per_customer,
CAST(Total_Sales / Total_Orders as float) AS Average_order_value
FROM country_sales_agg
)
SELECT
Country,
Total_Sales,
Average_sales_per_customer,
Average_order_value
FROM (
SELECT
cv.*,
CASE
WHEN cv.Country = 'Other' THEN 1
ELSE 0
END AS sort
FROM country_variables cv
)
ORDER BY sort ASC
'''
run_command(c)
q = '''
SELECT * FROM country_sales_data
'''
country_sales_data = run_query(q)
total_sales_country = country_sales_data.copy()
total_sales_country = total_sales_country[['Country', 'Total_Sales']]
total_sales_country = total_sales_country.set_index('Country')
plt.figure()
total_sales_country.plot.bar()
plt.title('Total Chinook Sales per Country')
plt.plot()
average_order_value_country = country_sales_data.copy()
average_order_value_country = average_order_value_country[['Country', 'Average_order_value']]
average_order_value_country = average_order_value_country.set_index('Country')
plt.figure()
average_order_value_country.plot.bar()
plt.title('Average Order Value for each Country')
plt.plot()
average_sales_country_customer = country_sales_data.copy()
average_sales_country_customer = average_sales_country_customer[['Country', 'Average_sales_per_customer']]
average_sales_country_customer = average_sales_country_customer.set_index('Country')
plt.figure()
average_sales_country_customer.plot.bar()
plt.title('Average Sales per Customer for each Country')
plt.plot()
From the above table and visualizations, we can observe that, on its own, the US is Chinook's largest market, followed by Canada and Brazil. The 'Other' row includes sales from countries with a single customer. However, we see a clear difference between average sales in all of its countries (around 80 - 90 USD per customer on average) and average sales in Czech Republic of 136 USD per customer. This indicates that in this country average revenue per customer is significantly higher and targeted marketing could potentially be more profitable per customer compared to its other markets. Average order value tends to be approximately 8 USD per order; again, we notice that in the Czech Republic this value is slightly higher than in other countries at 9 USD per order.
Chinook's current sales strategy allows customers to make purchases in the following ways:
Customers are not allowed to purchase a whole album and then add individual tracks to that same purchase, unless they do so by manually choosing each track. Customers who purchase an album are charged the equivalent of the sum of the price of each of the tracks purchased.
Chinook's management is considering a change in their purchasing strategy by which they only purchase the most popular tracks from each album, instead of purchasing all of the tracks. To evaluate this proposal, we will analyze the percentage of purchases that are individual tracks compared to purchases of entire albums in order to understand the impact on revenue should management pursue this change of strategy.
To answer this question, we must identify which invoices include all the tracks from an album.
Let us begin by creating a table including album ID and track ID from the tracks included in that album.
c = '''
CREATE VIEW album_tracks_dictionary AS
SELECT
a.album_id AS album_id,
t.track_id AS track_id
FROM album a
INNER JOIN track t ON a.album_id = t.album_id
'''
run_command(c)
q = '''
SELECT * FROM album_tracks_dictionary
'''
run_query(q)
We next create a table of Chinook's sales including invoice ID, track ID and album ID.
c = '''
CREATE VIEW invoice_tracks_album AS
SELECT
il.invoice_id as invoice_id,
at.album_id AS album_id,
at.track_id AS track_id
FROM invoice_line il
INNER JOIN album_tracks at ON at.track_id = il.track_id
'''
run_command(c)
q = '''
SELECT * FROM invoice_tracks_album
'''
run_query(q)
q = '''
WITH invoice_full_dictionary AS
(
SELECT
ita.invoice_id AS invoice_id,
CASE
WHEN
(
SELECT track_id FROM (SELECT itainv.track_id AS track_id
FROM invoice_tracks_album itainv
WHERE itainv.invoice_id = ita.invoice_id)
EXCEPT
SELECT track_id FROM (SELECT ald.track_id AS track_id
FROM album_tracks_dictionary ald
WHERE ald.album_id = ita.album_id)
) IS NULL
AND
(
SELECT track_id FROM (SELECT ald.track_id AS track_id
FROM album_tracks_dictionary ald
WHERE ald.album_id = ita.album_id)
EXCEPT
SELECT track_id FROM (SELECT itainv.track_id AS track_id
FROM invoice_tracks_album itainv
WHERE itainv.invoice_id = ita.invoice_id)
) IS NULL
THEN 1
ELSE 0
END AS Full_Album
FROM invoice_tracks_album ita
),
invoice_dictionary AS
(
SELECT
invoice_id,
MAX(Full_Album) as Full_Album
FROM invoice_full_dictionary
GROUP BY 1
)
SELECT
COUNT(id.invoice_id) AS Total_invoices,
SUM(id.Full_Album) AS Total_Full_Album,
(CAST(SUM(id.Full_Album) / CAST(COUNT(id.invoice_id) as float) * 100 as float)) AS Percentage_Full_Album
FROM invoice_dictionary id
'''
run_query(q)
From the above, we note that only 19% of invoices are related to the purchase of whole albums. From this, we can infer that approximately 80% of purchases are made by customers seeking specific tracks, instead of purchasing whole albums. This seems to indicate that Chinook's strategy of only purchasing popular tracks could be a profitable one as purchases do not appear to be driven by less popular songs that are part of albums.
In this project, we explored Chinook's sales data using the sqlite3 and pandas modules to understand the nature of its sales and employees, and to evaluate strategic business questions in fictional scenarios such as the introduction of new artists to its product portfolio and a change in purchasing strategy. We also visualized some of the tables that were created using the matplotlib module in order to present a direct approach to responding these questions.