In this project, I use SQL to extract data from a database to answer important business questions. The database used is chinook.db, which is a sample database available in several database systems. The Chinook data represents a digital media store containing information about artists, their albums,sales information about the albums & tracks as well as customers and employee information. A schema of the database is provided below.
Using the chinook database, I answer questions such as which country recorded the highest sales, what are the popular genres of music, etc. The sample chinook database available in SQLite is used in this project. I will use the RSQLite
module to query the database from R
. Other modules (e.g., ggplot2
, dplyr
) are also used to manipulate and visualize query results.
Schema of Chinook Database (Src: Dataquest.io)
In this section, I import the modules that will allow connection to the chinook database, execute queries and fetch results. Additionally, modules that will help manipulate the query results and create appropriate visualizations are imported.
I also create 2 functions: run_query()
and show_tables()
. The run_query()
function will be used to execute all the queries in this project. It accepts a query as a parameter, connects to the chinook database, executes the query, disconnects from the database and returns the query results. The show_tables()
function, on the other hand, is used to test the run_query()
function buy displaying all tables
and views
in the chinook database when a connection to the database has been established.
library(RSQLite)
library(DBI)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidyr)
db_file <- 'F:\\Data Science\\R\\DataQuest\\Step4 Working with Data Sources\\Course #2 Intermediate SQL in R\\Project1\\db\\chinook.db'
run_query <- function(query) {
conn <- dbConnect(SQLite(), db_file)
result <- dbGetQuery(conn, query)
dbDisconnect(conn)
return (result)
}
show_tables <- function() {
q <- "SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view');"
run_query(q)
}
show_tables()
Confirming the above schema of the chinook database, the show_tables()
function indicate a total of 11 tables
in the database.Now, with a successful connection to the database, we can write queries to answer some interesting questions. Let’s start by analyzing album/tracks sales statistics per country.
As the first tast of this project, I want to understand the countries in which most sales occur. To address this, I need to collate data on sales made by customers from different countries. Specifically, for each country I need to understand:
Preliminary analysis shows there are a number of countries with only one customer. I will group the customers from these countries as Other
in the analysis. Let’s go ahead to write our query!
query1 <- '
wITH
country_with_one_cust AS
(
SELECT
country,
COUNT(customer_id) cust_count
FROM customer
GROUP BY country
HAVING cust_count = 1
),
country_or_other AS
(
SELECT
CASE
WHEN c.country IN
(SELECT country FROM country_with_one_cust) THEN "Other"
ELSE country
END
AS country,
i.*
FROM customer c
INNER JOIN invoice i ON c.customer_id = i.customer_id
)
SELECT
country,
customers,
total_sales,
avg_customer_sales,
avg_order
FROM
(
SELECT
country,
COUNT(DISTINCT customer_id) customers,
SUM(total) total_sales,
SUM(total)/COUNT(DISTINCT customer_id) avg_customer_sales,
SUM(total)/COUNT(DISTINCT invoice_id) avg_order,
CASE
WHEN country = "Other" THEN 1
ELSE 0
END AS sort
FROM country_or_other
GROUP BY country
ORDER BY sort ASC, total_sales DESC
);
'
run_query(query1)
Using the query results, let’s create some visualizations to see if any patterns emerge. First, we will plot the customers and total sales information for each country. Next, we will visualize the average customer order data.
#assign query1 results to sales_by_country maintaining the sorted order by sales
sales_by_coutries <- run_query(query1) %>%
mutate(country = factor(country, levels = c(country)))
g1 <- ggplot(data = sales_by_coutries,
aes(x = country, y = customers, fill = "#FF6666")) +
geom_bar(stat = "identity") +
labs(title = "Number of Customers By Country", x="", y="") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
panel.background = element_rect(fill = "white"),
legend.position = "none")
# percentage of customers and sales
cust_and_sales <- sales_by_coutries %>%
select(country : total_sales) %>%
mutate(customers = customers/sum(customers) * 100,
total_sales = total_sales/sum(total_sales) * 100) %>%
pivot_longer(cols=c(customers, total_sales),
names_to = "variable",
values_to = "value")
g2 <- ggplot(data = cust_and_sales,
aes(x= country, y=value, fill=variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Customers vs Total Sales (%)", x="", y="") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
panel.background = element_rect(fill = "white"),
legend.position = c(0.65, 0.8),
legend.title = element_blank())
#plot g1 and g2 in on the same row
grid.arrange(g1, g2, nrow = 1)
From the charts, We see that countries in the
Other
category together account for close to a quarter of all customers and sales. However, since each of these countries has only one customer, it is difficult to draw much conclusions from the aggregated sales information of these countries. Consequently, in the next analyses of average sales information, I will exclude the Other
category.
#filter out 'Other' category from results of query1
sales_by_coutries <- run_query(query1) %>%
filter(country != "Other")
#average sales per cutomer
g3 <- ggplot (data = sales_by_coutries,
aes(x = reorder(country, -avg_customer_sales),
y=avg_customer_sales)) +
geom_bar(stat = "identity") +
update_geom_defaults("bar", list(fill = "#4381AB")) +
labs(title = "Avg. Customer Order(USD)", x="", y="") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
panel.background = element_rect(fill = "white"),
legend.position = "none")
#average order difference from mean (%)
avg_order <- sales_by_coutries %>%
select(country, avg_order) %>%
mutate(avg_order_diff = ((avg_order - mean(avg_order))/
mean(avg_order)) * 100)
g4 <- ggplot (data = avg_order,
aes(x = reorder(country, -avg_order_diff),
y=avg_order_diff)) +
geom_bar(stat = "identity") +
update_geom_defaults("bar", list(fill = "#4381AB")) +
labs(title = "Avg.Order, % Difference from Mean", x="", y="") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
panel.background = element_rect(fill = "white"),
legend.position = "none") +
coord_cartesian(ylim = c(-20, 20))
#plot g4 and g3 on the same row
grid.arrange(g4, g3, nrow = 1)
In summary, the analyses of sales by countries show that a significant amount of sales occur in North America as USA and Canada together account for around 34% of all customers and sales. Though, there are only a few customers from countries including Czech Republic, United Kingdom and India, these countries customers’ orders are on average ones of the highest compared to other countries. This may suggest business opportunity for Chinook store in the 3 countries. However, given the small data sample used in the current analyses, the store may further collect and analyze customer data from small marketing campaigns in the 3 countries to see whether the trend observed in this project still holds.
Having understood the countries in which Chinook store sells most, let’s identify the popular music genres among customers in the top five countries which account for most customers and sales: USA, Canada, Brazil, France, and Germany.
query2 <- '
WITH
popular_genre AS
(
SELECT
g.name,
COUNT(*) tracks_sold
FROM genre g
INNER JOIN track t ON g.genre_id = t.genre_id
INNER JOIN invoice_line il ON il.track_id = t.track_id
INNER JOIN invoice i ON i.invoice_id = il.invoice_id
WHERE i.billing_country IN ("USA", "Canada", "Brazil", "France", "Germany")
GROUP BY 1
)
SELECT
name,
tracks_sold,
CAST (tracks_sold AS FLOAT)/
(SELECT SUM(tracks_sold)
FROM popular_genre) * 100 tracks_sold_pct
FROM popular_genre
ORDER BY 2 DESC
LIMIT 10;
'
run_query(query2)
Let’s use visualization to highlight the findings of the query results.
#plot a bar chart of tracks sold per genre
g1 <- ggplot(data=run_query(query2),
aes(x=reorder(name,-tracks_sold), y=tracks_sold)) +
geom_bar(stat="identity") +
update_geom_defaults("bar", list(fill = "#4381AB")) +
geom_text(aes(label = paste(as.integer(tracks_sold_pct), "%", sep=""), hjust=0)) +
labs(title = "Top 10 Popular Music Genres", x="", y = "Number of Tracks Sold") +
theme(panel.background = element_rect(fill = "white"))
#flips chart horizantally
g1 + coord_flip()
overall, the analyses show Rock as the most popular genre among customers; it accounts for more than half of (54%
) of all purchased tracks.Other popular genres include Metal and Alternative & Punk, which together account for 23%
of tracks sales. Thus, if Chinook store wants to market tracks which will sell well, it can focus more on these 3 popular genres.
The preceding analyses have revealed some countries where the Chinook media store can make potential investments. The analyses also help the store to identify the most popular music genres. To aid the stores future marketing and investiment decisions, it is important to understand whether individuals tend to buy whole albums or a collection of individual tracks.These are the two ways customers can purchase from the Chinook store.Customers cannot add individual tracks to an album purchase.
The focus of this section is thus to determing what percentage of purchases included individual tracks vs whole albums. To address this question, I used the following steps:
ID
of a purchased album to get all the tracks in the album.EXCEPT
operator to check whether an invoice included the purcase of a whole album or individual tracks.query3 <- '
WITH
purchased_tracks AS
(
SELECT
t.album_id,
t.track_id,
il.invoice_id
FROM track t
INNER JOIN invoice_line il ON il.track_id = t.track_id
),
album_and_tracks AS
(
SELECT
t.album_id,
t.track_id
FROM track t
INNER JOIN
(
SELECT DISTINCT album_id
FROM purchased_tracks
) ta
ON t.album_id = ta.album_id
),
album_or_track_sales AS
(
SELECT
DISTINCT p.invoice_id,
CASE
WHEN
(
SELECT track_id
FROM purchased_tracks
WHERE invoice_id = (SELECT DISTINCT p.invoice_id From purchased_tracks)
EXCEPT
SELECT track_id
FROM album_and_tracks
WHERE album_id = (SELECT DISTINCT p.album_id From purchased_tracks)
) IS NULL
AND
(
SELECT track_id
FROM album_and_tracks
WHERE album_id = (SELECT DISTINCT p.album_id From purchased_tracks)
EXCEPT
SELECT track_id
FROM purchased_tracks
WHERE invoice_id = (SELECT DISTINCT p.invoice_id From purchased_tracks)
) IS NULL THEN "Yes"
ELSE "No"
END
AS album_purchase
FROM purchased_tracks p
)
SELECT
album_purchase,
COUNT(invoice_id) num_of_invoice,
CAST(COUNT(invoice_id) AS FLOAT) / (SELECT COUNT(*)
FROM album_or_track_sales) percent
FROM album_or_track_sales
GROUP BY 1;
'
run_query(query3)
The results show that a little over 81%
of all purchases are individual tracks indicating that customers prefer buying a collection of tracks over whole albums. Chinook store’s strategy going forward thus may focus on promoting popular tracks.
It is important for Chinook store to understand the performance of its sales representatives regarding sales of albums/tracks. Therefore, as the final task in this project, we want to understand the total sales (in dollars) made by each sales representative.
query4 <- '
WITH
employee_sales AS
(
SELECT
c.support_rep_id,
COUNT(*) num_of_sales,
SUM(i.total) total_sales_value
FROM customer c
INNER JOIN invoice i ON c.customer_id = i.customer_id
GROUP BY 1
)
SELECT
e.first_name || " " || e.last_name name,
e.hire_date,
es.num_of_sales,
es.total_sales_value
FROM employee e
INNER JOIN employee_sales es ON e.employee_id = es.support_rep_id
ORDER BY es.total_sales_value DESC;
'
run_query(query4)
Jane Peacock who has been working for Chinook store the longest has the largest sales in dollars. However, Magaret Park, hired a month after Jane Peacock, has made 2 more sales than Jane. The results suggest that sales performance may not be due entirely to how long an employee has worked for the store but may also be due to other factors including the value of the sales made.