Introduction

In this project, I use SQL to extract data from a database to answer important business questions. The database used is chinook.db, which is a sample database available in several database systems. The Chinook data represents a digital media store containing information about artists, their albums,sales information about the albums & tracks as well as customers and employee information. A schema of the database is provided below.

Using the chinook database, I answer questions such as which country recorded the highest sales, what are the popular genres of music, etc. The sample chinook database available in SQLite is used in this project. I will use the RSQLite module to query the database from R. Other modules (e.g., ggplot2, dplyr) are also used to manipulate and visualize query results.

Schema of Chinook Database (Src: Dataquest.io)

Importing Modules & Creating Helper Functions

In this section, I import the modules that will allow connection to the chinook database, execute queries and fetch results. Additionally, modules that will help manipulate the query results and create appropriate visualizations are imported.

I also create 2 functions: run_query() and show_tables(). The run_query() function will be used to execute all the queries in this project. It accepts a query as a parameter, connects to the chinook database, executes the query, disconnects from the database and returns the query results. The show_tables() function, on the other hand, is used to test the run_query() function buy displaying all tables and views in the chinook database when a connection to the database has been established.

library(RSQLite)
library(DBI)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(tidyr)
db_file <- 'F:\\Data Science\\R\\DataQuest\\Step4 Working with Data Sources\\Course #2 Intermediate SQL in R\\Project1\\db\\chinook.db'

run_query <- function(query) {
   conn <- dbConnect(SQLite(), db_file)
  result <- dbGetQuery(conn, query)
  dbDisconnect(conn)
  return (result)
}

show_tables <- function() {
  q <- "SELECT name, type FROM sqlite_master WHERE type IN ('table', 'view');"
  run_query(q)
}

show_tables()

Confirming the above schema of the chinook database, the show_tables() function indicate a total of 11 tables in the database.Now, with a successful connection to the database, we can write queries to answer some interesting questions. Let’s start by analyzing album/tracks sales statistics per country.

Analyzing Sales By Country

As the first tast of this project, I want to understand the countries in which most sales occur. To address this, I need to collate data on sales made by customers from different countries. Specifically, for each country I need to understand:

Preliminary analysis shows there are a number of countries with only one customer. I will group the customers from these countries as Other in the analysis. Let’s go ahead to write our query!

query1 <- '
          wITH
              country_with_one_cust AS
              (
                  SELECT 
                    country, 
                    COUNT(customer_id) cust_count 
                  FROM customer 
                  GROUP BY country 
                  HAVING cust_count = 1
              ),
              
              country_or_other AS
              (
                  SELECT
                      CASE
                        WHEN c.country IN
                            (SELECT country FROM country_with_one_cust) THEN "Other"
                        ELSE country
                        END 
                        AS country,
                      i.*
                FROM customer c
                INNER JOIN invoice i ON c.customer_id = i.customer_id
              )
              
          SELECT
            country,
            customers,
            total_sales,
            avg_customer_sales,
            avg_order
          FROM
          (
           SELECT
            country,
            COUNT(DISTINCT customer_id) customers,
            SUM(total) total_sales,
            SUM(total)/COUNT(DISTINCT customer_id) avg_customer_sales,
            SUM(total)/COUNT(DISTINCT invoice_id) avg_order,
            CASE
              WHEN country = "Other" THEN 1
              ELSE 0
            END AS sort
          FROM country_or_other
          GROUP BY country
          ORDER BY sort ASC, total_sales DESC
          );
      '
run_query(query1)

Using the query results, let’s create some visualizations to see if any patterns emerge. First, we will plot the customers and total sales information for each country. Next, we will visualize the average customer order data.

#assign query1 results to sales_by_country maintaining the sorted order by sales
sales_by_coutries <- run_query(query1) %>%
                      mutate(country = factor(country, levels = c(country)))


g1 <- ggplot(data = sales_by_coutries, 
         aes(x = country, y = customers, fill = "#FF6666")) +
         geom_bar(stat = "identity") +
         labs(title = "Number of Customers By Country", x="", y="") +
         theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
                axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
                panel.background = element_rect(fill = "white"),
                legend.position = "none")


# percentage of customers and sales
cust_and_sales <- sales_by_coutries %>% 
                    select(country : total_sales) %>%
                    mutate(customers = customers/sum(customers) * 100,
                           total_sales = total_sales/sum(total_sales) * 100) %>%
                    pivot_longer(cols=c(customers, total_sales), 
                                 names_to = "variable", 
                                 values_to = "value")

g2 <- ggplot(data = cust_and_sales, 
         aes(x= country, y=value, fill=variable)) +
         geom_bar(stat = "identity", position = "dodge") +
         labs(title = "Customers vs Total Sales (%)", x="", y="") +
         theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
                axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
                panel.background = element_rect(fill = "white"),
                legend.position = c(0.65, 0.8),
                legend.title = element_blank())

#plot g1 and g2 in on the same row
grid.arrange(g1, g2, nrow = 1)

From the charts, We see that countries in the Other category together account for close to a quarter of all customers and sales. However, since each of these countries has only one customer, it is difficult to draw much conclusions from the aggregated sales information of these countries. Consequently, in the next analyses of average sales information, I will exclude the Other category.

#filter out 'Other' category from results of query1
sales_by_coutries <- run_query(query1) %>%
                      filter(country != "Other")

#average sales per cutomer
g3 <- ggplot (data = sales_by_coutries,
              aes(x = reorder(country, -avg_customer_sales), 
                  y=avg_customer_sales)) +
              geom_bar(stat = "identity") +
              update_geom_defaults("bar",   list(fill = "#4381AB")) +
              labs(title = "Avg. Customer Order(USD)", x="", y="") +
              theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
                axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
                panel.background = element_rect(fill = "white"),
                legend.position = "none")


#average order difference from mean (%)
avg_order <- sales_by_coutries %>%
                    select(country, avg_order) %>%
                    mutate(avg_order_diff = ((avg_order - mean(avg_order))/
                                               mean(avg_order)) * 100)

g4 <- ggplot (data = avg_order,
              aes(x = reorder(country, -avg_order_diff), 
                  y=avg_order_diff)) +
              geom_bar(stat = "identity") +
              update_geom_defaults("bar",   list(fill = "#4381AB")) +
              labs(title = "Avg.Order, % Difference from Mean", x="", y="") +
              theme(axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
                axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
                panel.background = element_rect(fill = "white"),
                legend.position = "none") +
              coord_cartesian(ylim = c(-20, 20))
              
          

#plot g4 and g3 on the same row
grid.arrange(g4, g3, nrow = 1)

In summary, the analyses of sales by countries show that a significant amount of sales occur in North America as USA and Canada together account for around 34% of all customers and sales. Though, there are only a few customers from countries including Czech Republic, United Kingdom and India, these countries customers’ orders are on average ones of the highest compared to other countries. This may suggest business opportunity for Chinook store in the 3 countries. However, given the small data sample used in the current analyses, the store may further collect and analyze customer data from small marketing campaigns in the 3 countries to see whether the trend observed in this project still holds.

Analyzing Album vs Tracks Sales

The preceding analyses have revealed some countries where the Chinook media store can make potential investments. The analyses also help the store to identify the most popular music genres. To aid the stores future marketing and investiment decisions, it is important to understand whether individuals tend to buy whole albums or a collection of individual tracks.These are the two ways customers can purchase from the Chinook store.Customers cannot add individual tracks to an album purchase.

The focus of this section is thus to determing what percentage of purchases included individual tracks vs whole albums. To address this question, I used the following steps:

query3 <- '
            WITH
                purchased_tracks AS
                (
                  SELECT
                      t.album_id, 
                      t.track_id, 
                      il.invoice_id
                  FROM track t
                  INNER JOIN invoice_line il ON il.track_id = t.track_id
                ),
                
                album_and_tracks AS
                (
                  SELECT
                      t.album_id,
                      t.track_id
                  FROM track t
                  INNER JOIN
                      (
                        SELECT DISTINCT album_id
                        FROM purchased_tracks
                      ) ta
                      ON t.album_id = ta.album_id
                ),
                
                album_or_track_sales AS
                (
                  SELECT 
                    DISTINCT p.invoice_id,
                    CASE
                      WHEN
                        (
                          SELECT track_id
                          FROM purchased_tracks
                          WHERE invoice_id = (SELECT DISTINCT p.invoice_id From purchased_tracks)
                          
                          EXCEPT
                          
                          SELECT track_id
                          FROM album_and_tracks
                          WHERE album_id = (SELECT DISTINCT p.album_id From purchased_tracks)
                        ) IS NULL
                        
                        AND
                        
                        (
                          SELECT track_id
                          FROM album_and_tracks
                          WHERE album_id = (SELECT DISTINCT p.album_id From purchased_tracks)
                          
                          EXCEPT
                          
                          SELECT track_id
                          FROM purchased_tracks
                          WHERE invoice_id = (SELECT DISTINCT p.invoice_id From purchased_tracks)
                        ) IS NULL THEN "Yes"
                     ELSE "No"
                   END
                   AS album_purchase
                FROM purchased_tracks p
                )
            
            SELECT 
              album_purchase,
              COUNT(invoice_id) num_of_invoice,
              CAST(COUNT(invoice_id) AS FLOAT) / (SELECT COUNT(*)
                                                    FROM album_or_track_sales) percent
            FROM album_or_track_sales
            GROUP BY 1;
    '
run_query(query3)

The results show that a little over 81% of all purchases are individual tracks indicating that customers prefer buying a collection of tracks over whole albums. Chinook store’s strategy going forward thus may focus on promoting popular tracks.

Analyzing Employee Sales Performance

It is important for Chinook store to understand the performance of its sales representatives regarding sales of albums/tracks. Therefore, as the final task in this project, we want to understand the total sales (in dollars) made by each sales representative.

query4 <- '
              WITH 
                  employee_sales AS
                  (
                    SELECT
                      c.support_rep_id,
                      COUNT(*) num_of_sales,
                      SUM(i.total) total_sales_value
                    FROM customer c
                    INNER JOIN invoice i ON c.customer_id = i.customer_id
                    GROUP BY 1
                  )
              
              SELECT
                  e.first_name || " " || e.last_name name,
                  e.hire_date,
                  es.num_of_sales,
                  es.total_sales_value
              FROM employee e
              INNER JOIN employee_sales es ON e.employee_id = es.support_rep_id
              ORDER BY es.total_sales_value DESC;
          '
run_query(query4)

Jane Peacock who has been working for Chinook store the longest has the largest sales in dollars. However, Magaret Park, hired a month after Jane Peacock, has made 2 more sales than Jane. The results suggest that sales performance may not be due entirely to how long an employee has worked for the store but may also be due to other factors including the value of the sales made.