Introduction and Set-up

Chinook is a data model that represents a digital media store including tables for albums, artists, invoices, customers, employees, genre and tracks. In this analysis we look into how business decisions can be made through the use of R and SQL for statistical insight.

Import libraries required to establish SQLite database engine and connect to the database within R.

library(RSQLite)
library(DBI)

Import the database

db <- 'chinook.db'

Create function to connect to the database, run a query and disconnect. We will repeatedly use this function to run all the queries written in R.

run_query <- function(q){
  conn <- dbConnect(SQLite(), db)
  result <- dbGetQuery(conn, q)
  dbDisconnect(conn)
  return(result)
}

Create function to define query string that lists all tables and views in database and use run_query() to execute the same.

show_tables <- function(){
  q <- 'SELECT
    name,
    type
FROM sqlite_master
WHERE type IN ("table","view");'
  
  return(run_query(q))
}

show_tables()
##              name  type
## 1           album table
## 2          artist table
## 3        customer table
## 4        employee table
## 5           genre table
## 6         invoice table
## 7    invoice_line table
## 8      media_type table
## 9        playlist table
## 10 playlist_track table
## 11          track table

Analysing New Albums to Purchase

The Chinook records store has to pick 3 albums out of 4 options as a deal with a new record label. In order to pick the albums of the most popular genre, we write a query to see what genres have sold the most tracks in USA.

q <- 'WITH usa_tracks AS (
      SELECT il.*
      FROM customer c
      INNER JOIN invoice i ON i.customer_id = c.customer_id
      INNER JOIN invoice_line il ON il.invoice_id = il.invoice_id
      WHERE c.country = "USA"
      )

      SELECT g.name as Genre,
        COUNT(t.track_id) as "Tracks Sold",
        CAST(COUNT(t.track_id) AS FLOAT)/(SELECT COUNT(*) from usa_tracks) as "Percent Sold"
      FROM usa_tracks u
      INNER JOIN track t ON t.track_id = u.track_id
      INNER JOIN genre g ON g.genre_id = t.genre_id
      GROUP BY g.genre_id
      ORDER BY "Tracks Sold" DESC
      LIMIT 10'
genre_tracks <- run_query(q)
print(genre_tracks)
##                 Genre Tracks Sold Percent Sold
## 1                Rock      345185   0.55392054
## 2               Metal       81089   0.13012403
## 3  Alternative & Punk       64452   0.10342653
## 4               Latin       21877   0.03510616
## 5            R&B/Soul       20829   0.03342443
## 6               Blues       16244   0.02606685
## 7                Jazz       15851   0.02543620
## 8         Alternative       15327   0.02459533
## 9      Easy Listening        9694   0.01555602
## 10                Pop        8253   0.01324364
library(ggplot2)
ggplot(data = genre_tracks, aes(x = reorder(Genre, -`Percent Sold`), y = `Percent Sold`)) + geom_col() + theme(axis.text.x = element_text(angle = 30)) + labs(x = "Genre", title = "Most Popular Music Genre in USA", subtitle = "Per the % of tracks sold, 10 of the most popular music genres are chosen for comparision")

We see that out of hip-hop, punk, pop and blues, the 3 most popular are is punk, blues and pop. Thus, the recommendation is to purchase - 1. Red Tone (Punk) 2. Slim Jim Bites (Blues) 3. Meteor and the Girls (Pop)

Analysing Employee Sales Performance

Next, let us take a look at the total sales assigned to each sales support agent in the company.

q <- "SELECT e.first_name || ' '|| e.last_name AS 'Employee Name',
        SUM(i.total) AS 'Total Sales($)',
        e.hire_date as 'Hire Date'
      FROM customer c
      INNER JOIN employee e ON e.employee_id = c.support_rep_id
      INNER JOIN invoice i ON i.customer_id = c.customer_id
      GROUP BY e.employee_id
      ORDER BY 2 DESC"

sales_rep <- run_query(q)
print(sales_rep)
##   Employee Name Total Sales($)           Hire Date
## 1  Jane Peacock        1731.51 2017-04-01 00:00:00
## 2 Margaret Park        1584.00 2017-05-03 00:00:00
## 3 Steve Johnson        1393.92 2017-10-17 00:00:00
ggplot(data = sales_rep, aes(x = `Employee Name`, y = `Total Sales($)`)) + geom_col() + labs(title = "Analysis of Employee Sales Performance", subtitle = "By gathering data of total amount spent by customers associated with each sales support agent\n")

Jane Peacock has the highest amount of sales but it is important to note that she is also been an employee at Chinook for the longest time.

Country-wise analysis

Let us look at sales data of different customers, grouping by country.

q <- "
      WITH country_prelim AS(
        SELECT c.country    AS Country_Name,
              COUNT(distinct c.customer_id)   AS 'Customers',
              SUM(il.unit_price)    AS 'Total Sales($)',
              ROUND(SUM(il.unit_price) / COUNT(distinct c.customer_id),2)    AS 'Avg. Sales Per Customer',
              ROUND(SUM(il.unit_price) / COUNT(distinct i.invoice_id),2)   AS 'Avg. Order Value',
              CASE
                WHEN COUNT(distinct c.customer_id) < 2 THEN 'Other'
                ELSE Country
              END as Country
        FROM invoice_line il
          INNER JOIN invoice i ON i.invoice_id = il.invoice_id
          INNER JOIN customer c ON c.customer_id = i.customer_id
        GROUP BY 1
        ORDER BY 1)

      SELECT
        Country,
        SUM(Customers) AS Customers,
        SUM(`Total Sales($)`) AS 'Total Sales($)',
        SUM(`Avg. Sales Per Customer`) AS 'Avg. Sales Per Customer',
        SUM(`Avg. Order Value`) AS 'Avg. Order Value'
      FROM (
            SELECT *,
              CASE
                WHEN Country = 'Other' THEN 1
                ELSE 0
              END AS sort
            FROM country_prelim
      )
      GROUP BY Country
      ORDER BY sort ASC, `Total Sales($)` DESC
      "
  
  
country_sales <- run_query(q)
print(country_sales)
##           Country Customers Total Sales($) Avg. Sales Per Customer
## 1             USA        13        1040.49                   80.04
## 2          Canada         8         535.59                   66.95
## 3          Brazil         5         427.68                   85.54
## 4          France         5         389.07                   77.81
## 5         Germany         4         334.62                   83.66
## 6  Czech Republic         2         273.24                  136.62
## 7  United Kingdom         3         245.52                   81.84
## 8        Portugal         2         185.13                   92.57
## 9           India         2         183.15                   91.58
## 10          Other        15        1094.94                 1094.94
##    Avg. Order Value
## 1              7.94
## 2              7.05
## 3              7.01
## 4              7.78
## 5              8.16
## 6              9.11
## 7              8.77
## 8              6.38
## 9              8.72
## 10           111.66
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
country_sales %>% filter(Country != "Other") %>% ggplot(aes(x = Country, y = Customers, fill = Country)) + geom_col() + theme(axis.text.x = element_text(angle = 30)) + labs(title = "Country-wise Distribution of Customers", subtitle = "All countries that have more than 1 customer are included in this analysis.\n")

country_sales %>% filter(Country != "Other") %>% ggplot(aes(x = Country, y = `Total Sales($)`, fill = Country)) + geom_col() + theme(axis.text.x = element_text(angle = 30)) + labs(title = "Country-wise Sales Analysis", subtitle = "All countries that have more than 1 customer are included in this analysis.\n")

country_sales %>% filter(Country != "Other") %>% ggplot(aes(x = Country, y = `Avg. Sales Per Customer`, fill = Country)) + geom_col() + theme(axis.text.x = element_text(angle = 30)) + labs(title = "Average Sales Per Customer by Country", subtitle = "All countries that have more than 1 customer are included in this analysis.\n")

Based on the analysis, we can see that USA is the country with the most number of customers and has the highest total sales value. It is interesting to note that the average value per order is the highest in Czech Republic.

q <- "SELECT c.country    AS Country_Name,
              COUNT(distinct c.customer_id)   AS 'Customers',
              SUM(il.unit_price)    AS 'Total Sales($)',
              ROUND(SUM(il.unit_price) / COUNT(distinct c.customer_id),2)    AS 'Avg. Sales Per Customer',
              ROUND(SUM(il.unit_price) / COUNT(distinct i.invoice_id),2)   AS 'Avg. Order Value'
              
        FROM invoice_line il
          INNER JOIN invoice i ON i.invoice_id = il.invoice_id
          INNER JOIN customer c ON c.customer_id = i.customer_id
          WHERE c.country in ('USA', 'Czech Republic')
        GROUP BY 1
        ORDER BY 1"

compare <- run_query(q)
print(compare)
##     Country_Name Customers Total Sales($) Avg. Sales Per Customer
## 1 Czech Republic         2         273.24                  136.62
## 2            USA        13        1040.49                   80.04
##   Avg. Order Value
## 1             9.11
## 2             7.94

We can see a clear comparison above highlighting the high values for average sales per customer and average order value in Czech Republic being higher. It might be a good idea for marketing team to consider Czech Republic as a country with potential for growth.