Introduction

  • https://www.theforage.com/ has hosted a virtual learning experience with KPMG, to simulate relevant work experience. Sprocket Central Pty Ltd needs help with its customer and transactions data. The company has a large dataset relating to its customers, but their team is unsure how to effectively analyze and optimize its marketing strategy.

Project Recap

  • CSV Cleaning Process On Excel

    • CustomerDemographic.csv
      • Row 1 had a “Note: The data . . .”, remove it. Highlight row and delete.
      • Column named [default], CTRL + SPACE twice to highlight. It is not needed
      • customer_id: 34, was born in 1843-12-21, remove.
      • In the gender column filter for rows that attempted to spell “Female” and spell it correctly. Now do the same for “Male”.

    • CustomerAddress.csv
      • Row 1 had a “Note: The data . . .”, remove it. Highlight row and delete.
      • Replace all state abbreviations. NSW as New South Wales, QLD as Queensland, VIC as Victoria

    • Transactions.csv
      • Row 1 had a “Note: The data . . .”, remove it. Highlight row and delete.
      • Convert [product_first_sold_date] 5-digit date into mm/dd/yyyy format.

Schema



SQL Queries and Analysis

SQL queries are written in RMarkdown.

  • Libraries used
    • library(odbc)
    • library(DT)


This is what our final table will look like. Transactions will be the main table that will be analyzed. It has unique [customer_id] and [transaction_id]s as well as the [listing_price], which will subtract [standard] cost to get the “profit”.

I will be left joining the [CustomerAddress] table by the customer_id to Transactions.customer_id.

And the [CustomerDemographic] table by the customer_id to Transactions.customer_id.

This is the data preview.

select 
    *
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id


Problem: What is the total number of canceled vs. approved orders within the dataset?


Query Process: Group by the [order_status], and count(order_status).



select 
    order_status
    , count(order_status) quantity
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
group by order_status


Conclusion: 19821 Approved Orders, and 179 canceled orders.




Problem: Lets calculate the total profit Sprocket Central has gained.


Query Process: Use Transaction table with the joins created earlier, filter for [order_status]=‘Approved’ orders only, leaving out ‘Canceled’ orders. Round the sum of [list_price] - [standard_cost] to 0 decimals.



select 
    round(sum(list_price - isnull(standard_cost,0)),0) Total_Profit
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
where
    order_status = 'Approved'


Conclusion: 11047456$ was the total amount of profit within the dataset.




Problem: What brand type was the most popular. List the profit earned for each brand made.


Query Process: Count the total number of [customer_id], then use over() to create a new column called “Total_Customers” that has the total number of customers. Make this query into a common table expression named cte1.

Query Process: Group by the [brand], do a count() of ‘Approved’ orders, and do a sum of [list_price]-[standard_cost].



select 
    brand
    ,Count(*) Total_Customers
    ,round(sum(list_price-isnull(standard_cost,0)),0) Profit
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
where
    order_status = 'Approved'
group by
    brand
order by
    Total_Customers desc


Conclusion: Solex seems to be the most popular brand, but WeareA2B seems to produce the most profit.




Problem: Lets give a percentage of the total profit each state produces.


Query Process: Count the total number of [customer_id], then use over() to create a new column called “Total_Profit” that has the total profit. Make this query into a common table expression named cte1.

Query from cte1, group by [state], do another aggregation * 100 / max[Total_Customers], name it Percentage.



with cte1 as(
select
    isnull(state,'Unknown') as state
    ,list_price
    ,standard_cost
        /*This Total_Profit column will be used to calculate the percentage
    in the next query.*/
    ,sum(list_price - isnull(standard_cost,0)) over () Total_Profit
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
where
    order_status = 'Approved'
)

select 
    state
    ,round(((sum(list_price - isnull(standard_cost,0)))*100.0/max(Total_Profit)),2) as Percentage 
from 
    cte1
group by 
    state
order by Percentage desc


Conclusion: Majority of the profit comes from New South Wales.




Problem: For each month lets give the percentage increase in profit compared to the previous month.


Query Process: Group by month, and calculate the monthly profit, sum of [list_price] - [standard_cost], that new column will be “monthly_profit”. Then use the lag() function to accesses the value from the previous row of the new column “monthly_profit”, name it “prev_monthly_profit”. Name the query as cte1.

Query from cte1, create new column called “percent_change”. The equation for percent change is ([monthly_profit] - [prev_monthly_profit])/[prev_monthly_profit]. Standard percent change equation.



with cte1 as(
select
    datepart(MONTH,transaction_date) month
    ,sum(list_price - isnull(standard_cost,0)) monthly_profit
    ,lag(sum(list_price - isnull(standard_cost,0))) 
    over (order by datepart(year,transaction_date), datepart(MONTH,transaction_date)) prev_monthly_profit
    
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
where
    order_status = 'Approved'
group by
    datepart(MONTH,transaction_date),datepart(year,transaction_date)
)

select
    month
    ,prev_monthly_profit
    ,monthly_profit
    ,round(((100*(monthly_profit-prev_monthly_profit))/prev_monthly_profit),2) percent_change
from cte1


Conclusion: Month of October, 2017, seems to have the highest profit earned, and the highest percent increase in revenue.




Problem: What are the top 10 customers.


Query Process: Group by [Transactions.customer_id], [first_name], [last_name], [job_title]. Order by the profit_earned descending.



select top 10
    Transactions.customer_id
    ,first_name
    ,last_name
    ,sum(list_price - isnull(standard_cost,0)) profit_earned
    ,job_title
    
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
where
    order_status = 'Approved'
group by
    Transactions.customer_id
    ,first_name
    ,last_name
    ,job_title
order by 
    profit_earned desc


Conclusion: Seems a recruiting manager named Tye Doohan was the highest paying customer.




Problem: Label our customers into different age groups: young, middle-aged, elderly.


Query Process: Use a case statement to create a column called [Age_Label] to label our customers that are Young if the [Todays_date]-[DOB] <= 45, >= 65 as Middle-Aged, else then Elderly. Create a new column that count the total amount of customers then use over() clause. Make it cte1.

New query from cte1, group by [Age_Label], make a new column call Age_Percent that counts the total number of each age group then divide by the total number of customers to get a percentage.



with cte1 as(
select
    count(*) over () total_customers
    ,case when datediff(year,DOB,CAST( GETDATE() AS Date )) <= 45 then 'Young'
    when datediff(year,DOB,CAST( GETDATE() AS Date )) <= 65 then 'Middle Aged'
    else 'Elderly' 
    end Age_Label
from 
    Transactions
    left join CustomerAddress 
        on CustomerAddress.customer_id = Transactions.customer_id
    left join CustomerDemographic
        on CustomerDemographic.customer_id = Transactions.customer_id
where
    order_status = 'Approved'
)
select
    Age_Label
    ,round((100.*count(Age_Label))/max(total_customers),2) Age_Percent
from
    cte1
group by
    Age_Label
order by
    Age_Percent desc


Conclusion: Majority are Young and under 45.