Introduction

Project Recap

  • CSV Files

    • supply_chain_data.csv
      • Dataset collected from a Fashion and Beauty startup. The data set is based on the supply chain of Makeup products.

Data Preview

select * 
from supply_chain_data


SQL Queries and Analysis

SQL queries are written in RMarkdown.

  • Libraries used
    • library(odbc)
    • library(DT)



Problem: I would like to understand the customer demographic, and get a percentage of each.


Query Process: Make a sub-query labeled as temp, then create a new column call [Total_Customers], which is a count aggregation of the entire data set, then use the OVER() clause to place the aggregated result in all the rows.

Call onto the subquery temp (with the from clause), group by [Customer_demographics]. Do a count aggregation that will be grouped by [Customer_demographics], multiply by 100 divided by the aggregation MAX() of column [Total_Customers] to get the percentage.



select

  Customer_demographics
  ,(count(Customer_demographics)*100/max(Total_Customers)) as Percentage 

from

(
Select 
  Customer_demographics
  ,count(Customer_demographics) over () as Total_Customers

from supply_chain_data
) as temp

group by 
  Customer_demographics


Conclusion: Table gives a percentage of each customer demographic, if we ignore the unknown factor, we can assume majority of customers are female.




Problem: Lets see which [Product_type] produces the most revenue, and their average manufacturing cost.


Query Process:Grouped by [Product_type], then take the sum of the [Revenue_generated] and average() of [Manufacturing_costs]. Ordered by Revenue_generated.



select
    Product_type
    ,sum(Revenue_generated) as Revenue_generated
    ,avg(Manufacturing_costs) as Avg_Manufac_costs
from 
    supply_chain_data
group by 
    Product_type
order by
    2 desc


Conclusion: Skincare seems to produce the most revenue.




Problem: Lets check shipping carriers and their average costs.


Query Process: Group by [Shipping_carriers], take average() of [Shipping_costs].



select
    Shipping_carriers
    ,round(avg(Shipping_costs),2) as avg_Shipping_costs 
from 
    supply_chain_data
group by 
    Shipping_carriers
order by 
    2 asc


Conclusion: Carrier B has a lower average shipping cost.




Problem: I want to know who is at fault when it comes to defect rates, is it the Location or the Supplier.


Query Process: Created a first cte1, where we give the AVG() defect rate of each Location using partition by clause(). And the AVG() defect rate of each Supplier using partition by () clause.

Create a second cte2. From cte1, create a case statement, if the [location_defect_rate] is > than [supplier_defect_rate], we label that as a ‘Bad Location’, else ‘Bad Supplier’.

Create a third cte3. Group by Location, Supplier_name, label. Then get the max defect rate of supplier_defect_rate, and location_defect_rate to compare Location versus Supplier.

/*
Gives the average defect rate of Locations, and average defect rate of all Suppliers.
*/
with cte1 as(
Select
        Location
        ,Supplier_name
        ,Defect_rates
        ,round(avg(Defect_rates) over(partition by Location),2) as location_defect_rate
        ,round(avg(Defect_rates) over (partition by Supplier_name),2) as supplier_defect_rate
    from 
        supply_chain_data
)
/*
Create a case statement where if the location or the supplier has a higher defect rate
*/
, cte2 as(
select
    Location
    ,Supplier_name
    ,supplier_defect_rate
    ,location_defect_rate
    ,Case 
        when location_defect_rate > supplier_defect_rate then 'Bad Location' 
    else 'Bad Supplier' 
    end as label

from cte1
)
/*
Group by Location, Supplier_name, label to compare Location and Supplier.
*/
, cte3 as(
select
    Supplier_name
    ,Location
    ,max(supplier_defect_rate) supplier_defect_rate 
    ,max(location_defect_rate) location_defect_rate
    , label
from cte2
group by 
    Location
    , Supplier_name
    , label
)

select *
from cte3
Order by
    1,2

Here shows the list of locations the manufacturing supplies come from and their Supplier number, as well as this average supplier defect rate and location defect rate.



Here is a count of how many [Bad Location], and [Bad Suppliers].

/*
Gives the average defect rate of Locations, and average defect rate of all Suppliers.
*/
with cte1 as(
Select
        Location
        ,Supplier_name
        ,Defect_rates
        ,round(avg(Defect_rates) over(partition by Location),2) as location_defect_rate
        ,round(avg(Defect_rates) over (partition by Supplier_name),2) as supplier_defect_rate
    from 
        supply_chain_data
)
/*
Create a case statement where if the location or the supplier has a higher defect rate
*/
, cte2 as(
select
    Location
    ,Supplier_name
    ,supplier_defect_rate
    ,location_defect_rate
    ,Case 
        when location_defect_rate > supplier_defect_rate then 'Bad Location' 
    else 'Bad Supplier' 
    end as label

from cte1
)

, cte3 as(
select
    Supplier_name
    ,Location
    ,max(supplier_defect_rate) supplier_defect_rate 
    ,max(location_defect_rate) location_defect_rate
    , label
from cte2
group by 
    Location
    , Supplier_name
    , label
)

select 
    count(case when label = 'Bad Location' then 1 else null end) as Bad_Location_Count
    ,count(case when label = 'Bad Supplier' then 1 else null end) as Bad_Supplier_Count
from cte3


Conclusion: There are 17 rows where the Suppliers had a higher average defect rate than the location, and 8 rows where the Location has a higher average defect rate than the Supplier.