CSV Cleaning Process On Excel
SQL queries are written in RMarkdown.
This is what our final table will look like. Transactions
will be the main table that will be analyzed. It has unique
[customer_id] and [transaction_id]s as well as the [listing_price],
which will subtract [standard] cost to get the “profit”.
I will be left joining the [CustomerAddress] table by
the customer_id to Transactions.customer_id.
And the [CustomerDemographic] table by the customer_id to
Transactions.customer_id.
This is the data
preview.
select
*
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
Problem: What is the total number of canceled
vs. approved orders within the dataset?
Query Process: Group by the [order_status], and count(order_status).
select
order_status
, count(order_status) quantity
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
group by order_status
Conclusion: 19821 Approved Orders, and 179 canceled
orders.
Problem: Lets calculate the total profit Sprocket
Central has gained.
Query Process: Use Transaction table with the joins created earlier, filter for [order_status]=‘Approved’ orders only, leaving out ‘Canceled’ orders. Round the sum of [list_price] - [standard_cost] to 0 decimals.
select
round(sum(list_price - isnull(standard_cost,0)),0) Total_Profit
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
where
order_status = 'Approved'
Conclusion: 11047456$ was the total amount of profit
within the dataset.
Problem: What brand type was the most popular. List the
profit earned for each brand made.
Query Process: Count the total number of [customer_id], then
use over() to create a new column called “Total_Customers” that has the
total number of customers. Make this query into a common table
expression named cte1.
Query Process: Group
by the [brand], do a count() of ‘Approved’ orders, and do a sum of
[list_price]-[standard_cost].
select
brand
,Count(*) Total_Customers
,round(sum(list_price-isnull(standard_cost,0)),0) Profit
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
where
order_status = 'Approved'
group by
brand
order by
Total_Customers desc
Conclusion: Solex seems to be the most popular brand,
but WeareA2B seems to produce the most profit.
Problem: Lets give a percentage of the total profit
each state produces.
Query Process: Count the total number of [customer_id], then
use over() to create a new column called “Total_Profit” that has the
total profit. Make this query into a common table expression named
cte1.
Query from cte1, group by [state], do
another aggregation * 100 / max[Total_Customers], name it
Percentage.
with cte1 as(
select
isnull(state,'Unknown') as state
,list_price
,standard_cost
/*This Total_Profit column will be used to calculate the percentage
in the next query.*/
,sum(list_price - isnull(standard_cost,0)) over () Total_Profit
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
where
order_status = 'Approved'
)
select
state
,round(((sum(list_price - isnull(standard_cost,0)))*100.0/max(Total_Profit)),2) as Percentage
from
cte1
group by
state
order by Percentage desc
Conclusion: Majority of the profit comes from New South
Wales.
Problem: For each month lets give the percentage
increase in profit compared to the previous month.
Query Process: Group by month, and calculate the monthly
profit, sum of [list_price] - [standard_cost], that new column will be
“monthly_profit”. Then use the lag() function to accesses the value from
the previous row of the new column “monthly_profit”, name it
“prev_monthly_profit”. Name the query as cte1.
Query from cte1, create new column called “percent_change”. The equation for percent change is ([monthly_profit] - [prev_monthly_profit])/[prev_monthly_profit]. Standard percent change equation.
with cte1 as(
select
datepart(MONTH,transaction_date) month
,sum(list_price - isnull(standard_cost,0)) monthly_profit
,lag(sum(list_price - isnull(standard_cost,0)))
over (order by datepart(year,transaction_date), datepart(MONTH,transaction_date)) prev_monthly_profit
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
where
order_status = 'Approved'
group by
datepart(MONTH,transaction_date),datepart(year,transaction_date)
)
select
month
,prev_monthly_profit
,monthly_profit
,round(((100*(monthly_profit-prev_monthly_profit))/prev_monthly_profit),2) percent_change
from cte1
Conclusion: Month of October, 2017, seems to have the
highest profit earned, and the highest percent increase in
revenue.
Problem: What are the top 10 customers.
Query Process: Group by [Transactions.customer_id],
[first_name], [last_name], [job_title]. Order by the profit_earned
descending.
select top 10
Transactions.customer_id
,first_name
,last_name
,sum(list_price - isnull(standard_cost,0)) profit_earned
,job_title
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
where
order_status = 'Approved'
group by
Transactions.customer_id
,first_name
,last_name
,job_title
order by
profit_earned desc
Conclusion: Seems a recruiting manager named Tye Doohan
was the highest paying customer.
Problem: Label our customers into different age groups:
young, middle-aged, elderly.
Query Process: Use a case statement to create a column called
[Age_Label] to label our customers that are Young if the
[Todays_date]-[DOB] <= 45, >= 65 as Middle-Aged, else then
Elderly. Create a new column that count the total amount of customers
then use over() clause. Make it cte1.
New
query from cte1, group by [Age_Label], make a new column call
Age_Percent that counts the total number of each age group then divide
by the total number of customers to get a percentage.
with cte1 as(
select
count(*) over () total_customers
,case when datediff(year,DOB,CAST( GETDATE() AS Date )) <= 45 then 'Young'
when datediff(year,DOB,CAST( GETDATE() AS Date )) <= 65 then 'Middle Aged'
else 'Elderly'
end Age_Label
from
Transactions
left join CustomerAddress
on CustomerAddress.customer_id = Transactions.customer_id
left join CustomerDemographic
on CustomerDemographic.customer_id = Transactions.customer_id
where
order_status = 'Approved'
)
select
Age_Label
,round((100.*count(Age_Label))/max(total_customers),2) Age_Percent
from
cte1
group by
Age_Label
order by
Age_Percent desc
Conclusion: Majority are Young and under
45.