product sales Analysis

Proffessional Data analyst assignment

library(tidyverse)
library(flextable)
library(gt)
prod_data<-readr::read_csv("product_sales.csv")

prod_data<- prod_data |> 
  mutate(sales_method= fct_collapse(sales_method,"Email + Call"=c("Email + Call","em + call"),
                                 "Email"=c("email","Email"))) |> 
  mutate(revenue=replace_na(revenue,mean(revenue,na.rm=TRUE))) |> 
  filter(!(years_as_customer>39))

prod_data |> head() %>% 
  gt() %>%
  tab_header(title = md("**First 6 rows of the data**"))

week	sales_method	customer_id	nb_sold	revenue	years_as_customer	nb_site_visits	state
First 6 rows of the data
2	Email	2e72d641-95ac-497b-bbf8-4861764a7097	10	93.93494	0	24	Arizona
6	Email + Call	3998a98d-70f5-44f7-942e-789bb8ad2fe7	15	225.47000	1	28	Kansas
5	Call	d1de9884-8059-4065-b10f-86eef57e4a44	11	52.55000	6	26	Wisconsin
4	Email	78aa75a4-ffeb-4817-b1d0-2f030783c5d7	11	93.93494	3	25	Indiana
3	Email	10e6d446-10a5-42e5-8210-1b5438f70922	9	90.49000	0	28	Illinois
6	Call	6489e678-40f2-4fed-a48e-d0dff9c09205	13	65.01000	10	24	Mississippi

Data validation

The dataset had 15000 rows and 8 columns
customer_id has 15000 unique customer ids without missing values thus matching the description
nb_sold is a numeric variable with no missing values and matches the data description
revenue has 1074 missing values that were replaced by the mean
sales method variable had 5 non unique values with no missing values in the column but had duplicate categories with non unique names so I lumped the categories em + call and Email + Call into one category called Email + Call , also lumbed the category email and Email to one category called Email. at the end i ended up with 3 categories as required by the description
there are 39 years since the company has been formed so for the variable years_as_customer i removed all rows or observations with years greater than 39 as these where outliers
week is made up of 6 unique categories that is 1-6 and it matches the description so no data cleaning was needed
nb_site_visits is a numeric variable with none missing values and does not need cleaning
state has 50 unique categories and no cleaning was required
after validating , the dataset remained with 14998 rows and 8 columns

How many customers were there for each approach?

library(scales)
prod_data |> 
  group_by(sales_method) |> 
  summarise(count=n()) |> 
  ggplot(aes(x=fct_reorder(sales_method,count),y=count))+
  geom_col()+
  geom_text(aes(label = comma(count)), position=position_dodge(width=1),
            hjust = 2, vjust=.5, colour = "black", size=3.5)+
  labs(x="sales method")

comments

the most sales were made through the email (7465 people) followed by call(4961) and lastly email+call(2572 people)

What does the spread of the revenue look like overall? And for each method?

prod_data |> 
  ggplot(aes(x=revenue))+
  geom_histogram(bins=30)

comments

from the histogram we note that, most sales fell between 40-60 and 80-115 with more sales concentrated on 60 and 90 specifically

prod_data |> 
  ggplot(aes(x=revenue,y=1))+
  geom_boxplot(outlier.colour = "green")

Comments

the plot suggests that the revenue per each individual is quite spread out ranging from approximately 50 to 200

correlations

## select only numeric values
cor_data<-prod_data %>%
  select(-customer_id) |> 
  keep(is.numeric)
## create a correlation matrix
corl<-cor(cor_data)
corrplot::corrplot(corl,method="color",addCoef.col = "black")

Comments

number of new products sold seem to increase as the weeks progress , we see that in the high correlation between the two variables
revenue also significantly increases with number of new products sold since there is a high correlation
with an increase in number of site visits , number of new products sold also increase and so does revenue to some extent

prod_data |> 
  ggplot(aes(x=sales_method,y= revenue,fill=sales_method))+
  geom_boxplot(outlier.colour = "green")+
  facet_wrap(~sales_method,scales="free")

comments

the plots suggest that calls were the least productive by looking and the range of sales revenue for the method
this also concludes our initial hypothesis of why calls had a smaller percentage of total revenue

proportion of revenue

prod_data |> 
  group_by(sales_method) |> 
  summarise(tot=sum(revenue)) |> 
  mutate(perc=tot/sum(tot)) |> 
  ggplot(aes(x=fct_reorder(sales_method,perc),y=perc))+
  geom_col(fill=ggthemes::tableau_color_pal()(1))+
  geom_text(aes(label = scales::percent(perc)), position=position_dodge(width=1),
            hjust = 2, vjust=.5, colour = "black", size=3.5)+
  labs(x="sales method",y="Proportion")

comments

the plot suggest that calls were the lowest revenue generating activity ,making up only around 17% of the total revenue produced this period
Email + call accounted for around 31% which was better than calls alone
Emails was responsible for the greatest proportion of revenue

Was there any difference in revenue over time for each of the methods?

prod_data |> 
  mutate(week=as.numeric(week)) |> 
  group_by(week,sales_method) |> 
  summarise(tot_revenue=sum(revenue,na.rm=T)) |> 
  ggplot(aes(x=week,y=tot_revenue,color=sales_method))+
  geom_line()+
  scale_x_continuous(breaks=1:6)+
  labs(y="revenue",color="method")

comments

we can note that email produced the highest amount only in the first week which accounted to a value of 247519.25. revenue gradually decreased over the next weeks
Email + call method started much lower but ended the period much higher than other methods claiming the top spot
the calls method started out lower and ended up lower than other other methods,there was litle to no change in the revenue generated from this method

prod_data |> 
  mutate(week=as.numeric(week)) |> 
  group_by(week,sales_method) |> 
  summarise(average_revenue=mean(revenue,na.rm=T)) |> 
  ggplot(aes(x=week,y=average_revenue,color=sales_method))+
  geom_line()+
  scale_x_continuous(breaks=1:6)+
  labs(y="average revenue",color="method")

comments

the graphs desplay average revenue for each sales method over time.
Email + call have the highest or up above in terms of the trend in average revenue over time
Email Alone comes in second then calls alone comes last

metrics to focus on

prod_data |> 
  mutate(week=as.numeric(week)) |> 
  group_by(week,sales_method) |> 
  summarise(average_revenue=mean(revenue,na.rm=T)) |> 
  group_by(week,sales_method) |> 
  summarise(`maximum average`=max(average_revenue)) |> 
  arrange(desc(`maximum average`)) |> 
  head(n=7) %>% 
  gt() %>%
  tab_header(title = md("** rows showing maximum values**"))

sales_method	maximum average
rows showing maximum values
6
Email + Call	205.9621
Email	127.9681
5
Email + Call	174.3793
4
Email + Call	169.9397
2
Email + Call	145.5017
3
Email + Call	144.9161
1
Email + Call	124.8799

since the company is focused on increasing revenue and profits by selling more products, I would suggest they focus on the metric of average weekly revenue generated from each method as their main performance indicator.
this would be the best metric to focus on because it can be measured over time and can be analyzed regularly on a weekly basis, which will also allow them to pivot should results and sales not fall in line with their goal.
looking at the weekly average revenue per sales method, we can see that the ‘Email + Call’ approach typically trended upwards week over week, with only a slight decline between week 2 and 3, and that was completely made up in the following weeks.
By analyzing this suggested metric of average weekly revenue per sales method, we could observe if this trend continues for the three different approaches.

Final Summary & Recommendations

First, I recommend the company increases the focus on the ‘Email + Call’ method because of the success it displayed in total average revenue generated per sale and also because it showed a consistent increase across most weeks, only decreasing slightly between weeks 2 and 3.
Next, because Email also displayed somewhat positive results that trended upwards across the 6 week period, I would suggest the company conveys stronger value in messages with more targeted emails into their strategy as this could be a way to add some extra revenue to the pipeline with little to no extra effort and a light lift on their end.
I would recommend the team focuses on the metric of average revenue generated on a weekly basis for each sales method as this can be measured over time, allowing trends to appear and be analyzed to predict future outcomes.
I would suggest that the team decreases the focus on the ‘Call’ method because it showed poor results that were significantly lower than the other approaches
Improve Data Collection practices controlling for getting missing data

product sales Analysis

Proffessional Data analyst assignment

product sales Analysis

Data validation

How many customers were there for each approach?

What does the spread of the revenue look like overall? And for each method?

correlations

proportion of revenue

Was there any difference in revenue over time for each of the methods?

Based on the Data, which method would you recommend we continue to use?

metrics to focus on

Final Summary & Recommendations