product sales Analysis

Proffessional Data analyst assignment

library(tidyverse)
library(flextable)
library(gt)
prod_data<-readr::read_csv("product_sales.csv")
 
prod_data<- prod_data |> 
  mutate(sales_method= fct_collapse(sales_method,"Email + Call"=c("Email + Call","em + call"),
                                 "Email"=c("email","Email"))) |> 
  mutate(revenue=replace_na(revenue,mean(revenue,na.rm=TRUE))) |> 
  filter(!(years_as_customer>39))

prod_data |> head() %>% 
  gt() %>%
  tab_header(title = md("**First 6 rows of the data**")) 
First 6 rows of the data
week sales_method customer_id nb_sold revenue years_as_customer nb_site_visits state
2 Email 2e72d641-95ac-497b-bbf8-4861764a7097 10 93.93494 0 24 Arizona
6 Email + Call 3998a98d-70f5-44f7-942e-789bb8ad2fe7 15 225.47000 1 28 Kansas
5 Call d1de9884-8059-4065-b10f-86eef57e4a44 11 52.55000 6 26 Wisconsin
4 Email 78aa75a4-ffeb-4817-b1d0-2f030783c5d7 11 93.93494 3 25 Indiana
3 Email 10e6d446-10a5-42e5-8210-1b5438f70922 9 90.49000 0 28 Illinois
6 Call 6489e678-40f2-4fed-a48e-d0dff9c09205 13 65.01000 10 24 Mississippi

Data validation

  • The dataset had 15000 rows and 8 columns
  • customer_id has 15000 unique customer ids without missing values thus matching the description
  • nb_sold is a numeric variable with no missing values and matches the data description
  • revenue has 1074 missing values that were replaced by the mean
  • sales method variable had 5 non unique values with no missing values in the column but had duplicate categories with non unique names so I lumped the categories em + call and Email + Call into one category called Email + Call , also lumbed the category email and Email to one category called Email. at the end i ended up with 3 categories as required by the description
  • there are 39 years since the company has been formed so for the variable years_as_customer i removed all rows or observations with years greater than 39 as these where outliers
  • week is made up of 6 unique categories that is 1-6 and it matches the description so no data cleaning was needed
  • nb_site_visits is a numeric variable with none missing values and does not need cleaning
  • state has 50 unique categories and no cleaning was required
  • after validating , the dataset remained with 14998 rows and 8 columns

How many customers were there for each approach?

library(scales)
prod_data |> 
  group_by(sales_method) |> 
  summarise(count=n()) |> 
  ggplot(aes(x=fct_reorder(sales_method,count),y=count))+
  geom_col()+
  geom_text(aes(label = comma(count)), position=position_dodge(width=1),
            hjust = 2, vjust=.5, colour = "black", size=3.5)+
  labs(x="sales method")

comments

  • the most sales were made through the email (7465 people) followed by call(4961) and lastly email+call(2572 people)

What does the spread of the revenue look like overall? And for each method?

prod_data |> 
  ggplot(aes(x=revenue))+
  geom_histogram(bins=30)

comments

  • from the histogram we note that, most sales fell between 40-60 and 80-115 with more sales concentrated on 60 and 90 specifically
prod_data |> 
  ggplot(aes(x=revenue,y=1))+
  geom_boxplot(outlier.colour = "green")

Comments

  • the plot suggests that the revenue per each individual is quite spread out ranging from approximately 50 to 200

correlations

## select only numeric values
cor_data<-prod_data %>%
  select(-customer_id) |> 
  keep(is.numeric)
## create a correlation matrix
corl<-cor(cor_data)
corrplot::corrplot(corl,method="color",addCoef.col = "black")

Comments

  • number of new products sold seem to increase as the weeks progress , we see that in the high correlation between the two variables
  • revenue also significantly increases with number of new products sold since there is a high correlation
  • with an increase in number of site visits , number of new products sold also increase and so does revenue to some extent
prod_data |> 
  ggplot(aes(x=sales_method,y= revenue,fill=sales_method))+
  geom_boxplot(outlier.colour = "green")+
  facet_wrap(~sales_method,scales="free")

comments

  • the plots suggest that calls were the least productive by looking and the range of sales revenue for the method
  • this also concludes our initial hypothesis of why calls had a smaller percentage of total revenue

proportion of revenue

prod_data |> 
  group_by(sales_method) |> 
  summarise(tot=sum(revenue)) |> 
  mutate(perc=tot/sum(tot)) |> 
  ggplot(aes(x=fct_reorder(sales_method,perc),y=perc))+
  geom_col(fill=ggthemes::tableau_color_pal()(1))+
  geom_text(aes(label = scales::percent(perc)), position=position_dodge(width=1),
            hjust = 2, vjust=.5, colour = "black", size=3.5)+
  labs(x="sales method",y="Proportion")

comments

  • the plot suggest that calls were the lowest revenue generating activity ,making up only around 17% of the total revenue produced this period
  • Email + call accounted for around 31% which was better than calls alone
  • Emails was responsible for the greatest proportion of revenue

Was there any difference in revenue over time for each of the methods?

prod_data |> 
  mutate(week=as.numeric(week)) |> 
  group_by(week,sales_method) |> 
  summarise(tot_revenue=sum(revenue,na.rm=T)) |> 
  ggplot(aes(x=week,y=tot_revenue,color=sales_method))+
  geom_line()+
  scale_x_continuous(breaks=1:6)+
  labs(y="revenue",color="method")

comments

  • we can note that email produced the highest amount only in the first week which accounted to a value of 247519.25. revenue gradually decreased over the next weeks
  • Email + call method started much lower but ended the period much higher than other methods claiming the top spot
  • the calls method started out lower and ended up lower than other other methods,there was litle to no change in the revenue generated from this method
prod_data |> 
  mutate(week=as.numeric(week)) |> 
  group_by(week,sales_method) |> 
  summarise(average_revenue=mean(revenue,na.rm=T)) |> 
  ggplot(aes(x=week,y=average_revenue,color=sales_method))+
  geom_line()+
  scale_x_continuous(breaks=1:6)+
  labs(y="average revenue",color="method")

comments

  • the graphs desplay average revenue for each sales method over time.
  • Email + call have the highest or up above in terms of the trend in average revenue over time
  • Email Alone comes in second then calls alone comes last

Based on the Data, which method would you recommend we continue to use?

  • Based on the data and charts above, I would recommend that this organization pivots away from using the calling approach on it’s own as it appears to be producing the worst results out of all of the sales methods.
  • I recommend the team doubles down on the Call + Email approach since from graphs the ‘Email + Call’ approach seems to be working really well and so I would recommend the business focuses on that outreach to remain relevant and timely while also decreasing the amount of time spent on the sales method.
  • The ROI for Email + Call can be seen on both the graphs as it appears to trend upward on both avg weekly revenue and total weekly revenue over time. Focus on Email + Calling! If possible focus on Email + Call, and then Email alone

metrics to focus on

prod_data |> 
  mutate(week=as.numeric(week)) |> 
  group_by(week,sales_method) |> 
  summarise(average_revenue=mean(revenue,na.rm=T)) |> 
  group_by(week,sales_method) |> 
  summarise(`maximum average`=max(average_revenue)) |> 
  arrange(desc(`maximum average`)) |> 
  head(n=7) %>% 
  gt() %>%
  tab_header(title = md("** rows showing maximum values**")) 
** rows showing maximum values**
sales_method maximum average
6
Email + Call 205.9621
Email 127.9681
5
Email + Call 174.3793
4
Email + Call 169.9397
2
Email + Call 145.5017
3
Email + Call 144.9161
1
Email + Call 124.8799
  • since the company is focused on increasing revenue and profits by selling more products, I would suggest they focus on the metric of average weekly revenue generated from each method as their main performance indicator.
  • this would be the best metric to focus on because it can be measured over time and can be analyzed regularly on a weekly basis, which will also allow them to pivot should results and sales not fall in line with their goal.
  • looking at the weekly average revenue per sales method, we can see that the ‘Email + Call’ approach typically trended upwards week over week, with only a slight decline between week 2 and 3, and that was completely made up in the following weeks.
  • By analyzing this suggested metric of average weekly revenue per sales method, we could observe if this trend continues for the three different approaches.

Final Summary & Recommendations

  • First, I recommend the company increases the focus on the ‘Email + Call’ method because of the success it displayed in total average revenue generated per sale and also because it showed a consistent increase across most weeks, only decreasing slightly between weeks 2 and 3.
  • Next, because Email also displayed somewhat positive results that trended upwards across the 6 week period, I would suggest the company conveys stronger value in messages with more targeted emails into their strategy as this could be a way to add some extra revenue to the pipeline with little to no extra effort and a light lift on their end.
  • I would recommend the team focuses on the metric of average revenue generated on a weekly basis for each sales method as this can be measured over time, allowing trends to appear and be analyzed to predict future outcomes.
  • I would suggest that the team decreases the focus on the ‘Call’ method because it showed poor results that were significantly lower than the other approaches
  • Improve Data Collection practices controlling for getting missing data