week: 6 unique values, without any missing data.
sales_method: had 5 unique values before validation: Email, Call, Email + Call, em + call, and email, which after validation were Email, Call, and Email + Call.
customer_id: 15,000 unique values. Needed no cleaning.
nb_sold: 10 unique values, no cleaning required and no missing values.
revenue: had 1074 missing values, of which the rows were removed from the data set.
years_as_customer: had two major values not corresponding: 47 and 63 which were way more than the number of years Pens and Printers has been in existence, 39 years. It made no sense having a customer when the business was not in existence. These rows were dropped.
nb_site_visits: Needed no cleaning.
state: Needed no cleaning too. At the end of the validation and cleaning process, the data that remained is 13,924 rows and 8 columns
# install.packages("DataExplorer")
# install.packages("timetk")
# install.packages('DT')
#import Libraries
library(tidyverse)
library(janitor)
library(DataExplorer)
library(DT)
library(patchwork)
library(timetk)
library(tinytex)
theme_set(theme_test())
library(readr)
sales <- read_csv("product_sales.csv")
head(sales)
dim(sales)
## [1] 15000 8
The data set contains 15,000 rows/observations and 8 columns/features before the cleaning and validation process.
#check duplicates
sum(duplicated(sales$customer_id))#no duplicates
## [1] 0
#Check mising values
colSums(is.na(sales)) %>% as.data.frame()#1074 missing values in #revenue column
#Remove Missing values
sales <- na.omit(sales)
# confirm if missing values still exist
colSums(is.na(sales)) %>% #No Missing values
as.data.frame()
names(sales)# check column names
## [1] "week" "sales_method" "customer_id"
## [4] "nb_sold" "revenue" "years_as_customer"
## [7] "nb_site_visits" "state"
#Remove Column customer_id- because we do not it in our analysis
sales$customer_id <- NULL
# Check the number of unque values in the column;week: 6 unique values, without any missing data.
unique(sales$week)# we've 6 unique values
## [1] 6 5 3 4 1 2
#Unique values in the sales_method
# sales_method: had 5 unique values before validation: Email, Call, Email + Call, em + call, and email, which after validation were Email, Call, and Email + Call.
unique(sales$sales_method)
## [1] "Email + Call" "Call" "Email" "em + call" "email"
#Sales method now has 3 uniques values as per description
#So we need to clean the data as per the description
sales <- sales %>%
mutate(
sales_method=ifelse(sales_method=="em + call","Email + Call",
ifelse(sales_method=="email","Email", sales_method)))
#Then we Check to confirm
unique(sales$sales_method)
## [1] "Email + Call" "Call" "Email"
# years_as_customer: had two major values not corresponding: 47 and 63 which were way more than the number of years Pens and Printers has been in existence, 39 years. It made no sense having a customer when the business was not in existence. These rows were dropped.
summary(sales$years_as_customer)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 4.979 7.000 63.000
boxplot(sales$years_as_customer)
sales1 <- sales %>% filter(!years_as_customer>39)#remove outliers(47,63)
Our Data is Clean Now
sales_clean <- sales1
sales_clean$sales_method <- as.factor(sales_clean$sales_method)
glimpse(sales_clean)
## Rows: 13,924
## Columns: 7
## $ week <dbl> 6, 5, 3, 6, 4, 1, 5, 5, 3, 2, 5, 2, 5, 4, 1, 1, 1, 1…
## $ sales_method <fct> Email + Call, Call, Email, Call, Email, Email, Email…
## $ nb_sold <dbl> 15, 11, 9, 13, 11, 10, 11, 11, 9, 9, 11, 10, 10, 10,…
## $ revenue <dbl> 225.47, 52.55, 90.49, 65.01, 113.38, 99.94, 108.34, …
## $ years_as_customer <dbl> 1, 6, 0, 10, 9, 1, 10, 7, 4, 2, 2, 1, 1, 2, 4, 2, 4,…
## $ nb_site_visits <dbl> 28, 26, 28, 24, 28, 22, 31, 23, 28, 23, 30, 28, 30, …
## $ state <chr> "Kansas", "Wisconsin", "Illinois", "Mississippi", "G…
#Reorder labels
# levels(sales_clean$sales_method) <- c("Email","Call" ,"Email + Call")
How many customers were there for each approach?
What does the spread of the revenue look like overall? And for each method?
Was there any difference in revenue over time for each of the methods?
Based on the data, which method would you recommend we continue to use?
sales_clean <- sales_clean %>%
mutate(percentage = revenue / sum(revenue) * 100)
A1 <- sales_clean %>%
ggplot(aes(x=fct_infreq(sales_method)))+
geom_bar(fill=rainbow(3))+
geom_text(aes(label=after_stat(count)),
stat='count',
position=position_dodge(1.0),
hjust=0.1,
vjust= 1.6,
size=5)+
theme(legend.position = 'bottom')+
labs(
x=' ',
y='Sum of Each Approach',
title = 'Number of Customers per Each Sales Method') +
theme(plot.title = element_text(face = 'bold',
hjust = 0.5))#bold and center title
####################################
A2 <- sales_clean %>%
ggplot(aes(x=fct_infreq(sales_method)))+
geom_bar(fill=rainbow(3)) +
geom_text(aes(label = scales::percent(..count../sum(..count..))),
stat = "count",
vjust = 1.8,
position = position_stack(1),
size = 6) + # Adjust the position of labels
theme(legend.position = 'bottom')+
labs(
x='',
y=' ',
title = 'Percentage(%) Count of Customers per Each Sales Method') +
theme(plot.title = element_text(face = 'bold',# title in bold
hjust = -1)) +# Center title
theme(plot.title = element_text(face = 'bold',
hjust = 0.5))#bold and center title
A1/A2
The Email sales method has the vast majority of 6921(50%) customers, followed by Call & Combination of Email and Call, with the respective count of 4780(34%) and 2223(16%).
x0 <- sales_clean %>%
ggplot(aes(revenue, fill=sales_method))+
geom_density()+
ggtitle('Revenue Distibution')+
theme(plot.title = element_text(hjust = 0.5,#Center Title
face ='bold')) +# in bold
theme(legend.position = 'none')
# install.packages("patchwork")
library(patchwork)
x1 <- sales_clean %>%
ggplot(aes(revenue, fill=sales_method)) +
geom_boxplot() +
facet_wrap(~sales_method, scales = "free")+
theme(legend.position = 'bottom')+
ggtitle('Revenue And Sales Method') +
theme(plot.title = element_text(face = 'bold',#title in bold
hjust = 0.5))#center title
x0/x1#combine plots
Calls is associated with lower Revenues in comparison to other methods. Both Email and (Email+Call) generates more revenues.
Low revenues were majorly generated by the calls method. This can be clearly seen on the Call graph above, with revenue ranging between 0-70
Email sales_approach generated revenues in the range of 80-120, with larger values start from 130-150.
Both (Email + Call) generate higher revenues ranging from 120-240 as seen from the histogram and boxplot for Email + call.
d <- sales_clean %>%
group_by(sales_method) %>%
summarise(Total_Revenue=sum(revenue),
Percen=round(sum(percentage), 1)) %>%
arrange(desc(Total_Revenue))
x2 <- DT::datatable(d)
x2
library(scales)
p1 <- d %>%
ggplot(aes(x=reorder(sales_method, desc(Total_Revenue)),
y=Total_Revenue, fill=rainbow(3)))+
geom_col()+
theme(legend.position = 'none') +
scale_y_continuous(labels = scales::comma)+
# theme(axis.text.x = element_text(angle = 90))+
geom_text(aes(label = paste0(round(Percen, 1), "%")),
vjust = -0.04,
position = position_stack(0.5),
hjust=1) + # Adjust the position of labels
xlab(' ')+
ggtitle('Percentage Count of Total Revenue Per Sales Method') +
theme(plot.title = element_text(hjust = 0.5,#center
face = 'bold'))# bold
p2 <- d %>%
ggplot(aes(x=reorder(sales_method, desc(Total_Revenue)),
y=Total_Revenue, fill=rainbow(3)))+
geom_col()+
theme(legend.position = 'none') +
scale_y_continuous(labels = scales::comma)+
#theme(axis.text.x = element_text(angle = 90))+
geom_text(
aes(label=Total_Revenue),
position = position_stack(0.5),
hjust=0.5,
vjust=-.3) +
xlab(' ') +
ggtitle('Total Revenue Per Sales Method') +
theme(plot.title = element_text(hjust = 0.5,
face = 'bold'))
p1/p2
library(scales)
d1 <- sales_clean %>%
arrange(week) %>%
group_by(week) %>%
summarise(Total_revenue = round(sum(revenue),0))
DT::datatable(d1)
z0 <- d1 %>%
ggplot(aes(x=week,
y=Total_revenue, fill=rainbow(6))) +
geom_col() +
scale_y_continuous(labels=scales::comma)+
geom_text(aes(label=Total_revenue),
position = position_stack(0.5),
# hjust=1,
vjust=-.2)+
theme(legend.position = 'none')+
xlab('Weekly Sales Total')+
ggtitle("Total Sales Per Week") +
theme(plot.title = element_text(hjust = 0.5,
face = "bold"))
##############################################################
library(timetk)
z1 <- d1 %>% plot_time_series(
week, Total_revenue, .interactive = FALSE)
z0/z1#combine plots
ab <- sales_clean %>%
arrange(week ) %>%
group_by(week, sales_method) %>%
summarise(Total= sum(revenue)) %>%
ggplot(
aes(x=week, y=Total, color=sales_method)) +
geom_line(linetype = 2) +
ggtitle("Weekly Sales Trend Per Sales Method") +
theme(plot.title = element_text(face="bold", hjust = 0.5)) +
theme(legend.position = "bottom")
ab
Briefly we can that there is a positive increase in the revenue from 1st to 5th week for the Call and (Email + Call) sales methods, with a decline on week 6. While for the Email sales method we can see the negative revenue trend on a weekly basis, from week 1 to 6.
sales_clean %>%
# select(years_as_customer, week, revenue) %>%
group_by(years_as_customer) %>%
mutate(Total_R=sum(revenue)) %>%
ggplot(
aes(x=years_as_customer, y= Total_R, col=week)
) +
geom_line() +
ggtitle("Customer Tenure And The Total Revenue Generated") +
theme(legend.position = " ",
plot.title = element_text(hjust = 0.5, face = "bold"))
The plot demonstrates the correlation between weekly revenue generated after product launch and customer tenure. The declining trend indicates that the majority of revenue comes from new customers who have been existing a period of 0 to 10 years.
sales_clean %>%
ggplot(aes(x=week, y=revenue, colour = sales_method))+
geom_line()
Based on the analysis, I recommend discontinuing the Calls method and focusing only on the Email and Email + Call sales method. This is due to the higher sales and revenue generated by these approaches, as well as the shorter average time required per sale compared to calls approach which is (30 minutes). However, the Calls approach can still be used on condition the customer doesn’t have an email address.
Based on the analysis conducted using the provided data, the following recommendations are proposed:
Monitor key metrics to track any changes in the sales approach.
It is recommended to utilize the Email method frequently to inform customers about new products. Additionally, follow-up calls in the second and third week can be made to discuss their needs and how the new product will assist them.
It is advisable to minimize the usage of the Call method or eliminate it altogether. This approach consumes more time for sales and ultimately generates the lowest revenue, despite having a higher number of sales.
The sales team should prioritize the Email and Email + Call approaches. As demonstrated in analysis, the Email sales approach yields the highest revenue during the initial three weeks, although it declines as the week progresses. To enhance sales and generate more revenue, a follow-up call should be made in the second or third week.
To broaden the customer segment, focus on enhancing marketing strategies and improving the conversion rate based on website visits. As indicated in the correlation graph, the longer customer tenure corresponds to lower revenue. To address this, onboard new customers and establish customer retention initiatives to boost sales and revenue from both new and existing customers.
Ensure accurate data collection to facilitate comprehensive analysis, particularly for revenue, which contains numerous missing values.