It can help businesses identify opportunities for growth and
improvement. For example, by identifying which products are performing
well, businesses can focus on these products and potentially develop new
products that are similar or complementary.
Based on the insights gained from sales analysis, businesses can develop
an action plan to improve sales performance. This may involve making
changes to pricing, marketing, product mix, or other factors that impact
sales.
DATASET DESCRIPTION:-
OrderDate:- Data of purchase
ProductID:- Unique product code
City:- Name of City
State: Name of State of order
Country:-
Country of order
Region: Regions of order
Market: Markets based
on regions
Segment: Product Segment
Department:- Product
Category
Division:- Product Sub category
OrderPriority : High ,medium or
low
ShipMode: Shipping methods
ProductPrice: Price of Product
per unit
Quantity:- Quantity purchased till date
Shipping_Cost_Each:- Cost associated with shipment per package
Returns :- no of orders returned
Loss_Per_Return: - Loss
inocurred per return
DESCRIPTIVE STATISTICS-
It is used to provide a summary of the characterstics of a sample
dataset like measures of central tendency, dispersion, shape, and
skewness.
There 3 measures that are used in descriptive
statistics:-
1.Measures of Central tendency like mean, median and
mode.
2.Measures of Dispersion like standard deviation ,variance and
Range
3.Measures of Positions like Quartiles , Interquatile
Range.
In this analysis we are going to use descriptive statistics to make our
analysis of sales dataset.
INFERENTIAL STATISTICS-
It is used to test hypotheses and make predictions about a larger population based on the characteristics of a smaller sample.
Inferential statistics are used in a wide range of applications, from
clinical trials and scientific research to business and finance.
R SCRIPT vs RMARKDOWN-
R markdown is a document format that combines plain text, R code, and
formatted output in a single file.
Task1:- Importing Libraries and
Dataset
In this
Task, we are going to import the important libraries and Dataset used
for our analysis
After Importing the dataset, we will describe the
fields of the dataset using a summary table and 2 barplots.
#libraries used
library(tidyverse)
library(readxl)
library(magrittr)
library(dplyr)
library(RColorBrewer)
library(kableExtra)
library(ggplot2)
#datasets used
Sales_2023 <- read_excel("DataSets/M6_project_dataset(2023).xlsx")
Observations:
The dataset
contains 1000 number of rows and 17 number of
columns
#selecting Numerical columns only
Sales_2023_tbl<-summary(Sales_2023[c("OrderDate","Product_Price","Quantity","Shipping_Cost_Each","Returns","Loss_Per_Return")])%>%as.matrix()
##assigning appropriate column names
colnames(Sales_2023_tbl)<-c("OrderDate","Product Price","Quantity","Shipping cost Each","Returns","Loss Per Return")
##formatting the table
knitr::kable(Sales_2023_tbl,align = "c",
format = "markdown",
caption = "Descriptive analysis of Numerical variable ",
table.attr="style='width:30%;'")
| OrderDate | Product Price | Quantity | Shipping cost Each | Returns | Loss Per Return | |
|---|---|---|---|---|---|---|
| Min. :2020-01-01 00:00:00.0 | Min. : 151.9 | Min. : 20.0 | Min. : 0.000 | Min. : 6.00 | Min. : 18.23 | |
| 1st Qu.:2020-05-09 00:00:00.0 | 1st Qu.: 361.1 | 1st Qu.: 45.0 | 1st Qu.: 3.890 | 1st Qu.: 14.00 | 1st Qu.: 73.54 | |
| Median :2020-08-06 00:00:00.0 | Median : 501.4 | Median :133.0 | Median : 8.585 | Median : 40.00 | Median : 145.66 | |
| Mean :2020-07-24 05:12:28.7 | Mean : 548.9 | Mean :158.5 | Mean :14.325 | Mean : 47.59 | Mean : 168.92 | |
| 3rd Qu.:2020-10-19 00:00:00.0 | 3rd Qu.: 684.3 | 3rd Qu.:251.2 | 3rd Qu.:21.413 | 3rd Qu.: 75.25 | 3rd Qu.: 240.63 | |
| Max. :2020-12-25 00:00:00.0 | Max. :3558.4 | Max. :400.0 | Max. :59.710 | Max. :120.00 | Max. :1352.20 |
##grouping Segment and Market
Sales_2023_tbl1<-Sales_2023[c("Segment","Market")]%>% table()
##grouping Department and Division
Sales_2023_tbl2<-Sales_2023[c("Department","Division")]%>% table()
par(mfrow=c(2,1))
par(mai=c(1,1,0.2,0.2))
#Graph1
Sales_2023_plot1<-barplot(Sales_2023_tbl1,col=brewer.pal(3,"Set1"),width=0.8, beside=TRUE , legend.text=rownames(Sales_2023_tbl1), main="Chart Showing no of observations in each Market w.r.t Segment",ylim=c(0,400),ylab="Frequency",las=1, args.legends="Top",cex.names=0.8)
text(y=Sales_2023_tbl1,
Sales_2023_plot1,
Sales_2023_tbl1,pos=3,cex=0.8)
#graph2
Sales_2023_plot2<-barplot(Sales_2023_tbl2,col=brewer.pal(5,"Set2"),legend.text=rownames(Sales_2023_tbl2),main="Chart Showing no of observations in each Depatment w.r.t Divisions",ylab="Frequency",space=0.5,args.legends="Top",ylim=c(0,300),cex.names=0.8,las=2)
Observations:
1.The dataset
shows sales of products from 2020-01-01 till
2020-12-25.
2.The Asia Pacific has largest market and
Africa has smallest.
3.The maximum products belonged to Consumer
Segment .553 number of unique products were sold in
Consumer Segment.
4. The maximum product sold belong to Technology
division.355 number of unique products were sold in
Technology.
Task2:- Viewing the
Distributions
In
this Task, we are going to use histogram and boxplot to view the shape
and distribution of one of the variables which is Shipping_cost_each of
the dataset
##histogram and boxplot for Shipping cost each
mean_shipcost=round(mean(Sales_2023$Shipping_Cost_Each),2)
median_shipcost=round(median(Sales_2023$Shipping_Cost_Each),2)
par(mfrow=c(2,1))
par(mai=c(1,1,0.2,0.2))
#graph 1
hist(Sales_2023$Shipping_Cost_Each,breaks = 50,col="pink",
xlim=c(0,65),ylim=c(0,100),xlab="Shipping Cost each",
ylab="Frequency",main="Histogram of Shipping cost")
abline(v=mean_shipcost,col="red",lwd=3)
abline(v=median_shipcost,col="orange",lwd=3)
text(y=100,
x=mean_shipcost,col="red",
paste("mean=",mean_shipcost))
text(y=80,
x=median_shipcost,col="orange",
paste("median=",median_shipcost))
#graph2
boxplot(Sales_2023$Shipping_Cost_Each,col="lightgreen",
main="boxplot for Shipping Cost",
horizontal = TRUE,xlab="Frequency")
text(x=mean_shipcost,y=1.25,col="red",labels=paste("mean=",mean_shipcost))
text(x=median_shipcost,y=0.75,col="blue",labels=paste("median=",median_shipcost))
points(x=mean_shipcost,y=1,pch=10,col="red",lwd=3)
points(x=median_shipcost,y=1,pch=18,col="blue",lwd=4)
Observations:
1. From the
Histogram and Boxplot, we can observe that the data is not symmetrical
and positively skewed which means most of the data is concentrated
around the median of data .
2. There are considerable number of
outliers in data which can be observed from the boxplot.
Task3:- Viewing the outliers and
Distributions with respect to Market
In this Task, we are going to use boxplot
to view the shape and distribution of one of the variables which is
Shipping_cost_each of the dataset and also compare it in all of the
markets.
par(mai=c(1,1.2,0.2,0.2))
boxplot(Sales_2023$Shipping_Cost_Each~Sales_2023$Market,
col=brewer.pal(5,"Pastel1"),main="Boxplot for Shipping Cost w.r.t Markets",ylab="Frequency",xlab="Markets",las=1,cex.names=0.8)
Observations:
From the
Boxplot, we can observe Europe has maximum outliers as
compared to rest of the markets.
Task4:- Average Shipping Cost
and Total quantity soldin Various Markets
In this Task, we are going calculate the
average shipping cost and total quantity sold in each of the markets and
compare the results and effect of shipping cost on sales
mean_shipcost_market<-tapply(Sales_2023$Shipping_Cost_Each,Sales_2023$Market,mean)%>% round(digits=2)%>% sort(decreasing = TRUE)%>% as.matrix()
colnames(mean_shipcost_market)<- c("Mean Ship cost")
#rownames(mean_shipcost_market)
total_Quantity_market<-tapply(Sales_2023$Quantity,Sales_2023$Market,sum)%>% round(digits=2)%>% sort(decreasing = TRUE)%>% as.matrix()
colnames(total_Quantity_market)<- c("Total quantity sold")
#rownames(median_productprice_market)
knitr::kable(mean_shipcost_market,align = "c",
format = "markdown",
caption = "Mean shipping cost per market",
table.attr="style='width:30%;'")
| Mean Ship cost | |
|---|---|
| Africa | 32.82 |
| LATAM | 16.47 |
| Europe | 13.82 |
| Asia Pacific | 13.02 |
| USCA | 8.09 |
knitr::kable(total_Quantity_market,align = "c",
format = "markdown",
caption = "Total quantity sold",
table.attr="style='width:30%;'")
| Total quantity sold | |
|---|---|
| Europe | 44281 |
| Asia Pacific | 43738 |
| USCA | 36160 |
| LATAM | 24792 |
| Africa | 9549 |
Africa has highest mean shipping cost of
32.82.USCA has lowest mean shipping
cost of 8.09.Europe has highest total
sold quantity of products 4.4281\times 10^{4}.Africa has lowest total sold quantity of products
9549.
Task5:- Average Shipping Cost in
Various Shipping Mode
In this Task, we are going calculate the
average shipping cost in each of the Shipping modes and compare the
results and effect of shipping cost on sales.
mean_shipcost_shipmode<-tapply(Sales_2023$Shipping_Cost_Each,Sales_2023$ShipMode,mean)%>% round(digits=2)%>% sort(decreasing = TRUE)%>% as.matrix()
#mean_shipcost_shipmode
par(mai=c(1,1.2,0.2,0.2))
boxplot(Sales_2023$Shipping_Cost_Each~Sales_2023$ShipMode,
col=brewer.pal(5,"Pastel2"),main="Boxplot for Shipping Cost w.r.t Ship Mode",ylab="Frequency",xlab="Ship Mode",las=1,cex.names=0.8)
Observations:
1.
Same Day shipping mode has highest mean shipping cost of
47.42 , which implies that it is expensive.
2.Standard Class shipping mode has lowest mean shipping
cost of 4.74, whih implies that it is cheapest shipping
mode.
3.First Class shipping mode has highest number of
outliers.
Task6:- Calculating Total Sales
In this Task, we are
going calculate the total Sales amount by adding a new column.
#adding a new column Total Sales
Sales_2023_new<- Sales_2023[c("Product_Price","Quantity")]%>% mutate(Total_Salesamt=Sales_2023$Product_Price*Sales_2023$Quantity)%>% round(digits = 2)
Sales_2023_new=cbind(Sales_2023,Sales_2023_new["Total_Salesamt"])%>% as.data.frame()
#str(Sales_2023_new)
#dplyr::glimpse(Sales_2023_new[1:4,])
Task7:- Department with highest
Sales
In this Task,
we are going calculate total sales per Department and find the depatment
with Highest sales.
max_sales_department <- Sales_2023_new%>%group_by(Department)%>%summarise(n=sum(Total_Salesamt)/1000000)
colnames(max_sales_department)<-c("Department","Total_Sales_in_Millions")
#str(max_sales_department)
sales_plot1=barplot(max_sales_department$Total_Sales_in_Millions,col=brewer.pal(3,"Set3"),names.arg=max_sales_department$Department,arg.legend="Topleft",legend.text=max_sales_department$Department, main="Barplot showing total sales in millions per department",ylim=c(0,60),ylab="Total sales in millions")
text(y=max_sales_department$Total_Sales_in_Millions,
sales_plot1,
max_sales_department$Total_Sales_in_Millions,pos=3,cex=0.8)
Observations:
From the table
below
`
| Department | Total_Sales_in_Millions |
|---|---|
| Furniture | 6.77232 |
| Office Supplies | 43.46151 |
| Technology | 27.76457 |
It can be observed that Office Supplies has
maximum total sales of 4.3461505\times 10^{7}.
Task8:- Product having High
losses
In this Task,
we are going to work on use case which involves finding the product
having highest losses in Africa Market.
We need to filter the
dataset for Africs market and then by using the total return and Losses
per Return column we will analyze which Department and which Product has
resulted in highest losses for the company.
#filter the records
sales_2023_Africa<- Sales_2023_new%>%filter(Market=="Africa")
#add new field
sales_2023_Africa_losses<- sales_2023_Africa[c("Returns","Loss_Per_Return","Total_Salesamt")]%>% mutate(total_loss=Loss_Per_Return*Returns) %>%
mutate(Net_profit=Total_Salesamt-total_loss)
#combining 2 datasets
sales_2023_Africa_new=cbind(sales_2023_Africa,sales_2023_Africa_losses[c("total_loss","Net_profit")])
#dplyr::glimpse(sales_2023_Africa_new[1:4,])
## Least profitable department
sales_2023_Africa_tbl1<- sales_2023_Africa_new%>%group_by(Department)%>% summarize(Net_profit_in_millions=sum(Net_profit)/1000000,
Total_returns=sum(Returns), Total_Losses_in_millions=sum(total_loss)/1000000)
##dipslay table
knitr::kable(sales_2023_Africa_tbl1,align = "c",
format = "markdown",
caption = "Tables showing Departments and mesures",
table.attr="style='width:30%;'")
| Department | Net_profit_in_millions | Total_returns | Total_Losses_in_millions |
|---|---|---|---|
| Furniture | 0.3386489 | 200 | 0.0397272 |
| Office Supplies | 2.6529214 | 1660 | 0.3115804 |
| Technology | 1.5971457 | 1006 | 0.1872725 |
par(mfrow=c(2,1))
par(mai=c(1,1,0.2,0.5))
#graph1
sales_plot2=barplot(sales_2023_Africa_tbl1$Total_Losses_in_millions,col=brewer.pal(3,"Set3"), names.arg=sales_2023_Africa_tbl1$Department,arg.legend="Topleft",legend.text=sales_2023_Africa_tbl1$Department, main="Barplot showing Net Loss in millions per Department",ylim=c(0,10),ylab="Net loss in millions")
text(y=sales_2023_Africa_tbl1$Total_Losses_in_millions,
sales_plot2,
sales_2023_Africa_tbl1$Total_Losses_in_millions,pos=3,cex=0.8)
#graph 2
sales_plot3=barplot(sales_2023_Africa_tbl1$Total_returns,col=brewer.pal(3,"Pastel2"), names.arg=sales_2023_Africa_tbl1$Department,arg.legend="TopRight",legend.text=sales_2023_Africa_tbl1$Department, main="Barplot showing Total Returns in per Department",ylim=c(0,2400),ylab="Total Returns")
text(y=sales_2023_Africa_tbl1$Total_returns,
sales_plot3,
sales_2023_Africa_tbl1$Total_returns,pos=3,cex=0.8)
sales_2023_Africa_tbl2<- sales_2023_Africa_new%>%filter(Department=="Office Supplies")%>% group_by(Division)%>% summarize(Net_profit_in_millions=sum(Net_profit)/1000000,
Total_returns=sum(Returns), Total_Losses_in_millions=sum(total_loss)/1000000)
#display table
knitr::kable(sales_2023_Africa_tbl2,align = "c",
format = "markdown",
caption = "Tables showing Products and mesures in office supplies",
table.attr="style='width:30%;'")
| Division | Net_profit_in_millions | Total_returns | Total_Losses_in_millions |
|---|---|---|---|
| Appliances | 0.3506783 | 211 | 0.0412113 |
| Art | 0.4601541 | 266 | 0.0541859 |
| Binders | 0.1948582 | 199 | 0.0228775 |
| Paper | 0.1383892 | 159 | 0.0162457 |
| Storage | 1.2674238 | 618 | 0.1487888 |
| Supplies | 0.2414178 | 207 | 0.0282712 |
par(mfrow=c(1,1))
par(mai=c(1,1,0.2,0.1))
sales_plot4=barplot(sales_2023_Africa_tbl2$Total_Losses_in_millions,col=brewer.pal(3,"Set2"), horiz=TRUE , names.arg=sales_2023_Africa_tbl2$Division,arg.legend="TopRight",legend.text=sales_2023_Africa_tbl2$Division, main="Barplot showing Total loss in millions in per Division",xlim=c(0,0.5),xlab="Total Losses In millions",
las=1,spaces=0.8,cex.names=0.8)
text(x=sales_2023_Africa_tbl2$Total_Losses_in_millions,
sales_plot4,
round(sales_2023_Africa_tbl2$Total_Losses_in_millions,2),pos=2,cex=0.8)
par(mfrow=c(1,1))
par(mai=c(1,1,0.2,0.1))
sales_plot5=barplot(sales_2023_Africa_tbl2$Total_returns,col=brewer.pal(3,"Set2"), horiz=TRUE , names.arg=sales_2023_Africa_tbl2$Division,arg.legend="TopRight",legend.text=sales_2023_Africa_tbl2$Division, main="Barplot showing Total return in per Division",xlim=c(0,1000),xlab="Total Returns",
las=1,spaces=0.8,cex.names=0.8)
text(x=sales_2023_Africa_tbl2$Total_returns,
sales_plot4,
sales_2023_Africa_tbl2$Total_returns,pos=2,cex=0.8)
Observations:
1. It can be
observed that Department Office Supplies has resulted of
approx 0.3115804(in millions) to the company.
2. The
total number of products returned so far in Department
Office Supplies is equal to 1660.
3.In
Department Office Supplies the product Storage
has inoccured most of the losses to company and has most of the products
returned.
4. From the above results it can be said that the quality
for Storage is not good due to which the returns are
made.
CONCLUSION
From this analysis of Dataset we have
learned
1. The use of basic descriptive statistics in data
analysis.
2. The visual graphs and Plots to generate insights into
dataset.
3. Data aggregation methods like group by() and tapply() to
add meaningful insights to out analysis.
4.We have learned important
measures which can be used for sales analysis like Total sales
calculation , profit calculation.
5.In our analysis we have observed
the high shipping cost results in low sales and this is due to
inefficient supply chain management in Africa.
6. We have also
observed that low quality products can result in loss and returns on the
company.
REFERENCES
Barplots
https://rpubs.com/Dee_Chiluiza/barplots
Histograms
https://rpubs.com/Dee_Chiluiza/816756
Group
BY
https://sparkbyexamples.com/r-programming/group-by-sum-in-r/
https://www.marsja.se/how-to-add-a-column-to-dataframe-in-r-with-tibble-dplyr/
https://www.r-bloggers.com/2022/08/r-program-to-concatenate-two-strings/
https://digitaschools.com/data-types-in-r-tutorial/