ALY6000-20312 INTRODUCTION TO ANALYTICS
NORTHEASTERN UNIVERSITY
SANA YASIN
Date :18 February, 2023
Project M6 Report
Professor: Dr. Dee Chiluiza, PHD

INTRODUCTION:-
SALES ANALYTICS
Sales analysis involves analysing trends from data, provide insights about top performing products /services, market opportunities and measures to generate more revenue.

It can help businesses identify opportunities for growth and improvement. For example, by identifying which products are performing well, businesses can focus on these products and potentially develop new products that are similar or complementary.

Based on the insights gained from sales analysis, businesses can develop an action plan to improve sales performance. This may involve making changes to pricing, marketing, product mix, or other factors that impact sales.

DATASET DESCRIPTION:-
We are going to perform basic exploratory data analysis of Sales data of an anomynous company containing information like

OrderDate:- Data of purchase
ProductID:- Unique product code
City:- Name of City
State: Name of State of order
Country:- Country of order
Region: Regions of order
Market: Markets based on regions
Segment: Product Segment
Department:- Product Category

Division:- Product Sub category
OrderPriority : High ,medium or low
ShipMode: Shipping methods
ProductPrice: Price of Product per unit
Quantity:- Quantity purchased till date
Shipping_Cost_Each:- Cost associated with shipment per package
Returns :- no of orders returned
Loss_Per_Return: - Loss inocurred per return

DESCRIPTIVE STATISTICS-
Descriptive statistics is a branch of statistics that involves summarizing and describing sample dataset using numerical and graphical methods.

It is used to provide a summary of the characterstics of a sample dataset like measures of central tendency, dispersion, shape, and skewness.
There 3 measures that are used in descriptive statistics:-
1.Measures of Central tendency like mean, median and mode.
2.Measures of Dispersion like standard deviation ,variance and Range
3.Measures of Positions like Quartiles , Interquatile Range.

In this analysis we are going to use descriptive statistics to make our analysis of sales dataset.

INFERENTIAL STATISTICS-
Inferential statistics is a branch of statistics that involves making inferences or conclusions about a population based on a sample of data.

It is used to test hypotheses and make predictions about a larger population based on the characteristics of a smaller sample.

Inferential statistics are used in a wide range of applications, from clinical trials and scientific research to business and finance.

R SCRIPT vs RMARKDOWN-
R script is a file containing a series of R commands or functions that can be executed in R console or in R Studio.

R markdown is a document format that combines plain text, R code, and formatted output in a single file.

Task1:- Importing Libraries and Dataset
In this Task, we are going to import the important libraries and Dataset used for our analysis
After Importing the dataset, we will describe the fields of the dataset using a summary table and 2 barplots.

#libraries used
library(tidyverse)
library(readxl)
library(magrittr)
library(dplyr)
library(RColorBrewer)
library(kableExtra)
library(ggplot2)
#datasets used
Sales_2023 <- read_excel("DataSets/M6_project_dataset(2023).xlsx")

Observations:
The dataset contains 1000 number of rows and 17 number of columns

#selecting Numerical columns only
Sales_2023_tbl<-summary(Sales_2023[c("OrderDate","Product_Price","Quantity","Shipping_Cost_Each","Returns","Loss_Per_Return")])%>%as.matrix()
##assigning appropriate column names
colnames(Sales_2023_tbl)<-c("OrderDate","Product Price","Quantity","Shipping cost Each","Returns","Loss Per Return")
##formatting the table 
knitr::kable(Sales_2023_tbl,align = "c",
               format = "markdown",
               caption = "Descriptive analysis of Numerical variable ",
               table.attr="style='width:30%;'")
Descriptive analysis of Numerical variable
OrderDate Product Price Quantity Shipping cost Each Returns Loss Per Return
Min. :2020-01-01 00:00:00.0 Min. : 151.9 Min. : 20.0 Min. : 0.000 Min. : 6.00 Min. : 18.23
1st Qu.:2020-05-09 00:00:00.0 1st Qu.: 361.1 1st Qu.: 45.0 1st Qu.: 3.890 1st Qu.: 14.00 1st Qu.: 73.54
Median :2020-08-06 00:00:00.0 Median : 501.4 Median :133.0 Median : 8.585 Median : 40.00 Median : 145.66
Mean :2020-07-24 05:12:28.7 Mean : 548.9 Mean :158.5 Mean :14.325 Mean : 47.59 Mean : 168.92
3rd Qu.:2020-10-19 00:00:00.0 3rd Qu.: 684.3 3rd Qu.:251.2 3rd Qu.:21.413 3rd Qu.: 75.25 3rd Qu.: 240.63
Max. :2020-12-25 00:00:00.0 Max. :3558.4 Max. :400.0 Max. :59.710 Max. :120.00 Max. :1352.20
##grouping Segment and Market
Sales_2023_tbl1<-Sales_2023[c("Segment","Market")]%>% table()

##grouping Department and Division
Sales_2023_tbl2<-Sales_2023[c("Department","Division")]%>% table()

par(mfrow=c(2,1))
par(mai=c(1,1,0.2,0.2))
#Graph1
Sales_2023_plot1<-barplot(Sales_2023_tbl1,col=brewer.pal(3,"Set1"),width=0.8, beside=TRUE ,        legend.text=rownames(Sales_2023_tbl1),                          main="Chart Showing no of observations in each Market w.r.t Segment",ylim=c(0,400),ylab="Frequency",las=1,                  args.legends="Top",cex.names=0.8)

text(y=Sales_2023_tbl1,
     Sales_2023_plot1,
     Sales_2023_tbl1,pos=3,cex=0.8)
#graph2
Sales_2023_plot2<-barplot(Sales_2023_tbl2,col=brewer.pal(5,"Set2"),legend.text=rownames(Sales_2023_tbl2),main="Chart Showing no of observations in each Depatment w.r.t Divisions",ylab="Frequency",space=0.5,args.legends="Top",ylim=c(0,300),cex.names=0.8,las=2)

Observations:
1.The dataset shows sales of products from 2020-01-01 till 2020-12-25.
2.The Asia Pacific has largest market and Africa has smallest.
3.The maximum products belonged to Consumer Segment .553 number of unique products were sold in Consumer Segment.
4. The maximum product sold belong to Technology division.355 number of unique products were sold in Technology.

Task2:- Viewing the Distributions
In this Task, we are going to use histogram and boxplot to view the shape and distribution of one of the variables which is Shipping_cost_each of the dataset

##histogram and boxplot for Shipping cost each
mean_shipcost=round(mean(Sales_2023$Shipping_Cost_Each),2)
median_shipcost=round(median(Sales_2023$Shipping_Cost_Each),2)
par(mfrow=c(2,1))
par(mai=c(1,1,0.2,0.2))
#graph 1
hist(Sales_2023$Shipping_Cost_Each,breaks = 50,col="pink",
     xlim=c(0,65),ylim=c(0,100),xlab="Shipping Cost each",
     ylab="Frequency",main="Histogram of Shipping cost")
abline(v=mean_shipcost,col="red",lwd=3)
abline(v=median_shipcost,col="orange",lwd=3)
text(y=100,
     x=mean_shipcost,col="red",
     paste("mean=",mean_shipcost))
text(y=80,
     x=median_shipcost,col="orange",
     paste("median=",median_shipcost))
#graph2
boxplot(Sales_2023$Shipping_Cost_Each,col="lightgreen",
        main="boxplot for Shipping Cost",
        horizontal = TRUE,xlab="Frequency")
text(x=mean_shipcost,y=1.25,col="red",labels=paste("mean=",mean_shipcost))
text(x=median_shipcost,y=0.75,col="blue",labels=paste("median=",median_shipcost))
points(x=mean_shipcost,y=1,pch=10,col="red",lwd=3)
points(x=median_shipcost,y=1,pch=18,col="blue",lwd=4)

Observations:
1. From the Histogram and Boxplot, we can observe that the data is not symmetrical and positively skewed which means most of the data is concentrated around the median of data .
2. There are considerable number of outliers in data which can be observed from the boxplot.

Task3:- Viewing the outliers and Distributions with respect to Market
In this Task, we are going to use boxplot to view the shape and distribution of one of the variables which is Shipping_cost_each of the dataset and also compare it in all of the markets.

par(mai=c(1,1.2,0.2,0.2))
boxplot(Sales_2023$Shipping_Cost_Each~Sales_2023$Market,
        col=brewer.pal(5,"Pastel1"),main="Boxplot for Shipping Cost w.r.t Markets",ylab="Frequency",xlab="Markets",las=1,cex.names=0.8)

Observations:
From the Boxplot, we can observe Europe has maximum outliers as compared to rest of the markets.

Task4:- Average Shipping Cost and Total quantity soldin Various Markets
In this Task, we are going calculate the average shipping cost and total quantity sold in each of the markets and compare the results and effect of shipping cost on sales

mean_shipcost_market<-tapply(Sales_2023$Shipping_Cost_Each,Sales_2023$Market,mean)%>% round(digits=2)%>% sort(decreasing = TRUE)%>% as.matrix()

colnames(mean_shipcost_market)<- c("Mean Ship cost")
#rownames(mean_shipcost_market)

total_Quantity_market<-tapply(Sales_2023$Quantity,Sales_2023$Market,sum)%>% round(digits=2)%>% sort(decreasing = TRUE)%>% as.matrix()

colnames(total_Quantity_market)<- c("Total quantity sold")
#rownames(median_productprice_market)

knitr::kable(mean_shipcost_market,align = "c",
               format = "markdown",
               caption = "Mean shipping cost per market",
               table.attr="style='width:30%;'")
Mean shipping cost per market
Mean Ship cost
Africa 32.82
LATAM 16.47
Europe 13.82
Asia Pacific 13.02
USCA 8.09
knitr::kable(total_Quantity_market,align = "c",
               format = "markdown",
               caption = "Total quantity sold",
               table.attr="style='width:30%;'")
Total quantity sold
Total quantity sold
Europe 44281
Asia Pacific 43738
USCA 36160
LATAM 24792
Africa 9549
Observations:
1. Africa has highest mean shipping cost of 32.82.
2.USCA has lowest mean shipping cost of 8.09.
3..Europe has highest total sold quantity of products 4.4281\times 10^{4}.
4.Africa has lowest total sold quantity of products 9549.

5. The high Shipping cost in impacting Sales in Africa.

Task5:- Average Shipping Cost in Various Shipping Mode
In this Task, we are going calculate the average shipping cost in each of the Shipping modes and compare the results and effect of shipping cost on sales.

mean_shipcost_shipmode<-tapply(Sales_2023$Shipping_Cost_Each,Sales_2023$ShipMode,mean)%>% round(digits=2)%>% sort(decreasing = TRUE)%>% as.matrix()
#mean_shipcost_shipmode
par(mai=c(1,1.2,0.2,0.2))
boxplot(Sales_2023$Shipping_Cost_Each~Sales_2023$ShipMode,
        col=brewer.pal(5,"Pastel2"),main="Boxplot for Shipping Cost w.r.t Ship Mode",ylab="Frequency",xlab="Ship Mode",las=1,cex.names=0.8)

Observations:
1. Same Day shipping mode has highest mean shipping cost of 47.42 , which implies that it is expensive.
2.Standard Class shipping mode has lowest mean shipping cost of 4.74, whih implies that it is cheapest shipping mode.
3.First Class shipping mode has highest number of outliers.

Task6:- Calculating Total Sales
In this Task, we are going calculate the total Sales amount by adding a new column.

#adding a new column Total Sales 
Sales_2023_new<- Sales_2023[c("Product_Price","Quantity")]%>% mutate(Total_Salesamt=Sales_2023$Product_Price*Sales_2023$Quantity)%>% round(digits = 2)
Sales_2023_new=cbind(Sales_2023,Sales_2023_new["Total_Salesamt"])%>% as.data.frame()
#str(Sales_2023_new)
#dplyr::glimpse(Sales_2023_new[1:4,])

Task7:- Department with highest Sales
In this Task, we are going calculate total sales per Department and find the depatment with Highest sales.

max_sales_department <- Sales_2023_new%>%group_by(Department)%>%summarise(n=sum(Total_Salesamt)/1000000)
colnames(max_sales_department)<-c("Department","Total_Sales_in_Millions")
#str(max_sales_department)
sales_plot1=barplot(max_sales_department$Total_Sales_in_Millions,col=brewer.pal(3,"Set3"),names.arg=max_sales_department$Department,arg.legend="Topleft",legend.text=max_sales_department$Department, main="Barplot showing total sales in millions per department",ylim=c(0,60),ylab="Total sales in millions")

text(y=max_sales_department$Total_Sales_in_Millions,
     sales_plot1,
     max_sales_department$Total_Sales_in_Millions,pos=3,cex=0.8)

Observations:
From the table below
`

Total Sales in Millions per Department
Department Total_Sales_in_Millions
Furniture 6.77232
Office Supplies 43.46151
Technology 27.76457
It can be observed that Office Supplies has maximum total sales of 4.3461505\times 10^{7}.

Task8:- Product having High losses
In this Task, we are going to work on use case which involves finding the product having highest losses in Africa Market.
We need to filter the dataset for Africs market and then by using the total return and Losses per Return column we will analyze which Department and which Product has resulted in highest losses for the company.

#filter the records
sales_2023_Africa<- Sales_2023_new%>%filter(Market=="Africa")
#add new field
sales_2023_Africa_losses<-  sales_2023_Africa[c("Returns","Loss_Per_Return","Total_Salesamt")]%>% mutate(total_loss=Loss_Per_Return*Returns) %>%
  mutate(Net_profit=Total_Salesamt-total_loss)
#combining 2 datasets
sales_2023_Africa_new=cbind(sales_2023_Africa,sales_2023_Africa_losses[c("total_loss","Net_profit")])
#dplyr::glimpse(sales_2023_Africa_new[1:4,])

## Least profitable department 
sales_2023_Africa_tbl1<- sales_2023_Africa_new%>%group_by(Department)%>% summarize(Net_profit_in_millions=sum(Net_profit)/1000000,
                                                                                   Total_returns=sum(Returns),                                                                                  Total_Losses_in_millions=sum(total_loss)/1000000)
##dipslay table 
knitr::kable(sales_2023_Africa_tbl1,align = "c",
               format = "markdown",
               caption = "Tables showing Departments and mesures",
               table.attr="style='width:30%;'")
Tables showing Departments and mesures
Department Net_profit_in_millions Total_returns Total_Losses_in_millions
Furniture 0.3386489 200 0.0397272
Office Supplies 2.6529214 1660 0.3115804
Technology 1.5971457 1006 0.1872725
par(mfrow=c(2,1))
par(mai=c(1,1,0.2,0.5))
#graph1
sales_plot2=barplot(sales_2023_Africa_tbl1$Total_Losses_in_millions,col=brewer.pal(3,"Set3"),            names.arg=sales_2023_Africa_tbl1$Department,arg.legend="Topleft",legend.text=sales_2023_Africa_tbl1$Department, main="Barplot showing Net Loss in millions per Department",ylim=c(0,10),ylab="Net loss in millions")

text(y=sales_2023_Africa_tbl1$Total_Losses_in_millions,
     sales_plot2,
     sales_2023_Africa_tbl1$Total_Losses_in_millions,pos=3,cex=0.8)
#graph 2 
sales_plot3=barplot(sales_2023_Africa_tbl1$Total_returns,col=brewer.pal(3,"Pastel2"),            names.arg=sales_2023_Africa_tbl1$Department,arg.legend="TopRight",legend.text=sales_2023_Africa_tbl1$Department, main="Barplot showing Total Returns in per Department",ylim=c(0,2400),ylab="Total Returns")

text(y=sales_2023_Africa_tbl1$Total_returns,
     sales_plot3,
     sales_2023_Africa_tbl1$Total_returns,pos=3,cex=0.8)

sales_2023_Africa_tbl2<- sales_2023_Africa_new%>%filter(Department=="Office Supplies")%>% group_by(Division)%>% summarize(Net_profit_in_millions=sum(Net_profit)/1000000,
                                                                                   Total_returns=sum(Returns),                                                                                  Total_Losses_in_millions=sum(total_loss)/1000000)
#display table
knitr::kable(sales_2023_Africa_tbl2,align = "c",
               format = "markdown",
               caption = "Tables showing Products and mesures in office supplies",
               table.attr="style='width:30%;'")
Tables showing Products and mesures in office supplies
Division Net_profit_in_millions Total_returns Total_Losses_in_millions
Appliances 0.3506783 211 0.0412113
Art 0.4601541 266 0.0541859
Binders 0.1948582 199 0.0228775
Paper 0.1383892 159 0.0162457
Storage 1.2674238 618 0.1487888
Supplies 0.2414178 207 0.0282712
par(mfrow=c(1,1))
par(mai=c(1,1,0.2,0.1))
sales_plot4=barplot(sales_2023_Africa_tbl2$Total_Losses_in_millions,col=brewer.pal(3,"Set2"), horiz=TRUE  ,        names.arg=sales_2023_Africa_tbl2$Division,arg.legend="TopRight",legend.text=sales_2023_Africa_tbl2$Division, main="Barplot showing Total loss in millions in per Division",xlim=c(0,0.5),xlab="Total Losses In millions",
                    las=1,spaces=0.8,cex.names=0.8)

text(x=sales_2023_Africa_tbl2$Total_Losses_in_millions,
     sales_plot4,
     round(sales_2023_Africa_tbl2$Total_Losses_in_millions,2),pos=2,cex=0.8)

par(mfrow=c(1,1))
par(mai=c(1,1,0.2,0.1))
sales_plot5=barplot(sales_2023_Africa_tbl2$Total_returns,col=brewer.pal(3,"Set2"), horiz=TRUE  ,        names.arg=sales_2023_Africa_tbl2$Division,arg.legend="TopRight",legend.text=sales_2023_Africa_tbl2$Division, main="Barplot showing Total return in per Division",xlim=c(0,1000),xlab="Total Returns",
                    las=1,spaces=0.8,cex.names=0.8)

text(x=sales_2023_Africa_tbl2$Total_returns,
     sales_plot4,
     sales_2023_Africa_tbl2$Total_returns,pos=2,cex=0.8)

Observations:
1. It can be observed that Department Office Supplies has resulted of approx 0.3115804(in millions) to the company.
2. The total number of products returned so far in Department Office Supplies is equal to 1660.
3.In Department Office Supplies the product Storage has inoccured most of the losses to company and has most of the products returned.
4. From the above results it can be said that the quality for Storage is not good due to which the returns are made.

CONCLUSION
From this analysis of Dataset we have learned
1. The use of basic descriptive statistics in data analysis.
2. The visual graphs and Plots to generate insights into dataset.
3. Data aggregation methods like group by() and tapply() to add meaningful insights to out analysis.
4.We have learned important measures which can be used for sales analysis like Total sales calculation , profit calculation.
5.In our analysis we have observed the high shipping cost results in low sales and this is due to inefficient supply chain management in Africa.
6. We have also observed that low quality products can result in loss and returns on the company.

REFERENCES
Barplots
https://rpubs.com/Dee_Chiluiza/barplots
Histograms
https://rpubs.com/Dee_Chiluiza/816756
Group BY
https://sparkbyexamples.com/r-programming/group-by-sum-in-r/
https://www.marsja.se/how-to-add-a-column-to-dataframe-in-r-with-tibble-dplyr/
https://www.r-bloggers.com/2022/08/r-program-to-concatenate-two-strings/
https://digitaschools.com/data-types-in-r-tutorial/