Sales Data Analysis for Daraz

library(dplyr)
library(ggplot2)
library(knitr)

Reading data and transformation

The data file SalesData.csv consists of the daily sales data for the customers on one store location. We are required to analyze and visualize the sales data in useful ways possible. First we read the data that is available in CSV into a dataframe. Then we need to transform certain aspects of the data in order to analyze.

setwd('/Users/swaraj/Desktop/daraz')
dat <- read.csv('SalesData.csv')
kable(head(dat))

Fiscal.Season	SeasonDesc	Fiscal.Year	Fiscal.Week	Day	Dayofweek	Class	ClassDesc	Location	Locdesc	SalesUnits	SalesDollars
20161	Spring 2016	2016	2/6/2016	1/31/2016	Sunday	10	SUNGLASSES	2061	TIMES SQUARE	2	60
20161	Spring 2016	2016	2/6/2016	1/31/2016	Sunday	15	FASHION ACCESSORIES	2061	TIMES SQUARE	1	18
20161	Spring 2016	2016	2/6/2016	1/31/2016	Sunday	18	HATS	2061	TIMES SQUARE	1	12
20161	Spring 2016	2016	2/6/2016	1/31/2016	Sunday	25	FASHION JEWELRY	2061	TIMES SQUARE	1	35
20161	Spring 2016	2016	2/6/2016	1/31/2016	Sunday	26	NECKLACES	2061	TIMES SQUARE	11	231
20161	Spring 2016	2016	2/6/2016	1/31/2016	Sunday	27	SUNGLASSES	2061	TIMES SQUARE	3	63

str(dat)

## 'data.frame':    40932 obs. of  12 variables:
##  $ Fiscal.Season: int  20161 20161 20161 20161 20161 20161 20161 20161 20161 20161 ...
##  $ SeasonDesc   : Factor w/ 2 levels "Fall 2016","Spring 2016": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Fiscal.Year  : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Fiscal.Week  : Factor w/ 48 levels "10/1/2016","10/15/2016",..: 18 18 18 18 18 18 18 18 18 18 ...
##  $ Day          : Factor w/ 330 levels "1/31/2016","10/1/2016",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dayofweek    : Factor w/ 7 levels "Friday","Monday",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Class        : int  10 15 18 25 26 27 28 30 31 41 ...
##  $ ClassDesc    : Factor w/ 112 levels "111 BODYSUITS",..: 92 39 49 40 68 92 60 61 50 12 ...
##  $ Location     : int  2061 2061 2061 2061 2061 2061 2061 2061 2061 2061 ...
##  $ Locdesc      : Factor w/ 1 level "TIMES SQUARE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SalesUnits   : int  2 1 1 1 11 3 2 10 4 1 ...
##  $ SalesDollars : int  60 18 12 35 231 63 100 416 212 60 ...

We need to transform variables related to date to a date-specific format, and also transfrom some variables in to categorical. We also need to add a ‘month’ variable to analyze sales based on month.

dat$Class <- factor(dat$Class)
dy <- strptime(dat$Day, "%m/%d/%Y")
dat$Day <- as.Date(dy)

wk <- strptime(dat$Fiscal.Week, "%m/%d/%Y")
dat$Fiscal.Week <- as.Date(wk)

dat$Dayofweek <- factor(dat$Dayofweek, ordered = TRUE, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

month <- as.numeric(format(dat$Day, "%m"))
#dat$month <- month.abb[month]
dat$month <- factor(month.abb[month], ordered=TRUE, levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))

Now,let’s look at the data summary and distribution of some of the key variables.

kable(head(dat))

Fiscal.Season	SeasonDesc	Fiscal.Year	Fiscal.Week	Day	Dayofweek	Class	ClassDesc	Location	Locdesc	SalesUnits	SalesDollars	month
20161	Spring 2016	2016	2016-02-06	2016-01-31	Sunday	10	SUNGLASSES	2061	TIMES SQUARE	2	60	Jan
20161	Spring 2016	2016	2016-02-06	2016-01-31	Sunday	15	FASHION ACCESSORIES	2061	TIMES SQUARE	1	18	Jan
20161	Spring 2016	2016	2016-02-06	2016-01-31	Sunday	18	HATS	2061	TIMES SQUARE	1	12	Jan
20161	Spring 2016	2016	2016-02-06	2016-01-31	Sunday	25	FASHION JEWELRY	2061	TIMES SQUARE	1	35	Jan
20161	Spring 2016	2016	2016-02-06	2016-01-31	Sunday	26	NECKLACES	2061	TIMES SQUARE	11	231	Jan
20161	Spring 2016	2016	2016-02-06	2016-01-31	Sunday	27	SUNGLASSES	2061	TIMES SQUARE	3	63	Jan

ggplot(dat, aes(x=SalesDollars)) + geom_histogram()

ggplot(dat, aes(x=SalesUnits)) + geom_histogram()

summary(dat$SalesUnits)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   9.303  11.000 255.000

summary(dat$SalesDollars)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    72.0   180.0   311.3   393.0  7539.0

summary(dat$Fiscal.Year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2016    2016    2016    2016    2016

summary(dat$SeasonDesc)

##   Fall 2016 Spring 2016 
##       18213       22719

nlevels(dat$ClassDesc)

## [1] 112

As we can see from the distribution of SalesDollars and SalesUnits, they are heavily skewed towards right, which means that there are lots of data on the lower end of the value. It is typical of the sales data where number of customers who buy less items are greater than the customers who buy more items at once.

Looking at the data and the summary, we can exclude some of the variables from our analysis. We can remove the following variables: Fiscal.Season, Fiscal.Year, Class, Location and Locdesc. The var.iables of particular interests are SalesDollars, SalesUnits, month, Day and ClassDesc

rem <- names(dat) %in% c("Fiscal.Season", "Fiscal.Year", "Class", "Location","Locdesc")
df <- dat[!rem]
kable(head(df))

SeasonDesc	Fiscal.Week	Day	Dayofweek	ClassDesc	SalesUnits	SalesDollars	month
Spring 2016	2016-02-06	2016-01-31	Sunday	SUNGLASSES	2	60	Jan
Spring 2016	2016-02-06	2016-01-31	Sunday	FASHION ACCESSORIES	1	18	Jan
Spring 2016	2016-02-06	2016-01-31	Sunday	HATS	1	12	Jan
Spring 2016	2016-02-06	2016-01-31	Sunday	FASHION JEWELRY	1	35	Jan
Spring 2016	2016-02-06	2016-01-31	Sunday	NECKLACES	11	231	Jan
Spring 2016	2016-02-06	2016-01-31	Sunday	SUNGLASSES	3	63	Jan

Let’s also see if there are any missing values from the new dataset, and finally check the structure of the dataset.

missing <-sapply(df, function(y) sum(length(which(is.na(y)))))
missing <- data.frame(missing)
missing

##              missing
## SeasonDesc         0
## Fiscal.Week        0
## Day                0
## Dayofweek          0
## ClassDesc          0
## SalesUnits         0
## SalesDollars       0
## month              0

str(df)

## 'data.frame':    40932 obs. of  8 variables:
##  $ SeasonDesc  : Factor w/ 2 levels "Fall 2016","Spring 2016": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Fiscal.Week : Date, format: "2016-02-06" "2016-02-06" ...
##  $ Day         : Date, format: "2016-01-31" "2016-01-31" ...
##  $ Dayofweek   : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ClassDesc   : Factor w/ 112 levels "111 BODYSUITS",..: 92 39 49 40 68 92 60 61 50 12 ...
##  $ SalesUnits  : int  2 1 1 1 11 3 2 10 4 1 ...
##  $ SalesDollars: int  60 18 12 35 231 63 100 416 212 60 ...
##  $ month       : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...

There are no missing values from the dataset. Now that we have a transformed data with a proper structure, we can continue with our analysis.

Data Analysis

Let us start by analysing the main variables, which are SalesUnits and SalesDollars. We will group the items by Day and ClassDesc and analyze daily sales data for the items.

data.frame(df %>% group_by(Day) %>% summarise(daily_mean=mean(SalesUnits)) %>% summarise(DailyAverageUnits=mean(daily_mean)))

##   DailyAverageUnits
## 1          9.146721

data.frame(df %>% group_by(Day) %>% summarise(daily_mean=mean(SalesDollars)) %>% summarise(DailyAverageSales=mean(daily_mean)))

##   DailyAverageSales
## 1          307.4234

kable(data.frame(dat %>% group_by(ClassDesc) %>% summarise(DailyAverageUnits=mean(SalesUnits)) %>% top_n(n=5, wt=DailyAverageUnits)))

ClassDesc	DailyAverageUnits
EARRINGS	24.13636
LS SHIRTS	25.05167
LS SOLID DRESS SHIRTS	29.43505
NECKLACES	25.58182
PULLOVERS	36.66364

kable(data.frame(dat %>% group_by(ClassDesc) %>% summarise(DailyAverageSales=mean(SalesDollars)) %>% top_n(n=5, wt=DailyAverageSales)))

ClassDesc	DailyAverageSales
DAY/CASUAL	832.3090
LEGGING	962.5006
LS SHIRTS	961.5760
LS SOLID DRESS SHIRTS	1132.2387
PULLOVERS	1227.6576

As we can see from the summary statistics, 9.14 items were sold per day in the year 2016, while the average sales per day was $307.42. This includes sale of all items during that day. Top 5 items with highest daily average units sold were Earrings, LS Shirts, LS Solid Dress Shirts, Necklaces and Pullovers. Similarly, top 5 items with highest daily average revenue were Day/Casual, Legging, LS Shirts, LS Solid Dress Shirts and Pullovers.

We can also do a seasonal and weekly analysis on the sales of the top 5 items for Fall and Spring.

seasonCount <- aggregate(df$SalesUnits, list(df$ClassDesc, df$SeasonDesc), sum)
scount <- data.frame(seasonCount %>% group_by(Group.2) %>% top_n(n=5, wt=x))
colnames(scount) <- c("Items", "Season", "Total Item Count")
kable(scount)

Items	Season	Total Item Count
LEGGING	Fall 2016	9089
LS TEES	Fall 2016	8000
PULLOVERS	Fall 2016	8776
SOCKS	Fall 2016	7750
SS TEES	Fall 2016	10248
DAY/CASUAL	Spring 2016	10490
LEGGING	Spring 2016	8343
LS SHIRTS	Spring 2016	9151
SS TEES	Spring 2016	25304
TANKS	Spring 2016	12149

Top 5 items that were sold in Fall were Legging, LS Tees, Pullovers, Socks and SS Tees. Similary, top 5 items sold in the Spring were Day/Casual, Legging, LS Shirts, SS Tees and Tanks. Let us now see the distribution of the average sales amount by the day of the week.

avgDaySales <- aggregate(df$SalesDollars, list(df$Dayofweek), mean)
ggplot(avgDaySales, aes(x=Group.1, y=x)) + geom_bar(stat="identity") + labs(x="Day", y="Average Sales")

From the barplot above, we can see that Satuday has the highest average amount of sales, whereas Tuesday has the lowest average sales amount. That means, people spend most on Saturdays and least on Tuesdays. We can also analyse the same data by month.

avgMonthlySales <- aggregate(df$SalesDollars, list(df$month), mean)
ggplot(avgMonthlySales, aes(x=Group.1, y=x)) + geom_bar(stat="identity") + labs(x="Month", y="Average Sales")

From the barplot above, we can see that in an average, people spent most in November, and spent least in January. This can be attributed to the holiday season during the last 2 months. We can also visualize the time-series data for daily total sales. It shows the total dollar sales for a day for 2016.

daily_data <- df %>% group_by(Day) %>% summarise(sales=sum(SalesDollars))
ggplot(daily_data, aes(x=Day, y=sales)) + geom_line() + xlab("Day") + ylab("Sales Amount")

From the timeseries plot above, we can see that the daily sales amount fluctuates regularly, spikes at just beginning of the months, and has a sharp rise during the month of December. This migh suggest variability in sales during certain times of the month or year. There is a sudden rise in sales during holiday season in December, and people spend money at the beginning of the month. This graph can help predict the pattern of sales during a fiscal year.

Sales Data Analysis for Daraz

Swaraj Rimal

1/16/2021

Reading data and transformation

Data Analysis