library(dplyr)
library(ggplot2)
library(knitr)

Reading data and transformation

The data file SalesData.csv consists of the daily sales data for the customers on one store location. We are required to analyze and visualize the sales data in useful ways possible. First we read the data that is available in CSV into a dataframe. Then we need to transform certain aspects of the data in order to analyze.

setwd('/Users/swaraj/Desktop/daraz')
dat <- read.csv('SalesData.csv')
kable(head(dat))
Fiscal.Season SeasonDesc Fiscal.Year Fiscal.Week Day Dayofweek Class ClassDesc Location Locdesc SalesUnits SalesDollars
20161 Spring 2016 2016 2/6/2016 1/31/2016 Sunday 10 SUNGLASSES 2061 TIMES SQUARE 2 60
20161 Spring 2016 2016 2/6/2016 1/31/2016 Sunday 15 FASHION ACCESSORIES 2061 TIMES SQUARE 1 18
20161 Spring 2016 2016 2/6/2016 1/31/2016 Sunday 18 HATS 2061 TIMES SQUARE 1 12
20161 Spring 2016 2016 2/6/2016 1/31/2016 Sunday 25 FASHION JEWELRY 2061 TIMES SQUARE 1 35
20161 Spring 2016 2016 2/6/2016 1/31/2016 Sunday 26 NECKLACES 2061 TIMES SQUARE 11 231
20161 Spring 2016 2016 2/6/2016 1/31/2016 Sunday 27 SUNGLASSES 2061 TIMES SQUARE 3 63
str(dat)
## 'data.frame':    40932 obs. of  12 variables:
##  $ Fiscal.Season: int  20161 20161 20161 20161 20161 20161 20161 20161 20161 20161 ...
##  $ SeasonDesc   : Factor w/ 2 levels "Fall 2016","Spring 2016": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Fiscal.Year  : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Fiscal.Week  : Factor w/ 48 levels "10/1/2016","10/15/2016",..: 18 18 18 18 18 18 18 18 18 18 ...
##  $ Day          : Factor w/ 330 levels "1/31/2016","10/1/2016",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Dayofweek    : Factor w/ 7 levels "Friday","Monday",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Class        : int  10 15 18 25 26 27 28 30 31 41 ...
##  $ ClassDesc    : Factor w/ 112 levels "111 BODYSUITS",..: 92 39 49 40 68 92 60 61 50 12 ...
##  $ Location     : int  2061 2061 2061 2061 2061 2061 2061 2061 2061 2061 ...
##  $ Locdesc      : Factor w/ 1 level "TIMES SQUARE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SalesUnits   : int  2 1 1 1 11 3 2 10 4 1 ...
##  $ SalesDollars : int  60 18 12 35 231 63 100 416 212 60 ...

We need to transform variables related to date to a date-specific format, and also transfrom some variables in to categorical. We also need to add a ‘month’ variable to analyze sales based on month.

dat$Class <- factor(dat$Class)
dy <- strptime(dat$Day, "%m/%d/%Y")
dat$Day <- as.Date(dy)

wk <- strptime(dat$Fiscal.Week, "%m/%d/%Y")
dat$Fiscal.Week <- as.Date(wk)

dat$Dayofweek <- factor(dat$Dayofweek, ordered = TRUE, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

month <- as.numeric(format(dat$Day, "%m"))
#dat$month <- month.abb[month]
dat$month <- factor(month.abb[month], ordered=TRUE, levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"))

Now,let’s look at the data summary and distribution of some of the key variables.

kable(head(dat))
Fiscal.Season SeasonDesc Fiscal.Year Fiscal.Week Day Dayofweek Class ClassDesc Location Locdesc SalesUnits SalesDollars month
20161 Spring 2016 2016 2016-02-06 2016-01-31 Sunday 10 SUNGLASSES 2061 TIMES SQUARE 2 60 Jan
20161 Spring 2016 2016 2016-02-06 2016-01-31 Sunday 15 FASHION ACCESSORIES 2061 TIMES SQUARE 1 18 Jan
20161 Spring 2016 2016 2016-02-06 2016-01-31 Sunday 18 HATS 2061 TIMES SQUARE 1 12 Jan
20161 Spring 2016 2016 2016-02-06 2016-01-31 Sunday 25 FASHION JEWELRY 2061 TIMES SQUARE 1 35 Jan
20161 Spring 2016 2016 2016-02-06 2016-01-31 Sunday 26 NECKLACES 2061 TIMES SQUARE 11 231 Jan
20161 Spring 2016 2016 2016-02-06 2016-01-31 Sunday 27 SUNGLASSES 2061 TIMES SQUARE 3 63 Jan
ggplot(dat, aes(x=SalesDollars)) + geom_histogram()

ggplot(dat, aes(x=SalesUnits)) + geom_histogram()

summary(dat$SalesUnits)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   5.000   9.303  11.000 255.000
summary(dat$SalesDollars)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    72.0   180.0   311.3   393.0  7539.0
summary(dat$Fiscal.Year)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2016    2016    2016    2016    2016    2016
summary(dat$SeasonDesc)
##   Fall 2016 Spring 2016 
##       18213       22719
nlevels(dat$ClassDesc)
## [1] 112

As we can see from the distribution of SalesDollars and SalesUnits, they are heavily skewed towards right, which means that there are lots of data on the lower end of the value. It is typical of the sales data where number of customers who buy less items are greater than the customers who buy more items at once.

Looking at the data and the summary, we can exclude some of the variables from our analysis. We can remove the following variables: Fiscal.Season, Fiscal.Year, Class, Location and Locdesc. The var.iables of particular interests are SalesDollars, SalesUnits, month, Day and ClassDesc

rem <- names(dat) %in% c("Fiscal.Season", "Fiscal.Year", "Class", "Location","Locdesc")
df <- dat[!rem]
kable(head(df))
SeasonDesc Fiscal.Week Day Dayofweek ClassDesc SalesUnits SalesDollars month
Spring 2016 2016-02-06 2016-01-31 Sunday SUNGLASSES 2 60 Jan
Spring 2016 2016-02-06 2016-01-31 Sunday FASHION ACCESSORIES 1 18 Jan
Spring 2016 2016-02-06 2016-01-31 Sunday HATS 1 12 Jan
Spring 2016 2016-02-06 2016-01-31 Sunday FASHION JEWELRY 1 35 Jan
Spring 2016 2016-02-06 2016-01-31 Sunday NECKLACES 11 231 Jan
Spring 2016 2016-02-06 2016-01-31 Sunday SUNGLASSES 3 63 Jan

Let’s also see if there are any missing values from the new dataset, and finally check the structure of the dataset.

missing <-sapply(df, function(y) sum(length(which(is.na(y)))))
missing <- data.frame(missing)
missing
##              missing
## SeasonDesc         0
## Fiscal.Week        0
## Day                0
## Dayofweek          0
## ClassDesc          0
## SalesUnits         0
## SalesDollars       0
## month              0
str(df)
## 'data.frame':    40932 obs. of  8 variables:
##  $ SeasonDesc  : Factor w/ 2 levels "Fall 2016","Spring 2016": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Fiscal.Week : Date, format: "2016-02-06" "2016-02-06" ...
##  $ Day         : Date, format: "2016-01-31" "2016-01-31" ...
##  $ Dayofweek   : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ClassDesc   : Factor w/ 112 levels "111 BODYSUITS",..: 92 39 49 40 68 92 60 61 50 12 ...
##  $ SalesUnits  : int  2 1 1 1 11 3 2 10 4 1 ...
##  $ SalesDollars: int  60 18 12 35 231 63 100 416 212 60 ...
##  $ month       : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 1 1 1 1 1 1 1 1 1 1 ...

There are no missing values from the dataset. Now that we have a transformed data with a proper structure, we can continue with our analysis.

Data Analysis

Let us start by analysing the main variables, which are SalesUnits and SalesDollars. We will group the items by Day and ClassDesc and analyze daily sales data for the items.

data.frame(df %>% group_by(Day) %>% summarise(daily_mean=mean(SalesUnits)) %>% summarise(DailyAverageUnits=mean(daily_mean)))
##   DailyAverageUnits
## 1          9.146721
data.frame(df %>% group_by(Day) %>% summarise(daily_mean=mean(SalesDollars)) %>% summarise(DailyAverageSales=mean(daily_mean)))
##   DailyAverageSales
## 1          307.4234
kable(data.frame(dat %>% group_by(ClassDesc) %>% summarise(DailyAverageUnits=mean(SalesUnits)) %>% top_n(n=5, wt=DailyAverageUnits)))
ClassDesc DailyAverageUnits
EARRINGS 24.13636
LS SHIRTS 25.05167
LS SOLID DRESS SHIRTS 29.43505
NECKLACES 25.58182
PULLOVERS 36.66364
kable(data.frame(dat %>% group_by(ClassDesc) %>% summarise(DailyAverageSales=mean(SalesDollars)) %>% top_n(n=5, wt=DailyAverageSales)))
ClassDesc DailyAverageSales
DAY/CASUAL 832.3090
LEGGING 962.5006
LS SHIRTS 961.5760
LS SOLID DRESS SHIRTS 1132.2387
PULLOVERS 1227.6576

As we can see from the summary statistics, 9.14 items were sold per day in the year 2016, while the average sales per day was $307.42. This includes sale of all items during that day. Top 5 items with highest daily average units sold were Earrings, LS Shirts, LS Solid Dress Shirts, Necklaces and Pullovers. Similarly, top 5 items with highest daily average revenue were Day/Casual, Legging, LS Shirts, LS Solid Dress Shirts and Pullovers.

We can also do a seasonal and weekly analysis on the sales of the top 5 items for Fall and Spring.

seasonCount <- aggregate(df$SalesUnits, list(df$ClassDesc, df$SeasonDesc), sum)
scount <- data.frame(seasonCount %>% group_by(Group.2) %>% top_n(n=5, wt=x))
colnames(scount) <- c("Items", "Season", "Total Item Count")
kable(scount)
Items Season Total Item Count
LEGGING Fall 2016 9089
LS TEES Fall 2016 8000
PULLOVERS Fall 2016 8776
SOCKS Fall 2016 7750
SS TEES Fall 2016 10248
DAY/CASUAL Spring 2016 10490
LEGGING Spring 2016 8343
LS SHIRTS Spring 2016 9151
SS TEES Spring 2016 25304
TANKS Spring 2016 12149

Top 5 items that were sold in Fall were Legging, LS Tees, Pullovers, Socks and SS Tees. Similary, top 5 items sold in the Spring were Day/Casual, Legging, LS Shirts, SS Tees and Tanks. Let us now see the distribution of the average sales amount by the day of the week.

avgDaySales <- aggregate(df$SalesDollars, list(df$Dayofweek), mean)
ggplot(avgDaySales, aes(x=Group.1, y=x)) + geom_bar(stat="identity") + labs(x="Day", y="Average Sales")

From the barplot above, we can see that Satuday has the highest average amount of sales, whereas Tuesday has the lowest average sales amount. That means, people spend most on Saturdays and least on Tuesdays. We can also analyse the same data by month.

avgMonthlySales <- aggregate(df$SalesDollars, list(df$month), mean)
ggplot(avgMonthlySales, aes(x=Group.1, y=x)) + geom_bar(stat="identity") + labs(x="Month", y="Average Sales")

From the barplot above, we can see that in an average, people spent most in November, and spent least in January. This can be attributed to the holiday season during the last 2 months. We can also visualize the time-series data for daily total sales. It shows the total dollar sales for a day for 2016.

daily_data <- df %>% group_by(Day) %>% summarise(sales=sum(SalesDollars))
ggplot(daily_data, aes(x=Day, y=sales)) + geom_line() + xlab("Day") + ylab("Sales Amount")

From the timeseries plot above, we can see that the daily sales amount fluctuates regularly, spikes at just beginning of the months, and has a sharp rise during the month of December. This migh suggest variability in sales during certain times of the month or year. There is a sudden rise in sales during holiday season in December, and people spend money at the beginning of the month. This graph can help predict the pattern of sales during a fiscal year.