INTRODUCTION
Rossmann is a Germany drug store chain with over 3790 stores in Europe. In this problem set obtained from Kaggle.com, participants are given records of sales of each store on different days, from 01/01/2013 to 31/07/2015. Sales can be affected by many factors, from holidays, promotions and competitions. Our goal is to explore these correlations using visualization tools in R, which will be helpful in predicting the sales of the store 6 weeks in advance, from 08/01/2015 to 09/17/2015. This RPub only includes exploratory data analysis only.
DATA CLEANSING
Load packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
library(sqldf)
## Warning: package 'sqldf' was built under R version 4.0.2
## Loading required package: gsubfn
## Warning: package 'gsubfn' was built under R version 4.0.2
## Loading required package: proto
## Warning: package 'proto' was built under R version 4.0.2
## Loading required package: RSQLite
## Warning: package 'RSQLite' was built under R version 4.0.2
library(zoo)
## Warning: package 'zoo' was built under R version 4.0.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(reshape2)
Import train data
train <- read.csv("train.csv")
Show the structure of train data
str(train)
## 'data.frame': 1017209 obs. of 9 variables:
## $ Store : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Date : chr "7/31/2015" "7/31/2015" "7/31/2015" "7/31/2015" ...
## $ Sales : int 5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
## $ Customers : int 555 625 821 1498 559 589 1414 833 687 681 ...
## $ Open : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StateHoliday : chr "0" "0" "0" "0" ...
## $ SchoolHoliday: int 1 1 1 1 1 1 1 1 1 1 ...
Some data fields should be converted to a more suitable data type for the convenience of the explanatory process.
str(train$Date)
## chr [1:1017209] "7/31/2015" "7/31/2015" "7/31/2015" "7/31/2015" ...
Change the data type of “Date” from “char” to “date”
train$Date <- as.Date(train$Date,format = "%m/%d/%y")
Factorize categorical data fields in train data
train$DayOfWeek <- as.factor(as.integer(train$DayOfWeek))
train$StateHoliday <- as.factor(as.character(train$StateHoliday))
train$Open <- as.factor(as.character(train$Open))
train$Promo <- as.factor(as.character(train$Promo))
train$SchoolHoliday <- as.factor(as.character(train$SchoolHoliday))
Check the data modification and identify NA cases in the test table
summary(train)
## Store DayOfWeek Date Sales
## Min. : 1.0 1:144730 Min. :2020-01-01 Min. : 0
## 1st Qu.: 280.0 2:145664 1st Qu.:2020-03-18 1st Qu.: 3727
## Median : 558.0 3:145665 Median :2020-06-02 Median : 5744
## Mean : 558.4 4:145845 Mean :2020-06-11 Mean : 5774
## 3rd Qu.: 838.0 5:145845 3rd Qu.:2020-08-29 3rd Qu.: 7856
## Max. :1115.0 6:144730 Max. :2020-12-31 Max. :41551
## 7:144730
## Customers Open Promo StateHoliday SchoolHoliday
## Min. : 0.0 0:172817 0:629129 0:986159 0:835488
## 1st Qu.: 405.0 1:844392 1:388080 a: 20260 1:181721
## Median : 609.0 b: 6690
## Mean : 633.1 c: 4100
## 3rd Qu.: 837.0
## Max. :7388.0
##
No NULL values found. Now the train data is ready for the analysis process. We will move to cleaning the store data.
Assign column names for store data
colNames <- c ("Store", "StoreType", "Assortment", "CompetitionDistance",
"CompetitionOpenSinceMonth", "CompetitionOpenSinceYear",
"PromoContinuation", "PromoParticipationSinceWeek",
"PromoParticipationSinceYear", "PromoInterval")
Import store data
store <- read.table ("store.csv", header = TRUE, sep = ",",
strip.white = TRUE, col.names = colNames,
na.strings = "?", stringsAsFactors = TRUE)
Show the structure of the store data
str (store)
## 'data.frame': 1115 obs. of 10 variables:
## $ Store : int 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : Factor w/ 4 levels "a","b","c","d": 3 1 1 3 1 1 1 1 1 1 ...
## $ Assortment : Factor w/ 3 levels "a","b","c": 1 1 1 3 1 1 3 1 3 1 ...
## $ CompetitionDistance : int 1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
## $ CompetitionOpenSinceMonth : int 9 11 12 9 4 12 4 10 8 9 ...
## $ CompetitionOpenSinceYear : int 2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
## $ PromoContinuation : int 0 1 1 0 0 0 0 0 0 0 ...
## $ PromoParticipationSinceWeek: int NA 13 14 NA NA NA NA NA NA NA ...
## $ PromoParticipationSinceYear: int NA 2010 2011 NA NA NA NA NA NA NA ...
## $ PromoInterval : Factor w/ 4 levels "","Feb,May,Aug,Nov",..: 1 3 3 1 1 1 1 1 1 1 ...
Identify NA cases
table (complete.cases (store))
##
## FALSE TRUE
## 750 365
Take a look at the table summary to identify the NAs
summary(store)
## Store StoreType Assortment CompetitionDistance
## Min. : 1.0 a:602 a:593 Min. : 20.0
## 1st Qu.: 279.5 b: 17 b: 9 1st Qu.: 717.5
## Median : 558.0 c:148 c:513 Median : 2325.0
## Mean : 558.0 d:348 Mean : 5404.9
## 3rd Qu.: 836.5 3rd Qu.: 6882.5
## Max. :1115.0 Max. :75860.0
## NA's :3
## CompetitionOpenSinceMonth CompetitionOpenSinceYear PromoContinuation
## Min. : 1.000 Min. :1900 Min. :0.0000
## 1st Qu.: 4.000 1st Qu.:2006 1st Qu.:0.0000
## Median : 8.000 Median :2010 Median :1.0000
## Mean : 7.225 Mean :2009 Mean :0.5121
## 3rd Qu.:10.000 3rd Qu.:2013 3rd Qu.:1.0000
## Max. :12.000 Max. :2015 Max. :1.0000
## NA's :354 NA's :354
## PromoParticipationSinceWeek PromoParticipationSinceYear PromoInterval
## Min. : 1.0 Min. :2009 :544
## 1st Qu.:13.0 1st Qu.:2011 Feb,May,Aug,Nov :130
## Median :22.0 Median :2012 Jan,Apr,Jul,Oct :335
## Mean :23.6 Mean :2012 Mar,Jun,Sept,Dec:106
## 3rd Qu.:37.0 3rd Qu.:2013
## Max. :50.0 Max. :2015
## NA's :544 NA's :544
Replace the NAs in Competition Distance by its median
store$CompetitionDistance[is.na(store$CompetitionDistance)] <- median(store$CompetitionDistance, na.rm=TRUE)
Replace the remaining NA’s by 0
store[is.na(store)] <- 0
Factorize categorical data fields in store data
store$Store <- as.factor(as.integer(store$Store))
store$CompetitionOpenSinceYear <- as.factor(as.integer(store$CompetitionOpenSinceYear))
store$CompetitionOpenSinceMonth <- as.factor(as.integer(store$CompetitionOpenSinceMonth))
store$PromoContinuation <- as.factor(as.integer(store$PromoContinuation))
store$PromoParticipationSinceWeek <- as.factor(as.integer(store$PromoParticipationSinceWeek))
store$PromoParticipationSinceYear <- as.factor(as.integer(store$PromoParticipationSinceYear))
Double check the store’s summary
summary(store)
## Store StoreType Assortment CompetitionDistance
## 1 : 1 a:602 a:593 Min. : 20
## 2 : 1 b: 17 b: 9 1st Qu.: 720
## 3 : 1 c:148 c:513 Median : 2325
## 4 : 1 d:348 Mean : 5397
## 5 : 1 3rd Qu.: 6875
## 6 : 1 Max. :75860
## (Other):1109
## CompetitionOpenSinceMonth CompetitionOpenSinceYear PromoContinuation
## 0 :354 0 :354 0:544
## 9 :125 2013 : 83 1:571
## 4 : 94 2012 : 82
## 11 : 92 2014 : 70
## 3 : 70 2005 : 62
## 7 : 67 2010 : 55
## (Other):313 (Other):409
## PromoParticipationSinceWeek PromoParticipationSinceYear PromoInterval
## 0 :544 0 :544 :544
## 14 : 81 2011 :128 Feb,May,Aug,Nov :130
## 40 : 77 2013 :120 Jan,Apr,Jul,Oct :335
## 31 : 44 2014 : 95 Mar,Jun,Sept,Dec:106
## 10 : 42 2012 : 81
## 5 : 39 2009 : 73
## (Other):288 (Other): 74
EXPLORATION
Join train and store tables to further explore other correlations between the data fields of 2 tables.
train_store <- merge(train, store, by = "Store")
For graph to display number in full (E.g. 1000000 instead of 10e6)
options("scipen" = 10)
For our exploratory data analysis, we will look into the relationship between Sales and other data fields.
Sales vs Store
Here we made a vector containing the mean sales of 1115 stores
MeanSalesPerStore <- vector(mode = "numeric",length = 1115)
for (i in 1:1115) {
MeanSalesPerStore[i] <- mean(train_store$Sales[train_store$Store==i])
}
hist(MeanSalesPerStore,xlab="Sales (€)")
summary(MeanSalesPerStore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2245 4412 5459 5763 6634 20719
Sales vary from stores to stores. The outliers are stores with over €20719 in mean sales.
Sales vs DayOfWeek
boxplot(Sales ~ DayOfWeek,data=train_store)
Sales of day 7 of the week (Sunday) is extremely low compared to other dates.
We will take a closer look at Day 7.
Day7Sales <- subset(train_store,DayOfWeek==7)
summary(Day7Sales)
## Store DayOfWeek Date Sales
## Min. : 1.0 1: 0 Min. :2020-01-04 Min. : 0.0
## 1st Qu.: 280.0 2: 0 1st Qu.:2020-03-17 1st Qu.: 0.0
## Median : 558.0 3: 0 Median :2020-06-01 Median : 0.0
## Mean : 558.4 4: 0 Mean :2020-06-11 Mean : 204.2
## 3rd Qu.: 838.0 5: 0 3rd Qu.:2020-08-31 3rd Qu.: 0.0
## Max. :1115.0 6: 0 Max. :2020-12-29 Max. :37376.0
## 7:144730
## Customers Open Promo StateHoliday SchoolHoliday StoreType
## Min. : 0.00 0:141137 0:144730 0:144421 0:142006 a:78484
## 1st Qu.: 0.00 1: 3593 1: 0 a: 309 1: 2724 b: 2252
## Median : 0.00 b: 0 c:19468
## Mean : 35.79 c: 0 d:44526
## 3rd Qu.: 0.00
## Max. :5145.00
##
## Assortment CompetitionDistance CompetitionOpenSinceMonth
## a:76472 Min. : 20 0 :46006
## b: 1180 1st Qu.: 710 9 :16256
## c:67078 Median : 2325 4 :12388
## Mean : 5422 11 :12016
## 3rd Qu.: 6880 3 : 9042
## Max. :75860 7 : 8458
## (Other):40564
## CompetitionOpenSinceYear PromoContinuation PromoParticipationSinceWeek
## 0 :46006 0:72272 0 :72272
## 2013 :10732 1:72458 14 :10386
## 2012 :10572 40 : 8914
## 2014 : 9068 31 : 5688
## 2005 : 8048 10 : 5524
## 2010 : 7292 5 : 5096
## (Other):53012 (Other):36850
## PromoParticipationSinceYear PromoInterval
## 0 :72272 :72272
## 2011 :16372 Feb,May,Aug,Nov :16874
## 2013 :15716 Jan,Apr,Jul,Oct :41718
## 2014 :11378 Mar,Jun,Sept,Dec:13866
## 2012 :10412
## 2009 : 9288
## (Other): 9292
Look at “Sales” data, we see that most stores have 0 sales on Sunday, and this is because 97.5% (141137/144730) of the records indicated that the stores were closed on those dates.
Now we will check if the stores that were open on Day 7 had sales or not.
summary(subset(Day7Sales,Open==1,select=c(Sales)))
## Sales
## Min. : 286
## 1st Qu.: 3314
## Median : 6876
## Mean : 8225
## 3rd Qu.:11418
## Max. :37376
All stores that were opened on Day 7 had sales.
boxplot(Sales ~ DayOfWeek,data=train_store[train_store$Sales!=0,])
Sales of stores that were opened on Sunday are higher than weekdays. We will use this insight to check whether other factors affect the high sales.
Holiday might be a factor of store closure, so we will check that as follow:
sqldf("select Open, sum(StateHoliday), sum(SchoolHoliday) from Day7Sales group by Open")
## Open sum(StateHoliday) sum(SchoolHoliday)
## 1 0 0 2642
## 2 1 0 82
On all Day 7 records, no stores open on State Holiday. 2642 records that was on School Holiday indicated that the store was closed, while in total we have 141137 closed stores. Therefore, holidays are not a strong factor of store closure like we assumed.
Sales vs Date
nrow(unique(train["Date"]))
## [1] 365
There’s 942 different dates in the train table.
ggplot(train_store, aes(x=Date,y=Sales)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Sales increased from 2013 to 2015, with fluctuations. Sales tend to decrease mid-year and then took off again at the end of the year.
Sales vs Customers
ggplot(train_store, aes(x=Date,y=Customers)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Customers increased from 2013 to 2015.
We can see that the Customers’ graph shows a similar trend as Sales. We will check for correlation between Sales and Customers.
Linear <- lm(Sales ~ Customers, data=train_store)
summary(Linear)
##
## Call:
## lm(formula = Sales ~ Customers, data = train_store)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28685.0 -1077.7 -253.6 882.4 27735.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1077.736503 2.882656 373.9 <2e-16 ***
## Customers 7.417062 0.003671 2020.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1720 on 1017207 degrees of freedom
## Multiple R-squared: 0.8005, Adjusted R-squared: 0.8005
## F-statistic: 4.082e+06 on 1 and 1017207 DF, p-value: < 2.2e-16
Customers and Sales are strongly correlated (Adjusted R-squared: 0.8005)
Sales vs StateHoliday
boxplot(Sales ~ StateHoliday,data=train_store)
Sales are significantly lower on holidays. –> Strong predictor value.
Now we want to see which holidays had the most sales.
OpenOnHoliday <- subset(train_store,Open==1)
mean(OpenOnHoliday$Sales[OpenOnHoliday$StateHoliday=="a"])
## [1] 8487.471
mean(OpenOnHoliday$Sales[OpenOnHoliday$StateHoliday=="b"])
## [1] 9887.89
mean(OpenOnHoliday$Sales[OpenOnHoliday$StateHoliday=="c"])
## [1] 9743.746
Within 910 stores that are open on holiday, Easter (b) saw the highest sale (9887.89), slightly higher than Christmas (c) (9743.746) and significantly higher than public holidays (a) (8487.471).
Sales vs SchoolHoliday
boxplot(Sales ~ SchoolHoliday,data=train_store)
Sales on School Holiday were just slightly higher than that of non-School Holiday. Therefore, School Holiday is not a strong predictor value.
Sales vs StoreType
boxplot(Sales ~ StoreType,data=train_store)
Type b has the highest mean sales.
Distribution of each assortment
boxplot(Sales ~ Assortment,data=train_store)
We can see that assortment b had the best sales among three assortments. Even its average sales is higher than the others’ average sales.
Determine the sales of each assortment by dates
ggplot(train_store["Sales" != 0],
aes(x = as.Date(Date), y = Sales, color = Assortment)) +
geom_smooth(size = 1.5) + xlab("Date")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Assortment b always had higher sales than the other assortments whereas c and a’s performances shared the same shape
Customers per assortment
cust_a <- sum(train_store$Customers[train_store$Assortment == "a"])
cust_b <- sum(train_store$Customers[train_store$Assortment == "b"])
cust_c <- sum(train_store$Customers[train_store$Assortment == "c"])
barplot(c(cust_a,cust_b,cust_c), main = "Customers per assortment", names.arg = c("a","b","c"))
Furthermore, the number of customers who bought b assortment is extremely low compare to other assortments’. Hence, b assortment could be a totally different type of product while a and c might be related to each other. Based on the sales trend of b, there are 2 assumptions considered. First, although having a low number of customers, b assortment could be a product type that could be bought with a large amount. Second, the price of b assortment is much more higher than those of a and c. These are interesting insights than can be reffered to later in the exploration process.
We move on to the correlation between Competition Distance and Sales on open days
CDopenday <- sqldf("Select CompetitionDistance, avg(Sales) as AvgSales from train_store where Open = 1 group by store")
CDmodel = lm(AvgSales ~ CompetitionDistance, data = CDopenday)
CDmodelsum = summary(CDmodel)
plot(AvgSales ~ CompetitionDistance,CDopenday)
abline(CDmodel, col = 'blue')
legend("topright", bty = "n", legend = paste("R2 =", format(CDmodelsum$adj.r.squared, digits = 4)))
There is nearly no correlation between sales and competition distance. Competition Distance may not be considered in the model later
Let’s see which month and year these competition opened
CompeteYear <- sqldf("select CompetitionOpenSinceYear as SinceYear, log(sum(Sales)) as Sales, log(count(CompetitionOpenSinceYear)) as CompetitionYearOpenFrequency from train_store where CompetitionOpenSinceYear <> 0 group by CompetitionOpenSinceYear")
CompeteYear$SinceYear <- as.numeric(as.character(CompeteYear$SinceYear))
CompeteYear <- melt(CompeteYear,id = "SinceYear")
ggplot(data=CompeteYear, aes(x = SinceYear, y= value, colour = variable), xlab="Since Year") + geom_line(size=1)
Sales slightly followed the trend of the number of competitions open from 1900 until 2015. However, the change of number of competitor do not affect much the sales of Rossmann stores. The sales could be impacted by the other elements.
Days since start of Promo2 (PromoContinuation)
Promo2Year <- sqldf("select PromoParticipationSinceYear as PromoSinceYear, log(sum(Sales)) as Sales, log(count(PromoParticipationSinceYear)) as PromoYearFrequency from train_store where PromoSinceYear <> 0 group by PromoSinceYear")
Promo2Year$PromoSinceYear <- as.numeric(as.character(Promo2Year$PromoSinceYear))
Promo2Year <- melt(Promo2Year,id = "PromoSinceYear")
ggplot(data=Promo2Year, aes(x = PromoSinceYear, y= value, colour = variable), xlab="Promo Since Year") + geom_line(size=1)
PromoContinuation vs Sales
boxplot(Sales ~ PromoContinuation, data = train_store,
main = "Sales based on the PromoContinuation",
xlab = "PromoContinuation", ylab = "Sales", col = "yellow")
Sales when having a 2nd Promo were less than without a 2nd Promo but it is not significant. The reason for this trend may because the 2nd Promo was not as effective as the first promo.
Since there is 0 sales on closed days, I want to specifically look at the PromoContinuation data on Open days. The number of two categories of PromoContinuation are nearly equal, which is good for the comparison of sales between the two.
row_to_keep = which(as.integer(train_store$Open) > 0)
openday <- train_store[row_to_keep,]
We compare sales between promo day and not promo day
ggplot(openday["Sales" != 0],
aes(x = as.Date(Date), y = Sales, color = factor(Promo))) +
geom_smooth(size = 1.5) + xlab("Date")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
promoY <- mean(train_store$Sales[train_store$Promo == 1])
promoN <- mean(train_store$Sales[train_store$Promo == 0])
barplot(c(promoY,promoN), main = "Average sales per Promo", names.arg = c("1","0"))
The graph follow the sales trend where sales dropped midyear and increase at the end of the year.Sales nearly doubled when there was a promo on that day. This is another trend that shoud be taken into consideration.
Determine the sales of each PromoInterval
IntervalsOnly <- subset(train_store, PromoInterval == 'Feb,May,Aug,Nov'| PromoInterval == 'Jan,Apr,Jul,Oct'|PromoInterval == 'Mar,Jun,Sept,Dec')
boxplot(Sales ~ PromoInterval, data = IntervalsOnly,
main = "Sales based on the Promo Interval",
xlab = "PromoInterval", ylab = "Sales", col = "blue")
Overall, all intervals share relatively same mean, quartiles, and minimum and maximum values. However, The “Feb,May,Aug,Nov” interval had the highest outlier
I will take a closer look at that row
sqldf("SELECT * FROM IntervalsOnly WHERE PromoInterval = 'Feb,May,Aug,Nov' ORDER BY Sales Desc LIMIT 1")
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## 1 909 1 2020-06-22 41551 1721 1 0 0
## SchoolHoliday StoreType Assortment CompetitionDistance
## 1 0 a c 1680
## CompetitionOpenSinceMonth CompetitionOpenSinceYear PromoContinuation
## 1 0 0 1
## PromoParticipationSinceWeek PromoParticipationSinceYear PromoInterval
## 1 45 2009 Feb,May,Aug,Nov
It can be clearly seen that this is the highest sales from the data. It is interesting that this store is a type a store, which sales performance wasn’t as outstanding as store type b and it sold assortment c, which did not contribute high sales as high as assortment b. However, it follows the 4 trends estabished, which futher support their importance to the model. Therefore, PromoInterval did not have significant impact on Sales.