This lab is based on an interesting kaggle competition to predict sales at a German drug store chain. Here is the description:
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams!
Here are the variables available to us:
| Id | an Id that represents a (Store, Date) duple within the test set |
| Store | a unique Id for each store |
| Sales | the turnover for any given day (this is what you are predicting) |
| Customers | the number of customers on a given day |
| Open | an indicator for whether the store was open: 0 = closed, 1 = open |
| StateHoliday | indicates a state holiday. a = public holiday, b = Easter holiday, c = Christmas, 0 = None |
| SchoolHoliday | indicates if the (Store, Date) was affected by the closure of public schools |
| StoreType | differentiates between 4 different store models: a, b, c, d |
| Assortment | describes an assortment level: a = basic, b = extra, c = extended |
| CompetitionDistance | distance in meters to the nearest competitor store |
| CompetitionOpenSince[Month/Year] | gives the approximate year and month of the time the nearest competitor was opened |
| Promo | indicates whether a store is running a promo on that day |
| Promo2 | continuing and consecutive promotion: 0 = store is not participating, 1 = store is participating |
| Promo2Since[Year/Week] | describes the year and calendar week when the store started participating in Promo2 |
| PromoInterval | describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store |
There are three data sets: rossmann_train.csv, rossmann_store.csv, rossmann_test.csv
library(ggplot2)
library(dplyr)
library(lubridate)
library(stargazer)
sales <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/train.csv")
#sales <- read.csv("https://www.dropbox.com/s/r1iptvp4ddrz8m9/rossmann_train.csv?raw=1")
str(sales)
## 'data.frame': 1017209 obs. of 9 variables:
## $ Store : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 5 5 5 5 5 5 5 5 5 5 ...
## $ Date : Factor w/ 942 levels "2013-01-01","2013-01-02",..: 942 942 942 942 942 942 942 942 942 942 ...
## $ Sales : int 5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
## $ Customers : int 555 625 821 1498 559 589 1414 833 687 681 ...
## $ Open : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : int 1 1 1 1 1 1 1 1 1 1 ...
## $ StateHoliday : Factor w/ 4 levels "0","a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
## $ SchoolHoliday: int 1 1 1 1 1 1 1 1 1 1 ...
Date is a factor. We should change it into a proper date. DayOfWeek is also an integer though we are not sure what the numbers represent. (Week starts on Monday in Europe.) So we will use a function wday() from the lubridate package to get the day of the week from a date.
sales$Date <- as.Date(sales$Date)
sales$day <- wday(sales$Date, label = TRUE, abbr = TRUE) #creates 'ordered factor'
sales$day <- factor(sales$day, ordered=FALSE) # makes it a factor
sales <- select(sales, -DayOfWeek) #not needed anymore
Let’s see how the StateHoliday correlates with opening of stores.
table(sales$StateHoliday, sales$Open)
##
## 0 1
## 0 142677 843482
## a 19566 694
## b 6545 145
## c 4029 71
The table above shows us how observations fall into different categories based on whether a store is open (column headings) and whether there is a state holiday and if so what type. It looks like most of the observations fall into no holiday (0) and open (1) category. Holiday a (public holiday) is associate with a lot more stores open than either Easter (b) or Christmas (c). Still, shockingly we have 140 thousand observations when stores are closed.
IN-CLASS EXERCISE 1: Can you figure out when Rossemann stores are closed?
Let’s examine if there are any sales when the stores are not open. If there no sales when the store is closed (as there should be no sales), let’s just filter out days and stores that are closed.
summary(filter(sales,Open==0))
## Store Date Sales Customers Open
## Min. : 1.0 Min. :2013-01-01 Min. :0 Min. :0 Min. :0
## 1st Qu.: 279.0 1st Qu.:2013-08-18 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median : 560.0 Median :2014-04-13 Median :0 Median :0 Median :0
## Mean : 558.5 Mean :2014-04-11 Mean :0 Mean :0 Mean :0
## 3rd Qu.: 839.0 3rd Qu.:2014-12-21 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :1115.0 Max. :2015-07-31 Max. :0 Max. :0 Max. :0
##
## Promo StateHoliday SchoolHoliday day
## Min. :0.00000 0:142677 Min. :0.0000 Sun :141137
## 1st Qu.:0.00000 a: 19566 1st Qu.:0.0000 Mon : 7170
## Median :0.00000 b: 6545 Median :0.0000 Tues : 1703
## Mean :0.06472 c: 4029 Mean :0.1057 Wed : 3729
## 3rd Qu.:0.00000 3rd Qu.:0.0000 Thurs: 11201
## Max. :1.00000 Max. :1.0000 Fri : 7205
## Sat : 672
sales <- filter(sales, Open==1)
sales <- select(sales, -Open) #not needed anymore
Let’s calculate average sales for each store so that we have an idea on how the stores differ in size. Let’s then plot average sales in a bar graph and a histogram. Note that the bars in the bar graph are ordered by minus average sales to so that the stores are ordered from biggest to smallest.
avsales <- sales %>% group_by(Store) %>% summarise(av_sales = mean(Sales))
ggplot(avsales, aes(x=reorder(Store,-av_sales),y=av_sales)) + geom_bar(stat="identity")
## Warning: position_stack requires constant width: output may be incorrect
ggplot(avsales, aes(x=av_sales)) + geom_histogram(binwidth = 100)
summary(avsales)
## Store av_sales
## Min. : 1.0 Min. : 2704
## 1st Qu.: 279.5 1st Qu.: 5322
## Median : 558.0 Median : 6590
## Mean : 558.0 Mean : 6934
## 3rd Qu.: 836.5 3rd Qu.: 7964
## Max. :1115.0 Max. :21757
IN-CLASS EXERCISE 2: Plot sales of store number 101 over time.
We would like to see if the pattern of fluctuations over time is typical. One option we have is to average across all stores for each date. However, such average would be dominated by the fluctuations at the largest stores. The other option is that for each store we calculate percent deviation form that store’s average sales. (We’ll call it dsales). For each store and date this variable will tell us by what percentage are sales higher or lower relative to that store’s average sales. We can then average these deviations for each day.
sales <- sales %>% group_by(Store) %>% mutate(av_sales=mean(Sales))
sales$dsales <- (sales$Sales-sales$av_sales)/sales$av_sales*100
summary(sales$dsales)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -100.000 -19.250 -3.382 0.000 14.940 477.400
We see that the half of store sales are between 19 percent below the store average, and 14 percent above the store average. Let’s now plot these fluctuations by date.
bydate <- sales %>% group_by(Date) %>% summarise(av_dsales=mean(dsales))
ggplot(bydate, aes(x=Date, y=av_dsales)) + geom_line()
IN-CLASS EXERCISE 3: Let’s plot dsales by day of the week.
IN-CLASS EXERCISE 4: How about by Promo or StateHoliday?
store data set.stores <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/store.csv")
#stores <- read.csv("https://www.dropbox.com/s/1sgbjkbcx18ggwq/rossmann_store.csv?raw=1")
str(stores)
## 'data.frame': 1115 obs. of 10 variables:
## $ Store : int 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : Factor w/ 4 levels "a","b","c","d": 3 1 1 3 1 1 1 1 1 1 ...
## $ Assortment : Factor w/ 3 levels "a","b","c": 1 1 1 3 1 1 3 1 3 1 ...
## $ CompetitionDistance : int 1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
## $ CompetitionOpenSinceMonth: int 9 11 12 9 4 12 4 10 8 9 ...
## $ CompetitionOpenSinceYear : int 2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
## $ Promo2 : int 0 1 1 0 0 0 0 0 0 0 ...
## $ Promo2SinceWeek : int NA 13 14 NA NA NA NA NA NA NA ...
## $ Promo2SinceYear : int NA 2010 2011 NA NA NA NA NA NA NA ...
## $ PromoInterval : Factor w/ 4 levels "","Feb,May,Aug,Nov",..: 1 3 3 1 1 1 1 1 1 1 ...
We see that stores come in four different types and three different levels of assortment. Let’s see how average sales of each store vary with store type.
stores <- inner_join(avsales, stores, by="Store")
bytype <- stores %>% group_by(StoreType) %>% summarise(av_sales=mean(av_sales))
ggplot(bytype, aes(x=StoreType, y=av_sales)) + geom_bar(stat="identity")
#byassort <- stores %>% group_by(Assortment) %>% summarise(av_sales=mean(av_sales))
#ggplot(byassort, aes(x=Assortment, y=av_sales)) + geom_bar(stat="identity")
summary(select(stores, CompetitionDistance, CompetitionOpenSinceMonth, CompetitionOpenSinceYear))
## CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
## Min. : 20.0 Min. : 1.000 Min. :1900
## 1st Qu.: 717.5 1st Qu.: 4.000 1st Qu.:2006
## Median : 2325.0 Median : 8.000 Median :2010
## Mean : 5404.9 Mean : 7.225 Mean :2009
## 3rd Qu.: 6882.5 3rd Qu.:10.000 3rd Qu.:2013
## Max. :75860.0 Max. :12.000 Max. :2015
## NA's :3 NA's :354 NA's :354
Competition distance varies from 20 meters to 75 kilometers. There are also three stores with NAs for competition distance. It is not clear what this means. Let’s ignore these for now, and plot average store sales against the log of distance to competition.
ggplot(stores, aes(x=CompetitionDistance, y=av_sales)) + geom_point() +
scale_x_continuous(trans="log")
## Warning: Removed 3 rows containing missing values (geom_point).
Looking again at the summary for the competition variables, it looks like 25% of stores got their competition in 2013 or later. Let’s create a date to indicate when competition arrived. Since we don’t know the exact day only month, let’s assume it happened on the 15th of the month.
Function paste() below pastes/strings together the characters in the variables we list as arguments, with sep= character in between. We use this to create a date. If date since competition is missing, let’s assume it has always been there by setting the date to January 1st, 2000.
stores$compdate <- as.Date(paste(stores$CompetitionOpenSinceYear,stores$CompetitionOpenSinceMonth,15, sep="-"))
stores$compdate[is.na(stores$CompetitionOpenSinceYear)] <- as.Date("2000-01-01")
Let’s merge the compdate into our sales data set. This way we will be able to see if the entrance of competition has an effect on sales.
sales <- inner_join(sales, select(stores, Store, compdate, StoreType, Assortment), by="Store")
sales$competition <- ifelse(sales$Date<sales$compdate, "no", "yes")
bycompetition <- sales %>% group_by(competition) %>% summarize(av_dsales=mean(dsales))
ggplot(bycompetition, aes(x=competition,y=av_dsales)) + geom_bar(stat="identity")
## Warning: Stacking not well defined when ymin != 0
IN-CLASS EXERCISE: Let’s limit to only stores that got competition during the period spanning our data so that we can compare sales before and after competition.
Let’s plot the time series of sales for store number 30. To take out the short term fluctuations by plotting geom_smooth().
ggplot(filter(sales, Store==30), aes(x=Date, y=Sales, color=competition)) + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
EXTENDED IN-CLASS EXERCISE: Could we create a data set with sales 100 days before and 100 after competition arrived?
Let’s estimate how the deviation from each store’s mean daily sales depends on the day of the week.
reg1 <- lm(dsales ~ day , data=sales)
summary(reg1)
##
## Call:
## lm(formula = dsales ~ day, data = sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -118.73 -18.18 -1.42 15.03 480.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.7274 0.4308 -24.90 <2e-16 ***
## dayMon 29.4529 0.4364 67.49 <2e-16 ***
## dayTues 13.0159 0.4361 29.84 <2e-16 ***
## dayWed 7.7558 0.4362 17.78 <2e-16 ***
## dayThurs 8.2756 0.4365 18.96 <2e-16 ***
## dayFri 12.7930 0.4363 29.32 <2e-16 ***
## daySat -5.9413 0.4361 -13.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.82 on 844385 degrees of freedom
## Multiple R-squared: 0.1406, Adjusted R-squared: 0.1406
## F-statistic: 2.302e+04 on 6 and 844385 DF, p-value: < 2.2e-16
IN-CLASS EXERCISE 5: Estimate a regression with sales as a function of day of the week and month of the year.
Let’s now also use state and school holidays.
reg3 <- lm(dsales ~ day + month + StateHoliday + SchoolHoliday, data=sales)
#summary(reg3)
What about the effect of competition?
reg4 <- lm(dsales ~ day + month + StateHoliday + SchoolHoliday + competition, data=sales)
#summary(reg4)
IN-CLASS EXERCISE 6: Controlling for day of the week, month, state and school holidays, do promotions increase sales?
#stargazer(reg3, reg4, reg5, type="text")
Let’s load the test data set from kaggle, create date, day of the week and month variables.
test <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/test.csv")
#test <- read.csv("https://www.dropbox.com/s/atatbelumf8glhy/rossmann_test.csv?raw=1")
test$Date <- as.Date(test$Date)
test$day <- wday(test$Date, label = TRUE, abbr = TRUE)
test$day <- factor(test$day, ordered=FALSE)
test$month <- month(test$Date, label = TRUE)
test$month <- factor(test$month, ordered=FALSE)
summary(test)
## Id Store DayOfWeek Date
## Min. : 1 Min. : 1.0 Min. :1.000 Min. :2015-08-01
## 1st Qu.:10273 1st Qu.: 279.8 1st Qu.:2.000 1st Qu.:2015-08-12
## Median :20545 Median : 553.5 Median :4.000 Median :2015-08-24
## Mean :20545 Mean : 555.9 Mean :3.979 Mean :2015-08-24
## 3rd Qu.:30816 3rd Qu.: 832.2 3rd Qu.:6.000 3rd Qu.:2015-09-05
## Max. :41088 Max. :1115.0 Max. :7.000 Max. :2015-09-17
##
## Open Promo StateHoliday SchoolHoliday
## Min. :0.0000 Min. :0.0000 0:40908 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 a: 180 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.8543 Mean :0.3958 Mean :0.4435
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :11
## day month
## Sun :5992 Aug:26536
## Mon :5992 Sep:14552
## Tues :5992
## Wed :5992
## Thurs:5992
## Fri :5136
## Sat :5992
We see we have to predict sales from August 1, 2015 to September 15, 2015. The kaggle competition says that days when a store is closed will not be counted. Our predictions will be evaluated using root mean square percentage error.
\(\textrm{RMSPE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left(\frac{y_i - \hat{y}_i}{y_i}\right)^2}\)
We’ll merge in average sales from the training data. This is because we will use average daily sales from the training data to make predictions for sales for the next 6 weeks.
test <- inner_join(test, select(stores, Store, av_sales), by="Store")
We can use function predict() to predict dsales. Let’s first use our reg3 model which used day of the week, month and state and school holiday as predictors.
test$dsales <- predict(reg3, test)
The above created a vector of predicted percent deviation from average sales. However, we need to predict total sales rather than the percent deviation from average sales. Fortunately, since we know average sales we can calculate total sales by multiplying average sales (av_sales) by the percent deviation from average sales (dsales).
test$Sales <- test$av_sales*(1+test$dsales/100)
write.csv(select(test,Id, Sales), file = "C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/submission.csv", row.names = FALSE)
This submission ranks about 2800.
IN-CLASS EXERCISE: Make a prediction using our reg5 model (it included Promo variable). DOes this model do better in the competition?