Lab 10: Regression Analysis

Learning objectives:

  • explore data
  • plot data
  • merge data
  • estimate regression
  • predict using linear regression model

1. Introduction

This lab is based on an interesting kaggle competition to predict sales at a German drug store chain. Here is the description:

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams!

Here are the variables available to us:

Id an Id that represents a (Store, Date) duple within the test set
Store a unique Id for each store
Sales the turnover for any given day (this is what you are predicting)
Customers the number of customers on a given day
Open an indicator for whether the store was open: 0 = closed, 1 = open
StateHoliday indicates a state holiday. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
SchoolHoliday indicates if the (Store, Date) was affected by the closure of public schools
StoreType differentiates between 4 different store models: a, b, c, d
Assortment describes an assortment level: a = basic, b = extra, c = extended
CompetitionDistance distance in meters to the nearest competitor store
CompetitionOpenSince[Month/Year] gives the approximate year and month of the time the nearest competitor was opened
Promo indicates whether a store is running a promo on that day
Promo2 continuing and consecutive promotion: 0 = store is not participating, 1 = store is participating
Promo2Since[Year/Week] describes the year and calendar week when the store started participating in Promo2
PromoInterval describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

There are three data sets: rossmann_train.csv, rossmann_store.csv, rossmann_test.csv

2. Explore the data

library(ggplot2)
library(dplyr)
library(lubridate)
library(stargazer)
sales <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/train.csv")
#sales <- read.csv("https://www.dropbox.com/s/r1iptvp4ddrz8m9/rossmann_train.csv?raw=1")
str(sales)
## 'data.frame':    1017209 obs. of  9 variables:
##  $ Store        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek    : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Date         : Factor w/ 942 levels "2013-01-01","2013-01-02",..: 942 942 942 942 942 942 942 942 942 942 ...
##  $ Sales        : int  5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
##  $ Customers    : int  555 625 821 1498 559 589 1414 833 687 681 ...
##  $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StateHoliday : Factor w/ 4 levels "0","a","b","c": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SchoolHoliday: int  1 1 1 1 1 1 1 1 1 1 ...

Date is a factor. We should change it into a proper date. DayOfWeek is also an integer though we are not sure what the numbers represent. (Week starts on Monday in Europe.) So we will use a function wday() from the lubridate package to get the day of the week from a date.

sales$Date <- as.Date(sales$Date)
sales$day <- wday(sales$Date, label = TRUE, abbr = TRUE) #creates 'ordered factor'
sales$day <- factor(sales$day, ordered=FALSE) # makes it a factor
sales <- select(sales, -DayOfWeek) #not needed anymore

2. Holidays

Let’s see how the StateHoliday correlates with opening of stores.

table(sales$StateHoliday, sales$Open)
##    
##          0      1
##   0 142677 843482
##   a  19566    694
##   b   6545    145
##   c   4029     71

The table above shows us how observations fall into different categories based on whether a store is open (column headings) and whether there is a state holiday and if so what type. It looks like most of the observations fall into no holiday (0) and open (1) category. Holiday a (public holiday) is associate with a lot more stores open than either Easter (b) or Christmas (c). Still, shockingly we have 140 thousand observations when stores are closed.

IN-CLASS EXERCISE 1: Can you figure out when Rossemann stores are closed?

Let’s examine if there are any sales when the stores are not open. If there no sales when the store is closed (as there should be no sales), let’s just filter out days and stores that are closed.

summary(filter(sales,Open==0))
##      Store             Date                Sales     Customers      Open  
##  Min.   :   1.0   Min.   :2013-01-01   Min.   :0   Min.   :0   Min.   :0  
##  1st Qu.: 279.0   1st Qu.:2013-08-18   1st Qu.:0   1st Qu.:0   1st Qu.:0  
##  Median : 560.0   Median :2014-04-13   Median :0   Median :0   Median :0  
##  Mean   : 558.5   Mean   :2014-04-11   Mean   :0   Mean   :0   Mean   :0  
##  3rd Qu.: 839.0   3rd Qu.:2014-12-21   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
##  Max.   :1115.0   Max.   :2015-07-31   Max.   :0   Max.   :0   Max.   :0  
##                                                                           
##      Promo         StateHoliday SchoolHoliday       day        
##  Min.   :0.00000   0:142677     Min.   :0.0000   Sun  :141137  
##  1st Qu.:0.00000   a: 19566     1st Qu.:0.0000   Mon  :  7170  
##  Median :0.00000   b:  6545     Median :0.0000   Tues :  1703  
##  Mean   :0.06472   c:  4029     Mean   :0.1057   Wed  :  3729  
##  3rd Qu.:0.00000                3rd Qu.:0.0000   Thurs: 11201  
##  Max.   :1.00000                Max.   :1.0000   Fri  :  7205  
##                                                  Sat  :   672
sales <- filter(sales, Open==1)
sales <- select(sales, -Open) #not needed anymore

3. Variation in sales across stores

Let’s calculate average sales for each store so that we have an idea on how the stores differ in size. Let’s then plot average sales in a bar graph and a histogram. Note that the bars in the bar graph are ordered by minus average sales to so that the stores are ordered from biggest to smallest.

avsales <- sales %>% group_by(Store) %>% summarise(av_sales = mean(Sales))
ggplot(avsales, aes(x=reorder(Store,-av_sales),y=av_sales)) + geom_bar(stat="identity")
## Warning: position_stack requires constant width: output may be incorrect

ggplot(avsales, aes(x=av_sales)) +  geom_histogram(binwidth = 100)

summary(avsales)
##      Store           av_sales    
##  Min.   :   1.0   Min.   : 2704  
##  1st Qu.: 279.5   1st Qu.: 5322  
##  Median : 558.0   Median : 6590  
##  Mean   : 558.0   Mean   : 6934  
##  3rd Qu.: 836.5   3rd Qu.: 7964  
##  Max.   :1115.0   Max.   :21757

4. Variation in sales over time

IN-CLASS EXERCISE 2: Plot sales of store number 101 over time.

We would like to see if the pattern of fluctuations over time is typical. One option we have is to average across all stores for each date. However, such average would be dominated by the fluctuations at the largest stores. The other option is that for each store we calculate percent deviation form that store’s average sales. (We’ll call it dsales). For each store and date this variable will tell us by what percentage are sales higher or lower relative to that store’s average sales. We can then average these deviations for each day.

sales <- sales %>% group_by(Store) %>% mutate(av_sales=mean(Sales))
sales$dsales <- (sales$Sales-sales$av_sales)/sales$av_sales*100
summary(sales$dsales)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -100.000  -19.250   -3.382    0.000   14.940  477.400

We see that the half of store sales are between 19 percent below the store average, and 14 percent above the store average. Let’s now plot these fluctuations by date.

bydate <- sales %>% group_by(Date) %>% summarise(av_dsales=mean(dsales))
ggplot(bydate, aes(x=Date, y=av_dsales)) + geom_line()

IN-CLASS EXERCISE 3: Let’s plot dsales by day of the week.

IN-CLASS EXERCISE 4: How about by Promo or StateHoliday?

5. Using store characteristics from the store data set.

stores <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/store.csv")
#stores <- read.csv("https://www.dropbox.com/s/1sgbjkbcx18ggwq/rossmann_store.csv?raw=1")
str(stores)
## 'data.frame':    1115 obs. of  10 variables:
##  $ Store                    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ StoreType                : Factor w/ 4 levels "a","b","c","d": 3 1 1 3 1 1 1 1 1 1 ...
##  $ Assortment               : Factor w/ 3 levels "a","b","c": 1 1 1 3 1 1 3 1 3 1 ...
##  $ CompetitionDistance      : int  1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
##  $ CompetitionOpenSinceMonth: int  9 11 12 9 4 12 4 10 8 9 ...
##  $ CompetitionOpenSinceYear : int  2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
##  $ Promo2                   : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ Promo2SinceWeek          : int  NA 13 14 NA NA NA NA NA NA NA ...
##  $ Promo2SinceYear          : int  NA 2010 2011 NA NA NA NA NA NA NA ...
##  $ PromoInterval            : Factor w/ 4 levels "","Feb,May,Aug,Nov",..: 1 3 3 1 1 1 1 1 1 1 ...

We see that stores come in four different types and three different levels of assortment. Let’s see how average sales of each store vary with store type.

stores <- inner_join(avsales, stores, by="Store")
bytype <- stores %>% group_by(StoreType) %>% summarise(av_sales=mean(av_sales))
ggplot(bytype, aes(x=StoreType, y=av_sales)) + geom_bar(stat="identity")

#byassort <- stores %>% group_by(Assortment) %>% summarise(av_sales=mean(av_sales))
#ggplot(byassort, aes(x=Assortment, y=av_sales)) + geom_bar(stat="identity")

6. The effect of competition

summary(select(stores, CompetitionDistance, CompetitionOpenSinceMonth, CompetitionOpenSinceYear))
##  CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear
##  Min.   :   20.0     Min.   : 1.000            Min.   :1900            
##  1st Qu.:  717.5     1st Qu.: 4.000            1st Qu.:2006            
##  Median : 2325.0     Median : 8.000            Median :2010            
##  Mean   : 5404.9     Mean   : 7.225            Mean   :2009            
##  3rd Qu.: 6882.5     3rd Qu.:10.000            3rd Qu.:2013            
##  Max.   :75860.0     Max.   :12.000            Max.   :2015            
##  NA's   :3           NA's   :354               NA's   :354

Competition distance varies from 20 meters to 75 kilometers. There are also three stores with NAs for competition distance. It is not clear what this means. Let’s ignore these for now, and plot average store sales against the log of distance to competition.

ggplot(stores, aes(x=CompetitionDistance, y=av_sales)) +  geom_point() +
    scale_x_continuous(trans="log")
## Warning: Removed 3 rows containing missing values (geom_point).

Looking again at the summary for the competition variables, it looks like 25% of stores got their competition in 2013 or later. Let’s create a date to indicate when competition arrived. Since we don’t know the exact day only month, let’s assume it happened on the 15th of the month.

Function paste() below pastes/strings together the characters in the variables we list as arguments, with sep= character in between. We use this to create a date. If date since competition is missing, let’s assume it has always been there by setting the date to January 1st, 2000.

stores$compdate <- as.Date(paste(stores$CompetitionOpenSinceYear,stores$CompetitionOpenSinceMonth,15, sep="-"))
stores$compdate[is.na(stores$CompetitionOpenSinceYear)] <- as.Date("2000-01-01")

Let’s merge the compdate into our sales data set. This way we will be able to see if the entrance of competition has an effect on sales.

sales <- inner_join(sales, select(stores, Store, compdate, StoreType, Assortment), by="Store")
sales$competition <- ifelse(sales$Date<sales$compdate, "no", "yes")

bycompetition <- sales %>% group_by(competition) %>% summarize(av_dsales=mean(dsales))
ggplot(bycompetition, aes(x=competition,y=av_dsales)) + geom_bar(stat="identity")
## Warning: Stacking not well defined when ymin != 0

IN-CLASS EXERCISE: Let’s limit to only stores that got competition during the period spanning our data so that we can compare sales before and after competition.

Let’s plot the time series of sales for store number 30. To take out the short term fluctuations by plotting geom_smooth().

ggplot(filter(sales, Store==30), aes(x=Date, y=Sales, color=competition)) + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

EXTENDED IN-CLASS EXERCISE: Could we create a data set with sales 100 days before and 100 after competition arrived?

7. Estimate a regression

Let’s estimate how the deviation from each store’s mean daily sales depends on the day of the week.

reg1 <- lm(dsales ~ day , data=sales)
summary(reg1)
## 
## Call:
## lm(formula = dsales ~ day, data = sales)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -118.73  -18.18   -1.42   15.03  480.34 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.7274     0.4308  -24.90   <2e-16 ***
## dayMon       29.4529     0.4364   67.49   <2e-16 ***
## dayTues      13.0159     0.4361   29.84   <2e-16 ***
## dayWed        7.7558     0.4362   17.78   <2e-16 ***
## dayThurs      8.2756     0.4365   18.96   <2e-16 ***
## dayFri       12.7930     0.4363   29.32   <2e-16 ***
## daySat       -5.9413     0.4361  -13.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.82 on 844385 degrees of freedom
## Multiple R-squared:  0.1406, Adjusted R-squared:  0.1406 
## F-statistic: 2.302e+04 on 6 and 844385 DF,  p-value: < 2.2e-16

IN-CLASS EXERCISE 5: Estimate a regression with sales as a function of day of the week and month of the year.

Let’s now also use state and school holidays.

reg3 <- lm(dsales ~ day + month + StateHoliday + SchoolHoliday, data=sales)
#summary(reg3)

What about the effect of competition?

reg4 <- lm(dsales ~ day + month + StateHoliday + SchoolHoliday + competition, data=sales)
#summary(reg4)

IN-CLASS EXERCISE 6: Controlling for day of the week, month, state and school holidays, do promotions increase sales?

#stargazer(reg3, reg4, reg5, type="text")

8. Make predictions

Let’s load the test data set from kaggle, create date, day of the week and month variables.

test <- read.csv("C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/test.csv")
#test <- read.csv("https://www.dropbox.com/s/atatbelumf8glhy/rossmann_test.csv?raw=1")
test$Date <- as.Date(test$Date)
test$day <- wday(test$Date, label = TRUE, abbr = TRUE)
test$day <- factor(test$day, ordered=FALSE)
test$month <- month(test$Date, label = TRUE)
test$month <- factor(test$month, ordered=FALSE)
summary(test)
##        Id            Store          DayOfWeek          Date           
##  Min.   :    1   Min.   :   1.0   Min.   :1.000   Min.   :2015-08-01  
##  1st Qu.:10273   1st Qu.: 279.8   1st Qu.:2.000   1st Qu.:2015-08-12  
##  Median :20545   Median : 553.5   Median :4.000   Median :2015-08-24  
##  Mean   :20545   Mean   : 555.9   Mean   :3.979   Mean   :2015-08-24  
##  3rd Qu.:30816   3rd Qu.: 832.2   3rd Qu.:6.000   3rd Qu.:2015-09-05  
##  Max.   :41088   Max.   :1115.0   Max.   :7.000   Max.   :2015-09-17  
##                                                                       
##       Open            Promo        StateHoliday SchoolHoliday   
##  Min.   :0.0000   Min.   :0.0000   0:40908      Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:0.0000   a:  180      1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000                Median :0.0000  
##  Mean   :0.8543   Mean   :0.3958                Mean   :0.4435  
##  3rd Qu.:1.0000   3rd Qu.:1.0000                3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000                Max.   :1.0000  
##  NA's   :11                                                     
##     day       month      
##  Sun  :5992   Aug:26536  
##  Mon  :5992   Sep:14552  
##  Tues :5992              
##  Wed  :5992              
##  Thurs:5992              
##  Fri  :5136              
##  Sat  :5992

We see we have to predict sales from August 1, 2015 to September 15, 2015. The kaggle competition says that days when a store is closed will not be counted. Our predictions will be evaluated using root mean square percentage error.

\(\textrm{RMSPE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left(\frac{y_i - \hat{y}_i}{y_i}\right)^2}\)

We’ll merge in average sales from the training data. This is because we will use average daily sales from the training data to make predictions for sales for the next 6 weeks.

test <- inner_join(test, select(stores, Store, av_sales), by="Store")

We can use function predict() to predict dsales. Let’s first use our reg3 model which used day of the week, month and state and school holiday as predictors.

test$dsales <- predict(reg3, test)

The above created a vector of predicted percent deviation from average sales. However, we need to predict total sales rather than the percent deviation from average sales. Fortunately, since we know average sales we can calculate total sales by multiplying average sales (av_sales) by the percent deviation from average sales (dsales).

test$Sales <- test$av_sales*(1+test$dsales/100)
write.csv(select(test,Id, Sales), file = "C:/Users/dvorakt/Google Drive/business analytics/labs/lab10/submission.csv", row.names = FALSE)

This submission ranks about 2800.

IN-CLASS EXERCISE: Make a prediction using our reg5 model (it included Promo variable). DOes this model do better in the competition?