Rossmann Store Sales - Exploratory Data Analysis & Forecasting using Linear Regression

INTRODUCTION

In Summer 2020, we, a Chemical Engineering/ Supply Chain major and a Quantitative Economics + Music Performance major teamed up to work on our first-ever data analytics project. While we came from completely different backgrounds, we shared a genuine interest in strengthening our skills in analyzing data and modeling. After spending some time on Kaggle.com, we came across this challenge that fitted both of our academic interests: Rossmann Store Sales (https://www.kaggle.com/c/rossmann-store-sales).

Rossmann is a Germany drug store chain with over 3790 stores in Europe. In this challenge, we are given records of sales of each store on different dates, from 01/01/2013 to 31/07/2015. Sales can be affected by many factors such as holidays, promotions and competitions. Our goal is to explore the correlations between these features using R and utilize the results of this exploratory data analysis to build a linear regression model that predicts the store sales from 08/01/2015 to 09/17/2015, 6 weeks in advance. Our project consist of 3 parts: data cleaning, exploratory data analysis and data modeling.

DATA CLEANSING

Load packages

library(ggplot2)
library(sqldf)

## Warning: package 'sqldf' was built under R version 4.0.2

## Loading required package: gsubfn

## Warning: package 'gsubfn' was built under R version 4.0.2

## Loading required package: proto

## Warning: package 'proto' was built under R version 4.0.2

## Loading required package: RSQLite

## Warning: package 'RSQLite' was built under R version 4.0.2

library(zoo)

## Warning: package 'zoo' was built under R version 4.0.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(reshape2)
library(readr)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.0.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Import train data

train <- read.csv("train.csv")

Show the structure of train data

str(train)

## 'data.frame':    1017209 obs. of  9 variables:
##  $ Store        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek    : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Date         : chr  "2015-07-31" "2015-07-31" "2015-07-31" "2015-07-31" ...
##  $ Sales        : int  5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
##  $ Customers    : int  555 625 821 1498 559 589 1414 833 687 681 ...
##  $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StateHoliday : chr  "0" "0" "0" "0" ...
##  $ SchoolHoliday: int  1 1 1 1 1 1 1 1 1 1 ...

Some data fields should be converted to a more suitable data type for the convenience of the explanatory process.

str(train$Date)

##  chr [1:1017209] "2015-07-31" "2015-07-31" "2015-07-31" "2015-07-31" ...

Change the data type of “Date” from “char” to “date”

train$Date <- as.Date(train$Date,format = "%m/%d/%y")

Factorize categorical data fields in train data

train$DayOfWeek <- as.factor(as.integer(train$DayOfWeek))
train$StateHoliday <- as.factor(as.character(train$StateHoliday))
train$Open <- as.factor(as.integer(train$Open))
train$Promo <- as.factor(as.integer(train$Promo))
train$SchoolHoliday <- as.factor(as.integer(train$SchoolHoliday))

Check the data modification and identify NA cases in the test table

summary(train)

##      Store        DayOfWeek       Date             Sales         Customers     
##  Min.   :   1.0   1:144730   Min.   :NA        Min.   :    0   Min.   :   0.0  
##  1st Qu.: 280.0   2:145664   1st Qu.:NA        1st Qu.: 3727   1st Qu.: 405.0  
##  Median : 558.0   3:145665   Median :NA        Median : 5744   Median : 609.0  
##  Mean   : 558.4   4:145845   Mean   :NA        Mean   : 5774   Mean   : 633.1  
##  3rd Qu.: 838.0   5:145845   3rd Qu.:NA        3rd Qu.: 7856   3rd Qu.: 837.0  
##  Max.   :1115.0   6:144730   Max.   :NA        Max.   :41551   Max.   :7388.0  
##                   7:144730   NA's   :1017209                                   
##  Open       Promo      StateHoliday SchoolHoliday
##  0:172817   0:629129   0:986159     0:835488     
##  1:844392   1:388080   a: 20260     1:181721     
##                        b:  6690                  
##                        c:  4100                  
##                                                  
##                                                  
##

No NULL values found. Now the train data is ready for the analysis process. We will move to cleaning the store data.

Assign column names for store data

colNames <- c ("Store", "StoreType",    "Assortment",   "CompetitionDistance",  
              "CompetitionOpenSinceMonth", "CompetitionOpenSinceYear",
              "PromoContinuation",  "PromoParticipationSinceWeek",  
              "PromoParticipationSinceYear", "PromoInterval")

Import store data

store <- read.table ("store.csv", header = TRUE, sep = ",",
                       strip.white = TRUE, col.names = colNames,
                       na.strings = "?", stringsAsFactors = TRUE)

Show the structure of the store data

str (store)

## 'data.frame':    1115 obs. of  10 variables:
##  $ Store                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ StoreType                  : Factor w/ 4 levels "a","b","c","d": 3 1 1 3 1 1 1 1 1 1 ...
##  $ Assortment                 : Factor w/ 3 levels "a","b","c": 1 1 1 3 1 1 3 1 3 1 ...
##  $ CompetitionDistance        : int  1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
##  $ CompetitionOpenSinceMonth  : int  9 11 12 9 4 12 4 10 8 9 ...
##  $ CompetitionOpenSinceYear   : int  2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
##  $ PromoContinuation          : int  0 1 1 0 0 0 0 0 0 0 ...
##  $ PromoParticipationSinceWeek: int  NA 13 14 NA NA NA NA NA NA NA ...
##  $ PromoParticipationSinceYear: int  NA 2010 2011 NA NA NA NA NA NA NA ...
##  $ PromoInterval              : Factor w/ 4 levels "","Feb,May,Aug,Nov",..: 1 3 3 1 1 1 1 1 1 1 ...

Identify NA cases

table (complete.cases (store))

## 
## FALSE  TRUE 
##   750   365

Take a look at the table summary to identify the NAs

summary(store)

##      Store        StoreType Assortment CompetitionDistance
##  Min.   :   1.0   a:602     a:593      Min.   :   20.0    
##  1st Qu.: 279.5   b: 17     b:  9      1st Qu.:  717.5    
##  Median : 558.0   c:148     c:513      Median : 2325.0    
##  Mean   : 558.0   d:348                Mean   : 5404.9    
##  3rd Qu.: 836.5                        3rd Qu.: 6882.5    
##  Max.   :1115.0                        Max.   :75860.0    
##                                        NA's   :3          
##  CompetitionOpenSinceMonth CompetitionOpenSinceYear PromoContinuation
##  Min.   : 1.000            Min.   :1900             Min.   :0.0000   
##  1st Qu.: 4.000            1st Qu.:2006             1st Qu.:0.0000   
##  Median : 8.000            Median :2010             Median :1.0000   
##  Mean   : 7.225            Mean   :2009             Mean   :0.5121   
##  3rd Qu.:10.000            3rd Qu.:2013             3rd Qu.:1.0000   
##  Max.   :12.000            Max.   :2015             Max.   :1.0000   
##  NA's   :354               NA's   :354                               
##  PromoParticipationSinceWeek PromoParticipationSinceYear          PromoInterval
##  Min.   : 1.0                Min.   :2009                                :544  
##  1st Qu.:13.0                1st Qu.:2011                Feb,May,Aug,Nov :130  
##  Median :22.0                Median :2012                Jan,Apr,Jul,Oct :335  
##  Mean   :23.6                Mean   :2012                Mar,Jun,Sept,Dec:106  
##  3rd Qu.:37.0                3rd Qu.:2013                                      
##  Max.   :50.0                Max.   :2015                                      
##  NA's   :544                 NA's   :544

Replace the NAs in Competition Distance by its median

store$CompetitionDistance[is.na(store$CompetitionDistance)] <- median(store$CompetitionDistance, na.rm=TRUE)

Replace the remaining NA’s by 0

store[is.na(store)] <- 0

Factorize categorical data fields in store data

store$Store <- as.factor(as.integer(store$Store))
store$CompetitionOpenSinceYear <- as.factor(as.integer(store$CompetitionOpenSinceYear))
store$CompetitionOpenSinceMonth <- as.factor(as.integer(store$CompetitionOpenSinceMonth))
store$PromoContinuation <- as.factor(as.integer(store$PromoContinuation))
store$PromoParticipationSinceWeek <- as.factor(as.integer(store$PromoParticipationSinceWeek))
store$PromoParticipationSinceYear <- as.factor(as.integer(store$PromoParticipationSinceYear))

Double check the store’s summary

summary(store)

##      Store      StoreType Assortment CompetitionDistance
##  1      :   1   a:602     a:593      Min.   :   20      
##  2      :   1   b: 17     b:  9      1st Qu.:  720      
##  3      :   1   c:148     c:513      Median : 2325      
##  4      :   1   d:348                Mean   : 5397      
##  5      :   1                        3rd Qu.: 6875      
##  6      :   1                        Max.   :75860      
##  (Other):1109                                           
##  CompetitionOpenSinceMonth CompetitionOpenSinceYear PromoContinuation
##  0      :354               0      :354              0:544            
##  9      :125               2013   : 83              1:571            
##  4      : 94               2012   : 82                               
##  11     : 92               2014   : 70                               
##  3      : 70               2005   : 62                               
##  7      : 67               2010   : 55                               
##  (Other):313               (Other):409                               
##  PromoParticipationSinceWeek PromoParticipationSinceYear          PromoInterval
##  0      :544                 0      :544                                 :544  
##  14     : 81                 2011   :128                 Feb,May,Aug,Nov :130  
##  40     : 77                 2013   :120                 Jan,Apr,Jul,Oct :335  
##  31     : 44                 2014   : 95                 Mar,Jun,Sept,Dec:106  
##  10     : 42                 2012   : 81                                       
##  5      : 39                 2009   : 73                                       
##  (Other):288                 (Other): 74

EXPLORATION

Join train and store tables to further explore other correlations between the data fields of 2 tables.

train_store <- merge(train, store, by = "Store")

For graph to display number in full (E.g. 1000000 instead of 10e6)

options("scipen" = 10)

For our exploratory data analysis, we will look into the relationship between Sales and other data fields.

Sales vs Store

Here we made a vector containing the mean sales of 1115 stores

MeanSalesPerStore <- vector(mode = "numeric",length = 1115)
for (i in 1:1115) {
  MeanSalesPerStore[i] <- mean(train_store$Sales[train_store$Store==i])
}

hist(MeanSalesPerStore,xlab="Sales (€)")

summary(MeanSalesPerStore)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2245    4412    5459    5763    6634   20719

Sales vary from stores to stores. The outliers are stores with over €20719 in mean sales.

Sales vs DayOfWeek

boxplot(Sales ~ DayOfWeek,data=train_store)

Sales of day 7 of the week (Sunday) is extremely low compared to other dates.

We will take a closer look at Day 7.

Day7Sales <- subset(train_store,DayOfWeek==7)
summary(Day7Sales)

##      Store        DayOfWeek       Date            Sales        
##  Min.   :   1.0   1:     0   Min.   :NA       Min.   :    0.0  
##  1st Qu.: 280.0   2:     0   1st Qu.:NA       1st Qu.:    0.0  
##  Median : 558.0   3:     0   Median :NA       Median :    0.0  
##  Mean   : 558.4   4:     0   Mean   :NA       Mean   :  204.2  
##  3rd Qu.: 838.0   5:     0   3rd Qu.:NA       3rd Qu.:    0.0  
##  Max.   :1115.0   6:     0   Max.   :NA       Max.   :37376.0  
##                   7:144730   NA's   :144730                    
##    Customers       Open       Promo      StateHoliday SchoolHoliday StoreType
##  Min.   :   0.00   0:141137   0:144730   0:144421     0:142006      a:78484  
##  1st Qu.:   0.00   1:  3593   1:     0   a:   309     1:  2724      b: 2252  
##  Median :   0.00                         b:     0                   c:19468  
##  Mean   :  35.79                         c:     0                   d:44526  
##  3rd Qu.:   0.00                                                             
##  Max.   :5145.00                                                             
##                                                                              
##  Assortment CompetitionDistance CompetitionOpenSinceMonth
##  a:76472    Min.   :   20       0      :46006            
##  b: 1180    1st Qu.:  710       9      :16256            
##  c:67078    Median : 2325       4      :12388            
##             Mean   : 5422       11     :12016            
##             3rd Qu.: 6880       3      : 9042            
##             Max.   :75860       7      : 8458            
##                                 (Other):40564            
##  CompetitionOpenSinceYear PromoContinuation PromoParticipationSinceWeek
##  0      :46006            0:72272           0      :72272              
##  2013   :10732            1:72458           14     :10386              
##  2012   :10572                              40     : 8914              
##  2014   : 9068                              31     : 5688              
##  2005   : 8048                              10     : 5524              
##  2010   : 7292                              5      : 5096              
##  (Other):53012                              (Other):36850              
##  PromoParticipationSinceYear          PromoInterval  
##  0      :72272                               :72272  
##  2011   :16372               Feb,May,Aug,Nov :16874  
##  2013   :15716               Jan,Apr,Jul,Oct :41718  
##  2014   :11378               Mar,Jun,Sept,Dec:13866  
##  2012   :10412                                       
##  2009   : 9288                                       
##  (Other): 9292

Look at “Sales” data, we see that most stores have 0 sales on Sunday, and this is because 97.5% (141137/144730) of the records indicated that the stores were closed on those dates.

Now we will check if the stores that were open on Day 7 had sales or not.

summary(subset(Day7Sales,Open==1,select=c(Sales)))

##      Sales      
##  Min.   :  286  
##  1st Qu.: 3314  
##  Median : 6876  
##  Mean   : 8225  
##  3rd Qu.:11418  
##  Max.   :37376

All stores that were opened on Day 7 had sales.

boxplot(Sales ~ DayOfWeek,data=train_store[train_store$Sales!=0,])

Sales of stores that were opened on Sunday are higher than weekdays. We will use this insight to check whether other factors affect the high sales.

Holiday might be a factor of store closure, so we will check that as follow:

sqldf("select Open, sum(StateHoliday), sum(SchoolHoliday) from Day7Sales group by Open")

##   Open sum(StateHoliday) sum(SchoolHoliday)
## 1    0                 0               2642
## 2    1                 0                 82

On all Day 7 records, no stores open on State Holiday. 2642 records that was on School Holiday indicated that the store was closed, while in total we have 141137 closed stores. Therefore, holidays are not a strong factor of store closure like we assumed.

Sales vs Date

nrow(unique(train["Date"]))

## [1] 1

There’s 942 different dates in the train table.

ggplot(train_store, aes(x=Date,y=Sales)) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1017209 rows containing non-finite values (stat_smooth).

Sales increased from 2013 to 2015, with fluctuations. Sales tend to decrease mid-year and then took off again at the end of the year.

Sales vs Customers

ggplot(train_store, aes(x=Date,y=Customers)) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1017209 rows containing non-finite values (stat_smooth).

Customers increased from 2013 to 2015.

We can see that the Customers’ graph shows a similar trend as Sales. We will check for correlation between Sales and Customers.

Linear <- lm(Sales ~ Customers, data=train_store)
summary(Linear)

## 
## Call:
## lm(formula = Sales ~ Customers, data = train_store)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28685.0  -1077.7   -253.6    882.4  27735.9 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 1077.736503    2.882656   373.9   <2e-16 ***
## Customers      7.417062    0.003671  2020.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1720 on 1017207 degrees of freedom
## Multiple R-squared:  0.8005, Adjusted R-squared:  0.8005 
## F-statistic: 4.082e+06 on 1 and 1017207 DF,  p-value: < 2.2e-16

Customers and Sales are strongly correlated (Adjusted R-squared: 0.8005)

Sales vs StateHoliday

boxplot(Sales ~ StateHoliday,data=train_store)

Sales are significantly lower on holidays. –> Strong predictor value.

Now we want to see which holidays had the most sales.

OpenOnHoliday <- subset(train_store,Open==1)
mean(OpenOnHoliday$Sales[OpenOnHoliday$StateHoliday=="a"])

## [1] 8487.471

mean(OpenOnHoliday$Sales[OpenOnHoliday$StateHoliday=="b"])

## [1] 9887.89

mean(OpenOnHoliday$Sales[OpenOnHoliday$StateHoliday=="c"])

## [1] 9743.746

Within 910 stores that are open on holiday, Easter (b) saw the highest sale (9887.89), slightly higher than Christmas (c) (9743.746) and significantly higher than public holidays (a) (8487.471).

Sales vs SchoolHoliday

boxplot(Sales ~ SchoolHoliday,data=train_store)

Sales on School Holiday were just slightly higher than that of non-School Holiday. Therefore, School Holiday is not a strong predictor value.

Sales vs StoreType

boxplot(Sales ~ StoreType,data=train_store)

Type b has the highest mean sales.

Distribution of each assortment

boxplot(Sales ~ Assortment,data=train_store)

We can see that assortment b had the best sales among three assortments. Even its average sales is higher than the others’ average sales.

Determine the sales of each assortment by dates

ggplot(train_store["Sales" != 0], 
       aes(x = as.Date(Date), y = Sales, color = Assortment)) + 
  geom_smooth(size = 1.5) + xlab("Date")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1017209 rows containing non-finite values (stat_smooth).

Assortment b always had higher sales than the other assortments whereas c and a’s performances shared the same shape

Customers per assortment

cust_a <- sum(train_store$Customers[train_store$Assortment == "a"])
cust_b <- sum(train_store$Customers[train_store$Assortment == "b"])
cust_c <- sum(train_store$Customers[train_store$Assortment == "c"])
barplot(c(cust_a,cust_b,cust_c), main = "Customers per assortment", names.arg = c("a","b","c"))

Furthermore, the number of customers who bought b assortment is extremely low compare to other assortments’. Hence, b assortment could be a totally different type of product while a and c might be related to each other. Based on the sales trend of b, there are 2 assumptions considered. First, although having a low number of customers, b assortment could be a product type that could be bought with a large amount. Second, the price of b assortment is much more higher than those of a and c. These are interesting insights than can be reffered to later in the exploration process.

We move on to the correlation between Competition Distance and Sales on open days

CDopenday <- sqldf("Select CompetitionDistance, avg(Sales) as AvgSales from train_store where Open = 1 group by store")
CDmodel = lm(AvgSales ~ CompetitionDistance, data = CDopenday)
CDmodelsum = summary(CDmodel)
plot(AvgSales ~ CompetitionDistance,CDopenday)
abline(CDmodel, col = 'blue')
legend("topright", bty = "n", legend = paste("R2 =", format(CDmodelsum$adj.r.squared, digits = 4)))

There is nearly no correlation between sales and competition distance. Competition Distance may not be considered in the model later

Let’s see which month and year these competition opened

CompeteYear <- sqldf("select CompetitionOpenSinceYear as SinceYear, log(sum(Sales)) as Sales, log(count(CompetitionOpenSinceYear)) as CompetitionYearOpenFrequency from train_store where CompetitionOpenSinceYear <> 0 group by CompetitionOpenSinceYear")
CompeteYear$SinceYear <- as.numeric(as.character(CompeteYear$SinceYear))
CompeteYear <- melt(CompeteYear,id = "SinceYear")
ggplot(data=CompeteYear, aes(x = SinceYear, y= value, colour = variable), xlab="Since Year") + geom_line(size=1)

Sales slightly followed the trend of the number of competitions open from 1900 until 2015. However, the change of number of competitor do not affect much the sales of Rossmann stores. The sales could be impacted by the other elements.

Days since start of Promo2 (PromoContinuation)

Promo2Year <- sqldf("select PromoParticipationSinceYear as PromoSinceYear, log(sum(Sales)) as Sales, log(count(PromoParticipationSinceYear)) as PromoYearFrequency from train_store where PromoSinceYear <> 0 group by PromoSinceYear")
Promo2Year$PromoSinceYear <- as.numeric(as.character(Promo2Year$PromoSinceYear))
Promo2Year <- melt(Promo2Year,id = "PromoSinceYear")
ggplot(data=Promo2Year, aes(x = PromoSinceYear, y= value, colour = variable), xlab="Promo Since Year") + geom_line(size=1)

PromoContinuation vs Sales

boxplot(Sales ~ PromoContinuation, data = train_store,
        main = "Sales based on the PromoContinuation",
        xlab = "PromoContinuation", ylab = "Sales", col = "yellow")

Sales when having a 2nd Promo were less than without a 2nd Promo but it is not significant. The reason for this trend may because the 2nd Promo was not as effective as the first promo.

Since there is 0 sales on closed days, I want to specifically look at the PromoContinuation data on Open days. The number of two categories of PromoContinuation are nearly equal, which is good for the comparison of sales between the two.

row_to_keep = which(as.integer(train_store$Open) > 0)
openday <- train_store[row_to_keep,]

We compare sales between promo day and not promo day

ggplot(openday["Sales" != 0], 
       aes(x = as.Date(Date), y = Sales, color = factor(Promo))) + 
  geom_smooth(size = 1.5) + xlab("Date")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1017209 rows containing non-finite values (stat_smooth).

promoY <- mean(train_store$Sales[train_store$Promo == 1])
promoN <- mean(train_store$Sales[train_store$Promo == 0])

barplot(c(promoY,promoN), main = "Average sales per Promo", names.arg = c("1","0"))

The graph follow the sales trend where sales dropped midyear and increase at the end of the year.Sales nearly doubled when there was a promo on that day. This is another trend that shoud be taken into consideration.

Determine the sales of each PromoInterval

IntervalsOnly <- subset(train_store, PromoInterval == 'Feb,May,Aug,Nov'| PromoInterval == 'Jan,Apr,Jul,Oct'|PromoInterval == 'Mar,Jun,Sept,Dec')
boxplot(Sales ~ PromoInterval, data = IntervalsOnly,
        main = "Sales based on the Promo Interval",
        xlab = "PromoInterval", ylab = "Sales", col = "blue")

Overall, all intervals share relatively same mean, quartiles, and minimum and maximum values. However, The “Feb,May,Aug,Nov” interval had the highest outlier

I will take a closer look at that row

sqldf("SELECT * FROM IntervalsOnly WHERE PromoInterval = 'Feb,May,Aug,Nov' ORDER BY Sales Desc LIMIT 1")

##   Store DayOfWeek Date Sales Customers Open Promo StateHoliday SchoolHoliday
## 1   909         1 <NA> 41551      1721    1     0            0             0
##   StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth
## 1         a          c                1680                         0
##   CompetitionOpenSinceYear PromoContinuation PromoParticipationSinceWeek
## 1                        0                 1                          45
##   PromoParticipationSinceYear   PromoInterval
## 1                        2009 Feb,May,Aug,Nov

It can be clearly seen that this is the highest sales from the data. It is interesting that this store is a type a store, which sales performance wasn’t as outstanding as store type b and it sold assortment c, which did not contribute high sales as high as assortment b. However, it follows the 4 trends estabished, which futher support their importance to the model. Therefore, PromoInterval did not have significant impact on Sales.

DATA MODELING - LINEAR REGRESSION

Import test data

test <- read.csv("test.csv")
test_store <- merge(test,store,by="Store")

Split train data: We will first test it with the train data first. We will split the train data so that 70% is used for modeling and 30% is used for testing.

index = sample(2,nrow(train_store),replace = TRUE, prob=c(0.7,0.3))
train70 <- train_store[index==1,]
train30 <- train_store[index==2,]

Build the model

train_store$Promo <- as.integer(as.factor(train_store$Promo))
train_store$DayOfWeek <- as.integer(as.factor(train_store$DayOfWeek))
train_store$Open <- as.integer(as.factor(train_store$Open))
lrMod <- lm(Sales ~ Store + Open + SchoolHoliday + PromoInterval + DayOfWeek + StateHoliday + DayOfWeek*StateHoliday + Promo + StoreType + Assortment + StoreType*Assortment + PromoContinuation, data=train70)

Use interactive terms: StoreType vs Assortment

pred <- predict(lrMod, train30)

## Warning in predict.lm(lrMod, train30): prediction from a rank-deficient fit may
## be misleading

summary (lrMod)

## 
## Call:
## lm(formula = Sales ~ Store + Open + SchoolHoliday + PromoInterval + 
##     DayOfWeek + StateHoliday + DayOfWeek * StateHoliday + Promo + 
##     StoreType + Assortment + StoreType * Assortment + PromoContinuation, 
##     data = train70)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9690.2 -1498.7  -224.1   890.6 28968.6 
## 
## Coefficients: (13 not defined because of singularities)
##                                   Estimate   Std. Error  t value Pr(>|t|)    
## (Intercept)                     241.605892    40.781132    5.924 3.13e-09 ***
## Store                             0.063787     0.009259    6.889 5.63e-12 ***
## Open1                          6521.426202    39.307740  165.907  < 2e-16 ***
## SchoolHoliday1                  232.152936     8.250166   28.139  < 2e-16 ***
## PromoIntervalFeb,May,Aug,Nov   -642.873125     9.714096  -66.179  < 2e-16 ***
## PromoIntervalJan,Apr,Jul,Oct   -437.424205     6.983496  -62.637  < 2e-16 ***
## PromoIntervalMar,Jun,Sept,Dec  -955.277959    10.545508  -90.586  < 2e-16 ***
## DayOfWeek2                    -1060.002302    11.286509  -93.918  < 2e-16 ***
## DayOfWeek3                    -1415.594044    11.328123 -124.963  < 2e-16 ***
## DayOfWeek4                    -1414.079377    11.476563 -123.215  < 2e-16 ***
## DayOfWeek5                    -1028.727818    11.395738  -90.273  < 2e-16 ***
## DayOfWeek6                    -1010.517741    12.127617  -83.324  < 2e-16 ***
## DayOfWeek7                     -323.861242    40.100131   -8.076 6.69e-16 ***
## StateHolidaya                  -496.190058    62.955097   -7.882 3.24e-15 ***
## StateHolidayb                  -531.574655    65.060691   -8.170 3.08e-16 ***
## StateHolidayc                   463.765566   106.099751    4.371 1.24e-05 ***
## Promo1                         2241.243587     7.136185  314.067  < 2e-16 ***
## StoreTypeb                     4317.281090    37.568832  114.917  < 2e-16 ***
## StoreTypec                      269.577255    12.396567   21.746  < 2e-16 ***
## StoreTyped                        6.061772    10.282032    0.590 0.555493    
## Assortmentb                   -2233.820516    49.359445  -45.256  < 2e-16 ***
## Assortmentc                     880.528561     8.386264  104.997  < 2e-16 ***
## PromoContinuation1                      NA           NA       NA       NA    
## DayOfWeek2:StateHolidaya        577.970972    93.482492    6.183 6.31e-10 ***
## DayOfWeek3:StateHolidaya        398.966714    79.838848    4.997 5.82e-07 ***
## DayOfWeek4:StateHolidaya        751.510050    59.434789   12.644  < 2e-16 ***
## DayOfWeek5:StateHolidaya       -366.418416    76.060501   -4.817 1.45e-06 ***
## DayOfWeek6:StateHolidaya       1357.807463   154.051795    8.814  < 2e-16 ***
## DayOfWeek7:StateHolidaya        674.873238   178.556202    3.780 0.000157 ***
## DayOfWeek2:StateHolidayb                NA           NA       NA       NA    
## DayOfWeek3:StateHolidayb                NA           NA       NA       NA    
## DayOfWeek4:StateHolidayb                NA           NA       NA       NA    
## DayOfWeek5:StateHolidayb      -1158.511470    74.377681  -15.576  < 2e-16 ***
## DayOfWeek6:StateHolidayb                NA           NA       NA       NA    
## DayOfWeek7:StateHolidayb                NA           NA       NA       NA    
## DayOfWeek2:StateHolidayc                NA           NA       NA       NA    
## DayOfWeek3:StateHolidayc        405.096637   133.828124    3.027 0.002470 ** 
## DayOfWeek4:StateHolidayc        393.207383   119.434939    3.292 0.000994 ***
## DayOfWeek5:StateHolidayc                NA           NA       NA       NA    
## DayOfWeek6:StateHolidayc                NA           NA       NA       NA    
## DayOfWeek7:StateHolidayc                NA           NA       NA       NA    
## StoreTypeb:Assortmentb                  NA           NA       NA       NA    
## StoreTypec:Assortmentb                  NA           NA       NA       NA    
## StoreTyped:Assortmentb                  NA           NA       NA       NA    
## StoreTypeb:Assortmentc         6025.889781   104.537029   57.644  < 2e-16 ***
## StoreTypec:Assortmentc         -651.491981    18.272482  -35.654  < 2e-16 ***
## StoreTyped:Assortmentc         -403.683663    13.933376  -28.972  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2506 on 712108 degrees of freedom
## Multiple R-squared:  0.5765, Adjusted R-squared:  0.5765 
## F-statistic: 2.937e+04 on 33 and 712108 DF,  p-value: < 2.2e-16

Calculating Prediction Accuracy

actuals_preds <- data.frame(cbind(actuals=train30$Sales, predicteds=pred))
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy

##              actuals predicteds
## actuals    1.0000000  0.7608356
## predicteds 0.7608356  1.0000000

Apply the model to the test data

lrMod <- lm(Sales ~ Store + Open + DayOfWeek + StateHoliday + DayOfWeek*StateHoliday + Promo + StoreType + Assortment + StoreType*Assortment + PromoContinuation, data=train_store)  
pred <- predict(lrMod, test_store)

## Warning in predict.lm(lrMod, test_store): prediction from a rank-deficient fit
## may be misleading

summary (lrMod)

## 
## Call:
## lm(formula = Sales ~ Store + Open + DayOfWeek + StateHoliday + 
##     DayOfWeek * StateHoliday + Promo + StoreType + Assortment + 
##     StoreType * Assortment + PromoContinuation, data = train_store)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10222  -1549   -234    844  34946 
## 
## Coefficients: (3 not defined because of singularities)
##                             Estimate   Std. Error  t value   Pr(>|t|)    
## (Intercept)             -6597.120608    23.710042 -278.242    < 2e-16 ***
## Store                       0.066140     0.007836    8.441    < 2e-16 ***
## Open                     5455.392921     9.100699  599.448    < 2e-16 ***
## DayOfWeek                -150.234801     1.661733  -90.409    < 2e-16 ***
## StateHolidaya           -1148.737127    48.051798  -23.906    < 2e-16 ***
## StateHolidayb            -480.888492    57.518090   -8.361    < 2e-16 ***
## StateHolidayc           -1016.769481   225.924243   -4.500 0.00000678 ***
## Promo                    2108.394634     5.732378  367.805    < 2e-16 ***
## StoreTypeb               4429.688196    31.604613  140.160    < 2e-16 ***
## StoreTypec                255.078421    10.463406   24.378    < 2e-16 ***
## StoreTyped                -12.764331     8.695770   -1.468    0.14214    
## Assortmentb             -2199.880656    41.886083  -52.521    < 2e-16 ***
## Assortmentc               857.771487     7.065528  121.402    < 2e-16 ***
## PromoContinuation1       -584.858257     5.063375 -115.508    < 2e-16 ***
## DayOfWeek:StateHolidaya   -13.025921    12.603593   -1.034    0.30137    
## DayOfWeek:StateHolidayb  -364.213134    15.700671  -23.197    < 2e-16 ***
## DayOfWeek:StateHolidayc   148.829924    56.147992    2.651    0.00803 ** 
## StoreTypeb:Assortmentb            NA           NA       NA         NA    
## StoreTypec:Assortmentb            NA           NA       NA         NA    
## StoreTyped:Assortmentb            NA           NA       NA         NA    
## StoreTypeb:Assortmentc   6047.237278    88.624642   68.234    < 2e-16 ***
## StoreTypec:Assortmentc   -692.027213    15.433317  -44.840    < 2e-16 ***
## StoreTyped:Assortmentc   -373.125543    11.782339  -31.668    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2536 on 1017189 degrees of freedom
## Multiple R-squared:  0.566,  Adjusted R-squared:  0.566 
## F-statistic: 6.983e+04 on 19 and 1017189 DF,  p-value: < 2.2e-16

Save the predicted sales to file .csv

sales_forecast <- data.frame(Id=test_store$Id, Sales=pred)
write.csv(sales_forecast,"LinearRegressionRossmann.csv")

CONCLUSION

We successfully generated sales predictions for the Rossmann store chain 6 weeks in advance using linear regression. While conducting exploratory data analysis for modeling, we found many interesting insights. Although many stores were closed on Sunday, those that were opened saw a great amount of sales revenue. Assortment b had the highest sales, but it was only available in tybe-b stores among 4 types of store in total. We hope that these findings will be helpful for business decision-makers to optimize their profits.

Rossmann Store Sales - Exploratory Data Analysis & Forecasting using Linear Regression

Nhi Huynh & Ngoc Ngo

8/24/2020