Introduction

Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

Overview of the data variables

Read your dataset in R and visualize the length and breadth of your dataset.

rossmann.df<-read.csv(paste("rossman.csv", sep=""))
View(rossmann.df)
dim(rossmann.df)
## [1] 1115    8

Create a descriptive statistics (min, max, median etc) of each variable.

summary(rossmann.df)
##     StoreId           Sales          CustomerCount        OpenDays    
##  Min.   :   1.0   Min.   : 2114322   Min.   : 187583   Min.   :592.0  
##  1st Qu.: 279.5   1st Qu.: 3949377   1st Qu.: 405391   1st Qu.:776.0  
##  Median : 558.0   Median : 4990259   Median : 509233   Median :779.0  
##  Mean   : 558.0   Mean   : 5267427   Mean   : 577616   Mean   :757.3  
##  3rd Qu.: 836.5   3rd Qu.: 6084148   3rd Qu.: 671544   3rd Qu.:782.0  
##  Max.   :1115.0   Max.   :19516842   Max.   :3206058   Max.   :942.0  
##                                                                       
##  CompetitionDistance    Promo2C       CompetitionDuration PromoDuration  
##  Min.   :   20.0     Min.   :0.0000   Min.   :  0.000     Min.   :0.000  
##  1st Qu.:  717.5     1st Qu.:0.0000   1st Qu.:  0.000     1st Qu.:0.000  
##  Median : 2325.0     Median :1.0000   Median :  3.000     Median :1.000  
##  Mean   : 5404.9     Mean   :0.5121   Mean   :  4.321     Mean   :1.657  
##  3rd Qu.: 6882.5     3rd Qu.:1.0000   3rd Qu.:  7.000     3rd Qu.:3.000  
##  Max.   :75860.0     Max.   :1.0000   Max.   :115.000     Max.   :6.000  
##  NA's   :3

Tidying up the data

From the above summary of data printed, we can observe few missing data in CompetitionDistance. We shall assign mean value in to the missing fields

attach(rossmann.df)
targetfield<-which(is.na(CompetitionDistance)) #identifying the location of missing values
print("before correction")
## [1] "before correction"
rossmann.df$CompetitionDistance[targetfield]
## [1] NA NA NA
rossmann.df$CompetitionDistance[targetfield]<-round(mean(CompetitionDistance, na.rm = TRUE))
print("after correction")
## [1] "after correction"
rossmann.df$CompetitionDistance[targetfield]
## [1] 5405 5405 5405

Exploring the data

As discussed above we shall investigate the relationship of sales with other variables. First, let us see the distribution of sales.

hist(Sales, 
      main="Sales distribution across stores",
      xlab="Sales",
      ylab="Count",
      breaks=50,
      col="gray", cex.lab=0.7, cex.axis=0.7, cex.main=0.7, cex.sub=0.7)

Sales vs. OpenDays

library(car)
scatterplot(OpenDays, Sales, main= "Scatterplot of Sales vs. Open Days", cex=0.6, pch=19,spread= FALSE, smoother.args = list(lty=2), data= rossmann.df)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

The above plot does not convey a definit relationship between the Sales and OpenDays

Sales vs. Customer Count

scatterplot(CustomerCount, Sales, main= "Scatterplot of Sales vs. Customer Count", cex=0.6, pch=19,spread= FALSE, smoother.args = list(lty=2), data= rossmann.df)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

There appears to be a clear positive correlation between the two variables.

Sales vs. Promotion

boxplot(Sales ~ Promo2C, horizontal=TRUE,
     ylab="Promo2C", xlab="Sales", las=1,
     main="Sales vs Promo2C")

It appears that the programme for consecutive promotion (Promo2C) is not performing expectedly. Is there a relation between the two?

t- test of Sales vs Promo2C

H1: Sales are getting affected by Promo2C program
sales_promo<-aggregate(Sales~Promo2C, list(Promo2C), mean)
sales_promo
##   Promo2C   Sales
## 1       0 5719747
## 2       1 4836494
t.test(Sales~Promo2C)
## 
##  Welch Two Sample t-test
## 
## data:  Sales by Promo2C
## t = 7.6872, df = 961.55, p-value = 3.707e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   657771 1108736
## sample estimates:
## mean in group 0 mean in group 1 
##         5719747         4836494

Since, the p value is less than 0.05, we can reject the null hypothesis that difference between means of sales within Promo2c and outside it is zero. Thus we infer that Promo2C program adversely affects Sales. So H1 is true.

Sales vs. Competition Distance

scatterplot(CompetitionDistance, Sales, main= "Scatterplot of Sales vs. Competition Distance", cex=0.6, pch=19,spread= FALSE, smoother.args = list(lty=2), data= rossmann.df)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter

This plot is not helpful in learning about the relation between the two variables.

Correlations

round(cor(rossmann.df),2)
##                     StoreId Sales CustomerCount OpenDays
## StoreId                1.00  0.01          0.04     0.02
## Sales                  0.01  1.00          0.85     0.33
## CustomerCount          0.04  0.85          1.00     0.41
## OpenDays               0.02  0.33          0.41     1.00
## CompetitionDistance   -0.03 -0.03         -0.14     0.08
## Promo2C                0.01 -0.23         -0.25    -0.32
## CompetitionDuration    0.00 -0.02         -0.01    -0.05
## PromoDuration         -0.01 -0.16         -0.21    -0.23
##                     CompetitionDistance Promo2C CompetitionDuration
## StoreId                           -0.03    0.01                0.00
## Sales                             -0.03   -0.23               -0.02
## CustomerCount                     -0.14   -0.25               -0.01
## OpenDays                           0.08   -0.32               -0.05
## CompetitionDistance                1.00   -0.15               -0.02
## Promo2C                           -0.15    1.00                0.01
## CompetitionDuration               -0.02    0.01                1.00
## PromoDuration                     -0.08    0.80               -0.01
##                     PromoDuration
## StoreId                     -0.01
## Sales                       -0.16
## CustomerCount               -0.21
## OpenDays                    -0.23
## CompetitionDistance         -0.08
## Promo2C                      0.80
## CompetitionDuration         -0.01
## PromoDuration                1.00

some curious correlations

round(cor(Sales,OpenDays), 2)
## [1] 0.33
round(cor(Sales,CompetitionDistance), 2)
## [1] NA
round(cor(Sales,CompetitionDuration), 2)
## [1] -0.02

These appear to be weak correlations

A corrgram plot is in order

library(corrgram)
corrgram(rossmann.df, order = FALSE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main= "Corrgram of Store Variables")

detach(rossmann.df)

Regression Analysis

Training set with 75% data

set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(rossmann.df), 0.75*nrow(rossmann.df))  # row indices for training data
trainingData <- rossmann.df[trainingRowIndex, ]  # model training data
testData  <- rossmann.df[-trainingRowIndex, ]   # test data

Modelling with the training data set

fit <- lm(Sales ~ CustomerCount + OpenDays + CompetitionDistance + Promo2C+ CompetitionDuration + PromoDuration, data=trainingData)  # build the model
summary(fit)
## 
## Call:
## lm(formula = Sales ~ CustomerCount + OpenDays + CompetitionDistance + 
##     Promo2C + CompetitionDuration + PromoDuration, data = trainingData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -7403607  -615332   -64510   550352  4703686 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.053e+06  4.554e+05   6.705 3.72e-11 ***
## CustomerCount        5.487e+00  1.192e-01  46.045  < 2e-16 ***
## OpenDays            -1.395e+03  6.104e+02  -2.286   0.0225 *  
## CompetitionDistance  2.427e+01  4.698e+00   5.166 3.00e-07 ***
## Promo2C             -2.732e+05  1.183e+05  -2.309   0.0212 *  
## CompetitionDuration -3.928e+03  5.346e+03  -0.735   0.4627    
## PromoDuration        6.375e+04  2.863e+04   2.226   0.0263 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 977400 on 829 degrees of freedom
## Multiple R-squared:  0.7572, Adjusted R-squared:  0.7555 
## F-statistic:   431 on 6 and 829 DF,  p-value: < 2.2e-16

The above analysis shows that sales dependent on multimple factors, with Customer Count and Competition Distance being highly significant factors. Other influencing factors include OpenDays, Promo2C and PromoDuration.

Testing the model on testdata

PredictSales <- predict(fit, testData)

Prediction accuracy

Correlation accuracy
actuals_preds <- data.frame(cbind(actuals=testData$Sales, predicteds=PredictSales))  # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy # 84.06$
##              actuals predicteds
## actuals    1.0000000  0.8406123
## predicteds 0.8406123  1.0000000
Min_Max accuracy
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))  
min_max_accuracy*100
## [1] 86.61408
Mean absolute percentage error
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape*100
## [1] 15.38447

Conclusion

  1. Rossmann stores should focus on opening and operating stores with the aim of increasing customer count. The associated factors could be visibility, population density etc.
  2. Competition’s nearness adversely affects sales and thus the location of store should accordingly be chosen.
  3. The Promo2C program needs a revision, as it is not yielding expected results. Although it is a weak variable influencing sales, the stores that signed up for it are not performing as expected. There may be other factors working against them such as staff training, customer satisfaction etc. Such data needs to be analysed and promotion program or other offers should thereupon be designed.