Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.
rossmann.df<-read.csv(paste("rossman.csv", sep=""))
View(rossmann.df)
dim(rossmann.df)
## [1] 1115 8
summary(rossmann.df)
## StoreId Sales CustomerCount OpenDays
## Min. : 1.0 Min. : 2114322 Min. : 187583 Min. :592.0
## 1st Qu.: 279.5 1st Qu.: 3949377 1st Qu.: 405391 1st Qu.:776.0
## Median : 558.0 Median : 4990259 Median : 509233 Median :779.0
## Mean : 558.0 Mean : 5267427 Mean : 577616 Mean :757.3
## 3rd Qu.: 836.5 3rd Qu.: 6084148 3rd Qu.: 671544 3rd Qu.:782.0
## Max. :1115.0 Max. :19516842 Max. :3206058 Max. :942.0
##
## CompetitionDistance Promo2C CompetitionDuration PromoDuration
## Min. : 20.0 Min. :0.0000 Min. : 0.000 Min. :0.000
## 1st Qu.: 717.5 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.000
## Median : 2325.0 Median :1.0000 Median : 3.000 Median :1.000
## Mean : 5404.9 Mean :0.5121 Mean : 4.321 Mean :1.657
## 3rd Qu.: 6882.5 3rd Qu.:1.0000 3rd Qu.: 7.000 3rd Qu.:3.000
## Max. :75860.0 Max. :1.0000 Max. :115.000 Max. :6.000
## NA's :3
From the above summary of data printed, we can observe few missing data in CompetitionDistance. We shall assign mean value in to the missing fields
attach(rossmann.df)
targetfield<-which(is.na(CompetitionDistance)) #identifying the location of missing values
print("before correction")
## [1] "before correction"
rossmann.df$CompetitionDistance[targetfield]
## [1] NA NA NA
rossmann.df$CompetitionDistance[targetfield]<-round(mean(CompetitionDistance, na.rm = TRUE))
print("after correction")
## [1] "after correction"
rossmann.df$CompetitionDistance[targetfield]
## [1] 5405 5405 5405
As discussed above we shall investigate the relationship of sales with other variables. First, let us see the distribution of sales.
hist(Sales,
main="Sales distribution across stores",
xlab="Sales",
ylab="Count",
breaks=50,
col="gray", cex.lab=0.7, cex.axis=0.7, cex.main=0.7, cex.sub=0.7)
library(car)
scatterplot(OpenDays, Sales, main= "Scatterplot of Sales vs. Open Days", cex=0.6, pch=19,spread= FALSE, smoother.args = list(lty=2), data= rossmann.df)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
The above plot does not convey a definit relationship between the Sales and OpenDays
scatterplot(CustomerCount, Sales, main= "Scatterplot of Sales vs. Customer Count", cex=0.6, pch=19,spread= FALSE, smoother.args = list(lty=2), data= rossmann.df)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
There appears to be a clear positive correlation between the two variables.
boxplot(Sales ~ Promo2C, horizontal=TRUE,
ylab="Promo2C", xlab="Sales", las=1,
main="Sales vs Promo2C")
It appears that the programme for consecutive promotion (Promo2C) is not performing expectedly. Is there a relation between the two?
sales_promo<-aggregate(Sales~Promo2C, list(Promo2C), mean)
sales_promo
## Promo2C Sales
## 1 0 5719747
## 2 1 4836494
t.test(Sales~Promo2C)
##
## Welch Two Sample t-test
##
## data: Sales by Promo2C
## t = 7.6872, df = 961.55, p-value = 3.707e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 657771 1108736
## sample estimates:
## mean in group 0 mean in group 1
## 5719747 4836494
Since, the p value is less than 0.05, we can reject the null hypothesis that difference between means of sales within Promo2c and outside it is zero. Thus we infer that Promo2C program adversely affects Sales. So H1 is true.
scatterplot(CompetitionDistance, Sales, main= "Scatterplot of Sales vs. Competition Distance", cex=0.6, pch=19,spread= FALSE, smoother.args = list(lty=2), data= rossmann.df)
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
This plot is not helpful in learning about the relation between the two variables.
round(cor(rossmann.df),2)
## StoreId Sales CustomerCount OpenDays
## StoreId 1.00 0.01 0.04 0.02
## Sales 0.01 1.00 0.85 0.33
## CustomerCount 0.04 0.85 1.00 0.41
## OpenDays 0.02 0.33 0.41 1.00
## CompetitionDistance -0.03 -0.03 -0.14 0.08
## Promo2C 0.01 -0.23 -0.25 -0.32
## CompetitionDuration 0.00 -0.02 -0.01 -0.05
## PromoDuration -0.01 -0.16 -0.21 -0.23
## CompetitionDistance Promo2C CompetitionDuration
## StoreId -0.03 0.01 0.00
## Sales -0.03 -0.23 -0.02
## CustomerCount -0.14 -0.25 -0.01
## OpenDays 0.08 -0.32 -0.05
## CompetitionDistance 1.00 -0.15 -0.02
## Promo2C -0.15 1.00 0.01
## CompetitionDuration -0.02 0.01 1.00
## PromoDuration -0.08 0.80 -0.01
## PromoDuration
## StoreId -0.01
## Sales -0.16
## CustomerCount -0.21
## OpenDays -0.23
## CompetitionDistance -0.08
## Promo2C 0.80
## CompetitionDuration -0.01
## PromoDuration 1.00
round(cor(Sales,OpenDays), 2)
## [1] 0.33
round(cor(Sales,CompetitionDistance), 2)
## [1] NA
round(cor(Sales,CompetitionDuration), 2)
## [1] -0.02
These appear to be weak correlations
library(corrgram)
corrgram(rossmann.df, order = FALSE, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main= "Corrgram of Store Variables")
detach(rossmann.df)
set.seed(100) # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(rossmann.df), 0.75*nrow(rossmann.df)) # row indices for training data
trainingData <- rossmann.df[trainingRowIndex, ] # model training data
testData <- rossmann.df[-trainingRowIndex, ] # test data
fit <- lm(Sales ~ CustomerCount + OpenDays + CompetitionDistance + Promo2C+ CompetitionDuration + PromoDuration, data=trainingData) # build the model
summary(fit)
##
## Call:
## lm(formula = Sales ~ CustomerCount + OpenDays + CompetitionDistance +
## Promo2C + CompetitionDuration + PromoDuration, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7403607 -615332 -64510 550352 4703686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.053e+06 4.554e+05 6.705 3.72e-11 ***
## CustomerCount 5.487e+00 1.192e-01 46.045 < 2e-16 ***
## OpenDays -1.395e+03 6.104e+02 -2.286 0.0225 *
## CompetitionDistance 2.427e+01 4.698e+00 5.166 3.00e-07 ***
## Promo2C -2.732e+05 1.183e+05 -2.309 0.0212 *
## CompetitionDuration -3.928e+03 5.346e+03 -0.735 0.4627
## PromoDuration 6.375e+04 2.863e+04 2.226 0.0263 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 977400 on 829 degrees of freedom
## Multiple R-squared: 0.7572, Adjusted R-squared: 0.7555
## F-statistic: 431 on 6 and 829 DF, p-value: < 2.2e-16
The above analysis shows that sales dependent on multimple factors, with Customer Count and Competition Distance being highly significant factors. Other influencing factors include OpenDays, Promo2C and PromoDuration.
PredictSales <- predict(fit, testData)
actuals_preds <- data.frame(cbind(actuals=testData$Sales, predicteds=PredictSales)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy # 84.06$
## actuals predicteds
## actuals 1.0000000 0.8406123
## predicteds 0.8406123 1.0000000
min_max_accuracy <- mean(apply(actuals_preds, 1, min) / apply(actuals_preds, 1, max))
min_max_accuracy*100
## [1] 86.61408
mape <- mean(abs((actuals_preds$predicteds - actuals_preds$actuals))/actuals_preds$actuals)
mape*100
## [1] 15.38447