Project Report - Walmart Sales Prediction

Goal

The objective of our project is to develop a statistical Model based on the dataset available. We are using the historical sales data for 45 Walmart stores located in different regions to predicting the Weekly sales for each store. Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays.

## [1] "C:/Users/Sreejith Nair/Google Drive/UC/Course/Data Analytics Methods BANA 7038/Final Project"

Chapter 1: Data Exploration and Data Cleaning

#Reading dataset from csv files
#Features dataset consists of Unemployment, Fuel Price, Markdowns, CPI information
features <- read.csv('features.csv')
head(features)

##   Store       Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3
## 1     1 2010-02-05       42.31      2.572        NA        NA        NA
## 2     1 2010-02-12       38.51      2.548        NA        NA        NA
## 3     1 2010-02-19       39.93      2.514        NA        NA        NA
## 4     1 2010-02-26       46.63      2.561        NA        NA        NA
## 5     1 2010-03-05       46.50      2.625        NA        NA        NA
## 6     1 2010-03-12       57.79      2.667        NA        NA        NA
##   MarkDown4 MarkDown5      CPI Unemployment IsHoliday
## 1        NA        NA 211.0964        8.106     FALSE
## 2        NA        NA 211.2422        8.106      TRUE
## 3        NA        NA 211.2891        8.106     FALSE
## 4        NA        NA 211.3196        8.106     FALSE
## 5        NA        NA 211.3501        8.106     FALSE
## 6        NA        NA 211.3806        8.106     FALSE

#Date split into Year, Month and Day. 
#Retrieving Week# from Date
features$Date <- as.Date(features$Date, format = "%Y-%m-%d")
features$Week <- format(features$Date, "%V")

features <- separate(features, "Date", c("Year", "Month", "Day"), sep = "-")

train <- read.csv("train.csv")
#Dropping th Department Information for now
train <- train %>% 
  group_by(Store, Date, IsHoliday) %>% 
  summarize(Weekly_Sales = sum(Weekly_Sales))

train$Date <- as.Date(train$Date, format = "%Y-%m-%d")
train <- separate(train, "Date", c("Year", "Month", "Day"), sep = '-')

trainM <- merge(train,features)

#Store Size information is available from Stores dataset
stores <- read.csv("stores.csv")
trainM <- merge(trainM, stores, by = "Store")

#Merging all the datasets into one training dataset which will be split into training and testing datasets

Chapter 2: Data Visualization

#Distribution of Variables
par(mfrow=c(3,2))
hist(trainM$Store, col = 'light green', main = "Stores")
hist(trainM$Temperature, col = 'light green', main = "Temperature")
hist(trainM$Fuel_Price, col = 'light green', main = "Fuel Price")
hist(trainM$CPI, col = 'light green', main = "CPI")
hist(trainM$Unemployment, col = 'light green', main = "Unemployment")
hist(trainM$Size, col = 'light green', main = "Store Size")

Temperature, Fuel Price and Unemployment are fairly normally distributed. Store Sizes have brackets on both ends indicating large number of large and small stores. The medium sized stores are very few in comparison in this dataset. We might need to do some transformation of this data

par(mfrow = c(1,2))
hist(trainM$Weekly_Sales, col = 'light green', main = "Weekly Sales Original", xlab = "Weekly Sales")
hist(log(trainM$Weekly_Sales), col = 'light green', main = "Weekly Sales Transformed", xlab ='log(Weekly Sales)')

Weekly Sales is positively skewed indicating few large values. This could be caused by the festive Weeks which are few but Sales value for these weeks would be very high in comparison to other weeks

We can transform it using a log transformation as can be seem from the second plot. This might be required when we build the model too

ggplot(trainM, aes(x = Month,y = Weekly_Sales )) + 
  geom_col() +
  facet_wrap(~Year) + 
  ggtitle("Weekly Sales Distribution")

Training Data is missing for January in 2010, November and December in 2012.There are weeks when Sales peaks in the festive months of November and December.Also seems like there is a dip in September - October leading towards the holiday weeks

#Grouping Weeks based on Sales
#Using 2011 as base because data available for full year
Weeks2011A <- trainM %>% 
  filter(Year == 2011, Type == "A") %>% 
  group_by(Type, Week) %>% 
  summarize(Size = mean(Size), WeeklySales = sum(Weekly_Sales))

Weeks2011B <- trainM %>% 
  filter(Year == 2011, Type == "B") %>% 
  group_by(Type, Week) %>% 
  summarize(Size = mean(Size), WeeklySales = mean(Weekly_Sales))

Weeks2011C <- trainM %>% 
  filter(Year == 2011, Type == "C") %>% 
  group_by(Type, Week) %>% 
  summarize(Size = mean(Size), WeeklySales = mean(Weekly_Sales))


#Calculating upper and lower threshold based on 5% bounds
Weeks2011A$MeanWeeklySales <- mean(Weeks2011A$WeeklySales)
Weeks2011B$MeanWeeklySales <- mean(Weeks2011B$WeeklySales)
Weeks2011C$MeanWeeklySales <- mean(Weeks2011C$WeeklySales)

Weeks2011A$SalesUpperThreshold <- mean(Weeks2011A$WeeklySales) * 1.05
Weeks2011A$SalesLowerThreshold <- mean(Weeks2011A$WeeklySales) * 0.95

Weeks2011B$SalesUpperThreshold <- mean(Weeks2011B$WeeklySales) * 1.05
Weeks2011B$SalesLowerThreshold <- mean(Weeks2011B$WeeklySales) * 0.95

Weeks2011C$SalesUpperThreshold <- mean(Weeks2011C$WeeklySales) * 1.05
Weeks2011C$SalesLowerThreshold <- mean(Weeks2011C$WeeklySales) * 0.95

ggplot(Weeks2011A, aes(x = Week,y = WeeklySales)) + 
  geom_col()

We have around 52 weeks in the dataset which can be used as a Categorical Variable as we can see trends in Sales across weeks. But doing this would complicate the model with too many variables.

Instead, we can try to cluster the weeks based on Sales. There are weeks with lows and peaks in Sales and then there are majority of the weeks with Sales around the median value

We split the data yearwise and Type wise to calculate an Upper (5% higher than Mean) and Lower (5% lower than Mean) Threshold Sales Value. We cluster weeks with Sales higher than Upper threshold as High weeks, lower than Lower threshold as Low weeks and everything else as Medium weeks. These Clusters are shown in below plot

Weeks2011A$WeekType = "Medium"
Weeks2011A[which(Weeks2011A$WeeklySales > Weeks2011A$SalesUpperThreshold),8] = "High"
Weeks2011A[which(Weeks2011A$WeeklySales < Weeks2011A$SalesLowerThreshold),8] = "Low"

Weeks2011B$WeekType = "Medium"
Weeks2011B[which(Weeks2011B$WeeklySales > Weeks2011B$SalesUpperThreshold),8] = "High"
Weeks2011B[which(Weeks2011B$WeeklySales < Weeks2011B$SalesLowerThreshold),8] = "Low"

Weeks2011C$WeekType = "Medium"
Weeks2011C[which(Weeks2011C$WeeklySales > Weeks2011C$SalesUpperThreshold),8] = "High"
Weeks2011C[which(Weeks2011C$WeeklySales < Weeks2011C$SalesLowerThreshold),8] = "Low"

Weeks2011 <- rbind(Weeks2011A[,c(1,2,8)], Weeks2011B[,c(1,2,8)], Weeks2011C[,c(1,2,8)])

#Add WeekType to Training Data
trainM <- merge(trainM, Weeks2011, by = c("Type", "Week"))

par(mfrow = c(1,3))
ggplot(Weeks2011A, aes(x = Week,y = WeeklySales, fill = WeekType)) + 
  geom_col()

ggplot(Weeks2011B, aes(x = Week,y = WeeklySales, fill = WeekType)) + 
  geom_col()

ggplot(Weeks2011C, aes(x = Week,y = WeeklySales, fill = WeekType)) + 
  geom_col()

Plotting Store Size against Sales, we can observe Linear relation. Alwe could observer possibilty ofcounfounding here as we can see chunks for different Sizes.

ggplot(trainM, aes(x = Size, y = Weekly_Sales))+ 
  geom_point() + 
  ggtitle("Store Size vs Sales")

Based on this, we will try to group the Store Sizes but we already have a “Type” variable in the dataset

#Using Year = 2011 for these calculations as this is the only Year with data for all weeks
train_S_2011 <- trainM %>% 
  filter(Year == 2011) %>% 
  group_by(Store, Month, Week, WeekType, Type, IsHoliday) %>% 
  summarize(WeeklySales = sum(Weekly_Sales), Size = mean(Size), 
            Temp = mean(Temperature), FuelPrice = mean(Fuel_Price), 
            Unemployment = mean(Unemployment))

ggplot(train_S_2011, aes(x = Week,y = WeeklySales, fill = WeekType)) + 
  geom_col() +
  facet_wrap(~Type)

ggplot(train_S_2011, aes(x = Week,y = WeeklySales, fill = WeekType)) + 
  geom_col() +
  facet_wrap(~Type)

ggplot(trainM, aes(x = Week,y = Weekly_Sales, fill = WeekType)) + 
  geom_col() +
  facet_wrap(~Year)

#Handling Store Size Confounding
ggplot(trainM, aes(x = Store,y = Size/1000, fill = Type)) + 
  geom_col()

StoreSales <- trainM %>% 
  group_by(Store, Type) %>% 
  summarize(Size = mean(Size), WeeklySales = sum(Weekly_Sales))

StoreUpperThreshold <- mean(StoreSales$WeeklySales) * 1.4
StoreLowerThreshold <- mean(StoreSales$WeeklySales) * 0.4

StoreSales$StoreUpperThreshold <- StoreUpperThreshold
StoreSales$StoreLowerThreshold <- StoreLowerThreshold

StoreSales$StoreGroup = "Medium"
StoreSales[which(StoreSales$WeeklySales>StoreUpperThreshold),7] = "High"
StoreSales[which(StoreSales$WeeklySales<StoreLowerThreshold),7] = "Low"

ggplot(StoreSales, aes(x = Store,y = WeeklySales, fill = StoreGroup)) + 
  geom_col()

trainM <- merge(trainM, StoreSales[,c(1,7)], by = "Store")

ggplot(trainM, aes(x = Store,y = Size/1000, fill = StoreGroup)) + 
  geom_col() +
  facet_wrap(~Type)

ggplot(trainM, aes(x = Store,y = Weekly_Sales/10000, fill = StoreGroup)) + 
  geom_col() +
  facet_wrap(~Year)

When we plot Sales against Store with Type as color dimension, we can observe clear groups in the stores but this is not exactly captured by the Type field. There are some Stores with really high Sales compared to rest. These Stores are a subset of Type A. Then there are bunch of stores with small sales and is a mixture of Type B and Type C. Rest of the stores makes average sales in a year and falls in a different group. These stores are mix of Type A and Type B

We have grouped the stores into 3 clusters like Week Groups without changing the original Type variable. This avoids using the Store variable as a categorical variable

Unemployment, CPI and Fuel Price does influence Weekly Sales but we cannot observe any major patterns here.

#Handling Markdown
#Since we don't have much information on the missing values, only replacing NA's with 0
trainM[is.na(trainM$MarkDown1),]$MarkDown1 <- 0
trainM[is.na(trainM$MarkDown2),]$MarkDown2 <- 0
trainM[is.na(trainM$MarkDown3),]$MarkDown3 <- 0
trainM[is.na(trainM$MarkDown4),]$MarkDown4 <- 0
trainM[is.na(trainM$MarkDown5),]$MarkDown5 <- 0

Since we don’t have much information on the Markdowns and major part of them are N/A, we are not imputing missing values with a mean / median. Instead, we are replacing N/As with 0 so that it can be used for Model building and other computations.

#Splitting into training and testing datasets
index <- sample(nrow(trainM),nrow(trainM)*0.70)
trainM <- trainM[index,]
testM <- trainM[-index,]

Original Training dataset is split into two random training and testing datasets

#Investigate Holiday
ggplot(trainM, aes(x = Week,y = Weekly_Sales, fill = IsHoliday)) + 
  geom_col() +
  facet_wrap(~Year)

Model3 <- lm(formula = Weekly_Sales~Size+WeekType+StoreGroup+IsHoliday, data = trainM)

We can expect Holidays to positively impact Sales in general but as can be seem from above plot, we cannot strongly state that all Holiday Weeks result in higher Sales. Thanksgiving week seems to have jump in sales by Xmas week on the other hand shows a drop

Variable Selection

_We use Forward selection mechanism to pin point the most significant variables__

Repeated with Backward Selection. In both cases, we ended up with similar models. Henec proceeeding with this Model

#Leaps packages require categorical data to be in encoded form
trainM$Type <- factor(trainM$Type, levels = c("A", "B", "C"),
                          labels = c(0,1,2))

trainM$WeekType <- factor(trainM$WeekType, levels = c("Low", "Medium", "High"),
                          labels = c(0,1,2))

trainM$StoreGroup <- factor(trainM$StoreGroup, levels = c("Low", "Medium", "High"),
                            labels = c(0,1,2))

Using leaps package, we can check for different combinations of the independent variables and select the best combination on the basis or R-sq / Adjusted R-sq. We are using Adjusted R-sq here

Chapter 3: Modelling

#Model based on the outcome of Leaps
ModelL <- lm(formula = Weekly_Sales~Type + Temperature + MarkDown3 + MarkDown5 + CPI + Unemployment + Size + WeekType + StoreGroup, data = trainM)

summary(ModelL)

## 
## Call:
## lm(formula = Weekly_Sales ~ Type + Temperature + MarkDown3 + 
##     MarkDown5 + CPI + Unemployment + Size + WeekType + StoreGroup, 
##     data = trainM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -642263 -125063   -3451  111785 1813161 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.386e+05  2.857e+04   8.351  < 2e-16 ***
## Type1         2.970e+04  1.067e+04   2.784   0.0054 ** 
## Type2         2.074e+04  1.650e+04   1.257   0.2089    
## Temperature   8.118e+02  1.852e+02   4.384 1.19e-05 ***
## MarkDown3     4.362e+00  5.956e-01   7.324 2.84e-13 ***
## MarkDown5     6.051e+00  7.854e-01   7.704 1.62e-14 ***
## CPI          -1.227e+03  8.457e+01 -14.504  < 2e-16 ***
## Unemployment -3.225e+03  1.791e+03  -1.800   0.0719 .  
## Size          4.953e+00  1.167e-01  42.429  < 2e-16 ***
## WeekType1     6.903e+04  7.172e+03   9.625  < 2e-16 ***
## WeekType2     3.808e+05  1.415e+04  26.916  < 2e-16 ***
## StoreGroup1   1.406e+05  1.305e+04  10.778  < 2e-16 ***
## StoreGroup2   7.986e+05  1.647e+04  48.494  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 204700 on 4491 degrees of freedom
## Multiple R-squared:  0.8684, Adjusted R-squared:  0.8681 
## F-statistic:  2471 on 12 and 4491 DF,  p-value: < 2.2e-16

Based on the outcome of analysis using Leaps, we build ModelL with Adj R-sq value of 86.3 %

Residual Analysis

par(mfrow=c(3,2))
plot(trainM$Temperature, ModelL$residuals, pch = 20)
plot(trainM$MarkDown3, ModelL$residuals, pch = 20)
plot(trainM$MarkDown5, ModelL$residuals, pch = 20)
plot(trainM$CPI, ModelL$residuals, pch = 20)
plot(trainM$Size, ModelL$residuals, pch = 20)

WE had noticed that Size of the Store is high on large & Small sizes and few in between which is probably adding to the positive skewness. Hence we will further enhance the current Model with a log transformation of Size

##Polynomial Regression

#Polynomial Regression Model
ModelP <- lm(formula = Weekly_Sales~Type + Temperature + MarkDown3 + MarkDown5 + CPI + Unemployment + I(log(Size))  + WeekType + StoreGroup, data = trainM)

plot(trainM$Size,trainM$Weekly_Sales)
points(trainM$Size, ModelL$fitted.values, pch = 20, col= 'blue')
points(trainM$Size, ModelP$fitted.values, pch = 20, col= 'green')

plot(trainM$Weekly_Sales, ModelP$residuals, pch = 20)
abline(h = 0, col = "grey")

qqnorm(ModelP$residuals)
qqline(ModelP$residuals)

Transformation

We had observed in EDA, that Weekly Sales is positively skewed. There is still skewness to the right and hence we will try transformation on the response variable.

Initially we tried Square Root transformation but not a major improvement in skewness. Using Boxcox, we see lambda close to 0 and hence using log transformation

Chapter 4: Model Checking

Checking for interaction between variables

ModelT <- lm(formula = log(Weekly_Sales)~Temperature + MarkDown3 + MarkDown5 + CPI + WeekType + StoreGroup+Type:WeekType, data = trainM)

Using ModelT, we predicted for training data and the trend is plotted against original Weekly Sales

##Cross Validation

Using K-Fold = 10, we did Cross-validation to check response of the model to new data and observe all responses to be very close to the interval line and residuals are equally spread

## [1] 0.0608

## [1] 0.825

## [1] 274

We get a Predicted R-square of 87.6%

Chapter 5: Finalizing Model

bcx <- boxcox(ModelL)

lambda <- bcx$x[which.max(bcx$y)]

ModelT <- lm(formula = log(Weekly_Sales)~Size+Temperature + MarkDown3 + MarkDown5 + CPI + WeekType + StoreGroup, data = trainM)

summary(ModelT)

## 
## Call:
## lm(formula = log(Weekly_Sales) ~ Size + Temperature + MarkDown3 + 
##     MarkDown5 + CPI + WeekType + StoreGroup, data = trainM)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6249 -0.1269 -0.0007  0.1253  0.9879 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.26e+01   1.89e-02  666.13  < 2e-16 ***
## Size         5.32e-06   6.60e-08   80.60  < 2e-16 ***
## Temperature  9.05e-04   1.79e-04    5.06  4.5e-07 ***
## MarkDown3    2.42e-06   5.86e-07    4.12  3.8e-05 ***
## MarkDown5    5.66e-06   7.70e-07    7.35  2.3e-13 ***
## CPI         -1.21e-03   7.79e-05  -15.57  < 2e-16 ***
## WeekType1    5.33e-02   6.90e-03    7.72  1.4e-14 ***
## WeekType2    2.93e-01   1.38e-02   21.24  < 2e-16 ***
## StoreGroup1  4.68e-01   1.08e-02   43.19  < 2e-16 ***
## StoreGroup2  9.05e-01   1.45e-02   62.49  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.202 on 4494 degrees of freedom
## Multiple R-squared:  0.884,  Adjusted R-squared:  0.883 
## F-statistic: 3.79e+03 on 9 and 4494 DF,  p-value: <2e-16

plot(trainM$Weekly_Sales, ModelT$residuals, pch = 20)
abline(h = 0, col = "grey")

qqnorm(ModelT$residuals)
qqline(ModelT$residuals)

With this, we have build our final Model: ModelT with an adjuasted R-sq = 87.29 and all variables are highly significant

Chapter 6: Make Prediciton using Finalized Model

trainM$Prediction <- exp(ModelT$fitted.values)

trainM2011 <- trainM %>% filter(Year == 2011) %>% group_by(Week) %>% 
  summarize(WeeklySales = sum(Weekly_Sales), WeeklyPrediction = sum(Prediction))


plot(x = trainM2011$Week, y = trainM2011$WeeklySales, col = 'Red', main ="Actual vs Prediciton", xlab = "Week", ylab = "Weekly Sales")
lines(x = trainM2011$Week, y = trainM2011$WeeklyPrediction, col = 'blue')

Chapter 7: Evaluation of Model using Test Data

#Preparing Test data
#Many Variable Selection packages require categorical data to be in encoded form
testM$Type <- factor(testM$Type, levels = c("A", "B", "C"),
                      labels = c(0,1,2))

testM$WeekType <- factor(testM$WeekType, levels = c("Low", "Medium", "High"),
                          labels = c(0,1,2))

testM$StoreGroup <- factor(testM$StoreGroup, levels = c("Low", "Medium", "High"),
                            labels = c(0,1,2))

pred <- predict(ModelT, newdata = testM, interval = c('confidence'), level = 0.95, type = 'response')

testM$Prediction <- exp(pred[, 2])

testMSum <- testM %>%  group_by(Week) %>% 
  summarize(WeeklySales = sum(Weekly_Sales), WeeklyPrediction = sum(Prediction))

plot(x = testMSum$Week, y = testMSum$WeeklySales, col = 'Red', main = "Actual vs Prediction - Testing Data", xlab = "Weeks", ylab = "Weekly Sales")
lines(x = testMSum$Week, y = testMSum$WeeklyPrediction, col = 'blue')

Chapter 8: Summary

We were able to build this model with Adjusted R-sq = 87 % and Predicted R-Sq = 87%. The variables “Store Group” & “Week Type” constructed from the dataset are the most significant variables. With this Model, we can predict a good estimate of the weekly Sales for Walmart stores.