The data contains historical sales data for 45 stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store. The data is stored in 3 different csv files.
stores.csv This file contains anonymized information about the 45 stores, indicating the type and size of store.
sales.csv This is the historical sales data, which covers to 2010-02-05 to 2013-07-26. Within this file you will find the following fields:
Store - the store number
Dept - the department number
Date - the week
Weekly_Sales - sales for the given department in the given store
IsHoliday - whether the week is a special holiday week
Features.csv This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
Store - the store number
Date - the week
Temperature - average temperature in the region
Fuel_Price - cost of fuel in the region
MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
CPI - the consumer price index
Unemployment - the unemployment rate
IsHoliday - whether the week is a special holiday week
| Store | Date | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 05/02/2010 | 42.31 | 2.572 | NA | NA | NA | NA | NA | 211.0964 | 8.106 | FALSE |
| 1 | 12/02/2010 | 38.51 | 2.548 | NA | NA | NA | NA | NA | 211.2422 | 8.106 | TRUE |
| 1 | 19/02/2010 | 39.93 | 2.514 | NA | NA | NA | NA | NA | 211.2891 | 8.106 | FALSE |
| 1 | 26/02/2010 | 46.63 | 2.561 | NA | NA | NA | NA | NA | 211.3196 | 8.106 | FALSE |
| 1 | 05/03/2010 | 46.50 | 2.625 | NA | NA | NA | NA | NA | 211.3501 | 8.106 | FALSE |
| 1 | 12/03/2010 | 57.79 | 2.667 | NA | NA | NA | NA | NA | 211.3806 | 8.106 | FALSE |
| Store | Dept | Date | Weekly_Sales | IsHoliday |
|---|---|---|---|---|
| 1 | 1 | 05/02/2010 | 24924.50 | FALSE |
| 1 | 1 | 12/02/2010 | 46039.49 | TRUE |
| 1 | 1 | 19/02/2010 | 41595.55 | FALSE |
| 1 | 1 | 26/02/2010 | 19403.54 | FALSE |
| 1 | 1 | 05/03/2010 | 21827.90 | FALSE |
| 1 | 1 | 12/03/2010 | 21043.39 | FALSE |
| Store | Type | Size |
|---|---|---|
| 1 | A | 151315 |
| 2 | A | 202307 |
| 3 | B | 37392 |
| 4 | A | 205863 |
| 5 | B | 34875 |
| 6 | A | 202505 |
In this section, We are going to do some Data Exploration. First of all, We are going to split the data into Year, Month and Day.
Now, We created a feature Weekly sales from sales data set.
Now, We must merge the two data sets features and sales.
There is the information about Store Size from Stores data set. So, We are now, merge the SalesM with Storesdata sets also. Merging all the data sets into one sales data set which will be split into sales and testing data sets for data modeling.
Let´s, first of all, look at the Variables distribution.
Temperature and Unemployment are fairly normally distributed. Store Sizes have brackets on both ends indicating large number of large and small stores. The medium sized stores are very few in comparison in this data set. We might need to do some transformation of this data.
Weekly Sales is positively skewed indicating few large values. This could be caused by the festive Weeks which are few but Sales value for these weeks would be very high in comparison to other weeks. We can transform it using a log transformation as can be seem from the second plot. This might be required when we build the model too.
Sales Data is missing for January in 2010, November and December in 2012.There are weeks when Sales peaks in the festive months of November and December.Also seems like there is a dip in September - October leading towards the holiday weeks.
Let´s now grouping the Weeks based on Sales. We are going to use from 2011, because the data is available for full year. And let´s summarize the Weekly Sales.
Now, calculating the upper and lower threshold based on 5% bounds.
We have around 52 weeks in the data set which can be used as a Categorical Variable as we can see trends in Sales across weeks. But doing this would complicate the model with too many variables.
Instead, we can try to cluster the weeks based on Sales. There are weeks with lows and peaks in Sales and then there are majority of the weeks with Sales around the median value
We split the data year wise and Type wise to calculate an Upper (5% higher than Mean) and Lower (5% lower than Mean) Threshold Sales Value. We cluster weeks with Sales higher than Upper threshold as High weeks, lower than Lower threshold as Low weeks and everything else as Medium weeks. These Clusters are shown in below plot.
Plotting Store Size against Sales, we can observe Linear relation. We could observer possibility of confounding here as we can see chunks for different Sizes.
## $title
## [1] "Store Size vs Sales"
##
## attr(,"class")
## [1] "labels"
Based on this, we will try to group the Store Sizes but we already have a Type variable in the data set. Below We see how is the sales level behavior in 2011.
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
Here, We have the Week Type per Year and Week.
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
Handling with Store Size Confounding.
Let´s see the distribution of the Weekly Sales by level of the Stores. We created a Upper and Lower threshold.
Below, We have the distribution of the Store by Size Store and grouping by Type.
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
We see in the graphic below, the distribution of Weekly Sales by Store, Level of the Store by Year.
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
When we plot Sales against Store with Type as color dimension, we can observe clear groups in the stores but this is not exactly captured by the Type field. There are some Stores with really high Sales compared to rest. These Stores are a subset of Type A. Then there are bunch of stores with small sales and is a mixture of Type B and Type C. Rest of the stores makes average sales in a year and falls in a different group. These stores are mix of Type A and Type B
We have grouped the stores into 3 clusters like Week Groups without changing the original Type variable. This avoids using the Store variable as a categorical variable
Unemployment, CPI and Fuel Price does influence Weekly Sales but we cannot observe any major patterns here.
Since we don’t have much information on the Markdowns and major part of them are N/A, we are not imputing missing values with a mean / median. Instead, we are replacing N/As with 0 so that it can be used for Model building and other computations.
Let´s see the behavior of Weekly Sales by Holiday features analysis distribution.
Holiday x Week
## <ggproto object: Class FacetWrap, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetWrap, Facet, gg>
Holiday x Weekly_Sales
Obviously, from the different variable perspective, weekly sales is higher during holidays.
Now, it’s time to see the Markdown features.
Markdown x month x Holiday
Markdown 1 and Markdown 4 are mainly offered on February, March and August and most of them offered during non-holiday. This discount is possibly related to Super bowl season, Mother’s day and back-to-school season.
Markdown 2 is offered mainly during Dec holiday and Jan, but also has its peaks during November. This discount is possibly related to Christmas and New year.
Markdown 3 is offered only during Nov holiday and Dec. This discount is possibly related to Thanksgiving.
Markdown 5 is mainly offered during non-holiday season.
Let´s Categorized the Weeklysalesand see what We figure it out.
Let´s now start the Machine Learning development.
Original Training data set is split into two random training and testing data sets.
##
## Call:
## lm(formula = Weekly_Sales ~ Size + StoreGroup + IsHoliday, data = trainM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -676576 -131333 -12932 105772 2173374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.728e+05 1.649e+04 58.98 < 2e-16 ***
## Size 4.766e+00 7.632e-02 62.44 < 2e-16 ***
## StoreGroupLow -8.222e+05 1.693e+04 -48.58 < 2e-16 ***
## StoreGroupMedium -6.647e+05 9.978e+03 -66.62 < 2e-16 ***
## IsHolidayTRUE 6.813e+04 1.349e+04 5.05 4.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 235900 on 4499 degrees of freedom
## Multiple R-squared: 0.8288, Adjusted R-squared: 0.8287
## F-statistic: 5445 on 4 and 4499 DF, p-value: < 2.2e-16
We can expect Holidays to positively impact Sales in general but as can be seem from above plot, we cannot strongly state that all Holiday Weeks result in higher Sales. Thanksgiving week seems to have jump in sales by Xmas week on the other hand shows a drop
We use Forward selection mechanism to pin point the most significant variables
Repeated with Backward Selection. In both cases, we ended up with similar models. Hence proceeding with this Model.
Using leaps package, we can check for different combinations of the independent variables and select the best combination on the basis or R-sq / Adjusted R-sq. We are using Adjusted R-sq here.
##
## Call:
## lm(formula = Weekly_Sales ~ Type + Temperature + MarkDown3 +
## MarkDown5 + CPI + Unemployment + Size + WeekType + StoreGroup,
## data = trainM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -696154 -123359 -1768 111566 1784907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.578e+05 2.900e+04 8.892 < 2e-16 ***
## Type1 4.242e+04 1.083e+04 3.916 9.14e-05 ***
## Type2 3.658e+04 1.686e+04 2.169 0.030122 *
## Temperature 7.270e+02 1.870e+02 3.887 0.000103 ***
## MarkDown3 4.027e+00 6.398e-01 6.295 3.38e-10 ***
## MarkDown5 5.878e+00 8.070e-01 7.285 3.79e-13 ***
## CPI -1.294e+03 8.549e+01 -15.135 < 2e-16 ***
## Unemployment -5.231e+03 1.847e+03 -2.832 0.004645 **
## Size 5.072e+00 1.187e-01 42.742 < 2e-16 ***
## WeekType1 6.507e+04 7.266e+03 8.955 < 2e-16 ***
## WeekType2 3.997e+05 1.398e+04 28.597 < 2e-16 ***
## StoreGroup1 1.312e+05 1.338e+04 9.806 < 2e-16 ***
## StoreGroup2 7.924e+05 1.676e+04 47.278 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207500 on 4491 degrees of freedom
## Multiple R-squared: 0.8678, Adjusted R-squared: 0.8674
## F-statistic: 2456 on 12 and 4491 DF, p-value: < 2.2e-16
We had noticed that Size of the Store is high on large & Small sizes and few in between which is probably adding to the positive skewness. Hence we will further enhance the current Model with a log transformation of Size.
Polynomial Regression
##
## Call:
## lm(formula = Weekly_Sales ~ Type + Temperature + MarkDown3 +
## MarkDown5 + CPI + Unemployment + I(log(Size)) + WeekType +
## StoreGroup, data = trainM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -699985 -148795 3265 131375 1759117
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.631e+06 1.512e+05 -30.633 < 2e-16 ***
## Type1 -6.443e+04 1.002e+04 -6.430 1.41e-10 ***
## Type2 1.676e+04 1.861e+04 0.901 0.3677
## Temperature 7.732e+02 1.952e+02 3.960 7.61e-05 ***
## MarkDown3 4.055e+00 6.679e-01 6.071 1.37e-09 ***
## MarkDown5 6.299e+00 8.422e-01 7.479 8.93e-14 ***
## CPI -1.049e+03 9.024e+01 -11.629 < 2e-16 ***
## Unemployment -4.608e+03 1.929e+03 -2.389 0.0169 *
## I(log(Size)) 4.811e+05 1.331e+04 36.152 < 2e-16 ***
## WeekType1 6.335e+04 7.585e+03 8.353 < 2e-16 ***
## WeekType2 3.959e+05 1.459e+04 27.142 < 2e-16 ***
## StoreGroup1 7.428e+04 1.608e+04 4.619 3.97e-06 ***
## StoreGroup2 7.689e+05 1.921e+04 40.026 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216600 on 4491 degrees of freedom
## Multiple R-squared: 0.8559, Adjusted R-squared: 0.8555
## F-statistic: 2223 on 12 and 4491 DF, p-value: < 2.2e-16
#################################
# Comparasion of the Two Models #
#################################
plot(trainM$Size,trainM$Weekly_Sales)
points(trainM$Size, Model_full$fitted.values, pch = 20, col= 'blue')
points(trainM$Size, Model_P$fitted.values, pch = 20, col= 'green')
Transformation
We had observed in EDA, that Weekly Sales is positively skewed. There is still skewness to the right and hence we will try transformation on the response variable.
Initially we tried Square Root transformation but not a major improvement in skewness.
Using Boxcox, we see lambda close to 0 and hence using log transformation.
##
## Call:
## lm(formula = log(Weekly_Sales) ~ Size + Temperature + MarkDown3 +
## MarkDown5 + CPI + WeekType + StoreGroup, data = trainM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62833 -0.12823 -0.00425 0.12403 0.97004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.261e+01 1.897e-02 664.939 < 2e-16 ***
## Size 5.329e-06 6.638e-08 80.271 < 2e-16 ***
## Temperature 8.137e-04 1.795e-04 4.532 5.98e-06 ***
## MarkDown3 2.313e-06 6.247e-07 3.703 0.000216 ***
## MarkDown5 5.363e-06 7.844e-07 6.837 9.16e-12 ***
## CPI -1.251e-03 7.840e-05 -15.957 < 2e-16 ***
## WeekType1 4.745e-02 6.953e-03 6.825 9.99e-12 ***
## WeekType2 3.033e-01 1.347e-02 22.514 < 2e-16 ***
## StoreGroup1 4.708e-01 1.099e-02 42.822 < 2e-16 ***
## StoreGroup2 9.111e-01 1.457e-02 62.541 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2027 on 4494 degrees of freedom
## Multiple R-squared: 0.8834, Adjusted R-squared: 0.8831
## F-statistic: 3782 on 9 and 4494 DF, p-value: < 2.2e-16
With this, we have build our final Model: Model with an adjusted R-sq = 88.24 and all variables are highly significant.
Doing the prediction using the Test Data set, We got very good predictions and pretty closer with the real values.
For whole data set.
Other Models
Using now the ts data frame.
We were able to build this model with Adjusted R-sq = 88% and Predicted R-Sq = 88%. The variables “Store Group” & “Week Type” constructed from the data set are the most significant variables. With this Model, we can predict a good estimate of the weekly Sales for Stores.
With this Pivot Table We can manipulate and see the distribution for whatever variables combination. It´s totally interactive Pivot Table. We can drag the variables to the columns or rows, choose the categories, make filters, choose which metrics or graphs.