DATA PREP FOR MODELING
Data Summary : The variable store, and holiday flag are continuous,
but should be categorical
## Store Date Weekly_Sales Holiday_Flag
## Min. : 1 Length:6435 Min. : 209986 Min. :0.00000
## 1st Qu.:12 Class :character 1st Qu.: 553350 1st Qu.:0.00000
## Median :23 Mode :character Median : 960746 Median :0.00000
## Mean :23 Mean :1046965 Mean :0.06993
## 3rd Qu.:34 3rd Qu.:1420159 3rd Qu.:0.00000
## Max. :45 Max. :3818686 Max. :1.00000
## Temperature Fuel_Price CPI Unemployment
## Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
## Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
## Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
## 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
## Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
Data Summary : categorical variabes as factors, and a random
variable for splitting the data
## Store Date Weekly_Sales Holiday_Flag
## 1 : 143 Length:6435 Min. : 209986 0:5985
## 2 : 143 Class :character 1st Qu.: 553350 1: 450
## 3 : 143 Mode :character Median : 960746
## 4 : 143 Mean :1046965
## 5 : 143 3rd Qu.:1420159
## 6 : 143 Max. :3818686
## (Other):5577
## Temperature Fuel_Price CPI Unemployment
## Min. : -2.06 Min. :2.472 Min. :126.1 Min. : 3.879
## 1st Qu.: 47.46 1st Qu.:2.933 1st Qu.:131.7 1st Qu.: 6.891
## Median : 62.67 Median :3.445 Median :182.6 Median : 7.874
## Mean : 60.66 Mean :3.359 Mean :171.6 Mean : 7.999
## 3rd Qu.: 74.94 3rd Qu.:3.735 3rd Qu.:212.7 3rd Qu.: 8.622
## Max. :100.14 Max. :4.468 Max. :227.2 Max. :14.313
##
## random
## Min. :0.0000653
## 1st Qu.:0.2527042
## Median :0.4945870
## Mean :0.4980825
## 3rd Qu.:0.7453083
## Max. :0.9999414
##
## tibble [6,435 × 9] (S3: tbl_df/tbl/data.frame)
## $ Store : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr [1:6435] "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
## $ Weekly_Sales: num [1:6435] 1643691 1641957 1611968 1409728 1554807 ...
## $ Holiday_Flag: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ Temperature : num [1:6435] 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num [1:6435] 2.57 2.55 2.51 2.56 2.62 ...
## $ CPI : num [1:6435] 211 211 211 211 211 ...
## $ Unemployment: num [1:6435] 8.11 8.11 8.11 8.11 8.11 ...
## $ random : num [1:6435] 0.288 0.788 0.409 0.883 0.94 ...
Data split 70/30 into training and validation datasets
- the training data has 4533 observations
- the validation data has 1902 observations
LINEAR MODEL TO PREDICT SALES
- A linear model was created using all possible predictors, except for
Date, predicting sales. This model allowed for the understanding of the
individual effects of each predictor.
- Then a model with all possible interactions was created, again,
excluding Date
- A backward selection was employed using alpha = 0.01 in order to
obtain a more parsimonious model.
##
## Call:
## lm(formula = Weekly_Sales ~ . - Date, data = trainlm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513804 -69546 -11882 39468 1823766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1357543.2 278781.6 4.870 1.16e-06 ***
## Store2 351630.2 22573.7 15.577 < 2e-16 ***
## Store3 -1180716.9 23170.7 -50.957 < 2e-16 ***
## Store4 716535.6 116599.9 6.145 8.68e-10 ***
## Store5 -1282576.3 24140.1 -53.131 < 2e-16 ***
## Store6 -42348.2 23255.9 -1.821 0.068679 .
## Store7 -946003.2 35215.5 -26.863 < 2e-16 ***
## Store8 -699172.6 23983.0 -29.153 < 2e-16 ***
## Store9 -1068955.1 23877.8 -44.768 < 2e-16 ***
## Store10 628887.4 112929.0 5.569 2.71e-08 ***
## Store11 -217449.3 23756.1 -9.153 < 2e-16 ***
## Store12 -160159.8 105614.7 -1.516 0.129475
## Store13 665390.8 114607.5 5.806 6.85e-09 ***
## Store14 541155.6 43009.0 12.582 < 2e-16 ***
## Store15 -711503.8 105846.8 -6.722 2.02e-11 ***
## Store16 -1033264.4 39281.3 -26.304 < 2e-16 ***
## Store17 -482891.3 115651.4 -4.175 3.03e-05 ***
## Store18 -236188.4 103207.7 -2.288 0.022156 *
## Store19 88916.7 105944.5 0.839 0.401359
## Store20 568352.9 25871.2 21.969 < 2e-16 ***
## Store21 -815160.0 22958.8 -35.505 < 2e-16 ***
## Store22 -323211.6 100196.3 -3.226 0.001265 **
## Store23 -29770.2 113150.0 -0.263 0.792482
## Store24 27496.8 104872.5 0.262 0.793185
## Store25 -847649.9 25399.5 -33.373 < 2e-16 ***
## Store26 -359445.5 105525.4 -3.406 0.000664 ***
## Store27 439229.1 101050.9 4.347 1.41e-05 ***
## Store28 139953.5 105779.4 1.323 0.185880
## Store29 -750276.3 101676.3 -7.379 1.89e-13 ***
## Store30 -1122743.3 22425.9 -50.065 < 2e-16 ***
## Store31 -158464.1 22737.4 -6.969 3.65e-12 ***
## Store32 -332244.3 35092.7 -9.468 < 2e-16 ***
## Store33 -1023329.2 112745.7 -9.076 < 2e-16 ***
## Store34 -312719.9 107897.4 -2.898 0.003770 **
## Store35 -415109.3 98516.6 -4.214 2.56e-05 ***
## Store36 -1180433.3 23588.0 -50.044 < 2e-16 ***
## Store37 -1035121.3 22767.4 -45.465 < 2e-16 ***
## Store38 -796167.5 105624.2 -7.538 5.75e-14 ***
## Store39 -86778.8 23088.3 -3.759 0.000173 ***
## Store40 -455869.6 113041.5 -4.033 5.60e-05 ***
## Store41 -279251.2 37950.0 -7.358 2.20e-13 ***
## Store42 -730734.7 113110.0 -6.460 1.16e-10 ***
## Store43 -857065.6 25926.2 -33.058 < 2e-16 ***
## Store44 -1056135.9 115347.2 -9.156 < 2e-16 ***
## Store45 -676430.4 43141.6 -15.679 < 2e-16 ***
## Holiday_Flag1 75352.5 9602.9 7.847 5.30e-15 ***
## Temperature -904.9 159.4 -5.677 1.46e-08 ***
## Fuel_Price -42007.4 8681.7 -4.839 1.35e-06 ***
## CPI 2692.5 1262.7 2.132 0.033030 *
## Unemployment -23933.5 5245.3 -4.563 5.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162300 on 4475 degrees of freedom
## Multiple R-squared: 0.9183, Adjusted R-squared: 0.9174
## F-statistic: 1027 on 49 and 4475 DF, p-value: < 2.2e-16
## Backward Elimination Method
## ---------------------------
##
## Candidate Terms:
##
## 1. Store
## 2. Holiday_Flag
## 3. Temperature
## 4. Fuel_Price
## 5. CPI
## 6. Unemployment
##
##
## Step => 0
## Model => Weekly_Sales ~ Store + Holiday_Flag + Temperature + Fuel_Price + CPI + Unemployment
## R2 => 0.918
##
## Initiating stepwise selection...
##
##
## No more variables to be removed.
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513804 -69546 -11882 39468 1823766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1357543.2 278781.6 4.870 1.16e-06 ***
## Store2 351630.2 22573.7 15.577 < 2e-16 ***
## Store3 -1180716.9 23170.7 -50.957 < 2e-16 ***
## Store4 716535.6 116599.9 6.145 8.68e-10 ***
## Store5 -1282576.3 24140.1 -53.131 < 2e-16 ***
## Store6 -42348.2 23255.9 -1.821 0.068679 .
## Store7 -946003.2 35215.5 -26.863 < 2e-16 ***
## Store8 -699172.6 23983.0 -29.153 < 2e-16 ***
## Store9 -1068955.1 23877.8 -44.768 < 2e-16 ***
## Store10 628887.4 112929.0 5.569 2.71e-08 ***
## Store11 -217449.3 23756.1 -9.153 < 2e-16 ***
## Store12 -160159.8 105614.7 -1.516 0.129475
## Store13 665390.8 114607.5 5.806 6.85e-09 ***
## Store14 541155.6 43009.0 12.582 < 2e-16 ***
## Store15 -711503.8 105846.8 -6.722 2.02e-11 ***
## Store16 -1033264.4 39281.3 -26.304 < 2e-16 ***
## Store17 -482891.3 115651.4 -4.175 3.03e-05 ***
## Store18 -236188.4 103207.7 -2.288 0.022156 *
## Store19 88916.7 105944.5 0.839 0.401359
## Store20 568352.9 25871.2 21.969 < 2e-16 ***
## Store21 -815160.0 22958.8 -35.505 < 2e-16 ***
## Store22 -323211.6 100196.3 -3.226 0.001265 **
## Store23 -29770.2 113150.0 -0.263 0.792482
## Store24 27496.8 104872.5 0.262 0.793185
## Store25 -847649.9 25399.5 -33.373 < 2e-16 ***
## Store26 -359445.5 105525.4 -3.406 0.000664 ***
## Store27 439229.1 101050.9 4.347 1.41e-05 ***
## Store28 139953.5 105779.4 1.323 0.185880
## Store29 -750276.3 101676.3 -7.379 1.89e-13 ***
## Store30 -1122743.3 22425.9 -50.065 < 2e-16 ***
## Store31 -158464.1 22737.4 -6.969 3.65e-12 ***
## Store32 -332244.3 35092.7 -9.468 < 2e-16 ***
## Store33 -1023329.2 112745.7 -9.076 < 2e-16 ***
## Store34 -312719.9 107897.4 -2.898 0.003770 **
## Store35 -415109.3 98516.6 -4.214 2.56e-05 ***
## Store36 -1180433.3 23588.0 -50.044 < 2e-16 ***
## Store37 -1035121.3 22767.4 -45.465 < 2e-16 ***
## Store38 -796167.5 105624.2 -7.538 5.75e-14 ***
## Store39 -86778.8 23088.3 -3.759 0.000173 ***
## Store40 -455869.6 113041.5 -4.033 5.60e-05 ***
## Store41 -279251.2 37950.0 -7.358 2.20e-13 ***
## Store42 -730734.7 113110.0 -6.460 1.16e-10 ***
## Store43 -857065.6 25926.2 -33.058 < 2e-16 ***
## Store44 -1056135.9 115347.2 -9.156 < 2e-16 ***
## Store45 -676430.4 43141.6 -15.679 < 2e-16 ***
## Holiday_Flag1 75352.5 9602.9 7.847 5.30e-15 ***
## Temperature -904.9 159.4 -5.677 1.46e-08 ***
## Fuel_Price -42007.4 8681.7 -4.839 1.35e-06 ***
## CPI 2692.5 1262.7 2.132 0.033030 *
## Unemployment -23933.5 5245.3 -4.563 5.18e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 162300 on 4475 degrees of freedom
## Multiple R-squared: 0.9183, Adjusted R-squared: 0.9174
## F-statistic: 1027 on 49 and 4475 DF, p-value: < 2.2e-16
INTERPRET 3 PREDICTORS
- Fuel Price: When fuel prices go up, sales go down
- Holiday_Flag: Sales increase during holidays
- CPI: as the CPI goes up, so do sales
WHAT IS THE R^2 OF THE MODEL
- The model, using the training data, has an R^2 of 0.9345
- The model, using the validation data, has an R^2 of 0.912