Project Overview

This project analyzes the drivers of Walmart’s weekly store-level sales using linear regression and predictive modeling in R.

The analysis progresses from simple models to a robust log-linear specification, balancing interpretability with predictive performance.

Key findings:

This report is designed to be read end-to-end without running any code.

This report is designed to be read end-to-end without running any code.

## Warning: package 'tidyverse' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## 
## ✔ broom        1.0.6      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10     
## 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
## Warning: package 'tidylog' was built under R version 4.4.1
## 
## Attaching package: 'tidylog'
## 
## The following objects are masked from 'package:dplyr':
## 
##     add_count, add_tally, anti_join, count, distinct, distinct_all,
##     distinct_at, distinct_if, filter, filter_all, filter_at, filter_if,
##     full_join, group_by, group_by_all, group_by_at, group_by_if,
##     inner_join, left_join, mutate, mutate_all, mutate_at, mutate_if,
##     relocate, rename, rename_all, rename_at, rename_if, rename_with,
##     right_join, sample_frac, sample_n, select, select_all, select_at,
##     select_if, semi_join, slice, slice_head, slice_max, slice_min,
##     slice_sample, slice_tail, summarise, summarise_all, summarise_at,
##     summarise_if, summarize, summarize_all, summarize_at, summarize_if,
##     tally, top_frac, top_n, transmute, transmute_all, transmute_at,
##     transmute_if, ungroup
## 
## The following objects are masked from 'package:tidyr':
## 
##     drop_na, fill, gather, pivot_longer, pivot_wider, replace_na,
##     separate_wider_delim, separate_wider_position,
##     separate_wider_regex, spread, uncount
## 
## The following object is masked from 'package:stats':
## 
##     filter
## Rows: 6435 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (7): Store, Temperature, Fuel_Price, CPI, Unemployment, Size, Weekly_Sales
## lgl  (1): IsHoliday
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## rename_with: renamed 9 variables (store, date, isholiday, temperature, fuel_price, …)
## # A tibble: 6 × 9
##   store date       isholiday temperature fuel_price   cpi unemployment   size
##   <dbl> <date>     <lgl>           <dbl>      <dbl> <dbl>        <dbl>  <dbl>
## 1     1 2010-04-16 FALSE            66.3       2.81  210.         7.81 151315
## 2     1 2012-04-06 FALSE            70.4       3.89  221.         7.14 151315
## 3     1 2010-08-06 FALSE            87.2       2.63  212.         7.79 151315
## 4     1 2010-02-05 FALSE            42.3       2.57  211.         8.11 151315
## 5     1 2012-08-17 FALSE            84.8       3.57  222.         6.91 151315
## 6     1 2011-02-04 FALSE            42.3       2.99  213.         7.74 151315
## # ℹ 1 more variable: weekly_sales <dbl>
## spc_tbl_ [6,435 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ date        : Date[1:6435], format: "2010-04-16" "2012-04-06" ...
##  $ isholiday   : logi [1:6435] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ temperature : num [1:6435] 66.3 70.4 87.2 42.3 84.8 ...
##  $ fuel_price  : num [1:6435] 2.81 3.89 2.63 2.57 3.57 ...
##  $ cpi         : num [1:6435] 210 221 212 211 222 ...
##  $ unemployment: num [1:6435] 7.81 7.14 7.79 8.11 6.91 ...
##  $ size        : num [1:6435] 151315 151315 151315 151315 151315 ...
##  $ weekly_sales: num [1:6435] 1105515 1505325 837329 1112467 1085133 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_date(format = ""),
##   ..   IsHoliday = col_logical(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Size = col_double(),
##   ..   Weekly_Sales = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
## # A tibble: 6 × 9
##   store date       isholiday temperature fuel_price   cpi unemployment   size
##   <dbl> <date>     <lgl>           <dbl>      <dbl> <dbl>        <dbl>  <dbl>
## 1     1 2010-04-16 FALSE            66.3       2.81  210.         7.81 151315
## 2     1 2012-04-06 FALSE            70.4       3.89  221.         7.14 151315
## 3     1 2010-08-06 FALSE            87.2       2.63  212.         7.79 151315
## 4     1 2010-02-05 FALSE            42.3       2.57  211.         8.11 151315
## 5     1 2012-08-17 FALSE            84.8       3.57  222.         6.91 151315
## 6     1 2011-02-04 FALSE            42.3       2.99  213.         7.74 151315
## # ℹ 1 more variable: weekly_sales <dbl>
## spc_tbl_ [6,435 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ date        : Date[1:6435], format: "2010-04-16" "2012-04-06" ...
##  $ isholiday   : logi [1:6435] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ temperature : num [1:6435] 66.3 70.4 87.2 42.3 84.8 ...
##  $ fuel_price  : num [1:6435] 2.81 3.89 2.63 2.57 3.57 ...
##  $ cpi         : num [1:6435] 210 221 212 211 222 ...
##  $ unemployment: num [1:6435] 7.81 7.14 7.79 8.11 6.91 ...
##  $ size        : num [1:6435] 151315 151315 151315 151315 151315 ...
##  $ weekly_sales: num [1:6435] 1105515 1505325 837329 1112467 1085133 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_date(format = ""),
##   ..   IsHoliday = col_logical(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Size = col_double(),
##   ..   Weekly_Sales = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Simple Linear Regression Model

We will begin by running a simple linear model that regresses weekly sales onto Consumer Price Index (CPI)
## 
## Call:
## stats::lm(formula = weekly_sales ~ cpi, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -662386 -318443  -73868  258442 2095880 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 827280.5    21778.4  37.986  < 2e-16 ***
## cpi           -732.7      123.7  -5.923 3.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 390600 on 6433 degrees of freedom
## Multiple R-squared:  0.005423,   Adjusted R-squared:  0.005269 
## F-statistic: 35.08 on 1 and 6433 DF,  p-value: 3.332e-09
In this model, a Walmart store with a theoretical square footage of 0 can expect its weekly sales to be ~$828,280 if CPI is held constant. We also observe that the relationship between Weekly_Sales and CPI is negative. That is, if CPI increases by one unit, weekly sales will decrease by ~$733; and if CPI decreases by one unit, sales would increase by ~$733.
In evaluating the model statistics, we can see an Adjusted R_Squared value of 0.005269. In other words, this model explains only roughly 0.5% of the variance in Walmart’s weekly sales. So, while our interpretation of the effect of CPI on Weekly_Sales is still valid, we must conclude that this model appears to fail in explaining the variance in our target variable.
## filter: removed 6,292 rows (98%), 143 rows remaining
## `geom_smooth()` using formula = 'y ~ x'
## filter: removed 6,292 rows (98%), 143 rows remaining
## `geom_smooth()` using formula = 'y ~ x'
## filter: removed 6,292 rows (98%), 143 rows remaining
## `geom_smooth()` using formula = 'y ~ x'
## filter: removed 6,292 rows (98%), 143 rows remaining
## `geom_smooth()` using formula = 'y ~ x'
## filter: removed 5,720 rows (89%), 715 rows remaining
## Warning: No renderer available. Please install the gifski, av, or magick package to
## create animated output
## NULL
What we observe here is that the impact of CPI can vary greatly by store/region. This still aligns with our evaluation of fit_cpi because we recall that that particular model explained only a small amount (~5%) of the variance in Weekly_Sales, so we would expect to see these kinds of swings. With a (much) higher Adjusted R-Squared, these variations would look unusual.
## group_by: one grouping variable (store)
## filter (grouped): removed 315 rows (88%), 45 rows remaining (removed 0 groups, 45 groups remaining)
## # A tibble: 45 × 6
## # Groups:   store [45]
##    store term  estimate std.error statistic p.value
##    <dbl> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
##  1     1 cpi    -15806.    15501.   -1.02     0.310
##  2     2 cpi     30013.    28889.    1.04     0.301
##  3     3 cpi      9663.     7616.    1.27     0.207
##  4     4 cpi    -20934.    61271.   -0.342    0.733
##  5     5 cpi      6166.     5817.    1.06     0.291
##  6     6 cpi     10838.    19566.    0.554    0.581
##  7     7 cpi     -1555.    17927.   -0.0867   0.931
##  8     8 cpi     -7967.    10842.   -0.735    0.464
##  9     9 cpi     10544.     8460.    1.25     0.215
## 10    10 cpi    -79325.    58475.   -1.36     0.177
## # ℹ 35 more rows
## filter: removed 4,500 rows (70%), 1,935 rows remaining
## NULL
## `geom_smooth()` using formula = 'y ~ x'
We see an interesting effect when we filter for one specific year. The clusters are nearly vertical because CPI is calculated geographically, with either Core Based Statistical Area (CBSA) or Metropolitan Statistical Area (MSA). CPI might be the same in a particular region, but different stores in that region will have different sales volume, hence the vertical clusters.
## filter: removed 6,392 rows (99%), 43 rows remaining
## NULL
## `geom_smooth()` using formula = 'y ~ x'
Although CPI varies by region, the deviation in CPI across time for a single region tends to be much lower, which is why we see such a slim range here. Since CPI is a measure of inflation, we expect to see these regional effects.
## 
## Call:
## stats::lm(formula = weekly_sales ~ cpi + size, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -563750 -167145  -29612  112172 1912650 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.828e+05  1.497e+04  12.216   <2e-16 ***
## cpi         -6.570e+02  7.692e+01  -8.542   <2e-16 ***
## size         4.847e+00  4.796e-02 101.048   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 242800 on 6432 degrees of freedom
## Multiple R-squared:  0.6156, Adjusted R-squared:  0.6155 
## F-statistic:  5151 on 2 and 6432 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: weekly_sales ~ cpi
## Model 2: weekly_sales ~ cpi + size
##   Res.Df        RSS Df  Sum of Sq     F    Pr(>F)    
## 1   6433 9.8128e+14                                  
## 2   6432 3.7924e+14  1 6.0204e+14 10211 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model that includes size as a predictor variable (fit_cpi_size) appears to perform significantly better than fit_cpi. Adjusted R-Square now explains ~62% of the variance in rentals and the ANOVA test confirms that including size is statistically significant.
## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  827280.    21778.     38.0  4.65e-285
## 2 cpi            -733.      124.     -5.92 3.33e-  9
## # A tibble: 3 × 5
##   term         estimate  std.error statistic  p.value
##   <chr>           <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept) 182832.   14967.         12.2  6.08e-34
## 2 cpi           -657.      76.9        -8.54 1.63e-17
## 3 size             4.85     0.0480    101.   0
Note also that the coefficient in the revised model has been reduced from ~$733 to ~$657. This is simply due to the fact that size is now explaining more of the variance that was left unexplained by the previous model that only included CPI.
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557148 -165608  -24125  112851 1918479 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.133e+05  3.546e+04   8.834  < 2e-16 ***
## isholidayTRUE  6.012e+04  1.196e+04   5.026 5.14e-07 ***
## temperature    1.002e+03  1.739e+02   5.761 8.72e-09 ***
## fuel_price    -1.333e+04  6.822e+03  -1.954   0.0507 .  
## cpi           -9.461e+02  8.445e+01 -11.203  < 2e-16 ***
## unemployment  -1.252e+04  1.725e+03  -7.258 4.40e-13 ***
## size           4.840e+00  4.802e-02 100.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 241200 on 6428 degrees of freedom
## Multiple R-squared:  0.621,  Adjusted R-squared:  0.6206 
## F-statistic:  1755 on 6 and 6428 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: weekly_sales ~ cpi + size
## Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - store - date
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6432 3.7924e+14                                   
## 2   6428 3.7394e+14  4 5.3028e+12 22.789 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We observe a further, though slight improvement in the Adjusted R-Squared value in the new model that eliminates temporal and regional effects (fit_full). The ANOVA test also confirms that the improvement in explanatory power is indeed statistically significant.

More Linear Regression

We hypothesize that the effect of good weather is increased on holidays. We can test this by revising fit_full and including an interaction term.
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date + isholiday * 
##     temperature, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557499 -165415  -24493  112914 1918376 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.148e+05  3.565e+04   8.830  < 2e-16 ***
## isholidayTRUE              4.745e+04  3.265e+04   1.453   0.1462    
## temperature                9.809e+02  1.808e+02   5.424 6.04e-08 ***
## fuel_price                -1.342e+04  6.826e+03  -1.966   0.0493 *  
## cpi                       -9.460e+02  8.446e+01 -11.200  < 2e-16 ***
## unemployment              -1.251e+04  1.725e+03  -7.254 4.53e-13 ***
## size                       4.840e+00  4.802e-02 100.779  < 2e-16 ***
## isholidayTRUE:temperature  2.473e+02  5.932e+02   0.417   0.6768    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 241200 on 6427 degrees of freedom
## Multiple R-squared:  0.621,  Adjusted R-squared:  0.6206 
## F-statistic:  1504 on 7 and 6427 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - store - date
## Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - store - date + isholiday * temperature
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   6428 3.7394e+14                            
## 2   6427 3.7393e+14  1 1.0112e+10 0.1738 0.6768
Although the results of our fit_full_int model demonstrate that the effect of good weather is indeed more significant on holidays, the ANOVA test shows no statistically significant improvement. We cannot assert definitively that this model with the interaction term is an improvement.
We’ll also test whether the effect of temperature on weekly sales is linear by squaring that variable.
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date + I(temperature^2), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -561455 -165260  -24674  112058 1911166 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.610e+05  4.111e+04   6.350 2.30e-10 ***
## isholidayTRUE     6.230e+04  1.199e+04   5.197 2.09e-07 ***
## temperature       3.294e+03  9.301e+02   3.542   0.0004 ***
## fuel_price       -1.471e+04  6.841e+03  -2.151   0.0315 *  
## cpi              -9.547e+02  8.449e+01 -11.300  < 2e-16 ***
## unemployment     -1.253e+04  1.724e+03  -7.268 4.09e-13 ***
## size              4.831e+00  4.811e-02 100.420  < 2e-16 ***
## I(temperature^2) -1.982e+01  7.901e+00  -2.509   0.0121 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 241100 on 6427 degrees of freedom
## Multiple R-squared:  0.6214, Adjusted R-squared:  0.621 
## F-statistic:  1507 on 7 and 6427 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - store - date
## Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - store - date + I(temperature^2)
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1   6428 3.7394e+14                               
## 2   6427 3.7357e+14  1 3.6586e+11 6.2943 0.01214 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model output demonstrates a curvilinear, or inverted U-shaped relationship (visualized below). People are less likely to shop retail on a freezing cold day. Increasing temperatures are associated with increased sales, but only to a point. As temperatures become excessive and dangerous, sales start to decrease.
If we were managing Walmart’s promotions we could offer larger discounts when the whether is at either extreme and perhaps even increase the price of certain products when the temperature is mild.

Predictive Analytics

Now that we have a model that is fairly robust we will use it to make predictions of weekly sales revenue.
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - date - store + I(temperature^2), 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -557260 -165114  -25112  115048 1913671 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.546e+05  4.725e+04   5.389 7.42e-08 ***
## isholidayTRUE     6.038e+04  1.397e+04   4.323 1.57e-05 ***
## temperature       3.056e+03  1.068e+03   2.861  0.00424 ** 
## fuel_price       -1.939e+04  7.819e+03  -2.480  0.01316 *  
## cpi              -9.217e+02  9.640e+01  -9.561  < 2e-16 ***
## unemployment     -1.058e+04  1.992e+03  -5.312 1.14e-07 ***
## size              4.826e+00  5.496e-02  87.809  < 2e-16 ***
## I(temperature^2) -1.628e+01  9.058e+00  -1.797  0.07237 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239000 on 4818 degrees of freedom
## Multiple R-squared:  0.6248, Adjusted R-squared:  0.6242 
## F-statistic:  1146 on 7 and 4818 DF,  p-value: < 2.2e-16
## # A tibble: 8 × 5
##   term              estimate  std.error statistic  p.value
##   <chr>                <dbl>      <dbl>     <dbl>    <dbl>
## 1 (Intercept)      254647.   47252.          5.39 7.42e- 8
## 2 isholidayTRUE     60380.   13967.          4.32 1.57e- 5
## 3 temperature        3056.    1068.          2.86 4.24e- 3
## 4 fuel_price       -19393.    7819.         -2.48 1.32e- 2
## 5 cpi                -922.      96.4        -9.56 1.81e-21
## 6 unemployment     -10580.    1992.         -5.31 1.14e- 7
## 7 size                  4.83     0.0550     87.8  0       
## 8 I(temperature^2)    -16.3      9.06       -1.80 7.24e- 2
## rename: renamed one variable (Predicted_Sales)
## # A tibble: 1,609 × 10
##    Predicted_Sales store date       isholiday temperature fuel_price   cpi
##              <dbl> <dbl> <date>     <lgl>           <dbl>      <dbl> <dbl>
##  1         754860.     1 2010-02-05 FALSE            42.3       2.57  211.
##  2         215324.     3 2010-02-05 FALSE            45.7       2.57  214.
##  3        1006246.     6 2010-02-05 FALSE            40.4       2.57  213.
##  4         774226.     8 2010-02-05 FALSE            34.1       2.57  214.
##  5         585828.    12 2010-02-05 FALSE            49.5       2.96  126.
##  6         978757.    14 2010-02-05 FALSE            27.3       2.78  182.
##  7         528790.    17 2010-02-05 FALSE            23.1       2.67  126.
##  8         693424.    21 2010-02-05 FALSE            39.0       2.57  211.
##  9        1032041.    24 2010-02-05 FALSE            22.4       2.95  132.
## 10         220072.    36 2010-02-05 FALSE            46.0       2.54  210.
## # ℹ 1,599 more rows
## # ℹ 3 more variables: unemployment <dbl>, size <dbl>, weekly_sales <dbl>
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     247631.
## 2 mae     standard     181493.
These metrics indicate that our model is off by ~$240,424 according to RMSE and ~$179,092 MAE. These numbers appear alarming until one recalls that the range of weekly sales by Walmart store location is about $70k to $2.8 million, with a mean of $740k and a median of $689k.
## 
## Call:
## stats::lm(formula = weekly_sales ~ . - store - date, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -553813 -165778  -24194  114613 1919644 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.981e+05  4.060e+04   7.344 2.42e-13 ***
## isholidayTRUE  5.846e+04  1.393e+04   4.197 2.76e-05 ***
## temperature    1.170e+03  1.998e+02   5.857 5.02e-09 ***
## fuel_price    -1.842e+04  7.802e+03  -2.360   0.0183 *  
## cpi           -9.137e+02  9.632e+01  -9.486  < 2e-16 ***
## unemployment  -1.056e+04  1.992e+03  -5.302 1.20e-07 ***
## size           4.832e+00  5.486e-02  88.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 239000 on 4819 degrees of freedom
## Multiple R-squared:  0.6245, Adjusted R-squared:  0.624 
## F-statistic:  1336 on 6 and 4819 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - date - store + I(temperature^2)
## Model 2: weekly_sales ~ (store + date + isholiday + temperature + fuel_price + 
##     cpi + unemployment + size) - store - date
##   Res.Df        RSS Df   Sum of Sq      F  Pr(>F)  
## 1   4818 2.7514e+14                                
## 2   4819 2.7533e+14 -1 -1.8445e+11 3.2299 0.07237 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## rename: renamed one variable (Predicted_Sales)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard     247845.
## 2 mae     standard     181757.
When we remove the temperature term, something notable occurs. First, our Adjusted R-Squared value diminishes slighlty (from 62.2% to 62.1%), making it a slightly less appealing model in terms of explaining the variance in weekly sales. However, we also observe that the error has been reduced, making fit_nosq relatively superior in terms of predictive capability. Since we are trying to build a reliable predictive model, we exclude the term and conclude that fit_nosq is better for that purpose.

More Predictive Modeling

We are fairly pleased with both the explanatory and predictive power of fit_nosq but of course we would like to improve upon both metrics. One issue that we have not yet discussed is the variability in weekly sales across Walmart locations, as shown below.

Viewing the bar chart, we can see that standardizing the weekly_sales variable could improve our model. One way we could do this is by transforming the scale of weekly sales, transforming each value to its natural logarithmic value. We’ll use the log() function to accomplish this.
## mutate: new variable 'log_sales' (double) with 6,435 unique values and 0% NA
## # A tibble: 6,435 × 10
##    store date       isholiday temperature fuel_price   cpi unemployment   size
##    <dbl> <date>     <lgl>           <dbl>      <dbl> <dbl>        <dbl>  <dbl>
##  1     1 2010-04-16 FALSE            66.3       2.81  210.         7.81 151315
##  2     1 2012-04-06 FALSE            70.4       3.89  221.         7.14 151315
##  3     1 2010-08-06 FALSE            87.2       2.63  212.         7.79 151315
##  4     1 2010-02-05 FALSE            42.3       2.57  211.         8.11 151315
##  5     1 2012-08-17 FALSE            84.8       3.57  222.         6.91 151315
##  6     1 2011-02-04 FALSE            42.3       2.99  213.         7.74 151315
##  7     1 2012-08-03 FALSE            86.1       3.42  222.         6.91 151315
##  8     1 2012-04-20 FALSE            66.8       3.88  222.         7.14 151315
##  9     1 2012-07-06 FALSE            81.6       3.23  222.         6.91 151315
## 10     1 2010-09-03 FALSE            81.2       2.58  212.         7.79 151315
## # ℹ 6,425 more rows
## # ℹ 2 more variables: weekly_sales <dbl>, log_sales <dbl>
## 
## Call:
## stats::lm(formula = log_sales ~ . - store - date - weekly_sales, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27631 -0.22829 -0.01924  0.22924  1.48007 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.247e+01  5.575e-02 223.594  < 2e-16 ***
## isholidayTRUE  6.378e-02  1.913e-02   3.334 0.000863 ***
## temperature    4.764e-04  2.744e-04   1.736 0.082580 .  
## fuel_price    -7.008e-03  1.072e-02  -0.654 0.513151    
## cpi           -1.185e-03  1.323e-04  -8.958  < 2e-16 ***
## unemployment  -4.849e-03  2.736e-03  -1.772 0.076401 .  
## size           8.095e-06  7.534e-08 107.444  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3283 on 4819 degrees of freedom
## Multiple R-squared:  0.7109, Adjusted R-squared:  0.7106 
## F-statistic:  1975 on 6 and 4819 DF,  p-value: < 2.2e-16
This final model uses a log-linear regression to explain weekly Walmart sales using store-level, economic, and seasonal predictors. Applying a log transformation to weekly sales substantially improves model performance, yielding an adjusted R² of 0.71, and stabilizes variance across stores with vastly different revenue scales. Overall model fit is strong, with well-behaved residuals and a highly significant F-statistic, indicating that the included predictors jointly explain a meaningful share of sales variation.
Results show that store size is the dominant driver of weekly sales, dwarfing macroeconomic effects and confirming that physical scale largely determines revenue potential. Holiday weeks are associated with an average 6–7% increase in sales, validating the importance of seasonal demand spikes. Inflation, proxied by CPI, has a small but statistically significant negative relationship with sales, even after controlling for store characteristics. Temperature and unemployment exhibit modest effects consistent with economic intuition, while fuel prices do not appear to meaningfully impact sales once other factors are accounted for.
This specification represents the best balance between interpretability, explanatory power, and robustness among the models tested. The log transformation enables clear percentage-based interpretations while materially improving fit relative to linear alternatives, making the model suitable for both analytical insight and downstream forecasting.

Limitiations

* The model does not explicitly account for store-level fixed effects or regional hierarchies, which may mask persistent location-specific dynamics.
* Temporal structure is handled implicitly; autocorrelation and seasonality are not directly modeled.
* The analysis assumes linear relationships on the log scale and may understate nonlinear or interaction effects beyond those tested.
* CPI and unemployment are measured at broader geographic levels and may not fully capture local economic conditions.

Next Steps

* Implement mixed-effects (hierarchical) models to capture store-specific variation.
* Explore time-series approaches for improved short-term forecasting.
* Incorporate promotional data, foot traffic, or local demographic variables to enhance predictive accuracy.