2024-04-07

Walmart Sales Data Set

Walmart Sales is a Kaggle data set containing sales, temperature, fuel price, consumer price index, and unemployment data. The following slides will explore the relationships within the data using linear regression.

Here is a snippet:

  Weekly_Sales Holiday_Flag Temperature Fuel_Price      CPI
1      1643691            0       42.31      2.572 211.0964
2      1641957            1       38.51      2.548 211.2422
3      1611968            0       39.93      2.514 211.2891
4      1409728            0       46.63      2.561 211.3196
5      1554807            0       46.50      2.625 211.3501
6      1439542            0       57.79      2.667 211.3806

Affect of Temperature and Holidays on Sales

The slope of this linear regression appears to be negative, indicating a negative relationship between Sales and Temperature. (Note: Holiday’s relationship is included only for visual interpretation within this slide.)

Regression Insights

To further investigate, the function lm() allows for insights into the relationship between Sales and Temperature.

DependentV <- "Temperature"
IndependentV <- "Weekly_Sales"
SalesTemp_df <- WalmartSales_DATA[,c(DependentV,IndependentV)]

lm_model<-lm(SalesTemp_df$Weekly_Sales~SalesTemp_df$Temperature)

Code Chunk Output:

Call: lm(formula = SalesTemp_df\(Weekly_Sales ~ SalesTemp_df\)Temperature)

Residuals: Min 1Q Median 3Q Max -871164 -488496 -91696 386226 2713005

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1165406.0 24139.0 48.279 < 2e-16 SalesTemp_df$Temperature -1952.4 380.7 -5.128 3.01e-07 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 563300 on 6433 degrees of freedom Multiple R-squared: 0.004072, Adjusted R-squared: 0.003917 F-statistic: 26.3 on 1 and 6433 DF, p-value: 3.008e-07

Output Explanation

Linear regression: \(WeeklySales = 1165406.0+(-1952.4 * Temp.)+\epsilon\) Correlation Coefficient: \(r = \sqrt{R^2}; r = \sqrt{0.004072} = .0638122245\)

Sales vs. Temp Relationship Conclusion:

  • The slope of the line, -1952.4, indicates that Temperature negatively regresses with Sales; $1952 for every unit increase in temperature.
  • The coefficient of determination, 0.004072, and correlation coefficient depict a weak linear relationship between the variables with only 0.4% of variability in Sales being correlated with Temperature.

Continuing, Sales will be examined against other variables to determine if they have stronger linear relationships.

Affect of Unemployment on Sales

For each unit increase in Unemployment, Sales decrease by $31944. Here, the relationship, indicated by r, is weak at .106, but, it’s contradicted by a high t-value (-8.564). While the relationship may appear weak, even the small changes that occur are significant when observed.

\(WeeklySales = 1302485 +(-31944 * Unemploy.)+\epsilon\) \(r = \sqrt{.01127} = .1061602562\)

Affect of Consumer Price Index on Sales

Notably, the data set does not contain information for periods where the CPI is between 150 and 175. With the data available, a negative slope, -$1041.6, is derived for every unit increase in CPI. Similar to the other examples, r shows weak linearity while significance is high.

\(WeeklySales = 1225673.7 +(-104.6 * CPI)+\epsilon\) \(r = \sqrt{.005276} = .0726360792\)

Data Set Conclusion

All independent variables depict low linearity and high significance. The most significant of the three in affecting sales is Unemployment. Next steps include identifying which products are most effected by the decreased purchasing.