Problem Statement

One of the leading retail stores in the US, Walmart, would like to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm.

An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available.

Analysis Tasks

Basic Statistics tasks • Which store has maximum sales • Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation • Which store/s has good quarterly growth rate in Q3’2012 • Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together • Provide a monthly and semester view of sales in units and give insights

Statistical Model

For Store 1 – Build prediction models to forecast demand • Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). Hypothesize if CPI, unemployment, and fuel price have any impact on sales. • Change dates into days by creating new variable. Select the model which gives best accuracy

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2) #data visualization
library(tibble) #handle tibbles
library(dplyr) #data manipulation
library(caret) # for data splitting, pre-processing,feature selection etc.
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(lubridate) # to work in ease with dates
library(lmtest) #Testing linear regression Model
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(corrplot)
## corrplot 0.92 loaded
#For multicollinearity (VIF)
library(sp)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
library(raster)
## 
## Attaching package: 'raster'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(usdm) 
## Loading required package: terra
## terra 1.7.46
## 
## Attaching package: 'terra'
## 
## The following object is masked from 'package:zoo':
## 
##     time<-
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:usdm':
## 
##     vif
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

walmart_sales Description

1.This is the historical data which covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields: 2.Store - the store number. 3.Date - the week of sales. 4.Weekly_Sales - sales for the given store. 5.Holiday_Flag - whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week. 6.Temperature - Temperature on the day of sale. 7.Fuel_Weekly_Sales - Cost of fuel in the region. 8.CPI – Prevailing consumer Weekly_Sales index. 9.Unemployment - Prevailing unemployment rate.

get the working directory

getwd()
## [1] "F:/cli/Data_science_portfolios/simplilearn_data_scientist/Data Science with R Programming/projects"

set the working directory

setwd("F:/cli/Data_science_portfolios/R_projects/")

read csv or import data as a dataframe

walmart_sales<- read.csv('F:/cli/Data_science_portfolios/R_projects/Walmart_Store_sales.csv',header=TRUE,",")

view the walmart sales data

#View(walmart_sales)
head(walmart_sales)
##   Store       Date Weekly_Sales Holiday_Flag Temperature Fuel_Price      CPI
## 1     1 05-02-2010      1643691            0       42.31      2.572 211.0964
## 2     1 12-02-2010      1641957            1       38.51      2.548 211.2422
## 3     1 19-02-2010      1611968            0       39.93      2.514 211.2891
## 4     1 26-02-2010      1409728            0       46.63      2.561 211.3196
## 5     1 05-03-2010      1554807            0       46.50      2.625 211.3501
## 6     1 12-03-2010      1439542            0       57.79      2.667 211.3806
##   Unemployment
## 1        8.106
## 2        8.106
## 3        8.106
## 4        8.106
## 5        8.106
## 6        8.106

Structure of Data Frame

#View(walmart_sales)
str(walmart_sales)
## 'data.frame':    6435 obs. of  8 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
##  $ Weekly_Sales: num  1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI         : num  211 211 211 211 211 ...
##  $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...

Observations

The Data frame has 6435 observations of 8 Variables.

Get the Storewise aggregate sales in a Data Frame

aggregate_sales <- aggregate(Weekly_Sales~ Store, data = walmart_sales, FUN = sum, na.rm = TRUE)
aggregate_sales
##    Store Weekly_Sales
## 1      1    222402809
## 2      2    275382441
## 3      3     57586735
## 4      4    299543953
## 5      5     45475689
## 6      6    223756131
## 7      7     81598275
## 8      8    129951181
## 9      9     77789219
## 10    10    271617714
## 11    11    193962787
## 12    12    144287230
## 13    13    286517704
## 14    14    288999911
## 15    15     89133684
## 16    16     74252425
## 17    17    127782139
## 18    18    155114734
## 19    19    206634862
## 20    20    301397792
## 21    21    108117879
## 22    22    147075649
## 23    23    198750618
## 24    24    194016021
## 25    25    101061179
## 26    26    143416394
## 27    27    253855917
## 28    28    189263681
## 29    29     77141554
## 30    30     62716885
## 31    31    199613906
## 32    32    166819246
## 33    33     37160222
## 34    34    138249763
## 35    35    131520672
## 36    36     53412215
## 37    37     74202740
## 38    38     55159626
## 39    39    207445542
## 40    40    137870310
## 41    41    181341935
## 42    42     79565752
## 43    43     90565435
## 44    44     43293088
## 45    45    112395341

Get the store having maximum Sales

which.max(aggregate_sales$Weekly_Sales)
## [1] 20

Get the Storewise standard deviation in a Data Frame.

aggregate_sales_sd <- aggregate(Weekly_Sales~ Store, data = walmart_sales, FUN = sd , na.rm = TRUE)
aggregate_sales_sd
##    Store Weekly_Sales
## 1      1    155980.77
## 2      2    237683.69
## 3      3     46319.63
## 4      4    266201.44
## 5      5     37737.97
## 6      6    212525.86
## 7      7    112585.47
## 8      8    106280.83
## 9      9     69028.67
## 10    10    302262.06
## 11    11    165833.89
## 12    12    139166.87
## 13    13    265507.00
## 14    14    317569.95
## 15    15    120538.65
## 16    16     85769.68
## 17    17    112162.94
## 18    18    176641.51
## 19    19    191722.64
## 20    20    275900.56
## 21    21    128752.81
## 22    22    161251.35
## 23    23    249788.04
## 24    24    167745.68
## 25    25    112976.79
## 26    26    110431.29
## 27    27    239930.14
## 28    28    181758.97
## 29    29     99120.14
## 30    30     22809.67
## 31    31    125855.94
## 32    32    138017.25
## 33    33     24132.93
## 34    34    104630.16
## 35    35    211243.46
## 36    36     60725.17
## 37    37     21837.46
## 38    38     42768.17
## 39    39    217466.45
## 40    40    119002.11
## 41    41    187907.16
## 42    42     50262.93
## 43    43     40598.41
## 44    44     24762.83
## 45    45    130168.53

Get the store having maximum standard deviation.

which.max(aggregate_sales_sd$Weekly_Sales)
## [1] 14

Get the Storewise mean Sales in a Data Frame.

aggregate_sales_mean <- aggregate(Weekly_Sales~ Store, data = walmart_sales, FUN = mean , na.rm = TRUE)

aggregate_sales_mean
##    Store Weekly_Sales
## 1      1    1555264.4
## 2      2    1925751.3
## 3      3     402704.4
## 4      4    2094713.0
## 5      5     318011.8
## 6      6    1564728.2
## 7      7     570617.3
## 8      8     908749.5
## 9      9     543980.6
## 10    10    1899424.6
## 11    11    1356383.1
## 12    12    1009001.6
## 13    13    2003620.3
## 14    14    2020978.4
## 15    15     623312.5
## 16    16     519247.7
## 17    17     893581.4
## 18    18    1084718.4
## 19    19    1444999.0
## 20    20    2107676.9
## 21    21     756069.1
## 22    22    1028501.0
## 23    23    1389864.5
## 24    24    1356755.4
## 25    25     706721.5
## 26    26    1002911.8
## 27    27    1775216.2
## 28    28    1323522.2
## 29    29     539451.4
## 30    30     438579.6
## 31    31    1395901.4
## 32    32    1166568.2
## 33    33     259861.7
## 34    34     966781.6
## 35    35     919725.0
## 36    36     373512.0
## 37    37     518900.3
## 38    38     385731.7
## 39    39    1450668.1
## 40    40     964128.0
## 41    41    1268125.4
## 42    42     556403.9
## 43    43     633324.7
## 44    44     302748.9
## 45    45     785981.4

Get the store having maximum mean sales.

max(aggregate_sales_mean$Weekly_Sales)
## [1] 2107677

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.