I’m using relative path, where the root folder is the location of the R project file for this repository.
To access files in repo use: “tutorials/tutorial1/data/file.csv”
[example]
Importing libraries:
library(dplyr) # for performing datajoins
[You can skip to Question 1]
Data available at: https://www.kaggle.com/datasets/manjeetsingh/retaildataset/discussion?sort=hotness
the data is publicly available on Kaggle
there are three related datasets (sales, features, stores)
Importing Data:
sales = read.csv("assignments/assignment1/data/sales_data.csv")
features = read.csv("assignments/assignment1/data/features_data.csv")
stores = read.csv("assignments/assignment1/data/stores_data.csv")
Joining data:
str(sales)
## 'data.frame': 421570 obs. of 5 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Dept : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "05/02/2010" "12/02/2010" "19/02/2010" "26/02/2010" ...
## $ Weekly_Sales: num 24924 46039 41596 19404 21828 ...
## $ IsHoliday : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
str(features)
## 'data.frame': 8190 obs. of 12 variables:
## $ Store : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : chr "05/02/2010" "12/02/2010" "19/02/2010" "26/02/2010" ...
## $ Temperature : num 42.3 38.5 39.9 46.6 46.5 ...
## $ Fuel_Price : num 2.57 2.55 2.51 2.56 2.62 ...
## $ MarkDown1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ MarkDown5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ CPI : num 211 211 211 211 211 ...
## $ Unemployment: num 8.11 8.11 8.11 8.11 8.11 ...
## $ IsHoliday : logi FALSE TRUE FALSE FALSE FALSE FALSE ...
str(stores)
## 'data.frame': 45 obs. of 3 variables:
## $ Store: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Type : chr "A" "A" "B" "A" ...
## $ Size : int 151315 202307 37392 205863 34875 202505 70713 155078 125833 126512 ...
Sales is the main dataset on which we can join
Features using Store and
Date.
sales_and_features = left_join(sales, features, by = c("Store","Date"))
head(sales_and_features)
## Store Dept Date Weekly_Sales IsHoliday.x Temperature Fuel_Price
## 1 1 1 05/02/2010 24924.50 FALSE 42.31 2.572
## 2 1 1 12/02/2010 46039.49 TRUE 38.51 2.548
## 3 1 1 19/02/2010 41595.55 FALSE 39.93 2.514
## 4 1 1 26/02/2010 19403.54 FALSE 46.63 2.561
## 5 1 1 05/03/2010 21827.90 FALSE 46.50 2.625
## 6 1 1 12/03/2010 21043.39 FALSE 57.79 2.667
## MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment
## 1 NA NA NA NA NA 211.0964 8.106
## 2 NA NA NA NA NA 211.2422 8.106
## 3 NA NA NA NA NA 211.2891 8.106
## 4 NA NA NA NA NA 211.3196 8.106
## 5 NA NA NA NA NA 211.3501 8.106
## 6 NA NA NA NA NA 211.3806 8.106
## IsHoliday.y
## 1 FALSE
## 2 TRUE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
Stores dataset just gives an information about
individual stores, so we can join in on Store number.
retail_data = left_join(sales_and_features, stores, by = "Store")
head(retail_data)
## Store Dept Date Weekly_Sales IsHoliday.x Temperature Fuel_Price
## 1 1 1 05/02/2010 24924.50 FALSE 42.31 2.572
## 2 1 1 12/02/2010 46039.49 TRUE 38.51 2.548
## 3 1 1 19/02/2010 41595.55 FALSE 39.93 2.514
## 4 1 1 26/02/2010 19403.54 FALSE 46.63 2.561
## 5 1 1 05/03/2010 21827.90 FALSE 46.50 2.625
## 6 1 1 12/03/2010 21043.39 FALSE 57.79 2.667
## MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment
## 1 NA NA NA NA NA 211.0964 8.106
## 2 NA NA NA NA NA 211.2422 8.106
## 3 NA NA NA NA NA 211.2891 8.106
## 4 NA NA NA NA NA 211.3196 8.106
## 5 NA NA NA NA NA 211.3501 8.106
## 6 NA NA NA NA NA 211.3806 8.106
## IsHoliday.y Type Size
## 1 FALSE A 151315
## 2 TRUE A 151315
## 3 FALSE A 151315
## 4 FALSE A 151315
## 5 FALSE A 151315
## 6 FALSE A 151315
adjusting data for plotting
retail_data$Date = as.Date(retail_data$Date, format = "%d/%m/%Y")
Now, we have final dataset called retail_data that will
be used throughout the Assignments
Motivate your choice of variables. Why are you interested in analyzing them? (“For-fun” purposes are also acceptable given that this is a non-graded assignment.)
I used to work in Marketing for some time and I got involved quite often with retail data of my clients. Back then I attempted to run several project regarding statistics/forecasting (e.g. Causal Impact Studies).
Since I already have an experience with this kind of data, I find it comfortable and fun to work with.
Sometimes, my former colleagues come to me for an advice, so it will be exciting to have something to share with them.
Which variable is primarily interesting for you?
The variable of primary interest is Weekly_Sales. Predicting future sales can be crucial for resource planning, inventory management, and promotional strategies, especially in the retail industry where demand fluctuates due to various factors.
What other related variables are you using?
In addition to Weekly_Sales, I am using the following related variables:
Store: Identifies the store number.
Dept: Identifies the department number.
Date: The start date of the week
Temperature: Average temperature in the region
Fuel_Price: Cost of fuel in the region
MarkDown1-5: Anonymized data related to promotional
markdowns. MarkDown data is only available after Nov 2011, and is not
available for all stores all the time. Any missing value is marked with
an NA
CPI: The Consumer Price Index
Unemployment: The unemployment rate
IsHoliday: Indicates whether a particular week includes
a major holiday
Type: Indicates the category of the store
Size: The size of the store (units are not mentioned on
the kaggle page)
Give an intuition as to why the other variables you use can explain the variable of primary interest.
Store: Different stores may experience varying sales
trends due to their geographical location, customer base, and local
market conditions.
Dept: Sales patterns can differ across departments. For
example, some departments may have steady sales year-round, while others
(e.g., seasonal departments) see peaks during specific times of the
year, such as holidays or promotions.
Date: The date is crucial for understanding the
seasonality of sales. Retail sales typically show clear seasonal
patterns, with higher sales around holidays and specific times of the
year (e.g., back-to-school season, Black Friday).
Temperature: Weather conditions can significantly affect
foot traffic and consumer behavior. For example, colder weather might
reduce store visits, while warmer temperatures could increase sales of
seasonal products (e.g., summer clothing, outdoor goods).
Fuel_Price: Higher fuel prices can reduce consumers’
disposable income, leading to reduced spending on non-essential items
and the other way around
MarkDown1-5: Promotions and markdowns are key drivers of
increased sales. Understanding how markdowns affect different
departments can help in forecasting spikes in sales during promotional
periods.
CPI (Consumer Price Index): The CPI measures inflation.
Higher CPI values indicate that goods are becoming more expensive, which
could reduce consumer purchasing power and decrease sales, especially
for non-essential items.
Unemployment: Higher unemployment rates usually
correlate with lower consumer confidence and reduced spending, which
would likely lead to lower sales
IsHoliday: Holidays are crucial for forecasting sales
spikes, as consumers tend to spend more during these periods,
particularly around major holidays like Thanksgiving and Christmas.
Type: The type of store likely reflects different store
formats, product ranges, or customer bases. Some of those types may have
higher sales due to broader product availability and customer
appeal.
Size: Larger stores generally have more capacity for
inventory and may serve a broader customer base, leading to higher
sales. Smaller stores, on the other hand, may focus on specific customer
segments or products, leading to different sales trends.
Would it be possible to look at another set of explanatory variables instead?
Foot traffic data: If available, data on the number of visitors to the store would provide direct insight into consumer behavior and sales patterns.
Competitor prices: Information about competitors’ prices or promotional strategies could help in understanding how competition affects sales.
Online sales: If available, integrating online sales data with in-store sales would give a more complete picture of the company’s total sales performance.
Product prices: The actual price of products in each department could be used to analyze how price changes influence sales, particularly in relation to markdowns.
Interest Rates: When interest rates are high, borrowing becomes more expensive, reducing consumers’ disposable income and potentially leading to lower spending on non-essential goods.
Describe the data characteristics: time span, frequency, etc.
Time span: The sales and features data spans from
February 5, 2010, to November 1, 2012, which gives approximately 2 years
and 9 months of weekly data.
Frequency: The data is recorded on a weekly basis for
each store and department, allowing for detailed time-series
analysis.
Use ts() function to convert the numeric values to a time series object. Plot your main variable of interest and describe the characteristics in the visual.
aggregated_data <- aggregate(Weekly_Sales ~ Date, data = retail_data, sum)
ts.plot(aggregated_data$Weekly_Sales)
Trend
The series appears to be relatively flat over time, with no clear long-term increasing or decreasing trend in the aggregated weekly sales.
Seasonality
There are recurring peaks visible at roughly regular intervals. These
could correspond to seasonal events like holidays (e.g., Black-Friday,
Christmas) or promotional markdown periods, where sales tend to spike
due to increased consumer spending.
Outside of the high peaks, there are smaller regular movements as well,
which could be influenced by promotional events, markdowns, weekend
vs. weekday variation or smaller seasonal changes like summer vacations
or smaller holidays
Plot the correlogram and argue whether the variable is stationary or nonstationary. How does your conclusion compare to the data plot?
acf(aggregated_data$Weekly_Sales)
The ACF plot shows significant autocorrelation at lag 1, which then quickly diminishes after a few lags.This suggests that the series might have some short-term dependencies but does not show long-term autocorrelations
pacf(aggregated_data$Weekly_Sales)
The first lag shows a significant positive partial autocorrelation. After lag 1, most lags are insignificant, meaning there is no strong additional structure beyond lag 1.
Conclusion:
Based on the fast drop-off in autocorrelation in the ACF and the significant spike only at lag 1 in the PACF, the data appears close to stationary, but with some short-term dependencies.
If your variable is nonstationary, plot the correlogram in first differences using diff() function. (Read the help file if necessary.)
Since the analysis in Q4 indicates that the series is stationary, differencing the data using the diff() function is not necessary.