Setup:

I’m using relative path, where the root folder is the location of the R project file for this repository.

To access files in repo use: “tutorials/tutorial1/data/file.csv” [example]

Importing libraries:

library(dplyr) # for performing datajoins

Assignment 1

Introduction and Data Import

[You can skip to Question 1]

Data available at: https://www.kaggle.com/datasets/manjeetsingh/retaildataset/discussion?sort=hotness

  • the data is publicly available on Kaggle

  • there are three related datasets (sales, features, stores)

Importing Data:

sales = read.csv("assignments/assignment1/data/sales_data.csv")

features = read.csv("assignments/assignment1/data/features_data.csv")

stores = read.csv("assignments/assignment1/data/stores_data.csv")
  • since we have three different datasets, we need to join them together

Joining data:

str(sales)
## 'data.frame':    421570 obs. of  5 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Dept        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "05/02/2010" "12/02/2010" "19/02/2010" "26/02/2010" ...
##  $ Weekly_Sales: num  24924 46039 41596 19404 21828 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
str(features)
## 'data.frame':    8190 obs. of  12 variables:
##  $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr  "05/02/2010" "12/02/2010" "19/02/2010" "26/02/2010" ...
##  $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
##  $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CPI         : num  211 211 211 211 211 ...
##  $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...
##  $ IsHoliday   : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
str(stores)
## 'data.frame':    45 obs. of  3 variables:
##  $ Store: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Type : chr  "A" "A" "B" "A" ...
##  $ Size : int  151315 202307 37392 205863 34875 202505 70713 155078 125833 126512 ...

Sales is the main dataset on which we can join Features using Store and Date.

sales_and_features = left_join(sales, features, by = c("Store","Date"))

head(sales_and_features)
##   Store Dept       Date Weekly_Sales IsHoliday.x Temperature Fuel_Price
## 1     1    1 05/02/2010     24924.50       FALSE       42.31      2.572
## 2     1    1 12/02/2010     46039.49        TRUE       38.51      2.548
## 3     1    1 19/02/2010     41595.55       FALSE       39.93      2.514
## 4     1    1 26/02/2010     19403.54       FALSE       46.63      2.561
## 5     1    1 05/03/2010     21827.90       FALSE       46.50      2.625
## 6     1    1 12/03/2010     21043.39       FALSE       57.79      2.667
##   MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5      CPI Unemployment
## 1        NA        NA        NA        NA        NA 211.0964        8.106
## 2        NA        NA        NA        NA        NA 211.2422        8.106
## 3        NA        NA        NA        NA        NA 211.2891        8.106
## 4        NA        NA        NA        NA        NA 211.3196        8.106
## 5        NA        NA        NA        NA        NA 211.3501        8.106
## 6        NA        NA        NA        NA        NA 211.3806        8.106
##   IsHoliday.y
## 1       FALSE
## 2        TRUE
## 3       FALSE
## 4       FALSE
## 5       FALSE
## 6       FALSE

Stores dataset just gives an information about individual stores, so we can join in on Store number.

retail_data = left_join(sales_and_features, stores, by = "Store")

head(retail_data)
##   Store Dept       Date Weekly_Sales IsHoliday.x Temperature Fuel_Price
## 1     1    1 05/02/2010     24924.50       FALSE       42.31      2.572
## 2     1    1 12/02/2010     46039.49        TRUE       38.51      2.548
## 3     1    1 19/02/2010     41595.55       FALSE       39.93      2.514
## 4     1    1 26/02/2010     19403.54       FALSE       46.63      2.561
## 5     1    1 05/03/2010     21827.90       FALSE       46.50      2.625
## 6     1    1 12/03/2010     21043.39       FALSE       57.79      2.667
##   MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5      CPI Unemployment
## 1        NA        NA        NA        NA        NA 211.0964        8.106
## 2        NA        NA        NA        NA        NA 211.2422        8.106
## 3        NA        NA        NA        NA        NA 211.2891        8.106
## 4        NA        NA        NA        NA        NA 211.3196        8.106
## 5        NA        NA        NA        NA        NA 211.3501        8.106
## 6        NA        NA        NA        NA        NA 211.3806        8.106
##   IsHoliday.y Type   Size
## 1       FALSE    A 151315
## 2        TRUE    A 151315
## 3       FALSE    A 151315
## 4       FALSE    A 151315
## 5       FALSE    A 151315
## 6       FALSE    A 151315

adjusting data for plotting

retail_data$Date = as.Date(retail_data$Date, format = "%d/%m/%Y")

Now, we have final dataset called retail_data that will be used throughout the Assignments

Question 1

Motivate your choice of variables. Why are you interested in analyzing them? (“For-fun” purposes are also acceptable given that this is a non-graded assignment.)

I used to work in Marketing for some time and I got involved quite often with retail data of my clients. Back then I attempted to run several project regarding statistics/forecasting (e.g. Causal Impact Studies).

Since I already have an experience with this kind of data, I find it comfortable and fun to work with.

Sometimes, my former colleagues come to me for an advice, so it will be exciting to have something to share with them.

Question 2

Which variable is primarily interesting for you?

The variable of primary interest is Weekly_Sales. Predicting future sales can be crucial for resource planning, inventory management, and promotional strategies, especially in the retail industry where demand fluctuates due to various factors.

What other related variables are you using?

In addition to Weekly_Sales, I am using the following related variables:

Store: Identifies the store number.

Dept: Identifies the department number.

Date: The start date of the week

Temperature: Average temperature in the region

Fuel_Price: Cost of fuel in the region

MarkDown1-5: Anonymized data related to promotional markdowns. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA

CPI: The Consumer Price Index

Unemployment: The unemployment rate

IsHoliday: Indicates whether a particular week includes a major holiday

Type: Indicates the category of the store

Size: The size of the store (units are not mentioned on the kaggle page)

Give an intuition as to why the other variables you use can explain the variable of primary interest.

Store: Different stores may experience varying sales trends due to their geographical location, customer base, and local market conditions.

Dept: Sales patterns can differ across departments. For example, some departments may have steady sales year-round, while others (e.g., seasonal departments) see peaks during specific times of the year, such as holidays or promotions.

Date: The date is crucial for understanding the seasonality of sales. Retail sales typically show clear seasonal patterns, with higher sales around holidays and specific times of the year (e.g., back-to-school season, Black Friday).

Temperature: Weather conditions can significantly affect foot traffic and consumer behavior. For example, colder weather might reduce store visits, while warmer temperatures could increase sales of seasonal products (e.g., summer clothing, outdoor goods).

Fuel_Price: Higher fuel prices can reduce consumers’ disposable income, leading to reduced spending on non-essential items and the other way around

MarkDown1-5: Promotions and markdowns are key drivers of increased sales. Understanding how markdowns affect different departments can help in forecasting spikes in sales during promotional periods.

CPI (Consumer Price Index): The CPI measures inflation. Higher CPI values indicate that goods are becoming more expensive, which could reduce consumer purchasing power and decrease sales, especially for non-essential items.

Unemployment: Higher unemployment rates usually correlate with lower consumer confidence and reduced spending, which would likely lead to lower sales

IsHoliday: Holidays are crucial for forecasting sales spikes, as consumers tend to spend more during these periods, particularly around major holidays like Thanksgiving and Christmas.

Type: The type of store likely reflects different store formats, product ranges, or customer bases. Some of those types may have higher sales due to broader product availability and customer appeal.

Size: Larger stores generally have more capacity for inventory and may serve a broader customer base, leading to higher sales. Smaller stores, on the other hand, may focus on specific customer segments or products, leading to different sales trends.

Would it be possible to look at another set of explanatory variables instead?

Foot traffic data: If available, data on the number of visitors to the store would provide direct insight into consumer behavior and sales patterns.

Competitor prices: Information about competitors’ prices or promotional strategies could help in understanding how competition affects sales.

Online sales: If available, integrating online sales data with in-store sales would give a more complete picture of the company’s total sales performance.

Product prices: The actual price of products in each department could be used to analyze how price changes influence sales, particularly in relation to markdowns.

Interest Rates: When interest rates are high, borrowing becomes more expensive, reducing consumers’ disposable income and potentially leading to lower spending on non-essential goods.

Describe the data characteristics: time span, frequency, etc.

Time span: The sales and features data spans from February 5, 2010, to November 1, 2012, which gives approximately 2 years and 9 months of weekly data.

Frequency: The data is recorded on a weekly basis for each store and department, allowing for detailed time-series analysis.

Question 3

Use ts() function to convert the numeric values to a time series object. Plot your main variable of interest and describe the characteristics in the visual.

  • We cannot really plot the data at this point because we have weekly sales value for each store and department, so we have to aggregate the data first on date level -> Just getting sales for each week
aggregated_data <- aggregate(Weekly_Sales ~ Date, data = retail_data, sum)
ts.plot(aggregated_data$Weekly_Sales)

Trend

The series appears to be relatively flat over time, with no clear long-term increasing or decreasing trend in the aggregated weekly sales.

Seasonality

There are recurring peaks visible at roughly regular intervals. These could correspond to seasonal events like holidays (e.g., Black-Friday, Christmas) or promotional markdown periods, where sales tend to spike due to increased consumer spending.

Outside of the high peaks, there are smaller regular movements as well, which could be influenced by promotional events, markdowns, weekend vs. weekday variation or smaller seasonal changes like summer vacations or smaller holidays

Question 4

Plot the correlogram and argue whether the variable is stationary or nonstationary. How does your conclusion compare to the data plot?

acf(aggregated_data$Weekly_Sales)

The ACF plot shows significant autocorrelation at lag 1, which then quickly diminishes after a few lags.This suggests that the series might have some short-term dependencies but does not show long-term autocorrelations

pacf(aggregated_data$Weekly_Sales)

The first lag shows a significant positive partial autocorrelation. After lag 1, most lags are insignificant, meaning there is no strong additional structure beyond lag 1.

Conclusion:

Based on the fast drop-off in autocorrelation in the ACF and the significant spike only at lag 1 in the PACF, the data appears close to stationary, but with some short-term dependencies.

Question 5

If your variable is nonstationary, plot the correlogram in first differences using diff() function. (Read the help file if necessary.)

Since the analysis in Q4 indicates that the series is stationary, differencing the data using the diff() function is not necessary.