WQD7004

OCC1-GROUP1

2025-01-07

Forecasting Walmart Sales and Segmenting Customers: Seasonal Trends, Economic Factors

Matric_ID-Name[Contribution]

23110894-HUANG PENG [Xgboost Model]

22106608-YANG QIULIN [Randomforest/K-mean Segmenting/Presentation]

23120355-FENG FEIRAN [EDA]

24059144-MUHAMAD AIMAN BIN JAMAL ABD NASIR [Model Comparision]

23073037-YANG XU [AIRIAM MODEL/Order files]

#Contents #

1.Introduction and Objectives

2.Processes and Results

2.1 Data Understanding

2.2 Data Preparation

2.3 Prediction Model Modeling and Evaluation

2.4 Prediction Results(EVALUATION COMPARISON)

2.5 Customer Segmentation Modeling

2.6 Customer Segmentation Results

3. Conclusion

1.Introduction and Objectives

1.1 Project Introduction

E-commerce faces increasing challenges in managing and predicting sales performance due to a variety of influencing factors such as seasonal trends, holidays, and weather conditions. This study selects the best predictive model through a comparison of three models and develops a customer segmentation model based on store data, helping Walmart optimize promotional activities and better understand consumer behavior.

1.2 Project Objectives

Objective 1:To identify patterns and trends in sales data

Objective 2:To model sales performance using predictive techniques

Objective 3:To evaluate and improve the accuracy of sales forecasts

Objective 4:To build a customer segmentation model and summarize the characteristics of the customer groups.

2.Processes and Results

2.1 Data Understanding

2.1.1.Data Introduction:

  • Data Source:

https://www.kaggle.com/datasets/rutuspatel/walmart-dataset-retail

  • Title: Walmart_Store_sales
  • Year:2010.02-2012.10

2.1.2.Dataset Overview

library(readr)
data <- read_csv("C:/Users/24319/Desktop/WQD7004数据科学编程-OCC1/小组作业/Walmart_Store_sales.csv")
head(data)
## # A tibble: 6 × 8
##   Store Date       Weekly_Sales Holiday_Flag Temperature Fuel_Price   CPI
##   <dbl> <chr>             <dbl>        <dbl>       <dbl>      <dbl> <dbl>
## 1     1 05-02-2010     1643691.            0        42.3       2.57  211.
## 2     1 12-02-2010     1641957.            1        38.5       2.55  211.
## 3     1 19-02-2010     1611968.            0        39.9       2.51  211.
## 4     1 26-02-2010     1409728.            0        46.6       2.56  211.
## 5     1 05-03-2010     1554807.            0        46.5       2.62  211.
## 6     1 12-03-2010     1439542.            0        57.8       2.67  211.
## # ℹ 1 more variable: Unemployment <dbl>
  • Dataset Dimensions: Rows: 6435 Columns: 8
  • Structure: Numeric type(7); Character type (1): Data
colnames(data)
## [1] "Store"        "Date"         "Weekly_Sales" "Holiday_Flag" "Temperature" 
## [6] "Fuel_Price"   "CPI"          "Unemployment"
  • Content:Here are the 8 column names in the dataset

1.Store: Store ID.

2.Date: Date of the sales data.

3.Weekly_Sales: Weekly sales.

4.Holiday_Flag: Holiday flag (0 or 1).

5.Temperature: Temperature, representing the temperature for that week.

6.Fuel_Price: Fuel price, representing the fuel price for that week.

7. CPI: Consumer Price Index, typically used to measure changes in the price level.

8. Unemployment: Unemployment rate, representing the unemployment rate in the region.

  • Summary:
summary(data)
##      Store        Date            Weekly_Sales      Holiday_Flag    
##  Min.   : 1   Length:6435        Min.   : 209986   Min.   :0.00000  
##  1st Qu.:12   Class :character   1st Qu.: 553350   1st Qu.:0.00000  
##  Median :23   Mode  :character   Median : 960746   Median :0.00000  
##  Mean   :23                      Mean   :1046965   Mean   :0.06993  
##  3rd Qu.:34                      3rd Qu.:1420159   3rd Qu.:0.00000  
##  Max.   :45                      Max.   :3818686   Max.   :1.00000  
##   Temperature       Fuel_Price         CPI         Unemployment   
##  Min.   : -2.06   Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.: 47.46   1st Qu.:2.933   1st Qu.:131.7   1st Qu.: 6.891  
##  Median : 62.67   Median :3.445   Median :182.6   Median : 7.874  
##  Mean   : 60.66   Mean   :3.359   Mean   :171.6   Mean   : 7.999  
##  3rd Qu.: 74.94   3rd Qu.:3.735   3rd Qu.:212.7   3rd Qu.: 8.622  
##  Max.   :100.14   Max.   :4.468   Max.   :227.2   Max.   :14.313

2.2 Data Preparation

install.packages("lubridate",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'lubridate'打开成功,MD5和检查也通过
install.packages("dplyr",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'dplyr'打开成功,MD5和检查也通过
install.packages("ggplot2",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'ggplot2'打开成功,MD5和检查也通过
install.packages("gridExtra",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'gridExtra'打开成功,MD5和检查也通过
install.packages("ggcorrplot",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'ggcorrplot'打开成功,MD5和检查也通过

2.2.1.Data Preprocessing:

①Convert Date Format:

library(data.table)
library(lubridate)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(ggcorrplot)


str(data)
## spc_tbl_ [6,435 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Store       : num [1:6435] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date        : chr [1:6435] "05-02-2010" "12-02-2010" "19-02-2010" "26-02-2010" ...
##  $ Weekly_Sales: num [1:6435] 1643691 1641957 1611968 1409728 1554807 ...
##  $ Holiday_Flag: num [1:6435] 0 1 0 0 0 0 0 0 0 0 ...
##  $ Temperature : num [1:6435] 42.3 38.5 39.9 46.6 46.5 ...
##  $ Fuel_Price  : num [1:6435] 2.57 2.55 2.51 2.56 2.62 ...
##  $ CPI         : num [1:6435] 211 211 211 211 211 ...
##  $ Unemployment: num [1:6435] 8.11 8.11 8.11 8.11 8.11 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   Date = col_character(),
##   ..   Weekly_Sales = col_double(),
##   ..   Holiday_Flag = col_double(),
##   ..   Temperature = col_double(),
##   ..   Fuel_Price = col_double(),
##   ..   CPI = col_double(),
##   ..   Unemployment = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
sapply(data, class)
##        Store         Date Weekly_Sales Holiday_Flag  Temperature   Fuel_Price 
##    "numeric"  "character"    "numeric"    "numeric"    "numeric"    "numeric" 
##          CPI Unemployment 
##    "numeric"    "numeric"
# Convert Date Format
data$Date <- dmy(data$Date)

# Descriptive statistical information
summary(data)
##      Store         Date             Weekly_Sales      Holiday_Flag    
##  Min.   : 1   Min.   :2010-02-05   Min.   : 209986   Min.   :0.00000  
##  1st Qu.:12   1st Qu.:2010-10-08   1st Qu.: 553350   1st Qu.:0.00000  
##  Median :23   Median :2011-06-17   Median : 960746   Median :0.00000  
##  Mean   :23   Mean   :2011-06-17   Mean   :1046965   Mean   :0.06993  
##  3rd Qu.:34   3rd Qu.:2012-02-24   3rd Qu.:1420159   3rd Qu.:0.00000  
##  Max.   :45   Max.   :2012-10-26   Max.   :3818686   Max.   :1.00000  
##   Temperature       Fuel_Price         CPI         Unemployment   
##  Min.   : -2.06   Min.   :2.472   Min.   :126.1   Min.   : 3.879  
##  1st Qu.: 47.46   1st Qu.:2.933   1st Qu.:131.7   1st Qu.: 6.891  
##  Median : 62.67   Median :3.445   Median :182.6   Median : 7.874  
##  Mean   : 60.66   Mean   :3.359   Mean   :171.6   Mean   : 7.999  
##  3rd Qu.: 74.94   3rd Qu.:3.735   3rd Qu.:212.7   3rd Qu.: 8.622  
##  Max.   :100.14   Max.   :4.468   Max.   :227.2   Max.   :14.313

② Extract Specified Columns

walmart_data <- subset(data, select = c(Weekly_Sales, Temperature, Fuel_Price, CPI, Unemployment))

head(walmart_data)
## # A tibble: 6 × 5
##   Weekly_Sales Temperature Fuel_Price   CPI Unemployment
##          <dbl>       <dbl>      <dbl> <dbl>        <dbl>
## 1     1643691.        42.3       2.57  211.         8.11
## 2     1641957.        38.5       2.55  211.         8.11
## 3     1611968.        39.9       2.51  211.         8.11
## 4     1409728.        46.6       2.56  211.         8.11
## 5     1554807.        46.5       2.62  211.         8.11
## 6     1439542.        57.8       2.67  211.         8.11

③heck for missing and duplicate values

# Checking for null and duplicate values
colSums(is.na(data)) # Empty value statistics
##        Store         Date Weekly_Sales Holiday_Flag  Temperature   Fuel_Price 
##            0            0            0            0            0            0 
##          CPI Unemployment 
##            0            0
sum(duplicated(data)) # Repeat value statistics
## [1] 0
# Summarize total sales by Store group
total_sales <- data %>%
  group_by(Store) %>%
  summarise(Weekly_Sales = sum(Weekly_Sales)) %>%
  arrange(desc(Weekly_Sales))

print(head(total_sales, 1))
## # A tibble: 1 × 2
##   Store Weekly_Sales
##   <dbl>        <dbl>
## 1    20   301397792.

2.2.2. Exploratory Data Analysis (EDA)

① Distribution of Different Features

# Plotting histograms
plots <- lapply(names(walmart_data), function(var) {
  ggplot(walmart_data, aes_string(x = var)) +
    geom_histogram(bins = 10, fill = "skyblue", color = "black") +
    theme_minimal() +
    ggtitle(paste("Histogram of", var))
})
grid.arrange(grobs = plots, ncol = 2)

# Generate Subgraph Functions
create_boxplot <- function(data, col_name, x_label, title) {
  ggplot(data, aes_string(y = col_name)) +
    geom_boxplot(fill = "skyblue", color = "black") +
    labs(x = x_label, y = "Values", title = title) +
    theme_minimal()
}
  • Weekly Sales: The histogram is right-skewed, indicating a few weeks had very high sales.
  • Temperature: A normal distribution suggests consistent temperatures, while skewness could show seasonal changes or outliers.
  • Fuel Price: Concentration at certain price ranges may indicate stability or volatility in the fuel market.
  • Unemployment: A peak at lower rates suggests a healthy job market, while a wider spread may indicate economic challenges.
  • CPI: A narrow distribution suggests stable prices, while a wide one indicates price fluctuations.
# Plotting box-and-line plots of raw data
p1 <- create_boxplot(walmart_data, "Temperature", "Temperature", "Boxplot of Temperature")
p2 <- create_boxplot(walmart_data, "Fuel_Price", "Fuel_Price", "Boxplot of Fuel_Price")
p3 <- create_boxplot(walmart_data, "CPI", "CPI", "Boxplot of CPI")
p4 <- create_boxplot(walmart_data, "Unemployment", "Unemployment", "Boxplot of Unemployment")
# Setting the image layout and drawing
grid.arrange(p1, p2, p3, p4, ncol = 2)

# Removing Exceptions
walmart_data <- walmart_data %>%
  filter(Temperature > 10,
         Unemployment > 4.5,
         Unemployment < 10)
# Plotting box plots of updated data
p1_updated <- create_boxplot(walmart_data, "Temperature", "Temperature", "Boxplot of Temperature (Updated)")
p2_updated <- create_boxplot(walmart_data, "Fuel_Price", "Fuel_Price", "Boxplot of Fuel_Price (Updated)")
p3_updated <- create_boxplot(walmart_data, "CPI", "CPI", "Boxplot of CPI (Updated)")
p4_updated <- create_boxplot(walmart_data, "Unemployment", "Unemployment", "Boxplot of Unemployment (Updated)")

# Setting the image layout and drawing
grid.arrange(p1_updated, p2_updated, p3_updated, p4_updated, ncol = 2)

  • Temperature: The median is around 63°, with most of the time the temperature ranging between 50° and 75°.

  • Fuel Price: The median is 3.4, with most values falling between 2.9 and 3.7.

  • CPI: The median is approximately 188, with most values ranging from 130 to 215, showing a relatively large variation.

  • Unemployment: The median is around 8, with most values concentrated between 7 and 8.5, showing a smaller range.

② Distribution of Different Features

# Plotting total sales histograms
ggplot(total_sales, aes(x = reorder(Store, -Weekly_Sales), y = Weekly_Sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Total Sales by Store", x = "Store", y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Viewpoint:The bar chart of total sales by store shows that a few stores account for most of the sales, while many others have much lower sales. This uneven distribution indicates that certain stores drive most of the sales, which could impact business planning and resource allocation.

std_sales <- data %>%
  group_by(Store) %>%
  summarise(SD_Weekly_Sales = sd(Weekly_Sales, na.rm = TRUE)) %>%
  arrange(desc(SD_Weekly_Sales))

save(std_sales, file = "std_sales.RData")
print(head(std_sales, 1))
## # A tibble: 1 × 2
##   Store SD_Weekly_Sales
##   <dbl>           <dbl>
## 1    14         317570.
# Plotting standard deviation histograms
ggplot(std_sales, aes(x = SD_Weekly_Sales)) +
  geom_histogram(binwidth = 10, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Standard Deviation of Weekly Sales", x = "Standard Deviation", y = "Frequency")

Viewpoint:The histogram of weekly sales standard deviation shows little variation, meaning most weeks have similar performance.

# Conversion of holiday date format
Super_Bowl <- as.Date(c("2010-02-12", "2011-02-11", "2012-02-10"))
Labour_Day <- as.Date(c("2010-09-10", "2011-09-09", "2012-09-07"))
Thanksgiving <- as.Date(c("2010-11-26", "2011-11-25", "2012-11-23"))
Christmas <- as.Date(c("2010-12-31", "2011-12-30", "2012-12-28"))
# Calculate average sales for holidays and non-holidays
super_bowl_sales <- data %>%
  filter(Date %in% Super_Bowl) %>%
  summarise(mean_sales = mean(Weekly_Sales, na.rm = TRUE)) %>%
  pull(mean_sales)
labour_day_sales <- data %>%
  filter(Date %in% Labour_Day) %>%
  summarise(mean_sales = mean(Weekly_Sales, na.rm = TRUE)) %>%
  pull(mean_sales)
thanksgiving_sales <- data %>%
  filter(Date %in% Thanksgiving) %>%
  summarise(mean_sales = mean(Weekly_Sales, na.rm = TRUE)) %>%
  pull(mean_sales)
christmas_sales <- data %>%
  filter(Date %in% Christmas) %>%
  summarise(mean_sales = mean(Weekly_Sales, na.rm = TRUE)) %>%
  pull(mean_sales)
non_holidays_sales <- data %>%
  filter(Holiday_Flag == 0) %>%
  summarise(mean_sales = mean(Weekly_Sales, na.rm = TRUE)) %>%
  pull(mean_sales)
cat("Super Bowl Sales: ", super_bowl_sales, "\n")
## Super Bowl Sales:  1079128
cat("Labour Day Sales: ", labour_day_sales, "\n")
## Labour Day Sales:  1042427
cat("Thanksgiving Sales: ", thanksgiving_sales, "\n")
## Thanksgiving Sales:  1471273
cat("Christmas Sales: ", christmas_sales, "\n")
## Christmas Sales:  960833.1
cat("Non-Holidays Sales: ", non_holidays_sales, "\n")
## Non-Holidays Sales:  1041256

Viewpoint:This result displays the average sales for specific holidays (such as Super Bowl, Labor Day, etc.) and non-holidays.

# Plotting average sales comparisons between holiday and non-holiday weeks
holiday_dates <- c(Super_Bowl, Labour_Day, Thanksgiving, Christmas)
data$Is_Holiday_Week <- ifelse(data$Date %in% holiday_dates, "Holiday Week", "Non-Holiday Week")
avg_sales <- data %>%
  group_by(Is_Holiday_Week) %>%
  summarise(mean_sales = mean(Weekly_Sales, na.rm = TRUE))
ggplot(avg_sales, aes(x = Is_Holiday_Week, y = mean_sales, fill = Is_Holiday_Week)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Weekly Sales for Holiday and Non-Holiday Weeks",
       x = "Week Type",
       y = "Average Weekly Sales") +
  scale_fill_manual(values = c("Holiday Week" = "red", "Non-Holiday Week" = "blue")) +
  theme_minimal()

Viewpoint:We can see that the average sales for holiday weeks are higher than the average sales for non-holiday weeks.

# Add month and year columns
data$Month <- as.numeric(format(as.Date(data$Date), "%m"))
data$Year <- as.numeric(format(as.Date(data$Date), "%Y"))
# Plotting a bar graph of sales by month (using 2010 as an example)
ggplot(data %>% filter(Year == 2010), aes(x = Month, y = Weekly_Sales)) +
  geom_bar(stat = "summary", fun = "mean", fill = "skyblue") +
  labs(title = "Monthly Sales Review in 2010", x = "Month", y = "Weekly Sales")

Viewpoint: We can see that the sales in 2010 remained steady throughout the year, but started to rise in October, reaching the highest average sales in December.

# Semester division
data$Semester <- ifelse(data$Month %in% 1:6, "1st Semester", "2nd Semester")
# Summary of sales by semester
semester_sales <- data %>%
  group_by(Semester) %>%
  summarise(total_sales = sum(Weekly_Sales, na.rm = TRUE))
# Plotting semester sales histograms
ggplot(semester_sales, aes(x = Semester, y = total_sales, fill = Semester)) +
  geom_bar(stat = "identity") +
  labs(title = "Semester Sales Review", x = "Semester", y = "Total Sales")

Viewpoint: We can see that when sales are divided into the first half (January to June) and the second half (July to December), the sales in the second half are higher than in the first half.

# Heat mapping to demonstrate relevance
walmart_data <- data %>% select(Weekly_Sales, Holiday_Flag, Temperature, Fuel_Price, CPI, Unemployment)
cor_matrix <- cor(walmart_data, use = "complete.obs")
ggcorrplot(cor_matrix, lab = TRUE, colors = c("red", "white", "blue"), outline.color = "black")

Viewpoint:The correlation matrix provides valuable insights into the relationships between various factors influencing weekly sales. Each value indicates the strength and direction of the correlation, ranging from -1 to 1, allowing for a nuanced understanding of how different variables interact.

  • One notable observation is the weak negative correlation between the Consumer Price Index (CPI) and weekly sales, with a value of -0.07. This suggests that as CPI increases, weekly sales may slightly decrease, indicating that higher consumer prices could impact purchasing behavior.

  • In contrast, the correlation between unemployment and weekly sales is stronger and negative at -0.30. This highlights that higher unemployment rates are associated with lower sales figures, emphasizing the significant economic impact that employment levels have on consumer spending.

  • Other factors, such as temperature and fuel prices, exhibit weak correlations with weekly sales, suggesting they may not significantly influence sales trends.

  • The Holiday Flag shows a slight positive correlation of 0.04 with weekly sales, indicating that holidays may have a minimal positive effect on sales performance.

  • Overall, the matrix underscores the importance of economic indicators like unemployment and CPI in shaping sales performance. Understanding these correlations can help businesses develop more effective strategies to enhance sales during varying economic conditions.

# Temperature versus sales (line and scatter plots)
ggplot(walmart_data, aes(x = Temperature, y = Weekly_Sales)) +
  geom_line(color = "blue") +
  labs(title = "Weekly Sales vs Temperature (Line Plot)", x = "Temperature", y = "Weekly Sales")

Viewpoint:The line plot of sales versus temperature shows a fluctuating relationship, where warmer temperatures are generally linked to higher sales, likely driven by seasonal products and promotions. However, sales begin to stabilize or decline beyond certain temperature thresholds, indicating market saturation.

ggplot(walmart_data, aes(x = Temperature, y = Weekly_Sales)) +
  geom_point(color = "red") +
  labs(title = "Weekly Sales vs Temperature (Scatter Plot)", x = "Temperature", y = "Weekly Sales")

Viewpoint:The scatter plot also shows that temperature affects purchasing behavior.

# Relationship between oil prices and sales (line and scatter plots)
ggplot(walmart_data, aes(x = Fuel_Price, y = Weekly_Sales)) +
  geom_line(color = "blue") +
  labs(title = "Weekly Sales vs Fuel Price (Line Plot)", x = "Fuel Price", y = "Weekly Sales")

Viewpoint: The line plot of sales versus fuel price shows a stable trend, with sales remaining high despite fluctuations in fuel costs, indicating that fuel prices have little effect on consumer behavior.

ggplot(walmart_data, aes(x = Fuel_Price, y = Weekly_Sales)) +
  geom_point(color = "red") +
  labs(title = "Weekly Sales vs Fuel Price (Scatter Plot)", x = "Fuel Price", y = "Weekly Sales")

Viewpoint:The scatter plot further illustrates this weak relationship, indicating that most sales are unaffected by fluctuations in fuel prices.

str(data$Date)
##  Date[1:6435], format: "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" "2010-03-05" ...
head(data$Date)
## [1] "2010-02-05" "2010-02-12" "2010-02-19" "2010-02-26" "2010-03-05"
## [6] "2010-03-12"
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(lubridate)

# Ensure the 'Date' column is in Date format
data$Date <- as.Date(data$Date, format = "%d/%m/%Y")

# Extract Year-Month for grouping and summarize the sales
data$Month <- floor_date(data$Date, "month")  # Group by month

monthly_sales <- data %>%
  group_by(Month) %>%
  summarise(Total_Sales = sum(Weekly_Sales),
            Holiday_Flag = max(Holiday_Flag))  # Assuming the presence of any holiday

# Create the bar chart
ggplot(monthly_sales, aes(x = Month, y = Total_Sales, fill = factor(Holiday_Flag))) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("0" = "blue", "1" = "red")) +  # Color bars based on Holiday_Flag
  labs(title = "Monthly Sales with Holiday Flag",
       x = "Month",
       y = "Total Sales",
       fill = "Holiday Flag") +
  scale_x_date(labels = function(x) format(x, "%Y %b"), breaks = "1 month") +  # Display Year-Month (e.g., 2010 Jan, 2011 Jan)
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

Viewpoint: The analysis highlights significant differences in consumer spending, showing that sales during holiday months (marked as 1) are consistently higher than those in non-holiday months (marked as 0)

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(lubridate)

# Ensure the 'Date' column is in Date format
data$Date <- as.Date(data$Date, format = "%d/%m/%Y")

# Group by Month and summarize Total Sales and Average Temperature
data$Month <- floor_date(data$Date, "month")

monthly_summary <- data %>%
  group_by(Month) %>%
  summarise(Total_Sales = sum(Weekly_Sales),
            Avg_Temperature = mean(Temperature),
            Holiday_Flag = max(Holiday_Flag))  # Assuming Holiday_Flag indicates any holiday

# Plot 1: Bar chart for Total Sales
sales_plot <- ggplot(monthly_summary) +
  geom_bar(aes(x = Month, y = Total_Sales, fill = factor(Holiday_Flag)),
           stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("0" = "blue", "1" = "red")) +
  labs(title = "Monthly Total Sales", x = "Month", y = "Total Sales", fill = "Holiday Flag") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Plot 2: Line chart for Average Temperature
temperature_plot <- ggplot(monthly_summary) +
  geom_line(aes(x = Month, y = Avg_Temperature), color = "green", size = 1) +
  labs(title = "Monthly Average Temperature", x = "Month", y = "Average Temperature (°C)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Print both plots (you can use gridExtra or patchwork to arrange them side by side)
library(gridExtra)
grid.arrange(sales_plot, temperature_plot, ncol = 1)

Viewpoint: Complementing this, the examination of “Monthly Total Sales” and “Monthly Average Temperature” from 2010 to 2012 reveals that while higher temperatures may correlate with increased sales due to seasonal demand, the pronounced spikes in sales during holidays suggest that promotional events play a more critical role. Overall, the data underscores the impact of holidays on consumer behavior while also acknowledging the influence of seasonal weather patterns.

2.3 Prediction Model Modeling and Evaluation

XGBoost and Random Forest were employed to forecast weekly sales. The dataset was divided into training, validation, and test subsets. ARIMA was applied to forecast both total sales across all stores and sales for individual stores.

2.3.1.XGBOOST

install.packages("caret",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'caret'打开成功,MD5和检查也通过
install.packages("gridExtra",repos = "https://cloud.r-project.org/",quiet = TRUE)
install.packages("xgboost",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'xgboost'打开成功,MD5和检查也通过
# Load necessary libraries
library(xgboost)
library(caret)
library(ggplot2)

# Load and preprocess the data
data$Date <- as.Date(data$Date, format = "%d-%m-%Y")
data <- na.omit(data)

# Split the data into training, validation, and test sets
set.seed(123)
train_index <- createDataPartition(data$Weekly_Sales, p = 0.6, list = FALSE)
train_data <- data[train_index, ]
remaining_data <- data[-train_index, ]
val_index <- createDataPartition(remaining_data$Weekly_Sales, p = 0.5, list = FALSE)
val_data <- remaining_data[val_index, ]
test_data <- remaining_data[-val_index, ]

cat("Training Set Size:", nrow(train_data), "\n")
## Training Set Size: 3863
cat("Validation Set Size:", nrow(val_data), "\n")
## Validation Set Size: 1288
cat("Test Set Size:", nrow(test_data), "\n")
## Test Set Size: 1284
# Convert the data to matrix format for XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]),
                            label = train_data$Weekly_Sales)
val_matrix <- xgb.DMatrix(data = as.matrix(val_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]),
                          label = val_data$Weekly_Sales)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]),
                           label = test_data$Weekly_Sales)

# Train the XGBoost model
xgb_model <- xgboost(data = train_matrix,
                     max.depth = 6,
                     eta = 0.1,
                     nrounds = 500,
                     objective = "reg:squarederror",
                     verbose = 0)

# Calculate feature importance  -- Moved to after model training
importance <- xgb.importance(
  feature_names = colnames(train_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]),
  model = xgb_model  # xgb_model is now defined
)

# Display feature importance in the console
print(importance)
##         Feature        Gain       Cover  Frequency
##          <char>       <num>       <num>      <num>
## 1:        Store 0.711576878 0.168903894 0.22193699
## 2:          CPI 0.146596990 0.236938961 0.19169195
## 3: Unemployment 0.098212896 0.105982193 0.09096849
## 4:   Fuel_Price 0.021730245 0.246974911 0.20396733
## 5:  Temperature 0.019182315 0.231661994 0.26669778
## 6: Holiday_Flag 0.002700677 0.009538048 0.02473746
# Plot feature importance
xgb.plot.importance(
  importance_matrix = importance,
  top_n = 6, # Show top 6 features
  main = "Feature Importance"
)

This Feature Importance visualization helps identify critical areas needing improvement, allowing for targeted strategies to enhance model performance and achieve more accurate forecasts.

# Train the XGBoost model
xgb_model <- xgboost(data = train_matrix,
                     max.depth = 6,
                     eta = 0.1,
                     nrounds = 500,
                     objective = "reg:squarederror",
                     verbose = 0)

# Function to calculate performance metrics
evaluate_model <- function(true, predicted) {
  rmse <- sqrt(mean((true - predicted)^2))
  mae <- mean(abs(true - predicted))
  mape <- mean(abs((true - predicted) / true)) * 100
  mase <- mae / mean(abs(diff(true)))
  list(RMSE = rmse, MAE = mae, MAPE = mape, MASE = mase)
}

# Validation set predictions
val_predictions <- predict(xgb_model, val_matrix)
val_actuals <- val_data$Weekly_Sales

# Plot actual vs predicted values for the validation set
plot(val_actuals, val_predictions,
     xlab = "Actual Values", ylab = "Predicted Values",
     main = "Validation Set: Actual vs Predicted",
     col = "blue", pch = 16)

# Add ideal diagonal line
abline(0, 1, col = "red", lwd = 2)  # Red diagonal line

# Calculate performance metrics for the validation set
val_metrics <- evaluate_model(val_actuals, val_predictions)
cat("Validation Metrics:\n")
## Validation Metrics:
print(val_metrics)
## $RMSE
## [1] 130689.9
## 
## $MAE
## [1] 71296.53
## 
## $MAPE
## [1] 6.92197
## 
## $MASE
## [1] 0.5651222
# Test set predictions
test_predictions <- predict(xgb_model, test_matrix)
test_actuals <- test_data$Weekly_Sales

# Plot actual vs predicted values for the test set
plot(test_actuals, test_predictions,
     xlab = "Actual Values", ylab = "Predicted Values",
     main = "Test Set: Actual vs Predicted",
     col = "green", pch = 16)

# Add ideal diagonal line
abline(0, 1, col = "red", lwd = 2)  # Red diagonal line

# Calculate performance metrics for the test set
test_metrics <- evaluate_model(test_actuals, test_predictions)
cat("Test Metrics:\n")
## Test Metrics:
print(test_metrics)
## $RMSE
## [1] 120367.1
## 
## $MAE
## [1] 67386.9
## 
## $MAPE
## [1] 6.640543
## 
## $MASE
## [1] 0.5199285

Result Analysis : MAPE and MASE: Both metrics are well within acceptable limits, showing that the model has high predictive accuracy and is reliable. The model performs consistently across both the validation and test sets, with slightly better performance on the test set, suggesting good generalization and effectiveness.

2.3.2.Random Forest

install.packages("randomForest",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'randomForest'打开成功,MD5和检查也通过
install.packages("caret",repos = "https://cloud.r-project.org/",quiet = TRUE)
install.packages("MLmetrics",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'MLmetrics'打开成功,MD5和检查也通过
library(randomForest)
library(caret)
library(MLmetrics)
#check data
data$Date <- as.Date(data$Date, format = "%d-%m-%Y")

data <- na.omit(data)

# split data train:test:validation=0.6:0.2:0.2
set.seed(123)
train_index <- createDataPartition(data$Weekly_Sales, p = 0.6, list = FALSE)
train_data <- data[train_index, ]
remaining_data <- data[-train_index, ]
val_index <- createDataPartition(remaining_data$Weekly_Sales, p = 0.5, list = FALSE)
val_data <- remaining_data[val_index, ]
test_data <- remaining_data[-val_index, ]

cat("Training Set Size:", nrow(train_data), "\n")
## Training Set Size: 3863
cat("Validation Set Size:", nrow(val_data), "\n")
## Validation Set Size: 1288
cat("Test Set Size:", nrow(test_data), "\n")
## Test Set Size: 1284
#set model
rf_model <- randomForest(
  Weekly_Sales ~ Store + Holiday_Flag + Temperature + Fuel_Price + CPI + Unemployment,
  data = train_data,
  ntree = 500,
  mtry = 3,
  importance = TRUE
)

print(rf_model)
## 
## Call:
##  randomForest(formula = Weekly_Sales ~ Store + Holiday_Flag +      Temperature + Fuel_Price + CPI + Unemployment, data = train_data,      ntree = 500, mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 19870155786
##                     % Var explained: 93.8
# importance
importance(rf_model)
##                 %IncMSE IncNodePurity
## Store        248.615701  7.435914e+14
## Holiday_Flag   9.470042  4.838308e+12
## Temperature   29.052360  4.649784e+13
## Fuel_Price    45.628960  4.664705e+13
## CPI          116.583896  2.262146e+14
## Unemployment  66.249083  1.582732e+14
varImpPlot(rf_model)

# evaluation
evaluate_model <- function(true, predicted, train_values) {
  # RMSE
  rmse <- sqrt(mean((true - predicted)^2))

  # MAE
  mae <- mean(abs(true - predicted))

  # MAPE
  mape <- mean(abs((true - predicted) / true)) * 100

  # MASE (Mean Absolute Scaled Error)
  naive_forecast <- mean(abs(diff(train_values)))
  mase <- mae / naive_forecast

  return(list(RMSE = rmse, MAE = mae, MAPE = mape, MASE = mase))
}
#validation prediction
val_predictions <- predict(rf_model, val_data)

#validation true
val_true <- val_data$Weekly_Sales

# validation performance
val_metrics <- evaluate_model(val_true, val_predictions, train_data$Weekly_Sales)
cat("Validation Metrics:\n")
## Validation Metrics:
print(val_metrics)
## $RMSE
## [1] 146595.4
## 
## $MAE
## [1] 80816.92
## 
## $MAPE
## [1] 7.815409
## 
## $MASE
## [1] 0.8287885
# predict on the test set
test_predictions <- predict(rf_model, test_data)

# true values from the test set
test_true <- test_data$Weekly_Sales

# test set performance
test_metrics <- evaluate_model(test_true, test_predictions, train_data$Weekly_Sales)
cat("Test Metrics:\n")
## Test Metrics:
print(test_metrics)
## $RMSE
## [1] 148187
## 
## $MAE
## [1] 78804.86
## 
## $MAPE
## [1] 7.478865
## 
## $MASE
## [1] 0.8081545
cat("Model Performance Comparison:\n")
## Model Performance Comparison:
cat("Validation Metrics:\n")
## Validation Metrics:
print(val_metrics)
## $RMSE
## [1] 146595.4
## 
## $MAE
## [1] 80816.92
## 
## $MAPE
## [1] 7.815409
## 
## $MASE
## [1] 0.8287885
cat("Test Metrics:\n")
## Test Metrics:
print(test_metrics)
## $RMSE
## [1] 148187
## 
## $MAE
## [1] 78804.86
## 
## $MAPE
## [1] 7.478865
## 
## $MASE
## [1] 0.8081545
# validation set: actual vs predicted
plot(val_true, val_predictions, col = "blue", pch = 20,
     main = "Validation Set: Actual vs Predicted",
     xlab = "Actual Values", ylab = "Predicted Values")
abline(0, 1, col = "red")

# test set:actual vs prediced
plot(test_true, test_predictions, col = "green", pch = 20,
     main = "Test Set: Actual vs Predicted",
     xlab = "Actual Values", ylab = "Predicted Values")
abline(0, 1, col = "red")

Result Analysis : The random forest model performs well, with consistent metrics across the validation and test sets. A MAPE below 8% indicates strong predictive accuracy relative to actual sales, while a MASE below 1 confirms the model’s superiority over naive benchmarks. The minimal difference between RMSE and MAE highlights controlled error distribution, and the close alignment of validation and test metrics shows good generalization. Overall, the model is robust and reliable for sales forecasting, with potential for further improvement through hyperparameter tuning and feature engineering.

2.3.3.ARIMA Model

①Forecast the total sales for all stores

install.packages("forecast",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'forecast'打开成功,MD5和检查也通过
library(dplyr) 
data$Date <- as.Date(data$Date, format="%Y-%m-%d")

data$Week <- format(data$Date, "%Y-%U")

total_weekly_sales <- data %>%
  group_by(Week) %>%
  summarise(Total_Weekly_Sales = sum(Weekly_Sales))

# Convert data to a time series object
ts_sales <- ts(total_weekly_sales$Total_Weekly_Sales, frequency = 52)  # 52 weeks per year

# Install and load necessary packages
if (!require("forecast")) install.packages("forecast")
library(forecast)

# Use the first 80% of the data as the training set
train_data <- ts(ts_sales[1:floor(0.8 * length(ts_sales))], frequency = 52)

# Use the remaining 20% of the data as the test set
test_data <- ts(ts_sales[(floor(0.8 * length(ts_sales)) + 1):length(ts_sales)], frequency = 52)

# Fit an ARIMA model using the training set
fit <- auto.arima(train_data)

# Forecast the next 20% of the data using the trained model
forecast_test <- forecast(fit, h = length(test_data))

# Combine the data for plotting
full_forecast <- ts(c(train_data, forecast_test$mean), start = start(ts_sales), frequency = frequency(ts_sales))

# Plot the full time series, including the actual values for the first 80% and the predicted values for the last 20%
plot(ts_sales, col = "blue", lty = 1, type = "l",
     main = "Total Weekly Sales: Actual vs Predicted",
     xlab = "Year", ylab = "Total Weekly Sales", ylim = range(c(ts_sales, forecast_test$mean)),xaxt = "n")

lines(full_forecast, col = "red", lty = 2, type = "l")  # Add predicted values
abline(v = length(train_data), col = "gray", lty = 3)  # Use a dashed line to mark the boundary between training and prediction

# Add legend
legend("topright", legend = c("Actual", "Predicted", "Training Boundary"),
       col = c("blue", "red", "gray"), lty = c(1, 2, 3), cex = 0.8)

Viewpoint:We can see that when using the last 20% of the data for prediction, the predicted values and actual values show a small difference, and the fluctuation trends are generally consistent.

# Select the first 80% of the data as the training set, and the remaining 20% as the test set.
train_data <- ts_sales[1:floor(0.8 * length(ts_sales))]
test_data <- ts_sales[(floor(0.8 * length(ts_sales)) + 1):length(ts_sales)]

# Fit the ARIMA model using the training set.
fit_train <- auto.arima(train_data)

# Evaluate the model performance using the test set.
forecast_test <- forecast(fit_train, h = length(test_data))

# Calculate the accuracy of the test set.
accuracy(forecast_test, test_data)
##                       ME    RMSE     MAE       MPE     MAPE      MASE
## Training set   -9082.085 5503920 3079490 -1.036141 6.034515 0.8727615
## Test set     -727883.005 2002380 1533399 -1.703293 3.331537 0.4345821
##                    ACF1
## Training set 0.02117658
## Test set             NA

Result Analysis :

  • RMSE:2002380,The value is relatively large, indicating that there is some error in the predictions. However, given the large scale of the data (sales amounting to tens of millions), such an RMSE is reasonable.

  • MAE:1533399, indicates that the model’s average prediction error is approximately this value, reflecting the overall error level.

  • MAPE:3.33, It falls within a relatively low error range, and the model performs well in terms of relative error assessment.

  • MASE:0.4346, MASE is less than 1, indicating that the ARIMA model is clearly superior to the benchmark model.

②Forecast the sales of a single store

#  load necessary packages
if (!require("forecast")) install.packages("forecast")
library(forecast)
store_1_data <- data[data$Store == 1, ]
ts_store_1 <- ts(store_1_data$Weekly_Sales, frequency = 52)
# Use the first 80% of the data as the training set
train_data_1 <- ts(ts_store_1[1:floor(0.8 * length(ts_store_1))], frequency = 52)

# Use the remaining 20% of the data as the test set
test_data_1 <- ts(ts_store_1[(floor(0.8 * length(ts_store_1)) + 1):length(ts_store_1)], frequency = 52)

# Fit an ARIMA model using the training set
fit <- auto.arima(train_data_1)

# Forecast the next 20% of the data using the trained model
forecast_test_1 <- forecast(fit, h = length(test_data_1))

# Combine the data for plotting
full_forecast_1 <- ts(c(train_data_1, forecast_test_1$mean), start = start(ts_store_1), frequency = frequency(ts_sales))

# Plot the full time series, including the actual values for the first 80% and the predicted values for the last 20%
plot(ts_store_1, col = "blue", lty = 1, type = "l",
     main = "store_1 Weekly Sales: Actual vs Predicted",
     xlab = "Year", ylab = "Total Weekly Sales", ylim = range(c(ts_store_1, forecast_test_1$mean)),xaxt = "n")
lines(full_forecast_1, col = "red", lty = 2, type = "l")  # Add predicted values
abline(v = length(train_data), col = "gray", lty = 3)  # Use a dashed line to mark the boundary between training and prediction

# Add legend
legend("topright", legend = c("Actual", "Predicted", "Training Boundary"),
       col = c("blue", "red", "gray"), lty = c(1, 2, 3), cex = 0.8)

Viewpoint: We can clearly see that when predicting for a single store, the predicted values are much closer to the actual values, showing better performance compared to forecasting the total sales.

# Select the first 80% of the data as the training set, and the remaining 20% as the test set.
train_data_1 <- ts_store_1[1:floor(0.8 * length(ts_store_1))]
test_data_1 <- ts_store_1[(floor(0.8 * length(ts_store_1)) + 1):length(ts_store_1)]

# Fit the ARIMA model using the training set.
fit_train_1 <- auto.arima(train_data_1)

# Evaluate the model performance using the test set.
forecast_test_1 <- forecast(fit_train_1, h = length(test_data_1))

# Calculate the accuracy of the test set.
accuracy(forecast_test_1, test_data_1)
##                      ME     RMSE       MAE        MPE     MAPE      MASE
## Training set   9326.826 148255.6 101358.55 -0.1383518 6.289358 0.7957775
## Test set     -76061.358 112995.7  94264.25 -5.1051813 6.161850 0.7400794
##                     ACF1
## Training set -0.01314101
## Test set              NA

Result Analysis:

  • Both RMSE and MAE are smaller than those of the training set. Although the errors are larger, they still indicate that the difference between the predicted values and the actual values is relatively small.

  • The MASE for the test set is less than 1, meaning the model still performs better than the baseline model, showing good effectiveness.

Conclusion: We can see from the numbers that the ARIMA model demonstrates strong forecasting accuracy when predicting both total sales and sales for Store 1. The forecast values are closer to the actual values for Store 1, likely due to the smaller sales figures of individual stores, resulting in a smaller difference between the predicted and actual values.

2.4 Prediction Results(EVALUATION COMPARISON)

The purpose of conducting an evaluation first in a machine learning or statistical modeling task is to assess and compare the performance of different models before selecting the best one for the problem at hand. Evaluation allows for an objective comparison between various algorithms, helping to determine which model provides the most accurate predictions, best generalizes to unseen data, and performs optimally with the available features

In this evaluation below,two machine learning models XGBoost and Random Forest are being compared on their ability to predict outcomes using a training dataset, and their performance is measured using four key metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Mean Absolute Scaled Error (MASE). These metrics provide valuable insights into the models’ predictive accuracy, error magnitudes, and ability to generalize to new data.

# Load necessary libraries
library(xgboost)
library(randomForest)
library(caret)
library(ggplot2)
library(gridExtra)
library(MLmetrics)

# Function to generate evaluation table
generate_evaluation_table <- function(model_name, rmse, mae, mape, mase) {
  data.frame(
    Model = model_name,
    RMSE = rmse,
    MAE = mae,
    MAPE = mape,
    MASE = mase
  )
}

# Load and preprocess the data
data$Date <- as.Date(data$Date, format = "%d-%m-%Y")
data <- na.omit(data)

# Split the data into training, validation, and test sets
set.seed(123)
train_index <- createDataPartition(data$Weekly_Sales, p = 0.6, list = FALSE)
train_data <- data[train_index, ]
remaining_data <- data[-train_index, ]
val_index <- createDataPartition(remaining_data$Weekly_Sales, p = 0.5, list = FALSE)
val_data <- remaining_data[val_index, ]
test_data <- remaining_data[-val_index, ]

cat("Training Set Size:", nrow(train_data), "\n")
## Training Set Size: 3863
cat("Validation Set Size:", nrow(val_data), "\n")
## Validation Set Size: 1288
cat("Test Set Size:", nrow(test_data), "\n")
## Test Set Size: 1284
# Prepare data matrices for XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]), label = train_data$Weekly_Sales)
val_matrix <- xgb.DMatrix(data = as.matrix(val_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]), label = val_data$Weekly_Sales)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, c("Store", "Holiday_Flag", "Temperature", "Fuel_Price", "CPI", "Unemployment")]), label = test_data$Weekly_Sales)

# Train the XGBoost model
xgb_model <- xgboost(data = train_matrix,
                     max.depth = 6,
                     eta = 0.1,
                     nrounds = 500,
                     objective = "reg:squarederror",
                     verbose = 0)

# Function to calculate performance metrics
evaluate_model <- function(true, predicted, train_values = NULL) {
  rmse <- sqrt(mean((true - predicted)^2))
  mae <- mean(abs(true - predicted))
  mape <- mean(abs((true - predicted) / true)) * 100
  mase <- if (!is.null(train_values)) {
    mae / mean(abs(diff(train_values)))  # Calculate MASE if train values are provided
  } else {
    NA
  }
  list(RMSE = rmse, MAE = mae, MAPE = mape, MASE = mase)
}

# XGBoost Validation and Test Metrics
val_predictions_xgb <- predict(xgb_model, val_matrix)
val_actuals_xgb <- val_data$Weekly_Sales
val_metrics_xgb <- evaluate_model(val_actuals_xgb, val_predictions_xgb, train_data$Weekly_Sales)

test_predictions_xgb <- predict(xgb_model, test_matrix)
test_actuals_xgb <- test_data$Weekly_Sales
test_metrics_xgb <- evaluate_model(test_actuals_xgb, test_predictions_xgb, train_data$Weekly_Sales)

# Generate evaluation table for XGBoost
xgb_val_table <- generate_evaluation_table("XGBoost (Validation)", val_metrics_xgb$RMSE, val_metrics_xgb$MAE, val_metrics_xgb$MAPE, val_metrics_xgb$MASE)
xgb_test_table <- generate_evaluation_table("XGBoost (Test)", test_metrics_xgb$RMSE, test_metrics_xgb$MAE, test_metrics_xgb$MAPE, test_metrics_xgb$MASE)

# Train the Random Forest model
rf_model <- randomForest(Weekly_Sales ~ Store + Holiday_Flag + Temperature + Fuel_Price + CPI + Unemployment,
                         data = train_data,
                         ntree = 500,
                         mtry = 3,
                         importance = TRUE)

# Random Forest Validation and Test Metrics
val_predictions_rf <- predict(rf_model, val_data)
val_actuals_rf <- val_data$Weekly_Sales
val_metrics_rf <- evaluate_model(val_actuals_rf, val_predictions_rf, train_data$Weekly_Sales)

test_predictions_rf <- predict(rf_model, test_data)
test_actuals_rf <- test_data$Weekly_Sales
test_metrics_rf <- evaluate_model(test_actuals_rf, test_predictions_rf, train_data$Weekly_Sales)

# Generate evaluation table for Random Forest
rf_val_table <- generate_evaluation_table("Random Forest (Validation)", val_metrics_rf$RMSE, val_metrics_rf$MAE, val_metrics_rf$MAPE, val_metrics_rf$MASE)
rf_test_table <- generate_evaluation_table("Random Forest (Test)", test_metrics_rf$RMSE, test_metrics_rf$MAE, test_metrics_rf$MAPE, test_metrics_rf$MASE)

# Combine all tables into one
evaluation_table <- rbind(xgb_val_table, xgb_test_table, rf_val_table, rf_test_table)

# Print the final evaluation table
print(evaluation_table)
##                        Model     RMSE      MAE     MAPE      MASE
## 1       XGBoost (Validation) 130689.9 71296.53 6.921970 0.7311555
## 2             XGBoost (Test) 120367.1 67386.90 6.640543 0.6910618
## 3 Random Forest (Validation) 146614.5 80248.24 7.787111 0.8229566
## 4       Random Forest (Test) 147832.7 78657.74 7.468135 0.8066457

Conclusion:

The table presents the performance metrics for two machine learning models, XGBoost and Random Forest, evaluated on validation and test datasets. The training set consists of 3,863 data points, while the validation and test sets contain 1,288 and 1,284 data points, respectively. The metrics used to evaluate the models are RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), and MASE (Mean Absolute Scaled Error). These metrics assess the models’ accuracy and generalization ability.

XGBoost outperforms Random Forest on both validation and test sets across all metrics. On the validation set, XGBoost has an RMSE of 130,689.9, an MAE of 71,296.53, a MAPE of 6.92%, and a MASE of 0.731, indicating that its predictions are closer to the actual values compared to Random Forest, which has a higher RMSE of 146,614.5, an MAE of 80,248.24, a MAPE of 7.79%, and a MASE of 0.823. On the test set, XGBoost continues to perform better with an RMSE of 120,367.1, an MAE of 67,386.90, a MAPE of 6.64%, and a MASE of 0.691, while Random Forest’s RMSE is 147,832.7, MAE is 78,657.74, MAPE is 7.47%, and MASE is 0.807.

Overall, XGBoost demonstrates superior performance compared to Random Forest in terms of prediction accuracy and generalization on both the validation and test datasets. This indicates that XGBoost is the more suitable model for the given task.

Why not include ARIMA?

The reason ARIMA (AutoRegressive Integrated Moving Average) is not included in the comparison table could be due to a few factors. First, ARIMA is a statistical model mainly used for forecasting single time series data, while XGBoost and Random Forest are machine learning models that can handle both regression and classification tasks.In situations where the dataset is large or the data shows trends or patterns that change over time (non-stationary), ARIMA might not work as well. Machine learning models like XGBoost and Random Forest are often better at handling complex patterns in data and can work well with various types of data. These models are also more flexible, especially when dealing with many different features or variables.

Therefore, ARIMA might have been left out of the comparison because XGBoost and Random Forest were seen as more suitable for the specific problem, offering better performance and flexibility for the task.

2.5 Customer Segmentation Modeling

library(dplyr)
data <- na.omit(data)
# # Calculate the average CPI, unemployment rate, and fuel price for each store
average_economic_data <- data %>%
  group_by(Store) %>%
  summarise(
    Avg_CPI = mean(CPI, na.rm = TRUE),
    Avg_Unemployment = mean(Unemployment, na.rm = TRUE),
    Avg_Fuel_Price = mean(Fuel_Price, na.rm = TRUE)
  )
# drop Store
average_data_scaled <- scale(average_economic_data[, -1])

# K-means
set.seed(123)
kmeans_result <- kmeans(average_data_scaled, centers=3, nstart=25)
average_economic_data$Cluster <- as.factor(kmeans_result$cluster)

2.6 Customer Segmentation Results

install.packages("ggrepel",repos = "https://cloud.r-project.org/",quiet = TRUE)
## 程序包'ggrepel'打开成功,MD5和检查也通过
library(ggrepel)

ggplot(average_economic_data, aes(x=Avg_CPI, y=Avg_Unemployment, label=Store)) +
  geom_point(aes(color=factor(Cluster)), alpha=0.5, size=3) +
  geom_text_repel(aes(label=Store), size=3, box.padding = 0.35, point.padding = 0.5) +
  scale_color_manual(values=c("red", "yellow", "blue")) +
  labs(title = "Repelled Labels in Cluster Plot",
       x = "Average CPI",
       y = "Average Unemployment Rate",
       color = "Cluster Category") +
  theme_minimal()

#table
store_cluster_table <- data.frame(
  Store = average_economic_data$Store,
  Cluster = average_economic_data$Cluster
)

print(store_cluster_table)
##    Store Cluster
## 1      1       2
## 2      2       2
## 3      3       2
## 4      4       1
## 5      5       2
## 6      6       2
## 7      7       2
## 8      8       2
## 9      9       2
## 10    10       1
## 11    11       2
## 12    12       3
## 13    13       1
## 14    14       1
## 15    15       1
## 16    16       2
## 17    17       1
## 18    18       1
## 19    19       1
## 20    20       2
## 21    21       2
## 22    22       1
## 23    23       1
## 24    24       1
## 25    25       2
## 26    26       1
## 27    27       1
## 28    28       3
## 29    29       1
## 30    30       2
## 31    31       2
## 32    32       2
## 33    33       1
## 34    34       1
## 35    35       1
## 36    36       2
## 37    37       2
## 38    38       3
## 39    39       2
## 40    40       1
## 41    41       2
## 42    42       1
## 43    43       2
## 44    44       1
## 45    45       1

Conclusion:

In the dataset, different stores have varying economic indicators, which can roughly infer that the stores are located in different regions. Consumers living in these different regions have different purchasing power, so stores can formulate marketing strategies based on local economic factors.

  • Cluster 1 (Red Zone): The CPI is relatively low, indicating a lower cost of living, likely in rural areas or economically underdeveloped regions. The unemployment rate is low to moderate, suggesting relatively stable employment, but household income levels may be lower. Consumers in these areas are likely to focus on purchasing basic necessities rather than luxury or high-end products. They prioritize discounted or promotional items.

  • Cluster 2 (Yellow Zone): The CPI is relatively high, indicating a higher cost of living, likely in urban or economically developed regions. The unemployment rate is low, which implies more job opportunities and higher household income levels. Consumers in these areas may prefer high-quality, high-brand-value products. They are more attentive to the shopping experience and additional services, such as membership programs or home delivery. Their consumption habits are more diverse and may include luxury goods, organic products, and environmentally friendly items.

  • Cluster 3 (Blue Zone): The unemployment rate is the highest, indicating potential economic difficulties in these regions, with household income being unstable. The CPI is moderate, so the cost of living is not high, but the high unemployment rate adds economic pressure on residents. Consumers in these areas are likely to focus their spending on essentials such as food and clothing and prefer low-priced products. They prioritize promotions and discounted items

3. Conclusion

This project demonstrates the effectiveness of machine learning models in forecasting sales performance and identifying key drivers of consumer behavior. XGBoost proved to be the most reliable model, offering high accuracy and robust predictions across datasets. Insights into economic indicators and seasonal factors provide actionable guidance for optimizing marketing strategies and resource planning.

At the same time, this project achieves customer segmentation by segmenting stores, as consumers living in these different areas have varying purchasing power and characteristics. Therefore, marketing strategies can be tailored based on local economic factors in the future.