Abstract

This capstone addresses the crucial challenge of predicting sales in businesses by applying advanced data science techniques.

This project focuses on the development and implementation of a predictive model aimed at improving sales management in a company with multiple points of sale. The company is inside Colombia (South America), it has more than 500 stores inside Colombia country, in different cities of the country.

The study focuses on the development of accurate and efficient predictive models that allow companies to anticipate sales trends, optimize strategic planning and improve business decision making.

The methodology uses data collection and data cleaning and implements machine learning algorithms and predictive analysis techniques. Historical sales data sets are used, as well as relevant information on external factors that may influence business performance. The goal is to seek to identify patterns and correlations that allow us to predict sales trends.

The methodology used includes the collection of historical sales data, seasonal factors, and relevant external variables. Using advanced predictive analysis techniques, such as machine learning, seeking to identify patterns and correlations that allow us to predict sales trends in each store.

The predictive model developed is integrated with the business management system. Providing the managers of each store with effective tools for making decisions based on data. It is expected that this approach will contribute to inventory optimization, reducing losses due to lack of stock or excess inventory. Lastly an instance, to increase the profitability of the company. Additionally, there will be consider scaling the predictive model to adapt to new locations or changes in market conditions.

This project not only seeks to improve operational efficiency, but also establish a sustainable and adaptable framework for sales management in a company with a multilocation presence. This work contributes to the field of data science by providing a guide to address the specific challenge of sales prediction in companies, highlighting the importance of the application of advanced techniques in strategic decision making.

Introduction

In today’s business environment, effective management of multiple retail stores has become a crucial challenge for optimizing operations and maximizing profits. Companies with multiple stores often face the complex task of anticipating and adapting to variations in demand at each location.

The project aims to develop a predictive model that allows forecasting sales trends in different sales stores. Thus, providing the company with the necessary tools to make data driven and strategic decisions.

The need for a predictive approach arises from increasing competition, and the diversity of factors that influence customer purchasing behavior. The gathering and analysis of historical data, combined with seasonal factors and relevant external variables, will allow the identification of data and sales behavior patterns. Using data science techniques that apply predictive analysis, future product demands can be estimated.

The implementation of predictive models not only seeks to improve operational efficiency, but also contributes to the overall profitability of the company, this way avoiding losses due to excess or lack of inventory. Additionally, the ability to scale the model to adapt to new locations or changes in market conditions will be considered. This way it establishes a sustainable and adaptable approach to sales management in a multi-store company.

This project is proposed as a strategic tool to strengthen the company’s ability to anticipate and satisfy market demands in its various sales stores. This with the objective of improving its competitive position and maximizing operational efficiency.

Literature Review

Existing literature provides an essential foundation for understanding the importance and complexities associated with sales prediction in the business environment. Sales management has evolved significantly with the emergence of data science.

Traditional Sales Prediction Models have established solid knowledge around traditional sales prediction models, which are often based on classic statistical methods such as time series analysis and linear regression. These approaches have proven effective in certain contexts, but their ability to handle the complexity of today’s data and adapt to dynamic changes in the market has become limited. Sometimes its precision is quite low.

Predictive analytics is an essential process that relies on using past and present data to make predictions, especially by examining data trends.

In recent years, and given the rise of big data, it has emerged as a fundamental trend, offering transformative potential in the way companies organize and use information relevant to them, their clients, as well as their potential clients.

Effective data storage is essential, as without it, businesses would lose the ability to gain valuable insights. Without information and analysis, new benefits and opportunities would be lost.

Technology currently allows companies to use data as a fundamental pillar to understand their customers, create new products and satisfy their needs. Implementing a data-driven approach to decision making has been shown to increase the profitability of companies regardless of industry.

Predictive sales analytics is widely used by companies across industries, from food service and retail to healthcare and government services. Companies can optimally allocate their resources, while adequately meeting expected demand.

Some of the predictive analysis techniques are:

Decision trees: Classify data into different groups. It is shaped like a tree; Each branch is a possibility of choice, and the result is shown on the sheet.

Random forest: It is a set of decision trees in which different models are applied.

Neural networks: This artificial technology aims to imitate the reactions of a human brain to make predictions in relationships of complex variables.

Data mining: Exploring large databases to find patterns.

Predictive analytics is extremely useful in sales prediction for several key reasons:

Inventory Optimization: Allows companies to anticipate future demand, which facilitates inventory management. By forecasting product needs, companies can avoid excess or shortages of stock, thus optimizing their inventory levels.

Improving Pricing Strategies: Facilitates the identification of consumer behavior patterns and market trends, which helps to adjust pricing strategies more precisely. This may include price discounts, special offers, and dynamic pricing strategies.

Offer Personalization: Allows companies to personalize their offers and promotions based on customers’ preferences and past behaviors. This not only improves customer satisfaction, but also increases the chances of making more sales.

Optimization of Marketing Strategies: Helps efficiently direct marketing strategies by identifying the customer segments most likely to make purchases. This may include audience segmentation, and personalization of advertising messages.

Improved Strategic Planning: Provides valuable information for long-term strategic planning, understanding sales trends and patterns. In this way, companies can adjust their goals and objectives more realistically.

Reducing Financial Risks: By forecasting demand and adjusting operations accordingly, companies can reduce the financial risks associated with unsold excess inventory or shortages of critical products.

By integrating these key aspects of the literature, this research seeks to contribute to the updated understanding of sales prediction. In the context of data science, the goal is to provide a solid foundation for the development of more advanced and effective approaches in the field of enterprise sales.

In short, predictive analytics in sales forecasting not only helps businesses anticipate demand but also provides a solid foundation for strategic decision making. This results in a more efficient operation, and better customer satisfaction.

Dataset Overview

Libraries

We use the R programming language and load several libraries or packages that contain specific functions and tools for data analysis and visualization in R. These libraries provide a variety of tools for performing data analysis, visualizations, and statistical modeling in R.

library(tidyverse)
library(dplyr)
library(readr)
library(ggplot2)
library(corrplot)
library(lubridate)
library(heatmaply)
library(visdat)
library(gridExtra)
library(GGally)
library(caret)
library(caTools)
library(randomForest)
library(tree)
library(Metrics)
library(rpart)
library(glmnet)

The quality and relevance of the data set plays a vital role in the data science project, especially in the context of sales prediction of the organization.

The review of the data set focuses on the key aspects that have influenced the selection, preparation, and use of the data to develop sales prediction models.

The company provided us with its sales database for 3 years, 2021, 2022 and 2023. The data set selected for this analysis is a compilation of data on the sales of each company store, for the last 3 years.

The historical sales data set spanning a significant period to capture seasonal patterns, long-term trends, and changes in consumer behavior.

In the set there is data on the number of sales of each store. Describe characteristics of each store, if the point of sale was open, if it had any promotions.

There is a very complete set of data. We can also see if the date was a holiday, the number of customers for each store, and many other data that will allow us to conduct a good analysis.

In the following table we can see the variables and the data type:

sales_df <- read_csv("C:/Users/jzw1jlf/Documents/Personal/CUNY SPS/DATA 698/Sales_data.csv")

head(sales_df)

## # A tibble: 6 × 9
##   Store DayOfWeek Date       Sales Customers  Open Promo StateHoliday
##   <dbl>     <dbl> <date>     <dbl>     <dbl> <dbl> <dbl> <chr>       
## 1     1         1 2021-01-04  7176       785     1     1 0           
## 2     1         1 2021-01-11  4717       616     1     0 0           
## 3     1         1 2021-01-18  5394       607     1     1 0           
## 4     1         1 2021-01-25  4055       549     1     0 0           
## 5     1         1 2021-02-01  7032       762     1     1 0           
## 6     1         1 2021-02-08  4409       599     1     0 0           
## # ℹ 1 more variable: SchoolHoliday <dbl>

str(sales_df)

## spc_tbl_ [510,776 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Store        : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek    : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date         : Date[1:510776], format: "2021-01-04" "2021-01-11" ...
##  $ Sales        : num [1:510776] 7176 4717 5394 4055 7032 ...
##  $ Customers    : num [1:510776] 785 616 607 549 762 599 710 534 840 618 ...
##  $ Open         : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : num [1:510776] 1 0 1 0 1 0 1 0 1 0 ...
##  $ StateHoliday : chr [1:510776] "0" "0" "0" "0" ...
##  $ SchoolHoliday: num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   DayOfWeek = col_double(),
##   ..   Date = col_date(format = ""),
##   ..   Sales = col_double(),
##   ..   Customers = col_double(),
##   ..   Open = col_double(),
##   ..   Promo = col_double(),
##   ..   StateHoliday = col_character(),
##   ..   SchoolHoliday = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

There is another dataset for the stores, where we can observe the type of store, if the store has carried out promotions or if the store has competition nearby that could affect.

store_df <- read_csv("C:/Users/jzw1jlf/Documents/Personal/CUNY SPS/DATA 698/Stores_data.csv")

head(store_df)

## # A tibble: 6 × 6
##   Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
##   <dbl> <chr>     <chr>                    <dbl>                    <dbl>  <dbl>
## 1     1 c         a                         1270                     2018      0
## 2     2 a         a                          570                     2007      1
## 3     3 a         a                        14130                     2006      1
## 4     4 c         c                          620                     2009      0
## 5     5 a         a                        29910                     2015      0
## 6     6 a         a                          310                     2013      0

tail(store_df)

## # A tibble: 6 × 6
##   Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
##   <dbl> <chr>     <chr>                    <dbl>                    <dbl>  <dbl>
## 1   555 d         a                         1560                     2014      1
## 2   556 d         c                         1140                     2014      0
## 3   557 a         a                          250                       NA      0
## 4   558 a         a                         3000                     2010      0
## 5   559 d         a                         3910                     2006      1
## 6   560 c         c                         1910                     2013      0

Preparation & Exploration

We have paid special attention to the selection of predictor variables within the data set.

This review addresses the importance of data quality and cleanliness throughout the process. Identification and handling of outliers, imputation of missing data, and verification of temporal consistency are critical aspects to ensure the reliability of the results.

Visual data exploration has been used as an integral part of the review of the data set. Visualization of patterns, distributions, and correlations between variables has provided valuable information for initial understanding of the structure of the data set.

The objective of data preparation and exploration is to ensure a solid and reliable basis for the development of sales prediction models in the field of data science.

The dataset has 510776 observations and 9 variables. We will see if there are missing values in the dataset.

sales_shape <- dim(sales_df)
  print(sales_shape)

## [1] 510776      9

any_missing <- anyNA(sales_df)
  print(any_missing)

## [1] FALSE

Checking Null Values:

missing_values <- is.na(sales_df)
missing_counts <- colSums(missing_values)
  print(missing_counts)

##         Store     DayOfWeek          Date         Sales     Customers 
##             0             0             0             0             0 
##          Open         Promo  StateHoliday SchoolHoliday 
##             0             0             0             0

summary(sales_df)

##      Store         DayOfWeek          Date                Sales      
##  Min.   :  1.0   Min.   :1.000   Min.   :2021-01-01   Min.   :    0  
##  1st Qu.:141.0   1st Qu.:2.000   1st Qu.:2021-08-17   1st Qu.: 3785  
##  Median :281.0   Median :4.000   Median :2022-04-02   Median : 5778  
##  Mean   :280.8   Mean   :4.003   Mean   :2022-04-11   Mean   : 5807  
##  3rd Qu.:420.0   3rd Qu.:6.000   3rd Qu.:2022-12-12   3rd Qu.: 7896  
##  Max.   :560.0   Max.   :7.000   Max.   :2023-07-31   Max.   :38722  
##    Customers           Open            Promo        StateHoliday      
##  Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Length:510776     
##  1st Qu.: 405.0   1st Qu.:1.0000   1st Qu.:0.0000   Class :character  
##  Median : 607.0   Median :1.0000   Median :0.0000   Mode  :character  
##  Mean   : 629.6   Mean   :0.8306   Mean   :0.3815                     
##  3rd Qu.: 834.0   3rd Qu.:1.0000   3rd Qu.:1.0000                     
##  Max.   :5494.0   Max.   :1.0000   Max.   :1.0000                     
##  SchoolHoliday   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.1785  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

Unique rows in sales_df:

unique(sales_df)

## # A tibble: 510,776 × 9
##    Store DayOfWeek Date       Sales Customers  Open Promo StateHoliday
##    <dbl>     <dbl> <date>     <dbl>     <dbl> <dbl> <dbl> <chr>       
##  1     1         1 2021-01-04  7176       785     1     1 0           
##  2     1         1 2021-01-11  4717       616     1     0 0           
##  3     1         1 2021-01-18  5394       607     1     1 0           
##  4     1         1 2021-01-25  4055       549     1     0 0           
##  5     1         1 2021-02-01  7032       762     1     1 0           
##  6     1         1 2021-02-08  4409       599     1     0 0           
##  7     1         1 2021-02-15  6407       710     1     1 0           
##  8     1         1 2021-02-22  4038       534     1     0 0           
##  9     1         1 2021-03-01  7675       840     1     1 0           
## 10     1         1 2021-03-08  4949       618     1     0 0           
## # ℹ 510,766 more rows
## # ℹ 1 more variable: SchoolHoliday <dbl>

unique_rows <- unique(sales_df)
  print(unique_rows)

## # A tibble: 510,776 × 9
##    Store DayOfWeek Date       Sales Customers  Open Promo StateHoliday
##    <dbl>     <dbl> <date>     <dbl>     <dbl> <dbl> <dbl> <chr>       
##  1     1         1 2021-01-04  7176       785     1     1 0           
##  2     1         1 2021-01-11  4717       616     1     0 0           
##  3     1         1 2021-01-18  5394       607     1     1 0           
##  4     1         1 2021-01-25  4055       549     1     0 0           
##  5     1         1 2021-02-01  7032       762     1     1 0           
##  6     1         1 2021-02-08  4409       599     1     0 0           
##  7     1         1 2021-02-15  6407       710     1     1 0           
##  8     1         1 2021-02-22  4038       534     1     0 0           
##  9     1         1 2021-03-01  7675       840     1     1 0           
## 10     1         1 2021-03-08  4949       618     1     0 0           
## # ℹ 510,766 more rows
## # ℹ 1 more variable: SchoolHoliday <dbl>

unique_count <- sales_df %>%
  summarise(UniqueCount = n_distinct(Store))
  print(unique_count$UniqueCount)

## [1] 560

value counts StateHoliday:

value_counts_StateHoliday <- table(sales_df$StateHoliday)
  print(value_counts_StateHoliday)

## 
##      0      a      b      c 
## 495254  10104   3360   2058

Print the initial (first) date:

initial_date <- head(sales_df$Date, n = 1)
  print(initial_date)

## [1] "2021-01-04"

Print the final (last) date:

final_date <- tail(sales_df$Date, n = 1)
  print(final_date)

## [1] "2023-07-30"

Assuming ‘sales_df’ The values for date must be organized and leave only columns with numerical values:

sales_df$Date <- as.Date(sales_df$Date)
sales_df$Year <- year(sales_df$Date)
sales_df$Month <- month(sales_df$Date)
sales_df$Day <- day(sales_df$Date)
sales_df$WeekOfYear <- week(sales_df$Date)
  print(sales_df)

## # A tibble: 510,776 × 13
##    Store DayOfWeek Date       Sales Customers  Open Promo StateHoliday
##    <dbl>     <dbl> <date>     <dbl>     <dbl> <dbl> <dbl> <chr>       
##  1     1         1 2021-01-04  7176       785     1     1 0           
##  2     1         1 2021-01-11  4717       616     1     0 0           
##  3     1         1 2021-01-18  5394       607     1     1 0           
##  4     1         1 2021-01-25  4055       549     1     0 0           
##  5     1         1 2021-02-01  7032       762     1     1 0           
##  6     1         1 2021-02-08  4409       599     1     0 0           
##  7     1         1 2021-02-15  6407       710     1     1 0           
##  8     1         1 2021-02-22  4038       534     1     0 0           
##  9     1         1 2021-03-01  7675       840     1     1 0           
## 10     1         1 2021-03-08  4949       618     1     0 0           
## # ℹ 510,766 more rows
## # ℹ 5 more variables: SchoolHoliday <dbl>, Year <dbl>, Month <dbl>, Day <int>,
## #   WeekOfYear <dbl>

sales_df_1 <- sales_df %>%
  arrange(desc(Date), Store) %>%
  head(4)

print(sales_df_1)

## # A tibble: 4 × 13
##   Store DayOfWeek Date       Sales Customers  Open Promo StateHoliday
##   <dbl>     <dbl> <date>     <dbl>     <dbl> <dbl> <dbl> <chr>       
## 1     1         1 2023-07-31     0         0     0     0 a           
## 2     2         1 2023-07-31     0         0     0     0 a           
## 3     3         1 2023-07-31     0         0     0     0 a           
## 4     4         1 2023-07-31     0         0     0     0 a           
## # ℹ 5 more variables: SchoolHoliday <dbl>, Year <dbl>, Month <dbl>, Day <int>,
## #   WeekOfYear <dbl>

The columns to be removed: Date, StateHoliday:

colums_removed <- subset(sales_df, select = -c(Date, StateHoliday))
  print(colums_removed)

## # A tibble: 510,776 × 11
##    Store DayOfWeek Sales Customers  Open Promo SchoolHoliday  Year Month   Day
##    <dbl>     <dbl> <dbl>     <dbl> <dbl> <dbl>         <dbl> <dbl> <dbl> <int>
##  1     1         1  7176       785     1     1             1  2021     1     4
##  2     1         1  4717       616     1     0             0  2021     1    11
##  3     1         1  5394       607     1     1             0  2021     1    18
##  4     1         1  4055       549     1     0             0  2021     1    25
##  5     1         1  7032       762     1     1             0  2021     2     1
##  6     1         1  4409       599     1     0             0  2021     2     8
##  7     1         1  6407       710     1     1             0  2021     2    15
##  8     1         1  4038       534     1     0             0  2021     2    22
##  9     1         1  7675       840     1     1             0  2021     3     1
## 10     1         1  4949       618     1     0             0  2021     3     8
## # ℹ 510,766 more rows
## # ℹ 1 more variable: WeekOfYear <dbl>

str(colums_removed)

## tibble [510,776 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Store        : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek    : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sales        : num [1:510776] 7176 4717 5394 4055 7032 ...
##  $ Customers    : num [1:510776] 785 616 607 549 762 599 710 534 840 618 ...
##  $ Open         : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : num [1:510776] 1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday: num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
##  $ Year         : num [1:510776] 2021 2021 2021 2021 2021 ...
##  $ Month        : num [1:510776] 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day          : int [1:510776] 4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear   : num [1:510776] 1 2 3 4 5 6 7 8 9 10 ...

columns_remove_numeric_day <- colums_removed %>%
mutate(Day  = as.numeric(Day))
str(columns_remove_numeric_day)

## tibble [510,776 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Store        : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek    : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sales        : num [1:510776] 7176 4717 5394 4055 7032 ...
##  $ Customers    : num [1:510776] 785 616 607 549 762 599 710 534 840 618 ...
##  $ Open         : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo        : num [1:510776] 1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday: num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
##  $ Year         : num [1:510776] 2021 2021 2021 2021 2021 ...
##  $ Month        : num [1:510776] 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day          : num [1:510776] 4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear   : num [1:510776] 1 2 3 4 5 6 7 8 9 10 ...

Correlation

To select which are the best variables for our analysis, a measure of the correlation will be performed on each independent variable with the dependent variable and eliminate the variables with low correlation with the independent variable.

correlation_matrix <- cor(columns_remove_numeric_day)

Heatmap of the Sales dataset

Let’s now create correlation plots to evaluate the most important variables in predicting the sales.

corr_plot_vals <- cor(columns_remove_numeric_day)

corrplot(corr_plot_vals, method="shade",shade.col=NA, tl.col="blue", tl.srt=50, addCoef.col = "black", number.cex = 0.7, col = COL2('RdYlBu', 10))

Theres is some variables that influence the number of sales: Customer Open: if the store is open Promo: if the store had promotions

Variable analysis

In the following graph we can see the list of closed stores each day of the week:

day_labels <- c("Mon","Tue", "Wed", "Thu", "Fri","Sat","Sun") 
sales_df$DayOfWeek <- factor(sales_df$DayOfWeek, labels = day_labels)

ggplot(sales_df, aes(x = DayOfWeek, fill = factor(Open))) +
  geom_bar(position = "dodge") +
  labs(title = "Sales vs Day of week", fill = "Open") +
  xlab("Day of the Week") +
  theme_gray() +
  scale_fill_manual(values = c("blue", "darkorange"))

In the following graph we can see how sales increase when the store has a promotion:

Promo_sales <- sales_df %>%
  group_by(Promo) %>%
  summarize(mean_sales = mean(Sales))

ggplot(Promo_sales, aes(x = Promo, y = mean_sales)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Mean sales vs Promo", x = "Promo", y = "Mean Sales") +
  theme_gray()

It can be observed how in the months of November and December, sales increase every year.

months_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

sales_df$Month <- factor(sales_df$Month, labels = months_labels)

aggregatevalue <- aggregate(Sales ~ Month + Year, data=sales_df, FUN = sum)

ggplot(aggregatevalue, aes(x = Month, y = Sales, fill = factor(Year))) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Total sales vs month and year",
       x = "Month",
       y = "Sales",
       fill = "Year") +
  theme_gray() +
  scale_fill_manual(values = c("blue", "darkorange", "lightgreen"))

School Holiday Column Values:

table(sales_df$SchoolHoliday)

## 
##      0      1 
## 419623  91153

str(sales_df$SchoolHoliday)

##  num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...

It can be observed from the graph that sales are not affected by the school holidays. Only 18% of sales are affected.

labels <- c('Not-Affected', 'Affected')
sizes <- table(sales_df$SchoolHoliday)
colors <- c('lightgreen', 'blue')
explode <- c(0.1, 0.0)

percent_labels <- round((sizes / sum(sizes)) * 100, 1)

data <- data.frame(labels, sizes, percent_labels)

ggplot(data, aes(x = "", y = sizes, fill = labels)) +
  geom_bar(stat = "identity", width = 3) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = colors) +
  labs(title = 'Sales vs Schoolholiday') +
  theme_void() +
  geom_text(aes(label = paste0(percent_labels, "%"), x = 1.5), position = position_stack(vjust = 0.5), size = 6) +
  theme(legend.position = "right") +
  theme(plot.title = element_text(hjust = 0.5, size = 15))

sales_df$StateHoliday <- ifelse(sales_df$StateHoliday == 0 | sales_df$StateHoliday == "0", 0, 1)

Transforming Variable StateHoliday:

table(sales_df$StateHoliday)

## 
##      0      1 
## 495254  15522

It can be observed in the graph that sales are not affected by holidays. Only 3% of sales are affected.

labels <- c('Not-Affected', 'Affected')
sizes <- table(sales_df$StateHoliday)
colors <- c('lightgreen', 'blue')
explode <- c(0.1, 0.0)

percent_labels <- round((sizes / sum(sizes)) * 100, 1)

data1 <- data.frame(labels, sizes, percent_labels)

ggplot(data1, aes(x = "", y = sizes, fill = labels)) +
  geom_bar(stat = "identity", width = 3) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = colors) +
  labs(title = 'Sales vs State holiday') +
  theme_void() +
  geom_text(aes(label = paste0(percent_labels, "%"), x = 3), position = position_stack(vjust = 0.5), size = 4) +
  theme(legend.position = "right") +
  theme(plot.title = element_text(hjust = 0.5, size = 15))

Since sales are no longer affected by holidays, I will remove this column for analysis.

sales_df <- sales_df %>% select(-StateHoliday)

Histogram Representation of Sales. Here 0 is showing because most of the time store was closed.

ggplot(sales_df, aes(x = Sales)) +
  geom_histogram(binwidth = 1000, fill = "blue", color = "white") +
  labs(title = "Distribution of sales", x = "Sales", y = "Frequency") +
  theme_gray() +
  theme(plot.title = element_text(size = 15))

Add a density curve:

ggplot(sales_df, aes(x = Sales)) +
  geom_histogram(binwidth = 1000, fill = "lightblue", color = "white") +
  geom_density(aes(y = ..count.. * 1000), color = "red") +
  labs(x = "Sales", y = "Density") +
  theme_gray() +
  theme(plot.title = element_text(size = 20))

Sales vs Customers

In the graph we see the relationship between the number of customers and sales.

ggplot(sales_df, aes(x = Sales, y = Customers)) +
  geom_point(color = "lightgreen", size = 2) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  labs(title = "Sales vs Customers", x = "Sales", y = "Customers") +
  theme_gray() +
  theme(plot.title = element_text(size = 15))

Analyzing the Store Dataset

head(store_df)

## # A tibble: 6 × 6
##   Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
##   <dbl> <chr>     <chr>                    <dbl>                    <dbl>  <dbl>
## 1     1 c         a                         1270                     2018      0
## 2     2 a         a                          570                     2007      1
## 3     3 a         a                        14130                     2006      1
## 4     4 c         c                          620                     2009      0
## 5     5 a         a                        29910                     2015      0
## 6     6 a         a                          310                     2013      0

tail(store_df)

## # A tibble: 6 × 6
##   Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
##   <dbl> <chr>     <chr>                    <dbl>                    <dbl>  <dbl>
## 1   555 d         a                         1560                     2014      1
## 2   556 d         c                         1140                     2014      0
## 3   557 a         a                          250                       NA      0
## 4   558 a         a                         3000                     2010      0
## 5   559 d         a                         3910                     2006      1
## 6   560 c         c                         1910                     2013      0

Checking Information about Dataset. Print the dimensions:

n_rows <- nrow(store_df)  # Get the number of rows
n_cols <- ncol(store_df)  # Get the number of columns

cat("Number of rows:", n_rows, "\n")

## Number of rows: 560

cat("Number of columns:", n_cols, "\n")

## Number of columns: 6

str(store_df)

## spc_tbl_ [560 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Store                   : num [1:560] 1 2 3 4 5 6 7 8 9 10 ...
##  $ StoreType               : chr [1:560] "c" "a" "a" "c" ...
##  $ Assortment              : chr [1:560] "a" "a" "a" "c" ...
##  $ CompetitionDistance     : num [1:560] 1270 570 14130 620 29910 ...
##  $ CompetitionOpenSinceYear: num [1:560] 2018 2007 2006 2009 2015 ...
##  $ Promo2                  : num [1:560] 0 1 1 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Store = col_double(),
##   ..   StoreType = col_character(),
##   ..   Assortment = col_character(),
##   ..   CompetitionDistance = col_double(),
##   ..   CompetitionOpenSinceYear = col_double(),
##   ..   Promo2 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

It can be observed that there are missing values in the dataset, we are going to convert these data into ‘NA’ in order to carry out our analysis.

any_missing <- anyNA(store_df)
  print(any_missing)

## [1] TRUE

Display the count of missing values for each column:

missing_values <- colSums(is.na(store_df))
print(missing_values)

##                    Store                StoreType               Assortment 
##                        0                        0                        0 
##      CompetitionDistance CompetitionOpenSinceYear                   Promo2 
##                        1                      184                        0

Heatmap for null values

vis_dat(store_df) + scale_fill_manual(
    values = c(
      "character" = "#2297E6",
      "numeric" = "lightgreen",
      "NA" = "lightgray"
  )) +
  theme(axis.text.x = element_text(angle = 60, hjust = 0, size=12, face="bold"))

Distribution Of Different Store Types:

labels <- c('a', 'b', 'c', 'd')
sizes <- table(store_df$StoreType)
colors <- c('orange', 'lightgreen', 'lightblue', 'pink')
explode <- c(0.1, 0.0, 0.15, 0.5)

percent_labels <- round((sizes / sum(sizes)) * 100, 1)

data2 <- data.frame(labels, sizes, percent_labels)

ggplot(data2, aes(x = "", y = sizes, fill = labels)) +
  geom_bar(stat = "identity", width = 3) +
  coord_polar(theta = "y") +
  scale_fill_manual(values = colors) +
  labs(title = 'Analysis sales per Store type') +
  theme_void() +
  geom_text(aes(label = paste0(percent_labels, "%"), x = 3), position = position_stack(vjust = 0.5), size = 4) +
  theme(legend.position = "right") +
  theme(plot.title = element_text(hjust = 0.5, size = 15))

Remove features with high percentages of missing values we can see that some features have a high percentage of missing values and they won’t be accurate as indicators, so we will remove features with more than 30% missing values.

store_df_30_percent <- store_df %>%
  select(-CompetitionOpenSinceYear)

Replace missing values in features with low percentages of missing values.

We review the distribution of CompetitionDistance, which is the distance in meters to the nearest competitor store.

ggplot(store_df_30_percent, aes(x = CompetitionDistance)) +
  geom_histogram(fill = "lightblue", color = "white") +
  geom_density(aes(y = ..count.. * 1900), color = "red") +
  labs(x = "Competition Distance", y = "Density") +
  theme_gray() +
  theme(plot.title = element_text(size = 15)) +
  ggtitle("Distribution Competition Distance")

The distribution is right skewed, so we’ll replace missing values with the median.

store_df$CompetitionDistance <- ifelse(is.na(store_df$CompetitionDistance), median(store_df$CompetitionDistance, na.rm = TRUE), store_df$CompetitionDistance)

Pair plot for Store Dataset, we can observe the different graphs to perform a quick analysis of the store dataset.

theme_set(theme_minimal())
theme_update(panel.grid.major = element_blank(),
             panel.grid.minor = element_blank())

pp <- ggplot(store_df, aes(color = StoreType)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x) +
  facet_grid(. ~ StoreType, scales = "free") +
  theme(legend.position = "right")

We can observe that the Store dataset data has a very weak or practically non-existent correlation between the variables represented.In this case, a significant relationship between the variables cannot be established from observing the data in the pair plot.

ggpairs(store_df, aes(color = StoreType)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.9, size=8))

Checking stores with their assortment type

theme_set(theme_gray())

store_type <- ggplot(store_df, aes(x = StoreType, fill = Assortment)) +
  geom_bar(position = "dodge") +
  theme_gray() +
  labs(fill = "Assortment") +
  theme(legend.position = "right") +
  labs(title = 'Assortment vs Store type', x = "StoreType") +
  scale_fill_manual(values = c("#41B7C4", "orange", "#CCEDB1")) +
  theme(plot.title = element_text(size = 15))

store_type_data <- store_df %>%
  group_by(StoreType, Assortment) %>%
  summarise(n = n())

store_type_data <- store_type_data %>%
  group_by(StoreType) %>%
  mutate(percentage = n / sum(n) * 100)

store_type <- store_type +
  geom_text(data = store_type_data,
            aes(label = paste0(n, " (", round(percentage, 1), "%)"), y = n),
            position = position_dodge(width = 0.9),
            vjust = -0.5,
            size = 4,
            color = "black")

  print(store_type)

We can see in the graph that there is no such significant differences in these 3 years in terms of sales:

ggplot(sales_df, aes(x = as.factor(Year), y = Sales)) +
  geom_bar(stat = "summary", fun = "mean", position = position_dodge(width = 0.8), fill = "blue") +
  geom_point(aes(group = Store, color = Store), position = position_jitterdodge(jitter.width = 0.3), size = 3) +
  theme_minimal() +
  labs(title = "Average Sales vs Year") +
  xlab("Year") +
  ylab("Average Sales") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position = "top")

Merging Two Datasets

We take the dataframes are named sales_df and store_df.

merged_df <- merge(sales_df, store_df, by.x = "Store", by.y = "Store", all.x = TRUE)

colums_removed_merged <- subset(merged_df, select = -c(CompetitionOpenSinceYear))

head(colums_removed_merged)

##   Store DayOfWeek       Date Sales Customers Open Promo SchoolHoliday Year
## 1     1       Mon 2021-01-04  7176       785    1     1             1 2021
## 2     1       Mon 2021-01-11  4717       616    1     0             0 2021
## 3     1       Mon 2021-01-18  5394       607    1     1             0 2021
## 4     1       Mon 2021-01-25  4055       549    1     0             0 2021
## 5     1       Mon 2021-02-01  7032       762    1     1             0 2021
## 6     1       Mon 2021-02-08  4409       599    1     0             0 2021
##   Month Day WeekOfYear StoreType Assortment CompetitionDistance Promo2
## 1   Jan   4          1         c          a                1270      0
## 2   Jan  11          2         c          a                1270      0
## 3   Jan  18          3         c          a                1270      0
## 4   Jan  25          4         c          a                1270      0
## 5   Feb   1          5         c          a                1270      0
## 6   Feb   8          6         c          a                1270      0

tail(colums_removed_merged)

##        Store DayOfWeek       Date Sales Customers Open Promo SchoolHoliday Year
## 510771   560       Sun 2023-06-25     0         0    0     0             0 2023
## 510772   560       Sun 2023-07-02     0         0    0     0             0 2023
## 510773   560       Sun 2023-07-09     0         0    0     0             0 2023
## 510774   560       Sun 2023-07-16     0         0    0     0             0 2023
## 510775   560       Sun 2023-07-23     0         0    0     0             0 2023
## 510776   560       Sun 2023-07-30     0         0    0     0             0 2023
##        Month Day WeekOfYear StoreType Assortment CompetitionDistance Promo2
## 510771   Jun  25         26         c          c                1910      0
## 510772   Jul   2         27         c          c                1910      0
## 510773   Jul   9         28         c          c                1910      0
## 510774   Jul  16         29         c          c                1910      0
## 510775   Jul  23         30         c          c                1910      0
## 510776   Jul  30         31         c          c                1910      0

str(colums_removed_merged)

## 'data.frame':    510776 obs. of  16 variables:
##  $ Store              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek          : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date               : Date, format: "2021-01-04" "2021-01-11" ...
##  $ Sales              : num  7176 4717 5394 4055 7032 ...
##  $ Customers          : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Open               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo              : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year               : num  2021 2021 2021 2021 2021 ...
##  $ Month              : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ StoreType          : chr  "c" "c" "c" "c" ...
##  $ Assortment         : chr  "a" "a" "a" "a" ...
##  $ CompetitionDistance: num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ Promo2             : num  0 0 0 0 0 0 0 0 0 0 ...

Checking Information about Dataset. Print the dimensions:

nrow(colums_removed_merged)

## [1] 510776

ncol(colums_removed_merged)

## [1] 16

Heatmap Of Merged Dataset

We are going to remove the following columns Date, StoreType, Assortment, CompetitionOpenSinceYear, because these variables are not relevant for the analysis:

colums_removed_merged <- subset(merged_df, select = -c(Date, StoreType, Assortment, CompetitionOpenSinceYear))

str(colums_removed_merged)

## 'data.frame':    510776 obs. of  13 variables:
##  $ Store              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek          : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sales              : num  7176 4717 5394 4055 7032 ...
##  $ Customers          : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Open               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo              : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year               : num  2021 2021 2021 2021 2021 ...
##  $ Month              : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CompetitionDistance: num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ Promo2             : num  0 0 0 0 0 0 0 0 0 0 ...

Convert DayofWeek as numeric:

colums_removed_merged$DayOfWeek <- as.numeric(colums_removed_merged$DayOfWeek)
colums_removed_merged$Month <- as.numeric(colums_removed_merged$Month)

str(colums_removed_merged)

## 'data.frame':    510776 obs. of  13 variables:
##  $ Store              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Sales              : num  7176 4717 5394 4055 7032 ...
##  $ Customers          : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Open               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo              : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year               : num  2021 2021 2021 2021 2021 ...
##  $ Month              : num  1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CompetitionDistance: num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ Promo2             : num  0 0 0 0 0 0 0 0 0 0 ...

Calculate the absolute correlation matrix:

correlation_matrix_merged <- cor(colums_removed_merged)

We can see that the variables do not change when we merge the datasets.

We can observe which variables influence the amount of sales: Customer Open: if the store is open Promo: if the store had promotions

Sales vs Stores type

Creates a new column Avg_Customer_Sales in merged df dataset, we are going to make the graphs to observe the relationship between sales and stores:

merged_df$Avg_Customer_Sales <- merged_df$Sales / merged_df$Customers

plot1 <- ggplot(store_df, aes(x = StoreType)) +
  geom_bar(fill="steelblue") +
  labs(title = "Quantity StoreTypes") +
  theme(plot.title = element_text(size = 12))

plot2 <- ggplot(merged_df, aes(x = StoreType, y = Sales)) +
  geom_bar(stat = "identity", fill="steelblue") +
  labs(title = "Sales vs StoreTypes") +
  theme(plot.title = element_text(size = 12))

plot3 <- ggplot(merged_df, aes(x = StoreType, y = Customers)) +
  geom_bar(stat = "identity", fill="steelblue") +
  labs(title = "Number Customers vs StoreTypes") +
  theme(plot.title = element_text(size = 12))

grid.arrange(plot1, plot2, plot3, ncol = 2)

Mean:

mean_Sales <- merged_df %>%
group_by(StoreType) %>%
  summarise(mean_Sales = mean(Sales))

plot4 <- ggplot(mean_Sales, aes(x = StoreType, y = mean_Sales)) +
  geom_bar(stat = "identity", fill="steelblue") +
  labs(title = "Average Sales vs StoreTypes") +
  theme(plot.title = element_text(size = 12))

mean_Avg_Customer_Sales <- merged_df %>%
group_by(StoreType) %>%
  summarise(mean_Avg_Customer_Sales = mean(Avg_Customer_Sales, na.rm = TRUE))

plot5 <-ggplot(mean_Avg_Customer_Sales, aes(x = StoreType, y = mean_Avg_Customer_Sales)) +
  geom_bar(stat = "identity", fill="steelblue") +
  labs(title = "Average Spending vs StoreTypes") +
  theme(plot.title = element_text(size = 12))

mean_Customer <- merged_df %>%
group_by(StoreType) %>%
  summarise(mean_Customer = mean(Customers))

plot6 <- ggplot(mean_Customer, aes(x = StoreType, y = mean_Customer)) +
  geom_bar(stat = "identity", fill="steelblue") +
  labs(title = "Average Customers vs StoreType") +
  theme(plot.title = element_text(size = 12))

grid.arrange(plot4, plot5, plot6, ncol = 2)

We can see in the graphs that Store Type A has the greatest number of stores, it also has the greatest number of sales and the greatest number of customers.

Type D stores have the best sales averages per customer.

Type B stores have the highest number of average customers.

Let’s analyze the impact of sales with promotions:

colors <- c('coral', 'steelblue')

ggplot(merged_df, aes(x = Month, y = Sales, fill = factor(Promo2))) +
  geom_bar(stat = "identity",  position = "dodge") +
  facet_wrap(. ~ Year + Promo, nrow=2, scales = "free") +
  scale_fill_manual(values = colors) +
  labs(title = "Sales vs Promo", x = "Month", y = "Sales") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

We can see that when promotions are applied, sales are high. According to this result, there are advantages to applying promotions to promote sales.

colors <- c('coral2', 'steelblue')

ggplot(merged_df, aes(x = DayOfWeek, y = Sales, fill = factor(Promo))) +
  geom_bar(stat = "identity",  position = "dodge") +
  scale_fill_manual(values = colors) +
  labs(x = "Month", y = "Sales") +
  theme_gray() + 
  theme(axis.text.x = element_text(angle = 0, hjust = 1))

When stores offer promotions, sales are higher. On Sundays there are no promotions, and sales are high.

Sales trend through the months of each year

We are going to analyze if there are variations in sales between the types of stores that the company has.For this we are going to graph the sales of the different types of stores each month by year.

colors <- c('green3', 'indianred', 'deepskyblue', 'blueviolet')

ggplot(merged_df, aes(x = Month, y = Sales, fill = StoreType)) +
  geom_bar(stat = "identity",  position = "dodge") +
  facet_grid(Year ~ .) +
  scale_fill_manual(values = colors) +
  labs(x = "Month", y = "Sales") +
  theme_minimal()

We can see that the sales trend is the same for each type of store. Observing that in the months of December sales increase. In general, store type B presents over the years the type of store that generates the most average sales according to the number of stores of this type there are.

Analysis of the impact of sales compared to competition Distance

Let’s categorize the nearest or farthest.

merged_df$CompetitionDistance_Cat <- cut(merged_df$CompetitionDistance, breaks = 5, labels = c("near_1mile", "near_2mile", "near_3mile", "near_4mile", "near_5mile"))

summary(merged_df$CompetitionDistance_Cat)

## near_1mile near_2mile near_3mile near_4mile near_5mile 
##     459680      40734       7536       1884        942

summary_df_merged <- merged_df %>%
  group_by(CompetitionDistance_Cat) %>%
  summarize(Avg_Sales = mean(Sales), Avg_Customers = mean(Customers))

plot7 <- ggplot(summary_df_merged, aes(x = CompetitionDistance_Cat, y = Avg_Sales)) +
  geom_bar(stat = "identity", fill = "#41B7C4") +
  labs(title = "Average Sales vs Competition Distance") + theme(axis.text.x = element_text(angle = 90, hjust = 0.9, size=8)) +  theme(plot.title = element_text(size = 10))

plot8 <- ggplot(summary_df_merged, aes(x = CompetitionDistance_Cat, y = Avg_Customers)) +
  geom_bar(stat = "identity", fill = "#CCEDB1") +
  labs(title = "Average Customers vs Competition Distance") + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=9)) +  theme(plot.title = element_text(size = 10))

grid.arrange(plot7, plot8, ncol = 2)

We can observe that sales are not affected according to the distance from the competition.

For our data management we are going to delete the columns Avg_Customer_Sales and CompetitionDistance_Cat

merged_df_drop <- merged_df[, !(names(merged_df) %in% c("Avg_Customer_Sales", "CompetitionDistance_Cat"))]

str(merged_df_drop)

## 'data.frame':    510776 obs. of  17 variables:
##  $ Store                   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek               : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                    : Date, format: "2021-01-04" "2021-01-11" ...
##  $ Sales                   : num  7176 4717 5394 4055 7032 ...
##  $ Customers               : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Open                    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Promo                   : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year                    : num  2021 2021 2021 2021 2021 ...
##  $ Month                   : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                     : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ StoreType               : chr  "c" "c" "c" "c" ...
##  $ Assortment              : chr  "a" "a" "a" "a" ...
##  $ CompetitionDistance     : num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ CompetitionOpenSinceYear: num  2018 2018 2018 2018 2018 ...
##  $ Promo2                  : num  0 0 0 0 0 0 0 0 0 0 ...

Box plot for outliers

The boxplot provides a compact, easy-to-interpret representation of the distribution of the data, and provides information about the presence of outliers.

Box plot shows that we have a very high outliers in sales.

ggplot(sales_df, aes(x = Sales)) +
  geom_boxplot(fill="lightblue") +
  labs(x = "Sales") +
  theme_gray()

Removing Outliers of Sales Column

Clear outliers from merge_df.

remove_outlier <- function(merged_df_in, col_name) {
  q1 <- quantile(merged_df_in[[col_name]], 0.25)
  q3 <- quantile(merged_df_in[[col_name]], 0.75)
  iqr <- q3 - q1
  
  fence_low <- q1 - 1.5 * iqr
  fence_high <- q3 + 1.5 * iqr
  
  merged_df_out <- merged_df_in[merged_df_in[[col_name]] > fence_low & merged_df_in[[col_name]] < fence_high, ]
  
  return(merged_df_out)
}

merged_df <- remove_outlier(merged_df, 'Sales')

We are going to delete from the dataset when the stores are closed, since they do not generate sales.

merged_df <- subset(merged_df, Open != 0)

We will also remove Open, since it is not a variable.

merged_df <- merged_df[, !(colnames(merged_df) %in% 'Open')]

We are going to check if there is any open store that does not report sales.

sum(merged_df$Store[merged_df$Sales == 0])

## [1] 5423

Let’s review the percentage of open stored with zero sales.

sum(subset(merged_df$Sales, merged_df$Sales == 0)) / sum(merged_df$Sales)

## [1] 0

Let’s remove this part of data to avoid bias.

df_merged_removed_zero <- subset(merged_df, Sales != 0)

new_merged_df <- data.frame(merged_df)

str(new_merged_df)

## 'data.frame':    410825 obs. of  18 variables:
##  $ Store                   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek               : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                    : Date, format: "2021-01-04" "2021-01-11" ...
##  $ Sales                   : num  7176 4717 5394 4055 7032 ...
##  $ Customers               : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Promo                   : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year                    : num  2021 2021 2021 2021 2021 ...
##  $ Month                   : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                     : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ StoreType               : chr  "c" "c" "c" "c" ...
##  $ Assortment              : chr  "a" "a" "a" "a" ...
##  $ CompetitionDistance     : num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ CompetitionOpenSinceYear: num  2018 2018 2018 2018 2018 ...
##  $ Promo2                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Avg_Customer_Sales      : num  9.14 7.66 8.89 7.39 9.23 ...
##  $ CompetitionDistance_Cat : Factor w/ 5 levels "near_1mile","near_2mile",..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s review the data frame, “Assortment”, “StoreType”. We define the columns to one-hot encode. Create dummy variables.

columns_to_encode <- c("Assortment", "StoreType")

for (col in columns_to_encode) {
  dummies <- model.matrix(~0 + factor(new_merged_df[[col]]))
  colnames(dummies) <- paste0(col, "_", colnames(dummies))
  new_merged_df <- cbind(new_merged_df, dummies)
}

new_merged_df <- new_merged_df[, !(names(new_merged_df) %in% columns_to_encode)]

str(new_merged_df)

## 'data.frame':    410825 obs. of  23 variables:
##  $ Store                                   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek                               : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                                    : Date, format: "2021-01-04" "2021-01-11" ...
##  $ Sales                                   : num  7176 4717 5394 4055 7032 ...
##  $ Customers                               : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Promo                                   : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday                           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year                                    : num  2021 2021 2021 2021 2021 ...
##  $ Month                                   : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                                     : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear                              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CompetitionDistance                     : num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ CompetitionOpenSinceYear                : num  2018 2018 2018 2018 2018 ...
##  $ Promo2                                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Avg_Customer_Sales                      : num  9.14 7.66 8.89 7.39 9.23 ...
##  $ CompetitionDistance_Cat                 : Factor w/ 5 levels "near_1mile","near_2mile",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Assortment_factor(new_merged_df[[col]])a: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Assortment_factor(new_merged_df[[col]])b: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Assortment_factor(new_merged_df[[col]])c: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_factor(new_merged_df[[col]])a : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_factor(new_merged_df[[col]])b : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_factor(new_merged_df[[col]])c : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ StoreType_factor(new_merged_df[[col]])d : num  0 0 0 0 0 0 0 0 0 0 ...

new_merged_df <- new_merged_df %>%
  rename(
    Assortment_a = 'Assortment_factor(new_merged_df[[col]])a', 
    Assortment_b = 'Assortment_factor(new_merged_df[[col]])b', 
    Assortment_c = 'Assortment_factor(new_merged_df[[col]])c',
    StoreType_a = 'StoreType_factor(new_merged_df[[col]])a',
    StoreType_b = 'StoreType_factor(new_merged_df[[col]])b',
    StoreType_c = 'StoreType_factor(new_merged_df[[col]])c',
    StoreType_d = 'StoreType_factor(new_merged_df[[col]])d'
    )

str(new_merged_df)

## 'data.frame':    410825 obs. of  23 variables:
##  $ Store                   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayOfWeek               : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                    : Date, format: "2021-01-04" "2021-01-11" ...
##  $ Sales                   : num  7176 4717 5394 4055 7032 ...
##  $ Customers               : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Promo                   : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday           : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Year                    : num  2021 2021 2021 2021 2021 ...
##  $ Month                   : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                     : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear              : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CompetitionDistance     : num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ CompetitionOpenSinceYear: num  2018 2018 2018 2018 2018 ...
##  $ Promo2                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Avg_Customer_Sales      : num  9.14 7.66 8.89 7.39 9.23 ...
##  $ CompetitionDistance_Cat : Factor w/ 5 levels "near_1mile","near_2mile",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Assortment_a            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Assortment_b            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Assortment_c            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_a             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_b             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_c             : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ StoreType_d             : num  0 0 0 0 0 0 0 0 0 0 ...

str(columns_to_encode)

##  chr [1:2] "Assortment" "StoreType"

count rows assortment con valor 0:

sum(!is.na(new_merged_df$Assortment_a) & is.numeric(new_merged_df$Assortment_a) & new_merged_df$Assortment_a == 0)

## [1] 201076

sum(!is.na(new_merged_df$Assortment_b) & is.numeric(new_merged_df$Assortment_b) & new_merged_df$Assortment_b == 0)

## [1] 407427

sum(!is.na(new_merged_df$Assortment_c) & is.numeric(new_merged_df$Assortment_c) & new_merged_df$Assortment_c == 0)

## [1] 213147

count rows assortment con valor 1:

sum(!is.na(new_merged_df$Assortment_a) & is.numeric(new_merged_df$Assortment_a) & new_merged_df$Assortment_a == 1)

## [1] 209749

sum(!is.na(new_merged_df$Assortment_b) & is.numeric(new_merged_df$Assortment_b) & new_merged_df$Assortment_b == 1)

## [1] 3398

sum(!is.na(new_merged_df$Assortment_c) & is.numeric(new_merged_df$Assortment_c) & new_merged_df$Assortment_c == 1)

## [1] 197678

count rows Store type con valor 0:

sum(!is.na(new_merged_df$StoreType_a) & is.numeric(new_merged_df$StoreType_a) & new_merged_df$StoreType_a == 0)

## [1] 189777

sum(!is.na(new_merged_df$StoreType_b) & is.numeric(new_merged_df$StoreType_b) & new_merged_df$StoreType_b == 0)

## [1] 404118

sum(!is.na(new_merged_df$StoreType_c) & is.numeric(new_merged_df$StoreType_c) & new_merged_df$StoreType_c == 0)

## [1] 361081

sum(!is.na(new_merged_df$StoreType_d) & is.numeric(new_merged_df$StoreType_d) & new_merged_df$StoreType_d == 0)

## [1] 277499

count rows Store type con valor 1:

sum(!is.na(new_merged_df$StoreType_a) & is.numeric(new_merged_df$StoreType_a) & new_merged_df$StoreType_a == 1)

## [1] 221048

sum(!is.na(new_merged_df$StoreType_b) & is.numeric(new_merged_df$StoreType_b) & new_merged_df$StoreType_b == 1)

## [1] 6707

sum(!is.na(new_merged_df$StoreType_c) & is.numeric(new_merged_df$StoreType_c) & new_merged_df$StoreType_c == 1)

## [1] 49744

sum(!is.na(new_merged_df$StoreType_d) & is.numeric(new_merged_df$StoreType_d) & new_merged_df$StoreType_d == 1)

## [1] 133326

The data frame:

head(new_merged_df)

##   Store DayOfWeek       Date Sales Customers Promo SchoolHoliday Year Month Day
## 1     1       Mon 2021-01-04  7176       785     1             1 2021   Jan   4
## 2     1       Mon 2021-01-11  4717       616     0             0 2021   Jan  11
## 3     1       Mon 2021-01-18  5394       607     1             0 2021   Jan  18
## 4     1       Mon 2021-01-25  4055       549     0             0 2021   Jan  25
## 5     1       Mon 2021-02-01  7032       762     1             0 2021   Feb   1
## 6     1       Mon 2021-02-08  4409       599     0             0 2021   Feb   8
##   WeekOfYear CompetitionDistance CompetitionOpenSinceYear Promo2
## 1          1                1270                     2018      0
## 2          2                1270                     2018      0
## 3          3                1270                     2018      0
## 4          4                1270                     2018      0
## 5          5                1270                     2018      0
## 6          6                1270                     2018      0
##   Avg_Customer_Sales CompetitionDistance_Cat Assortment_a Assortment_b
## 1           9.141401              near_1mile            1            0
## 2           7.657468              near_1mile            1            0
## 3           8.886326              near_1mile            1            0
## 4           7.386157              near_1mile            1            0
## 5           9.228346              near_1mile            1            0
## 6           7.360601              near_1mile            1            0
##   Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d
## 1            0           0           0           1           0
## 2            0           0           0           1           0
## 3            0           0           0           1           0
## 4            0           0           0           1           0
## 5            0           0           0           1           0
## 6            0           0           0           1           0

We are going to carry out sales analysis for the days:

Colors2 <- c("steelblue", "orange2", "forestgreen", "purple2", "royalblue2", "lightgreen", "skyblue2")

average_sales <- new_merged_df %>%
  group_by(DayOfWeek) %>%
  summarize(AvgSales = mean(Sales))

ggplot(average_sales, aes(x = DayOfWeek, y = AvgSales, fill = factor(DayOfWeek))) +
  geom_bar(stat = "identity") +
  labs(x = "DayOfWeek", y = "Average Sales") +
  scale_fill_manual(values = Colors2)

We can see from the plot it can be seen that most of the average sales have been on Mondays, Fridays.

Setting Features and Target Variables

new_merged_df <- subset(new_merged_df, select = -c(CompetitionOpenSinceYear,Avg_Customer_Sales,CompetitionDistance_Cat))

Creation target and features:

X <- new_merged_df[, !(names(new_merged_df) %in% c("Sales", "Store", "Date", "Year"))]
y <- new_merged_df$Sales

str(X)

## 'data.frame':    410825 obs. of  16 variables:
##  $ DayOfWeek          : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Customers          : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Promo              : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Month              : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CompetitionDistance: num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ Promo2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Assortment_a       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Assortment_b       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Assortment_c       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_a        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_b        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_c        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ StoreType_d        : num  0 0 0 0 0 0 0 0 0 0 ...

dim(X)

## [1] 410825     16

head(X)

##   DayOfWeek Customers Promo SchoolHoliday Month Day WeekOfYear
## 1       Mon       785     1             1   Jan   4          1
## 2       Mon       616     0             0   Jan  11          2
## 3       Mon       607     1             0   Jan  18          3
## 4       Mon       549     0             0   Jan  25          4
## 5       Mon       762     1             0   Feb   1          5
## 6       Mon       599     0             0   Feb   8          6
##   CompetitionDistance Promo2 Assortment_a Assortment_b Assortment_c StoreType_a
## 1                1270      0            1            0            0           0
## 2                1270      0            1            0            0           0
## 3                1270      0            1            0            0           0
## 4                1270      0            1            0            0           0
## 5                1270      0            1            0            0           0
## 6                1270      0            1            0            0           0
##   StoreType_b StoreType_c StoreType_d
## 1           0           1           0
## 2           0           1           0
## 3           0           1           0
## 4           0           1           0
## 5           0           1           0
## 6           0           1           0

head(y)

## [1] 7176 4717 5394 4055 7032 4409

str(y)

##  num [1:410825] 7176 4717 5394 4055 7032 ...

Create a vector with the names of days of the week in the correct order. Then, convert the factor to numerical values.

day_week <- c("Monday" = 1, "Tuesday" = 2, "Wednesday" = 3, "Thursday" = 4, "Friday" = 5, "Saturday" = 6, "Sunday" = 7)

X$DayOfWeek <- day_week[X$DayOfWeek]

X$Month <- as.numeric(X$Month)

str(X)

## 'data.frame':    410825 obs. of  16 variables:
##  $ DayOfWeek          : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Customers          : num  785 616 607 549 762 599 710 534 840 618 ...
##  $ Promo              : num  1 0 1 0 1 0 1 0 1 0 ...
##  $ SchoolHoliday      : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ Month              : num  1 1 1 1 2 2 2 2 3 3 ...
##  $ Day                : int  4 11 18 25 1 8 15 22 1 8 ...
##  $ WeekOfYear         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ CompetitionDistance: num  1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
##  $ Promo2             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Assortment_a       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Assortment_b       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Assortment_c       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_a        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_b        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StoreType_c        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ StoreType_d        : num  0 0 0 0 0 0 0 0 0 0 ...

Summary Data Analysis Results

Upon review we can see that sales are highly correlated with the number of Customers.

The type of store that sold the most and was most visited is A and the ones with the most sales. Types of A stores are the types with the most presence.

Store Type B has the lowest sales and number of stores. Compared to the other types of stores it has the highest number of average customers. Regarding the average sales by store type, store type B is the highest.

Store Type D has higher Average Spending.

When promotions are applied, sales are high.

Currently the company applies the promotions only between days of the week and Saturdays. For all stores, the Promotion leads to an increase in both Sales and Customers.

Sales have not been affected by 82.2% on school holidays, and sales are not affected by State holidays by 97%.

Sales increase during Christmas week, this could be due to the fact that people buy more products during the Christmas celebration.

Promo2 does not appear to be correlated with any significant change in sales amount.

We can also observe that the distance from the competition is not considerably affecting the amount of sales of the company’s stores.

Sales increase in the months of November and December, for each year analyzed.

Therefore, data subsets where they could cause bias were eliminated.

Implementing Supervised Machine Learning algorithms

Splitting Dataset Into Training Set and Test Set

We are going to divide the data into training and test sets, 30% of the data will be used as test set while the remaining 70% will be used as training set.

set.seed(11)

split <- sample.split(y, SplitRatio = 0.3)

X_train <- X[split, ]
X_test <- X[!split, ]
y_train <- y[split]
y_test <- y[!split]

columns <- colnames(X_train)

We are going to carry out the review of missing values: NA

colSums(is.na(X_train))

##           DayOfWeek           Customers               Promo       SchoolHoliday 
##                   0                   0                   0                   0 
##               Month                 Day          WeekOfYear CompetitionDistance 
##                   0                   0                   0                   0 
##              Promo2        Assortment_a        Assortment_b        Assortment_c 
##                   0                   0                   0                   0 
##         StoreType_a         StoreType_b         StoreType_c         StoreType_d 
##                   0                   0                   0                   0

colSums(is.na(X_test))

##           DayOfWeek           Customers               Promo       SchoolHoliday 
##                   0                   0                   0                   0 
##               Month                 Day          WeekOfYear CompetitionDistance 
##                   0                   0                   0                   0 
##              Promo2        Assortment_a        Assortment_b        Assortment_c 
##                   0                   0                   0                   0 
##         StoreType_a         StoreType_b         StoreType_c         StoreType_d 
##                   0                   0                   0                   0

Simple Linear Regression

Data preprocessing is a crucial phase in developing machine learning models.This is done with the goal of improving data quality and suitability before training a model. We apply the same preprocessing to both data sets.We then apply the resulting transformation to the training and test sets using the prediction function.

preproc <- preProcess(X_train, method = c("range"))

X_train_scaled <- predict(preproc, X_train)
X_test_scaled <- predict(preproc, X_test)

lm_model <- lm(y_train ~ ., data = X_train_scaled)
summary(lm_model)

## 
## Call:
## lm(formula = y_train ~ ., data = X_train_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6996.6  -729.5   -94.0   632.1  7492.5 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)           1618.907     15.797  102.485  < 2e-16 ***
## DayOfWeek             -115.523     11.498  -10.047  < 2e-16 ***
## Customers            22130.647     35.850  617.305  < 2e-16 ***
## Promo                 1107.367      6.791  163.067  < 2e-16 ***
## SchoolHoliday           58.115      8.123    7.155 8.43e-13 ***
## Month               -11915.725    498.389  -23.909  < 2e-16 ***
## Day                  -1044.695     45.875  -22.772  < 2e-16 ***
## WeekOfYear           13136.440    542.196   24.228  < 2e-16 ***
## CompetitionDistance   1565.008     30.114   51.969  < 2e-16 ***
## Promo2                 357.700      6.507   54.970  < 2e-16 ***
## Assortment_a          -230.387      6.667  -34.556  < 2e-16 ***
## Assortment_b         -3329.081     49.859  -66.769  < 2e-16 ***
## Assortment_c                NA         NA       NA       NA    
## StoreType_a          -1189.127      7.487 -158.830  < 2e-16 ***
## StoreType_b          -3568.618     37.231  -95.850  < 2e-16 ***
## StoreType_c          -1178.287     10.847 -108.631  < 2e-16 ***
## StoreType_d                 NA         NA       NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1105 on 123656 degrees of freedom
## Multiple R-squared:  0.8034, Adjusted R-squared:  0.8033 
## F-statistic: 3.609e+04 on 14 and 123656 DF,  p-value: < 2.2e-16

intercept <- coef(lm_model)[1]
intercept

## (Intercept) 
##    1618.907

Exclude the intercept

coefficients <- coef(lm_model)[-1]

print(coefficients)

##           DayOfWeek           Customers               Promo       SchoolHoliday 
##          -115.52287         22130.64738          1107.36704            58.11541 
##               Month                 Day          WeekOfYear CompetitionDistance 
##        -11915.72507         -1044.69535         13136.43988          1565.00824 
##              Promo2        Assortment_a        Assortment_b        Assortment_c 
##           357.69983          -230.38737         -3329.08093                  NA 
##         StoreType_a         StoreType_b         StoreType_c         StoreType_d 
##         -1189.12678         -3568.61756         -1178.28671                  NA

rsquared <- summary(lm_model)$r.squared
cat("R^2:", rsquared, "\n")

## R^2: 0.8033675

By obtaining such a low R-squared result of 0.80, it makes no sense for us to evaluate this model or make predictions. An R-squared of 0.8034 means that 80.34% of the variability in the response variable has been captured by the model.

Tune PLS model

PLS model reduces the predictors to a smaller set of uncorrelated components and then performs least squares regression on these components, instead of on the original data. PLS finds linear combinations of the predictors called components.

set.seed(12)
pls_model <- train(x=X_train,
                 y=y_train,
                 method="pls",
                 metric="Rsquared",
                 tuneLength=10, 
                 trControl=trainControl(method = "cv")
                 )
pls_model

## Partial Least Squares 
## 
## 123671 samples
##     16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111305, 111304, 111304, 111304, 111304, 111303, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared     MAE      
##    1     2484.084  0.006062206  1987.5969
##    2     1496.154  0.639459181  1136.2643
##    3     1493.744  0.640621783  1135.0339
##    4     1415.372  0.677320835  1071.9181
##    5     1402.585  0.683127710  1061.3198
##    6     1255.875  0.745961756   926.4743
##    7     1200.774  0.767781425   895.3679
##    8     1156.929  0.784429022   890.0878
##    9     1112.353  0.800729627   858.7937
##   10     1107.309  0.802534733   853.9247
## 
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.

pls_model$bestTune

##    ncomp
## 10    10

plot(pls_model, col= "blue")

pls_model$results %>% 
  filter(ncomp == pls_model$bestTune$ncomp) %>% 
  dplyr::select(ncomp,RMSE,Rsquared)

##   ncomp     RMSE  Rsquared
## 1    10 1107.309 0.8025347

data.frame(Rsquared=pls_model[["results"]][["Rsquared"]][as.numeric(rownames(pls_model$bestTune))],
           RMSE=pls_model[["results"]][["RMSE"]][as.numeric(rownames(pls_model$bestTune))])

##    Rsquared     RMSE
## 1 0.8025347 1107.309

The final value used for the model was ncomp = 10 which corresponds to best tune model.

We see that R2 is 0.802, RMSE= 1107. An R2 of 0.8025 means that 80.25% of the variability in the response variable has been captured by the model.

The Root Mean Square Error (RMSE) is a measure of model accuracy, representing the square root of the mean of the squared errors between the predictions and the actual values.

An RMSE of 1107 indicates the average size of the errors in the model predictions.

LARS Lasso Regression

LARS Lasso is an algorithm that uses linear regression with L1 regularization (Lasso) to fit linear regression models.It integrates LARS’s ability to handle feature-rich data sets with the L1(Lasso) penalty that favors coefficient dispersion and feature selection.

The value of “alpha” is 1 for Lasso (L1-regularized), and lambda is the regularization hyperparameter.

set.seed(13)
lasreg <- glmnet(X_train, y_train, alpha = 1, lambda = 0.3, standardize = TRUE, intercept = FALSE)

coef(lasreg)

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                                s0
## (Intercept)          .           
## DayOfWeek           -1.928252e+01
## Customers            7.439976e+00
## Promo                1.087095e+03
## SchoolHoliday        4.195243e+01
## Month                2.528475e+01
## Day                  2.235604e+00
## WeekOfYear          -2.164461e+00
## CompetitionDistance  2.100767e-02
## Promo2               3.656447e+02
## Assortment_a         3.668940e+02
## Assortment_b        -2.734978e+03
## Assortment_c         5.957255e+02
## StoreType_a         -3.416052e+02
## StoreType_b         -2.718628e+03
## StoreType_c         -3.301008e+02
## StoreType_d          8.479856e+02

Let’s convert X_test to a matrix and make predictions on the test set.

X_test_matrix <- as.matrix(X_test)

test_predictions <- predict(lasreg, s = 0, newx = X_test_matrix)

r_squared <- 1 - sum((y_test - test_predictions)^2) / sum((y_test - mean(y_test))^2)

rmse <- sqrt(mean((y_test - test_predictions)^2))

cat("R-squared:", r_squared, "\n")

## R-squared: 0.7970162

cat("RMSE:", rmse, "\n")

## RMSE: 1109.792

By obtaining such a low Rsquared result of 0.797, it makes no sense for us to evaluate this model or make predictions.

Single model Tree

Let’s use Regression trees. The regression trees partition a data set into smaller groups and then fit a simple model for each subgroup.

set.seed(14)
st_model <- train(x=X_train,
                  y=y_train,
                  method = "rpart",
                  tuneLength = 5,
                  trControl = trainControl(method = "cv"))
st_model

## CART 
## 
## 123671 samples
##     16 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 111304, 111303, 111304, 111305, 111304, 111303, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.02265544  1507.687  0.6337774  1179.722
##   0.02776454  1553.565  0.6111829  1218.469
##   0.07775306  1669.504  0.5502088  1316.335
##   0.08166468  1769.907  0.4946644  1401.160
##   0.43848287  2058.366  0.4354218  1637.793
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02265544.

st_model$bestTune

##           cp
## 1 0.02265544

plot(st_model, col = "blue")

data.frame(Rsquared=st_model[["results"]][["Rsquared"]][as.numeric(rownames(st_model$bestTune))],
           RMSE=st_model[["results"]][["RMSE"]][as.numeric(rownames(st_model$bestTune))])

##    Rsquared     RMSE
## 1 0.6337774 1507.687

RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.02341516. We see R2 value is 0.6339 and RMSE as 1507. By obtaining such a low R-squared result of 0.63, we are not going to use this model.

Random Forest With Hyper Parameter Tuning

set.seed(15)
rdf_model <- randomForest(y_train ~ ., data = cbind(X_train, y_train), 
                          ntree = 30, mtry = sqrt(ncol(X_train)), nodesize = 1)

Evaluate the model on the training and testing datasets. Make predictions on the training and testing datasets. Calculate Root Mean Squared Error (RMSE) for training and testing datasets:

train_score <- cor(predict(rdf_model, X_train), y_train)^2
test_score <- cor(predict(rdf_model, X_test), y_test)^2

cat("Regression Model Score (R^2):", train_score, ", Test Score:", test_score, "\n")

## Regression Model Score (R^2): 0.9572492 , Test Score: 0.9031334

y_predicted_train <- predict(rdf_model, X_train)
y_predicted_test <- predict(rdf_model, X_test)


train_rmse <- sqrt(mean((y_train - y_predicted_train)^2))
test_rmse <- sqrt(mean((y_test - y_predicted_test)^2))

cat("Training RMSE:", train_rmse, "Testing RMSE:", test_rmse, "\n")

## Training RMSE: 550.5013 Testing RMSE: 783.3986

train_mape <- mean(ifelse(y_train != 0, abs((y_train - y_predicted_train) / y_train), 0)) * 100
test_mape <- mean(ifelse(y_test != 0, abs((y_test - y_predicted_test) / y_test), 0)) * 100

cat("Training MAPE:", train_mape, "Testing MAPE:", test_mape, "\n")

## Training MAPE: 7.072235 Testing MAPE: 9.535869

plot(rdf_model, col ="blue")

cat("R-squared (Training):", train_score, "\n")

## R-squared (Training): 0.9572492

cat("R-squared (Testing):", test_score, "\n")

## R-squared (Testing): 0.9031334

cat("RMSE (Training):", train_rmse, "\n")

## RMSE (Training): 550.5013

cat("RMSE (Testing):", test_rmse, "\n")

## RMSE (Testing): 783.3986

cat("MAPE (Training):",train_mape, "\n")

## MAPE (Training): 7.072235

The results are performance metrics of the regression model, and we can see that they are quite reliable at first glance.

The R^2 on the training set is 0.9572, which is very high. It indicates that 95.72% of the variability in the dependent variable is explained by the model, which is quite good.

The R^2 on the test set is 0.90313, also quite high. An R^2 of 90.31% suggests that the model generalizes well to unseen data.

The RMSE on the training set is 550.5013. The RMSE measures the average magnitude of the errors between the predictions and the actual values. A value of 550.5013 indicates that, on average, the predictions are off by about 550 units from the dependent variable.

The RMSE on the test set is 783.3986. Although the value is higher than in the training set, it is still reasonably low. An RMSE of 783.3986 suggests that the predictions on the test set have an average error of about 822 units.

Un MAPE del 7 significa que, en promedio, las predicciones del modelo tienen un error absoluto del 7% con respecto a los valores reales. Es una buena metrica para nuestro modelo.

Overall, these results suggest that the model performs well on both the training set and the test set, compared to the other models used.

Let’s make predictions on the test set, and visualize the first 5 predictions.

y_test_predicted_2 <- predict(rdf_model, newdata = X_test)

head(y_test_predicted_2)

##        1        2        3        4        5        6 
## 7665.388 4503.070 5941.009 4372.896 7435.233 4686.428

Let’s select the first 100 observations for the predictions and the actual values.

We create a DataFrame (dataset_rf):

rf_prd <- y_test_predicted_2[1:100]
rf_real <- y_test[1:100]

dataset_rf <- data.frame(Real = rf_real, PredictedRF = rf_prd)

Let’s calculate the ‘diff’ column, which shows us the difference between the prediction values and actual values.

dataset_rf$diff <- abs(dataset_rf$Real - dataset_rf$PredictedRF)

Let’s select 15 random rows from dataset_rf:

sampled_rows <- dataset_rf %>% sample_n(15)

print(sampled_rows)

##     Real PredictedRF      diff
## 110 3965    4239.891 274.89112
## 101 4884    4933.362  49.36174
## 40  6004    5861.139 142.86060
## 127 4071    4050.015  20.98535
## 137 3900    4054.959 154.95903
## 114 3701    4118.196 417.19583
## 12  6729    7707.581 978.58103
## 126 5774    6044.958 270.95809
## 24  5337    5727.175 390.17495
## 59  5393    5855.572 462.57236
## 53  6194    6124.092  69.90771
## 100 7380    7186.384 193.61601
## 49  4462    4926.202 464.20176
## 23  4291    4349.157  58.15749
## 65  6008    5981.406  26.59431

Dataset_rf statistical summary

summary(dataset_rf)

##       Real       PredictedRF        diff         
##  Min.   :3414   Min.   :3787   Min.   :   4.747  
##  1st Qu.:4028   1st Qu.:4218   1st Qu.: 138.011  
##  Median :4836   Median :5446   Median : 296.683  
##  Mean   :5086   Mean   :5345   Mean   : 324.762  
##  3rd Qu.:5786   3rd Qu.:6038   3rd Qu.: 462.980  
##  Max.   :9528   Max.   :9616   Max.   :1020.181

Let’s create a histogram for each column:

theme_set(theme_minimal())

plots <- lapply(names(dataset_rf), function(col) {
  ggplot(dataset_rf, aes_string(x = col)) +
    geom_histogram(fill = "steelblue", color = "white") +
    ggtitle(paste("Histogram of", col)) +
    theme_minimal()
})

gridExtra::grid.arrange(grobs = plots, ncol = 2)

Now the scatter plot with regression line:

ggplot(dataset_rf, aes(x = Real, y = PredictedRF)) +
  geom_point() +
  geom_smooth(method = "lm", col = "red") +
  labs(title = "Scatter Plot with Regression Line", x = "Real", y = "PredictedRF") +
  theme_minimal()

We can see that the Random Random Forest with Hyper Parameter Tuning model predicts the company’s sales values very well.

Feature Importance On Random Forest Regressor

Feature Importance in a Random Forest-based regression model is a measure that indicates the relative contribution of each feature to the model’s ability to make predictions.

This metric is particularly useful to understand which features are most influential in the model and can be used for decision making.

In the context of a Random Forest Regressor, the importance of features is calculated during model training and is based on how each feature contributes to reducing impurity or error in the nodes of the forest.

Features that are most effective in reducing impurity or error have greater importance.

importance_values <- randomForest::importance(rdf_model)

importance_df <- data.frame(
  Feature = rownames(importance_values),
  Importance = importance_values[, "IncNodePurity"]
)

importance_df <- importance_df[order(-importance_df$Importance), ]


ggplot(importance_df, aes(x = reorder(Feature, -Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Random Forest Feature Importance",
       x = "Feature",
       y = "Importance") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

And still not satisfied with the good result, let’s see another mode.

Regression Model Tree

This code fits a decision tree regression model in R and then calculates the R-squared, RMSE, and MAPE evaluation metrics for both the training set and the test set.

set.seed(16)

tree <- rpart(y_train ~ ., data = data.frame(y_train, X_train), 
              method = "anova", minsplit = 5, minbucket = 8)

print(tree)

## n= 123671 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 123671 767733800000  6651.433  
##    2) Customers< 705.5 70181 178606400000  5211.062  
##      4) Customers< 472.5 23083  31019230000  3893.687  
##        8) Customers< 346.5 6520   5039379000  2801.352 *
##        9) Customers>=346.5 16563  15137770000  4323.683 *
##      5) Customers>=472.5 47098  87893560000  5856.715  
##       10) StoreType_d< 0.5 27821  36499280000  5350.863 *
##       11) StoreType_d>=0.5 19277  34000930000  6586.773 *
##    3) Customers>=705.5 53490 252489300000  8541.256  
##      6) Customers< 1064.5 39142 131947600000  7885.775  
##       12) Promo< 0.5 17821  41735210000  7078.601 *
##       13) Promo>=0.5 21321  68896630000  8560.445  
##         26) StoreType_d< 0.5 15319  40876700000  8124.863 *
##         27) StoreType_d>=0.5 6002  17695140000  9672.187 *
##      7) Customers>=1064.5 14348  57844910000 10329.440  
##       14) Customers< 1328.5 9044  29989700000  9765.539 *
##       15) Customers>=1328.5 5304  20075670000 11290.960 *

Fit a decision tree regression model:

treereg <- rpart(y_train ~ ., data = data.frame(y_train, X_train), method = "anova", minsplit = 5, minbucket = 8)

train_score_4 <- 1 - sum((y_train - predict(treereg, newdata = data.frame(X_train)))^2) / sum((y_train - mean(y_train))^2)
test_score_4 <- 1 - sum((y_test - predict(treereg, newdata = data.frame(X_test)))^2) / sum((y_test - mean(y_train))^2)

rmse_train <- sqrt(mean((y_train - predict(treereg, newdata = data.frame(X_train)))^2))
rmse_test <- sqrt(mean((y_test - predict(treereg, newdata = data.frame(X_test)))^2))

mape_train <- mean(ifelse(y_train != 0, abs((y_train - y_predicted_train) / y_train), 0)) * 100
mape_test <- mean(ifelse(y_test != 0, abs((y_test - y_predicted_test) / y_test), 0)) * 100


cat("Regresion Model Score:", train_score_4, "Test Score:", test_score_4, "\n")

## Regresion Model Score: 0.6860243 Test Score: 0.680879

cat("Training RMSE:", rmse_train, "Testing RMSE:", rmse_test, "\n")

## Training RMSE: 1396.109 Testing RMSE: 1391.535

cat("Training MAPE:", mape_train, "Testing MAPE:", mape_test, "\n")

## Training MAPE: 7.072235 Testing MAPE: 9.535869

By obtaining such a low R-squared result, 0.68, it is not a good model.

Summary models

model_comparison_df <- read_csv("C:/Users/jzw1jlf/Documents/Personal/CUNY SPS/DATA 698/Model_comparison.csv")

head(model_comparison_df)

## # A tibble: 6 × 5
##   Model                                 `R-squared`  RMSE Train_Score Test_Score
##   <chr>                                       <dbl> <dbl>       <dbl>      <dbl>
## 1 Simple Linear Regression                    0.803 1619.       0.803      0.803
## 2 Tune PLS                                    0.803 1107.       0          0    
## 3 LARS Lasso Regression                       0.797 1110.       0          0    
## 4 Single model Tree                           0.634 1508.       0          0    
## 5 Random Forest With Hyper Parameter T…       0.957  783.       0          0.957
## 6 Regression Model Tree                       0.686 1392.       0          0.681

Conclusion

This research has shown the application of advanced data science techniques in sales prediction in companies. This constitutes a valuable and effective approach to improve strategic decision making. Through the development and implementation of predictive models based on historical data sets and important external factors, it has been possible to obtain significant results that overcome the limitations of traditional methods.

The robustness of the proposed models is reflected in their ability to accurately anticipate sales trends. It provides powerful tools to optimize strategic planning and proactively respond to changes in the business environment. The exploration of data visualization tools has made it easier to interpret the results, and easier to understand the predictions.

During this project, advanced data science techniques and regression models have been applied to address the problem that the company currently has, providing an accurate sales prediction. The problem is not having identified the variables that may be increasing or decreasing sales. This helps when making decisions in advance and avoiding loss of sales in warehouses.

This work has made it possible not only to predict store sales but also to identify the different aspects that positively or negatively influence sales. It was determined that weekday promotions are justifiable to increase sales.

Store type B is the one that generates the most average sales, although the number of stores in the company’s total is the lowest. Therefore, in the short or medium term, the company could transfer the store b model to other types of stores to increase sales.

The best model was Random Forest with Hyper Parameter Tuning. The Random Forest model, with hyperparameter tuning, achieved the best accuracy compared to other models.

Comparing it with specific evaluation metrics, such as R², and RMSE. Being an ensemble model that combines multiple decision trees, it can capture non-linear and complex relationships in the data more effectively.

The data set had outliers, and the Random Forest model, being robust to outliers, was able to handle those values better compared to other models.

Throughout this process, we have gained valuable insights and faced significant challenges.

Achievements and knowledge application

We implemented regression models, such as linear regression, decision trees, LARS lasso regression, Random Forest, among others. But not all models achieved good or acceptable metrics to be able to generate a correct and reliable data prediction. In the end, some acceptable models were selected to choose the best regression model to generate the sales prediction.

Data science techniques were used to analyze and process the data, such as feature engineering to transform categorical variables into dummy variables so that the model could interpret them correctly and capture non-linear relationships.

Handling outliers was essential to improve the robustness and accuracy of our models. We use statistical and graphical techniques to identify outliers in the data set. We proceeded to eliminate or transform these observations, preventing them from negatively affecting the predictive capacity of the model.

These feature engineering and outlier management techniques positively influenced the quality of our predictions by providing the model with a more informative and robust representation of the data.

Model performance evaluation methods were implemented to ensure generalization. The Mean Square Error (MSE) was calculated, the mean of the squares of the differences between the predictions and the actual values. And Root Mean Square Error (RMSE), the square root of the MSE provides a metric on the same scale as the variable of interest.

Challenges and Learned Lessons

The first big challenge was obtaining the data, because the company did not have complete and unified data. Although the company has external software with SQL database, we were able to extract the data to start the analysis. Not all the information was provided correctly within the data files, without fill out all the required information, and no parameterization of the data a lot of data cleaning were performed.

Then the challenge of Handling Missing Data and Outliers was faced. Handling missing data and outliers is one of the common challenges in building and evaluating regression models. The strategy was to observe whether the missing data was small or large. As the amount of missing data was small, we considered eliminating observations or variables with missing data.

Outliers can bias the interpretation of the model and negatively affect its performance. These observations were removed, preventing them from negatively affecting the predictive capacity of the model.

Feature Selection: Various feature selection and transformation techniques were experimented with to improve the predictive capacity of the model.

Future Directions

It was considered necessary to collect additional data to enrich the data set and allow the model to learn more complex patterns.

This work provides the company with advanced tools and predictions that will allow it to improve decision making, resource optimization, and above all, avoid losing sales due to an incorrect estimate of future sales.

In the future for the company, we could create a software tool in html that allows them to auto generate the sales predictions in an easy and agile way to make purchases.

In practical terms, this research highlights the importance of adopting data science approaches in business management, especially around sales prediction. The recommendations derived from this study provide guidance for the successful implementation of predictive solutions in enterprise environments, underscoring the need for careful integration with existing processes and the importance of continuous updating of models to adapt to changes in the market.

In summary, this work contributes significantly to the advancement of data science applied to business management. It provides a solid foundation for future research and highlights the relevance of sales prediction as a key strategic tool for business success.

This project has demonstrated the positive impact of the application of regression models and data science techniques in solving the problem of correctly estimating the company’s future sales. As we advance, we are excited to explore new opportunities and continue to improve our capabilities in this exciting field of study.

Forecast

Finally, the .csv file is generated so that the company can start using it, and thus to anticipate sales trends. This file will help you optimize strategic planning and improve business and data driven decision making.

forecast_df <- data.frame(Forecast = y_test_predicted_2)
results_df <- round(forecast_df, digits = 0)

columns_Test <- X_test[, c("StoreType_a", "StoreType_b","StoreType_c","StoreType_d","Month", "Day",  "Customers" )]

Sales_forecast <- cbind(columns_Test, results_df)

write.csv(Sales_forecast, file = "Sales_forecast.csv", row.names = FALSE)

References

Sheather, S. (2009). A Modern Approach to Regression with R. Springer

Kuhn, M. Johnson, K. (2013). Applied Predictive Modeling. Springer

Knight, M. (August, 2017). What Is Predictive Analytics?. DATADIVERSITY. Retrieved from: https://www.dataversity.net/what-is-predictive-analytics/#:~:text=Predictive%20analytics%20defines%20outcomes%20through,business%20goals%20and%20ameliorating%20risks.

Lander, J. (2017). R for everyone. Advanced Analytics and Graphics. Second Edition. Addison-Wesley

Provost, F. Fawcett, T. (2013). Data Science for Bussiness. O’reilly

Sivarajah, U. Mustafa, M. Irani, Z. ( January, 2017). Critical analysis of Big Data challenges and analytical methods. Journal of Business Research. Science Direct. Retrieved from: https://www.sciencedirect.com/science/article/pii/S014829631630488X

Karaman, B. (June, 2019). Predicting Sales. Forecasting the monthly sales with LSTM. Medium. Retrieved from: https://towardsdatascience.com/predicting-sales-611cb5a252de

Tiefenthaler, D. (September, 2023).Forecasting Sales & Demand: The Path to an Accurate Prediction. Medium. Retrieved from: https://medium.com/@david.tiefenthaler/forecasting-sales-demand-forecast-the-path-to-an-accurate-prediction-a317748192f3

Riveroll, F. (July, 2020). The Demand Sales Forecast Technique Every Data Scientist Should be Using to Reduce Error. Medium. Retrieves from: https://towardsdatascience.com/the-demand-sales-forecast-technique-every-data-scientist-should-be-using-to-reduce-error-1c6f25add9cb

Zhang, B. Tseng, M.(February, 2023). A comparative online sales forecasting analysis: Data mining techniques. SCience Direct. Retrieve from: https://www.sciencedirect.com/science/article/abs/pii/S0360835222009238

Appendix

Code in Rstudio:

Optimizing Sales Strategies: A Data-Driven Journey into Business Sales Analytics and Predictive Modeling

Gabriel Santos

2023-12-12