This capstone addresses the crucial challenge of predicting sales in businesses by applying advanced data science techniques.
This project focuses on the development and implementation of a predictive model aimed at improving sales management in a company with multiple points of sale. The company is inside Colombia (South America), it has more than 500 stores inside Colombia country, in different cities of the country.
The study focuses on the development of accurate and efficient predictive models that allow companies to anticipate sales trends, optimize strategic planning and improve business decision making.
The methodology uses data collection and data cleaning and implements machine learning algorithms and predictive analysis techniques. Historical sales data sets are used, as well as relevant information on external factors that may influence business performance. The goal is to seek to identify patterns and correlations that allow us to predict sales trends.
The methodology used includes the collection of historical sales data, seasonal factors, and relevant external variables. Using advanced predictive analysis techniques, such as machine learning, seeking to identify patterns and correlations that allow us to predict sales trends in each store.
The predictive model developed is integrated with the business management system. Providing the managers of each store with effective tools for making decisions based on data. It is expected that this approach will contribute to inventory optimization, reducing losses due to lack of stock or excess inventory. Lastly an instance, to increase the profitability of the company. Additionally, there will be consider scaling the predictive model to adapt to new locations or changes in market conditions.
This project not only seeks to improve operational efficiency, but also establish a sustainable and adaptable framework for sales management in a company with a multilocation presence. This work contributes to the field of data science by providing a guide to address the specific challenge of sales prediction in companies, highlighting the importance of the application of advanced techniques in strategic decision making.
In today’s business environment, effective management of multiple retail stores has become a crucial challenge for optimizing operations and maximizing profits. Companies with multiple stores often face the complex task of anticipating and adapting to variations in demand at each location.
The project aims to develop a predictive model that allows forecasting sales trends in different sales stores. Thus, providing the company with the necessary tools to make data driven and strategic decisions.
The need for a predictive approach arises from increasing competition, and the diversity of factors that influence customer purchasing behavior. The gathering and analysis of historical data, combined with seasonal factors and relevant external variables, will allow the identification of data and sales behavior patterns. Using data science techniques that apply predictive analysis, future product demands can be estimated.
The implementation of predictive models not only seeks to improve operational efficiency, but also contributes to the overall profitability of the company, this way avoiding losses due to excess or lack of inventory. Additionally, the ability to scale the model to adapt to new locations or changes in market conditions will be considered. This way it establishes a sustainable and adaptable approach to sales management in a multi-store company.
This project is proposed as a strategic tool to strengthen the company’s ability to anticipate and satisfy market demands in its various sales stores. This with the objective of improving its competitive position and maximizing operational efficiency.
Existing literature provides an essential foundation for understanding the importance and complexities associated with sales prediction in the business environment. Sales management has evolved significantly with the emergence of data science.
Traditional Sales Prediction Models have established solid knowledge around traditional sales prediction models, which are often based on classic statistical methods such as time series analysis and linear regression. These approaches have proven effective in certain contexts, but their ability to handle the complexity of today’s data and adapt to dynamic changes in the market has become limited. Sometimes its precision is quite low.
Predictive analytics is an essential process that relies on using past and present data to make predictions, especially by examining data trends.
In recent years, and given the rise of big data, it has emerged as a fundamental trend, offering transformative potential in the way companies organize and use information relevant to them, their clients, as well as their potential clients.
Effective data storage is essential, as without it, businesses would lose the ability to gain valuable insights. Without information and analysis, new benefits and opportunities would be lost.
Technology currently allows companies to use data as a fundamental pillar to understand their customers, create new products and satisfy their needs. Implementing a data-driven approach to decision making has been shown to increase the profitability of companies regardless of industry.
Predictive sales analytics is widely used by companies across industries, from food service and retail to healthcare and government services. Companies can optimally allocate their resources, while adequately meeting expected demand.
Some of the predictive analysis techniques are:
Decision trees: Classify data into different groups. It is shaped like a tree; Each branch is a possibility of choice, and the result is shown on the sheet.
Random forest: It is a set of decision trees in which different models are applied.
Neural networks: This artificial technology aims to imitate the reactions of a human brain to make predictions in relationships of complex variables.
Data mining: Exploring large databases to find patterns.
Predictive analytics is extremely useful in sales prediction for several key reasons:
Inventory Optimization: Allows companies to anticipate future demand, which facilitates inventory management. By forecasting product needs, companies can avoid excess or shortages of stock, thus optimizing their inventory levels.
Improving Pricing Strategies: Facilitates the identification of consumer behavior patterns and market trends, which helps to adjust pricing strategies more precisely. This may include price discounts, special offers, and dynamic pricing strategies.
Offer Personalization: Allows companies to personalize their offers and promotions based on customers’ preferences and past behaviors. This not only improves customer satisfaction, but also increases the chances of making more sales.
Optimization of Marketing Strategies: Helps efficiently direct marketing strategies by identifying the customer segments most likely to make purchases. This may include audience segmentation, and personalization of advertising messages.
Improved Strategic Planning: Provides valuable information for long-term strategic planning, understanding sales trends and patterns. In this way, companies can adjust their goals and objectives more realistically.
Reducing Financial Risks: By forecasting demand and adjusting operations accordingly, companies can reduce the financial risks associated with unsold excess inventory or shortages of critical products.
By integrating these key aspects of the literature, this research seeks to contribute to the updated understanding of sales prediction. In the context of data science, the goal is to provide a solid foundation for the development of more advanced and effective approaches in the field of enterprise sales.
In short, predictive analytics in sales forecasting not only helps businesses anticipate demand but also provides a solid foundation for strategic decision making. This results in a more efficient operation, and better customer satisfaction.
We use the R programming language and load several libraries or packages that contain specific functions and tools for data analysis and visualization in R. These libraries provide a variety of tools for performing data analysis, visualizations, and statistical modeling in R.
library(tidyverse)
library(dplyr)
library(readr)
library(ggplot2)
library(corrplot)
library(lubridate)
library(heatmaply)
library(visdat)
library(gridExtra)
library(GGally)
library(caret)
library(caTools)
library(randomForest)
library(tree)
library(Metrics)
library(rpart)
library(glmnet)The quality and relevance of the data set plays a vital role in the data science project, especially in the context of sales prediction of the organization.
The review of the data set focuses on the key aspects that have influenced the selection, preparation, and use of the data to develop sales prediction models.
The company provided us with its sales database for 3 years, 2021, 2022 and 2023. The data set selected for this analysis is a compilation of data on the sales of each company store, for the last 3 years.
The historical sales data set spanning a significant period to capture seasonal patterns, long-term trends, and changes in consumer behavior.
In the set there is data on the number of sales of each store. Describe characteristics of each store, if the point of sale was open, if it had any promotions.
There is a very complete set of data. We can also see if the date was a holiday, the number of customers for each store, and many other data that will allow us to conduct a good analysis.
In the following table we can see the variables and the data type:
## # A tibble: 6 × 9
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 1 2021-01-04 7176 785 1 1 0
## 2 1 1 2021-01-11 4717 616 1 0 0
## 3 1 1 2021-01-18 5394 607 1 1 0
## 4 1 1 2021-01-25 4055 549 1 0 0
## 5 1 1 2021-02-01 7032 762 1 1 0
## 6 1 1 2021-02-08 4409 599 1 0 0
## # ℹ 1 more variable: SchoolHoliday <dbl>
## spc_tbl_ [510,776 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Store : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date[1:510776], format: "2021-01-04" "2021-01-11" ...
## $ Sales : num [1:510776] 7176 4717 5394 4055 7032 ...
## $ Customers : num [1:510776] 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num [1:510776] 1 0 1 0 1 0 1 0 1 0 ...
## $ StateHoliday : chr [1:510776] "0" "0" "0" "0" ...
## $ SchoolHoliday: num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Store = col_double(),
## .. DayOfWeek = col_double(),
## .. Date = col_date(format = ""),
## .. Sales = col_double(),
## .. Customers = col_double(),
## .. Open = col_double(),
## .. Promo = col_double(),
## .. StateHoliday = col_character(),
## .. SchoolHoliday = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
There is another dataset for the stores, where we can observe the type of store, if the store has carried out promotions or if the store has competition nearby that could affect.
## # A tibble: 6 × 6
## Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 c a 1270 2018 0
## 2 2 a a 570 2007 1
## 3 3 a a 14130 2006 1
## 4 4 c c 620 2009 0
## 5 5 a a 29910 2015 0
## 6 6 a a 310 2013 0
## # A tibble: 6 × 6
## Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 555 d a 1560 2014 1
## 2 556 d c 1140 2014 0
## 3 557 a a 250 NA 0
## 4 558 a a 3000 2010 0
## 5 559 d a 3910 2006 1
## 6 560 c c 1910 2013 0
We have paid special attention to the selection of predictor variables within the data set.
This review addresses the importance of data quality and cleanliness throughout the process. Identification and handling of outliers, imputation of missing data, and verification of temporal consistency are critical aspects to ensure the reliability of the results.
Visual data exploration has been used as an integral part of the review of the data set. Visualization of patterns, distributions, and correlations between variables has provided valuable information for initial understanding of the structure of the data set.
The objective of data preparation and exploration is to ensure a solid and reliable basis for the development of sales prediction models in the field of data science.
The dataset has 510776 observations and 9 variables. We will see if there are missing values in the dataset.
## [1] 510776 9
## [1] FALSE
Checking Null Values:
## Store DayOfWeek Date Sales Customers
## 0 0 0 0 0
## Open Promo StateHoliday SchoolHoliday
## 0 0 0 0
## Store DayOfWeek Date Sales
## Min. : 1.0 Min. :1.000 Min. :2021-01-01 Min. : 0
## 1st Qu.:141.0 1st Qu.:2.000 1st Qu.:2021-08-17 1st Qu.: 3785
## Median :281.0 Median :4.000 Median :2022-04-02 Median : 5778
## Mean :280.8 Mean :4.003 Mean :2022-04-11 Mean : 5807
## 3rd Qu.:420.0 3rd Qu.:6.000 3rd Qu.:2022-12-12 3rd Qu.: 7896
## Max. :560.0 Max. :7.000 Max. :2023-07-31 Max. :38722
## Customers Open Promo StateHoliday
## Min. : 0.0 Min. :0.0000 Min. :0.0000 Length:510776
## 1st Qu.: 405.0 1st Qu.:1.0000 1st Qu.:0.0000 Class :character
## Median : 607.0 Median :1.0000 Median :0.0000 Mode :character
## Mean : 629.6 Mean :0.8306 Mean :0.3815
## 3rd Qu.: 834.0 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :5494.0 Max. :1.0000 Max. :1.0000
## SchoolHoliday
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.1785
## 3rd Qu.:0.0000
## Max. :1.0000
Unique rows in sales_df:
## # A tibble: 510,776 × 9
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 1 2021-01-04 7176 785 1 1 0
## 2 1 1 2021-01-11 4717 616 1 0 0
## 3 1 1 2021-01-18 5394 607 1 1 0
## 4 1 1 2021-01-25 4055 549 1 0 0
## 5 1 1 2021-02-01 7032 762 1 1 0
## 6 1 1 2021-02-08 4409 599 1 0 0
## 7 1 1 2021-02-15 6407 710 1 1 0
## 8 1 1 2021-02-22 4038 534 1 0 0
## 9 1 1 2021-03-01 7675 840 1 1 0
## 10 1 1 2021-03-08 4949 618 1 0 0
## # ℹ 510,766 more rows
## # ℹ 1 more variable: SchoolHoliday <dbl>
## # A tibble: 510,776 × 9
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 1 2021-01-04 7176 785 1 1 0
## 2 1 1 2021-01-11 4717 616 1 0 0
## 3 1 1 2021-01-18 5394 607 1 1 0
## 4 1 1 2021-01-25 4055 549 1 0 0
## 5 1 1 2021-02-01 7032 762 1 1 0
## 6 1 1 2021-02-08 4409 599 1 0 0
## 7 1 1 2021-02-15 6407 710 1 1 0
## 8 1 1 2021-02-22 4038 534 1 0 0
## 9 1 1 2021-03-01 7675 840 1 1 0
## 10 1 1 2021-03-08 4949 618 1 0 0
## # ℹ 510,766 more rows
## # ℹ 1 more variable: SchoolHoliday <dbl>
unique_count <- sales_df %>%
summarise(UniqueCount = n_distinct(Store))
print(unique_count$UniqueCount)## [1] 560
value counts StateHoliday:
##
## 0 a b c
## 495254 10104 3360 2058
Print the initial (first) date:
## [1] "2021-01-04"
Print the final (last) date:
## [1] "2023-07-30"
Assuming ‘sales_df’ The values for date must be organized and leave only columns with numerical values:
sales_df$Date <- as.Date(sales_df$Date)
sales_df$Year <- year(sales_df$Date)
sales_df$Month <- month(sales_df$Date)
sales_df$Day <- day(sales_df$Date)
sales_df$WeekOfYear <- week(sales_df$Date)
print(sales_df)## # A tibble: 510,776 × 13
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 1 2021-01-04 7176 785 1 1 0
## 2 1 1 2021-01-11 4717 616 1 0 0
## 3 1 1 2021-01-18 5394 607 1 1 0
## 4 1 1 2021-01-25 4055 549 1 0 0
## 5 1 1 2021-02-01 7032 762 1 1 0
## 6 1 1 2021-02-08 4409 599 1 0 0
## 7 1 1 2021-02-15 6407 710 1 1 0
## 8 1 1 2021-02-22 4038 534 1 0 0
## 9 1 1 2021-03-01 7675 840 1 1 0
## 10 1 1 2021-03-08 4949 618 1 0 0
## # ℹ 510,766 more rows
## # ℹ 5 more variables: SchoolHoliday <dbl>, Year <dbl>, Month <dbl>, Day <int>,
## # WeekOfYear <dbl>
## # A tibble: 4 × 13
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## <dbl> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1 1 2023-07-31 0 0 0 0 a
## 2 2 1 2023-07-31 0 0 0 0 a
## 3 3 1 2023-07-31 0 0 0 0 a
## 4 4 1 2023-07-31 0 0 0 0 a
## # ℹ 5 more variables: SchoolHoliday <dbl>, Year <dbl>, Month <dbl>, Day <int>,
## # WeekOfYear <dbl>
The columns to be removed: Date, StateHoliday:
## # A tibble: 510,776 × 11
## Store DayOfWeek Sales Customers Open Promo SchoolHoliday Year Month Day
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 1 7176 785 1 1 1 2021 1 4
## 2 1 1 4717 616 1 0 0 2021 1 11
## 3 1 1 5394 607 1 1 0 2021 1 18
## 4 1 1 4055 549 1 0 0 2021 1 25
## 5 1 1 7032 762 1 1 0 2021 2 1
## 6 1 1 4409 599 1 0 0 2021 2 8
## 7 1 1 6407 710 1 1 0 2021 2 15
## 8 1 1 4038 534 1 0 0 2021 2 22
## 9 1 1 7675 840 1 1 0 2021 3 1
## 10 1 1 4949 618 1 0 0 2021 3 8
## # ℹ 510,766 more rows
## # ℹ 1 more variable: WeekOfYear <dbl>
## tibble [510,776 × 11] (S3: tbl_df/tbl/data.frame)
## $ Store : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ Sales : num [1:510776] 7176 4717 5394 4055 7032 ...
## $ Customers : num [1:510776] 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num [1:510776] 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday: num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num [1:510776] 2021 2021 2021 2021 2021 ...
## $ Month : num [1:510776] 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int [1:510776] 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num [1:510776] 1 2 3 4 5 6 7 8 9 10 ...
columns_remove_numeric_day <- colums_removed %>%
mutate(Day = as.numeric(Day))
str(columns_remove_numeric_day)## tibble [510,776 × 11] (S3: tbl_df/tbl/data.frame)
## $ Store : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ Sales : num [1:510776] 7176 4717 5394 4055 7032 ...
## $ Customers : num [1:510776] 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num [1:510776] 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num [1:510776] 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday: num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num [1:510776] 2021 2021 2021 2021 2021 ...
## $ Month : num [1:510776] 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : num [1:510776] 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num [1:510776] 1 2 3 4 5 6 7 8 9 10 ...
To select which are the best variables for our analysis, a measure of the correlation will be performed on each independent variable with the dependent variable and eliminate the variables with low correlation with the independent variable.
Let’s now create correlation plots to evaluate the most important variables in predicting the sales.
corr_plot_vals <- cor(columns_remove_numeric_day)
corrplot(corr_plot_vals, method="shade",shade.col=NA, tl.col="blue", tl.srt=50, addCoef.col = "black", number.cex = 0.7, col = COL2('RdYlBu', 10))Theres is some variables that influence the number of sales: Customer Open: if the store is open Promo: if the store had promotions
In the following graph we can see the list of closed stores each day of the week:
day_labels <- c("Mon","Tue", "Wed", "Thu", "Fri","Sat","Sun")
sales_df$DayOfWeek <- factor(sales_df$DayOfWeek, labels = day_labels)
ggplot(sales_df, aes(x = DayOfWeek, fill = factor(Open))) +
geom_bar(position = "dodge") +
labs(title = "Sales vs Day of week", fill = "Open") +
xlab("Day of the Week") +
theme_gray() +
scale_fill_manual(values = c("blue", "darkorange"))In the following graph we can see how sales increase when the store has a promotion:
Promo_sales <- sales_df %>%
group_by(Promo) %>%
summarize(mean_sales = mean(Sales))
ggplot(Promo_sales, aes(x = Promo, y = mean_sales)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Mean sales vs Promo", x = "Promo", y = "Mean Sales") +
theme_gray()It can be observed how in the months of November and December, sales increase every year.
months_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
sales_df$Month <- factor(sales_df$Month, labels = months_labels)
aggregatevalue <- aggregate(Sales ~ Month + Year, data=sales_df, FUN = sum)
ggplot(aggregatevalue, aes(x = Month, y = Sales, fill = factor(Year))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total sales vs month and year",
x = "Month",
y = "Sales",
fill = "Year") +
theme_gray() +
scale_fill_manual(values = c("blue", "darkorange", "lightgreen")) School Holiday Column Values:
##
## 0 1
## 419623 91153
## num [1:510776] 1 0 0 0 0 0 0 0 0 0 ...
It can be observed from the graph that sales are not affected by the school holidays. Only 18% of sales are affected.
labels <- c('Not-Affected', 'Affected')
sizes <- table(sales_df$SchoolHoliday)
colors <- c('lightgreen', 'blue')
explode <- c(0.1, 0.0)
percent_labels <- round((sizes / sum(sizes)) * 100, 1)
data <- data.frame(labels, sizes, percent_labels)
ggplot(data, aes(x = "", y = sizes, fill = labels)) +
geom_bar(stat = "identity", width = 3) +
coord_polar(theta = "y") +
scale_fill_manual(values = colors) +
labs(title = 'Sales vs Schoolholiday') +
theme_void() +
geom_text(aes(label = paste0(percent_labels, "%"), x = 1.5), position = position_stack(vjust = 0.5), size = 6) +
theme(legend.position = "right") +
theme(plot.title = element_text(hjust = 0.5, size = 15))Transforming Variable StateHoliday:
##
## 0 1
## 495254 15522
It can be observed in the graph that sales are not affected by holidays. Only 3% of sales are affected.
labels <- c('Not-Affected', 'Affected')
sizes <- table(sales_df$StateHoliday)
colors <- c('lightgreen', 'blue')
explode <- c(0.1, 0.0)
percent_labels <- round((sizes / sum(sizes)) * 100, 1)
data1 <- data.frame(labels, sizes, percent_labels)
ggplot(data1, aes(x = "", y = sizes, fill = labels)) +
geom_bar(stat = "identity", width = 3) +
coord_polar(theta = "y") +
scale_fill_manual(values = colors) +
labs(title = 'Sales vs State holiday') +
theme_void() +
geom_text(aes(label = paste0(percent_labels, "%"), x = 3), position = position_stack(vjust = 0.5), size = 4) +
theme(legend.position = "right") +
theme(plot.title = element_text(hjust = 0.5, size = 15))Since sales are no longer affected by holidays, I will remove this column for analysis.
Histogram Representation of Sales. Here 0 is showing because most of the time store was closed.
ggplot(sales_df, aes(x = Sales)) +
geom_histogram(binwidth = 1000, fill = "blue", color = "white") +
labs(title = "Distribution of sales", x = "Sales", y = "Frequency") +
theme_gray() +
theme(plot.title = element_text(size = 15))Add a density curve:
ggplot(sales_df, aes(x = Sales)) +
geom_histogram(binwidth = 1000, fill = "lightblue", color = "white") +
geom_density(aes(y = ..count.. * 1000), color = "red") +
labs(x = "Sales", y = "Density") +
theme_gray() +
theme(plot.title = element_text(size = 20))In the graph we see the relationship between the number of customers and sales.
ggplot(sales_df, aes(x = Sales, y = Customers)) +
geom_point(color = "lightgreen", size = 2) +
geom_smooth(method = "lm", color = "blue", se = FALSE) +
labs(title = "Sales vs Customers", x = "Sales", y = "Customers") +
theme_gray() +
theme(plot.title = element_text(size = 15))## # A tibble: 6 × 6
## Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 c a 1270 2018 0
## 2 2 a a 570 2007 1
## 3 3 a a 14130 2006 1
## 4 4 c c 620 2009 0
## 5 5 a a 29910 2015 0
## 6 6 a a 310 2013 0
## # A tibble: 6 × 6
## Store StoreType Assortment CompetitionDistance CompetitionOpenSinceYear Promo2
## <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 555 d a 1560 2014 1
## 2 556 d c 1140 2014 0
## 3 557 a a 250 NA 0
## 4 558 a a 3000 2010 0
## 5 559 d a 3910 2006 1
## 6 560 c c 1910 2013 0
Checking Information about Dataset. Print the dimensions:
n_rows <- nrow(store_df) # Get the number of rows
n_cols <- ncol(store_df) # Get the number of columns
cat("Number of rows:", n_rows, "\n")## Number of rows: 560
## Number of columns: 6
## spc_tbl_ [560 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Store : num [1:560] 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : chr [1:560] "c" "a" "a" "c" ...
## $ Assortment : chr [1:560] "a" "a" "a" "c" ...
## $ CompetitionDistance : num [1:560] 1270 570 14130 620 29910 ...
## $ CompetitionOpenSinceYear: num [1:560] 2018 2007 2006 2009 2015 ...
## $ Promo2 : num [1:560] 0 1 1 0 0 0 0 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. Store = col_double(),
## .. StoreType = col_character(),
## .. Assortment = col_character(),
## .. CompetitionDistance = col_double(),
## .. CompetitionOpenSinceYear = col_double(),
## .. Promo2 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
It can be observed that there are missing values in the dataset, we are going to convert these data into ‘NA’ in order to carry out our analysis.
## [1] TRUE
Display the count of missing values for each column:
## Store StoreType Assortment
## 0 0 0
## CompetitionDistance CompetitionOpenSinceYear Promo2
## 1 184 0
vis_dat(store_df) + scale_fill_manual(
values = c(
"character" = "#2297E6",
"numeric" = "lightgreen",
"NA" = "lightgray"
)) +
theme(axis.text.x = element_text(angle = 60, hjust = 0, size=12, face="bold"))Distribution Of Different Store Types:
labels <- c('a', 'b', 'c', 'd')
sizes <- table(store_df$StoreType)
colors <- c('orange', 'lightgreen', 'lightblue', 'pink')
explode <- c(0.1, 0.0, 0.15, 0.5)
percent_labels <- round((sizes / sum(sizes)) * 100, 1)
data2 <- data.frame(labels, sizes, percent_labels)
ggplot(data2, aes(x = "", y = sizes, fill = labels)) +
geom_bar(stat = "identity", width = 3) +
coord_polar(theta = "y") +
scale_fill_manual(values = colors) +
labs(title = 'Analysis sales per Store type') +
theme_void() +
geom_text(aes(label = paste0(percent_labels, "%"), x = 3), position = position_stack(vjust = 0.5), size = 4) +
theme(legend.position = "right") +
theme(plot.title = element_text(hjust = 0.5, size = 15))Remove features with high percentages of missing values we can see that some features have a high percentage of missing values and they won’t be accurate as indicators, so we will remove features with more than 30% missing values.
Replace missing values in features with low percentages of missing values.
We review the distribution of CompetitionDistance, which is the distance in meters to the nearest competitor store.
ggplot(store_df_30_percent, aes(x = CompetitionDistance)) +
geom_histogram(fill = "lightblue", color = "white") +
geom_density(aes(y = ..count.. * 1900), color = "red") +
labs(x = "Competition Distance", y = "Density") +
theme_gray() +
theme(plot.title = element_text(size = 15)) +
ggtitle("Distribution Competition Distance")The distribution is right skewed, so we’ll replace missing values with the median.
store_df$CompetitionDistance <- ifelse(is.na(store_df$CompetitionDistance), median(store_df$CompetitionDistance, na.rm = TRUE), store_df$CompetitionDistance)Pair plot for Store Dataset, we can observe the different graphs to perform a quick analysis of the store dataset.
theme_set(theme_minimal())
theme_update(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
pp <- ggplot(store_df, aes(color = StoreType)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x) +
facet_grid(. ~ StoreType, scales = "free") +
theme(legend.position = "right")We can observe that the Store dataset data has a very weak or practically non-existent correlation between the variables represented.In this case, a significant relationship between the variables cannot be established from observing the data in the pair plot.
ggpairs(store_df, aes(color = StoreType)) + theme(axis.text.x = element_text(angle = 90, hjust = 0.9, size=8))Checking stores with their assortment type
theme_set(theme_gray())
store_type <- ggplot(store_df, aes(x = StoreType, fill = Assortment)) +
geom_bar(position = "dodge") +
theme_gray() +
labs(fill = "Assortment") +
theme(legend.position = "right") +
labs(title = 'Assortment vs Store type', x = "StoreType") +
scale_fill_manual(values = c("#41B7C4", "orange", "#CCEDB1")) +
theme(plot.title = element_text(size = 15))
store_type_data <- store_df %>%
group_by(StoreType, Assortment) %>%
summarise(n = n())
store_type_data <- store_type_data %>%
group_by(StoreType) %>%
mutate(percentage = n / sum(n) * 100)
store_type <- store_type +
geom_text(data = store_type_data,
aes(label = paste0(n, " (", round(percentage, 1), "%)"), y = n),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 4,
color = "black")
print(store_type)We can see in the graph that there is no such significant differences in these 3 years in terms of sales:
ggplot(sales_df, aes(x = as.factor(Year), y = Sales)) +
geom_bar(stat = "summary", fun = "mean", position = position_dodge(width = 0.8), fill = "blue") +
geom_point(aes(group = Store, color = Store), position = position_jitterdodge(jitter.width = 0.3), size = 3) +
theme_minimal() +
labs(title = "Average Sales vs Year") +
xlab("Year") +
ylab("Average Sales") +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = "top")We take the dataframes are named sales_df and store_df.
## Store DayOfWeek Date Sales Customers Open Promo SchoolHoliday Year
## 1 1 Mon 2021-01-04 7176 785 1 1 1 2021
## 2 1 Mon 2021-01-11 4717 616 1 0 0 2021
## 3 1 Mon 2021-01-18 5394 607 1 1 0 2021
## 4 1 Mon 2021-01-25 4055 549 1 0 0 2021
## 5 1 Mon 2021-02-01 7032 762 1 1 0 2021
## 6 1 Mon 2021-02-08 4409 599 1 0 0 2021
## Month Day WeekOfYear StoreType Assortment CompetitionDistance Promo2
## 1 Jan 4 1 c a 1270 0
## 2 Jan 11 2 c a 1270 0
## 3 Jan 18 3 c a 1270 0
## 4 Jan 25 4 c a 1270 0
## 5 Feb 1 5 c a 1270 0
## 6 Feb 8 6 c a 1270 0
## Store DayOfWeek Date Sales Customers Open Promo SchoolHoliday Year
## 510771 560 Sun 2023-06-25 0 0 0 0 0 2023
## 510772 560 Sun 2023-07-02 0 0 0 0 0 2023
## 510773 560 Sun 2023-07-09 0 0 0 0 0 2023
## 510774 560 Sun 2023-07-16 0 0 0 0 0 2023
## 510775 560 Sun 2023-07-23 0 0 0 0 0 2023
## 510776 560 Sun 2023-07-30 0 0 0 0 0 2023
## Month Day WeekOfYear StoreType Assortment CompetitionDistance Promo2
## 510771 Jun 25 26 c c 1910 0
## 510772 Jul 2 27 c c 1910 0
## 510773 Jul 9 28 c c 1910 0
## 510774 Jul 16 29 c c 1910 0
## 510775 Jul 23 30 c c 1910 0
## 510776 Jul 30 31 c c 1910 0
## 'data.frame': 510776 obs. of 16 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2021-01-04" "2021-01-11" ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : chr "c" "c" "c" "c" ...
## $ Assortment : chr "a" "a" "a" "a" ...
## $ CompetitionDistance: num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
Checking Information about Dataset. Print the dimensions:
## [1] 510776
## [1] 16
We are going to remove the following columns Date, StoreType, Assortment, CompetitionOpenSinceYear, because these variables are not relevant for the analysis:
colums_removed_merged <- subset(merged_df, select = -c(Date, StoreType, Assortment, CompetitionOpenSinceYear))## 'data.frame': 510776 obs. of 13 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CompetitionDistance: num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
Convert DayofWeek as numeric:
colums_removed_merged$DayOfWeek <- as.numeric(colums_removed_merged$DayOfWeek)
colums_removed_merged$Month <- as.numeric(colums_removed_merged$Month)## 'data.frame': 510776 obs. of 13 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : num 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CompetitionDistance: num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
Calculate the absolute correlation matrix:
We can see that the variables do not change when we merge the datasets.
We can observe which variables influence the amount of sales: Customer Open: if the store is open Promo: if the store had promotions
Creates a new column Avg_Customer_Sales in merged df dataset, we are going to make the graphs to observe the relationship between sales and stores:
plot1 <- ggplot(store_df, aes(x = StoreType)) +
geom_bar(fill="steelblue") +
labs(title = "Quantity StoreTypes") +
theme(plot.title = element_text(size = 12))
plot2 <- ggplot(merged_df, aes(x = StoreType, y = Sales)) +
geom_bar(stat = "identity", fill="steelblue") +
labs(title = "Sales vs StoreTypes") +
theme(plot.title = element_text(size = 12))
plot3 <- ggplot(merged_df, aes(x = StoreType, y = Customers)) +
geom_bar(stat = "identity", fill="steelblue") +
labs(title = "Number Customers vs StoreTypes") +
theme(plot.title = element_text(size = 12))
grid.arrange(plot1, plot2, plot3, ncol = 2)Mean:
mean_Sales <- merged_df %>%
group_by(StoreType) %>%
summarise(mean_Sales = mean(Sales))
plot4 <- ggplot(mean_Sales, aes(x = StoreType, y = mean_Sales)) +
geom_bar(stat = "identity", fill="steelblue") +
labs(title = "Average Sales vs StoreTypes") +
theme(plot.title = element_text(size = 12))
mean_Avg_Customer_Sales <- merged_df %>%
group_by(StoreType) %>%
summarise(mean_Avg_Customer_Sales = mean(Avg_Customer_Sales, na.rm = TRUE))
plot5 <-ggplot(mean_Avg_Customer_Sales, aes(x = StoreType, y = mean_Avg_Customer_Sales)) +
geom_bar(stat = "identity", fill="steelblue") +
labs(title = "Average Spending vs StoreTypes") +
theme(plot.title = element_text(size = 12))
mean_Customer <- merged_df %>%
group_by(StoreType) %>%
summarise(mean_Customer = mean(Customers))
plot6 <- ggplot(mean_Customer, aes(x = StoreType, y = mean_Customer)) +
geom_bar(stat = "identity", fill="steelblue") +
labs(title = "Average Customers vs StoreType") +
theme(plot.title = element_text(size = 12))
grid.arrange(plot4, plot5, plot6, ncol = 2)We can see in the graphs that Store Type A has the greatest number of stores, it also has the greatest number of sales and the greatest number of customers.
Type D stores have the best sales averages per customer.
Type B stores have the highest number of average customers.
Let’s analyze the impact of sales with promotions:
colors <- c('coral', 'steelblue')
ggplot(merged_df, aes(x = Month, y = Sales, fill = factor(Promo2))) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(. ~ Year + Promo, nrow=2, scales = "free") +
scale_fill_manual(values = colors) +
labs(title = "Sales vs Promo", x = "Month", y = "Sales") +
theme_gray() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))We can see that when promotions are applied, sales are high. According to this result, there are advantages to applying promotions to promote sales.
colors <- c('coral2', 'steelblue')
ggplot(merged_df, aes(x = DayOfWeek, y = Sales, fill = factor(Promo))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = colors) +
labs(x = "Month", y = "Sales") +
theme_gray() +
theme(axis.text.x = element_text(angle = 0, hjust = 1))When stores offer promotions, sales are higher. On Sundays there are no promotions, and sales are high.
We are going to analyze if there are variations in sales between the types of stores that the company has.For this we are going to graph the sales of the different types of stores each month by year.
colors <- c('green3', 'indianred', 'deepskyblue', 'blueviolet')
ggplot(merged_df, aes(x = Month, y = Sales, fill = StoreType)) +
geom_bar(stat = "identity", position = "dodge") +
facet_grid(Year ~ .) +
scale_fill_manual(values = colors) +
labs(x = "Month", y = "Sales") +
theme_minimal()We can see that the sales trend is the same for each type of store. Observing that in the months of December sales increase. In general, store type B presents over the years the type of store that generates the most average sales according to the number of stores of this type there are.
Let’s categorize the nearest or farthest.
merged_df$CompetitionDistance_Cat <- cut(merged_df$CompetitionDistance, breaks = 5, labels = c("near_1mile", "near_2mile", "near_3mile", "near_4mile", "near_5mile"))
summary(merged_df$CompetitionDistance_Cat)## near_1mile near_2mile near_3mile near_4mile near_5mile
## 459680 40734 7536 1884 942
summary_df_merged <- merged_df %>%
group_by(CompetitionDistance_Cat) %>%
summarize(Avg_Sales = mean(Sales), Avg_Customers = mean(Customers))
plot7 <- ggplot(summary_df_merged, aes(x = CompetitionDistance_Cat, y = Avg_Sales)) +
geom_bar(stat = "identity", fill = "#41B7C4") +
labs(title = "Average Sales vs Competition Distance") + theme(axis.text.x = element_text(angle = 90, hjust = 0.9, size=8)) + theme(plot.title = element_text(size = 10))
plot8 <- ggplot(summary_df_merged, aes(x = CompetitionDistance_Cat, y = Avg_Customers)) +
geom_bar(stat = "identity", fill = "#CCEDB1") +
labs(title = "Average Customers vs Competition Distance") + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=9)) + theme(plot.title = element_text(size = 10))
grid.arrange(plot7, plot8, ncol = 2) We can observe that sales are not affected according to the distance from the competition.
For our data management we are going to delete the columns Avg_Customer_Sales and CompetitionDistance_Cat
merged_df_drop <- merged_df[, !(names(merged_df) %in% c("Avg_Customer_Sales", "CompetitionDistance_Cat"))]## 'data.frame': 510776 obs. of 17 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2021-01-04" "2021-01-11" ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Open : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : chr "c" "c" "c" "c" ...
## $ Assortment : chr "a" "a" "a" "a" ...
## $ CompetitionDistance : num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ CompetitionOpenSinceYear: num 2018 2018 2018 2018 2018 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
The boxplot provides a compact, easy-to-interpret representation of the distribution of the data, and provides information about the presence of outliers.
Box plot shows that we have a very high outliers in sales.
ggplot(sales_df, aes(x = Sales)) +
geom_boxplot(fill="lightblue") +
labs(x = "Sales") +
theme_gray()Clear outliers from merge_df.
remove_outlier <- function(merged_df_in, col_name) {
q1 <- quantile(merged_df_in[[col_name]], 0.25)
q3 <- quantile(merged_df_in[[col_name]], 0.75)
iqr <- q3 - q1
fence_low <- q1 - 1.5 * iqr
fence_high <- q3 + 1.5 * iqr
merged_df_out <- merged_df_in[merged_df_in[[col_name]] > fence_low & merged_df_in[[col_name]] < fence_high, ]
return(merged_df_out)
}We are going to delete from the dataset when the stores are closed, since they do not generate sales.
We will also remove Open, since it is not a variable.
We are going to check if there is any open store that does not report sales.
## [1] 5423
Let’s review the percentage of open stored with zero sales.
## [1] 0
Let’s remove this part of data to avoid bias.
## 'data.frame': 410825 obs. of 18 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2021-01-04" "2021-01-11" ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ StoreType : chr "c" "c" "c" "c" ...
## $ Assortment : chr "a" "a" "a" "a" ...
## $ CompetitionDistance : num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ CompetitionOpenSinceYear: num 2018 2018 2018 2018 2018 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Avg_Customer_Sales : num 9.14 7.66 8.89 7.39 9.23 ...
## $ CompetitionDistance_Cat : Factor w/ 5 levels "near_1mile","near_2mile",..: 1 1 1 1 1 1 1 1 1 1 ...
Let’s review the data frame, “Assortment”, “StoreType”. We define the columns to one-hot encode. Create dummy variables.
columns_to_encode <- c("Assortment", "StoreType")
for (col in columns_to_encode) {
dummies <- model.matrix(~0 + factor(new_merged_df[[col]]))
colnames(dummies) <- paste0(col, "_", colnames(dummies))
new_merged_df <- cbind(new_merged_df, dummies)
}
new_merged_df <- new_merged_df[, !(names(new_merged_df) %in% columns_to_encode)]## 'data.frame': 410825 obs. of 23 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2021-01-04" "2021-01-11" ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CompetitionDistance : num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ CompetitionOpenSinceYear : num 2018 2018 2018 2018 2018 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Avg_Customer_Sales : num 9.14 7.66 8.89 7.39 9.23 ...
## $ CompetitionDistance_Cat : Factor w/ 5 levels "near_1mile","near_2mile",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Assortment_factor(new_merged_df[[col]])a: num 1 1 1 1 1 1 1 1 1 1 ...
## $ Assortment_factor(new_merged_df[[col]])b: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Assortment_factor(new_merged_df[[col]])c: num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_factor(new_merged_df[[col]])a : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_factor(new_merged_df[[col]])b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_factor(new_merged_df[[col]])c : num 1 1 1 1 1 1 1 1 1 1 ...
## $ StoreType_factor(new_merged_df[[col]])d : num 0 0 0 0 0 0 0 0 0 0 ...
new_merged_df <- new_merged_df %>%
rename(
Assortment_a = 'Assortment_factor(new_merged_df[[col]])a',
Assortment_b = 'Assortment_factor(new_merged_df[[col]])b',
Assortment_c = 'Assortment_factor(new_merged_df[[col]])c',
StoreType_a = 'StoreType_factor(new_merged_df[[col]])a',
StoreType_b = 'StoreType_factor(new_merged_df[[col]])b',
StoreType_c = 'StoreType_factor(new_merged_df[[col]])c',
StoreType_d = 'StoreType_factor(new_merged_df[[col]])d'
)## 'data.frame': 410825 obs. of 23 variables:
## $ Store : num 1 1 1 1 1 1 1 1 1 1 ...
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Date, format: "2021-01-04" "2021-01-11" ...
## $ Sales : num 7176 4717 5394 4055 7032 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Year : num 2021 2021 2021 2021 2021 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CompetitionDistance : num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ CompetitionOpenSinceYear: num 2018 2018 2018 2018 2018 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Avg_Customer_Sales : num 9.14 7.66 8.89 7.39 9.23 ...
## $ CompetitionDistance_Cat : Factor w/ 5 levels "near_1mile","near_2mile",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Assortment_a : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Assortment_b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Assortment_c : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_a : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_c : num 1 1 1 1 1 1 1 1 1 1 ...
## $ StoreType_d : num 0 0 0 0 0 0 0 0 0 0 ...
## chr [1:2] "Assortment" "StoreType"
count rows assortment con valor 0:
sum(!is.na(new_merged_df$Assortment_a) & is.numeric(new_merged_df$Assortment_a) & new_merged_df$Assortment_a == 0)## [1] 201076
sum(!is.na(new_merged_df$Assortment_b) & is.numeric(new_merged_df$Assortment_b) & new_merged_df$Assortment_b == 0)## [1] 407427
sum(!is.na(new_merged_df$Assortment_c) & is.numeric(new_merged_df$Assortment_c) & new_merged_df$Assortment_c == 0)## [1] 213147
count rows assortment con valor 1:
sum(!is.na(new_merged_df$Assortment_a) & is.numeric(new_merged_df$Assortment_a) & new_merged_df$Assortment_a == 1)## [1] 209749
sum(!is.na(new_merged_df$Assortment_b) & is.numeric(new_merged_df$Assortment_b) & new_merged_df$Assortment_b == 1)## [1] 3398
sum(!is.na(new_merged_df$Assortment_c) & is.numeric(new_merged_df$Assortment_c) & new_merged_df$Assortment_c == 1)## [1] 197678
count rows Store type con valor 0:
sum(!is.na(new_merged_df$StoreType_a) & is.numeric(new_merged_df$StoreType_a) & new_merged_df$StoreType_a == 0)## [1] 189777
sum(!is.na(new_merged_df$StoreType_b) & is.numeric(new_merged_df$StoreType_b) & new_merged_df$StoreType_b == 0)## [1] 404118
sum(!is.na(new_merged_df$StoreType_c) & is.numeric(new_merged_df$StoreType_c) & new_merged_df$StoreType_c == 0)## [1] 361081
sum(!is.na(new_merged_df$StoreType_d) & is.numeric(new_merged_df$StoreType_d) & new_merged_df$StoreType_d == 0)## [1] 277499
count rows Store type con valor 1:
sum(!is.na(new_merged_df$StoreType_a) & is.numeric(new_merged_df$StoreType_a) & new_merged_df$StoreType_a == 1)## [1] 221048
sum(!is.na(new_merged_df$StoreType_b) & is.numeric(new_merged_df$StoreType_b) & new_merged_df$StoreType_b == 1)## [1] 6707
sum(!is.na(new_merged_df$StoreType_c) & is.numeric(new_merged_df$StoreType_c) & new_merged_df$StoreType_c == 1)## [1] 49744
sum(!is.na(new_merged_df$StoreType_d) & is.numeric(new_merged_df$StoreType_d) & new_merged_df$StoreType_d == 1)## [1] 133326
The data frame:
## Store DayOfWeek Date Sales Customers Promo SchoolHoliday Year Month Day
## 1 1 Mon 2021-01-04 7176 785 1 1 2021 Jan 4
## 2 1 Mon 2021-01-11 4717 616 0 0 2021 Jan 11
## 3 1 Mon 2021-01-18 5394 607 1 0 2021 Jan 18
## 4 1 Mon 2021-01-25 4055 549 0 0 2021 Jan 25
## 5 1 Mon 2021-02-01 7032 762 1 0 2021 Feb 1
## 6 1 Mon 2021-02-08 4409 599 0 0 2021 Feb 8
## WeekOfYear CompetitionDistance CompetitionOpenSinceYear Promo2
## 1 1 1270 2018 0
## 2 2 1270 2018 0
## 3 3 1270 2018 0
## 4 4 1270 2018 0
## 5 5 1270 2018 0
## 6 6 1270 2018 0
## Avg_Customer_Sales CompetitionDistance_Cat Assortment_a Assortment_b
## 1 9.141401 near_1mile 1 0
## 2 7.657468 near_1mile 1 0
## 3 8.886326 near_1mile 1 0
## 4 7.386157 near_1mile 1 0
## 5 9.228346 near_1mile 1 0
## 6 7.360601 near_1mile 1 0
## Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d
## 1 0 0 0 1 0
## 2 0 0 0 1 0
## 3 0 0 0 1 0
## 4 0 0 0 1 0
## 5 0 0 0 1 0
## 6 0 0 0 1 0
We are going to carry out sales analysis for the days:
Colors2 <- c("steelblue", "orange2", "forestgreen", "purple2", "royalblue2", "lightgreen", "skyblue2")
average_sales <- new_merged_df %>%
group_by(DayOfWeek) %>%
summarize(AvgSales = mean(Sales))
ggplot(average_sales, aes(x = DayOfWeek, y = AvgSales, fill = factor(DayOfWeek))) +
geom_bar(stat = "identity") +
labs(x = "DayOfWeek", y = "Average Sales") +
scale_fill_manual(values = Colors2)We can see from the plot it can be seen that most of the average sales have been on Mondays, Fridays.
new_merged_df <- subset(new_merged_df, select = -c(CompetitionOpenSinceYear,Avg_Customer_Sales,CompetitionDistance_Cat))Creation target and features:
X <- new_merged_df[, !(names(new_merged_df) %in% c("Sales", "Store", "Date", "Year"))]
y <- new_merged_df$Sales## 'data.frame': 410825 obs. of 16 variables:
## $ DayOfWeek : Factor w/ 7 levels "Mon","Tue","Wed",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Month : Factor w/ 12 levels "Jan","Feb","Mar",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CompetitionDistance: num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Assortment_a : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Assortment_b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Assortment_c : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_a : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_c : num 1 1 1 1 1 1 1 1 1 1 ...
## $ StoreType_d : num 0 0 0 0 0 0 0 0 0 0 ...
## [1] 410825 16
## DayOfWeek Customers Promo SchoolHoliday Month Day WeekOfYear
## 1 Mon 785 1 1 Jan 4 1
## 2 Mon 616 0 0 Jan 11 2
## 3 Mon 607 1 0 Jan 18 3
## 4 Mon 549 0 0 Jan 25 4
## 5 Mon 762 1 0 Feb 1 5
## 6 Mon 599 0 0 Feb 8 6
## CompetitionDistance Promo2 Assortment_a Assortment_b Assortment_c StoreType_a
## 1 1270 0 1 0 0 0
## 2 1270 0 1 0 0 0
## 3 1270 0 1 0 0 0
## 4 1270 0 1 0 0 0
## 5 1270 0 1 0 0 0
## 6 1270 0 1 0 0 0
## StoreType_b StoreType_c StoreType_d
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 0 1 0
## 5 0 1 0
## 6 0 1 0
## [1] 7176 4717 5394 4055 7032 4409
## num [1:410825] 7176 4717 5394 4055 7032 ...
Create a vector with the names of days of the week in the correct order. Then, convert the factor to numerical values.
day_week <- c("Monday" = 1, "Tuesday" = 2, "Wednesday" = 3, "Thursday" = 4, "Friday" = 5, "Saturday" = 6, "Sunday" = 7)
X$DayOfWeek <- day_week[X$DayOfWeek]## 'data.frame': 410825 obs. of 16 variables:
## $ DayOfWeek : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Customers : num 785 616 607 549 762 599 710 534 840 618 ...
## $ Promo : num 1 0 1 0 1 0 1 0 1 0 ...
## $ SchoolHoliday : num 1 0 0 0 0 0 0 0 0 0 ...
## $ Month : num 1 1 1 1 2 2 2 2 3 3 ...
## $ Day : int 4 11 18 25 1 8 15 22 1 8 ...
## $ WeekOfYear : num 1 2 3 4 5 6 7 8 9 10 ...
## $ CompetitionDistance: num 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
## $ Promo2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Assortment_a : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Assortment_b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Assortment_c : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_a : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_b : num 0 0 0 0 0 0 0 0 0 0 ...
## $ StoreType_c : num 1 1 1 1 1 1 1 1 1 1 ...
## $ StoreType_d : num 0 0 0 0 0 0 0 0 0 0 ...
Upon review we can see that sales are highly correlated with the number of Customers.
The type of store that sold the most and was most visited is A and the ones with the most sales. Types of A stores are the types with the most presence.
Store Type B has the lowest sales and number of stores. Compared to the other types of stores it has the highest number of average customers. Regarding the average sales by store type, store type B is the highest.
Store Type D has higher Average Spending.
When promotions are applied, sales are high.
Currently the company applies the promotions only between days of the week and Saturdays. For all stores, the Promotion leads to an increase in both Sales and Customers.
Sales have not been affected by 82.2% on school holidays, and sales are not affected by State holidays by 97%.
Sales increase during Christmas week, this could be due to the fact that people buy more products during the Christmas celebration.
Promo2 does not appear to be correlated with any significant change in sales amount.
We can also observe that the distance from the competition is not considerably affecting the amount of sales of the company’s stores.
Sales increase in the months of November and December, for each year analyzed.
Therefore, data subsets where they could cause bias were eliminated.
We are going to divide the data into training and test sets, 30% of the data will be used as test set while the remaining 70% will be used as training set.
set.seed(11)
split <- sample.split(y, SplitRatio = 0.3)
X_train <- X[split, ]
X_test <- X[!split, ]
y_train <- y[split]
y_test <- y[!split]We are going to carry out the review of missing values: NA
## DayOfWeek Customers Promo SchoolHoliday
## 0 0 0 0
## Month Day WeekOfYear CompetitionDistance
## 0 0 0 0
## Promo2 Assortment_a Assortment_b Assortment_c
## 0 0 0 0
## StoreType_a StoreType_b StoreType_c StoreType_d
## 0 0 0 0
## DayOfWeek Customers Promo SchoolHoliday
## 0 0 0 0
## Month Day WeekOfYear CompetitionDistance
## 0 0 0 0
## Promo2 Assortment_a Assortment_b Assortment_c
## 0 0 0 0
## StoreType_a StoreType_b StoreType_c StoreType_d
## 0 0 0 0
Data preprocessing is a crucial phase in developing machine learning models.This is done with the goal of improving data quality and suitability before training a model. We apply the same preprocessing to both data sets.We then apply the resulting transformation to the training and test sets using the prediction function.
preproc <- preProcess(X_train, method = c("range"))
X_train_scaled <- predict(preproc, X_train)
X_test_scaled <- predict(preproc, X_test)##
## Call:
## lm(formula = y_train ~ ., data = X_train_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6996.6 -729.5 -94.0 632.1 7492.5
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1618.907 15.797 102.485 < 2e-16 ***
## DayOfWeek -115.523 11.498 -10.047 < 2e-16 ***
## Customers 22130.647 35.850 617.305 < 2e-16 ***
## Promo 1107.367 6.791 163.067 < 2e-16 ***
## SchoolHoliday 58.115 8.123 7.155 8.43e-13 ***
## Month -11915.725 498.389 -23.909 < 2e-16 ***
## Day -1044.695 45.875 -22.772 < 2e-16 ***
## WeekOfYear 13136.440 542.196 24.228 < 2e-16 ***
## CompetitionDistance 1565.008 30.114 51.969 < 2e-16 ***
## Promo2 357.700 6.507 54.970 < 2e-16 ***
## Assortment_a -230.387 6.667 -34.556 < 2e-16 ***
## Assortment_b -3329.081 49.859 -66.769 < 2e-16 ***
## Assortment_c NA NA NA NA
## StoreType_a -1189.127 7.487 -158.830 < 2e-16 ***
## StoreType_b -3568.618 37.231 -95.850 < 2e-16 ***
## StoreType_c -1178.287 10.847 -108.631 < 2e-16 ***
## StoreType_d NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1105 on 123656 degrees of freedom
## Multiple R-squared: 0.8034, Adjusted R-squared: 0.8033
## F-statistic: 3.609e+04 on 14 and 123656 DF, p-value: < 2.2e-16
## (Intercept)
## 1618.907
Exclude the intercept
## DayOfWeek Customers Promo SchoolHoliday
## -115.52287 22130.64738 1107.36704 58.11541
## Month Day WeekOfYear CompetitionDistance
## -11915.72507 -1044.69535 13136.43988 1565.00824
## Promo2 Assortment_a Assortment_b Assortment_c
## 357.69983 -230.38737 -3329.08093 NA
## StoreType_a StoreType_b StoreType_c StoreType_d
## -1189.12678 -3568.61756 -1178.28671 NA
## R^2: 0.8033675
By obtaining such a low R-squared result of 0.80, it makes no sense for us to evaluate this model or make predictions. An R-squared of 0.8034 means that 80.34% of the variability in the response variable has been captured by the model.
PLS model reduces the predictors to a smaller set of uncorrelated components and then performs least squares regression on these components, instead of on the original data. PLS finds linear combinations of the predictors called components.
set.seed(12)
pls_model <- train(x=X_train,
y=y_train,
method="pls",
metric="Rsquared",
tuneLength=10,
trControl=trainControl(method = "cv")
)
pls_model## Partial Least Squares
##
## 123671 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111305, 111304, 111304, 111304, 111304, 111303, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 2484.084 0.006062206 1987.5969
## 2 1496.154 0.639459181 1136.2643
## 3 1493.744 0.640621783 1135.0339
## 4 1415.372 0.677320835 1071.9181
## 5 1402.585 0.683127710 1061.3198
## 6 1255.875 0.745961756 926.4743
## 7 1200.774 0.767781425 895.3679
## 8 1156.929 0.784429022 890.0878
## 9 1112.353 0.800729627 858.7937
## 10 1107.309 0.802534733 853.9247
##
## Rsquared was used to select the optimal model using the largest value.
## The final value used for the model was ncomp = 10.
## ncomp
## 10 10
pls_model$results %>%
filter(ncomp == pls_model$bestTune$ncomp) %>%
dplyr::select(ncomp,RMSE,Rsquared)## ncomp RMSE Rsquared
## 1 10 1107.309 0.8025347
data.frame(Rsquared=pls_model[["results"]][["Rsquared"]][as.numeric(rownames(pls_model$bestTune))],
RMSE=pls_model[["results"]][["RMSE"]][as.numeric(rownames(pls_model$bestTune))])## Rsquared RMSE
## 1 0.8025347 1107.309
The final value used for the model was ncomp = 10 which corresponds to best tune model.
We see that R2 is 0.802, RMSE= 1107. An R2 of 0.8025 means that 80.25% of the variability in the response variable has been captured by the model.
The Root Mean Square Error (RMSE) is a measure of model accuracy, representing the square root of the mean of the squared errors between the predictions and the actual values.
An RMSE of 1107 indicates the average size of the errors in the model predictions.
LARS Lasso is an algorithm that uses linear regression with L1 regularization (Lasso) to fit linear regression models.It integrates LARS’s ability to handle feature-rich data sets with the L1(Lasso) penalty that favors coefficient dispersion and feature selection.
The value of “alpha” is 1 for Lasso (L1-regularized), and lambda is the regularization hyperparameter.
set.seed(13)
lasreg <- glmnet(X_train, y_train, alpha = 1, lambda = 0.3, standardize = TRUE, intercept = FALSE)
coef(lasreg)## 17 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) .
## DayOfWeek -1.928252e+01
## Customers 7.439976e+00
## Promo 1.087095e+03
## SchoolHoliday 4.195243e+01
## Month 2.528475e+01
## Day 2.235604e+00
## WeekOfYear -2.164461e+00
## CompetitionDistance 2.100767e-02
## Promo2 3.656447e+02
## Assortment_a 3.668940e+02
## Assortment_b -2.734978e+03
## Assortment_c 5.957255e+02
## StoreType_a -3.416052e+02
## StoreType_b -2.718628e+03
## StoreType_c -3.301008e+02
## StoreType_d 8.479856e+02
Let’s convert X_test to a matrix and make predictions on the test set.
X_test_matrix <- as.matrix(X_test)
test_predictions <- predict(lasreg, s = 0, newx = X_test_matrix)
r_squared <- 1 - sum((y_test - test_predictions)^2) / sum((y_test - mean(y_test))^2)
rmse <- sqrt(mean((y_test - test_predictions)^2))
cat("R-squared:", r_squared, "\n")## R-squared: 0.7970162
## RMSE: 1109.792
By obtaining such a low Rsquared result of 0.797, it makes no sense for us to evaluate this model or make predictions.
Let’s use Regression trees. The regression trees partition a data set into smaller groups and then fit a simple model for each subgroup.
set.seed(14)
st_model <- train(x=X_train,
y=y_train,
method = "rpart",
tuneLength = 5,
trControl = trainControl(method = "cv"))
st_model## CART
##
## 123671 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111304, 111303, 111304, 111305, 111304, 111303, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.02265544 1507.687 0.6337774 1179.722
## 0.02776454 1553.565 0.6111829 1218.469
## 0.07775306 1669.504 0.5502088 1316.335
## 0.08166468 1769.907 0.4946644 1401.160
## 0.43848287 2058.366 0.4354218 1637.793
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.02265544.
## cp
## 1 0.02265544
data.frame(Rsquared=st_model[["results"]][["Rsquared"]][as.numeric(rownames(st_model$bestTune))],
RMSE=st_model[["results"]][["RMSE"]][as.numeric(rownames(st_model$bestTune))])## Rsquared RMSE
## 1 0.6337774 1507.687
RMSE was used to select the optimal model using the smallest value. The final value used for the model was cp = 0.02341516. We see R2 value is 0.6339 and RMSE as 1507. By obtaining such a low R-squared result of 0.63, we are not going to use this model.
set.seed(15)
rdf_model <- randomForest(y_train ~ ., data = cbind(X_train, y_train),
ntree = 30, mtry = sqrt(ncol(X_train)), nodesize = 1)Evaluate the model on the training and testing datasets. Make predictions on the training and testing datasets. Calculate Root Mean Squared Error (RMSE) for training and testing datasets:
train_score <- cor(predict(rdf_model, X_train), y_train)^2
test_score <- cor(predict(rdf_model, X_test), y_test)^2
cat("Regression Model Score (R^2):", train_score, ", Test Score:", test_score, "\n")## Regression Model Score (R^2): 0.9572492 , Test Score: 0.9031334
y_predicted_train <- predict(rdf_model, X_train)
y_predicted_test <- predict(rdf_model, X_test)
train_rmse <- sqrt(mean((y_train - y_predicted_train)^2))
test_rmse <- sqrt(mean((y_test - y_predicted_test)^2))
cat("Training RMSE:", train_rmse, "Testing RMSE:", test_rmse, "\n")## Training RMSE: 550.5013 Testing RMSE: 783.3986
train_mape <- mean(ifelse(y_train != 0, abs((y_train - y_predicted_train) / y_train), 0)) * 100
test_mape <- mean(ifelse(y_test != 0, abs((y_test - y_predicted_test) / y_test), 0)) * 100
cat("Training MAPE:", train_mape, "Testing MAPE:", test_mape, "\n")## Training MAPE: 7.072235 Testing MAPE: 9.535869
## R-squared (Training): 0.9572492
## R-squared (Testing): 0.9031334
## RMSE (Training): 550.5013
## RMSE (Testing): 783.3986
## MAPE (Training): 7.072235
The results are performance metrics of the regression model, and we can see that they are quite reliable at first glance.
The R^2 on the training set is 0.9572, which is very high. It indicates that 95.72% of the variability in the dependent variable is explained by the model, which is quite good.
The R^2 on the test set is 0.90313, also quite high. An R^2 of 90.31% suggests that the model generalizes well to unseen data.
The RMSE on the training set is 550.5013. The RMSE measures the average magnitude of the errors between the predictions and the actual values. A value of 550.5013 indicates that, on average, the predictions are off by about 550 units from the dependent variable.
The RMSE on the test set is 783.3986. Although the value is higher than in the training set, it is still reasonably low. An RMSE of 783.3986 suggests that the predictions on the test set have an average error of about 822 units.
Un MAPE del 7 significa que, en promedio, las predicciones del modelo tienen un error absoluto del 7% con respecto a los valores reales. Es una buena metrica para nuestro modelo.
Overall, these results suggest that the model performs well on both the training set and the test set, compared to the other models used.
Let’s make predictions on the test set, and visualize the first 5 predictions.
## 1 2 3 4 5 6
## 7665.388 4503.070 5941.009 4372.896 7435.233 4686.428
Let’s select the first 100 observations for the predictions and the actual values.
We create a DataFrame (dataset_rf):
rf_prd <- y_test_predicted_2[1:100]
rf_real <- y_test[1:100]
dataset_rf <- data.frame(Real = rf_real, PredictedRF = rf_prd)Let’s calculate the ‘diff’ column, which shows us the difference between the prediction values and actual values.
Let’s select 15 random rows from dataset_rf:
## Real PredictedRF diff
## 110 3965 4239.891 274.89112
## 101 4884 4933.362 49.36174
## 40 6004 5861.139 142.86060
## 127 4071 4050.015 20.98535
## 137 3900 4054.959 154.95903
## 114 3701 4118.196 417.19583
## 12 6729 7707.581 978.58103
## 126 5774 6044.958 270.95809
## 24 5337 5727.175 390.17495
## 59 5393 5855.572 462.57236
## 53 6194 6124.092 69.90771
## 100 7380 7186.384 193.61601
## 49 4462 4926.202 464.20176
## 23 4291 4349.157 58.15749
## 65 6008 5981.406 26.59431
## Real PredictedRF diff
## Min. :3414 Min. :3787 Min. : 4.747
## 1st Qu.:4028 1st Qu.:4218 1st Qu.: 138.011
## Median :4836 Median :5446 Median : 296.683
## Mean :5086 Mean :5345 Mean : 324.762
## 3rd Qu.:5786 3rd Qu.:6038 3rd Qu.: 462.980
## Max. :9528 Max. :9616 Max. :1020.181
Let’s create a histogram for each column:
theme_set(theme_minimal())
plots <- lapply(names(dataset_rf), function(col) {
ggplot(dataset_rf, aes_string(x = col)) +
geom_histogram(fill = "steelblue", color = "white") +
ggtitle(paste("Histogram of", col)) +
theme_minimal()
})
gridExtra::grid.arrange(grobs = plots, ncol = 2)Now the scatter plot with regression line:
ggplot(dataset_rf, aes(x = Real, y = PredictedRF)) +
geom_point() +
geom_smooth(method = "lm", col = "red") +
labs(title = "Scatter Plot with Regression Line", x = "Real", y = "PredictedRF") +
theme_minimal()We can see that the Random Random Forest with Hyper Parameter Tuning model predicts the company’s sales values very well.
Feature Importance in a Random Forest-based regression model is a measure that indicates the relative contribution of each feature to the model’s ability to make predictions.
This metric is particularly useful to understand which features are most influential in the model and can be used for decision making.
In the context of a Random Forest Regressor, the importance of features is calculated during model training and is based on how each feature contributes to reducing impurity or error in the nodes of the forest.
Features that are most effective in reducing impurity or error have greater importance.
importance_values <- randomForest::importance(rdf_model)
importance_df <- data.frame(
Feature = rownames(importance_values),
Importance = importance_values[, "IncNodePurity"]
)
importance_df <- importance_df[order(-importance_df$Importance), ]
ggplot(importance_df, aes(x = reorder(Feature, -Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Random Forest Feature Importance",
x = "Feature",
y = "Importance") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) And still not satisfied with the good result, let’s see another mode.
This code fits a decision tree regression model in R and then calculates the R-squared, RMSE, and MAPE evaluation metrics for both the training set and the test set.
set.seed(16)
tree <- rpart(y_train ~ ., data = data.frame(y_train, X_train),
method = "anova", minsplit = 5, minbucket = 8)
print(tree)## n= 123671
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 123671 767733800000 6651.433
## 2) Customers< 705.5 70181 178606400000 5211.062
## 4) Customers< 472.5 23083 31019230000 3893.687
## 8) Customers< 346.5 6520 5039379000 2801.352 *
## 9) Customers>=346.5 16563 15137770000 4323.683 *
## 5) Customers>=472.5 47098 87893560000 5856.715
## 10) StoreType_d< 0.5 27821 36499280000 5350.863 *
## 11) StoreType_d>=0.5 19277 34000930000 6586.773 *
## 3) Customers>=705.5 53490 252489300000 8541.256
## 6) Customers< 1064.5 39142 131947600000 7885.775
## 12) Promo< 0.5 17821 41735210000 7078.601 *
## 13) Promo>=0.5 21321 68896630000 8560.445
## 26) StoreType_d< 0.5 15319 40876700000 8124.863 *
## 27) StoreType_d>=0.5 6002 17695140000 9672.187 *
## 7) Customers>=1064.5 14348 57844910000 10329.440
## 14) Customers< 1328.5 9044 29989700000 9765.539 *
## 15) Customers>=1328.5 5304 20075670000 11290.960 *
Fit a decision tree regression model:
treereg <- rpart(y_train ~ ., data = data.frame(y_train, X_train), method = "anova", minsplit = 5, minbucket = 8)train_score_4 <- 1 - sum((y_train - predict(treereg, newdata = data.frame(X_train)))^2) / sum((y_train - mean(y_train))^2)
test_score_4 <- 1 - sum((y_test - predict(treereg, newdata = data.frame(X_test)))^2) / sum((y_test - mean(y_train))^2)
rmse_train <- sqrt(mean((y_train - predict(treereg, newdata = data.frame(X_train)))^2))
rmse_test <- sqrt(mean((y_test - predict(treereg, newdata = data.frame(X_test)))^2))
mape_train <- mean(ifelse(y_train != 0, abs((y_train - y_predicted_train) / y_train), 0)) * 100
mape_test <- mean(ifelse(y_test != 0, abs((y_test - y_predicted_test) / y_test), 0)) * 100
cat("Regresion Model Score:", train_score_4, "Test Score:", test_score_4, "\n")## Regresion Model Score: 0.6860243 Test Score: 0.680879
## Training RMSE: 1396.109 Testing RMSE: 1391.535
## Training MAPE: 7.072235 Testing MAPE: 9.535869
By obtaining such a low R-squared result, 0.68, it is not a good model.
model_comparison_df <- read_csv("C:/Users/jzw1jlf/Documents/Personal/CUNY SPS/DATA 698/Model_comparison.csv")## # A tibble: 6 × 5
## Model `R-squared` RMSE Train_Score Test_Score
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Simple Linear Regression 0.803 1619. 0.803 0.803
## 2 Tune PLS 0.803 1107. 0 0
## 3 LARS Lasso Regression 0.797 1110. 0 0
## 4 Single model Tree 0.634 1508. 0 0
## 5 Random Forest With Hyper Parameter T… 0.957 783. 0 0.957
## 6 Regression Model Tree 0.686 1392. 0 0.681
This research has shown the application of advanced data science techniques in sales prediction in companies. This constitutes a valuable and effective approach to improve strategic decision making. Through the development and implementation of predictive models based on historical data sets and important external factors, it has been possible to obtain significant results that overcome the limitations of traditional methods.
The robustness of the proposed models is reflected in their ability to accurately anticipate sales trends. It provides powerful tools to optimize strategic planning and proactively respond to changes in the business environment. The exploration of data visualization tools has made it easier to interpret the results, and easier to understand the predictions.
During this project, advanced data science techniques and regression models have been applied to address the problem that the company currently has, providing an accurate sales prediction. The problem is not having identified the variables that may be increasing or decreasing sales. This helps when making decisions in advance and avoiding loss of sales in warehouses.
This work has made it possible not only to predict store sales but also to identify the different aspects that positively or negatively influence sales. It was determined that weekday promotions are justifiable to increase sales.
Store type B is the one that generates the most average sales, although the number of stores in the company’s total is the lowest. Therefore, in the short or medium term, the company could transfer the store b model to other types of stores to increase sales.
The best model was Random Forest with Hyper Parameter Tuning. The Random Forest model, with hyperparameter tuning, achieved the best accuracy compared to other models.
Comparing it with specific evaluation metrics, such as R², and RMSE. Being an ensemble model that combines multiple decision trees, it can capture non-linear and complex relationships in the data more effectively.
The data set had outliers, and the Random Forest model, being robust to outliers, was able to handle those values better compared to other models.
Throughout this process, we have gained valuable insights and faced significant challenges.
We implemented regression models, such as linear regression, decision trees, LARS lasso regression, Random Forest, among others. But not all models achieved good or acceptable metrics to be able to generate a correct and reliable data prediction. In the end, some acceptable models were selected to choose the best regression model to generate the sales prediction.
Data science techniques were used to analyze and process the data, such as feature engineering to transform categorical variables into dummy variables so that the model could interpret them correctly and capture non-linear relationships.
Handling outliers was essential to improve the robustness and accuracy of our models. We use statistical and graphical techniques to identify outliers in the data set. We proceeded to eliminate or transform these observations, preventing them from negatively affecting the predictive capacity of the model.
These feature engineering and outlier management techniques positively influenced the quality of our predictions by providing the model with a more informative and robust representation of the data.
Model performance evaluation methods were implemented to ensure generalization. The Mean Square Error (MSE) was calculated, the mean of the squares of the differences between the predictions and the actual values. And Root Mean Square Error (RMSE), the square root of the MSE provides a metric on the same scale as the variable of interest.
The first big challenge was obtaining the data, because the company did not have complete and unified data. Although the company has external software with SQL database, we were able to extract the data to start the analysis. Not all the information was provided correctly within the data files, without fill out all the required information, and no parameterization of the data a lot of data cleaning were performed.
Then the challenge of Handling Missing Data and Outliers was faced. Handling missing data and outliers is one of the common challenges in building and evaluating regression models. The strategy was to observe whether the missing data was small or large. As the amount of missing data was small, we considered eliminating observations or variables with missing data.
Outliers can bias the interpretation of the model and negatively affect its performance. These observations were removed, preventing them from negatively affecting the predictive capacity of the model.
Feature Selection: Various feature selection and transformation techniques were experimented with to improve the predictive capacity of the model.
It was considered necessary to collect additional data to enrich the data set and allow the model to learn more complex patterns.
This work provides the company with advanced tools and predictions that will allow it to improve decision making, resource optimization, and above all, avoid losing sales due to an incorrect estimate of future sales.
In the future for the company, we could create a software tool in html that allows them to auto generate the sales predictions in an easy and agile way to make purchases.
In practical terms, this research highlights the importance of adopting data science approaches in business management, especially around sales prediction. The recommendations derived from this study provide guidance for the successful implementation of predictive solutions in enterprise environments, underscoring the need for careful integration with existing processes and the importance of continuous updating of models to adapt to changes in the market.
In summary, this work contributes significantly to the advancement of data science applied to business management. It provides a solid foundation for future research and highlights the relevance of sales prediction as a key strategic tool for business success.
This project has demonstrated the positive impact of the application of regression models and data science techniques in solving the problem of correctly estimating the company’s future sales. As we advance, we are excited to explore new opportunities and continue to improve our capabilities in this exciting field of study.
Finally, the .csv file is generated so that the company can start using it, and thus to anticipate sales trends. This file will help you optimize strategic planning and improve business and data driven decision making.
Sheather, S. (2009). A Modern Approach to Regression with R. Springer
Kuhn, M. Johnson, K. (2013). Applied Predictive Modeling. Springer
Knight, M. (August, 2017). What Is Predictive Analytics?. DATADIVERSITY. Retrieved from: https://www.dataversity.net/what-is-predictive-analytics/#:~:text=Predictive%20analytics%20defines%20outcomes%20through,business%20goals%20and%20ameliorating%20risks.
Lander, J. (2017). R for everyone. Advanced Analytics and Graphics. Second Edition. Addison-Wesley
Provost, F. Fawcett, T. (2013). Data Science for Bussiness. O’reilly
Sivarajah, U. Mustafa, M. Irani, Z. ( January, 2017). Critical analysis of Big Data challenges and analytical methods. Journal of Business Research. Science Direct. Retrieved from: https://www.sciencedirect.com/science/article/pii/S014829631630488X
Karaman, B. (June, 2019). Predicting Sales. Forecasting the monthly sales with LSTM. Medium. Retrieved from: https://towardsdatascience.com/predicting-sales-611cb5a252de
Tiefenthaler, D. (September, 2023).Forecasting Sales & Demand: The Path to an Accurate Prediction. Medium. Retrieved from: https://medium.com/@david.tiefenthaler/forecasting-sales-demand-forecast-the-path-to-an-accurate-prediction-a317748192f3
Riveroll, F. (July, 2020). The Demand Sales Forecast Technique Every Data Scientist Should be Using to Reduce Error. Medium. Retrieves from: https://towardsdatascience.com/the-demand-sales-forecast-technique-every-data-scientist-should-be-using-to-reduce-error-1c6f25add9cb
Zhang, B. Tseng, M.(February, 2023). A comparative online sales forecasting analysis: Data mining techniques. SCience Direct. Retrieve from: https://www.sciencedirect.com/science/article/abs/pii/S0360835222009238
Code in Rstudio: