Regression - Coffee Shop Daily Revenue Prediction

About dataset

This dataset contains 2,000 rows of data from coffee shops, offering detailed insights into factors that influence daily revenue. It includes key operational and environmental variables that provide a comprehensive view of how business activities and external conditions affect sales performance. Designed for use in predictive analytics and business optimization, this dataset is a valuable resource for anyone looking to understand the relationship between customer behavior, operational decisions, and revenue generation in the food and beverage industry. Link to the main dataset can be found here

Data Description

Column	Description	Data type
Number_of_Customers_Per_Day	The total number of customers visiting the coffee shop on any given day.	int
Average_Order_Value	The average dollar amount spent by each customer during their visit.	float
Operating_Hours_Per_Day	The total number of hours the coffee shop is open for business each day.	int
Number_of_Employees	The number of employees working on a given day. This can influence service speed, customer satisfaction, and ultimately, sales.	int
Marketing_Spend_Per_Day	The amount of money spent on marketing campaigns or promotions on any given day.	float
Location_Foot_Traffic	The number of people passing by the coffee shop per hour, a variable indicative of the shop’s location and its potential to attract customers.	int
Daily_Revenue	This is the dependent variable representing the total revenue generated by the coffee shop each day.	float

Step 1 - Install packages and libraries

## install packages and libraries

#install.packages("tidyverse")
#install.packages("ggplot2")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

Step 2 - Read the data

# read the data
data <- read.csv("coffee_shop_revenue.csv")
head(data)

  Number_of_Customers_Per_Day Average_Order_Value Operating_Hours_Per_Day
1                         152                6.74                      14
2                         485                4.50                      12
3                         398                9.09                       6
4                         320                8.48                      17
5                         156                7.44                      17
6                         121                8.88                       6
  Number_of_Employees Marketing_Spend_Per_Day Location_Foot_Traffic
1                   4                  106.62                    97
2                   8                   57.83                   744
3                   6                   91.76                   636
4                   4                  462.63                   770
5                   2                  412.52                   232
6                   9                  183.49                   484
  Daily_Revenue
1       1547.81
2       2084.68
3       3118.39
4       2912.20
5       1663.42
6       1155.18

dim(data)

[1] 2000    7

sum(duplicated(data)) # no duplicates

[1] 0

sum(is.na(data)) # no nulls

[1] 0

This dataset is clean. But for the sake of this project, we will introduce some null values and then perform cleaning on it.

# lets see how we can introduce some nulls 

set.seed(123)  # For reproducibility

# Columns where we want to introduce NAs
columns_to_na <- c("Average_Order_Value", "Operating_Hours_Per_Day", "Number_of_Employees")

# Total number of NAs to introduce (20% of the selected columns' values)
num_na <- round(0.05 * nrow(data) * length(columns_to_na))  

# Randomly select rows
row_indices <- sample(nrow(data), num_na, replace = TRUE)

# Introduce NAs only in the selected columns
for (col in columns_to_na) {
  data[row_indices, col] <- NA
}

colSums(is.na(data))

Number_of_Customers_Per_Day         Average_Order_Value 
                          0                         279 
    Operating_Hours_Per_Day         Number_of_Employees 
                        279                         279 
    Marketing_Spend_Per_Day       Location_Foot_Traffic 
                          0                           0 
              Daily_Revenue 
                          0

Step 3 - Data Cleaning

The different methods of data cleaning involves -

1. Dropping nulls - If the missing values are limited and the data loss is acceptable

2. Eliminating columns with too many nulls - If a column has more than 50% nulls, we can delete that column

3. We can fill in missing values with a fixed value

4. Filling missing values for numeric variables with mean, median or mode accroding to requirement of the dataset

5. Impute Missing Values Using Predictive Models

# for this data- lets try data imputation with imputating with mean, median and mode for three columns with nulls

# but when we use this, we need to see how the data is.
#Use Mean for normally distributed data.

# Histogram to check the distribution of the column
hist(data$Average_Order_Value, main = "Histogram for Average Order Value", 
     xlab = "Average Order Value", col = "lightgreen", breaks = 10)

### we can see that data is pretty normally distributed so lets stick with mean
data$Average_Order_Value[is.na(data$Average_Order_Value)] <- mean(data$Average_Order_Value, na.rm = TRUE)

#Use Median if the data is skewed (right or left skew).

# Histogram to check the distribution of the column
hist(data$Operating_Hours_Per_Day, main = "Histogram for Operating_Hours_Per_Day", 
     xlab = "Average Operating Hours", col = "steelblue", breaks = 10)

### the data is right skewed, so we will use median for this.
data$Operating_Hours_Per_Day[is.na(data$Operating_Hours_Per_Day)] <- median(data$Operating_Hours_Per_Day, na.rm = TRUE)

#Use Mode for categorical or discrete data.

# Bar plot to check the frequency of values (if categorical data)
barplot(table(data$Number_of_Employees), main = "Bar Plot for Number of Employees", 
        xlab = "Number of Employees", col = "lightcoral")

### the mode is 8, so we will fill missing values with that.

# Calculate the mode (most frequent value) for Number_of_Employees
mode_value <- as.numeric(names(sort(table(data$Number_of_Employees), decreasing = TRUE)[1]))

# Replace NAs in the column with the mode
data$Number_of_Employees[is.na(data$Number_of_Employees)] <- mode_value

Step 4 - Data Manipulation - Data wrangling

glimpse(data) # the datatypes look good

Rows: 2,000
Columns: 7
$ Number_of_Customers_Per_Day <int> 152, 485, 398, 320, 156, 121, 238, 70, 152…
$ Average_Order_Value         <dbl> 6.740000, 4.500000, 9.090000, 8.480000, 7.…
$ Operating_Hours_Per_Day     <int> 14, 12, 6, 17, 17, 6, 11, 10, 14, 7, 13, 1…
$ Number_of_Employees         <dbl> 4, 8, 6, 4, 2, 9, 4, 3, 2, 5, 14, 9, 8, 4,…
$ Marketing_Spend_Per_Day     <dbl> 106.62, 57.83, 91.76, 462.63, 412.52, 183.…
$ Location_Foot_Traffic       <int> 97, 744, 636, 770, 232, 484, 156, 237, 825…
$ Daily_Revenue               <dbl> 1547.81, 2084.68, 3118.39, 2912.20, 1663.4…

colnames(data)

[1] "Number_of_Customers_Per_Day" "Average_Order_Value"        
[3] "Operating_Hours_Per_Day"     "Number_of_Employees"        
[5] "Marketing_Spend_Per_Day"     "Location_Foot_Traffic"      
[7] "Daily_Revenue"

# lets rename variables
data<- data %>% 
  rename("Daily_Revenue_Generated" = "Daily_Revenue", 
         "Number_of_People_Per_Hour" = "Location_Foot_Traffic")
colnames(data)

[1] "Number_of_Customers_Per_Day" "Average_Order_Value"        
[3] "Operating_Hours_Per_Day"     "Number_of_Employees"        
[5] "Marketing_Spend_Per_Day"     "Number_of_People_Per_Hour"  
[7] "Daily_Revenue_Generated"

Step 5 - Summarise data

# check the stats of the data 
summary(data)

 Number_of_Customers_Per_Day Average_Order_Value Operating_Hours_Per_Day
 Min.   : 50.0               Min.   : 2.500      Min.   : 6.00          
 1st Qu.:164.0               1st Qu.: 4.710      1st Qu.: 9.00          
 Median :275.0               Median : 6.276      Median :12.00          
 Mean   :274.3               Mean   : 6.276      Mean   :11.76          
 3rd Qu.:386.0               3rd Qu.: 7.850      3rd Qu.:14.00          
 Max.   :499.0               Max.   :10.000      Max.   :17.00          
 Number_of_Employees Marketing_Spend_Per_Day Number_of_People_Per_Hour
 Min.   : 2.000      Min.   : 10.12          Min.   : 50.0            
 1st Qu.: 5.000      1st Qu.:130.12          1st Qu.:302.0            
 Median : 8.000      Median :251.00          Median :540.0            
 Mean   : 7.955      Mean   :252.61          Mean   :534.9            
 3rd Qu.:11.000      3rd Qu.:375.35          3rd Qu.:767.0            
 Max.   :14.000      Max.   :499.74          Max.   :999.0            
 Daily_Revenue_Generated
 Min.   : -58.95        
 1st Qu.:1140.09        
 Median :1770.78        
 Mean   :1917.33        
 3rd Qu.:2530.45        
 Max.   :5114.60

Insights - The summary of the dataset gives us statistical information of all the numerical variables. The spread looks good for all the variables. No significant outliers.

Step 6 - Data Visualization

# lets start with our prediction varaible

colnames(data)

[1] "Number_of_Customers_Per_Day" "Average_Order_Value"        
[3] "Operating_Hours_Per_Day"     "Number_of_Employees"        
[5] "Marketing_Spend_Per_Day"     "Number_of_People_Per_Hour"  
[7] "Daily_Revenue_Generated"

data %>% 
  ggplot(aes(Daily_Revenue_Generated)) + 
  geom_histogram(colour = "black", fill = "orange")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data %>% 
  ggplot(aes(Marketing_Spend_Per_Day)) + 
  geom_histogram(colour = "black", fill = "orange")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# we need to see the relation between the dependent variable with all the independent variable.
# This is important for building our regression model

# we are going to create a pairplot for this


pairs(data[c("Number_of_Customers_Per_Day", "Average_Order_Value" ,      
             "Operating_Hours_Per_Day", "Number_of_Employees",      
             "Marketing_Spend_Per_Day", "Number_of_People_Per_Hour",   
             "Daily_Revenue_Generated")],
      col = "steelblue",
      pch = 16,
      labels = c("Number of Customers Per Day", "Average Order Value" ,      
                 "Operating Hours Per Day", "Number of Employees",      
                 "Marketing Spend Per Day", "Number of People Per Hour",   
                 "Daily Revenue Generated"),
      main = "Pairplots for the data")

data %>% 
  ggplot(aes(Average_Order_Value, Daily_Revenue_Generated))+
  geom_point(size = 3, colour = "steelblue")+
  geom_smooth(method = lm, se = F)

`geom_smooth()` using formula = 'y ~ x'

Insight - Average order value and number of customer seems correlated with the daily value generated

Step 7 - Simple Linear Regression

# we will start by creating a training and a testing dataset

set.seed(123)
row_number <- sample(1:nrow(data), 0.8*nrow(data))

train <- data[row_number,]
dim(train)

[1] 1600    7

test <- data[-row_number,]
dim(test)

[1] 400   7

# train the model
model_train <- lm(Daily_Revenue_Generated ~ Number_of_Customers_Per_Day, data = train)
summary(model_train)


Call:
lm(formula = Daily_Revenue_Generated ~ Number_of_Customers_Per_Day, 
    data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2096.29  -447.92   -20.05   447.19  2210.69 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 384.9863    38.6506   9.961   <2e-16 ***
Number_of_Customers_Per_Day   5.5605     0.1275  43.600   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 664 on 1598 degrees of freedom
Multiple R-squared:  0.5433,    Adjusted R-squared:  0.543 
F-statistic:  1901 on 1 and 1598 DF,  p-value: < 2.2e-16

# test the data
pred <- predict(model_train, newdata = test)

error <- pred - test$Daily_Revenue_Generated # predicted - the actual value
rmse <- sqrt(mean(error^2))
mape <- mean(abs(error / test$Yearly.Amount.Spent))

c(RMSE=rmse, mape=mape, R2=summary(model_train)$r.squared)

       RMSE        mape          R2 
646.8582778         NaN   0.5432951

Insights - The error seems to be high and the r2 score value is 54.3% only. So lets see what happens when we include all the variables.

Step 8 - Multiple Linear Regression

# create the model

model_train2 <- lm(Daily_Revenue_Generated ~ Number_of_Customers_Per_Day +
                     Average_Order_Value + Operating_Hours_Per_Day +
                     Number_of_Employees + Marketing_Spend_Per_Day +
                     Number_of_People_Per_Hour, data = train)

summary(model_train2)


Call:
lm(formula = Daily_Revenue_Generated ~ Number_of_Customers_Per_Day + 
    Average_Order_Value + Operating_Hours_Per_Day + Number_of_Employees + 
    Marketing_Spend_Per_Day + Number_of_People_Per_Hour, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1776.66  -220.21     3.35   224.94  1658.58 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -1.472e+03  6.422e+01 -22.923   <2e-16 ***
Number_of_Customers_Per_Day  5.526e+00  7.712e-02  71.659   <2e-16 ***
Average_Order_Value          2.391e+02  5.067e+00  47.185   <2e-16 ***
Operating_Hours_Per_Day     -1.868e+00  3.163e+00  -0.590    0.555    
Number_of_Employees         -4.650e+00  2.939e+00  -1.582    0.114    
Marketing_Spend_Per_Day      1.636e+00  7.101e-02  23.045   <2e-16 ***
Number_of_People_Per_Hour    3.526e-02  3.683e-02   0.957    0.338    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 401.3 on 1593 degrees of freedom
Multiple R-squared:  0.8337,    Adjusted R-squared:  0.8331 
F-statistic:  1331 on 6 and 1593 DF,  p-value: < 2.2e-16

Number_of_Customers_Per_Day, Marketing_Spend_Per_Day, Average_Order_Value seem to have high signifcance comapred to other varaibles.

# test the data
pred2 <- predict(model_train2, newdata = test)

error2 <- pred2 - test$Daily_Revenue_Generated # predicted - the actual value
rmse2 <- sqrt(mean(error2^2))
mape2 <- mean(abs(error2 / test$Yearly.Amount.Spent))

c(RMSE=rmse2, mape=mape2, R2=summary(model_train2)$r.squared)

       RMSE        mape          R2 
324.7314918         NaN   0.8337169

Insights - The error has significantly reduced and the r2 score has increased till 83.3%.

This indicates that all the variables play an important role in generating the daily revenue

To access the code file and dataset, check the Github repository.