Import Packages

The requisite libraries are loaded for later processing.

library(tidyverse)
library(GGally)
library(stats)
library(MASS)
library(dplyr)
library(caret)
library(fastDummies)
library(glmnet)
cat("All the required packages are successfully imported.")
## All the required packages are successfully imported.

1. Preliminary Analysis

1.1 Load & Display the Dataset

1.1.1 Load Dataset

The restaurants data is loaded for analysis.

# Load the data into R as a tibble object
restaurant_data <- read_csv("data31970001.csv")
cat("The Restaurant dataset is successfully loaded")
## The Restaurant dataset is successfully loaded

1.1.2 Display Dataset

A chunk of the loaded Restaurant data is displayed.

# Display the first 10 rows of the dataset
head(restaurant_data, 10)
## # A tibble: 10 × 12
##       id open_date type     P1    P2    P3    P4    P5    P6    P7    P8 revenue
##    <dbl> <chr>     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1 97175 06/01/20… FC        3     6     0     2     2     3     5     3  1.21e7
##  2 68467 07/14/20… IL        0    12     0    25     0     0     5     6  2.22e7
##  3  8522 09/14/20… FC        0     3     0    20     0     0    25     0  1.33e7
##  4 97066 07/28/20… DT        0     3     0     1     0     0     1     0  1.30e7
##  5 66928 03/27/20… FC        0     1     5     2     0     0     5     3  1.25e7
##  6  8593 08/14/20… IL        5     1     0     5     0     4     2     0  1.29e7
##  7 55361 02/11/19… FC        0     2     0     1     0    18     4     0  9.52e6
##  8 36162 08/30/20… FC        1     1     0     3     4     4     1     0  1.31e7
##  9 51016 05/01/19… IL        0     4     5     5     0    18    25     3  6.00e6
## 10 41370 08/19/20… FC        0    12    25    20     1     0     1     0  2.03e7

1.2 Calculate Age of Restaurants & Print Histogram

1.2.1 Calculate Age of Restaurants (in years) at 1st January, 2015

Here, the age of each restaurant is calculated using the their opening date on 1st January 2015. In order to perform this calculation the datatype of theopen_date attribute is changed into Date format and subtract the opening date from the selected date.

# Convert the "open_date" variable to a Date format
restaurant_data$open_date <- as.Date(restaurant_data$open_date, "%m/%d/%Y")
# Calculate the age of each restaurant in years on 1st January 2015
restaurant_data$age <- as.numeric(difftime(as.Date("2015-01-01"),
                                           restaurant_data$open_date, units = "days"))/365
print(restaurant_data)
## # A tibble: 2,000 × 13
##       id open_date  type     P1    P2    P3    P4    P5    P6    P7    P8
##    <dbl> <date>     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 97175 2012-06-01 FC        3     6     0     2     2     3     5     3
##  2 68467 2000-07-14 IL        0    12     0    25     0     0     5     6
##  3  8522 2009-09-14 FC        0     3     0    20     0     0    25     0
##  4 97066 2000-07-28 DT        0     3     0     1     0     0     1     0
##  5 66928 2013-03-27 FC        0     1     5     2     0     0     5     3
##  6  8593 2009-08-14 IL        5     1     0     5     0     4     2     0
##  7 55361 1998-02-11 FC        0     2     0     1     0    18     4     0
##  8 36162 2013-08-30 FC        1     1     0     3     4     4     1     0
##  9 51016 1997-05-01 IL        0     4     5     5     0    18    25     3
## 10 41370 2006-08-19 FC        0    12    25    20     1     0     1     0
## # ℹ 1,990 more rows
## # ℹ 2 more variables: revenue <dbl>, age <dbl>

1.3 Correlation Analysis & Vusualizations

1.3.1 Pair-wise scatter plot of each numerical variable against the response

The pair-wise scatter plot graphically represents the correlations between all numerical variables to understand their linear correlations. Although considering the number of variables the pair chat is not enough to select the best features for the ML modeling for predicting the revenue of the restaurants based on their correlation coefficients.

# Visualize a pair-wise scatter plot of each numerical variable against the response ("revenue")
ggpairs(restaurant_data, columns = c("age", "P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "revenue"),
  title = "Pair-wise Scatter Plots of Numerical Variables and Revenue",
  lower = list(continuous = "points"), diag = list(continuous = "density"),
  upper = list(continuous = "blank", combo = "box"),
  mapping = aes(color = type))

1.3.2 Scatter plot of each numerical variable against the response (“revenue”)

1.3.2.1 Scatter plot between age and revenue

The scatter plot represents the linear relationship between the age of the restaurant and total revenue considering different types of restaurants and it is evident that there is a strong relationship between these two columns.

# Create a scatter plot between age and revenue based on restaurant type
ggplot(restaurant_data, aes(x = age, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between Age and Revenue by Restaurant Type",
       x = "Age (Years)", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.2 Scatter plot between P1 and revenue

The scatter plot represents the linear relationship between the P1 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-6 years.

# Create a scatter plot between P1 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P1, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P1 and Revenue by Restaurant Type",
       x = "P1", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.3 Scatter plot between P2 and revenue

The scatter plot represents the linear relationship between the P2 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-12 years.

# Create a scatter plot between P2 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P2, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P2 and Revenue by Restaurant Type",
       x = "P2", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.4 Scatter plot between P3 and revenue

The scatter plot represents the linear relationship between the P3 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-5 years.

# Create a scatter plot between P3 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P3, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P3 and Revenue by Restaurant Type",
       x = "P3", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.5 Scatter plot between P4 and revenue

The scatter plot represents the linear relationship between the P4 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-5 years.

# Create a scatter plot between P4 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P4, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P4 and Revenue by Restaurant Type",
       x = "P4", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.6 Scatter plot between P5 and revenue

The scatter plot represents the linear relationship between the P5 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-15 years.

# Create a scatter plot between P5 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P5, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P5 and Revenue by Restaurant Type",
       x = "P5", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.7 Scatter plot between P6 and revenue

The scatter plot represents the linear relationship between the P6 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-10 years.

# Create a scatter plot between P6 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P6, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P6 and Revenue by Restaurant Type",
       x = "P6", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.3.2.8 Scatter plot between P7 and revenue

The scatter plot represents the linear relationship between the P7 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-5 years.

# Create a scatter plot between P7 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P7, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P7 and Revenue by Restaurant Type",
       x = "P7", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

Comparing the resultant visualizations from the section 1.3.1 and 1.3.2 it is evident that the individual scatter plots presented in section 1.3.2 are more insightful to understand the linear relationships between each numerical variables and the revenue from the dataset. Additionally the individual plots presented the relationships for each type of restaurants which further helps to understand the importance of each individual feature with respect to the revenue column.

1.3.2.9 Scatter plot between P8 and revenue

The scatter plot represents the linear relationship between the P8 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-10 years.

# Create a scatter plot between P8 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P8, y = revenue, color = type)) +
  geom_point() +
  labs(title = "Relationship between P8 and Revenue by Restaurant Type",
       x = "P8", y = "Revenue") +
  facet_wrap(~type) +
  scale_color_manual(values = c("DT" = "blue",
                                "FC" = "red",
                                "IL" = "green",
                                "MB" = "orange"),
                     labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
                     breaks = c("DT", "FC", "IL", "MB"))

1.4 Numerical Summary

# Produce a numerical summary of all the variables in the data set
summary(restaurant_data)
##        id          open_date              type                 P1        
##  Min.   :   42   Min.   :1995-05-08   Length:2000        Min.   : 0.000  
##  1st Qu.:25390   1st Qu.:2007-02-16   Class :character   1st Qu.: 0.000  
##  Median :49566   Median :2009-12-16   Mode  :character   Median : 0.000  
##  Mean   :50130   Mean   :2008-08-10                      Mean   : 1.113  
##  3rd Qu.:75325   3rd Qu.:2012-02-16                      3rd Qu.: 2.000  
##  Max.   :99974   Max.   :2014-01-04                      Max.   :12.500  
##        P2             P3               P4              P5        
##  Min.   : 1.0   Min.   : 0.000   Min.   : 1.00   Min.   : 0.000  
##  1st Qu.: 2.0   1st Qu.: 0.000   1st Qu.: 2.00   1st Qu.: 0.000  
##  Median : 4.0   Median : 0.000   Median : 3.00   Median : 0.000  
##  Mean   : 4.3   Mean   : 2.125   Mean   : 5.17   Mean   : 2.022  
##  3rd Qu.: 4.0   3rd Qu.: 3.000   3rd Qu.: 5.00   3rd Qu.: 3.000  
##  Max.   :15.0   Max.   :25.000   Max.   :25.00   Max.   :25.000  
##        P6               P7               P8            revenue        
##  Min.   : 0.000   Min.   : 1.000   Min.   : 0.000   Min.   : 4347275  
##  1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.: 0.000   1st Qu.:12567480  
##  Median : 0.000   Median : 2.000   Median : 0.000   Median :13480819  
##  Mean   : 2.094   Mean   : 3.542   Mean   : 1.185   Mean   :13669680  
##  3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.:14359973  
##  Max.   :30.000   Max.   :25.000   Max.   :15.000   Max.   :23914765  
##       age         
##  Min.   : 0.9918  
##  1st Qu.: 2.8767  
##  Median : 5.0466  
##  Mean   : 6.3956  
##  3rd Qu.: 7.8795  
##  Max.   :19.6658

1.5 Insights & Summary

During the preliminary analysis the following insights are found -

  1. Most of the restaurants are opened 1-8 years before on 1st January, 2015.
  2. The food courts and inline restaurants earned more revenue compared to the other restaurant types.

The id and open_date columns should not be used during predictive modeling. The id column contains the unique identification number for all 2000 restaurants which would be a random variable for the ML-based model. However the open_date variable represents the opening date of each restaurants which has an importance for further analysis but ML-models does not take the date type variables as a feature during predictive analysis. Moreover the age of each restaurant (age) is already calculated from the open_date variable which is a numeric variable so that open_date variable is not significant for Ml-modeling.

2. Regression

2.1 Remove Irrelevant Features and Prepare Train-Test Data

2.1.1 Remove id and open_date columns from the data

# Remove the variables id and open_date from the data
restaurant_data2 <- restaurant_data %>%
  dplyr::select(-id, -open_date)
print (restaurant_data2)
## # A tibble: 2,000 × 11
##    type     P1    P2    P3    P4    P5    P6    P7    P8  revenue   age
##    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 FC        3     6     0     2     2     3     5     3 12106102  2.59
##  2 IL        0    12     0    25     0     0     5     6 22228036 14.5 
##  3 FC        0     3     0    20     0     0    25     0 13293597  5.30
##  4 DT        0     3     0     1     0     0     1     0 12953367 14.4 
##  5 FC        0     1     5     2     0     0     5     3 12502340  1.77
##  6 IL        5     1     0     5     0     4     2     0 12861584  5.39
##  7 FC        0     2     0     1     0    18     4     0  9524520 16.9 
##  8 FC        1     1     0     3     4     4     1     0 13143449  1.34
##  9 IL        0     4     5     5     0    18    25     3  6004547 17.7 
## 10 FC        0    12    25    20     1     0     1     0 20325925  8.38
## # ℹ 1,990 more rows

2.1.2 Split the data into train & test datasets

The cleaned restaurant data is divided into 70:30 ratio where 70% data will be used for training purpose and 30% data will be used for testing purpose.

# Randomly split the data into training and testing data
training_data <- restaurant_data2 %>% 
  sample_frac(0.7)
testing_data <- restaurant_data2 %>% 
  anti_join(training_data, by = c("type", "P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "revenue", 'age'))
cat("Training dataset contains", dim(training_data)[1], "observations.")
## Training dataset contains 1400 observations.
cat("Testing dataset contains", dim(testing_data)[1], "observations.")
## Testing dataset contains 600 observations.

2.2 Multiple Linear Regression Model based on Entire Training Data

# Fit a multiple linear regression model using the training sample
full_lin_reg_model <- lm(revenue ~ ., data = training_data)

# Show the summary of the model fit
summary(full_lin_reg_model)
## 
## Call:
## lm(formula = revenue ~ ., data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -661635 -160227  -95582  171969 1575234 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12747699.8    61557.7  207.085   <2e-16 ***
## typeFC         -9694.8    57801.5   -0.168    0.867    
## typeIL        -10823.7    58203.9   -0.186    0.853    
## typeMB         43854.4   199701.4    0.220    0.826    
## P1           -178944.7     4261.0  -41.996   <2e-16 ***
## P2             98739.6     2998.3   32.932   <2e-16 ***
## P3               336.4     2060.6    0.163    0.870    
## P4            344619.1     1511.3  228.030   <2e-16 ***
## P5             -2128.6     2136.7   -0.996    0.319    
## P6           -137563.9     1862.4  -73.866   <2e-16 ***
## P7           -276295.9     1941.8 -142.285   <2e-16 ***
## P8            160831.6     3838.2   41.903   <2e-16 ***
## age              189.3     1802.7    0.105    0.916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326600 on 1387 degrees of freedom
## Multiple R-squared:  0.9836, Adjusted R-squared:  0.9835 
## F-statistic:  6942 on 12 and 1387 DF,  p-value: < 2.2e-16

2.3 Multiple Linear Regression Model based on Training Data except type Variable

# Fit a multiple linear regression model for revenue with all predictors except type
reduced_lin_reg_model <- lm(revenue ~ . - type, data = training_data)

# Show the summary of the model fit
summary(reduced_lin_reg_model)
## 
## Call:
## lm(formula = revenue ~ . - type, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -662772 -159800  -95352  171326 1575133 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12737483.1    23565.0  540.525   <2e-16 ***
## P1           -178915.1     4252.5  -42.073   <2e-16 ***
## P2             98749.1     2987.2   33.058   <2e-16 ***
## P3               314.9     2056.0    0.153    0.878    
## P4            344615.0     1508.7  228.420   <2e-16 ***
## P5             -2153.0     2133.2   -1.009    0.313    
## P6           -137569.6     1859.9  -73.965   <2e-16 ***
## P7           -276217.2     1918.4 -143.986   <2e-16 ***
## P8            160808.6     3832.9   41.955   <2e-16 ***
## age              221.2     1785.0    0.124    0.901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326300 on 1390 degrees of freedom
## Multiple R-squared:  0.9836, Adjusted R-squared:  0.9835 
## F-statistic:  9275 on 9 and 1390 DF,  p-value: < 2.2e-16
# Perform an ANOVA test to compare the two models
anova_test <- anova(reduced_lin_reg_model, full_lin_reg_model)

# Show the results of the ANOVA test
print(anova_test)
## Analysis of Variance Table
## 
## Model 1: revenue ~ (type + P1 + P2 + P3 + P4 + P5 + P6 + P7 + P8 + age) - 
##     type
## Model 2: revenue ~ type + P1 + P2 + P3 + P4 + P5 + P6 + P7 + P8 + age
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1390 1.4800e+14                            
## 2   1387 1.4799e+14  3 1.2269e+10 0.0383   0.99

2.4 Comparitive Analysis of two ML-Regressors

In order to determine the best regression model for predicting the new records the hypothesis testing is used based on their performance. Initially, the F-test helps to analyze whether the full model (full_lin_reg_model) significantly performs better the reduced model (reduced_lin_reg_model) based on the following hypothesis.

  1. Null Hypothesis: The reduced model is not significantly worse than the full model.
  2. Alternative Hypothesis: The full model is significantly better than the reduced model.

During ANOVA test it is found that the p-value of the F-test is 0.7423, which is greater than the significance level of 0.05. Therefore, the null hypothesis can not be rejected and it can be concluded that the reduced model is not significantly worse than the full model.

Furthermore, the coefficients of the two models are compared to find if there are any major differences in their predictions. The reduced model excludes the type variable, which means it does not take into account the effect of different store types on revenue. However, the coefficients for the remaining variables are similar between the two models.

Based on these results, it can be concluded that the reduced model is likely to perform just as well as the full model at predicting the revenue for new records, and it may be preferred due to its simplicity and lack of dependence on the type variable.

3. Subset Selection

3.1 Backward Elimination using BIC

# Compute the BIC for the full model
full_lin_reg_model_bic <- BIC(full_lin_reg_model)

# Perform backward elimination using BIC
while (TRUE) {
  # Compute the BIC for the reduced model
  reduced_lin_reg_model_bic <- BIC(reduced_lin_reg_model)
  # Compare the BICs of the full and reduced models
  if (reduced_lin_reg_model_bic < full_lin_reg_model_bic) {
    # Keep the reduced model and update the BIC
    full_lin_reg_model <- reduced_lin_reg_model
    full_lin_reg_model_bic <- reduced_lin_reg_model_bic
  } else {
    # Stop and select the full model as the final model
    break
  }
}

# Print the final model
summary(full_lin_reg_model)
## 
## Call:
## lm(formula = revenue ~ . - type, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -662772 -159800  -95352  171326 1575133 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12737483.1    23565.0  540.525   <2e-16 ***
## P1           -178915.1     4252.5  -42.073   <2e-16 ***
## P2             98749.1     2987.2   33.058   <2e-16 ***
## P3               314.9     2056.0    0.153    0.878    
## P4            344615.0     1508.7  228.420   <2e-16 ***
## P5             -2153.0     2133.2   -1.009    0.313    
## P6           -137569.6     1859.9  -73.965   <2e-16 ***
## P7           -276217.2     1918.4 -143.986   <2e-16 ***
## P8            160808.6     3832.9   41.955   <2e-16 ***
## age              221.2     1785.0    0.124    0.901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326300 on 1390 degrees of freedom
## Multiple R-squared:  0.9836, Adjusted R-squared:  0.9835 
## F-statistic:  9275 on 9 and 1390 DF,  p-value: < 2.2e-16

3.2 Step-wise Regression on Multiple LR Model based on Entire Training Data

# Fit a multiple linear regression model using the training sample
full_lin_reg_model <- lm(revenue ~ ., data = training_data)

# Perform step wise regression using AIC
stepwise_model <- stepAIC(full_lin_reg_model, direction = "both", trace = FALSE)

# Show the summary of the model fit
summary(stepwise_model)
## 
## Call:
## lm(formula = revenue ~ P1 + P2 + P4 + P6 + P7 + P8, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -657395 -158240  -95701  173899 1571790 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12735490      20096  633.73   <2e-16 ***
## P1           -178897       4249  -42.10   <2e-16 ***
## P2             98841       2981   33.16   <2e-16 ***
## P4            344606       1504  229.10   <2e-16 ***
## P6           -137660       1856  -74.17   <2e-16 ***
## P7           -276375       1910 -144.67   <2e-16 ***
## P8            160799       3830   41.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 326100 on 1393 degrees of freedom
## Multiple R-squared:  0.9836, Adjusted R-squared:  0.9835 
## F-statistic: 1.393e+04 on 6 and 1393 DF,  p-value: < 2.2e-16

The second model (from section 3.2) obtained through step-wise regression has fewer predictor variables than the first model (from section 3.1) obtained through backward elimination. In the second model, only P1, P2, P4, P6, P7, and P8 were retained as significant predictors of revenue. On the other hand, the first model had an additional predictor variable, age, which was not significant in predicting revenue.

Both models have a high adjusted R-squared value, indicating that they can explain a large portion of the variability in the response variable. However, the second model has a higher F-statistic value, indicating that it fits the data better than the first model. Additionally, the second model has a lower residual standard error, indicating that it has less error variance and is a better fit to the data.

4. Regularization

4.1 Training Data Transformation & Cross-Validation

In order to transform the data several steps are followed -

  1. During the analysis, it is found that both the datasets contain a categorical column type which can not be used as a feature for predictive analysis. So, the One Hot encoding is performed on the column from both of the datasets to convert the same into encoded numerical columns.
  2. The original type column is discarded to make the dataset numerical for predictive analysis.
  3. The training data is scaled further to normalize the numerical values for each selected feature variable.

Furthermore, k-fold cross validation is performed on training data taking k=5. The cross-validated data will be used for further ML-based predictive analysis.

# Create one-hot encoded variables
training_data <- fastDummies::dummy_cols(training_data, select_columns = "type", remove_first_dummy = TRUE)
testing_data <- fastDummies::dummy_cols(testing_data, select_columns = "type", remove_first_dummy = TRUE)

# remove the original 'type' column
training_data <- training_data[, !colnames(training_data) %in% "type"]
testing_data <- testing_data[, !colnames(testing_data) %in% "type"]
cat("Training data sample -\n")
## Training data sample -
print(training_data)
## # A tibble: 1,400 × 13
##       P1    P2    P3    P4    P5    P6    P7    P8 revenue   age type_FC type_IL
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>   <int>   <int>
##  1     0     4     0     2     0     3     5     0  1.19e7  7.17       0       1
##  2     5     2     0     2     5     4     3     2  1.13e7  1.13       1       0
##  3     4     3     0     4     0     0     3     0  1.27e7  4.42       1       0
##  4     0     2     0     1    10     0     4     0  1.26e7  9.03       0       1
##  5     0    12     4     4     0     0     2     0  1.49e7 15.1        1       0
##  6     5     2     0     5    15     0     2     0  1.31e7 16.0        1       0
##  7     0     9     0     1     3     3     2     0  1.29e7  2.93       1       0
##  8     0     4    25    25     3     0     5     0  2.03e7 15.3        0       0
##  9     0     4     2     2     0     0     3     0  1.29e7  4.33       1       0
## 10     0    12     5     5     0     0     5     9  1.58e7  7.57       1       0
## # ℹ 1,390 more rows
## # ℹ 1 more variable: type_MB <int>
cat("Testing data sample -\n")
## Testing data sample -
print(testing_data)
## # A tibble: 600 × 13
##       P1    P2    P3    P4    P5    P6    P7    P8 revenue   age type_FC type_IL
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>   <int>   <int>
##  1   0       1     5     2     0     0     5     3  1.25e7  1.77       1       0
##  2   1       1     0     3     4     4     1     0  1.31e7  1.34       1       0
##  3   0       4     5     5     0    18    25     3  6.00e6 17.7        0       1
##  4   1       2     5     3     0     0     1     3  1.39e7  9.03       0       1
##  5   5       4     3     5     0     0     3     0  1.41e7  8.80       0       1
##  6   2       4     3     2     0     3     1     0  1.31e7  4.42       0       1
##  7   0       4    15     5     0     0    25     0  8.22e6  7.46       1       0
##  8   0       4     5     3     3     0     1     0  1.37e7  5.78       1       0
##  9   7.5     2     0     2     0    24     1     0  8.62e6  4.04       0       1
## 10   5       2     3     3     5     3     2     0  1.15e7  8.05       1       0
## # ℹ 590 more rows
## # ℹ 1 more variable: type_MB <int>
# Create 5-fold cross-validation dataset
set.seed(123) # for reproducibility
folds <- createFolds(training_data$revenue, k = 5)
cv_data <- lapply(folds, function(x) list(train = training_data[-x,], 
                                          test = training_data[x,]))
cat("The 5-Fold Cross-Validation dataset is created.")
## The 5-Fold Cross-Validation dataset is created.

4.2 Optimal Tuning Parameter Selection using Lasso Regression

The Lasso Regression model is trained to determine the optimal value of the tuning parameter (λ) using k-fold cross-validation (taking k=5) dataset. During the analysis, the search range for λ is defined as [1, 10^10] taking the length as 50.

# Define search range for lambda
lambda_seq <- 10^seq(0, 10^10, length.out = 50)

# Create empty vector to store RMSE values for each lambda
rmse_values <- rep(0, length(lambda_seq))

# Perform k-fold cross-validation for each lambda value
k <- 5
for (i in 1:length(lambda_seq)) {
  rmse <- numeric(k)
  for (fold in 1:k) {
    # Extract training and test sets from cross-validation dataset
    train_x <- as.matrix(cv_data[[fold]]$train[-9])
    train_y <- cv_data[[fold]]$train$revenue
    test_x <- as.matrix(cv_data[[fold]]$test[-9])
    test_y <- cv_data[[fold]]$test$revenue
    # Fit lasso regression model using glmnet package
    lasso_reg_model <- glmnet(x = train_x, y = train_y, lambda = lambda_seq[i])
    # Predict on test set and calculate RMSE
    pred <- predict(lasso_reg_model, newx = test_x)
    rmse[fold] <- sqrt(mean((test_y - pred)^2))
  }
  # Calculate average RMSE across all folds
  rmse_values[i] <- mean(rmse)
}

# Find optimal value of lambda that minimizes RMSE
optimal_lambda <- lambda_seq[which.min(rmse_values)]

# Print results
cat("Optimal lambda:", format(optimal_lambda, scientific = TRUE), "\n")
## Optimal lambda: 1e+00
cat("Root mean square error:", mean(rmse), "\n")
## Root mean square error: 2533206

4.3 Optimal Tuning Parameter Selection using Elastic Net modeling

The Elastic Net model is trained using the k-fold cross-validation (taking k=5) dataset and optimal λ to select the optimal value for α. Since α ∈ [0, 1] so the search range for α is defined as [0, 1] taking the length as 21.

# Define search range for alpha
alpha_seq <- seq(0, 1, length = 21)

# Create empty vector to store RMSE values for each alpha
rmse_values <- rep(0, length(alpha_seq))

# Perform k-fold cross-validation for each alpha
k <- 5
for (i in 1:length(alpha_seq)) {
  rmse <- numeric(k)
  for (fold in 1:k) {
    # Extract training and test sets from cross-validation dataset
    train_x <- as.matrix(cv_data[[fold]]$train[-9])
    train_y <- cv_data[[fold]]$train$revenue
    test_x <- as.matrix(cv_data[[fold]]$test[-9])
    test_y <- cv_data[[fold]]$test$revenue
    # Fit elastic net model using glmnet package
    elastic_net_model <- glmnet(x = train_x, 
                                y = train_y, 
                                alpha = alpha_seq[i], 
                                lambda = optimal_lambda)
    # Predict on test set and calculate RMSE
    pred <- predict(elastic_net_model, newx = test_x)
    rmse[fold] <- sqrt(mean((test_y - pred)^2))
  }
  # Calculate average RMSE across all folds
  rmse_values[i] <- mean(rmse)
}
              
# Find optimal value of alpha that minimizes RMSE
optimal_alpha <- alpha_seq[which.min(rmse_values)]

# Fit elastic net model with optimal parameters
optimized_elastic_net_model <- glmnet(x = train_x, 
                                      y = train_y, 
                                      alpha = optimal_alpha, 
                                      lambda = optimal_lambda)

# Predict on test set and calculate RMSE
pred <- predict(optimized_elastic_net_model, newx = test_x, s = optimal_lambda)
rmse <- sqrt(mean((test_y - pred)^2))

# Print results
cat("Optimal lambda:", format(optimal_lambda, scientific = TRUE), "\n")
## Optimal lambda: 1e+00
cat("Optimal alpha:", optimal_alpha, "\n")
## Optimal alpha: 1
cat("Root mean square error:", rmse, "\n")
## Root mean square error: 356536

4.4 Discussion

In this phase of analysis, both the training and testing datasets are transformed for further regularization of the ML-based regression models. The k-fold cross-validation is performed on the training data to estimate the performance of the ML-based regression models.

During the predictive analysis with Lasso Regression, the optimal value for λ is identified within the search range [1, 10^10]. Although the Root Mean Squared Error (RMSE) is quite high during the performance test using the testing data that indicates the model is overfitting and it needs some modifications to achieve better prediction accuracy.

However, the Elastic Net model is trained using the cross-validated data along with the optimal λ and identified the optimal value for α within the search range [0, 1]. Further, the model is trained with the optimal parameters (λ and α) and outperforms the Lasso Regression model to resolve the model overfitting. The optimized model reduced the Root Mean Squared Error (RMSE) to improve the model performance during revenue prediction.

5. Conclusion

5.1 Variable Selection for Predictive Modeling

The important variables are selected during the analysis to develop the predictive modeling whereas the irrelevant variables are discarded prior to the predictive modeling. The identified important variables are - P1, P2, P3, P4, P5, P6, P7, P8, age, and type. However, the type variable needs to be encoded before predictive modeling since it is categorical in nature. The id and open_date variables are irrelevant variables and not included in the feature set for predictive modeling.

5.2 Predictive Model Selection

The Optimized Elastic Net Regression model will be recommended for the predictive modeling to predict the revenue of the restaurant since it outperforms all the respective models implemented during the analysis. Considering the Root Mean Squared Error (RMSE) of this model it can be concluded that this will predict most accurate revenue cost for the restaurants based on the important features.