The requisite libraries are loaded for later processing.
library(tidyverse)
library(GGally)
library(stats)
library(MASS)
library(dplyr)
library(caret)
library(fastDummies)
library(glmnet)
cat("All the required packages are successfully imported.")
## All the required packages are successfully imported.
The restaurants data is loaded for analysis.
# Load the data into R as a tibble object
restaurant_data <- read_csv("data31970001.csv")
cat("The Restaurant dataset is successfully loaded")
## The Restaurant dataset is successfully loaded
A chunk of the loaded Restaurant data is displayed.
# Display the first 10 rows of the dataset
head(restaurant_data, 10)
## # A tibble: 10 × 12
## id open_date type P1 P2 P3 P4 P5 P6 P7 P8 revenue
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 97175 06/01/20… FC 3 6 0 2 2 3 5 3 1.21e7
## 2 68467 07/14/20… IL 0 12 0 25 0 0 5 6 2.22e7
## 3 8522 09/14/20… FC 0 3 0 20 0 0 25 0 1.33e7
## 4 97066 07/28/20… DT 0 3 0 1 0 0 1 0 1.30e7
## 5 66928 03/27/20… FC 0 1 5 2 0 0 5 3 1.25e7
## 6 8593 08/14/20… IL 5 1 0 5 0 4 2 0 1.29e7
## 7 55361 02/11/19… FC 0 2 0 1 0 18 4 0 9.52e6
## 8 36162 08/30/20… FC 1 1 0 3 4 4 1 0 1.31e7
## 9 51016 05/01/19… IL 0 4 5 5 0 18 25 3 6.00e6
## 10 41370 08/19/20… FC 0 12 25 20 1 0 1 0 2.03e7
Here, the age of each restaurant is calculated using the their
opening date on 1st January 2015. In order to perform this calculation
the datatype of theopen_date attribute is changed into Date
format and subtract the opening date from the selected date.
# Convert the "open_date" variable to a Date format
restaurant_data$open_date <- as.Date(restaurant_data$open_date, "%m/%d/%Y")
# Calculate the age of each restaurant in years on 1st January 2015
restaurant_data$age <- as.numeric(difftime(as.Date("2015-01-01"),
restaurant_data$open_date, units = "days"))/365
print(restaurant_data)
## # A tibble: 2,000 × 13
## id open_date type P1 P2 P3 P4 P5 P6 P7 P8
## <dbl> <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 97175 2012-06-01 FC 3 6 0 2 2 3 5 3
## 2 68467 2000-07-14 IL 0 12 0 25 0 0 5 6
## 3 8522 2009-09-14 FC 0 3 0 20 0 0 25 0
## 4 97066 2000-07-28 DT 0 3 0 1 0 0 1 0
## 5 66928 2013-03-27 FC 0 1 5 2 0 0 5 3
## 6 8593 2009-08-14 IL 5 1 0 5 0 4 2 0
## 7 55361 1998-02-11 FC 0 2 0 1 0 18 4 0
## 8 36162 2013-08-30 FC 1 1 0 3 4 4 1 0
## 9 51016 1997-05-01 IL 0 4 5 5 0 18 25 3
## 10 41370 2006-08-19 FC 0 12 25 20 1 0 1 0
## # ℹ 1,990 more rows
## # ℹ 2 more variables: revenue <dbl>, age <dbl>
The histogram plot represents the restaurant count for each restaurant’s age to identify the age of most of the restaurants. As a result of this analysis it has been found that out of 2000 restaurants all most 500+ restaurants are 2 to 3 years old.
# Show the histogram of the "age" variable
ggplot(restaurant_data, aes(x = age)) +
geom_histogram(binwidth = 1, fill = "blue", color = "white") +
labs(title = "Histogram of Restaurant Age", x = "Age (Years)", y = "Count")
The pair-wise scatter plot graphically represents the correlations between all numerical variables to understand their linear correlations. Although considering the number of variables the pair chat is not enough to select the best features for the ML modeling for predicting the revenue of the restaurants based on their correlation coefficients.
# Visualize a pair-wise scatter plot of each numerical variable against the response ("revenue")
ggpairs(restaurant_data, columns = c("age", "P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "revenue"),
title = "Pair-wise Scatter Plots of Numerical Variables and Revenue",
lower = list(continuous = "points"), diag = list(continuous = "density"),
upper = list(continuous = "blank", combo = "box"),
mapping = aes(color = type))
The scatter plot represents the linear relationship between the age of the restaurant and total revenue considering different types of restaurants and it is evident that there is a strong relationship between these two columns.
# Create a scatter plot between age and revenue based on restaurant type
ggplot(restaurant_data, aes(x = age, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between Age and Revenue by Restaurant Type",
x = "Age (Years)", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P1 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-6 years.
# Create a scatter plot between P1 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P1, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P1 and Revenue by Restaurant Type",
x = "P1", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P2 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-12 years.
# Create a scatter plot between P2 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P2, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P2 and Revenue by Restaurant Type",
x = "P2", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P3 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-5 years.
# Create a scatter plot between P3 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P3, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P3 and Revenue by Restaurant Type",
x = "P3", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P4 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-5 years.
# Create a scatter plot between P4 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P4, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P4 and Revenue by Restaurant Type",
x = "P4", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P5 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-15 years.
# Create a scatter plot between P5 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P5, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P5 and Revenue by Restaurant Type",
x = "P5", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P6 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-10 years.
# Create a scatter plot between P6 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P6, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P6 and Revenue by Restaurant Type",
x = "P6", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
The scatter plot represents the linear relationship between the P7 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-5 years.
# Create a scatter plot between P7 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P7, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P7 and Revenue by Restaurant Type",
x = "P7", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
Comparing the resultant visualizations from the section 1.3.1 and 1.3.2 it is evident that the individual scatter plots presented in section 1.3.2 are more insightful to understand the linear relationships between each numerical variables and the revenue from the dataset. Additionally the individual plots presented the relationships for each type of restaurants which further helps to understand the importance of each individual feature with respect to the revenue column.
The scatter plot represents the linear relationship between the P8 (obfuscated data) and total revenue considering different types of restaurants and it is evident that there is a good relationship between these two columns considering the restaurants having age between 0-10 years.
# Create a scatter plot between P8 and revenue based on restaurant type
ggplot(restaurant_data, aes(x = P8, y = revenue, color = type)) +
geom_point() +
labs(title = "Relationship between P8 and Revenue by Restaurant Type",
x = "P8", y = "Revenue") +
facet_wrap(~type) +
scale_color_manual(values = c("DT" = "blue",
"FC" = "red",
"IL" = "green",
"MB" = "orange"),
labels = c("Drive Thru", "Food Court", "Inline", "Mobile"),
breaks = c("DT", "FC", "IL", "MB"))
# Produce a numerical summary of all the variables in the data set
summary(restaurant_data)
## id open_date type P1
## Min. : 42 Min. :1995-05-08 Length:2000 Min. : 0.000
## 1st Qu.:25390 1st Qu.:2007-02-16 Class :character 1st Qu.: 0.000
## Median :49566 Median :2009-12-16 Mode :character Median : 0.000
## Mean :50130 Mean :2008-08-10 Mean : 1.113
## 3rd Qu.:75325 3rd Qu.:2012-02-16 3rd Qu.: 2.000
## Max. :99974 Max. :2014-01-04 Max. :12.500
## P2 P3 P4 P5
## Min. : 1.0 Min. : 0.000 Min. : 1.00 Min. : 0.000
## 1st Qu.: 2.0 1st Qu.: 0.000 1st Qu.: 2.00 1st Qu.: 0.000
## Median : 4.0 Median : 0.000 Median : 3.00 Median : 0.000
## Mean : 4.3 Mean : 2.125 Mean : 5.17 Mean : 2.022
## 3rd Qu.: 4.0 3rd Qu.: 3.000 3rd Qu.: 5.00 3rd Qu.: 3.000
## Max. :15.0 Max. :25.000 Max. :25.00 Max. :25.000
## P6 P7 P8 revenue
## Min. : 0.000 Min. : 1.000 Min. : 0.000 Min. : 4347275
## 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.:12567480
## Median : 0.000 Median : 2.000 Median : 0.000 Median :13480819
## Mean : 2.094 Mean : 3.542 Mean : 1.185 Mean :13669680
## 3rd Qu.: 3.000 3rd Qu.: 4.000 3rd Qu.: 2.000 3rd Qu.:14359973
## Max. :30.000 Max. :25.000 Max. :15.000 Max. :23914765
## age
## Min. : 0.9918
## 1st Qu.: 2.8767
## Median : 5.0466
## Mean : 6.3956
## 3rd Qu.: 7.8795
## Max. :19.6658
During the preliminary analysis the following insights are found -
The id and open_date columns should not be
used during predictive modeling. The id column contains the
unique identification number for all 2000 restaurants which would be a
random variable for the ML-based model. However the
open_date variable represents the opening date of each
restaurants which has an importance for further analysis but ML-models
does not take the date type variables as a feature during predictive
analysis. Moreover the age of each restaurant (age) is
already calculated from the open_date variable which is a
numeric variable so that open_date variable is not
significant for Ml-modeling.
# Remove the variables id and open_date from the data
restaurant_data2 <- restaurant_data %>%
dplyr::select(-id, -open_date)
print (restaurant_data2)
## # A tibble: 2,000 × 11
## type P1 P2 P3 P4 P5 P6 P7 P8 revenue age
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FC 3 6 0 2 2 3 5 3 12106102 2.59
## 2 IL 0 12 0 25 0 0 5 6 22228036 14.5
## 3 FC 0 3 0 20 0 0 25 0 13293597 5.30
## 4 DT 0 3 0 1 0 0 1 0 12953367 14.4
## 5 FC 0 1 5 2 0 0 5 3 12502340 1.77
## 6 IL 5 1 0 5 0 4 2 0 12861584 5.39
## 7 FC 0 2 0 1 0 18 4 0 9524520 16.9
## 8 FC 1 1 0 3 4 4 1 0 13143449 1.34
## 9 IL 0 4 5 5 0 18 25 3 6004547 17.7
## 10 FC 0 12 25 20 1 0 1 0 20325925 8.38
## # ℹ 1,990 more rows
The cleaned restaurant data is divided into 70:30 ratio where 70% data will be used for training purpose and 30% data will be used for testing purpose.
# Randomly split the data into training and testing data
training_data <- restaurant_data2 %>%
sample_frac(0.7)
testing_data <- restaurant_data2 %>%
anti_join(training_data, by = c("type", "P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "revenue", 'age'))
cat("Training dataset contains", dim(training_data)[1], "observations.")
## Training dataset contains 1400 observations.
cat("Testing dataset contains", dim(testing_data)[1], "observations.")
## Testing dataset contains 600 observations.
# Fit a multiple linear regression model using the training sample
full_lin_reg_model <- lm(revenue ~ ., data = training_data)
# Show the summary of the model fit
summary(full_lin_reg_model)
##
## Call:
## lm(formula = revenue ~ ., data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -661635 -160227 -95582 171969 1575234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12747699.8 61557.7 207.085 <2e-16 ***
## typeFC -9694.8 57801.5 -0.168 0.867
## typeIL -10823.7 58203.9 -0.186 0.853
## typeMB 43854.4 199701.4 0.220 0.826
## P1 -178944.7 4261.0 -41.996 <2e-16 ***
## P2 98739.6 2998.3 32.932 <2e-16 ***
## P3 336.4 2060.6 0.163 0.870
## P4 344619.1 1511.3 228.030 <2e-16 ***
## P5 -2128.6 2136.7 -0.996 0.319
## P6 -137563.9 1862.4 -73.866 <2e-16 ***
## P7 -276295.9 1941.8 -142.285 <2e-16 ***
## P8 160831.6 3838.2 41.903 <2e-16 ***
## age 189.3 1802.7 0.105 0.916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 326600 on 1387 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9835
## F-statistic: 6942 on 12 and 1387 DF, p-value: < 2.2e-16
# Fit a multiple linear regression model for revenue with all predictors except type
reduced_lin_reg_model <- lm(revenue ~ . - type, data = training_data)
# Show the summary of the model fit
summary(reduced_lin_reg_model)
##
## Call:
## lm(formula = revenue ~ . - type, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -662772 -159800 -95352 171326 1575133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12737483.1 23565.0 540.525 <2e-16 ***
## P1 -178915.1 4252.5 -42.073 <2e-16 ***
## P2 98749.1 2987.2 33.058 <2e-16 ***
## P3 314.9 2056.0 0.153 0.878
## P4 344615.0 1508.7 228.420 <2e-16 ***
## P5 -2153.0 2133.2 -1.009 0.313
## P6 -137569.6 1859.9 -73.965 <2e-16 ***
## P7 -276217.2 1918.4 -143.986 <2e-16 ***
## P8 160808.6 3832.9 41.955 <2e-16 ***
## age 221.2 1785.0 0.124 0.901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 326300 on 1390 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9835
## F-statistic: 9275 on 9 and 1390 DF, p-value: < 2.2e-16
# Perform an ANOVA test to compare the two models
anova_test <- anova(reduced_lin_reg_model, full_lin_reg_model)
# Show the results of the ANOVA test
print(anova_test)
## Analysis of Variance Table
##
## Model 1: revenue ~ (type + P1 + P2 + P3 + P4 + P5 + P6 + P7 + P8 + age) -
## type
## Model 2: revenue ~ type + P1 + P2 + P3 + P4 + P5 + P6 + P7 + P8 + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1390 1.4800e+14
## 2 1387 1.4799e+14 3 1.2269e+10 0.0383 0.99
In order to determine the best regression model for predicting the new records the hypothesis testing is used based on their performance. Initially, the F-test helps to analyze whether the full model (full_lin_reg_model) significantly performs better the reduced model (reduced_lin_reg_model) based on the following hypothesis.
During ANOVA test it is found that the p-value of the F-test is 0.7423, which is greater than the significance level of 0.05. Therefore, the null hypothesis can not be rejected and it can be concluded that the reduced model is not significantly worse than the full model.
Furthermore, the coefficients of the two models are compared to find
if there are any major differences in their predictions. The reduced
model excludes the type variable, which means it does not
take into account the effect of different store types on revenue.
However, the coefficients for the remaining variables are similar
between the two models.
Based on these results, it can be concluded that the reduced model is
likely to perform just as well as the full model at predicting the
revenue for new records, and it may be preferred due to its simplicity
and lack of dependence on the type variable.
# Compute the BIC for the full model
full_lin_reg_model_bic <- BIC(full_lin_reg_model)
# Perform backward elimination using BIC
while (TRUE) {
# Compute the BIC for the reduced model
reduced_lin_reg_model_bic <- BIC(reduced_lin_reg_model)
# Compare the BICs of the full and reduced models
if (reduced_lin_reg_model_bic < full_lin_reg_model_bic) {
# Keep the reduced model and update the BIC
full_lin_reg_model <- reduced_lin_reg_model
full_lin_reg_model_bic <- reduced_lin_reg_model_bic
} else {
# Stop and select the full model as the final model
break
}
}
# Print the final model
summary(full_lin_reg_model)
##
## Call:
## lm(formula = revenue ~ . - type, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -662772 -159800 -95352 171326 1575133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12737483.1 23565.0 540.525 <2e-16 ***
## P1 -178915.1 4252.5 -42.073 <2e-16 ***
## P2 98749.1 2987.2 33.058 <2e-16 ***
## P3 314.9 2056.0 0.153 0.878
## P4 344615.0 1508.7 228.420 <2e-16 ***
## P5 -2153.0 2133.2 -1.009 0.313
## P6 -137569.6 1859.9 -73.965 <2e-16 ***
## P7 -276217.2 1918.4 -143.986 <2e-16 ***
## P8 160808.6 3832.9 41.955 <2e-16 ***
## age 221.2 1785.0 0.124 0.901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 326300 on 1390 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9835
## F-statistic: 9275 on 9 and 1390 DF, p-value: < 2.2e-16
# Fit a multiple linear regression model using the training sample
full_lin_reg_model <- lm(revenue ~ ., data = training_data)
# Perform step wise regression using AIC
stepwise_model <- stepAIC(full_lin_reg_model, direction = "both", trace = FALSE)
# Show the summary of the model fit
summary(stepwise_model)
##
## Call:
## lm(formula = revenue ~ P1 + P2 + P4 + P6 + P7 + P8, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -657395 -158240 -95701 173899 1571790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12735490 20096 633.73 <2e-16 ***
## P1 -178897 4249 -42.10 <2e-16 ***
## P2 98841 2981 33.16 <2e-16 ***
## P4 344606 1504 229.10 <2e-16 ***
## P6 -137660 1856 -74.17 <2e-16 ***
## P7 -276375 1910 -144.67 <2e-16 ***
## P8 160799 3830 41.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 326100 on 1393 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9835
## F-statistic: 1.393e+04 on 6 and 1393 DF, p-value: < 2.2e-16
The second model (from section 3.2)
obtained through step-wise regression has fewer predictor variables than
the first model (from section 3.1) obtained
through backward elimination. In the second model, only P1,
P2, P4, P6, P7, and
P8 were retained as significant predictors of revenue. On
the other hand, the first model had an additional predictor variable,
age, which was not significant in predicting revenue.
Both models have a high adjusted R-squared value, indicating that they can explain a large portion of the variability in the response variable. However, the second model has a higher F-statistic value, indicating that it fits the data better than the first model. Additionally, the second model has a lower residual standard error, indicating that it has less error variance and is a better fit to the data.
In order to transform the data several steps are followed -
type which can not be used as a feature
for predictive analysis. So, the One Hot encoding is performed on the
column from both of the datasets to convert the same into encoded
numerical columns.type column is discarded to make the
dataset numerical for predictive analysis.Furthermore, k-fold cross validation is performed on training data taking k=5. The cross-validated data will be used for further ML-based predictive analysis.
# Create one-hot encoded variables
training_data <- fastDummies::dummy_cols(training_data, select_columns = "type", remove_first_dummy = TRUE)
testing_data <- fastDummies::dummy_cols(testing_data, select_columns = "type", remove_first_dummy = TRUE)
# remove the original 'type' column
training_data <- training_data[, !colnames(training_data) %in% "type"]
testing_data <- testing_data[, !colnames(testing_data) %in% "type"]
cat("Training data sample -\n")
## Training data sample -
print(training_data)
## # A tibble: 1,400 × 13
## P1 P2 P3 P4 P5 P6 P7 P8 revenue age type_FC type_IL
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0 4 0 2 0 3 5 0 1.19e7 7.17 0 1
## 2 5 2 0 2 5 4 3 2 1.13e7 1.13 1 0
## 3 4 3 0 4 0 0 3 0 1.27e7 4.42 1 0
## 4 0 2 0 1 10 0 4 0 1.26e7 9.03 0 1
## 5 0 12 4 4 0 0 2 0 1.49e7 15.1 1 0
## 6 5 2 0 5 15 0 2 0 1.31e7 16.0 1 0
## 7 0 9 0 1 3 3 2 0 1.29e7 2.93 1 0
## 8 0 4 25 25 3 0 5 0 2.03e7 15.3 0 0
## 9 0 4 2 2 0 0 3 0 1.29e7 4.33 1 0
## 10 0 12 5 5 0 0 5 9 1.58e7 7.57 1 0
## # ℹ 1,390 more rows
## # ℹ 1 more variable: type_MB <int>
cat("Testing data sample -\n")
## Testing data sample -
print(testing_data)
## # A tibble: 600 × 13
## P1 P2 P3 P4 P5 P6 P7 P8 revenue age type_FC type_IL
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 0 1 5 2 0 0 5 3 1.25e7 1.77 1 0
## 2 1 1 0 3 4 4 1 0 1.31e7 1.34 1 0
## 3 0 4 5 5 0 18 25 3 6.00e6 17.7 0 1
## 4 1 2 5 3 0 0 1 3 1.39e7 9.03 0 1
## 5 5 4 3 5 0 0 3 0 1.41e7 8.80 0 1
## 6 2 4 3 2 0 3 1 0 1.31e7 4.42 0 1
## 7 0 4 15 5 0 0 25 0 8.22e6 7.46 1 0
## 8 0 4 5 3 3 0 1 0 1.37e7 5.78 1 0
## 9 7.5 2 0 2 0 24 1 0 8.62e6 4.04 0 1
## 10 5 2 3 3 5 3 2 0 1.15e7 8.05 1 0
## # ℹ 590 more rows
## # ℹ 1 more variable: type_MB <int>
# Create 5-fold cross-validation dataset
set.seed(123) # for reproducibility
folds <- createFolds(training_data$revenue, k = 5)
cv_data <- lapply(folds, function(x) list(train = training_data[-x,],
test = training_data[x,]))
cat("The 5-Fold Cross-Validation dataset is created.")
## The 5-Fold Cross-Validation dataset is created.
The Lasso Regression model is trained to determine the optimal value of the tuning parameter (λ) using k-fold cross-validation (taking k=5) dataset. During the analysis, the search range for λ is defined as [1, 10^10] taking the length as 50.
# Define search range for lambda
lambda_seq <- 10^seq(0, 10^10, length.out = 50)
# Create empty vector to store RMSE values for each lambda
rmse_values <- rep(0, length(lambda_seq))
# Perform k-fold cross-validation for each lambda value
k <- 5
for (i in 1:length(lambda_seq)) {
rmse <- numeric(k)
for (fold in 1:k) {
# Extract training and test sets from cross-validation dataset
train_x <- as.matrix(cv_data[[fold]]$train[-9])
train_y <- cv_data[[fold]]$train$revenue
test_x <- as.matrix(cv_data[[fold]]$test[-9])
test_y <- cv_data[[fold]]$test$revenue
# Fit lasso regression model using glmnet package
lasso_reg_model <- glmnet(x = train_x, y = train_y, lambda = lambda_seq[i])
# Predict on test set and calculate RMSE
pred <- predict(lasso_reg_model, newx = test_x)
rmse[fold] <- sqrt(mean((test_y - pred)^2))
}
# Calculate average RMSE across all folds
rmse_values[i] <- mean(rmse)
}
# Find optimal value of lambda that minimizes RMSE
optimal_lambda <- lambda_seq[which.min(rmse_values)]
# Print results
cat("Optimal lambda:", format(optimal_lambda, scientific = TRUE), "\n")
## Optimal lambda: 1e+00
cat("Root mean square error:", mean(rmse), "\n")
## Root mean square error: 2533206
The Elastic Net model is trained using the k-fold cross-validation (taking k=5) dataset and optimal λ to select the optimal value for α. Since α ∈ [0, 1] so the search range for α is defined as [0, 1] taking the length as 21.
# Define search range for alpha
alpha_seq <- seq(0, 1, length = 21)
# Create empty vector to store RMSE values for each alpha
rmse_values <- rep(0, length(alpha_seq))
# Perform k-fold cross-validation for each alpha
k <- 5
for (i in 1:length(alpha_seq)) {
rmse <- numeric(k)
for (fold in 1:k) {
# Extract training and test sets from cross-validation dataset
train_x <- as.matrix(cv_data[[fold]]$train[-9])
train_y <- cv_data[[fold]]$train$revenue
test_x <- as.matrix(cv_data[[fold]]$test[-9])
test_y <- cv_data[[fold]]$test$revenue
# Fit elastic net model using glmnet package
elastic_net_model <- glmnet(x = train_x,
y = train_y,
alpha = alpha_seq[i],
lambda = optimal_lambda)
# Predict on test set and calculate RMSE
pred <- predict(elastic_net_model, newx = test_x)
rmse[fold] <- sqrt(mean((test_y - pred)^2))
}
# Calculate average RMSE across all folds
rmse_values[i] <- mean(rmse)
}
# Find optimal value of alpha that minimizes RMSE
optimal_alpha <- alpha_seq[which.min(rmse_values)]
# Fit elastic net model with optimal parameters
optimized_elastic_net_model <- glmnet(x = train_x,
y = train_y,
alpha = optimal_alpha,
lambda = optimal_lambda)
# Predict on test set and calculate RMSE
pred <- predict(optimized_elastic_net_model, newx = test_x, s = optimal_lambda)
rmse <- sqrt(mean((test_y - pred)^2))
# Print results
cat("Optimal lambda:", format(optimal_lambda, scientific = TRUE), "\n")
## Optimal lambda: 1e+00
cat("Optimal alpha:", optimal_alpha, "\n")
## Optimal alpha: 1
cat("Root mean square error:", rmse, "\n")
## Root mean square error: 356536
In this phase of analysis, both the training and testing datasets are transformed for further regularization of the ML-based regression models. The k-fold cross-validation is performed on the training data to estimate the performance of the ML-based regression models.
During the predictive analysis with Lasso Regression, the optimal value for λ is identified within the search range [1, 10^10]. Although the Root Mean Squared Error (RMSE) is quite high during the performance test using the testing data that indicates the model is overfitting and it needs some modifications to achieve better prediction accuracy.
However, the Elastic Net model is trained using the cross-validated data along with the optimal λ and identified the optimal value for α within the search range [0, 1]. Further, the model is trained with the optimal parameters (λ and α) and outperforms the Lasso Regression model to resolve the model overfitting. The optimized model reduced the Root Mean Squared Error (RMSE) to improve the model performance during revenue prediction.
The important variables are selected during the analysis to develop
the predictive modeling whereas the irrelevant variables are discarded
prior to the predictive modeling. The identified important variables are
- P1, P2, P3, P4,
P5, P6, P7, P8,
age, and type. However, the type
variable needs to be encoded before predictive modeling since it is
categorical in nature. The id and open_date
variables are irrelevant variables and not included in the feature set
for predictive modeling.
The Optimized Elastic Net Regression model will be recommended for the predictive modeling to predict the revenue of the restaurant since it outperforms all the respective models implemented during the analysis. Considering the Root Mean Squared Error (RMSE) of this model it can be concluded that this will predict most accurate revenue cost for the restaurants based on the important features.