Author: Daniele Melotti
Date:
Dec 2023
See the related GitHub repository for code, data and list of tasks.
We start by loading and taking a peek at the dataset:
bookings <- read.table("../data/first_bookings_datetime_sample.txt",
header=TRUE) # loading the data
str(bookings$datetime) # understanding what data type we are dealing with## chr [1:100000] "4/16/2014 17:30" "1/11/2014 20:00" "3/24/2013 12:00" ...
Our data is contained in a character vector, which is not ideal for performing statistical operations. Therefore, we will convert the content of datetime to a POSIXlt date-time format and extract the components that interest us.
Since we know that EZTABLE would like to start a promotion for new members to make their bookings earlier in the day, we focus on extracting the hours and minutes after converting datetime to a POSIXlt object. We do not focus on date.
# Converting datetime to a POSIXlt object, extracting hours and minutes
hours <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$hour
mins <- as.POSIXlt(bookings$datetime, format="%m/%d/%Y %H:%M")$minAfterwards, we calculate the time of each booking in minutes since the start of the day. So, we multiply the hours by 60 and add the minutes, resulting in a single numerical value that represents the minute of the day:
Now, we shall see the density of minday so as to get an idea of how the values of this vector are distributed:
plot(density(minday), main="Minute of the day of first ever booking",
ylab = "", xlab = "Minutes", col="cornflowerblue", lwd=2)We see a distribution that is similar to a bimodal, with a peek of booking times around 700 minutes and another around 1100 minutes, which likely represent booking requests before lunch and dinner time. Next, we are going to conduct an analysis of the mean booking time.
We shall start by computing the mean booking time and its standard error. The standard error is computed by dividing the standard deviation of the booking time by the square root of the total number of bookings:
## [1] 942.4964
## [1] 0.5997673
The mean booking time corresponds to 942.4964 minutes. To convert the time back to a familiar format, we use a custom function, located in the functions.R script:
source(file = "../R/functions.R")
mins_to_time(minday_mean) # custom function that converts number of minutes to time of the day## [1] "15:42"
So, we know that the time at which a first booking is made on average is 15:42.
Recall that the data we have available is only a sample; now that we know mean and standard error, we are able to build a confidence interval and get a clearer idea about the true average booking time:
## [1] 941.3208
## [1] 943.6719
The interval we just built suggests that we can be 95% confident that the true average booking time is between 941.2308 and 943.2308 minutes, or that it is included between 15:41 and 15:44.
## [1] "15:41"
## [1] "15:44"
Next, we are going to employ bootstrapping. This is a technique that
resamples a dataset to create several simulated samples. To implement
bootstrapping, we use the replicate() function, paired with
sample(). Let’s generate 2000 new samples from the original
sample data:
# Setting seed
set.seed(100)
# Creating 2000 new samples (bootstrapped samples) from the starting data
new_samples <- replicate(n = 2000,
sample(minday, length(minday), replace = TRUE))We might be interested in seeing how the 2000 bootstrapped samples
look like in comparison with the starting sample. We can display them on
the same plot. We achieve this by applying the custom function
plot_resamples(), included in the functions.R
script. This function plots the density of a sample and then applies a
function on the sample; if paired with apply(), it allows
us to plot each bootstraped samples’ density and compute their
means.
# Create an empty plotting space
plot(density(minday), lwd=0, ylim=c(0, 0.004), xlab = "Minutes", ylab = "",
main = "Original sample vs. Bootstrapped samples with Mean indicators")
# Plot and get means of all bootstrapped samples
sample_means <- apply(new_samples, 2,
FUN = function(x) plot_resamples(x, mean))
# Add starting sample density to the plot
lines(density(minday), lwd = 2)
# add vertical lines for the means of each the original sample and each new sample
abline(v = sample_means, col = rgb(0.0, 0.4, 0.0, 0.05))
abline(v = minday_mean, lwd = 1, col = "red")We notice that the bootstrapped samples’ densities are all very close to the original sample. The same goes for the bootstrapped means, the confidence interval is certainly very narrow. Now, let’s visualize the density of the 2000 bootstrapped means. We can take a look at how they compare with the original mean value from the starting sample.
plot(density(sample_means), lwd = 2, col = "cornflowerblue",
main = "Density of Bootstrapped samples with Original mean indicator",
ylab = "", xlab = "Minutes")
abline(v = minday_mean, lwd = 2, col = "tomato") # adding vertical line for original sample meanWe see that the bootstrapped means do not vary largely from the mean of the original sample. Most values are included between 941 and 944 minutes, with only a few means being outside of this interval.
We know what the 95% confidence interval is for the original sample’s
mean, now we can build a new interval for the resampled means. We can
build this interval using the quantile() function:
## 2.5% 97.5%
## 941.2731 943.7660
The 95% CI of the bootstrapped means ranges from 941.2731 to 943.766 (or from 15:41 to 15:44), which is very similar to the CI of the original sample’s mean.
## [1] "15:41" "15:44"
This means that we are 95% confident that the true average booking time falls between 15:41 and 15:44. With this confidence interval, we complete the analysis of the average booking time.
Now, we focus on the median booking time, in other words, the time of the day by which half of the customers has made their first booking.
## [1] 1040
## [1] "17:20"
We see that the median booking time is 1040 minutes, in other words, it’s 17:20.
We can display the bootstrapped medians and compare them with the
original sample’s median booking time. We shall use
plot_resamples() again.
# Create an empty plotting space
plot(density(minday), lwd=0, ylim=c(0, 0.004), xlab = "Minutes", ylab = "",
main = "Original sample vs. Bootstrapped samples with Median indicators")
# Plot and get means of all bootstrapped samples
sample_medians <- apply(new_samples, 2,
FUN = function(x) plot_resamples(x, median))
# Add starting sample density to the plot
lines(density(minday), lwd = 2)
# add vertical lines for the means of each the original sample and each new sample
abline(v = sample_medians, col = rgb(0.0, 0.4, 0.0, 0.05))
abline(v = minday_median, lwd = 1, col = "red")Interestingly, we see that the bootstrapped medians are close to the original sample’s median, however, there are gaps in between these values. This is due to the fact that all the bootstrapped median values are represented by one of these discrete values:
## [1] 1020.0 1025.0 1030.0 1032.5 1035.0 1037.5 1040.0 1045.0 1050.0
Now, let’s visualize the density of the 2000 bootstrapped medians:
plot(density(sample_medians), lwd = 2, col = "cornflowerblue",
main = "Density of Bootstrapped samples with Original median indicator",
ylab = "", xlab = "Minutes")
abline(v = minday_median, lwd = 2, col = "tomato") # add vertical line for original sample medianAs we can see, the bootstrapped medians are distributed quite widely and non-normally. This makes it hard to use them for inference.
Let’s compute the 95% CI:
## 2.5% 97.5%
## 1020 1050
The confidence interval shows that we can be 95% confident that the true median booking time is included between minutes 1020 and 1050, else:
## [1] "17:00" "17:30"
Between 17:00 and 17:30. This is a window of 30 minutes, which is quite large. The calculation of the confidence interval for the median concludes the median analysis.
The bimodal distribution seen at the end of the data exploration section suggests two peak periods, likely corresponding to lunch and dinner times. This pattern indicates that new members to EZTABLE prefer these periods for their first bookings. This trend presents an opportunity to shift some of this demand to earlier times. For example, offering special promotions or incentives for bookings made before the usual lunch peak could help redistribute the customer flow. The narrow confidence interval for the mean booking time seen in the mean analysis section suggests a consistent pattern in customer behavior. Since the mean is later in the day (15:42), EZTABLE can design targeted promotions aimed at times just before the average, gradually encouraging customers to book earlier. On the other hand, the median analysis showed us a distribution that doesn’t allow us to conduct inference easily.
Considering what we have discovered so far, EZTABLE could employ the following strategies in order to encourage early bookings:
These are a few simple recommendations for EZTABLE, which are designed to leverage the insights from the booking time analysis to effectively encourage early-day bookings, thereby aiding EZTABLE in optimizing restaurant reservations and enhancing customer flow.