Context
The Bread Basket1 is a bakery nestled in the historic center of Edinburgh, England. It is open from mid-morning to early afernoon and specializes in Spanish and Argentenian foods.
Source
The bakery cataloged every transaction made in the bakery over a 6 month period of time. Each transaction included date, time after bakery opening, and number of items purchased.
Research Question
Given the transactions over such a large stretch of time, we are hoping to observe trends in sale, specifically what hours and days of the week are the most profitable for the bakery.
Limitations
We were not able to break down the data even furthur by item. Although the information was available, we do not possess the abilties to work with multiple categorical variables.
After cleaning the dataset, we had to apply time constraints on to the transactions, cementing 8 am as the average opening time and 6 pm as the average closing time in order to prevent skewing of the data. Due to these actions, some points are neglected because they have “negative”" time values.
| Hours_Open | Total_Transactions | Mean_Items_Per_Transaction |
|---|---|---|
| 4.485 | 13257 | 2.139 |
| week_day | mean_items |
|---|---|
| Mon | 2.109 |
| Tue | 2.062 |
| Wed | 2.167 |
| Thu | 2.172 |
| Fri | 2.174 |
Observations
We can see that the majority of people each day buy between one and two items; few people buy more than three items per transaction.
The amount of people who buy each number of items stays relatively consistent throughout the week.
We can see Friday has a small spike in the number of items per transaction. This will be further analyzed in the Inference for Multiple Regression Section.
In our regression model, we are measuring the number of sales in the bakery. Our numerical variable is the amount of hours the bakery has been open and our categorical variable specifies the weekday of each transaction.
We fitted our model to the parallel slopes model2, so each day of the week has the same slope. For every hour since the bakery opened, there is an associated increase of, on average, 0.175 transactions.
We facetted the scatterplots by weekday to show the similarities of the intercepts without the business of having them all on the same plot
The intercept is the estimated average transactions made when the time since opening is 0 hours. Our baseline for comparison is Monday. The intercept for Tuesday is less than the baseline. Wednesday, Thursday, and Friday are all slightly greater than the baseline.
In the eye-ball test, we noticed the mean number of items for transaction was 2.1, and mode number of items in transaction was around 1-2. This was a visual representation of the median number of items in transaction computed prior to fitting this regression.
Potential limitations of our analysis:
The number of items in our transactions column are all integers, causing the striations in the graph; one does not simply buy half an item. A linear regression model may also not be the best visulation for the relationship of our variables. The model shows a positive correlation between the time since opening and number of items in a transaction, however, approximately an hour before closing, sales completely drop off. (Once we do a residual analysis we will see how the line is not the best for the data).
Every day of the week has the same slope, so the number of items in each transaction increases the longer the bakery is open regardless of the day of the week. The intercepts did not have any practical applications, since there would be no sales while the bakery is closed (time since opening = 0).
This regression is telling us two general points for our data:
There is a positive correlation between nubmer of items purchased per transaction and time since opening, which suggests that the longer the bakery is open, the more items are purchased per transaction. However, because of the viewing of the original dataset and the EDA we performed, we believe this correlation may have no practical meaning. We will be able to tell more after a residual analysis.
The number of items per transaction does not vary much depending on the day. This was also seen in our EDA when we calculated the mean number of items per transaction grouped by day, and the means were all around 2.1 items per transaction.
P-Values
| term | p_value |
|---|---|
| intercept | 0.000 |
| time_since_opening_hrs | 0.000 |
| week_dayTue | 0.313 |
| week_dayWed | 0.326 |
| week_dayThu | 0.248 |
| week_dayFri | 0.160 |
All of the p-values calculated based on the difference in intercepts of each day of the week is much larger than p = 0.05. This means that there is no statistically significant difference in the intercepts for each day of the week. In other words, there is no statistically significant difference between the number of items sold over time for each day of the week, even if we saw a spike in our data for Friday.
Confidence Intervals
| term | upper_ci | lower_ci |
|---|---|---|
| intercept | 1.825 | 1.556 |
| time_since_opening_hrs | 0.222 | 0.128 |
| week_dayTue | 0.050 | -0.155 |
| week_dayWed | 0.156 | -0.052 |
| week_dayThu | 0.161 | -0.042 |
| week_dayFri | 0.167 | -0.028 |
The regression table shows the upper and lower confidence intervals for each of the days of the week compared to Monday. In general, the confidence intervals show a “net”, or a range of values in which the regression predicts its outcome to fall in.
The confidence intervals for the fitted slope of this regression, or the ‘time_since_opening_hrs’ term, shows a 95% confidence interval that does not include the value of zero. This means that there is indeed a positive trend between time since opening and number of items sold at the bakery.
The confidence intervals for the different intercepts corresponding to the different days of the week are all 95% confidence intervals that do include the value of zero. This means that it is possible that there is no difference between the baseline intercept for Monday and the rest of the days of the week.
Residual Analysis
The histogram shown above of the residual values is centered at zero, but has a large right skew, which suggests that the predicted values from the model are further and further from the observed values as time goes on during the day.
The scatterplot of our residuals has a clear pattern to it and shows that the fitted regression is not a good predictor for the observed values. The fitted regression model shows an increase in the number of items bought per transaction over time. However, if we look at our residuals, it looks like as our model predicts larger numbers, the model has a greater error from the observed value.
We decided to investigate whether the average amount of items purchased per transaction increased more over the course of the day for different days of the week. Based on the multiple regression model, confidence interval, and p-values, it appears that there is a positive trend between the number of hours the bakery is open and the number of items the bakery sells in a day. However, after observing the p-values and confidence intervals for each day of the week, we observed that there was no statistically significant difference in the number of items sold for each day of the week.
From this information, we can conclude that the bakery sells a consistent amount of items each week day.
One caveat to keep in mind is that this data takes on the appearance of a normal bell-curve, so a linear regression is not the best model fit for this data. We drew this conclusion when observing our residual graphs. There was a large right skew in our residual histogram. There was also a clear pattern to the scatterplot for our residuals that indicated that numbers predicted by the linear model showed increasing error from the actual data the longer the bakery was open for in a given day. This means that the model predicted that at a later hour, there would be more items sold at the bakery, but this wasn’t always true. We could see that after the “breakfast rush”, there weren’t nearly as many transactions or number of items bought per transaction.
If we were to work with this data further, we would want to break the transactions down by item to see if sales of certain items increased throughout the day, allowing us to find the more popular items the store supplies. We would also try to fit our data to a bell-curve model instead of a linear model, because we think this may represent our data better than the linear model we tried to fit it to here. This may provide a more useful analysis of what hours the bakery is most profitable in a given week day.