1 Analyzing Growth Drivers in Coffee Shops

2 Introduction

This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop density, ratings, and pricing that may inform strategic decisions about future locations and marketing approaches. I find that using address, specifically the growth rate by city and specific address is the best predictor of coffee shop quality. My r^2 value is .82 in the test data in my linear regression model and 0.82 in my decision tree model.

3 Data Overview

The dataset contains the following columns:

  • name: Coffee shop name
  • city: City location
  • street: Street address
  • years_in_business: Years the shop has operated
  • coffee_quality: Rated as bad, ok, or good
  • growing: 1 = growing, 0 = not growing

Derived variables added for analysis:

  • city_growth_rate: Average growth rate for all shops in the same city
  • city_growth_rank: Rank of the city based on growth rate (higher is better)
  • smoothed_growth_rate: Growth rate for the street, adjusted using empirical Bayes smoothing to reduce the impact of low-frequency streets

There are 30 records and 6 variables. # Methods ## Data Cleaning and Transformation

## # A tibble: 22 × 3
##    street         growth_rate count
##    <chr>                <dbl> <int>
##  1 101 First Ave         0        1
##  2 1010 Willow Dr        0        1
##  3 123 Main St           0.5      6
##  4 135 Pine Rd           0        1
##  5 147 Birch Way         1        1
##  6 159 Willow Ln         1        1
##  7 258 Spruce Ct         1        1
##  8 303 Third Rd          0        1
##  9 334 Pine St           0.25     4
## 10 369 Cedar Blvd        1        1
## # ℹ 12 more rows

To enable correlation analysis and modeling, categorical variables were transformed into numeric format—for example, coffee_quality was recoded as an ordinal variable. Growth rates were calculated by street and city, with a smoothed street-level growth rate applied to reduce the influence of low-frequency streets. These new features were then joined back to the main dataset. A numeric-only version of the dataset (t_numeric) was created to support correlation matrix visualization and predictive modeling.

3.1 Correlations

A correlation matrix was generated to explore relationships between numeric variables. Both city_growth_rate and smoothed_growth_rate showed the strongest positive correlation with the growing variable. These two features were subsequently used in the linear regression and decision tree models to assess their predictive impact on coffee shop growth.

4 Histograms

Most coffee shops are concentrated in a few key streets within the same city, indicating localized competition and growing opportunities.

## 
##    Oakville    Riverton Springfield 
##          10          10          10
## 
##   101 First Ave  1010 Willow Dr     123 Main St     135 Pine Rd   147 Birch Way 
##               1               1               6               1               1 
##   159 Willow Ln   258 Spruce Ct    303 Third Rd     334 Pine St  369 Cedar Blvd 
##               1               1               1               4               1 
## 404 Fourth Blvd    445 Cedar Rd      456 Elm St    556 Elm Blvd   667 Maple Way 
##               1               1               1               1               1 
##  707 Seventh Ct    778 Birch Ct     789 Oak Ave  808 Eighth Way  889 Hickory Ln 
##               1               1               1               1               1 
##   909 Ninth Ave   990 Spruce Pl 
##               1               1

5 Predicting growth

Growth was predicted using a linear model based on smoothed street- and city-level growth rates. The model achieved a strong R² of 0.82 on the test set, indicating high predictive accuracy across most cities.

## 
## Call:
## lm(formula = growing ~ smoothed_growth_rate + city_growth_rate, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49696 -0.14808  0.03807  0.19280  0.86986 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.9793     0.2656  -3.687 0.001370 ** 
## smoothed_growth_rate   2.5119     0.6520   3.853 0.000923 ***
## city_growth_rate       0.6205     0.2128   2.916 0.008258 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2907 on 21 degrees of freedom
## Multiple R-squared:  0.7021, Adjusted R-squared:  0.6738 
## F-statistic: 24.75 on 2 and 21 DF,  p-value: 3.001e-06
## [1] "Test RMSE: 0.19"
## [1] "Test R-squared: 0.854"

6 Decision Tree

I create a decision tree to predict the growth variable.

A decision tree was built using smoothed_growth_rate and city_growth_rate to predict coffee shop growth. The model provides an interpretable, rule-based structure for identifying growth patterns. The tree visualization highlights key thresholds in the data, showing how location-based growth rates influence the likelihood of a shop growing.

6.0.1 Evaluate the model

## Test RMSE: 0.21
## Test R-squared: 0.824

The decision tree model achieved an R² of 0.82 on the test set, indicating good predictive accuracy. The linear regression model performed better than our decision tree model.

7 Limitations

  • Sample Size and Distribution: The dataset includes a small number of records, with several streets appearing only once. This may introduce noise and limit the reliability of location-based insights.
  • Simplified Categorical Encoding: Coffee quality was transformed into an ordinal numeric scale, which may oversimplify subjective or qualitative aspects of customer experience.
  • Model Overfitting Risk: The strong R² score, combined with the dataset’s size, suggests potential overfitting—especially in the decision tree model.
  • Omitted Variables: Important external factors such as competition, marketing, pricing, and seasonal trends were not included, though they likely affect growth.
  • Binary Growth Definition: Modeling growth as a binary outcome (growing) does not account for degrees of business success or longitudinal performance.

8 Recommendations

  • Expand the Dataset: Collect additional data across more coffee shops and time periods to improve model robustness and generalizability.
  • Include External Factors: Incorporate variables such as pricing, customer reviews, competition density, marketing activity, and seasonal trends to better capture growth dynamics.
  • Refine Growth Measurement: Use a more granular growth metric (e.g., revenue, foot traffic, or sales over time) rather than a binary indicator.
  • Address Low-Frequency Locations: Consider aggregating or smoothing growth data for locations with low representation to reduce noise.
  • Model Validation: Apply cross-validation techniques and test the model on unseen data to better assess its predictive power and minimize overfitting.
  • Explore Additional Models: Compare performance across various algorithms (e.g., logistic regression, random forests, gradient boosting) to ensure the best fit for the problem.

9 References

  • ChatGPT used to proofread text and generate and enhance code.
  • Cheatsheet
  • Previous Rmd files given in class
  • R documentation