This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop density, ratings, and pricing that may inform strategic decisions about future locations and marketing approaches. I find that using address, specifically the growth rate by city and specific address is the best predictor of coffee shop quality. My r^2 value is .82 in the test data in my linear regression model and 0.82 in my decision tree model.
The dataset contains the following columns:
name: Coffee shop namecity: City locationstreet: Street addressyears_in_business: Years the shop has operatedcoffee_quality: Rated as bad, ok, or
goodgrowing: 1 = growing, 0 = not growingDerived variables added for analysis:
city_growth_rate: Average growth rate for all shops in
the same citycity_growth_rank: Rank of the city based on growth rate
(higher is better)smoothed_growth_rate: Growth rate for the street,
adjusted using empirical Bayes smoothing to reduce the impact of
low-frequency streetsThere are 30 records and 6 variables. # Methods ## Data Cleaning and Transformation
## # A tibble: 22 × 3
## street growth_rate count
## <chr> <dbl> <int>
## 1 101 First Ave 0 1
## 2 1010 Willow Dr 0 1
## 3 123 Main St 0.5 6
## 4 135 Pine Rd 0 1
## 5 147 Birch Way 1 1
## 6 159 Willow Ln 1 1
## 7 258 Spruce Ct 1 1
## 8 303 Third Rd 0 1
## 9 334 Pine St 0.25 4
## 10 369 Cedar Blvd 1 1
## # ℹ 12 more rows
To enable correlation analysis and modeling, categorical variables
were transformed into numeric format—for example,
coffee_quality was recoded as an ordinal variable. Growth
rates were calculated by street and city, with
a smoothed street-level growth rate applied to reduce the influence of
low-frequency streets. These new features were then joined back to the
main dataset. A numeric-only version of the dataset
(t_numeric) was created to support correlation matrix
visualization and predictive modeling.
A correlation matrix was generated to explore relationships between
numeric variables. Both
city_growth_rate and
smoothed_growth_rate showed the strongest positive
correlation with the growing variable. These two features
were subsequently used in the linear regression and decision tree models
to assess their predictive impact on coffee shop growth.
Most coffee shops are concentrated in a few key streets within the same city, indicating localized competition and growing opportunities.
##
## Oakville Riverton Springfield
## 10 10 10
##
## 101 First Ave 1010 Willow Dr 123 Main St 135 Pine Rd 147 Birch Way
## 1 1 6 1 1
## 159 Willow Ln 258 Spruce Ct 303 Third Rd 334 Pine St 369 Cedar Blvd
## 1 1 1 4 1
## 404 Fourth Blvd 445 Cedar Rd 456 Elm St 556 Elm Blvd 667 Maple Way
## 1 1 1 1 1
## 707 Seventh Ct 778 Birch Ct 789 Oak Ave 808 Eighth Way 889 Hickory Ln
## 1 1 1 1 1
## 909 Ninth Ave 990 Spruce Pl
## 1 1
Growth was predicted using a linear model based on smoothed street- and city-level growth rates. The model achieved a strong R² of 0.82 on the test set, indicating high predictive accuracy across most cities.
##
## Call:
## lm(formula = growing ~ smoothed_growth_rate + city_growth_rate,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49696 -0.14808 0.03807 0.19280 0.86986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9793 0.2656 -3.687 0.001370 **
## smoothed_growth_rate 2.5119 0.6520 3.853 0.000923 ***
## city_growth_rate 0.6205 0.2128 2.916 0.008258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2907 on 21 degrees of freedom
## Multiple R-squared: 0.7021, Adjusted R-squared: 0.6738
## F-statistic: 24.75 on 2 and 21 DF, p-value: 3.001e-06
## [1] "Test RMSE: 0.19"
## [1] "Test R-squared: 0.854"
I create a decision tree to predict the growth variable.
A decision tree was built using
smoothed_growth_rate and
city_growth_rate to predict coffee shop growth. The model
provides an interpretable, rule-based structure for identifying growth
patterns. The tree visualization highlights key thresholds in the data,
showing how location-based growth rates influence the likelihood of a
shop growing.
## Test RMSE: 0.21
## Test R-squared: 0.824
The decision tree model achieved an R² of 0.82 on the test set, indicating good predictive accuracy. The linear regression model performed better than our decision tree model.
growing) does not account for degrees of
business success or longitudinal performance.