1 Introduction

This report presents an analysis of data from 30 coffee shops across three cities (Springfield, Riverton, and Oakville). The objective is to build a predictive model that identifies factors associated with growing coffee shops and can generalize to new locations. We’ll examine how variables such as years in business, coffee quality, location, and other factors correlate with a shop’s growth status.

##Section Changes
#   Fixed typo in introduction
#   Rewrote the introduction to clearly state the objective
#   Removed the unsupported claim about r^2 value of .77
#   Added proper data overview and visualization

2 Data Overview

The dataset contains information on 30 coffee shops with the following variables:

## Section Changes 
#    Created a proper data overview with a table showing variable types and descriptions
#    This helps to understand the data structure before analysis

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

variable_info <- tribble(
  ~Variable, ~Type, ~Description,
  "name", "Character", "Name of the coffee shop",
  "street", "Character", "Street address of the shop",
  "city", "Factor", "City where the shop is located (Springfield, Riverton, or Oakville)",
  "years_in_business", "Numeric", "Number of years the shop has been operating",
  "coffee_quality", "Ordered Factor", "Quality rating of coffee (bad < ok < good)",
  "growing", "Factor", "Whether the shop is growing (Yes) or not (No)",
  "address_count", "Numeric", "Number of shops sharing the same street address"
)

kable(variable_info, caption = "Variables in the Coffee Shop Dataset")

Variables in the Coffee Shop Dataset
Variable	Type	Description
name	Character	Name of the coffee shop
street	Character	Street address of the shop
city	Factor	City where the shop is located (Springfield, Riverton, or Oakville)
years_in_business	Numeric	Number of years the shop has been operating
coffee_quality	Ordered Factor	Quality rating of coffee (bad < ok < good)
growing	Factor	Whether the shop is growing (Yes) or not (No)
address_count	Numeric	Number of shops sharing the same street address

2.1 Data Distribution

Let’s examine the distribution of coffee shops across cities and their characteristics:

## Section Changes
#    Added proper visualization of city distribution with ggplot
#    Used more appropriate geom_col instead of table/hist
#    Added proper titles and labels
#    These are all sections I added for the histograms
#    Better to use ggplot than just r

ggplot(t, aes(x = city, fill = city)) +
  geom_bar() +
  geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
  labs(
    title = "Number of Coffee Shops by City",
    x = "City",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

## Section Changes
#   Added visualization of coffee quality by city
#   This shows the distribution of quality ratings across different locations
#   Using facets to allow for comparison

ggplot(t, aes(x = coffee_quality, fill = coffee_quality)) +
  geom_bar() +
  facet_wrap(~city) +
  labs(
    title = "Coffee Quality Distribution by City",
    x = "Coffee Quality",
    y = "Count"
  ) +
  theme_minimal()

## Section Changes
# Created a Years in Business histogram
# With proper labels and formatting

ggplot(t, aes(x = years_in_business)) +
  geom_histogram(bins = 10, fill = "steelblue", color = "white") +
  labs(
    title = "Distribution of Years in Business",
    x = "Years in Business",
    y = "Number of Shops"
  ) +
  theme_minimal()

## Section Changes
# Created a proper visualization for the growing variable
# Added breakdown by city for more insight
# Used appropriate bar chart instead of histogram for categorical data


ggplot(t, aes(x = growing, fill = city)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Growing vs. Non-Growing Coffee Shops by City",
    x = "Growing Status",
    y = "Count"
  ) +
  theme_minimal()

2.2 Correlations

## Section Changes
# Fixed the correlation analysis
# Only including numeric variables to avoid errors
# Created a proper correlation plot with visualization
# Added interpretation of the results

# Corrected correlation approach
# First, let's inspect what values are actually in the growing column
print("Values in growing column:")

## [1] "Values in growing column:"

print(table(t$growing))

## 
##  No Yes 
##  16  14

# Create properly converted numeric variables
numeric_data <- data.frame(
  years_in_business = t$years_in_business,
  address_count = as.numeric(t$address_count),
  growing_numeric = ifelse(t$growing == "Yes", 1, 0)
)

# Verify all columns are numeric
str(numeric_data)

## 'data.frame':    30 obs. of  3 variables:
##  $ years_in_business: num  5 8 3 6 2 4 1 7 9 5 ...
##  $ address_count    : num  6 1 1 1 6 1 1 1 6 1 ...
##  $ growing_numeric  : num  1 1 1 0 1 1 1 1 1 1 ...

# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
print(cor_matrix)

##                   years_in_business address_count growing_numeric
## years_in_business        1.00000000   -0.05405859      0.01210747
## address_count           -0.05405859    1.00000000     -0.05191741
## growing_numeric          0.01210747   -0.05191741      1.00000000

The correlation analysis shows:

A very weak positive correlation (0.012) between years in business and growth status, suggesting almost no linear relationship between experience and growth. A very weak negative correlation (-0.052) between address count (number of shops sharing the same address) and growth status, indicating a minimal tendency for shops in unique locations to perform slightly better.

## Section Changes
# Added analysis of coffee quality vs. growing status
# This helps understand if better coffee quality is associated with growth
# Identified what the analysis shows. 

ggplot(t, aes(x = coffee_quality, fill = growing)) +
  geom_bar(position = "fill") +
  labs(
    title = "Proportion of Growing Shops by Coffee Quality",
    x = "Coffee Quality",
    y = "Proportion"
  ) +
  theme_minimal()

This analysis reveals that coffee quality appears to be associated with shop growth, with a higher proportion of “good” quality shops showing growth compared to “bad” quality shops.

3 Predictive Modeling

Before building our model, we need to properly split our data into training and testing sets to ensure our model can generalize to new data:

## Section Changes
# Fixed typo in predictive modeling and fixed sentence to sound better
# Added proper train/test split
# This is critical for model validation and ensuring generalizability
# Used the caret package for proper stratified sampling
library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(123) # For reproducibility
train_index <- createDataPartition(t$growing, p = 0.7, list = FALSE)
train_data <- t[train_index, ]
test_data <- t[-train_index, ]

cat("Training set size:", nrow(train_data), "shops\n")

## Training set size: 22 shops

cat("Testing set size:", nrow(test_data), "shops\n")

## Testing set size: 8 shops

4 Logistic Regression Model

## Section Changes
# Changed from linear to logistic regression for binary outcome
# Removed problematic 'street' variable (too many levels)
# Used more generalizable predictors
# Added proper model diagnostics
# Added what the logistic regression model is identifying and showing

# Build logistic regression model
log_model <- glm(
  growing ~ years_in_business + coffee_quality + city + address_count,
  data = train_data,
  family = "binomial"
)

# Model summary
summary(log_model)

## 
## Call:
## glm(formula = growing ~ years_in_business + coffee_quality + 
##     city + address_count, family = "binomial", data = train_data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -1.4322     2.1408  -0.669    0.504  
## years_in_business  -0.0658     0.3523  -0.187    0.852  
## coffee_quality.L    1.6726     1.3035   1.283    0.199  
## coffee_quality.Q    0.3141     1.2614   0.249    0.803  
## cityRiverton        2.0595     1.6022   1.285    0.199  
## citySpringfield     3.2087     1.8082   1.775    0.076 .
## address_count4     -0.5854     1.6303  -0.359    0.720  
## address_count6     -0.9683     2.2941  -0.422    0.673  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 30.316  on 21  degrees of freedom
## Residual deviance: 21.206  on 14  degrees of freedom
## AIC: 37.206
## 
## Number of Fisher Scoring iterations: 5

# Evaluate on test data
test_pred_prob <- predict(log_model, newdata = test_data, type = "response")
test_pred <- ifelse(test_pred_prob > 0.5, "Yes", "No")
test_pred <- factor(test_pred, levels = c("No", "Yes"))

# Create confusion matrix
conf_mat <- confusionMatrix(test_pred, test_data$growing)
conf_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   4   1
##        Yes  0   3
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.4735, 0.9968)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.03516         
##                                           
##                   Kappa : 0.75            
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 1.000           
##             Specificity : 0.750           
##          Pos Pred Value : 0.800           
##          Neg Pred Value : 1.000           
##              Prevalence : 0.500           
##          Detection Rate : 0.500           
##    Detection Prevalence : 0.625           
##       Balanced Accuracy : 0.875           
##                                           
##        'Positive' Class : No              
##

The logistic regression model shows:

several interesting patterns in coffee shop growth. Although not statistically significant at the p < 0.05 level, coffee quality has a positive relationship with growth, suggesting better coffee may contribute to business success. Location plays an important role, with shops in Springfield showing higher growth likelihood (p = 0.076). Shops with multiple competitors at the same address tend to grow less, though this effect is not statistically significant. Despite the limited sample size, the model performs well on test data with 87.5% accuracy, indicating it captures meaningful patterns that can generalize to new coffee shops.

5 Decision Tree

Decision trees can capture non-linear relationships and provide easily interpretable rules:

## Section Changes
#  Changed sentence to sound better
#  Implemented the missing decision tree model
#  used rpart.plot library
#  Used appropriate variables that can generalize
#  Added visualization and interpretation
#  Added model validation on test data
#  Added what the decision tree model is identifying and showing 

# Build decision tree model
library(rpart.plot)

## Loading required package: rpart

tree_model <- rpart(
  growing ~ years_in_business + coffee_quality + city + address_count,
  data = train_data,
  method = "class",
  control = rpart.control(cp = 0.01) # Complexity parameter to avoid overfitting
)

# Visualize the tree
rpart.plot(tree_model, extra = 1, under = TRUE, box.palette = "RdBu")

# Evaluate on test data
tree_pred <- predict(tree_model, newdata = test_data, type = "class")
tree_conf_mat <- confusionMatrix(tree_pred, test_data$growing)
tree_conf_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   1   1
##        Yes  3   3
##                                         
##                Accuracy : 0.5           
##                  95% CI : (0.157, 0.843)
##     No Information Rate : 0.5           
##     P-Value [Acc > NIR] : 0.6367        
##                                         
##                   Kappa : 0             
##                                         
##  Mcnemar's Test P-Value : 0.6171        
##                                         
##             Sensitivity : 0.250         
##             Specificity : 0.750         
##          Pos Pred Value : 0.500         
##          Neg Pred Value : 0.500         
##              Prevalence : 0.500         
##          Detection Rate : 0.125         
##    Detection Prevalence : 0.250         
##       Balanced Accuracy : 0.500         
##                                         
##        'Positive' Class : No            
##

The decision tree model performed poorly on the test data with only 50% accuracy, suggesting it wasn’t able to identify reliable patterns for predicting coffee shop growth from the available variables and limited sample size. This contrasts with the logistic regression model, which achieved 87.5% accuracy on the same test data.

6 Model Comparison

## Section Changes
# Added model comparison to select the best model
# This ensures we choose the model with better generalization
# Added Findings and Recommendations as well as Model Limitations and Further Research sections

# Create a comparison table
model_comparison <- data.frame(
  Model = c("Logistic Regression", "Decision Tree"),
  Accuracy = c(conf_mat$overall["Accuracy"], tree_conf_mat$overall["Accuracy"]),
  Sensitivity = c(conf_mat$byClass["Sensitivity"], tree_conf_mat$byClass["Sensitivity"]),
  Specificity = c(conf_mat$byClass["Specificity"], tree_conf_mat$byClass["Specificity"])
)

kable(model_comparison, digits = 3, caption = "Model Performance Comparison")

Model Performance Comparison
Model	Accuracy	Sensitivity	Specificity
Logistic Regression	0.875	1.00	0.75
Decision Tree	0.500	0.25	0.75

6.1 Findings and Recommendations

Based on our analysis, we can draw several conclusions about factors associated with coffee shop growth:

City Location Matters: Our logistic regression model suggests that shops in Springfield are more likely to be growing compared to other cities (p = 0.076), indicating that location plays an important role in business success. Coffee Quality Shows Promise: While not statistically significant at the p < 0.05 level, our model indicates a positive relationship between coffee quality and growth. This suggests improving coffee quality may contribute to business success. Location Competition: The number of shops sharing the same address shows a negative relationship with growth, though not statistically significant. This aligns with business intuition that unique locations may perform better. Experience Not a Key Factor: Contrary to initial expectations, years in business showed a slight negative relationship with growth in our model, suggesting newer shops may be growing at similar or better rates than established ones.

6.2 Recommendations

For coffee shop owners and investors:

Location Selection: Consider Springfield as a potentially favorable location for growing coffee shops, as our analysis shows better growth patterns in this city. Focus on Quality: While not strongly significant in our model, coffee quality shows a positive relationship with growth and remains worth investing in. Avoid Clustering: When possible, choose locations without multiple existing coffee shops at the same address to reduce direct competition. Data-Driven Decisions: Use predictive modeling to evaluate potential new locations, as our logistic regression model achieved 87.5% accuracy on test data.

6.3 Model Limitations and Further Research

Our analysis has several limitations:

Small Sample Size: With only 30 shops and a 22/8 train/test split, our findings should be considered preliminary. Model Selection: The logistic regression model (87.5% accuracy) significantly outperformed the decision tree model (50% accuracy), indicating linear relationships may better capture growth patterns in this dataset. Limited Variables: Other factors like pricing, shop size, and marketing efforts were not captured in our dataset. Cross-sectional Data: Our dataset provides only a snapshot in time rather than tracking changes over time.

6.4 Future research could include:

Collecting longitudinal data to track growth over time Including additional variables such as pricing, menu diversity, and customer demographics Expanding the dataset to include more shops and cities for better generalizability Exploring more complex modeling techniques with larger datasets

Coffee Shop Analysis

Exam 2 - 2025 Spring ACCT 426/BUDA 451

2025-05-01