This report presents an analysis of data from 30 coffee shops across three cities (Springfield, Riverton, and Oakville). The objective is to build a predictive model that identifies factors associated with growing coffee shops and can generalize to new locations. We’ll examine how variables such as years in business, coffee quality, location, and other factors correlate with a shop’s growth status.
##Section Changes
# Fixed typo in introduction
# Rewrote the introduction to clearly state the objective
# Removed the unsupported claim about r^2 value of .77
# Added proper data overview and visualization
The dataset contains information on 30 coffee shops with the following variables:
## Section Changes
# Created a proper data overview with a table showing variable types and descriptions
# This helps to understand the data structure before analysis
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
variable_info <- tribble(
~Variable, ~Type, ~Description,
"name", "Character", "Name of the coffee shop",
"street", "Character", "Street address of the shop",
"city", "Factor", "City where the shop is located (Springfield, Riverton, or Oakville)",
"years_in_business", "Numeric", "Number of years the shop has been operating",
"coffee_quality", "Ordered Factor", "Quality rating of coffee (bad < ok < good)",
"growing", "Factor", "Whether the shop is growing (Yes) or not (No)",
"address_count", "Numeric", "Number of shops sharing the same street address"
)
kable(variable_info, caption = "Variables in the Coffee Shop Dataset")
| Variable | Type | Description |
|---|---|---|
| name | Character | Name of the coffee shop |
| street | Character | Street address of the shop |
| city | Factor | City where the shop is located (Springfield, Riverton, or Oakville) |
| years_in_business | Numeric | Number of years the shop has been operating |
| coffee_quality | Ordered Factor | Quality rating of coffee (bad < ok < good) |
| growing | Factor | Whether the shop is growing (Yes) or not (No) |
| address_count | Numeric | Number of shops sharing the same street address |
Let’s examine the distribution of coffee shops across cities and their characteristics:
## Section Changes
# Added proper visualization of city distribution with ggplot
# Used more appropriate geom_col instead of table/hist
# Added proper titles and labels
# These are all sections I added for the histograms
# Better to use ggplot than just r
ggplot(t, aes(x = city, fill = city)) +
geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
labs(
title = "Number of Coffee Shops by City",
x = "City",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")
## Section Changes
# Added visualization of coffee quality by city
# This shows the distribution of quality ratings across different locations
# Using facets to allow for comparison
ggplot(t, aes(x = coffee_quality, fill = coffee_quality)) +
geom_bar() +
facet_wrap(~city) +
labs(
title = "Coffee Quality Distribution by City",
x = "Coffee Quality",
y = "Count"
) +
theme_minimal()
## Section Changes
# Created a Years in Business histogram
# With proper labels and formatting
ggplot(t, aes(x = years_in_business)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white") +
labs(
title = "Distribution of Years in Business",
x = "Years in Business",
y = "Number of Shops"
) +
theme_minimal()
## Section Changes
# Created a proper visualization for the growing variable
# Added breakdown by city for more insight
# Used appropriate bar chart instead of histogram for categorical data
ggplot(t, aes(x = growing, fill = city)) +
geom_bar(position = "dodge") +
labs(
title = "Growing vs. Non-Growing Coffee Shops by City",
x = "Growing Status",
y = "Count"
) +
theme_minimal()
## Section Changes
# Fixed the correlation analysis
# Only including numeric variables to avoid errors
# Created a proper correlation plot with visualization
# Added interpretation of the results
# Corrected correlation approach
# First, let's inspect what values are actually in the growing column
print("Values in growing column:")
## [1] "Values in growing column:"
print(table(t$growing))
##
## No Yes
## 16 14
# Create properly converted numeric variables
numeric_data <- data.frame(
years_in_business = t$years_in_business,
address_count = as.numeric(t$address_count),
growing_numeric = ifelse(t$growing == "Yes", 1, 0)
)
# Verify all columns are numeric
str(numeric_data)
## 'data.frame': 30 obs. of 3 variables:
## $ years_in_business: num 5 8 3 6 2 4 1 7 9 5 ...
## $ address_count : num 6 1 1 1 6 1 1 1 6 1 ...
## $ growing_numeric : num 1 1 1 0 1 1 1 1 1 1 ...
# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
print(cor_matrix)
## years_in_business address_count growing_numeric
## years_in_business 1.00000000 -0.05405859 0.01210747
## address_count -0.05405859 1.00000000 -0.05191741
## growing_numeric 0.01210747 -0.05191741 1.00000000
The correlation analysis shows:
A very weak positive correlation (0.012) between years in business and growth status, suggesting almost no linear relationship between experience and growth. A very weak negative correlation (-0.052) between address count (number of shops sharing the same address) and growth status, indicating a minimal tendency for shops in unique locations to perform slightly better.
## Section Changes
# Added analysis of coffee quality vs. growing status
# This helps understand if better coffee quality is associated with growth
# Identified what the analysis shows.
ggplot(t, aes(x = coffee_quality, fill = growing)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Growing Shops by Coffee Quality",
x = "Coffee Quality",
y = "Proportion"
) +
theme_minimal()
This analysis reveals that coffee quality appears to be associated with shop growth, with a higher proportion of “good” quality shops showing growth compared to “bad” quality shops.
Before building our model, we need to properly split our data into training and testing sets to ensure our model can generalize to new data:
## Section Changes
# Fixed typo in predictive modeling and fixed sentence to sound better
# Added proper train/test split
# This is critical for model validation and ensuring generalizability
# Used the caret package for proper stratified sampling
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(123) # For reproducibility
train_index <- createDataPartition(t$growing, p = 0.7, list = FALSE)
train_data <- t[train_index, ]
test_data <- t[-train_index, ]
cat("Training set size:", nrow(train_data), "shops\n")
## Training set size: 22 shops
cat("Testing set size:", nrow(test_data), "shops\n")
## Testing set size: 8 shops
## Section Changes
# Changed from linear to logistic regression for binary outcome
# Removed problematic 'street' variable (too many levels)
# Used more generalizable predictors
# Added proper model diagnostics
# Added what the logistic regression model is identifying and showing
# Build logistic regression model
log_model <- glm(
growing ~ years_in_business + coffee_quality + city + address_count,
data = train_data,
family = "binomial"
)
# Model summary
summary(log_model)
##
## Call:
## glm(formula = growing ~ years_in_business + coffee_quality +
## city + address_count, family = "binomial", data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4322 2.1408 -0.669 0.504
## years_in_business -0.0658 0.3523 -0.187 0.852
## coffee_quality.L 1.6726 1.3035 1.283 0.199
## coffee_quality.Q 0.3141 1.2614 0.249 0.803
## cityRiverton 2.0595 1.6022 1.285 0.199
## citySpringfield 3.2087 1.8082 1.775 0.076 .
## address_count4 -0.5854 1.6303 -0.359 0.720
## address_count6 -0.9683 2.2941 -0.422 0.673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 30.316 on 21 degrees of freedom
## Residual deviance: 21.206 on 14 degrees of freedom
## AIC: 37.206
##
## Number of Fisher Scoring iterations: 5
# Evaluate on test data
test_pred_prob <- predict(log_model, newdata = test_data, type = "response")
test_pred <- ifelse(test_pred_prob > 0.5, "Yes", "No")
test_pred <- factor(test_pred, levels = c("No", "Yes"))
# Create confusion matrix
conf_mat <- confusionMatrix(test_pred, test_data$growing)
conf_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 4 1
## Yes 0 3
##
## Accuracy : 0.875
## 95% CI : (0.4735, 0.9968)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.03516
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 1.000
## Specificity : 0.750
## Pos Pred Value : 0.800
## Neg Pred Value : 1.000
## Prevalence : 0.500
## Detection Rate : 0.500
## Detection Prevalence : 0.625
## Balanced Accuracy : 0.875
##
## 'Positive' Class : No
##
The logistic regression model shows:
several interesting patterns in coffee shop growth. Although not statistically significant at the p < 0.05 level, coffee quality has a positive relationship with growth, suggesting better coffee may contribute to business success. Location plays an important role, with shops in Springfield showing higher growth likelihood (p = 0.076). Shops with multiple competitors at the same address tend to grow less, though this effect is not statistically significant. Despite the limited sample size, the model performs well on test data with 87.5% accuracy, indicating it captures meaningful patterns that can generalize to new coffee shops.
Decision trees can capture non-linear relationships and provide easily interpretable rules:
## Section Changes
# Changed sentence to sound better
# Implemented the missing decision tree model
# used rpart.plot library
# Used appropriate variables that can generalize
# Added visualization and interpretation
# Added model validation on test data
# Added what the decision tree model is identifying and showing
# Build decision tree model
library(rpart.plot)
## Loading required package: rpart
tree_model <- rpart(
growing ~ years_in_business + coffee_quality + city + address_count,
data = train_data,
method = "class",
control = rpart.control(cp = 0.01) # Complexity parameter to avoid overfitting
)
# Visualize the tree
rpart.plot(tree_model, extra = 1, under = TRUE, box.palette = "RdBu")
# Evaluate on test data
tree_pred <- predict(tree_model, newdata = test_data, type = "class")
tree_conf_mat <- confusionMatrix(tree_pred, test_data$growing)
tree_conf_mat
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1 1
## Yes 3 3
##
## Accuracy : 0.5
## 95% CI : (0.157, 0.843)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.6367
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.6171
##
## Sensitivity : 0.250
## Specificity : 0.750
## Pos Pred Value : 0.500
## Neg Pred Value : 0.500
## Prevalence : 0.500
## Detection Rate : 0.125
## Detection Prevalence : 0.250
## Balanced Accuracy : 0.500
##
## 'Positive' Class : No
##
The decision tree model performed poorly on the test data with only 50% accuracy, suggesting it wasn’t able to identify reliable patterns for predicting coffee shop growth from the available variables and limited sample size. This contrasts with the logistic regression model, which achieved 87.5% accuracy on the same test data.
## Section Changes
# Added model comparison to select the best model
# This ensures we choose the model with better generalization
# Added Findings and Recommendations as well as Model Limitations and Further Research sections
# Create a comparison table
model_comparison <- data.frame(
Model = c("Logistic Regression", "Decision Tree"),
Accuracy = c(conf_mat$overall["Accuracy"], tree_conf_mat$overall["Accuracy"]),
Sensitivity = c(conf_mat$byClass["Sensitivity"], tree_conf_mat$byClass["Sensitivity"]),
Specificity = c(conf_mat$byClass["Specificity"], tree_conf_mat$byClass["Specificity"])
)
kable(model_comparison, digits = 3, caption = "Model Performance Comparison")
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Logistic Regression | 0.875 | 1.00 | 0.75 |
| Decision Tree | 0.500 | 0.25 | 0.75 |
Based on our analysis, we can draw several conclusions about factors associated with coffee shop growth:
City Location Matters: Our logistic regression model suggests that shops in Springfield are more likely to be growing compared to other cities (p = 0.076), indicating that location plays an important role in business success. Coffee Quality Shows Promise: While not statistically significant at the p < 0.05 level, our model indicates a positive relationship between coffee quality and growth. This suggests improving coffee quality may contribute to business success. Location Competition: The number of shops sharing the same address shows a negative relationship with growth, though not statistically significant. This aligns with business intuition that unique locations may perform better. Experience Not a Key Factor: Contrary to initial expectations, years in business showed a slight negative relationship with growth in our model, suggesting newer shops may be growing at similar or better rates than established ones.
For coffee shop owners and investors:
Location Selection: Consider Springfield as a potentially favorable location for growing coffee shops, as our analysis shows better growth patterns in this city. Focus on Quality: While not strongly significant in our model, coffee quality shows a positive relationship with growth and remains worth investing in. Avoid Clustering: When possible, choose locations without multiple existing coffee shops at the same address to reduce direct competition. Data-Driven Decisions: Use predictive modeling to evaluate potential new locations, as our logistic regression model achieved 87.5% accuracy on test data.
Our analysis has several limitations:
Small Sample Size: With only 30 shops and a 22/8 train/test split, our findings should be considered preliminary. Model Selection: The logistic regression model (87.5% accuracy) significantly outperformed the decision tree model (50% accuracy), indicating linear relationships may better capture growth patterns in this dataset. Limited Variables: Other factors like pricing, shop size, and marketing efforts were not captured in our dataset. Cross-sectional Data: Our dataset provides only a snapshot in time rather than tracking changes over time.
Collecting longitudinal data to track growth over time Including additional variables such as pricing, menu diversity, and customer demographics Expanding the dataset to include more shops and cities for better generalizability Exploring more complex modeling techniques with larger datasets