2 Data Cleanup & Transformation

3 Introcution

This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop density, ratings, and pricing that may inform strategic decisions about future locations and marketing approaches. We will explore whether shop experience, coffee quality, or city location best predict growth. We evaluate these predictors using linear regression and decision trees.

4 Data Overview

The dataset contains the following columns:

  • name: Name of the coffee shop
  • city: City where the shop is located
  • street: Street address
  • years in business: years in business
  • coffee quality: good, ok, or bad.
  • growing: 1 if the shop is growing, 0 if not

There are 30 records and 6 variables.

4.1 Correlations

##                   years_in_business coffee_quality    growing
## years_in_business        1.00000000      0.2375221 0.01210747
## coffee_quality           0.23752214      1.0000000 0.40505554
## growing                  0.01210747      0.4050555 1.00000000

4.2 Correlation Matrix

## Warning: package 'ggcorrplot' was built under R version 4.4.3
##                   years_in_business coffee_quality    growing
## years_in_business        1.00000000      0.2375221 0.01210747
## coffee_quality           0.23752214      1.0000000 0.40505554
## growing                  0.01210747      0.4050555 1.00000000

4.3 Histograms

We begin by examining where most coffee shops are located.

## 
##    Oakville    Riverton Springfield 
##          10          10          10
## 
##   101 First Ave  1010 Willow Dr     123 Main St     135 Pine Rd   147 Birch Way 
##               1               1               6               1               1 
##   159 Willow Ln   258 Spruce Ct    303 Third Rd     334 Pine St  369 Cedar Blvd 
##               1               1               1               4               1 
## 404 Fourth Blvd    445 Cedar Rd      456 Elm St    556 Elm Blvd   667 Maple Way 
##               1               1               1               1               1 
##  707 Seventh Ct    778 Birch Ct     789 Oak Ave  808 Eighth Way  889 Hickory Ln 
##               1               1               1               1               1 
##   909 Ninth Ave   990 Spruce Pl 
##               1               1

5 Principal Component Analysis (PCA)

We apply PCA to identify underlying patterns among the numeric variables.

## Importance of components:
##                           PC1    PC2    PC3
## Standard deviation     1.2145 0.9947 0.7319
## Proportion of Variance 0.4916 0.3298 0.1785
## Cumulative Proportion  0.4916 0.8215 1.0000

The PCA shows that the first principal component captures most of the variance in the data, driven primarily by [coffee quality] and [growing]. This suggests a shared structure across these variables that may relate to shop success.

6 Predicting Growth

I predict growth by using a number of variables. The model is highly predictive. While it does not show Oakfield, it does work for the other two cities.

m <- lm(growing ~ years_in_business + coffee_quality + city, data = t)

summary(m)
## 
## Call:
## lm(formula = growing ~ years_in_business + coffee_quality + city, 
##     data = t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83615 -0.27474  0.00422  0.16300  0.88265 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.1989328  0.2515354  -0.791 0.436457    
## years_in_business -0.0004254  0.0288360  -0.015 0.988348    
## coffee_quality     0.1585644  0.1007563   1.574 0.128119    
## cityRiverton       0.3629578  0.1802407   2.014 0.054919 .  
## citySpringfield    0.7205051  0.1830572   3.936 0.000584 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 25 degrees of freedom
## Multiple R-squared:  0.4905, Adjusted R-squared:  0.409 
## F-statistic: 6.017 on 4 and 25 DF,  p-value: 0.001558
#took street out of linear regression model so that there weren't so many variables/values. R squared was also negative with street included.
#I included city in the model because it was a factor variable and I thought it would be helpful to see if it was significant. It increased R squared.
#Fixed the typo in the header from "gorwth" to "Growth"

#Evaluation Model

#Added an evaluation model to truly test your model and how well it predicts growth.

set.seed(123)

# Split data
train_idx <- sample(seq_len(nrow(t)), size = 0.7 * nrow(t))
train <- t[train_idx, ]
test <- t[-train_idx, ]

# Fit on training
model <- lm(growing ~ years_in_business + coffee_quality + city, data = train)

# Predict on test
pred <- predict(model, newdata = test)

# Calculate Mean Squared Error (MSE)
mse <- mean((test$growing - pred)^2)
mse
## [1] 0.1049407

7 Decision Tree

I create a decision tree to predict the growth variable.

## Warning: package 'rpart.plot' was built under R version 4.4.3

8 Final Prediction Plot

We visualize how the model’s predicted values compare to actual growth outcomes.

9 Conclusion

This analysis examined factors influencing coffee shop growth, focusing on years in business, coffee quality, and city. Both the regression and decision tree models showed that coffee quality and location are useful predictors of growth. While years in business played a smaller role, it still added value. Future work could include more detailed variables to improve accuracy, but this provides a strong foundation for understanding growth trends.