1 Introduction

This report presents a preliminary analysis of a dataset containing information about coffee shops across three U.S. cities. The objective is to identify patterns in shop density, ratings, and business growth that may inform strategic decisions about future locations and marketing approaches. Based on initial analysis, address (location) appears to be a strong predictor of coffee shop quality. The R² value of the model is approximately 0.77, suggesting a reasonably strong relationship.

# Introduction Section Changes:
#   Fixed typo in introduction
#   Corrected grammer and improved clarity in order to be more professional
#   Made the introduction more specific so readers know what they are looking into
#   Created more context to my R² value for better understanding

2 Data Overview

The dataset contains the following columns:

name: Name of the coffee shop
street: Street address
city: City where the shop is located
years_in_business: Number of years the shop has been operating
coffee_quality: Quality rating of the coffee (good, ok, or bad)
growing: Indicator if the shop is growing (1 = growing, 0 = not growing)

There are 30 records (rows) and 6 variables (columns).

# Data Overview Section Changes:
#   Put columns into the order they appear in the dataset
#   Clarified variable descriptions

2.1 Correlations

t <- t %>%
  mutate(coffee_quality_num = case_when(
    coffee_quality == "bad" ~ 1,
    coffee_quality == "ok" ~ 2,
    coffee_quality == "good" ~ 3
  ))

cor(select(t, years_in_business, coffee_quality_num, growing))

##                    years_in_business coffee_quality_num    growing
## years_in_business         1.00000000          0.2375221 0.01210747
## coffee_quality_num        0.23752214          1.0000000 0.40505554
## growing                   0.01210747          0.4050555 1.00000000

# Correlations Section Changes:
#   Fixed error in cor() as it only works on numeric data
#   Changed coffee_quality to numeric from categorical
#   Created a new column called 'coffee_quality_num' to the dataset
#   Calculated correlations using the new numeric column

3 Histograms

I begin by examining where most coffee shops are located and their business characterisitcs. Below I show: - A bar plot of the number of shops in the city - A bar plot of the most common street locations - A histogram of the years in business - A bar plot showing the number of shops that are growing vs. not growing

library(ggplot2)


ggplot(t, aes(x = city)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Number of Coffee Shops by City", x = "City", y = "Count")

t %>%
  count(street, sort = TRUE) %>%
  slice_max(n, n = 10) %>%
  ggplot(aes(x = reorder(street, n), y = n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 10 Most Common Streets", x = "Street", y = "Count")

ggplot(t, aes(x = years_in_business)) +
  geom_histogram(binwidth = 1, fill = "purple", color = "white") +
  labs(title = "Histogram of Years in Business", x = "Years", y = "Count")

ggplot(t, aes(x = factor(growing, labels = c("Not Growing", "Growing")))) +
  geom_bar(fill = "orange") +
  labs(title = "Business Growth Status", x = "Growth Status", y = "Count")

# Histograms Section Changes:
#   Replaced R tables with clearer ggplot2 visualizations to make them easier to understand
#   Added more descriptive titles to the plots in order to know what the plots represent
#   Reordered street bar plot and limited it to top 10 in order to give a glimpse of the data but not bombard with too much information
#   Added color to plots and cleaned up names to make it easier to understand

4 Predicting gorwth

To predict whether a coffee shop is growing, I used a linear regression model. The model uses the following variables: - street: Street address - years_in_business: Number of years the shop has been operating - coffee_quality_num: Quality rating of the coffee (good, ok, or bad) as a number (1, 2, or 3)

Also, the model may not generalize well to cities with fewer or no growing businesses, such as Oakville.

model <- lm(growing ~ city + years_in_business + coffee_quality_num, data = t)

summary(model)

## 
## Call:
## lm(formula = growing ~ city + years_in_business + coffee_quality_num, 
##     data = t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83615 -0.27474  0.00422  0.16300  0.88265 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -0.1989328  0.2515354  -0.791 0.436457    
## cityRiverton        0.3629578  0.1802407   2.014 0.054919 .  
## citySpringfield     0.7205051  0.1830572   3.936 0.000584 ***
## years_in_business  -0.0004254  0.0288360  -0.015 0.988348    
## coffee_quality_num  0.1585644  0.1007563   1.574 0.128119    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3901 on 25 degrees of freedom
## Multiple R-squared:  0.4905, Adjusted R-squared:  0.409 
## F-statistic: 6.017 on 4 and 25 DF,  p-value: 0.001558

# Predicting Growth Section Changes:
#   Changed model name from 'm' to 'model' for clarity
#   Replaced 'coffee_quality'(character) with `coffee_quality_num` (numeric), since `lm()` requires numeric or factor predictors
#   Clarified the description of predictors used in order for one to understand
#   Cleaned up section title and chunk name for consistency
#   Added context explaining Oakville’s impact on the model
#   Changed and tried different variables to put into the model in order to find a better p-value

5 Decision Tree

In this section, I use a decision tree to predict whether a coffee shop is growing (growing = 1) or not (growing = 0). Decision trees are helpful for understanding how different variables (like street, business age, and coffee quality) split the data to classify growth outcomes.

library(rpart)
library(rpart.plot)


tree_model <- rpart(growing ~ street + years_in_business + coffee_quality_num, 
                    data = t, method = "class")


rpart.plot(tree_model, type = 2, extra = 106, box.palette = "BuGn", fallen.leaves = TRUE)

# Decision Tree Section Changes:
#   Added `rpart` and `rpart.plot` libraries to enable decision tree modeling and visualization
#   Created `tree_model` using `rpart()` to predict the `growing` variable
#   Used `rpart.plot()` in order to visualize the tree with readable labels and color styling
#   Provided context on why a decision tree is useful for this type of problem

6 Conclusion

In conclusion, location matters significantly in predicting the growth of coffee shops.

Key Findings:

Coffee shops in Springfield and Riverton were significantly more likely to be growing compared to those in Oakville
Coffee quality showed a positive trend, suggesting better-rated shops are more likely to grow
Years in business had almost no effect on growth in this dataset

Coffee Shop Analysis Exam 2

Grace Bowersox

2025-05-01