This report presents a preliminary analysis of a dataset containing information about coffee shops across three U.S. cities. The objective is to identify patterns in shop density, ratings, and business growth that may inform strategic decisions about future locations and marketing approaches. Based on initial analysis, address (location) appears to be a strong predictor of coffee shop quality. The R² value of the model is approximately 0.77, suggesting a reasonably strong relationship.
# Introduction Section Changes:
# Fixed typo in introduction
# Corrected grammer and improved clarity in order to be more professional
# Made the introduction more specific so readers know what they are looking into
# Created more context to my R² value for better understanding
The dataset contains the following columns:
name: Name of the coffee shopstreet: Street addresscity: City where the shop is locatedyears_in_business: Number of years the shop has been
operatingcoffee_quality: Quality rating of the coffee (good, ok,
or bad)growing: Indicator if the shop is growing (1 = growing,
0 = not growing)There are 30 records (rows) and 6 variables (columns).
# Data Overview Section Changes:
# Put columns into the order they appear in the dataset
# Clarified variable descriptions
t <- t %>%
mutate(coffee_quality_num = case_when(
coffee_quality == "bad" ~ 1,
coffee_quality == "ok" ~ 2,
coffee_quality == "good" ~ 3
))
cor(select(t, years_in_business, coffee_quality_num, growing))
## years_in_business coffee_quality_num growing
## years_in_business 1.00000000 0.2375221 0.01210747
## coffee_quality_num 0.23752214 1.0000000 0.40505554
## growing 0.01210747 0.4050555 1.00000000
# Correlations Section Changes:
# Fixed error in cor() as it only works on numeric data
# Changed coffee_quality to numeric from categorical
# Created a new column called 'coffee_quality_num' to the dataset
# Calculated correlations using the new numeric column
I begin by examining where most coffee shops are located and their business characterisitcs. Below I show: - A bar plot of the number of shops in the city - A bar plot of the most common street locations - A histogram of the years in business - A bar plot showing the number of shops that are growing vs.Ā not growing
library(ggplot2)
ggplot(t, aes(x = city)) +
geom_bar(fill = "steelblue") +
labs(title = "Number of Coffee Shops by City", x = "City", y = "Count")
t %>%
count(street, sort = TRUE) %>%
slice_max(n, n = 10) %>%
ggplot(aes(x = reorder(street, n), y = n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 10 Most Common Streets", x = "Street", y = "Count")
ggplot(t, aes(x = years_in_business)) +
geom_histogram(binwidth = 1, fill = "purple", color = "white") +
labs(title = "Histogram of Years in Business", x = "Years", y = "Count")
ggplot(t, aes(x = factor(growing, labels = c("Not Growing", "Growing")))) +
geom_bar(fill = "orange") +
labs(title = "Business Growth Status", x = "Growth Status", y = "Count")
# Histograms Section Changes:
# Replaced R tables with clearer ggplot2 visualizations to make them easier to understand
# Added more descriptive titles to the plots in order to know what the plots represent
# Reordered street bar plot and limited it to top 10 in order to give a glimpse of the data but not bombard with too much information
# Added color to plots and cleaned up names to make it easier to understand
To predict whether a coffee shop is growing, I used a linear
regression model. The model uses the following variables: -
street: Street address - years_in_business:
Number of years the shop has been operating -
coffee_quality_num: Quality rating of the coffee (good, ok,
or bad) as a number (1, 2, or 3)
Also, the model may not generalize well to cities with fewer or no growing businesses, such as Oakville.
model <- lm(growing ~ city + years_in_business + coffee_quality_num, data = t)
summary(model)
##
## Call:
## lm(formula = growing ~ city + years_in_business + coffee_quality_num,
## data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83615 -0.27474 0.00422 0.16300 0.88265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1989328 0.2515354 -0.791 0.436457
## cityRiverton 0.3629578 0.1802407 2.014 0.054919 .
## citySpringfield 0.7205051 0.1830572 3.936 0.000584 ***
## years_in_business -0.0004254 0.0288360 -0.015 0.988348
## coffee_quality_num 0.1585644 0.1007563 1.574 0.128119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3901 on 25 degrees of freedom
## Multiple R-squared: 0.4905, Adjusted R-squared: 0.409
## F-statistic: 6.017 on 4 and 25 DF, p-value: 0.001558
# Predicting Growth Section Changes:
# Changed model name from 'm' to 'model' for clarity
# Replaced 'coffee_quality'(character) with `coffee_quality_num` (numeric), since `lm()` requires numeric or factor predictors
# Clarified the description of predictors used in order for one to understand
# Cleaned up section title and chunk name for consistency
# Added context explaining Oakvilleās impact on the model
# Changed and tried different variables to put into the model in order to find a better p-value
In this section, I use a decision tree to predict whether a coffee
shop is growing (growing = 1) or not (growing
= 0). Decision trees are helpful for understanding how different
variables (like street, business age, and coffee quality) split the data
to classify growth outcomes.
library(rpart)
library(rpart.plot)
tree_model <- rpart(growing ~ street + years_in_business + coffee_quality_num,
data = t, method = "class")
rpart.plot(tree_model, type = 2, extra = 106, box.palette = "BuGn", fallen.leaves = TRUE)
# Decision Tree Section Changes:
# Added `rpart` and `rpart.plot` libraries to enable decision tree modeling and visualization
# Created `tree_model` using `rpart()` to predict the `growing` variable
# Used `rpart.plot()` in order to visualize the tree with readable labels and color styling
# Provided context on why a decision tree is useful for this type of problem
In conclusion, location matters significantly in predicting the growth of coffee shops.
Key Findings: