This is our data we are using for this project as well as loading up two libraries.
This report presents a preliminary analysis of a data set containing information about coffee shops across various U.S. cities. The objective is to identify patterns in location, years in business, and quality that may inform strategic decisions about future locations and marketing approaches. I find that using city is the best predictor of coffee shop quality.
The data set contains the following columns:
name: Name of the coffee shopcity: City where the shop is locatedstreet: Street addressyears in business: years in businesscoffee quality: good, ok, or bad.growing: 1 if the shop is growing, 0 if notThese are things I added to the data set to make it easier to analyze:
coffee quality number: good = 2, ok = 1, bad = 0city number: Oakville = 0, Riverton = 1, Springfield =
2predicted: predicted growthis_correct: 1 if the prediction was correct, 0 if
notThis plot shows how each variable correlates with the others. The closer to 1 or -1, the stronger the correlation. A positive correlation means that as one variable increases, the other does too. A negative correlation means that as one variable increases, the other decreases. The closer you are to 1 or -1, the stronger the correlation. You can see here that the city and the coffee quality are the two biggest factors.
# NOTE by Joe. Not sure why this doesn't work?
# CHANGED NOTE by Nolan. I added the missing values to the correlation matrix. I also made the quality and city into a numerical value to be able to show correlation. I also made it into a visual correlation matrix.
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.3
t <- t %>%
mutate(
coffee_quality_num = case_when(
coffee_quality == "good" ~ 2,
coffee_quality == "ok" ~ 1,
coffee_quality == "bad" ~ 0,
TRUE ~ NA_real_
)
)
t <- t %>%
mutate(
city_num = case_when(
city == "Oakville" ~ 2,
city == "Riverton" ~ 1,
city == "Springfield" ~ 0,
TRUE ~ NA_real_
)
)
ggcorrplot(cor(select(t, years_in_business, coffee_quality_num, growing, city_num), use = "complete.obs"),
colors = c('green', 'white', 'yellow'),
lab = TRUE,
title = "Correlation Matrix of Coffee Shop Data")
This is examining where most coffee shops are located, as well as the distribution of years in business and growth.
##
## Oakville Riverton Springfield
## 10 10 10
##
## 101 First Ave 1010 Willow Dr 123 Main St 135 Pine Rd 147 Birch Way
## 1 1 6 1 1
## 159 Willow Ln 258 Spruce Ct 303 Third Rd 334 Pine St 369 Cedar Blvd
## 1 1 1 4 1
## 404 Fourth Blvd 445 Cedar Rd 456 Elm St 556 Elm Blvd 667 Maple Way
## 1 1 1 1 1
## 707 Seventh Ct 778 Birch Ct 789 Oak Ave 808 Eighth Way 889 Hickory Ln
## 1 1 1 1 1
## 909 Ninth Ave 990 Spruce Pl
## 1 1
Using a linear regression model, I find out how good the city and coffee quality are to predict growth.
m <- lm(growing ~ coffee_quality_num + city_num, data = t)
summary(m)
##
## Call:
## lm(formula = growing ~ coffee_quality_num + city_num, data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83778 -0.27413 0.00466 0.16222 0.88344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.68021 0.14947 4.551 0.000102 ***
## coffee_quality_num 0.15756 0.08700 1.811 0.081264 .
## city_num -0.36061 0.08671 -4.159 0.000290 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3754 on 27 degrees of freedom
## Multiple R-squared: 0.4905, Adjusted R-squared: 0.4527
## F-statistic: 13 on 2 and 27 DF, p-value: 0.0001114
# CHNAGED NOTE by Nolan. The city and the quality are the two biggest factors and found them to be close to 0.5 P value or less
I created a decision tree to predict the growth variable. As you can see in the buckets and the table, I predicted most of them right, the accuracy is 0.833.
# NOTE by Joe. Not sure why this doesn't work?
# CHANGED NOTE by Nolan. I created a decision tree by using the two factors I said I would use. I then plotted it so you can see the buckets and then checked the accuracy through a prediction model and showed the results through a table.
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
m2 <- rpart(formula = growing ~ coffee_quality_num + city_num,
data = t,
minsplit = 3,
minbucket = 3,
method = 'class')
rpart.plot(m2)
predicted <- predict(m2, t, type = 'class')
t_new <- mutate( t,
predicted = predicted,
is_correct = predicted == growing)
accuracy <- sum(t_new$is_correct) / nrow(t_new)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.833333333333333"
table(str_to_upper(t_new$predicted), t_new$growing)
##
## 0 1
## 0 14 3
## 1 2 11
In conclusion, the analysis of the coffee shop data set indicates that the city and coffee quality are significant predictors of growth. The decision tree model achieved an accuracy of 0.833, demonstrating its effectiveness in predicting growth based on these factors. This information can be valuable for strategic decision-making regarding future locations and marketing approaches for coffee shops.