1 Introduction

This is our data we are using for this project as well as loading up two libraries.

2 Introcution

This report presents a preliminary analysis of a data set containing information about coffee shops across various U.S. cities. The objective is to identify patterns in location, years in business, and quality that may inform strategic decisions about future locations and marketing approaches. I find that using city is the best predictor of coffee shop quality.

3 Data Overview

The data set contains the following columns:

  • name: Name of the coffee shop
  • city: City where the shop is located
  • street: Street address
  • years in business: years in business
  • coffee quality: good, ok, or bad.
  • growing: 1 if the shop is growing, 0 if not

These are things I added to the data set to make it easier to analyze:

  • coffee quality number: good = 2, ok = 1, bad = 0
  • city number: Oakville = 0, Riverton = 1, Springfield = 2
  • predicted: predicted growth
  • is_correct: 1 if the prediction was correct, 0 if not

3.1 Correlations

This plot shows how each variable correlates with the others. The closer to 1 or -1, the stronger the correlation. A positive correlation means that as one variable increases, the other does too. A negative correlation means that as one variable increases, the other decreases. The closer you are to 1 or -1, the stronger the correlation. You can see here that the city and the coffee quality are the two biggest factors.

# NOTE by Joe. Not sure why this doesn't work?

# CHANGED NOTE by Nolan. I added the missing values to the correlation matrix. I also made the quality and city into a numerical value to be able to show correlation. I also made it into a visual correlation matrix.

library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.4.3
t <- t %>%
  mutate(
    coffee_quality_num = case_when(
      coffee_quality == "good" ~ 2,
      coffee_quality == "ok" ~ 1,
      coffee_quality == "bad" ~ 0,
      TRUE ~ NA_real_
    )
  )

t <- t %>% 
  mutate(
    city_num = case_when(
      city == "Oakville" ~ 2,
      city == "Riverton" ~ 1,
      city == "Springfield" ~ 0,
      TRUE ~ NA_real_
    )
  )


ggcorrplot(cor(select(t, years_in_business, coffee_quality_num, growing, city_num), use = "complete.obs"),
            colors = c('green', 'white', 'yellow'),
            lab = TRUE,
            title = "Correlation Matrix of Coffee Shop Data")

4 Histograms

This is examining where most coffee shops are located, as well as the distribution of years in business and growth.

## 
##    Oakville    Riverton Springfield 
##          10          10          10
## 
##   101 First Ave  1010 Willow Dr     123 Main St     135 Pine Rd   147 Birch Way 
##               1               1               6               1               1 
##   159 Willow Ln   258 Spruce Ct    303 Third Rd     334 Pine St  369 Cedar Blvd 
##               1               1               1               4               1 
## 404 Fourth Blvd    445 Cedar Rd      456 Elm St    556 Elm Blvd   667 Maple Way 
##               1               1               1               1               1 
##  707 Seventh Ct    778 Birch Ct     789 Oak Ave  808 Eighth Way  889 Hickory Ln 
##               1               1               1               1               1 
##   909 Ninth Ave   990 Spruce Pl 
##               1               1

5 Predicting gorwth

Using a linear regression model, I find out how good the city and coffee quality are to predict growth.

m <- lm(growing ~ coffee_quality_num + city_num, data = t)

summary(m)
## 
## Call:
## lm(formula = growing ~ coffee_quality_num + city_num, data = t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83778 -0.27413  0.00466  0.16222  0.88344 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.68021    0.14947   4.551 0.000102 ***
## coffee_quality_num  0.15756    0.08700   1.811 0.081264 .  
## city_num           -0.36061    0.08671  -4.159 0.000290 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3754 on 27 degrees of freedom
## Multiple R-squared:  0.4905, Adjusted R-squared:  0.4527 
## F-statistic:    13 on 2 and 27 DF,  p-value: 0.0001114
# CHNAGED NOTE by Nolan. The city and the quality are the two biggest factors and found them to be close to 0.5 P value or less

6 Decision Tree

I created a decision tree to predict the growth variable. As you can see in the buckets and the table, I predicted most of them right, the accuracy is 0.833.

# NOTE by Joe. Not sure why this doesn't work?

# CHANGED NOTE by Nolan. I created a decision tree by using the two factors I said I would use. I then plotted it so you can see the buckets and then checked the accuracy through a prediction model and showed the results through a table.

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
 m2 <- rpart(formula = growing ~ coffee_quality_num + city_num,
           data = t,
           minsplit = 3,
           minbucket = 3,
           method = 'class')

rpart.plot(m2)

predicted <- predict(m2, t, type = 'class')
t_new <- mutate( t, 
                  predicted = predicted,
                  is_correct = predicted == growing)

accuracy <- sum(t_new$is_correct) / nrow(t_new)
print(paste("Accuracy:", accuracy))
## [1] "Accuracy: 0.833333333333333"
table(str_to_upper(t_new$predicted), t_new$growing)
##    
##      0  1
##   0 14  3
##   1  2 11

7 Conclusion

In conclusion, the analysis of the coffee shop data set indicates that the city and coffee quality are significant predictors of growth. The decision tree model achieved an accuracy of 0.833, demonstrating its effectiveness in predicting growth based on these factors. This information can be valuable for strategic decision-making regarding future locations and marketing approaches for coffee shops.