========================================

==============Exam Bad========================== ====================end of instructions and start of test====================

1 Introcution

This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop density, ratings, and pricing that may inform strategic decisions about future locations and marketing approaches. I find that using address is the best predictor of coffee shop quality. My r^2 value is .77.

2 Data Overview

The dataset contains the following columns:

  • name: Name of the coffee shop
  • city: City where the shop is located
  • street: Street address
  • years in business: years in business
  • coffee quality: good, ok, or bad.
  • growing: 1 if the shop is growing, 0 if not

There are 30 records and 8 variables.

2.1 Correlations

library(ggcorrplot)
# NOTE by Joe. Not sure why this doesn't work?
tcorr <- t %>%
  select(where(is.numeric))

ggcorrplot(cor(tcorr),
           lab = T)

### Next I fixed the ggcorplot function to see how each of my varibles correlated to each other. Before fixing the corrplot function I first made a new tibble called "t_corr" that was just my numeric values. I then put this tibble in the ggcorplot function and got some promising results. The way I fixed the original code was first loading the library. Then I actually typed out the entire function, not just corr. I then added labels to my plot to make it easier to understand.

3 Histograms

We begin by examining where most coffee shops are located.

table(t$city)
## 
##    Oakville    Riverton Springfield 
##          10          10          10
table(t$street)
## 
##   101 First Ave  1010 Willow Dr     123 Main St     135 Pine Rd   147 Birch Way 
##               1               1               6               1               1 
##   159 Willow Ln   258 Spruce Ct    303 Third Rd     334 Pine St  369 Cedar Blvd 
##               1               1               1               4               1 
## 404 Fourth Blvd    445 Cedar Rd      456 Elm St    556 Elm Blvd   667 Maple Way 
##               1               1               1               1               1 
##  707 Seventh Ct    778 Birch Ct     789 Oak Ave  808 Eighth Way  889 Hickory Ln 
##               1               1               1               1               1 
##   909 Ninth Ave   990 Spruce Pl 
##               1               1
hist(t$years_in_business)

hist(t$growing)

hist(t$quality_num)

ggplot(t, aes(x = factor(quality_num, levels = 0:2, labels = c("Bad", "OK", "Good")))) +
  geom_bar(fill = "steelblue") +
  labs(title = "Coffee Quality Distribution",
       x = "Coffee Quality",
       y = "Number of Shops") +
  scale_y_continuous(limits = c(0, 12)) +
  theme_minimal()

### all I did here was add two plots to show the distribution of coffee quality. One shows by the quality_num rating and the other shows the labels as bad, ok, and good.

4 Predicting gorwth

I predict growth by using a number of variables. The model is highly predictive. While it does not show Oakfield, it does work for the other two cities.

test01 <- sample(x = 0:1,
                 size = 30,
                 replace = TRUE,
                 prob = c(0.6, 0.4))
table(test01)
## test01
##  0  1 
## 23  7
t <- t %>%
  mutate(is_test = test01)

t_train <- t %>% 
  filter(is_test == 0) %>% 
  select(-is_test)

t_test  <- t %>% 
  filter(is_test == 1) %>% 
  select(-is_test)

train_m <- lm(growing ~ town_id + quality_num, data = t_train)

summary(train_m)
## 
## Call:
## lm(formula = growing ~ town_id + quality_num, data = t_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.85306 -0.09116  0.05034  0.11156  0.63401 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78231    0.15706   4.981 7.18e-05 ***
## town_id     -0.41633    0.09036  -4.607  0.00017 ***
## quality_num  0.07075    0.09405   0.752  0.46068    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3536 on 20 degrees of freedom
## Multiple R-squared:  0.5435, Adjusted R-squared:  0.4979 
## F-statistic: 11.91 on 2 and 20 DF,  p-value: 0.0003928
t_test_predictions <- t_test %>% 
  mutate(predicted = predict(train_m, newdata = t_test),
         predicted = round(predicted))


### First I split my data up into training and testing data. Then for my linear regression model I chose town_id and quality_num as my independent variables. The variables the intern chose were not very good. They were not very correlated to the growing variable. Here independent variables were also had lower p-vales which means the model was not confident in those variables. With my new data I was able to get an Adjusted R-Squared of .4568 and low p-values for both of my independent variables. Town_id seemed to have the most significance in my linear regression. I then made a predicted column in my test data. When the model was exposed to new data it did very well. It only got one prediction wrong, it predicted Steam Beans was not growing when it in fact is.

5 Decision Tree

I create a decision tree to predict the growth variable.

# NOTE by Joe. Not sure why this doesn't work?
library(rpart.plot)
## Loading required package: rpart
dt <- rpart(formula = growing ~ town_id + quality_num,
           data = t,
           minsplit = 2,
           minbucket = 3,
           method = 'class')

rpart.plot(dt)

### Since this intern had no code here I had to start my decision tree from scratch. I used the same formula as my regression with 2 splits and 4 buckets. The tree found that if your town_id was 0, meaning your coffee shop is located in Springfield, you are a growing shop. This had 33% of the data and only was incorrect once. The opposite of Springfield is Oakville. My decision tree found if your town_id was 2, meaning it is located in Oakville, your shop was not growing. Like with prediciting the gorwth in Springfield it was only wrong once. Last the tree found if you are located in Riverton and your coffee quality is good, meaning your quality_num is 2, you are growing. If your coffee is ok or bad you are most likely not growing.