========================================
==============Exam Bad========================== ====================end of instructions and start of test====================
This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop density, ratings, and pricing that may inform strategic decisions about future locations and marketing approaches. I find that using address is the best predictor of coffee shop quality. My r^2 value is .77.
The dataset contains the following columns:
name: Name of the coffee shopcity: City where the shop is locatedstreet: Street addressyears in business: years in businesscoffee quality: good, ok, or bad.growing: 1 if the shop is growing, 0 if notThere are 30 records and 8 variables.
library(ggcorrplot)
# NOTE by Joe. Not sure why this doesn't work?
tcorr <- t %>%
select(where(is.numeric))
ggcorrplot(cor(tcorr),
lab = T)
### Next I fixed the ggcorplot function to see how each of my varibles correlated to each other. Before fixing the corrplot function I first made a new tibble called "t_corr" that was just my numeric values. I then put this tibble in the ggcorplot function and got some promising results. The way I fixed the original code was first loading the library. Then I actually typed out the entire function, not just corr. I then added labels to my plot to make it easier to understand.
We begin by examining where most coffee shops are located.
table(t$city)
##
## Oakville Riverton Springfield
## 10 10 10
table(t$street)
##
## 101 First Ave 1010 Willow Dr 123 Main St 135 Pine Rd 147 Birch Way
## 1 1 6 1 1
## 159 Willow Ln 258 Spruce Ct 303 Third Rd 334 Pine St 369 Cedar Blvd
## 1 1 1 4 1
## 404 Fourth Blvd 445 Cedar Rd 456 Elm St 556 Elm Blvd 667 Maple Way
## 1 1 1 1 1
## 707 Seventh Ct 778 Birch Ct 789 Oak Ave 808 Eighth Way 889 Hickory Ln
## 1 1 1 1 1
## 909 Ninth Ave 990 Spruce Pl
## 1 1
hist(t$years_in_business)
hist(t$growing)
hist(t$quality_num)
ggplot(t, aes(x = factor(quality_num, levels = 0:2, labels = c("Bad", "OK", "Good")))) +
geom_bar(fill = "steelblue") +
labs(title = "Coffee Quality Distribution",
x = "Coffee Quality",
y = "Number of Shops") +
scale_y_continuous(limits = c(0, 12)) +
theme_minimal()
### all I did here was add two plots to show the distribution of coffee quality. One shows by the quality_num rating and the other shows the labels as bad, ok, and good.
I predict growth by using a number of variables. The model is highly predictive. While it does not show Oakfield, it does work for the other two cities.
test01 <- sample(x = 0:1,
size = 30,
replace = TRUE,
prob = c(0.6, 0.4))
table(test01)
## test01
## 0 1
## 23 7
t <- t %>%
mutate(is_test = test01)
t_train <- t %>%
filter(is_test == 0) %>%
select(-is_test)
t_test <- t %>%
filter(is_test == 1) %>%
select(-is_test)
train_m <- lm(growing ~ town_id + quality_num, data = t_train)
summary(train_m)
##
## Call:
## lm(formula = growing ~ town_id + quality_num, data = t_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.85306 -0.09116 0.05034 0.11156 0.63401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.78231 0.15706 4.981 7.18e-05 ***
## town_id -0.41633 0.09036 -4.607 0.00017 ***
## quality_num 0.07075 0.09405 0.752 0.46068
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3536 on 20 degrees of freedom
## Multiple R-squared: 0.5435, Adjusted R-squared: 0.4979
## F-statistic: 11.91 on 2 and 20 DF, p-value: 0.0003928
t_test_predictions <- t_test %>%
mutate(predicted = predict(train_m, newdata = t_test),
predicted = round(predicted))
### First I split my data up into training and testing data. Then for my linear regression model I chose town_id and quality_num as my independent variables. The variables the intern chose were not very good. They were not very correlated to the growing variable. Here independent variables were also had lower p-vales which means the model was not confident in those variables. With my new data I was able to get an Adjusted R-Squared of .4568 and low p-values for both of my independent variables. Town_id seemed to have the most significance in my linear regression. I then made a predicted column in my test data. When the model was exposed to new data it did very well. It only got one prediction wrong, it predicted Steam Beans was not growing when it in fact is.
I create a decision tree to predict the growth variable.
# NOTE by Joe. Not sure why this doesn't work?
library(rpart.plot)
## Loading required package: rpart
dt <- rpart(formula = growing ~ town_id + quality_num,
data = t,
minsplit = 2,
minbucket = 3,
method = 'class')
rpart.plot(dt)
### Since this intern had no code here I had to start my decision tree from scratch. I used the same formula as my regression with 2 splits and 4 buckets. The tree found that if your town_id was 0, meaning your coffee shop is located in Springfield, you are a growing shop. This had 33% of the data and only was incorrect once. The opposite of Springfield is Oakville. My decision tree found if your town_id was 2, meaning it is located in Oakville, your shop was not growing. Like with prediciting the gorwth in Springfield it was only wrong once. Last the tree found if you are located in Riverton and your coffee quality is good, meaning your quality_num is 2, you are growing. If your coffee is ok or bad you are most likely not growing.