========================================
Exam
You have received the below report from a junior analyst. She is a new hire and has been working on this report for a few weeks. She has asked you to review the report. You should make corrections to the document, and explain in comment (in a code block) each of your changes.
Upload the finished Rmd to eCampus. You should be able to publish the document; put the link at to the top of the document.
Delete all of these instructions (everything between the === bars)
This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop location, coffee quality, and growth that may inform strategic decisions about future locations and marketing approaches. I find that using street is the best predictor of coffee shop growth. My R² value is .77.
#Section changes
# I changed the spelling error in Introduction
# I changed pricing to growth because our tribble doesn't deal with pricing
# I chnaged ratings to the actual variable coffee quality
# I changed shop density to location because we are dealing with city and street
# In the last sentance I changed address to street because it is our actual variable
# I changed R^2 to R²
# I changed the last sentance to say coffee shop growth instead of ratings
The dataset contains the following columns:
name: Name of the coffee shopcity: City where the shop is locatedstreet: Street addressyears_in_business: years in businesscoffee_quality: good, ok, or bad.growing: 1 if the shop is growing, 0 if notThere are 30 records and 6 variables.
#Section changes
# I added underscores in the years_in_business and coffee_quality variables to match the actual data
# NOTE by Joe. Not sure why this doesn't work?
t2 <- t %>%
mutate(
coffee_quality_numeric = case_when(
coffee_quality == "good" ~ 1,
coffee_quality == "ok" ~ 2,
coffee_quality == "bad" ~ 3
),
city_numeric = case_when(
city == "Springfield" ~ 1,
city == "Riverton" ~ 2,
city == "Oakville" ~ 3
))
correlation <- cor(select(t2, years_in_business, growing, coffee_quality_numeric, city_numeric))
library(ggcorrplot)
ggcorrplot(correlation,
lab = TRUE,
lab_size = 4,
method = "square",
type = "lower",
title = "Correlation Matrix of Coffee Shop Variables")
#Section changes
# I changed coffee_quality to a numeric variable to be used in the model if needed
# I changed up some of the spaces in the data and ran the correlation
# I added correlation to my enviornment
# I added the library ggcorrplot and made a correlation matrix of the numeric variables
# I made city a numeric variable so I could add it to the correlation too
We begin by examining where most coffee shops are located.
table(t$city)
##
## Oakville Riverton Springfield
## 10 10 10
table(t$street)
##
## 101 First Ave 1010 Willow Dr 123 Main St 135 Pine Rd 147 Birch Way
## 1 1 6 1 1
## 159 Willow Ln 258 Spruce Ct 303 Third Rd 334 Pine St 369 Cedar Blvd
## 1 1 1 4 1
## 404 Fourth Blvd 445 Cedar Rd 456 Elm St 556 Elm Blvd 667 Maple Way
## 1 1 1 1 1
## 707 Seventh Ct 778 Birch Ct 789 Oak Ave 808 Eighth Way 889 Hickory Ln
## 1 1 1 1 1
## 909 Ninth Ave 990 Spruce Pl
## 1 1
hist(t$years_in_business,
main = "Histogram of Years in Business",
xlab = "Years in Business",
col = "lightblue",
border = "black")
barplot(table(t$growing),
main = "Barplot of Growing Shops",
names.arg = c("Not Growing", "Growing"),
col = "lightgreen",
border = "black",
ylab = "Number of Shops")
#Section changes
# I made the years in business histogram cleaner and better to look at
# I changed the shop growth to a bar plot to more clearly show the difference between growing and non growing shops
# I left the tables as is because I thought they were good to read like that
I predict growth by using a number of variables. The model is highly predictive. While it does not show Oakville, it does work for the other two cities.
m <- lm(growing ~ city_numeric + coffee_quality_numeric, data = t2)
summary(m)
##
## Call:
## lm(formula = growing ~ city_numeric + coffee_quality_numeric,
## data = t2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.83778 -0.27413 0.00466 0.16222 0.88344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.51352 0.22684 6.672 3.67e-07 ***
## city_numeric -0.36061 0.08671 -4.159 0.00029 ***
## coffee_quality_numeric -0.15756 0.08700 -1.811 0.08126 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3754 on 27 degrees of freedom
## Multiple R-squared: 0.4905, Adjusted R-squared: 0.4527
## F-statistic: 13 on 2 and 27 DF, p-value: 0.0001114
#Section changes
#changed growth to be spelled correctly
#changed Oakfield to Oakville
#used city_numeric instead of street
#used coffee_quality_numeric instaed of coffee_quality
#took years in business out of my model because the r squared was higher and its p-value was high
I create a decision tree to predict the growth variable.
# NOTE by Joe. Not sure why this doesn't work?
library(rpart)
library(rpart.plot)
tree_model <- rpart(
growing ~ city + years_in_business + coffee_quality,
data = t2,
method = "class",
control = rpart.control(cp = 0.001, minsplit = 2, minbucket = 1)
)
rpart.plot(tree_model, type = 2, extra = 106, fallen.leaves = TRUE, main = "Decision Tree Predicting Growth")
#Section changes
# I added the necessary Libraries
#I made the entire decision tree model