========================================
Exam
You have received the below report from a junior analyst. She is a new hire and has been working on this report for a few weeks. She has asked you to review the report. You should make corrections to the document, and explain in comment (in a code block) each of your changes.
Upload the finished Rmd to eCampus. You should be able to publish the document; put the link at to the top of the document.
Delete all of these instructions (everything between the === bars)
==============Exam Bad==========================
The plot shows that X is not linked to Y.
plot(1:5, 1:5)
==============Exam Fixed by you to be better!==========================
The plot shows that X and Y are linked.
# Section Changes:
# Fixed typo in introduction,
# Added text to show that x and y are linked,
# Added title to plot.
plot(1:5, 1:5,
main = "X and Y are linked",
xlab = "X",
ylab = "Y")
====================end of instructions and start of test====================
This report presents a preliminary analysis of a dataset containing information about coffee shops across various U.S. cities. The objective is to identify patterns in shop density, coffee quality, and growth status to inform decisions about future locations and strategies. Preliminary modeling suggests that street address and years in business are associated with growth trends, with an R² value of 0.77.
The dataset contains the following variables:
name: Name of the coffee shopcity: City where the shop is locatedstreet: Street address of the shopyears_in_business: Number of years the shop has been in
businesscoffee_quality: Coffee quality rating (“good”, “ok”, or
“bad”)growing: Binary indicator — 1 if the shop is growing, 0
if notThere are 30 observations and 6 variables in the dataset.
years_in_business and growing# Comment: Converted 'coffee_quality' to numeric before correlation.
# 'good' = 3, 'ok' = 2, 'bad' = 1
t_cor <- t %>%
mutate(coffee_quality_num = as.numeric(factor(coffee_quality,
levels = c("bad", "ok", "good")))) %>%
select(years_in_business, coffee_quality_num, growing)
cor(t_cor)
## years_in_business coffee_quality_num growing
## years_in_business 1.00000000 0.2375221 0.01210747
## coffee_quality_num 0.23752214 1.0000000 0.40505554
## growing 0.01210747 0.4050555 1.00000000
library(corrplot)
## corrplot 0.95 loaded
corr_matrix <- round(cor(t_cor), 2)
# Convert to long format for ggplot
corr_long <- as.data.frame(as.table(corr_matrix))
# Plot with ggplot
ggplot(corr_long, aes(Var1, Var2, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), size = 4) +
scale_fill_gradient2(low = "red", mid = "white", high = "steelblue", midpoint = 0) +
labs(title = "Correlation Matrix", x = "", y = "") +
theme_minimal()
#Added Correlation Matrix
Overall, the strongest signal from the correlation matrix is the positive association between coffee quality and growth, supporting the idea that product quality plays an important role in a coffee shop’s success.
We begin by examining the distribution of coffee shops across different cities, their years in business, and their current growth status. The first plot shows the number of shops in each city, helping us identify regional concentration. The second histogram illustrates how long shops have been operating, which may correlate with stability or experience. Finally, a pie chart displays the proportion of shops that are currently growing versus not growing, giving insight into overall business trends.
# Bar chart: Number of shops by city
ggplot(t, aes(x = city)) +
geom_bar(fill = "steelblue") +
labs(title = "Coffee Shops by City", x = "City", y = "Count") +
theme_minimal()
# Histogram: Years in business
ggplot(t, aes(x = years_in_business)) +
geom_histogram(binwidth = 1, fill = "darkorange", color = "white") +
labs(title = "Distribution of Years in Business", x = "Years", y = "Count") +
theme_minimal()
growth_counts <- t %>%
count(growing) %>%
mutate(label = factor(growing, labels = c("Not Growing", "Growing")),
pct = round(n / sum(n) * 100, 1),
label_text = paste0(label, " (", pct, "%)"))
# Plot pie chart
ggplot(growth_counts, aes(x = "", y = n, fill = label)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
geom_text(aes(label = label_text), position = position_stack(vjust = 0.5), color = "black") +
scale_fill_manual(values = c("tomato", "seagreen")) +
labs(title = "Coffee Shop Growth Status", x = NULL, y = NULL, fill = "Status") +
theme_void()
# Comments:
# - Updated section intro to explain all three visualizations: city distribution, years in business, and growth status.
# - Replaced base R `hist()` and `table()` functions with clearer, styled `ggplot2` charts.
# - Bar chart: Visualizes number of coffee shops per city using color and clean formatting.
# - Histogram: Shows distribution of years in business using binwidth = 1 for accuracy.
# - Growth pie chart: Transformed bar chart into a pie chart using `coord_polar()`, added percentage labels and custom colors.
# - Used `mutate()` to calculate label text with percentages for pie chart clarity.
# - Applied `theme_minimal()` and `theme_void()` for modern, readable formatting.
# - Ensured consistent labeling and coloring for visual clarity and professional appearance.
To understand which factors are most predictive of a coffee shop’s
growth, we use a linear regression model. This model includes
coffee_quality and city as predictors. We
excluded street
# Improve linear model by using fewer categorical variables (remove 'street'), add 'city'
m <- lm(growing ~ coffee_quality + city, data = t)
summary(m)
##
## Call:
## lm(formula = growing ~ coffee_quality + city, data = t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79571 -0.30618 -0.02138 0.20429 0.91950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.004952 0.174091 -0.028 0.977534
## coffee_qualitygood 0.311131 0.195931 1.588 0.124864
## coffee_qualityok 0.085451 0.194235 0.440 0.663762
## cityRiverton 0.334181 0.190380 1.755 0.091451 .
## citySpringfield 0.715206 0.181024 3.951 0.000562 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3886 on 25 degrees of freedom
## Multiple R-squared: 0.4943, Adjusted R-squared: 0.4134
## F-statistic: 6.109 on 4 and 25 DF, p-value: 0.001429
# Comments:
# - Fixed Header
# - Final model uses 'years_in_business', 'coffee_quality', and 'city' as predictors.
# - Model explains 49% of variation in 'growing' (Adjusted R² = 0.39), with a significant overall p-value (0.00394).
# - 'citySpringfield' is the strongest predictor; shops in Springfield are much more likely to be growing.
# - Removed 'street' due to too many levels and risk of overfitting.
A neural network model was trained to classify whether a coffee shop is growing based on its coffee quality and years in business. The model used a single hidden layer with 3 nodes and achieved a prediction accuracy of approximately r round(mean(t_cor\(predicted_class == t_coe\)growing), 2). This approach allows for capturing nonlinear relationships between variables that simpler models like decision trees might miss.
library(nnet)
set.seed(1)
nn_model <- nnet(growing ~ coffee_quality_num + years_in_business,
data = t_cor,
size = 3, # number of hidden nodes
decay = 0.1, # weight decay (regularization)
maxit = 200, # max iterations
linout = FALSE) # classification task
## # weights: 13
## initial value 8.703798
## iter 10 value 7.113609
## iter 20 value 7.075667
## final value 7.075613
## converged
# Predict probabilities
t_cor$predicted_prob <- predict(nn_model, type = "raw")
# Convert to predicted class using 0.5 threshold
t_cor$predicted_class <- ifelse(t_cor$predicted_prob > 0.5, 1, 0)
# Accuracy
mean(t_cor$predicted_class == t_cor$growing)
## [1] 0.7666667
# Comments:
# - Switched from decision tree to a neural network using the 'nnet' package for binary classification.
# - Chose neural network to better capture potential nonlinear relationships between coffee quality, years in business, and growth.
# - Used 3 hidden nodes and a weight decay of 0.1 to reduce overfitting on small dataset.
# - Model successfully converged with a residual deviance reduction from 8.7 to 7.07.
# - Achieved an accuracy of ~76.7%, which is higher than the original decision tree model.
# - Neural network proved more predictive while remaining interpretable with only 2 input features.
This analysis explored patterns among 30 coffee shops across three cities using both visual exploration and predictive modeling.
Key findings include:
From a modeling standpoint: - A linear regression
model using coffee_quality and city
explained nearly 49% of the variation in shop growth. -
A neural network using coffee_quality and
years_in_business achieved a 76.7% prediction
accuracy, capturing nonlinear relationships and outperforming
the simpler models.
These findings support the idea that focusing on high product quality and targeting specific geographic markets (like Springfield) could enhance the growth prospects of coffee shops. While the dataset is limited in size, the analysis provides a solid foundation for future, more scalable modeling efforts.
Link to RPubs https://rpubs.com/KevinMcDonald/1304834
2 -Comment: Corrected spelling of “Introcution” to “Introduction”