Code
library(tidyverse)
library(ggplot2)
library(dplyr)
library(scales)
library(broom)
library(knitr)library(tidyverse)
library(ggplot2)
library(dplyr)
library(scales)
library(broom)
library(knitr)Presentation link: Recording-20260427_232549.webm
Client Question: What characteristics of a coffee — origin, roast level, roaster country — best predict its price per 100g, and can we use these to inform pricing strategy?
This analysis is directed at specialty coffee roasting companies worldwide. Whether a roaster is setting prices for a new single-origin release, deciding which coffee origins to source, or benchmarking their prices against the broader market, the data driven insights presented here can be practically applied to everyday operations. Specifically, this report is most relevant to:
Throughout this semester, I have taken a deep dive into uncovering trends and insights from a global coffee reviews dataset, primarily by conducting statistical analyses. Often, I found myself analyzing the correlation between two variables: 100g_USD (price) and rating. All too often, I used price as a predictor of rating, seldom as the outcome.
In this final statistical analysis, I am flipping the lens and asking:
What predicts whether a coffee is expensive? Origin, roast type, roaster country, rating?
I believe this statistical analysis will be of high interest to coffee roasting companies trying to price their products competitively. A linear regression model will serve as the centerpiece of this investigation, with price as the response variable. Ultimately, this unexplored angle of the dataset offers clear, actionable insights as well as relevance to coffee roasting companies in regard to setting their pricing strategy.
coffee_clean <- readRDS("coffee_clean.rds")
# Confirm factor structure carried over from prior work
coffee_clean <- coffee_clean %>%
mutate(
roast = as.factor(roast),
loc_country = as.factor(loc_country),
origin_1 = as.factor(origin_1)
)# Recreate price_bucket from WK3 for use in EDA groupings
coffee_clean <- coffee_clean %>%
mutate(
price_bucket = cut(
`100g_USD`,
breaks = c(-Inf, 4.99, 9.99, 19.99, Inf),
labels = c("$0–5", "$5–10", "$10–20", "$20+"),
right = TRUE
)
)Before running any models, it is critical to understand the shape of 100g_USD. A heavily skewed distribution may violate regression assumptions and warrant a transformation.
p1 <- ggplot(coffee_clean, aes(x = `100g_USD`)) +
geom_histogram(bins = 50, fill = "#6F4E37", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Coffee Price (per 100g USD)",
subtitle = "Raw price is right-skewed, driven by a small number of premium coffees",
x = "Price (USD per 100g)",
y = "Count"
) +
theme_minimal()
p2 <- ggplot(coffee_clean, aes(x = log(`100g_USD`))) +
geom_histogram(bins = 50, fill = "#C8A882", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Log-Transformed Price",
subtitle = "Log(price) is approximately normal — appropriate for linear regression",
x = "log(Price)",
y = "Count"
) +
theme(plot.subtitle = element_text(face = "bold"))
gridExtra::grid.arrange(p1, p2, ncol = 2)Explanation: The raw price distribution is strongly right-skewed. A small number of ultra-premium coffees (above $20/100g) significantly impact the results. While I cannot classify these few coffee prices as outliers without running statistical tests, it is highly likely that the coffees priced above $20 per 100g of coffee will be marked as such.
The log-transformed price is approximately bell-shaped, which satisfies the normality assumption for linear regression. From this point forward, log(100g_USD) will be used as the response variable in the regression model.
roast_price <- coffee_clean %>%
filter(!is.na(roast)) %>%
group_by(roast) %>%
summarise(
avg_price = mean(`100g_USD`, na.rm = TRUE),
median_price = median(`100g_USD`, na.rm = TRUE),
n = n()
) %>%
arrange(desc(avg_price))
roast_colors <- c(
"Light" = "#E8D9BF", # lightest tan
"Medium-Light" = "#D2B892", # light brown
"Medium" = "#B8946A", # medium brown
"Medium-Dark" = "#8B5E3C", # darker brown
"Dark" = "#4B2E14" # darkest roast
)
ggplot(coffee_clean %>% filter(!is.na(roast)),
aes(x = reorder(roast, `100g_USD`, FUN = median),
y = `100g_USD`,
fill = roast)) +
geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
scale_y_log10(labels = dollar_format()) +
scale_fill_manual(values = roast_colors) +
coord_flip() +
labs(
title = "Coffee Price by Roast Level (Log Scale)",
subtitle = "Light roasts tend to command higher prices than darker roasts",
x = "Roast Level",
y = "Price per 100g (USD, log scale)",
fill = "Roast"
) +
theme_minimal() +
theme(legend.position = "none")Explanation: Light roasts generally command higher prices, which aligns with the specialty coffee market’s emphasis on single-origin, lightly roasted beans that preserve complex flavors. Dark roasts, while least common in this dataset, have the lowest average price — consistent with their association with commodity-grade or blended coffees. Medium-Light dominates the dataset in volume but spans a wide price range.
# Filter to countries with at least 10 reviews for reliable estimates
country_price <- coffee_clean %>%
group_by(loc_country) %>%
summarise(
avg_price = mean(`100g_USD`, na.rm = TRUE),
n = n()
) %>%
filter(n >= 10) %>%
arrange(desc(avg_price))
ggplot(country_price,
aes(x = reorder(loc_country, avg_price), y = avg_price)) +
geom_col(aes(fill = avg_price == max(avg_price)),
alpha = 0.85,
show.legend = FALSE) +
scale_fill_manual(values = c("TRUE" = "#3B1E08", "FALSE" = "#C8A882")) +
scale_y_continuous(labels = dollar_format()) +
coord_flip() +
labs(
title = "Average Coffee Price by Roaster Country/Region",
subtitle = "Countries with fewer than 10 reviews excluded to reduce variance",
x = "Roaster Country",
y = "Average Price (USD per 100g)",
fill = "Avg Price"
) +
theme_minimal()Explanation: There is meaningful variation in average price by roaster country. Countries with historically strong specialty coffee cultures — such as the United States, Taiwan, and Japan — tend to have higher average prices. This may reflect both the premium positioning of their products and the higher cost of living/operating in those markets.
origin_price <- coffee_clean %>%
group_by(origin_1) %>%
summarise(
avg_price = mean(`100g_USD`, na.rm = TRUE),
n = n()
) %>%
filter(n >= 15) %>%
arrange(desc(avg_price))
ggplot(origin_price %>% slice_head(n = 20),
aes(x = reorder(origin_1, avg_price), y = avg_price)) +
geom_col(aes(fill = avg_price == max(avg_price)),
alpha = 0.85,
show.legend = FALSE) +
scale_fill_manual(values = c("TRUE" = "#3B1E08", "FALSE" = "#C8A882")) +
scale_y_continuous(labels = dollar_format()) +
coord_flip() +
labs(
title = "Top 20 Origins by Average Price (min 15 reviews)",
subtitle = "Certain origins are associated with significantly higher market prices",
x = "Origin",
y = "Average Price (USD per 100g)",
fill = "Avg Price"
) +
theme_minimal()Explanation: Origin plays a meaningful role in pricing. Certain producing regions (e.g., Gesha/Geisha varieties from Panama or Ethiopia) are strongly associated with premium pricing, reflecting their rarity, flavor complexity, and reputation in the specialty market. This visual motivates the addition of origin_1 in our regression model, although ultimately it may not included due to the numerous quantity of origin locations in the dataset. In the top 20 origins, it does appear that aside from a few origins with exceptionally high prices, the prices are very competitive between other origin country’s. In the multivariate model, including only 20 origins would only represent a fraction of them from the dataset.
ggplot(coffee_clean, aes(x = rating, y = `100g_USD`)) +
geom_point(alpha = 0.25, color = "#6F4E37") +
geom_smooth(method = "lm", se = TRUE, color = "#C8A882", linewidth = 1.2) +
scale_y_log10(labels = dollar_format()) +
labs(
title = "Rating vs. Price",
subtitle = "Higher-rated coffees tend to be priced higher, but with substantial spread",
x = "Rating",
y = "Price (USD per 100g, log scale)"
) +
theme_minimal()Explanation: There is a positive association between rating and price, which is expected — better coffees command higher prices. The prices on this scatterplot are measured on a log scale, which is important to note. However, the substantial spread indicates that rating alone does not fully explain price variation. This motivates a multi-variable regression that incorporates roast, origin, and roaster country alongside rating. These four variables together are the primary influencer’s of coffee prices.
The following assumptions are made in this analysis, along with justifications and mitigation strategies for each:
1. Log-normality of price. We assume that log(100g_USD) is approximately normally distributed. This is supported by the EDA histogram above, which shows a roughly symmetric distribution after transformation. This assumption is acceptable and standard for price data in econometric modeling.
2. Independence of observations. We assume that each coffee review represents an independent data point. The dataset contains reviews across many different roasters, origins, and time periods, making this assumption reasonable. A risk is that the same roaster may appear multiple times — this could be partially mitigated by including loc_country as a grouping variable in the model.
3. Linearity. We assume a linear relationship between the predictors and log(price). The scatterplot of rating vs. log(price) supports this for rating. For categorical variables, linear regression estimates the mean log-price per group, which is valid. Linearity cannot be assumed for the raw price distribution, as price is heavily skewed from ultra premium, high priced coffee. All in all, log price will be integrated with the regression model as it is approximately normal, but for the majority of statistical analysis’, regular price will work most effectively.
4. Sufficient sample size by group. For origin and country estimates to be reliable, groups should have adequate observations. We mitigate this by filtering to origins with at least 10 reviews in visualizations, and by interpreting regression coefficients for small groups with caution. Groups with very few observations will have wide confidence intervals. More filtering may be necessary later on in the statistical analysis to exclude categories with an insufficient quantity of observations to produce statistically significant results.
5. No causal claims. This analysis identifies statistical associations, not causal relationships. A higher rating is associated with higher price — but this does not mean improving a coffee’s rating will cause its price to rise. Roasters should treat model outputs as signals rather than firm rules.
# One-way ANOVA: Does roast level predict log(price)?
coffee_roast <- coffee_clean %>% filter(!is.na(roast))
aov_roast <- aov(log(`100g_USD`) ~ roast, data = coffee_roast)
summary(aov_roast) Df Sum Sq Mean Sq F value Pr(>F)
roast 4 23.7 5.932 15.73 1.1e-12 ***
Residuals 2075 782.6 0.377
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(coffee_roast, aes(x = reorder(roast, `100g_USD`, FUN = median),
y = log(`100g_USD`), fill = roast)) +
geom_violin(alpha = 0.6) +
geom_boxplot(width = 0.15, fill = "white", outlier.alpha = 0.2) +
scale_fill_brewer(palette = "YlOrBr") +
coord_flip() +
labs(
title = "Log(Price) Distribution by Roast Level",
subtitle = "Violin + boxplot shows full spread and median by roast type",
x = "Roast Level",
y = "log(Price per 100g USD)"
) +
theme_minimal() +
theme(legend.position = "none")Interpretation: This one-way ANOVA tests whether the mean log-price differs significantly across roast levels. The only statistically significant result (p < 0.05) is Light roasts indicate higher prices, justifying its inclusion in the regression model. All other roast levels have p values less than 0.05, which means they likely do not impact price significantly. Ultimately, this violin plot reinforces the visual differences seen in the EDA, particularly the tendency for Light roasts to occupy higher price territory among all roast levels.
This hypothesis test directly ties to the main objective: understanding what drives coffee price.
Hypotheses (Neyman-Pearson Framework):
coffee_hyp <- coffee_clean %>%
mutate(
rating_group = if_else(rating >= 94, "High (≥94)", "Low (<94)")
)
# Sample size check (power analysis)
power.t.test(
delta = 2, # $2 meaningful difference in avg price
sd = 8, # approximate SD of price from EDA
sig.level = 0.05,
power = 0.80,
type = "two.sample",
alternative = "two.sided"
)
Two-sample t test power calculation
n = 252.1281
delta = 2
sd = 8
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
t_result <- t.test(
`100g_USD` ~ rating_group,
data = coffee_hyp,
var.equal = FALSE
)
t_result
Welch Two Sample t-test
data: 100g_USD by rating_group
t = 9.6536, df = 1158.4, p-value < 2.2e-16
alternative hypothesis: true difference in means between group High (≥94) and group Low (<94) is not equal to 0
95 percent confidence interval:
3.830612 5.784885
sample estimates:
mean in group High (≥94) mean in group Low (<94)
11.908510 7.100762
rating_price_summary <- coffee_hyp %>%
group_by(rating_group) %>%
summarise(
avg_price = mean(`100g_USD`, na.rm = TRUE),
se = sd(`100g_USD`, na.rm = TRUE) / sqrt(n()),
n = n()
)
ggplot(rating_price_summary, aes(x = rating_group, y = avg_price, fill = rating_group)) +
geom_col(alpha = 0.85, width = 0.5) +
geom_errorbar(aes(ymin = avg_price - 1.96 * se,
ymax = avg_price + 1.96 * se),
width = 0.15, linewidth = 0.8) +
geom_text(aes(label = dollar(round(avg_price, 2))), hjust = -.59, vjust = -.70, fontface = "bold") +
scale_fill_manual(values = c("High (≥94)" = "#6F4E37", "Low (<94)" = "#C8A882")) +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Average Price by Rating Group",
subtitle = "Error bars represent 95% confidence intervals",
x = "Rating Group",
y = "Average Price (USD per 100g)"
) +
theme_minimal() +
theme(legend.position = "none")Interpretation: The two-sample Welch t-test directly tests whether the premium pricing of high-rated coffees is statistically significant. For this particular test, premium coffees fall into the grouping that received a rating of at least 94.
Average price of coffee based on ratings groups:
High rating: $11.94
Low rating: $7.10
Decision: The p-value is below 0.05 which leads us to reject \(H_0\) and conclude that rating group membership is meaningfully associated with price — a finding with direct relevance to roasters who invest in quality to justify premium price points. Overall, the statistical results gained from this t-test support aggressive pricing strategy for coffee roasters who sell premium coffee.
Hypotheses:
log(100g_USD) by loc_country (countries with ≥ 10 reviews)coffee_country <- coffee_clean %>%
group_by(loc_country) %>%
filter(n() >= 10) %>%
ungroup()
aov_country <- aov(log(`100g_USD`) ~ loc_country, data = coffee_country)
summary(aov_country) Df Sum Sq Mean Sq F value Pr(>F)
loc_country 6 83.2 13.869 40.66 <2e-16 ***
Residuals 2052 699.9 0.341
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(coffee_country,
aes(x = reorder(loc_country, `100g_USD`, FUN = median),
y = `100g_USD`)) +
geom_boxplot(fill = "#6F4E37", alpha = 0.65, outlier.alpha = 0.2) +
scale_y_log10(labels = dollar_format()) +
coord_flip() +
labs(
title = "Price Distribution by Roaster Country (Log Scale)",
subtitle = "Countries with < 10 reviews excluded",
x = "Roaster Country",
y = "Price (USD per 100g, log scale)"
) +
theme_minimal()Interpretation: The ANOVA tests whether roaster country / region explains a significant portion of price variance. For this analysis, I integrated log price into the test rather than raw price because the the log scale makes the coefficients more interpretable. This tests primary goal is to directly inform roasters about the competitive pricing landscape geographically across the globe.
Decision: Reject \(H_0\) with F(6, 2052) = 40.66 and p < \(2e^{16}\), there is overwhelming evidence that at least one roaster country / region has a significantly different mean log price compared to other roasting locations.
This statistically significant result affirms that where a coffee is roasted matters for its price — this may reflect labor costs and import/export economics. Currency value is a non factor as all prices have been converted to US Dollars. This finding directly informs roasters about the competitive pricing landscape by geography, because coffee roaster location is not only a statistically significant predictor of price, but this test confirms ‘loc_country’ is independent of other explanatory variables. The practical meaning for clients (coffee roasters) is that operating in a high-price country market has a structural pricing advantage rooted into the market.
The centerpiece of this analysis is a multiple linear regression predicting log(100g_USD) from:
rating — continuous numeric predictorroast — categorical (5 levels, Medium-Light as reference)loc_country — categorical (filtered to countries with ≥ 10 reviews, United States as reference)Origin (origin_1) has hundreds of levels and would inflate the model; it is better summarized through the ANOVA and EDA visualizations above. This regression focuses on the most actionable predictors for various roasting company’s pricing strategy.
Impact of each predictor: Why were rating, roast, and loc_country were selected for this model?
Rating is included because the scatterplot (EDA Visual 3) highlighted a clear positive relationship with price. Additionally, rating has proven throughout this analysis to be the most reliable and direct measure of coffee quality.
Roast is incorporated because the boxplots (EDA Visual 2A) showed evident price differences across roast levels, which was supported in Analysis 1 (see above). Practically speaking, choosing what type of coffee roasts to import and brew for customers is a key business decision that coffee roasting companies control directly, making roast a valuable predictor.
Loc_country is included in large part due to the results of the ANOVA test in Analysis 3 that successfully confirmed the notable price variance among countries and regions. While coffee roasters do not have the luxury of controlling the market they are located in, understanding the geographic context in which roasting companies operate in is imperative to success in determining pricing strategies and catering towards key market segments.
Model reference level: Medium-Light coffee roasted in the United States
I chose to use Medium-Light roast coffee from the United States as the baseline level for the coefficients of the other four roast levels to be interpreted relative to it. This was a clear decision because Median-Light is the most common roast level in the dataset as it represents 1,490 of the 2,080 total reviews in the dataset, while the United States has the most reviews out of all roasting locations with 1331. Jointly, they combine for 995 of the total observations in the full dataset.
This is important for two reasons:
# Filter to complete cases and countries with >= 10 reviews
coffee_model <- coffee_clean %>%
filter(!is.na(roast)) %>%
group_by(loc_country) %>%
filter(n() >= 10) %>%
ungroup() %>%
filter(!is.na(`100g_USD`), !is.na(rating))
# Relevel roast so Medium-Light is the reference (most common)
coffee_model <- coffee_model %>%
mutate(roast = relevel(factor(roast, ordered = FALSE), ref = "Medium-Light"))
# Relevel location country so United States is the reference (most common)
coffee_model <- coffee_model %>%
mutate(loc_country = relevel(factor(loc_country), ref = "United States"))
# Fit the model
model <- lm(log(`100g_USD`) ~ rating + roast + loc_country,
data = coffee_model)
summary(model)
Call:
lm(formula = log(`100g_USD`) ~ rating + roast + loc_country,
data = coffee_model)
Residuals:
Min 1Q Median 3Q Max
-4.5821 -0.3081 -0.1183 0.1632 2.6649
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.63712 0.76933 -12.527 < 2e-16 ***
rating 0.12395 0.00826 15.006 < 2e-16 ***
roastLight 0.18981 0.03596 5.279 1.44e-07 ***
roastMedium -0.06661 0.03782 -1.761 0.07835 .
roastMedium-Dark 0.03746 0.08975 0.417 0.67648
roastDark 0.12604 0.24772 0.509 0.61095
loc_countryCanada -0.24007 0.09921 -2.420 0.01561 *
loc_countryGuatemala -0.44265 0.10389 -4.261 2.13e-05 ***
loc_countryHawai'i 0.79631 0.06061 13.138 < 2e-16 ***
loc_countryHong Kong 0.57207 0.12281 4.658 3.40e-06 ***
loc_countryJapan 0.45510 0.15985 2.847 0.00446 **
loc_countryTaiwan -0.04583 0.02814 -1.629 0.10351
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5451 on 2047 degrees of freedom
Multiple R-squared: 0.2233, Adjusted R-squared: 0.2191
F-statistic: 53.5 on 11 and 2047 DF, p-value: < 2.2e-16
# Clean coefficient table
tidy(model, conf.int = TRUE) %>%
mutate(
estimate = round(estimate, 3),
std.error = round(std.error, 3),
statistic = round(statistic, 3),
p.value = round(p.value, 4),
conf.low = round(conf.low, 3),
conf.high = round(conf.high, 3)
) %>%
kable(caption = "Regression Coefficients: Predictors of log(Price per 100g USD)")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -9.637 | 0.769 | -12.527 | 0.0000 | -11.146 | -8.128 |
| rating | 0.124 | 0.008 | 15.006 | 0.0000 | 0.108 | 0.140 |
| roastLight | 0.190 | 0.036 | 5.279 | 0.0000 | 0.119 | 0.260 |
| roastMedium | -0.067 | 0.038 | -1.761 | 0.0784 | -0.141 | 0.008 |
| roastMedium-Dark | 0.037 | 0.090 | 0.417 | 0.6765 | -0.139 | 0.213 |
| roastDark | 0.126 | 0.248 | 0.509 | 0.6109 | -0.360 | 0.612 |
| loc_countryCanada | -0.240 | 0.099 | -2.420 | 0.0156 | -0.435 | -0.046 |
| loc_countryGuatemala | -0.443 | 0.104 | -4.261 | 0.0000 | -0.646 | -0.239 |
| loc_countryHawai’i | 0.796 | 0.061 | 13.138 | 0.0000 | 0.677 | 0.915 |
| loc_countryHong Kong | 0.572 | 0.123 | 4.658 | 0.0000 | 0.331 | 0.813 |
| loc_countryJapan | 0.455 | 0.160 | 2.847 | 0.0045 | 0.142 | 0.769 |
| loc_countryTaiwan | -0.046 | 0.028 | -1.629 | 0.1035 | -0.101 | 0.009 |
model_tidy <- tidy(model, conf.int = TRUE) %>%
filter(term != "(Intercept)") %>%
mutate(
significant = p.value < 0.05,
term_label = gsub("roast|loc_country", "", term)
)
ggplot(model_tidy, aes(x = reorder(term_label, estimate),
y = estimate,
color = significant)) +
geom_point(size = 2.5) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.3) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
scale_color_manual(values = c("TRUE" = "#6F4E37", "FALSE" = "gray"),
labels = c("Not significant", "Significant (p < 0.05)")) +
coord_flip() +
labs(
title = "Regression Coefficients: Predictors of log(Price)",
subtitle = "Points to the right = associated with higher prices; reference = Medium-Light USA roasts",
x = "Predictors",
y = "Coefficient (log scale — multiply by 100 for % effect)",
color = NULL
) +
theme_minimal()In the first model I tested, Canada was the default reference for loc_country selected by R. To enhance the overall interpretability of the regression model, I adjusted the country reference to be the United States, as this will provide higher value to the global client audience, as the US market is a strong baseline to leverage in this analysis.
Fitted Regression Model:
\[ \text{log(price_100g) = $β_0 + β_1$(rating)$ + β_2$(roast)$ + β_3$(loc_country)} \]
The multiple linear regression model answers the core question: What predicts coffee price? The F-statistic (53.5, p < \(2.2e^{-16}\)) confirms the model as a whole is very significant. The adjusted \(R^2\) = 0.22, which means that rating, roast, and loc_country together explain roughly 22% of the log(100g_USD) variation. The model is currently built leveraging the specific occurrence of Medium-Light coffee roasted in the United States to serve as the baseline level, but the p-value and adjusted \(R^2\) are subject to change based on the roast and loc_country selections. In this statistical analysis, the best practice is to move forward with these references as benchmarks to compare against, yielding high accuracy and stability, given the ample observations of Medium-Light coffee roasts across the United States observed in the dataset. The remaining 78% that explains price variation is reflective of other variables in the dataset including coffee name, roaster, and origin_1 while also accounting for outside factors such as brand reputation or processing method for instance.
RatingThe strongest and most consistent predictor confirmed in the linear regression model. Before building the model, the EDA and analysis’ demonstrated a strong positive relationship between rating and 100g_USD. As rating increases, so does price. The linear model quantifies this sentiment, holding roast and loc_country constant.
A 1 point rating increase leads to a 13.2% increase in price:
\[ \text{$e^{0.124}$ = 1.132} \]
From this dataset, coffee rating is highly significant, as its p-value confirms (p < \(2e^{-16}\)). This statistically significant finding is great news for clients, as it confirms that expert reviewed coffee is reliably priced into the market. The strong positive correlation between rating and price is an actionable insight that coffee roasting companies can leverage to drive pricing strategy pertaining to making thoroughly informed investments in sourcing and roasting quality.
RoastRoast Level: Reference = Medium-Light
| Predictor | Estimate | Significant (p < 0.05)? | Interpretation |
|---|---|---|---|
Light (roast) |
+0.190 | Yes | 21% more expensive than Medium-Light |
Medium (roast) |
-0.067 | No(p=0.078) | Marginally cheaper; not conclusive |
Medium-Dark (roast) |
+0.037 | No(p=0.677 | No meaningful difference |
Dark (roast) |
+0.126 | No(p=0.611) | No meaningful difference |
From the reference baseline price, Light roasts are 21% more expensive than Medium-Light.
\[ \text{ $e^{0.190}$ = 1.209} \]
Although the insignificance of darker coffee roasts limit the overall impact that roast level has on price, it is statistically evident that Light roast is the clear most expensive type of roast across roasters in the United States. This information can be leveraged by coffee roasters aspiring to not only set premium prices, but justify them to coffee consumers. The key actionable insight for clients is clear: If premium pricing is the goal, leaning into importing high quality, Light roast beans, as it is the roast-level pricing strategy that the data most heavily supports. Ultimately, the ideal outcome for coffee roasting companies who lean into this business strategy of purchasing lighter coffee beans to roast is that this will directly result in stronger pricing power, as the demand and enthusiasm for light coffee roasts is high.
Loc_countryRoaster Country / Region: Reference = United States
| Country / Region | Estimate | Significant (p < 0.05)? | Interpretation |
|---|---|---|---|
Hawaii (loc_country) |
+0.796 | Yes | 122% more expensive than US |
Hong Kong (loc_country) |
+0.572 | Yes | 77% more expensive than US |
Japan (loc_country) |
+0.455 | Yes | 58% more expensive than US |
Taiwan (loc_country) |
-0.046 | No(p=0.104) | No meaningful difference from US |
Canada (loc_country) |
-0.240 | Yes | 21% cheaper than US |
Guatemala (loc_country) |
-0.443 | Yes | 36% cheaper than US |
To quantify this variance, lets analyze Guatemala and Hawaii coffee pricing relative to the United States:
\[ \textbf{Guatemala}: \text{$e^{-0.443}$ = 0.642} \hspace{2cm} \textbf{Hawaii}: \text{$e^{0.796}$ = 2.217} \]
While higher prices could be in response to their unique markets carrying stronger, more reputable brand names or perhaps have higher cost structures, which could justify inflated prices, there is reason to believe that other regional factors contribute to their dominate pricing power. The vast 158% swing in price between Guatemala and Hawaii is indicative of the strong variance in price globally. Of course, Hawaii is a major standout, as extreme coffee prices can be attributed to the premium coffee brands in the region and high local costs of production. From exploring the data, I found that in this dataset, 100% of coffee grown and roasted in Hawaii are Medium, Medium-Light, and Light roasts. This key finding strongly aligns with above insights gathered that lighter coffee roasts consistently have higher prices and ratings. Additionally, Hawaii does not have a single coffee review below 90, highlighting a remarkable collective performance across roasting companies in the region. Ultimately, this supports the belief that specialty roasting companies have the internal ability to optimize their service and coffee to meet customers needs.
This analysis set out to answer one question for specialty coffee roasters worldwide: What predicts coffee price? The findings are clear and actionable:
1. Rating is a significant predictor of price. Coffees rated higher by expert reviewers consistently command higher prices. This validates the business case for investing in quality — the market rewards it, as many consumers are willing to pay more for premium coffees.
2. Roast type matters. Light roasts are associated with higher prices, while Dark roasts command the lowest prices in this dataset. For roasters considering expanding their product lines, leaning into light-to-medium-light roasting of quality beans will likely yield stronger pricing power.
3. Roaster country affects price independent of quality. After controlling rating and roast type, coffee roasting location still influences price. This suggests that market-level factors such as consumer willingness to pay, cost structures, and brand premiumization vary geographically.
4. Origin matters, even beyond what the model captures. The EDA clearly shows that certain origins command price premiums. Roasters who source from high-prestige origins (e.g., Ethiopian specialty regions, areas of Panama) have a market basis for premium pricing.
5. Excellent internal production, service, and coffee quality is paramount. The variables analyzed from the dataset and external factors are major indicators of coffee price, although ultimately, everything circles back to specialty coffee roasting companies. Roasters must control the controllables to deliver the best coffee roasts they can for clients to ultimately build their brand, receive good ratings, and ensure high customer satisfaction to position themselves to successfully leverage pricing decisions.
| Recommendation | Supporting Evidence |
|---|---|
| Invest in quality — it pays. Prioritize sourcing and roasting practices that improve ratings. The regression confirms rating is a significant positive predictor of price. | Hypothesis Test (Analysis 2), Regression coefficient on rating |
| Lead with Light roasts for premium positioning. Light roasts are associated with higher market prices. | Roast-level ANOVA, Regression coefficient on roastLight |
| Source from high-prestige origins. Origins with strong reputations command price premiums. Use the origin-price EDA to identify high-opportunity sourcing regions. | Analysis 1 (EDA), Origin price visualization |
| Consider geographic market when pricing. If your roaster operates in a market where prices tend to be lower, consider online/export channels to reach markets with higher willingness to pay. | Country ANOVA, Regression coefficient on loc_country |
| Use the regression model as a pricing benchmark. Given a new coffee’s rating, roast type, and roaster country, plug values into the model to generate a log(price) prediction, then exponentiate to get a dollar estimate. This gives a data-driven starting point for pricing decisions. | Full multiple linear regression model |