What Predicts Coffee Price? A Statistical Analysis for Specialty Roasters

Author

Woods Procise

Published

April 27, 2026

Code

library(tidyverse)
library(ggplot2)
library(dplyr)
library(scales)
library(broom)
library(knitr)

Slide Deck: https://docs.google.com/presentation/d/12HCW6tA_tEXzRdmIUixRXRM8OQKUCucyOHZySQ66-cc/edit?slide=id.g3d6ffdf285a_0_560#slide=id.g3d6ffdf285a_0_560

Presentation link: Recording-20260427_232549.webm

Problem Statement

Client Question: What characteristics of a coffee — origin, roast level, roaster country — best predict its price per 100g, and can we use these to inform pricing strategy?

Target Audience

This analysis is directed at specialty coffee roasting companies worldwide. Whether a roaster is setting prices for a new single-origin release, deciding which coffee origins to source, or benchmarking their prices against the broader market, the data driven insights presented here can be practically applied to everyday operations. Specifically, this report is most relevant to:

Pricing and product teams deciding how to position coffees in the market
Green coffee buyers evaluating which origins offer the most premium potential
Brand strategists who want to understand what signals quality — and price — to consumers

Background

Throughout this semester, I have taken a deep dive into uncovering trends and insights from a global coffee reviews dataset, primarily by conducting statistical analyses. Often, I found myself analyzing the correlation between two variables: 100g_USD (price) and rating. All too often, I used price as a predictor of rating, seldom as the outcome.

In this final statistical analysis, I am flipping the lens and asking:

What predicts whether a coffee is expensive? Origin, roast type, roaster country, rating?

I believe this statistical analysis will be of high interest to coffee roasting companies trying to price their products competitively. A linear regression model will serve as the centerpiece of this investigation, with price as the response variable. Ultimately, this unexplored angle of the dataset offers clear, actionable insights as well as relevance to coffee roasting companies in regard to setting their pricing strategy.

Load & Prepare Data

Code

coffee_clean <- readRDS("coffee_clean.rds")

# Confirm factor structure carried over from prior work
coffee_clean <- coffee_clean %>%
  mutate(
    roast       = as.factor(roast),
    loc_country = as.factor(loc_country),
    origin_1    = as.factor(origin_1)
  )

Code

# Recreate price_bucket from WK3 for use in EDA groupings
coffee_clean <- coffee_clean %>%
  mutate(
    price_bucket = cut(
      `100g_USD`,
      breaks = c(-Inf, 4.99, 9.99, 19.99, Inf),
      labels = c("$0–5", "$5–10", "$10–20", "$20+"),
      right  = TRUE
    )
  )

Initial EDA

1. Distribution of Price (Response Variable)

Before running any models, it is critical to understand the shape of 100g_USD. A heavily skewed distribution may violate regression assumptions and warrant a transformation.

Code

p1 <- ggplot(coffee_clean, aes(x = `100g_USD`)) +
  geom_histogram(bins = 50, fill = "#6F4E37", color = "white", alpha = 0.85) +
  labs(
    title = "Distribution of Coffee Price (per 100g USD)",
    subtitle = "Raw price is right-skewed, driven by a small number of premium coffees",
    x     = "Price (USD per 100g)",
    y     = "Count"
  ) +
  theme_minimal()

p2 <- ggplot(coffee_clean, aes(x = log(`100g_USD`))) +
  geom_histogram(bins = 50, fill = "#C8A882", color = "white", alpha = 0.85) +
  labs(
    title = "Distribution of Log-Transformed Price",
    subtitle = "Log(price) is approximately normal — appropriate for linear regression",
    x     = "log(Price)",
    y     = "Count"
  ) +
  theme(plot.subtitle = element_text(face = "bold"))
  
gridExtra::grid.arrange(p1, p2, ncol = 2)

Explanation: The raw price distribution is strongly right-skewed. A small number of ultra-premium coffees (above $20/100g) significantly impact the results. While I cannot classify these few coffee prices as outliers without running statistical tests, it is highly likely that the coffees priced above $20 per 100g of coffee will be marked as such.

The log-transformed price is approximately bell-shaped, which satisfies the normality assumption for linear regression. From this point forward, log(100g_USD) will be used as the response variable in the regression model.

2A. Price by Roast Level

Code

roast_price <- coffee_clean %>%
  filter(!is.na(roast)) %>%
  group_by(roast) %>%
  summarise(
    avg_price    = mean(`100g_USD`, na.rm = TRUE),
    median_price = median(`100g_USD`, na.rm = TRUE),
    n            = n()
  ) %>%
  arrange(desc(avg_price))

roast_colors <- c(
  "Light"        = "#E8D9BF",  # lightest tan
  "Medium-Light" = "#D2B892",  # light brown
  "Medium"       = "#B8946A",  # medium brown
  "Medium-Dark"  = "#8B5E3C",  # darker brown
  "Dark"         = "#4B2E14"   # darkest roast
)

ggplot(coffee_clean %>% filter(!is.na(roast)),
       aes(x = reorder(roast, `100g_USD`, FUN = median),
           y = `100g_USD`,
           fill = roast)) +
  geom_boxplot(alpha = 0.75, outlier.alpha = 0.3) +
  scale_y_log10(labels = dollar_format()) +
  scale_fill_manual(values = roast_colors) +
  coord_flip() +
  labs(
    title    = "Coffee Price by Roast Level (Log Scale)",
    subtitle = "Light roasts tend to command higher prices than darker roasts",
    x        = "Roast Level",
    y        = "Price per 100g (USD, log scale)",
    fill     = "Roast"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Explanation: Light roasts generally command higher prices, which aligns with the specialty coffee market’s emphasis on single-origin, lightly roasted beans that preserve complex flavors. Dark roasts, while least common in this dataset, have the lowest average price — consistent with their association with commodity-grade or blended coffees. Medium-Light dominates the dataset in volume but spans a wide price range.

2B. Price by Roaster Country

Code

# Filter to countries with at least 10 reviews for reliable estimates
country_price <- coffee_clean %>%
  group_by(loc_country) %>%
  summarise(
    avg_price = mean(`100g_USD`, na.rm = TRUE),
    n         = n()
  ) %>%
  filter(n >= 10) %>%
  arrange(desc(avg_price))

ggplot(country_price,
       aes(x = reorder(loc_country, avg_price), y = avg_price)) +
  geom_col(aes(fill = avg_price == max(avg_price)),
           alpha = 0.85,
           show.legend = FALSE) +
  scale_fill_manual(values = c("TRUE" = "#3B1E08", "FALSE" = "#C8A882")) +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip() +
  labs(
    title    = "Average Coffee Price by Roaster Country/Region",
    subtitle = "Countries with fewer than 10 reviews excluded to reduce variance",
    x        = "Roaster Country",
    y        = "Average Price (USD per 100g)",
    fill     = "Avg Price"
  ) +
  theme_minimal()

Explanation: There is meaningful variation in average price by roaster country. Countries with historically strong specialty coffee cultures — such as the United States, Taiwan, and Japan — tend to have higher average prices. This may reflect both the premium positioning of their products and the higher cost of living/operating in those markets.

2C. Price by Coffee Origin

Code

origin_price <- coffee_clean %>%
  group_by(origin_1) %>%
  summarise(
    avg_price = mean(`100g_USD`, na.rm = TRUE),
    n         = n()
  ) %>%
  filter(n >= 15) %>%
  arrange(desc(avg_price))

ggplot(origin_price %>% slice_head(n = 20),
       aes(x = reorder(origin_1, avg_price), y = avg_price)) +
  geom_col(aes(fill = avg_price == max(avg_price)),
           alpha = 0.85,
           show.legend = FALSE) +
  scale_fill_manual(values = c("TRUE" = "#3B1E08", "FALSE" = "#C8A882")) +
  scale_y_continuous(labels = dollar_format()) +
  coord_flip() +
  labs(
    title    = "Top 20 Origins by Average Price (min 15 reviews)",
    subtitle = "Certain origins are associated with significantly higher market prices",
    x        = "Origin",
    y        = "Average Price (USD per 100g)",
    fill     = "Avg Price"
  ) +
  theme_minimal()

Explanation: Origin plays a meaningful role in pricing. Certain producing regions (e.g., Gesha/Geisha varieties from Panama or Ethiopia) are strongly associated with premium pricing, reflecting their rarity, flavor complexity, and reputation in the specialty market. This visual motivates the addition of origin_1 in our regression model, although ultimately it may not included due to the numerous quantity of origin locations in the dataset. In the top 20 origins, it does appear that aside from a few origins with exceptionally high prices, the prices are very competitive between other origin country’s. In the multivariate model, including only 20 origins would only represent a fraction of them from the dataset.

3. Price vs. Rating Scatterplot

Code

ggplot(coffee_clean, aes(x = rating, y = `100g_USD`)) +
  geom_point(alpha = 0.25, color = "#6F4E37") +
  geom_smooth(method = "lm", se = TRUE, color = "#C8A882", linewidth = 1.2) +
  scale_y_log10(labels = dollar_format()) +
  labs(
    title    = "Rating vs. Price",
    subtitle = "Higher-rated coffees tend to be priced higher, but with substantial spread",
    x        = "Rating",
    y        = "Price (USD per 100g, log scale)"
  ) +
  theme_minimal()

Explanation: There is a positive association between rating and price, which is expected — better coffees command higher prices. The prices on this scatterplot are measured on a log scale, which is important to note. However, the substantial spread indicates that rating alone does not fully explain price variation. This motivates a multi-variable regression that incorporates roast, origin, and roaster country alongside rating. These four variables together are the primary influencer’s of coffee prices.

Assumptions & Interpretation Risks

The following assumptions are made in this analysis, along with justifications and mitigation strategies for each:

1. Log-normality of price. We assume that log(100g_USD) is approximately normally distributed. This is supported by the EDA histogram above, which shows a roughly symmetric distribution after transformation. This assumption is acceptable and standard for price data in econometric modeling.

2. Independence of observations. We assume that each coffee review represents an independent data point. The dataset contains reviews across many different roasters, origins, and time periods, making this assumption reasonable. A risk is that the same roaster may appear multiple times — this could be partially mitigated by including loc_country as a grouping variable in the model.

3. Linearity. We assume a linear relationship between the predictors and log(price). The scatterplot of rating vs. log(price) supports this for rating. For categorical variables, linear regression estimates the mean log-price per group, which is valid. Linearity cannot be assumed for the raw price distribution, as price is heavily skewed from ultra premium, high priced coffee. All in all, log price will be integrated with the regression model as it is approximately normal, but for the majority of statistical analysis’, regular price will work most effectively.

4. Sufficient sample size by group. For origin and country estimates to be reliable, groups should have adequate observations. We mitigate this by filtering to origins with at least 10 reviews in visualizations, and by interpreting regression coefficients for small groups with caution. Groups with very few observations will have wide confidence intervals. More filtering may be necessary later on in the statistical analysis to exclude categories with an insufficient quantity of observations to produce statistically significant results.

5. No causal claims. This analysis identifies statistical associations, not causal relationships. A higher rating is associated with higher price — but this does not mean improving a coffee’s rating will cause its price to rise. Roasters should treat model outputs as signals rather than firm rules.

Analyses & Support

Analysis 1: Does Roast Level Significantly Affect Price?

Code

# One-way ANOVA: Does roast level predict log(price)?
coffee_roast <- coffee_clean %>% filter(!is.na(roast))

aov_roast <- aov(log(`100g_USD`) ~ roast, data = coffee_roast)
summary(aov_roast)

              Df Sum Sq Mean Sq F value  Pr(>F)    
roast          4   23.7   5.932   15.73 1.1e-12 ***
Residuals   2075  782.6   0.377                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

ggplot(coffee_roast, aes(x = reorder(roast, `100g_USD`, FUN = median),
                          y = log(`100g_USD`), fill = roast)) +
  geom_violin(alpha = 0.6) +
  geom_boxplot(width = 0.15, fill = "white", outlier.alpha = 0.2) +
  scale_fill_brewer(palette = "YlOrBr") +
  coord_flip() +
  labs(
    title    = "Log(Price) Distribution by Roast Level",
    subtitle = "Violin + boxplot shows full spread and median by roast type",
    x        = "Roast Level",
    y        = "log(Price per 100g USD)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: This one-way ANOVA tests whether the mean log-price differs significantly across roast levels. The only statistically significant result (p < 0.05) is Light roasts indicate higher prices, justifying its inclusion in the regression model. All other roast levels have p values less than 0.05, which means they likely do not impact price significantly. Ultimately, this violin plot reinforces the visual differences seen in the EDA, particularly the tendency for Light roasts to occupy higher price territory among all roast levels.

Analysis 2: Do High-Rated Coffees Cost More? (Hypothesis Test)

This hypothesis test directly ties to the main objective: understanding what drives coffee price.

Hypotheses (Neyman-Pearson Framework):

H₀: The mean price of high-rated coffees (rating ≥ 94) equals the mean price of lower-rated coffees (rating < 94)
H₁: The mean price of high-rated coffees differs from lower-rated coffees
α = 0.05, two-sided Welch t-test

Code

coffee_hyp <- coffee_clean %>%
  mutate(
    rating_group = if_else(rating >= 94, "High (≥94)", "Low (<94)")
  )

# Sample size check (power analysis)
power.t.test(
  delta     = 2,        # $2 meaningful difference in avg price
  sd        = 8,        # approximate SD of price from EDA
  sig.level = 0.05,
  power     = 0.80,
  type      = "two.sample",
  alternative = "two.sided"
)


     Two-sample t test power calculation 

              n = 252.1281
          delta = 2
             sd = 8
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

Code

t_result <- t.test(
  `100g_USD` ~ rating_group,
  data      = coffee_hyp,
  var.equal = FALSE
)
t_result


    Welch Two Sample t-test

data:  100g_USD by rating_group
t = 9.6536, df = 1158.4, p-value < 2.2e-16
alternative hypothesis: true difference in means between group High (≥94) and group Low (<94) is not equal to 0
95 percent confidence interval:
 3.830612 5.784885
sample estimates:
mean in group High (≥94)  mean in group Low (<94) 
               11.908510                 7.100762

Code

rating_price_summary <- coffee_hyp %>%
  group_by(rating_group) %>%
  summarise(
    avg_price = mean(`100g_USD`, na.rm = TRUE),
    se        = sd(`100g_USD`, na.rm = TRUE) / sqrt(n()),
    n         = n()
  )

ggplot(rating_price_summary, aes(x = rating_group, y = avg_price, fill = rating_group)) +
  geom_col(alpha = 0.85, width = 0.5) +
  geom_errorbar(aes(ymin = avg_price - 1.96 * se,
                    ymax = avg_price + 1.96 * se),
                width = 0.15, linewidth = 0.8) +
  geom_text(aes(label = dollar(round(avg_price, 2))), hjust = -.59, vjust = -.70, fontface = "bold") +
  scale_fill_manual(values = c("High (≥94)" = "#6F4E37", "Low (<94)" = "#C8A882")) +
  scale_y_continuous(labels = dollar_format()) +
  labs(
    title    = "Average Price by Rating Group",
    subtitle = "Error bars represent 95% confidence intervals",
    x        = "Rating Group",
    y        = "Average Price (USD per 100g)"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: The two-sample Welch t-test directly tests whether the premium pricing of high-rated coffees is statistically significant. For this particular test, premium coffees fall into the grouping that received a rating of at least 94.

Average price of coffee based on ratings groups:

High rating: $11.94
Low rating: $7.10

Decision: The p-value is below 0.05 which leads us to reject $H_0$ and conclude that rating group membership is meaningfully associated with price — a finding with direct relevance to roasters who invest in quality to justify premium price points. Overall, the statistical results gained from this t-test support aggressive pricing strategy for coffee roasters who sell premium coffee.

Analysis 3: Does Roaster Country Affect Price? (Hypothesis Test)

Hypotheses:

H₀: Mean coffee price is the same across all roaster countries
H₁: At least one roaster country differs in mean coffee price
Test: One-way ANOVA on log(100g_USD) by loc_country (countries with ≥ 10 reviews)

Code

coffee_country <- coffee_clean %>%
  group_by(loc_country) %>%
  filter(n() >= 10) %>%
  ungroup()

aov_country <- aov(log(`100g_USD`) ~ loc_country, data = coffee_country)
summary(aov_country)

              Df Sum Sq Mean Sq F value Pr(>F)    
loc_country    6   83.2  13.869   40.66 <2e-16 ***
Residuals   2052  699.9   0.341                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

ggplot(coffee_country,
       aes(x = reorder(loc_country, `100g_USD`, FUN = median),
           y = `100g_USD`)) +
  geom_boxplot(fill = "#6F4E37", alpha = 0.65, outlier.alpha = 0.2) +
  scale_y_log10(labels = dollar_format()) +
  coord_flip() +
  labs(
    title    = "Price Distribution by Roaster Country (Log Scale)",
    subtitle = "Countries with < 10 reviews excluded",
    x        = "Roaster Country",
    y        = "Price (USD per 100g, log scale)"
  ) +
  theme_minimal()

Interpretation: The ANOVA tests whether roaster country / region explains a significant portion of price variance. For this analysis, I integrated log price into the test rather than raw price because the the log scale makes the coefficients more interpretable. This tests primary goal is to directly inform roasters about the competitive pricing landscape geographically across the globe.

Decision: Reject $H_0$ with F(6, 2052) = 40.66 and p < $2e^{16}$, there is overwhelming evidence that at least one roaster country / region has a significantly different mean log price compared to other roasting locations.

This statistically significant result affirms that where a coffee is roasted matters for its price — this may reflect labor costs and import/export economics. Currency value is a non factor as all prices have been converted to US Dollars. This finding directly informs roasters about the competitive pricing landscape by geography, because coffee roaster location is not only a statistically significant predictor of price, but this test confirms ‘loc_country’ is independent of other explanatory variables. The practical meaning for clients (coffee roasters) is that operating in a high-price country market has a structural pricing advantage rooted into the market.

Regression Model

Model Building

The centerpiece of this analysis is a multiple linear regression predicting log(100g_USD) from:

rating — continuous numeric predictor
roast — categorical (5 levels, Medium-Light as reference)
loc_country — categorical (filtered to countries with ≥ 10 reviews, United States as reference)

Origin (origin_1) has hundreds of levels and would inflate the model; it is better summarized through the ANOVA and EDA visualizations above. This regression focuses on the most actionable predictors for various roasting company’s pricing strategy.

Impact of each predictor: Why were rating, roast, and loc_country were selected for this model?

Rating is included because the scatterplot (EDA Visual 3) highlighted a clear positive relationship with price. Additionally, rating has proven throughout this analysis to be the most reliable and direct measure of coffee quality.
Roast is incorporated because the boxplots (EDA Visual 2A) showed evident price differences across roast levels, which was supported in Analysis 1 (see above). Practically speaking, choosing what type of coffee roasts to import and brew for customers is a key business decision that coffee roasting companies control directly, making roast a valuable predictor.
Loc_country is included in large part due to the results of the ANOVA test in Analysis 3 that successfully confirmed the notable price variance among countries and regions. While coffee roasters do not have the luxury of controlling the market they are located in, understanding the geographic context in which roasting companies operate in is imperative to success in determining pricing strategies and catering towards key market segments.

Model reference level: Medium-Light coffee roasted in the United States

I chose to use Medium-Light roast coffee from the United States as the baseline level for the coefficients of the other four roast levels to be interpreted relative to it. This was a clear decision because Median-Light is the most common roast level in the dataset as it represents 1,490 of the 2,080 total reviews in the dataset, while the United States has the most reviews out of all roasting locations with 1331. Jointly, they combine for 995 of the total observations in the full dataset.

This is important for two reasons:

Statistical stability - Medium-Light roasts in the United States account for 47.8% of the total observations, proving vital for high reliability in the comparisons against it.
Interpretability - For coffee roasters, it makes practical sense to compare the other roast levels against the dominant market standard, rather than a far more rare category, such as Dark roasts. From a similar lens, the United States is a seemingly natural selection for a global business audience.

Code

# Filter to complete cases and countries with >= 10 reviews
coffee_model <- coffee_clean %>%
  filter(!is.na(roast)) %>%
  group_by(loc_country) %>%
  filter(n() >= 10) %>%
  ungroup() %>%
  filter(!is.na(`100g_USD`), !is.na(rating))

# Relevel roast so Medium-Light is the reference (most common)
coffee_model <- coffee_model %>%
  mutate(roast = relevel(factor(roast, ordered = FALSE), ref = "Medium-Light"))

# Relevel location country so United States is the reference (most common)
coffee_model <- coffee_model %>%
  mutate(loc_country = relevel(factor(loc_country), ref = "United States")) 

# Fit the model
model <- lm(log(`100g_USD`) ~ rating + roast + loc_country,
            data = coffee_model)

summary(model)


Call:
lm(formula = log(`100g_USD`) ~ rating + roast + loc_country, 
    data = coffee_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5821 -0.3081 -0.1183  0.1632  2.6649 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -9.63712    0.76933 -12.527  < 2e-16 ***
rating                0.12395    0.00826  15.006  < 2e-16 ***
roastLight            0.18981    0.03596   5.279 1.44e-07 ***
roastMedium          -0.06661    0.03782  -1.761  0.07835 .  
roastMedium-Dark      0.03746    0.08975   0.417  0.67648    
roastDark             0.12604    0.24772   0.509  0.61095    
loc_countryCanada    -0.24007    0.09921  -2.420  0.01561 *  
loc_countryGuatemala -0.44265    0.10389  -4.261 2.13e-05 ***
loc_countryHawai'i    0.79631    0.06061  13.138  < 2e-16 ***
loc_countryHong Kong  0.57207    0.12281   4.658 3.40e-06 ***
loc_countryJapan      0.45510    0.15985   2.847  0.00446 ** 
loc_countryTaiwan    -0.04583    0.02814  -1.629  0.10351    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5451 on 2047 degrees of freedom
Multiple R-squared:  0.2233,    Adjusted R-squared:  0.2191 
F-statistic:  53.5 on 11 and 2047 DF,  p-value: < 2.2e-16

Code

# Clean coefficient table
tidy(model, conf.int = TRUE) %>%
  mutate(
    estimate   = round(estimate, 3),
    std.error  = round(std.error, 3),
    statistic  = round(statistic, 3),
    p.value    = round(p.value, 4),
    conf.low   = round(conf.low, 3),
    conf.high  = round(conf.high, 3)
  ) %>%
  kable(caption = "Regression Coefficients: Predictors of log(Price per 100g USD)")

Regression Coefficients: Predictors of log(Price per 100g USD)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-9.637	0.769	-12.527	0.0000	-11.146	-8.128
rating	0.124	0.008	15.006	0.0000	0.108	0.140
roastLight	0.190	0.036	5.279	0.0000	0.119	0.260
roastMedium	-0.067	0.038	-1.761	0.0784	-0.141	0.008
roastMedium-Dark	0.037	0.090	0.417	0.6765	-0.139	0.213
roastDark	0.126	0.248	0.509	0.6109	-0.360	0.612
loc_countryCanada	-0.240	0.099	-2.420	0.0156	-0.435	-0.046
loc_countryGuatemala	-0.443	0.104	-4.261	0.0000	-0.646	-0.239
loc_countryHawai’i	0.796	0.061	13.138	0.0000	0.677	0.915
loc_countryHong Kong	0.572	0.123	4.658	0.0000	0.331	0.813
loc_countryJapan	0.455	0.160	2.847	0.0045	0.142	0.769
loc_countryTaiwan	-0.046	0.028	-1.629	0.1035	-0.101	0.009

Coefficient Plot

Code

model_tidy <- tidy(model, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    significant = p.value < 0.05,
    term_label  = gsub("roast|loc_country", "", term)
  )

ggplot(model_tidy, aes(x = reorder(term_label, estimate),
                        y = estimate,
                        color = significant)) +
  geom_point(size = 2.5) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  scale_color_manual(values = c("TRUE" = "#6F4E37", "FALSE" = "gray"),
                     labels = c("Not significant", "Significant (p < 0.05)")) +
  coord_flip() +
  labs(
    title    = "Regression Coefficients: Predictors of log(Price)",
    subtitle = "Points to the right = associated with higher prices; reference = Medium-Light USA roasts",
    x        = "Predictors",
    y        = "Coefficient (log scale — multiply by 100 for % effect)",
    color    = NULL
  ) +
  theme_minimal()

Interpretation of Results

In the first model I tested, Canada was the default reference for loc_country selected by R. To enhance the overall interpretability of the regression model, I adjusted the country reference to be the United States, as this will provide higher value to the global client audience, as the US market is a strong baseline to leverage in this analysis.

Fitted Regression Model:

\[ \text{log⁡(price_100g) = $β_0 + β_1$(rating)$ + β_2$(roast)$ + β_3$(loc_country)} \]

Overall Model

The multiple linear regression model answers the core question: What predicts coffee price? The F-statistic (53.5, p < $2.2e^{-16}$) confirms the model as a whole is very significant. The adjusted $R^2$ = 0.22, which means that rating, roast, and loc_country together explain roughly 22% of the log(100g_USD) variation. The model is currently built leveraging the specific occurrence of Medium-Light coffee roasted in the United States to serve as the baseline level, but the p-value and adjusted $R^2$ are subject to change based on the roast and loc_country selections. In this statistical analysis, the best practice is to move forward with these references as benchmarks to compare against, yielding high accuracy and stability, given the ample observations of Medium-Light coffee roasts across the United States observed in the dataset. The remaining 78% that explains price variation is reflective of other variables in the dataset including coffee name, roaster, and origin_1 while also accounting for outside factors such as brand reputation or processing method for instance.

Predictors

1. `Rating`

The strongest and most consistent predictor confirmed in the linear regression model. Before building the model, the EDA and analysis’ demonstrated a strong positive relationship between rating and 100g_USD. As rating increases, so does price. The linear model quantifies this sentiment, holding roast and loc_country constant.

A 1 point rating increase leads to a 13.2% increase in price:

\[ \text{$e^{0.124}$ = 1.132} \]

From this dataset, coffee rating is highly significant, as its p-value confirms (p < $2e^{-16}$). This statistically significant finding is great news for clients, as it confirms that expert reviewed coffee is reliably priced into the market. The strong positive correlation between rating and price is an actionable insight that coffee roasting companies can leverage to drive pricing strategy pertaining to making thoroughly informed investments in sourcing and roasting quality.

2. `Roast`

Roast Level: Reference = Medium-Light

In comparison to *Medium-Light* coffee roasts, factoring only reviews from the United States into the equation, **only Light roast is statistically significant.** Basically, only Light roast commands a meaningful price premium over Medium-Light. Medium, Medium-Dark, and Dark coffee roasts are not statistically significant enough alone to influence coffee price, and if anything, are associated with lower prices.
Predictor	Estimate	Significant (p < 0.05)?	Interpretation
Light (`roast`)	+0.190	Yes	21% more expensive than Medium-Light
Medium (`roast`)	-0.067	No(p=0.078)	Marginally cheaper; not conclusive
Medium-Dark (`roast`)	+0.037	No(p=0.677	No meaningful difference
Dark (`roast`)	+0.126	No(p=0.611)	No meaningful difference

From the reference baseline price, Light roasts are 21% more expensive than Medium-Light.

\[ \text{ $e^{0.190}$ = 1.209} \]

Although the insignificance of darker coffee roasts limit the overall impact that roast level has on price, it is statistically evident that Light roast is the clear most expensive type of roast across roasters in the United States. This information can be leveraged by coffee roasters aspiring to not only set premium prices, but justify them to coffee consumers. The key actionable insight for clients is clear: If premium pricing is the goal, leaning into importing high quality, Light roast beans, as it is the roast-level pricing strategy that the data most heavily supports. Ultimately, the ideal outcome for coffee roasting companies who lean into this business strategy of purchasing lighter coffee beans to roast is that this will directly result in stronger pricing power, as the demand and enthusiasm for light coffee roasts is high.

3. `Loc_country`

Roaster Country / Region: Reference = United States

The United States serving as the reference country / region in this model maximizes interpretability for clients. Hawaii, Hong Kong, and Japan all differentiate themselves significantly from other countries and regions in this dataset from a pricing standpoint. These three excellent coffee roasting regions, **Hawaii, Hong Kong, and Japan each command vast price premiums in comparison to US roasted coffees**, even after accounting for `rating` and `roast`. The most eye-popping trend visible in the table, and explained by the wide confidence intervals for `loc_country` variables in the coefficients plot is the immense variance in coffee prices scattered between roasting countries / regions globally.
Country / Region	Estimate	Significant (p < 0.05)?	Interpretation
Hawaii (`loc_country`)	+0.796	Yes	122% more expensive than US
Hong Kong (`loc_country`)	+0.572	Yes	77% more expensive than US
Japan (`loc_country`)	+0.455	Yes	58% more expensive than US
Taiwan (`loc_country`)	-0.046	No(p=0.104)	No meaningful difference from US
Canada (`loc_country`)	-0.240	Yes	21% cheaper than US
Guatemala (`loc_country`)	-0.443	Yes	36% cheaper than US

To quantify this variance, lets analyze Guatemala and Hawaii coffee pricing relative to the United States:

\[ \textbf{Guatemala}: \text{$e^{-0.443}$ = 0.642} \hspace{2cm} \textbf{Hawaii}: \text{$e^{0.796}$ = 2.217} \]

While higher prices could be in response to their unique markets carrying stronger, more reputable brand names or perhaps have higher cost structures, which could justify inflated prices, there is reason to believe that other regional factors contribute to their dominate pricing power. The vast 158% swing in price between Guatemala and Hawaii is indicative of the strong variance in price globally. Of course, Hawaii is a major standout, as extreme coffee prices can be attributed to the premium coffee brands in the region and high local costs of production. From exploring the data, I found that in this dataset, 100% of coffee grown and roasted in Hawaii are Medium, Medium-Light, and Light roasts. This key finding strongly aligns with above insights gathered that lighter coffee roasts consistently have higher prices and ratings. Additionally, Hawaii does not have a single coffee review below 90, highlighting a remarkable collective performance across roasting companies in the region. Ultimately, this supports the belief that specialty roasting companies have the internal ability to optimize their service and coffee to meet customers needs.

Conclusions

This analysis set out to answer one question for specialty coffee roasters worldwide: What predicts coffee price? The findings are clear and actionable:

1. Rating is a significant predictor of price. Coffees rated higher by expert reviewers consistently command higher prices. This validates the business case for investing in quality — the market rewards it, as many consumers are willing to pay more for premium coffees.

2. Roast type matters. Light roasts are associated with higher prices, while Dark roasts command the lowest prices in this dataset. For roasters considering expanding their product lines, leaning into light-to-medium-light roasting of quality beans will likely yield stronger pricing power.

3. Roaster country affects price independent of quality. After controlling rating and roast type, coffee roasting location still influences price. This suggests that market-level factors such as consumer willingness to pay, cost structures, and brand premiumization vary geographically.

4. Origin matters, even beyond what the model captures. The EDA clearly shows that certain origins command price premiums. Roasters who source from high-prestige origins (e.g., Ethiopian specialty regions, areas of Panama) have a market basis for premium pricing.

5. Excellent internal production, service, and coffee quality is paramount. The variables analyzed from the dataset and external factors are major indicators of coffee price, although ultimately, everything circles back to specialty coffee roasting companies. Roasters must control the controllables to deliver the best coffee roasts they can for clients to ultimately build their brand, receive good ratings, and ensure high customer satisfaction to position themselves to successfully leverage pricing decisions.

Actionable Recommendations

Recommendation	Supporting Evidence
Invest in quality — it pays. Prioritize sourcing and roasting practices that improve ratings. The regression confirms rating is a significant positive predictor of price.	Hypothesis Test (Analysis 2), Regression coefficient on `rating`
Lead with Light roasts for premium positioning. Light roasts are associated with higher market prices.	Roast-level ANOVA, Regression coefficient on `roastLight`
Source from high-prestige origins. Origins with strong reputations command price premiums. Use the origin-price EDA to identify high-opportunity sourcing regions.	Analysis 1 (EDA), Origin price visualization
Consider geographic market when pricing. If your roaster operates in a market where prices tend to be lower, consider online/export channels to reach markets with higher willingness to pay.	Country ANOVA, Regression coefficient on `loc_country`
Use the regression model as a pricing benchmark. Given a new coffee’s rating, roast type, and roaster country, plug values into the model to generate a log(price) prediction, then exponentiate to get a dollar estimate. This gives a data-driven starting point for pricing decisions.	Full multiple linear regression model

Problem Statement

Target Audience

Background

Load & Prepare Data

Initial EDA

1. Distribution of Price (Response Variable)

2A. Price by Roast Level

2B. Price by Roaster Country

2C. Price by Coffee Origin

3. Price vs. Rating Scatterplot

Assumptions & Interpretation Risks

Analyses & Support

Analysis 1: Does Roast Level Significantly Affect Price?

Analysis 2: Do High-Rated Coffees Cost More? (Hypothesis Test)

Analysis 3: Does Roaster Country Affect Price? (Hypothesis Test)

Regression Model

Model Building

Coefficient Plot

Interpretation of Results

Overall Model

Predictors

1. Rating

2. Roast

3. Loc_country

Conclusions

Actionable Recommendations

1. `Rating`

2. `Roast`

3. `Loc_country`