1. Create New Numeric Variables

# Engineering 'price_per_rating' and 'log_price'
coffee_clean <- coffee_clean %>%
  mutate(
    price_per_rating = `100g_USD` / rating,     
    log_price = log(`100g_USD`)                 
  )

2. Pair 1

Pair 1: Rating vs Price per Rating

Visual

ggplot(coffee_clean, aes(x = price_per_rating, y = rating)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  labs(
    title = "Rating vs. Price per Rating Point",
    x = "Price per Rating Point (USD)",
    y = "Rating"
  ) +
  theme_minimal()

Visual Conclusions

In creating a ‘price_per_rating’ variable, my goal was to analyze if coffee was priced appropriately according to its actual value, which is measured by the rating. The rating on the Y-axis should respond accordingly to the explanatory variable ‘price_per_rating’. Barring a few outliers, I believe this scatter plot represents a moderate correlation between ‘rating’ and ‘price_per_rating’ as we see can see a slight increase in the price per rating point in USD as coffee ratings increase. This raises the question: Are higher rated coffee actually worth their price? Based on these insights, I would conclude that any coffee that has a price per rating point higher than .50 is very overvalued, especially do to the number of high rated coffees that are far more aligned with a reasonable price per rating point. I would classify all coffees, regardless of rating that have a price per rating point over .50 as outliers, that represents overvalued coffee. The vast difference regarding certain outliers likely results from the dataset containing coffee prices per 100g of coffee rather than based on a per cup price.

Correlation Coefficient

cor_rating_priceper <- cor(
  coffee_clean$rating,
  coffee_clean$price_per_rating,
  use = "complete.obs",
  method = "pearson"
)

cor_rating_priceper
## [1] 0.2609673

Correlation coefficient = .261

The correlation coefficient being much closer to 0 than to 1 confirms the initial belief that there is minimal correlation between rating and price per rating. There is no clear trend with a cloudy explanation for why certain coffees are expensive based on their rating, and other highly rated coffees with far lower prices. The major insight that still needs explored is whether value is meaningfully tied to rating. The major limitation in this dataset is that rating is the only major continuous indicator of coffee value aside from price, but in the future, I will definitely explore categorical factors that could determine the overall value of coffee. Additionally, correlation could if I focus on an analysis specific to individual countries.

Confidence Interval

rating_mean  <- mean(coffee_clean$rating, na.rm = TRUE)
rating_sd    <- sd(coffee_clean$rating, na.rm = TRUE)
rating_n     <- sum(!is.na(coffee_clean$rating))
rating_se    <- rating_sd / sqrt(rating_n)
alpha        <- 0.05
t_crit       <- qt(1 - alpha/2, df = rating_n - 1)

lower_rating <- rating_mean - t_crit * rating_se
upper_rating <- rating_mean + t_crit * rating_se

c(lower_rating, upper_rating)
## [1] 93.04378 93.17834

Major Conclusion: Based on the sample population from this coffee reviews dataset, I estimate the population mean rating of coffees to be in between 93.04 and 93.19 with 95% confidence. While this window may seem very tight, the minimum coffee rating is 84 and the maximum is 98, meaning there is far less space for significant variance such as there is for price. Given the average coffee rating is approximately a 93/100, this proves that the data set quite accurately represents all coffees given the ratings are between 84-98, with minimal bias towards higher rated coffees.

3. Pair 2

Pair 2: Price per 100g vs Log price

Visual

ggplot(coffee_clean, aes(x = `100g_USD`, y = log_price)) +
  geom_point(alpha = 0.6, color = "darkgreen") +
  labs(
    title = "Price vs. Log-Transformed Price",
    x = "Price (100g USD)",
    y = "Log Price"
  ) +
  theme_minimal()

Visual Conclusions

I took the log of price to represent the price data ‘100g_USD’ in a slightly different way and compare the data changes based on percentage movements rather than based on the US dollar. The question that this plot aims to answer is how log-transformed price related to the natural coffee price per 100g. The major takeaway from this visualization is the smooth curve that shows how ‘100g_USD’ increases as ‘log_price’ increases, but ‘log_price’ increases far less than ‘100g_USD’. The only identifiable outliers are coffees that are priced over 100 USD per 100g of coffee, which stick out away from the majority of the data points. The strong majority of the values are quite compressed in this visual, and it is highly effective in showing outliers.

Correlation Coefficient

cor_price_log <- cor(
  coffee_clean$`100g_USD`,
  coffee_clean$log_price,
  use = "complete.obs",
  method = "pearson"
)

cor_price_log
## [1] 0.8470088

Correlation coefficient = .847

As expected, there is a strong positive correlation between price and log price. The likely reason that this number is not even close to 1 is due to a few major outliers, which possible skew the data slightly. This strong correlation coefficient further proves that log price is a stable transformation of the actual price.

Confidence Interval

price_mean  <- mean(coffee_clean$`100g_USD`, na.rm = TRUE)
price_sd    <- sd(coffee_clean$`100g_USD`, na.rm = TRUE)
price_n     <- sum(!is.na(coffee_clean$`100g_USD`))
price_se    <- price_sd / sqrt(price_n)
alpha       <- 0.05
t_crit_p    <- qt(1 - alpha/2, df = price_n - 1)

lower_price <- price_mean - t_crit_p * price_se
upper_price <- price_mean + t_crit_p * price_se

c(lower_price, upper_price)
## [1] 8.641764 9.530775

Major Conclusion: I estimate the population mean price per 100g of coffee to be somewhere between 8.64 and 9.53 USD with 95% confidence. In the future, I would like to investigate categorical variables that have the most impact in shaping coffee prices, and analyze potential causes for variability such as origin of the coffee, type of roast, or location country of the roaster. The lower and upper mean prices represent the average price of coffee in this dataset to be between 8-10 USD per 100g.