# Engineering 'price_per_rating' and 'log_price'
coffee_clean <- coffee_clean %>%
mutate(
price_per_rating = `100g_USD` / rating,
log_price = log(`100g_USD`)
)
Pair 1: Rating vs Price per Rating
Response Variable: ‘rating’
Explanatory Variable: ‘price_per_rating’
ggplot(coffee_clean, aes(x = price_per_rating, y = rating)) +
geom_point(alpha = 0.6, color = "steelblue") +
labs(
title = "Rating vs. Price per Rating Point",
x = "Price per Rating Point (USD)",
y = "Rating"
) +
theme_minimal()
In creating a ‘price_per_rating’ variable, my goal was to analyze if coffee was priced appropriately according to its actual value, which is measured by the rating. The rating on the Y-axis should respond accordingly to the explanatory variable ‘price_per_rating’. Barring a few outliers, I believe this scatter plot represents a moderate correlation between ‘rating’ and ‘price_per_rating’ as we see can see a slight increase in the price per rating point in USD as coffee ratings increase. This raises the question: Are higher rated coffee actually worth their price? Based on these insights, I would conclude that any coffee that has a price per rating point higher than .50 is very overvalued, especially do to the number of high rated coffees that are far more aligned with a reasonable price per rating point. I would classify all coffees, regardless of rating that have a price per rating point over .50 as outliers, that represents overvalued coffee. The vast difference regarding certain outliers likely results from the dataset containing coffee prices per 100g of coffee rather than based on a per cup price.
cor_rating_priceper <- cor(
coffee_clean$rating,
coffee_clean$price_per_rating,
use = "complete.obs",
method = "pearson"
)
cor_rating_priceper
## [1] 0.2609673
Correlation coefficient = .261
The correlation coefficient being much closer to 0 than to 1 confirms the initial belief that there is a weak positive correlation between rating and price per rating. This proves statistical correlation between price and rating is evident, although not significant. In the future, I may look into other variables that may have a stronger correlation to rating that solely the price point. There is no clear trend with a cloudy explanation for why certain coffees are expensive based on their rating, and other highly rated coffees with far lower prices. The major insight that still needs explored is whether value is meaningfully tied to rating. The major limitation in this dataset is that rating is the only major continuous indicator of coffee value aside from price, but in the future, I will definitely explore categorical factors that could determine the overall value of coffee. Additionally, correlation could if I focus on an analysis specific to individual countries.
rating_mean <- mean(coffee_clean$rating, na.rm = TRUE)
rating_sd <- sd(coffee_clean$rating, na.rm = TRUE)
rating_n <- sum(!is.na(coffee_clean$rating))
rating_se <- rating_sd / sqrt(rating_n)
alpha <- 0.05
t_crit <- qt(1 - alpha/2, df = rating_n - 1)
lower_rating <- rating_mean - t_crit * rating_se
upper_rating <- rating_mean + t_crit * rating_se
c(lower_rating, upper_rating)
## [1] 93.04378 93.17834
Major Conclusion: Based on the sample population from this coffee reviews dataset, I estimate the population mean rating of coffees to be in between 93.04 and 93.19 with 95% confidence. While this window may seem very tight, the minimum coffee rating is 84 and the maximum is 98, meaning there is far less space for significant variance such as there is for price. Given the average coffee rating is approximately a 93/100, this proves that the data set quite accurately represents all coffees given the ratings are between 84-98, with minimal bias towards higher rated coffees.
Pair 2: Review date vs reviews per month
Original Variable: ‘review_date’
Engineered Variable: ‘reviews_per_month’
library(dplyr)
library(lubridate)
# Parse review date
coffee_clean <- coffee_clean %>%
mutate(
review_date = as.Date(paste0(review_date, " 01"), format = "%B %Y %d"),
review_month = floor_date(review_date, unit = "month")
)
# Create continuous time variable
coffee_clean <- coffee_clean %>%
mutate(
time_months = as.numeric(difftime(review_month, min(review_month), units = "days")) / 30
)
# Engineer 'reviews_per_month' column
reviews_monthly <- coffee_clean %>%
count(review_month, name = "reviews_per_month")
# Join columns back together
coffee_clean <- coffee_clean %>%
left_join(reviews_monthly, by = "review_month")
cor(coffee_clean$time_months, coffee_clean$reviews_per_month, use = "complete.obs")
## [1] 0.5718173
Correlation coefficient = .572
This correlation coefficient provides us insight that there is a moderate positive relationship between the increase in reviews by month over time. We will find out more down below with the confidence interval and through visualizing the relationship, but it appears that the number of reviews do slightly increase over time. This proves that there absolutely is correlation evident, although it may not be exceptionally strong.
cor.test(coffee_clean$time_months, coffee_clean$reviews_per_month)
##
## Pearson's product-moment correlation
##
## data: coffee_clean$time_months and coffee_clean$reviews_per_month
## t = 31.773, df = 2078, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5421621 0.6000497
## sample estimates:
## cor
## 0.5718173
95% Confidence Interval = [0.542, 0.600]
The confidence interval is a tight window, which supports the correlation coefficient. Additionally, the p-value found in the output is rather small, proving a weak to moderate positive correlation. The correlation coefficient is right in between the 95% confidence interval, which proves that our statistical tests are aligned. The true correlation is likely within the 95% confidence interval, which leads me to confidently conclude that the possibility of reaching a strong positive correlation between reviews per month over time since the first review is highly unlikely within the time range of the data I am working with. The coffee reviews are from 2017-2022 in this dataset, and while there is a slight increase in reviews over time, which is a positive sign that people are reviewing more coffees, the correlation is not significant, as the coefficients prove.
ggplot(coffee_clean, aes(x = time_months, y = reviews_per_month)) +
geom_point(alpha = 0.4, color = "steelblue") +
geom_smooth(method = "lm", se = TRUE, color = "darkred") +
labs(
title = "Reviews per Month Over Time",
x = "Months since First Review (Nov 2017)",
y = "Number of Reviews by Month"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The visual strongly supports the correlation coefficient and the 95% confidence interval, as there is a moderately positive relationship between ‘time_months’ and ‘reviews_per_month’. This provides insight to the slight increase in coffee reviews over time, which is positive news for someone like myself who is currently analyzing this dataset, meaning I will have a larger sample size to work with in the more recent years. Statistically speaking, the moderate correlation does not allow for there to be a significant takeaway regarding the relationship between the original and engineered variable in this specific situation. All in all, that is okay mostly because I am exploring new variables that I have not explored too much of thus far with this dataset so I did not exactly expect there to be strong correlation here.