This data dive will focus on a time-based data analysis. I will first need to convert ‘review_date’ to a date format in R by creating a new column. I will then analyze a response variable over a period of time, with the goal to detect underlying trends in that variable. This coffee reviews dataset consists of dates from 2017 to 2023, so the analysis will include reviews within these years.
Response Variable: Rating over time
Ratings reflect quality perception
Trends in roasting, weather patterns in origin country’s, or drift in consumer preferences can all impact rating
Importantly, rating is both continuous, interpretible, and clean
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.5.2
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
coffee_monthly_ts <- coffee_clean %>%
mutate(month = floor_date(review_date, "month")) %>%
group_by(month) %>%
summarise(rating = mean(rating, na.rm = TRUE)) %>%
as_tsibble(index = month)
In this tsibble, I chose to aggregate by monthly averages because the months are unique and the tsibble will have unique indexes, rather than if I did not group the dates by month.
ggplot(coffee_monthly_ts, aes(x = month, y = rating)) +
geom_point(alpha = 0.3) +
geom_line(alpha = 0.4) +
labs(title = "Coffee Ratings Over Time",
x = "Review Date",
y = "Rating") +
theme_minimal()
A slight upward drift in ratings over time
No obvious seasonal pattern, although there are more spikes around the winter months
High variability month‑to‑month
Goal: Detect any upward or downward trends
coffee_monthly_ts <- coffee_monthly_ts %>%
mutate(time_num = as.numeric(month))
trend_mod <- lm(rating ~ time_num, data = coffee_monthly_ts)
summary(trend_mod)
##
## Call:
## lm(formula = rating ~ time_num, data = coffee_monthly_ts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.42642 -0.23490 0.06365 0.26562 1.04467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.876e+01 2.060e+00 43.079 <2e-16 ***
## time_num 2.355e-04 1.120e-04 2.102 0.0398 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4689 on 59 degrees of freedom
## Multiple R-squared: 0.06966, Adjusted R-squared: 0.05389
## F-statistic: 4.418 on 1 and 59 DF, p-value: 0.03985
Slope: 2.355×10−4
p-value: 0.0398
Adjusted R²: 0.0539
Residual SE: 0.469
Coefficient for ‘time_num’:
\[ \beta_1=0.0002355 \]
For each unit increase in ‘time_num’ (one day), the expected monthly average rating increases by 0.0002355 points.
Even after multiplying this number by 365 days to evaluate the annual trend, it is less than 1/10 a point per year, showing how minimal this trend really is.
p = 0.0398 → statistically significant at the 5% level
The trend is real, but very small in magnitude
ggplot(coffee_monthly_ts, aes(month, rating)) +
geom_line(alpha = 0.5) +
geom_smooth(span = 0.3, color = "red") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The smoothing allows for the trend to be encapsulated more effectively, although even this this, the line shows a gentle waviness to visualize the changes in coffee ratings over time. Just after 2020, there is somewhat of a major dip in ratings for a few months. I would be curious if this was due to the supply chain shortages as a result of the global pandemic. Aside from this, I am unable to detect any seasonal pattern prevalent of major significance.
ACF Illustrates Seasonality
acf(coffee_monthly_ts$rating)
ACF Insights: There is no significant spikes in the above ACF diagram. Additionally, no seasonal cycles repeat or are existent for that matter. I confidently conclude after various statistical tests that there are no significant seasonal trends or patterns regarding coffee ratings from 2017 to 2022. Due to this, I have no major conclusions to new insight to offer, although I am highly interested in investigating the price of coffee, 100g_USD over time in the future to uncover insights in regard to fluctuation. Overall, it is important to understand a time-based structure to interpret in this case whether ratings are stable, fluctuate, or drift, but in turn, if ratings inflate over time, comparisons could be biased.