1. Introduction

This data dive will focus on a time-based data analysis. I will first need to convert ‘review_date’ to a date format in R by creating a new column. I will then analyze a response variable over a period of time, with the goal to detect underlying trends in that variable. This coffee reviews dataset consists of dates from 2017 to 2023, so the analysis will include reviews within these years.

2. Choose Response Variable

Response Variable: Rating over time

  1. Ratings reflect quality perception

  2. Trends in roasting, weather patterns in origin country’s, or drift in consumer preferences can all impact rating

  3. Importantly, rating is both continuous, interpretible, and clean

3. Create Tsibble

library(tsibble)
## Warning: package 'tsibble' was built under R version 4.5.2
## 
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
## 
##     interval
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
coffee_monthly_ts <- coffee_clean %>%
  mutate(month = floor_date(review_date, "month")) %>%
  group_by(month) %>%
  summarise(rating = mean(rating, na.rm = TRUE)) %>%
  as_tsibble(index = month)

In this tsibble, I chose to aggregate by monthly averages because the months are unique and the tsibble will have unique indexes, rather than if I did not group the dates by month.

4. Plot Data over Time

ggplot(coffee_monthly_ts, aes(x = month, y = rating)) +
  geom_point(alpha = 0.3) +
  geom_line(alpha = 0.4) +
  labs(title = "Coffee Ratings Over Time",
       x = "Review Date",
       y = "Rating") +
  theme_minimal()

Plot Insights

  • A slight upward drift in ratings over time

  • No obvious seasonal pattern, although there are more spikes around the winter months

  • High variability month‑to‑month

5. Linear Regression Model

Goal: Detect any upward or downward trends

coffee_monthly_ts <- coffee_monthly_ts %>%
  mutate(time_num = as.numeric(month))

trend_mod <- lm(rating ~ time_num, data = coffee_monthly_ts)
summary(trend_mod)
## 
## Call:
## lm(formula = rating ~ time_num, data = coffee_monthly_ts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42642 -0.23490  0.06365  0.26562  1.04467 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.876e+01  2.060e+00  43.079   <2e-16 ***
## time_num    2.355e-04  1.120e-04   2.102   0.0398 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4689 on 59 degrees of freedom
## Multiple R-squared:  0.06966,    Adjusted R-squared:  0.05389 
## F-statistic: 4.418 on 1 and 59 DF,  p-value: 0.03985

Model Interpretation

Slope: 2.355×10−4

  • p-value: 0.0398

  • Adjusted R²: 0.0539

  • Residual SE: 0.469

Coefficient for ‘time_num’:

\[ \beta_1=0.0002355 \]

For each unit increase in ‘time_num’ (one day), the expected monthly average rating increases by 0.0002355 points.

Even after multiplying this number by 365 days to evaluate the annual trend, it is less than 1/10 a point per year, showing how minimal this trend really is.

Statistical Signficance

  • p = 0.0398 → statistically significant at the 5% level

  • The trend is real, but very small in magnitude

6. Data Smoothing

ggplot(coffee_monthly_ts, aes(month, rating)) +
  geom_line(alpha = 0.5) +
  geom_smooth(span = 0.3, color = "red") +
  theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Visual Insights

The smoothing allows for the trend to be encapsulated more effectively, although even this this, the line shows a gentle waviness to visualize the changes in coffee ratings over time. Just after 2020, there is somewhat of a major dip in ratings for a few months. I would be curious if this was due to the supply chain shortages as a result of the global pandemic. Aside from this, I am unable to detect any seasonal pattern prevalent of major significance.

ACF Illustrates Seasonality

acf(coffee_monthly_ts$rating)

ACF Insights: There is no significant spikes in the above ACF diagram. Additionally, no seasonal cycles repeat or are existent for that matter. I confidently conclude after various statistical tests that there are no significant seasonal trends or patterns regarding coffee ratings from 2017 to 2022. Due to this, I have no major conclusions to new insight to offer, although I am highly interested in investigating the price of coffee, 100g_USD over time in the future to uncover insights in regard to fluctuation. Overall, it is important to understand a time-based structure to interpret in this case whether ratings are stable, fluctuate, or drift, but in turn, if ratings inflate over time, comparisons could be biased.