Week 12 Time Analysis

This notebook answers the Week 12 time-series questions using the COVID dataset. The response variable is new_cases_smoothed_per_million because it changes over time and is easy to interpret.

Question 1

Which column encodes time, and how is it converted into a proper Date in R?

covid <- read.csv("covid_combined_groups.csv")

covid$date <- as.Date(covid$date)

covid_time <- covid %>%
  select(date, new_cases_smoothed_per_million) %>%
  filter(!is.na(date), !is.na(new_cases_smoothed_per_million)) %>%
  group_by(date) %>%
  summarise(
    avg_new_cases_smoothed_per_million = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(date)

summary(covid_time)
##       date            avg_new_cases_smoothed_per_million
##  Min.   :2020-03-01   Min.   :  0.2921                  
##  1st Qu.:2020-08-15   1st Qu.: 38.0929                  
##  Median :2021-01-30   Median :132.3614                  
##  Mean   :2021-01-30   Mean   :130.4883                  
##  3rd Qu.:2021-07-16   3rd Qu.:191.3260                  
##  Max.   :2021-12-31   Max.   :422.0129

Answer: The time column is date. I convert it with as.Date() so R recognizes it as a real date instead of text. That matters because time-series tools such as tsibble, ggplot, and smoothing methods work best when the index is stored as a proper Date object.

Because the dataset has many rows per day across countries, I average new_cases_smoothed_per_million by date so that each day appears once. That gives a clean daily time series.

Create the tsibble

covid_tsibble <- covid_time %>%
  as_tsibble(index = date)

covid_tsibble
## # A tsibble: 671 x 2 [1D]
##    date       avg_new_cases_smoothed_per_million
##    <date>                                  <dbl>
##  1 2020-03-01                              0.292
##  2 2020-03-02                              0.292
##  3 2020-03-03                              0.292
##  4 2020-03-04                              0.292
##  5 2020-03-05                              0.292
##  6 2020-03-06                              0.292
##  7 2020-03-07                              0.292
##  8 2020-03-08                              1.40 
##  9 2020-03-09                              1.40 
## 10 2020-03-10                              1.40 
## # ℹ 661 more rows

The tsibble now contains one observation per day. That makes it possible to study the overall shape of the response over time.

Plot the response over time

ggplot(covid_tsibble, aes(x = date, y = avg_new_cases_smoothed_per_million)) +
  geom_line(linewidth = 0.5) +
  labs(
    title = "Average New COVID Cases Smoothed Per Million Over Time",
    x = "Date",
    y = "Average new cases smoothed per million"
  ) +
  theme_minimal()

Answer: Interpretation:

  • The time series shows a clear upward trend in average new COVID cases over time. Cases start at very low levels in early 2020, rise sharply toward late 2020, fluctuate through 2021, and then increase again significantly toward the end of the period. These repeated rises and falls indicate multiple waves of infection rather than a steady increase.

  • This plot shows the overall time pattern in the response.

.

Different windows of time

recent_cutoff <- max(covid_tsibble$date, na.rm = TRUE) - years(2)
recent_window <- covid_tsibble %>%
  filter(date >= recent_cutoff)

ggplot(recent_window, aes(x = date, y = avg_new_cases_smoothed_per_million)) +
  geom_line(linewidth = 0.5) +
  labs(
    title = "Recent Two-Year Window of Average New Cases Smoothed Per Million",
    x = "Date",
    y = "Average new cases smoothed per million"
  ) +
  theme_minimal()

Answer: Interpretation:

  • Focusing on the recent two-year window highlights clearer wave patterns. There are distinct peaks followed by declines, showing that COVID cases rise and fall in cycles. The most recent portion shows a strong upward surge, indicating a new wave that exceeds earlier peaks. This confirms that the behavior is not random but follows repeated patterns over time.

Question 2

Can linear regression detect an upward or downward trend over time, and how strong is that trend?

trend_data <- covid_tsibble %>%
  mutate(time_index = as.numeric(date - min(date)))

trend_model <- lm(avg_new_cases_smoothed_per_million ~ time_index, data = trend_data)
summary(trend_model)
## 
## Call:
## lm(formula = avg_new_cases_smoothed_per_million ~ time_index, 
##     data = trend_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.18  -34.83  -13.07   34.15  172.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.79394    4.84816    2.02   0.0438 *  
## time_index   0.36028    0.01253   28.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 62.86 on 669 degrees of freedom
## Multiple R-squared:  0.5528, Adjusted R-squared:  0.5521 
## F-statistic:   827 on 1 and 669 DF,  p-value: < 2.2e-16
ggplot(trend_data, aes(x = date, y = avg_new_cases_smoothed_per_million)) +
  geom_line(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Average New Cases Smoothed Per Million With Linear Trend",
    x = "Date",
    y = "Average new cases smoothed per million"
  ) +
  theme_minimal()

Answer: Interpretation:

  • The linear regression line shows a strong overall upward trend in average new cases over time. However, the actual data fluctuates significantly around this line, indicating that while cases are generally increasing, the trend alone does not fully capture the short-term waves. This means the trend is strong, but variability is also high due to recurring surges and declines.

  • The slope from the regression tells me whether the response tends to rise or fall as time passes.

  • A positive slope means the average response increases over time, while a negative slope means it decreases.

  • The p-value tells me whether that slope is statistically meaningful.

  • The R-squared tells me how much of the variation is explained by a simple straight-line trend.

Question 3

Do I need to subset the data for multiple trends?

trend_periods <- trend_data %>%
  as_tibble() %>%
  mutate(
    period = case_when(
      date < as.Date("2021-01-01") ~ "2020",
      date < as.Date("2023-01-01") ~ "2021-2022",
      TRUE ~ "2023+"
    )
  )

period_models <- trend_periods %>%
  group_by(period) %>%
  summarise(
    slope = coef(lm(avg_new_cases_smoothed_per_million ~ time_index))[2],
    intercept = coef(lm(avg_new_cases_smoothed_per_million ~ time_index))[1],
    r_squared = summary(lm(avg_new_cases_smoothed_per_million ~ time_index))$r.squared,
    .groups = "drop"
  )

kable(head(period_models, 10), digits = 4,
      caption = "Trend slopes and R-squared values by period")
Trend slopes and R-squared values by period
period slope intercept r_squared
2020 0.7382 -40.3982 0.7265
2021-2022 0.3260 20.3116 0.1950
ggplot(trend_periods, aes(x = date, y = avg_new_cases_smoothed_per_million, color = period)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Trend of COVID Cases by Time Period",
    x = "Date",
    y = "Average New Cases (Smoothed per Million)"
  ) +
  theme_minimal()

Answer: If the slope changes a lot across periods, then a single global trend is not enough. That means the data likely has multiple phases and should be analyzed in smaller time blocks. This is important because major events can change the direction of the series.

– Interpretation:

  • The plot shows that the trend in COVID cases differs across time periods. In 2020, there is a steep upward trend, indicating a rapid increase in cases during the initial phase of the pandemic. In contrast, the 2021–2022 period shows a more moderate upward trend, but with significant fluctuations around the line, reflecting multiple waves of increases and decreases.

  • The difference in slopes between the two periods confirms that the rate of change in cases is not constant over time. This demonstrates that a single linear model would not adequately capture the behavior of the data. Subsetting the data into periods allows for a more accurate representation of these changing trends.

Question 4

Can smoothing detect at least one season in the data, and what does that season mean?

monthly_data <- covid_tsibble %>%
  mutate(month = floor_date(date, "month")) %>%
  group_by(month) %>%
  summarise(monthly_avg = mean(avg_new_cases_smoothed_per_million, na.rm = TRUE), .groups = "drop") %>%
  arrange(month)

monthly_ts <- ts(
  monthly_data$monthly_avg,
  frequency = 12,
  start = c(year(min(monthly_data$month)), month(min(monthly_data$month)))
)

smoothed_monthly <- stats::filter(monthly_ts, rep(1 / 12, 12), sides = 2)

smooth_df <- tibble(
  month = monthly_data$month,
  raw = as.numeric(monthly_ts),
  smoothed = as.numeric(smoothed_monthly)
)

ggplot(smooth_df, aes(x = month)) +
  geom_line(aes(y = raw), alpha = 0.5) +
  geom_line(aes(y = smoothed), linewidth = 1) +
  labs(
    title = "Monthly Average New Cases Smoothed Per Million",
    subtitle = "Raw series with a 12-month moving average",
    x = "Month",
    y = "Average new cases smoothed per million"
  ) +
  theme_minimal()

stl_fit <- stl(monthly_ts, s.window = "periodic")
plot(stl_fit)

Answer: Interpretation:

  • The decomposition separates the data into three components:

  • Trend: There is a clear long-term upward movement, especially toward the end of the time period, confirming sustained growth in cases.

  • Seasonal: There is a consistent repeating pattern, indicating regular cycles in case counts. This confirms the presence of seasonality in the data.

  • Remainder: The irregular component shows random fluctuations and spikes, capturing unexpected events or shocks not explained by trend or seasonality.

  • Overall, this shows that the data is driven by both long-term growth and repeating cyclical patterns.

  • The moving average and STL decomposition help me see whether the data has a repeating seasonal pattern.

  • For this COVID series, any seasonal effect would likely be weaker than the big trend and wave patterns.

Question 5

Can I illustrate the seasonality using ACF or PACF?

acf(monthly_ts, lag.max = 36, main = "ACF of Monthly Average New Cases Smoothed Per Million")

pacf(monthly_ts, lag.max = 36, main = "PACF of Monthly Average New Cases Smoothed Per Million")

Answer: Yes.

Interpretation:

  • The ACF shows strong positive correlations across many lags, which gradually decrease over time. This indicates that current values are highly dependent on past values. The slow decay confirms a strong trend component and persistence in the data. The repeated high correlations across lags also support the presence of seasonality.

  • The ACF shows whether values from one month are correlated with later months.

– Interpretation:

  • The PACF shows a very strong spike at lag 1, followed by much smaller values. This indicates that the most recent time point has the strongest direct influence on current values. After accounting for this, additional lags contribute little new information. This suggests the data has strong short-term dependence combined with longer-term structure captured in the ACF.

  • Together, these plots help confirm whether the series repeats across seasons.

Key insights

  • date is the time column and it converts cleanly into a Date object in R.
  • The response series changes over time, so time is an important dimension of the data.
  • Linear regression gives a basic trend summary, but it may not capture all phases of the series.
  • Smoothing and STL decomposition help reveal whether seasonal structure exists.
  • ACF and PACF help confirm whether repeating monthly or yearly patterns are present.

Further questions

  • Do different countries show different trends if analyzed separately?
  • Does seasonality become stronger or weaker after vaccination rollout?
  • Would country-level time series show clearer seasonal structure than the aggregated daily series?