This notebook answers the Week 12 time-series questions using the
COVID dataset. The response variable is
new_cases_smoothed_per_million because it changes over time
and is easy to interpret.
Which column encodes time, and how is it converted into a proper Date in R?
covid <- read.csv("covid_combined_groups.csv")
covid$date <- as.Date(covid$date)
covid_time <- covid %>%
select(date, new_cases_smoothed_per_million) %>%
filter(!is.na(date), !is.na(new_cases_smoothed_per_million)) %>%
group_by(date) %>%
summarise(
avg_new_cases_smoothed_per_million = mean(new_cases_smoothed_per_million, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(date)
summary(covid_time)## date avg_new_cases_smoothed_per_million
## Min. :2020-03-01 Min. : 0.2921
## 1st Qu.:2020-08-15 1st Qu.: 38.0929
## Median :2021-01-30 Median :132.3614
## Mean :2021-01-30 Mean :130.4883
## 3rd Qu.:2021-07-16 3rd Qu.:191.3260
## Max. :2021-12-31 Max. :422.0129
Answer: The time column is date. I
convert it with as.Date() so R recognizes it as a real date
instead of text. That matters because time-series tools such as
tsibble, ggplot, and smoothing methods work
best when the index is stored as a proper Date object.
Because the dataset has many rows per day across countries, I average
new_cases_smoothed_per_million by date so that each day
appears once. That gives a clean daily time series.
## # A tsibble: 671 x 2 [1D]
## date avg_new_cases_smoothed_per_million
## <date> <dbl>
## 1 2020-03-01 0.292
## 2 2020-03-02 0.292
## 3 2020-03-03 0.292
## 4 2020-03-04 0.292
## 5 2020-03-05 0.292
## 6 2020-03-06 0.292
## 7 2020-03-07 0.292
## 8 2020-03-08 1.40
## 9 2020-03-09 1.40
## 10 2020-03-10 1.40
## # ℹ 661 more rows
The tsibble now contains one observation per day. That makes it possible to study the overall shape of the response over time.
ggplot(covid_tsibble, aes(x = date, y = avg_new_cases_smoothed_per_million)) +
geom_line(linewidth = 0.5) +
labs(
title = "Average New COVID Cases Smoothed Per Million Over Time",
x = "Date",
y = "Average new cases smoothed per million"
) +
theme_minimal()Answer: Interpretation:
The time series shows a clear upward trend in average new COVID cases over time. Cases start at very low levels in early 2020, rise sharply toward late 2020, fluctuate through 2021, and then increase again significantly toward the end of the period. These repeated rises and falls indicate multiple waves of infection rather than a steady increase.
This plot shows the overall time pattern in the response.
.
recent_cutoff <- max(covid_tsibble$date, na.rm = TRUE) - years(2)
recent_window <- covid_tsibble %>%
filter(date >= recent_cutoff)
ggplot(recent_window, aes(x = date, y = avg_new_cases_smoothed_per_million)) +
geom_line(linewidth = 0.5) +
labs(
title = "Recent Two-Year Window of Average New Cases Smoothed Per Million",
x = "Date",
y = "Average new cases smoothed per million"
) +
theme_minimal()Answer: Interpretation:
Can linear regression detect an upward or downward trend over time, and how strong is that trend?
trend_data <- covid_tsibble %>%
mutate(time_index = as.numeric(date - min(date)))
trend_model <- lm(avg_new_cases_smoothed_per_million ~ time_index, data = trend_data)
summary(trend_model)##
## Call:
## lm(formula = avg_new_cases_smoothed_per_million ~ time_index,
## data = trend_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.18 -34.83 -13.07 34.15 172.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.79394 4.84816 2.02 0.0438 *
## time_index 0.36028 0.01253 28.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.86 on 669 degrees of freedom
## Multiple R-squared: 0.5528, Adjusted R-squared: 0.5521
## F-statistic: 827 on 1 and 669 DF, p-value: < 2.2e-16
ggplot(trend_data, aes(x = date, y = avg_new_cases_smoothed_per_million)) +
geom_line(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Average New Cases Smoothed Per Million With Linear Trend",
x = "Date",
y = "Average new cases smoothed per million"
) +
theme_minimal()Answer: Interpretation:
The linear regression line shows a strong overall upward trend in average new cases over time. However, the actual data fluctuates significantly around this line, indicating that while cases are generally increasing, the trend alone does not fully capture the short-term waves. This means the trend is strong, but variability is also high due to recurring surges and declines.
The slope from the regression tells me whether the response tends to rise or fall as time passes.
A positive slope means the average response increases over time, while a negative slope means it decreases.
The p-value tells me whether that slope is statistically meaningful.
The R-squared tells me how much of the variation is explained by a simple straight-line trend.
Do I need to subset the data for multiple trends?
trend_periods <- trend_data %>%
as_tibble() %>%
mutate(
period = case_when(
date < as.Date("2021-01-01") ~ "2020",
date < as.Date("2023-01-01") ~ "2021-2022",
TRUE ~ "2023+"
)
)
period_models <- trend_periods %>%
group_by(period) %>%
summarise(
slope = coef(lm(avg_new_cases_smoothed_per_million ~ time_index))[2],
intercept = coef(lm(avg_new_cases_smoothed_per_million ~ time_index))[1],
r_squared = summary(lm(avg_new_cases_smoothed_per_million ~ time_index))$r.squared,
.groups = "drop"
)
kable(head(period_models, 10), digits = 4,
caption = "Trend slopes and R-squared values by period")| period | slope | intercept | r_squared |
|---|---|---|---|
| 2020 | 0.7382 | -40.3982 | 0.7265 |
| 2021-2022 | 0.3260 | 20.3116 | 0.1950 |
ggplot(trend_periods, aes(x = date, y = avg_new_cases_smoothed_per_million, color = period)) +
geom_line() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Trend of COVID Cases by Time Period",
x = "Date",
y = "Average New Cases (Smoothed per Million)"
) +
theme_minimal()Answer: If the slope changes a lot across periods, then a single global trend is not enough. That means the data likely has multiple phases and should be analyzed in smaller time blocks. This is important because major events can change the direction of the series.
– Interpretation:
The plot shows that the trend in COVID cases differs across time periods. In 2020, there is a steep upward trend, indicating a rapid increase in cases during the initial phase of the pandemic. In contrast, the 2021–2022 period shows a more moderate upward trend, but with significant fluctuations around the line, reflecting multiple waves of increases and decreases.
The difference in slopes between the two periods confirms that the rate of change in cases is not constant over time. This demonstrates that a single linear model would not adequately capture the behavior of the data. Subsetting the data into periods allows for a more accurate representation of these changing trends.
Can smoothing detect at least one season in the data, and what does that season mean?
monthly_data <- covid_tsibble %>%
mutate(month = floor_date(date, "month")) %>%
group_by(month) %>%
summarise(monthly_avg = mean(avg_new_cases_smoothed_per_million, na.rm = TRUE), .groups = "drop") %>%
arrange(month)
monthly_ts <- ts(
monthly_data$monthly_avg,
frequency = 12,
start = c(year(min(monthly_data$month)), month(min(monthly_data$month)))
)
smoothed_monthly <- stats::filter(monthly_ts, rep(1 / 12, 12), sides = 2)
smooth_df <- tibble(
month = monthly_data$month,
raw = as.numeric(monthly_ts),
smoothed = as.numeric(smoothed_monthly)
)
ggplot(smooth_df, aes(x = month)) +
geom_line(aes(y = raw), alpha = 0.5) +
geom_line(aes(y = smoothed), linewidth = 1) +
labs(
title = "Monthly Average New Cases Smoothed Per Million",
subtitle = "Raw series with a 12-month moving average",
x = "Month",
y = "Average new cases smoothed per million"
) +
theme_minimal()Answer: Interpretation:
The decomposition separates the data into three components:
Trend: There is a clear long-term upward movement, especially toward the end of the time period, confirming sustained growth in cases.
Seasonal: There is a consistent repeating pattern, indicating regular cycles in case counts. This confirms the presence of seasonality in the data.
Remainder: The irregular component shows random fluctuations and spikes, capturing unexpected events or shocks not explained by trend or seasonality.
Overall, this shows that the data is driven by both long-term growth and repeating cyclical patterns.
The moving average and STL decomposition help me see whether the data has a repeating seasonal pattern.
For this COVID series, any seasonal effect would likely be weaker than the big trend and wave patterns.
Can I illustrate the seasonality using ACF or PACF?
Answer: Yes.
Interpretation:
The ACF shows strong positive correlations across many lags, which gradually decrease over time. This indicates that current values are highly dependent on past values. The slow decay confirms a strong trend component and persistence in the data. The repeated high correlations across lags also support the presence of seasonality.
The ACF shows whether values from one month are correlated with later months.
– Interpretation:
The PACF shows a very strong spike at lag 1, followed by much smaller values. This indicates that the most recent time point has the strongest direct influence on current values. After accounting for this, additional lags contribute little new information. This suggests the data has strong short-term dependence combined with longer-term structure captured in the ACF.
Together, these plots help confirm whether the series repeats across seasons.
date is the time column and it converts cleanly into a
Date object in R.