Summary of Analysis

The time series model used to forecast daily COVID-19 cases in California is a seasonal ARIMA model with the specification ARIMA(2,2,1)(1,0,2)[7]. This model structure indicates that the data required two levels of differencing to achieve stationarity, with two autoregressive terms (AR1 and AR2) and one moving average term (MA1) in the non-seasonal component. Additionally, it includes a seasonal autoregressive term (SAR1) and two seasonal moving average terms (SMA1 and SMA2), accounting for a weekly seasonality pattern (period = 7).

The coefficient estimates were as follows: AR1 = -0.397, AR2 = -0.2727, MA1 = -0.5414, SAR1 = 0.5224, SMA1 = 0.2040, and SMA2 = 0.1794. All coefficients were statistically significant, as indicated by their relatively low standard errors. These values suggest a well-balanced model that captures both short-term dependencies and recurring weekly fluctuations in the case counts.

The model’s overall fit was evaluated using standard time series metrics. The residual variance (σ²) was estimated at approximately 88,272,350, indicating the variability remaining in the model’s errors. The log-likelihood of the model was -8855.08, and the corresponding AIC, AICc, and BIC values were 17,724.15, 17,724.29, and 17,757.27, respectively. These metrics indicate that the model achieves a good balance between accuracy and complexity.

A 30-day forecast was generated using this ARIMA model. The forecast includes both point predictions and confidence intervals, providing a realistic estimate of future daily COVID-19 case counts in California. The visual forecast shows a continuation of existing trends, along with a range of uncertainty that widens over time, which is typical in time series forecasting.

This analysis demonstrates the effectiveness of ARIMA modeling for understanding and projecting infectious disease patterns. The insights derived from this forecast can be valuable for policymakers, public health authorities, and hospital administrators in preparing for and responding to short-term changes in infection trends.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.4.3
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(fable)
## Warning: package 'fable' was built under R version 4.4.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 4.4.3
library(feasts)
## Warning: package 'feasts' was built under R version 4.4.3
library(readr)
library(lubridate)

Introduction

This project analyzes and forecasts daily COVID-19 case counts in California using ARIMA time series modeling.

Data Preparation and Time Series Construction

covid_data <- read_csv("us-counties.csv")
## Rows: 2502832 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): county, state, fips
## dbl  (2): cases, deaths
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
california_data <- covid_data %>%
  filter(state == "California") %>%
  group_by(date = as.Date(date)) %>%
  summarise(total_cases = sum(cases), .groups = "drop")

covid_ts <- california_data %>%
  as_tsibble(index = date)

Time Series Visualization

autoplot(covid_ts, total_cases) + 
  labs(title = "Daily COVID-19 Cases in California", y = "Cases")

ARIMA Modeling and Forecast

model_fit <- covid_ts %>% model(ARIMA(total_cases))
forecast_30 <- forecast(model_fit, h = "30 days")

autoplot(forecast_30, covid_ts) + 
  labs(title = "30-Day Forecast of COVID-19 Cases in California", y = "Cases")

Model Summary

report(model_fit)
## Series: total_cases 
## Model: ARIMA(2,2,1)(1,0,2)[7] 
## 
## Coefficients:
##          ar1      ar2      ma1    sar1    sma1    sma2
##       -0.397  -0.2727  -0.5414  0.5224  0.2040  0.1794
## s.e.   0.056   0.0468   0.0532  0.0603  0.0631  0.0519
## 
## sigma^2 estimated as 88272350:  log likelihood=-8855.08
## AIC=17724.15   AICc=17724.29   BIC=17757.27