Based on Week 12 Lab
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(xts)
## Warning: package 'xts' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
##
## The following object is masked from 'package:zoo':
##
## index
##
## The following object is masked from 'package:lubridate':
##
## interval
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/quakes/quakes.csv"
quakes <- read_delim(url_, delim = ",")
## Rows: 18334 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): magType, net, id, place, type, status, locationSource, magSource
## dbl (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
## dttm (2): time, updated
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
quakes_ <- quakes |>
select(time, latitude, longitude, mag) |>
distinct()
quakes_ts <- as_tsibble(quakes_, index=time) |>
index_by(date = date(time)) |>
summarise(num_quakes = sum(!is.na(mag))) |>
fill_gaps(num_quakes = 0)
The lab used a fixed MA(7) model, but I improved it by using
auto.arima()
restricted to MA terms, which selected a
better MA(5) model. Although the residual diagnostics showed
autocorrelation which means that the model may still be missing
something.
library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/pageviews_ts/pageviews_ts.csv"
hits <- read_delim(url_, delim = ",") |>
select(Date, `Time series`) |>
rename(date = Date,
views = `Time series`)
## Rows: 1918 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (1): Time series
## date (1): Date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hits_ts <- as_tsibble(hits, index = date)
hits_xts <- xts(hits_ts$views,
order.by = hits_ts$date,
frequency = 7)
hits_xts <- setNames(hits_xts, "views")
auto_ma <- auto.arima(hits_xts, d = 0, max.p = 0, seasonal = FALSE)
summary(auto_ma)
## Series: hits_xts
## ARIMA(0,0,5) with non-zero mean
##
## Coefficients:
## ma1 ma2 ma3 ma4 ma5 mean
## 0.7567 0.5983 0.4847 0.2347 -0.0511 999.8757
## s.e. 0.0250 0.0323 0.0348 0.0324 0.0232 17.5892
##
## sigma^2 = 65216: log likelihood = -13350.02
## AIC=26714.04 AICc=26714.09 BIC=26752.95
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.06998605 254.9747 158.2422 -5.592384 17.4898 0.9498189
## ACF1
## Training set 0.009854837
checkresiduals(auto_ma)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,5) with non-zero mean
## Q* = 537.85, df = 5, p-value < 2.2e-16
##
## Model df: 5. Total lags used: 10
The lab tested for stationarity using the ADF test, but continued with an MA(7) model regardless of the result. If the test suggests non-stationarity, a more appropriate step would be to difference the data before fitting an ARIMA model to capture underlying trends.
library(tseries)
## Warning: package 'tseries' was built under R version 4.3.3
library(fable)
## Warning: package 'fable' was built under R version 4.3.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 4.3.3
library(tsibble)
library(feasts)
## Warning: package 'feasts' was built under R version 4.3.3
adf.test(quakes_ts$num_quakes)
## Warning in adf.test(quakes_ts$num_quakes): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: quakes_ts$num_quakes
## Dickey-Fuller = -13.04, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
quakes_diff <- quakes_ts |>
mutate(diff_quakes = difference(num_quakes))
model_diff <- quakes_diff |>
model(ARIMA(diff_quakes))
report(model_diff)
## Series: diff_quakes
## Model: ARIMA(5,0,0)
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5
## -0.5785 -0.4163 -0.3266 -0.2288 -0.1432
## s.e. 0.0157 0.0179 0.0183 0.0179 0.0157
##
## sigma^2 estimated as 17.03: log likelihood=-11227.95
## AIC=22467.9 AICc=22467.92 BIC=22505.6
glance(model_diff)
## # A tibble: 1 × 8
## .model sigma2 log_lik AIC AICc BIC ar_roots ma_roots
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <list>
## 1 ARIMA(diff_quakes) 17.0 -11228. 22468. 22468. 22506. <cpl [5]> <cpl [0]>
augment(model_diff) |>
ACF(.resid) |>
autoplot()
The lab primarily uses basic line plots which may not have fully captured patters like seasonality or long term trends. By enhancing the visualizations it reveals many hidden structures in our data.
quakes_ts |>
mutate(roll_avg = slider::slide_dbl(num_quakes, mean, .before = 3, .after = 3, .complete = TRUE)) |>
ggplot(aes(x = date)) +
geom_line(aes(y = num_quakes), color = "gray", alpha = 0.5) +
geom_line(aes(y = roll_avg), color = "blue", size = 1) +
labs(title = "Daily Earthquakes with 7-Day Rolling Average", y = "Count")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).
quakes_ts_ts <- ts(quakes_ts$num_quakes, frequency = 365)
decomp <- stl(quakes_ts_ts, s.window = "periodic")
autoplot(decomp)
Dependence on Trends: Although we try to prepare based on historical cycles it’s well known that this could be a problem. Earthquakes do not follow some sort of “calendar” regularly.
False Data Representation: We made an assumption that earthquakes are recorded regularly no matter what but we do have to take into account that some regions may be more heavily monitored which could skew our data.
Action from Results: If our results are misinterpreted or outright incorrect, it could lead to ineffective disaster relief planning by people in power.
False Confidence: Similar to our first point, by modeling these earthquakes it can give a false confidence as to when they may or may not appear when in reality it’s very hard to predict them.