Noor Hassuneh

Data Dive 14

Based on Week 12 Lab

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(xts)
## Warning: package 'xts' was built under R version 4.3.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.3
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
library(tsibble)
## Warning: package 'tsibble' was built under R version 4.3.3
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
theme_set(theme_minimal())
options(scipen = 6)

Goal 1: Business Scenario

Goal 2: Model Critique

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/quakes/quakes.csv"
quakes <- read_delim(url_, delim = ",")
## Rows: 18334 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (8): magType, net, id, place, type, status, locationSource, magSource
## dbl  (12): latitude, longitude, depth, mag, nst, gap, dmin, rms, horizontalE...
## dttm  (2): time, updated
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
quakes_ <- quakes |>
  select(time, latitude, longitude, mag) |>
  distinct()

quakes_ts <- as_tsibble(quakes_, index=time) |>
  index_by(date = date(time)) |>
  summarise(num_quakes = sum(!is.na(mag))) |>
  fill_gaps(num_quakes = 0) 

Critique One: Model Refinement

The lab used a fixed MA(7) model, but I improved it by using auto.arima() restricted to MA terms, which selected a better MA(5) model. Although the residual diagnostics showed autocorrelation which means that the model may still be missing something.

library(forecast)
## Warning: package 'forecast' was built under R version 4.3.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
url_ <- "https://raw.githubusercontent.com/leontoddjohnson/datasets/main/data/pageviews_ts/pageviews_ts.csv"

hits <- read_delim(url_, delim = ",") |>
  select(Date, `Time series`) |>
  rename(date = Date,
         views = `Time series`)
## Rows: 1918 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (1): Time series
## date (1): Date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
hits_ts <- as_tsibble(hits, index = date)

hits_xts <- xts(hits_ts$views, 
                order.by = hits_ts$date,
                frequency = 7)

hits_xts <- setNames(hits_xts, "views")
auto_ma <- auto.arima(hits_xts, d = 0, max.p = 0, seasonal = FALSE)
summary(auto_ma)
## Series: hits_xts 
## ARIMA(0,0,5) with non-zero mean 
## 
## Coefficients:
##          ma1     ma2     ma3     ma4      ma5      mean
##       0.7567  0.5983  0.4847  0.2347  -0.0511  999.8757
## s.e.  0.0250  0.0323  0.0348  0.0324   0.0232   17.5892
## 
## sigma^2 = 65216:  log likelihood = -13350.02
## AIC=26714.04   AICc=26714.09   BIC=26752.95
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE    MAPE      MASE
## Training set 0.06998605 254.9747 158.2422 -5.592384 17.4898 0.9498189
##                     ACF1
## Training set 0.009854837
checkresiduals(auto_ma)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,5) with non-zero mean
## Q* = 537.85, df = 5, p-value < 2.2e-16
## 
## Model df: 5.   Total lags used: 10
  • The residual plot shows a large spike around the middle of the series, suggesting a potential outlier or structural change not accounted for by the model.
  • The ACF plot (bottom left) reveals significant autocorrelation at multiple lags, which means the residuals are not behaving like white noise.
  • This implies the ARIMA(0,0,5) model may be missing important time-dependent structure.

Critique Two: Checking for Stationary Data and Differencing

The lab tested for stationarity using the ADF test, but continued with an MA(7) model regardless of the result. If the test suggests non-stationarity, a more appropriate step would be to difference the data before fitting an ARIMA model to capture underlying trends.

library(tseries)
## Warning: package 'tseries' was built under R version 4.3.3
library(fable)
## Warning: package 'fable' was built under R version 4.3.3
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 4.3.3
library(tsibble)
library(feasts)
## Warning: package 'feasts' was built under R version 4.3.3
adf.test(quakes_ts$num_quakes)
## Warning in adf.test(quakes_ts$num_quakes): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  quakes_ts$num_quakes
## Dickey-Fuller = -13.04, Lag order = 15, p-value = 0.01
## alternative hypothesis: stationary
quakes_diff <- quakes_ts |> 
  mutate(diff_quakes = difference(num_quakes))

model_diff <- quakes_diff |> 
  model(ARIMA(diff_quakes))

report(model_diff)
## Series: diff_quakes 
## Model: ARIMA(5,0,0) 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4      ar5
##       -0.5785  -0.4163  -0.3266  -0.2288  -0.1432
## s.e.   0.0157   0.0179   0.0183   0.0179   0.0157
## 
## sigma^2 estimated as 17.03:  log likelihood=-11227.95
## AIC=22467.9   AICc=22467.92   BIC=22505.6
glance(model_diff)
## # A tibble: 1 × 8
##   .model             sigma2 log_lik    AIC   AICc    BIC ar_roots  ma_roots 
##   <chr>               <dbl>   <dbl>  <dbl>  <dbl>  <dbl> <list>    <list>   
## 1 ARIMA(diff_quakes)   17.0 -11228. 22468. 22468. 22506. <cpl [5]> <cpl [0]>
augment(model_diff) |> 
  ACF(.resid) |> 
  autoplot()

Critique Three: Change in Plots

The lab primarily uses basic line plots which may not have fully captured patters like seasonality or long term trends. By enhancing the visualizations it reveals many hidden structures in our data.

quakes_ts |>
  mutate(roll_avg = slider::slide_dbl(num_quakes, mean, .before = 3, .after = 3, .complete = TRUE)) |>
  ggplot(aes(x = date)) +
  geom_line(aes(y = num_quakes), color = "gray", alpha = 0.5) +
  geom_line(aes(y = roll_avg), color = "blue", size = 1) +
  labs(title = "Daily Earthquakes with 7-Day Rolling Average", y = "Count")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_line()`).

  • The blue line represents the 7-day rolling average, which smooths out day-to-day fluctuations and helps highlight broader trends.
  • The graph shows several spikes in the smoothed line, especially around 2015, 2021, and 2022, which could correspond to clusters of seismic activity.
  • Overall, the rolling average suggests that while daily counts vary a lot, there are periods of increased activity that may be worth further investigation.
quakes_ts_ts <- ts(quakes_ts$num_quakes, frequency = 365)
decomp <- stl(quakes_ts_ts, s.window = "periodic")
autoplot(decomp)

  • The trend line shows a slow upward movement between months 7–9, indicating a potential long-term increase in earthquake frequency during that period.
  • The seasonal component reveals repeating spikes, suggesting some periodic pattern in the data that may align with environmental or reporting cycles.
  • The remainder (bottom panel) still contains several sharp peaks, showing that certain irregular or unexpected earthquake spikes are not explained by the model.

Goal 3: Ethical and Epistemological Concerns