This report is a summary of lesson by Harrison Brown, Data Camp

1. What is Time Series Data?

What is time series data?

AirPassengers
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
summary(AirPassengers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   104.0   180.0   265.5   280.3   360.5   622.0

Time series object:

  • work better with specialized tools
  • keep track of data and time
  • aid in smoother workflow

Ploting with autoplot

  • autoplot()
autoplot(zoo(maunaloa))

Temporal data classes

class() 함수로 object들의 속성 확인 가능

  • numeric
    • integer, floating point
    • August 9, 2022 = 19223
    • Number of days since Jan.1, 1970
  • character
    • String of text, names
    • August 9, 2022 = "2022-08-09
    • August 9, 2022 = "August 9, 2022"
  • Date
    • Dates, days of the year
    • August 9, 2022 = "2022-08-09"
    • lubridate::as_date()
    • can allow us to do math
  • POSIXct
    • Dates and times, time zones
    • August 9, 2022, 4:17 p.m = "2022-08-09 20:17:00 UTC"
    • as.POSIXct()
    • is.POSIXct()

Formatting dates

Order of time element

국가, 지역에 따라 time element 순서가 다를 수 있음

* U.S: 12/20/2022 * U.K: 20/12/2022 * Ambiguous most of the year: * 6/4/2010: June 4th or April 6th??

ISO 8601

  • Time elements arranged largest-to-smallest
    • Year -> month -> day -> …
    • 2022-06-04
    • 2022-06-04 = June 4th, 2022
  • Time elements separated by specific characters
    • Hyphens(-) between date elements
  • Ensure legibility and clarity
    • 2022-06-04 vs. 20220604

Formatting dates and times

lubridate::parse_date_time()

dates_vector <- c("12/20/2022", "2022-12-21", "December 22, 2022")
dates_vector
## [1] "12/20/2022"        "2022-12-21"        "December 22, 2022"
parse_date_time(dates_vector,
                orders = c("%m/%d/%Y",
                           "%Y-%m-%d",
                           "%B %d, %Y"))
## [1] "2022-12-20 UTC" "2022-12-21 UTC" "2022-12-22 UTC"

2. Manipulating Time Series with zoo

Temporal attributes

  • Start point
    • start()
  • End point
    • end()
  • Frequency
    • frequency()
  • \(\Delta t\): 1/Freq; 관찰이 이뤄지는 간격
    • deltat()
  • time
  • cycle
start(AirPassengers)
## [1] 1949    1
# Decimal date
end(ftse)
## [1] 1998.646
end(ftse) %>% 
  lubridate::date_decimal()
## [1] "1998-08-24 20:18:27 UTC"
frequency(ftse)
## [1] 260
# weekly
frequency(maunaloa)
## [1] 52.17855
# delta t
deltat(ftse)
## [1] 0.003846154

Regular vs. irregular time series

Regular
  • Evenly-spaced intervals
  • No missing values
  • Uses decimal date for ‘irregular’ intervals
Irregular
  • Spacing can be irregular
    • weekdays, random days
  • Missing observations
  • Decimal date or Date/POSIXct data
# Save the start point of maunaloa: maunaloa_start
maunaloa_start <- start(maunaloa)

# Assign the formatted date to start_iso
start_iso <- date_decimal(maunaloa_start)

# Convert to Date class
as_date(start_iso)
## [1] "1974-05-17"

zoo

  • zoo(x = ..., order.by = ...)
  • as.zoo: converting to zoo from ts
  • index()
  • coredata()
  • c(): when joining

Finding overlapping indices

# # Determine the overlapping indexes
# overlapping_index <-
#   index(coffee_overlap) %in% index(coffee)
# 
# # Create a subset of the elements which do not overlap
# coffee_subset <- coffee_overlap[!overlapping_index]
# 
# # Combine the coffee time series and the new subset
# coffee_combined <- c(coffee, coffee_subset)
# 
# autoplot(coffee_combined)

Converting between zoo and data frame

  • fortify.zoo(): from zoo to data frame

3. Indexing Time Series Objects

Subsetting a window of observations

Time series windows

Windows:

  • Subset of time series
  • Inherits frequency from parent time series
  • Defied by start and end point

Purpose of windows:

  • View a specified range of data

  • Focus in on years/events of interest

  • Ignore observations at the “edge” of the data

  • stats::window(x = ..., start = ..., end = ...)

Selecting a window from a time series
# Create a window from dow_jones
ftse_window <- window(ftse, start = "1995-01-01", end = "1997-01-01")

# Create an autoplot from the original dow_jones
autoplot(ftse) + 
  labs(y = "Price (USD)")

# Create an autoplot from dow_jones_window
autoplot(ftse_window) + 
  labs(y = "Price (USD)")

Logical expressions and subsets
# Complete the logical expression
subset <- index(maunaloa) >= "1990" &
          index(maunaloa) <= "2010"

# Extract the subset of maunaloa
maunaloa_subset <- maunaloa[subset]

# Autoplot the subsetted maunaloa dataset
autoplot(zoo(maunaloa_subset))

Monthly and quarterly data

Dates and aggregated data

Aggregation:

  • Monthly mean
  • Weekly maximum
  • Daily median
  • Shows general trend and patterns in the data

ex) Monthly data: Which data to use?

  • Data for January, 2003:
    • 2003-01-01?
    • 2003-01-31?
    • 2003-01-15?
    • 2003-01?
  • zoo::as.yearmon
  • zoo::as.yearqtr
laborday_2022 <- as_date("2022-09-05")
as.yearmon(laborday_2022)
## [1] "9 2022"
as.yearqtr(laborday_2022)
## [1] "2022 Q3"
as.yearqtr(2018.639)
## [1] "2018 Q3"

Resampling and aggregating observations

Smapling frequency

Frequency:

  • Number of observations per year
  • e.g., weekly, daily, monthly, …

Temporal resolution(해상도):

  • “High resolution” sampled often
  • “Low resolution” sampled infrequently
  • “High” and “Low” are subjective

Aggregation

  • High resolution -> Low resolution
  • Applies a function like mean, sum, max to the chosen interval
  • e.g.:
    • Monthly sum of daily data
    • Weekly mean of hourly values
  • Cannot ’reverse` aggregation
    • Monthly total -> daily values? NO
  • Provides statistics to describe patterns in the data
  • Aggregation reduces information
xts

xts:

  • eXtensible Time Series

  • Extend the zoo package and zoo class of objects

  • apply.*(data = ..., FUN = ...) functions

  • endpoints(x = ..., on = ..., k = ...)

    • on: “weeks”, “months”, “days”, …
    • k: integer로 on에서 설정한 기간 단위
  • period.apply()

zoo_maunaloa <- zoo(maunaloa)
index(zoo_maunaloa) <- date_decimal(index(zoo_maunaloa))

autoplot(zoo_maunaloa)

# Aggregate to the monthly max and autoplot
monthly_max <- apply.monthly(zoo_maunaloa, FUN = max)
autoplot(monthly_max)

# Create the index from every third month
three_month_index <- endpoints(x = zoo_maunaloa,
                               on = "months",
                               k = 3)
# Apply the maximum to the time series using the index
three_month_max <- period.apply(x = zoo_maunaloa,
                                INDEX = three_month_index,
                                FUN = max)
# Autoplot with labels and theme
autoplot(three_month_max)

Imputing missing values

Imputing values with zoo

na. fucntion from zoo: * na.fill(object = ..., fill = ...): 단순히 fill 인수 값으로 대체 * na.locf(): 가장 최근 관찰값으로 대체 * na.approx(): 선형 보간 활용하여 대체

4. Rolling and Expanding Windows

What is a rolling window?

Measure of how statistics change as the data moves in time

Rolling with zoo

  • zoo::rollmean(x = ..., k = ..., align = ..., fill = ...)

    • k: size of window
    • align: alignment of window; “right”, “left”, “center”
    • fill: values to fill-in outside of window
  • zoo::rollsum()

  • zoo::rollmax()

  • zoo::rollapply(data = ..., width = ..., FUN = ..., align = ..., fill = ...): 사용자정의함수 가능

    • data: Time series object
    • width: Width of window(k)
    • FUN: Summary function

Expanding windows

  • base::seq_along() 으로 rollapplywidth 인수 생성

Expanding window inferences

  • Statistics approach global summaries
  • Expanding mean becomes less sensitive to change
  • Earlier observations are more sensitive to change