Data Dive 12

In this Data DIve we will be exploring time series trends in our data set and looking for seasonal patterns in the ‘sales’ column

library(xts)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first()  masks xts::first()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::last()   masks xts::last()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
##    city     year month sales   volume median listings inventory  date
##    <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
##  2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
##  3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
##  4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
##  5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
##  6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
##  7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
##  8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
##  9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
## 10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
## # ℹ 8,592 more rows
texas = data.frame(txhousing)

Selecting a timeseries column

We will use the ‘date’ column as our time series column of choice. We first need to convert it to a datetime datatype.

Selecting a response variable

For this datadive we will explore the relationship of ‘sales’ over time.

library(tsibble)
## 
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
## 
##     interval
## The following object is masked from 'package:zoo':
## 
##     index
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
unique_txhousing = texas |> select(date, sales)|> distinct()
#unique_dates = txhousing$date |> distinct()


texas_ts = as_tsibble(unique_txhousing, key = sales, index = date)  |>  
  summarise(num_sales = sum(!is.na(sales))) 


texas_ts
## # A tsibble: 187 x 2 [1D]
##    date       num_sales
##    <date>         <int>
##  1 2000-01-01        36
##  2 2000-02-01        37
##  3 2000-03-01        37
##  4 2000-04-01        37
##  5 2000-05-01        35
##  6 2000-06-01        39
##  7 2000-07-01        36
##  8 2000-08-01        38
##  9 2000-09-01        38
## 10 2000-10-01        35
## # ℹ 177 more rows

Graphing our Timeseries data:

library("ggplot2")
library(ggthemes)


texas_ts |>
ggplot(mapping = aes(x=date, y= num_sales)) +
  geom_line() + geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

From the above graph, we can observe a general upward trend from 2003 to 2010 which plateaus from 2010 - 2015.

However in order to confirm this relationship, let us run a linear regression to see if there is a real relationship between sales and date

model = lm(num_sales ~ date, data = texas_ts)

model$coefficient
##  (Intercept)         date 
## 17.814196362  0.001692918

From the above model we observe a coefficient of 0.00169. This means that for every marginal increase in date, we see an increase of 0.00169 sales. This implies that our initial observation of a positive relationship between date and sales is true however the strength is fairly weak

Subsetting the Data:

2010 - 2015

texas_ts |> filter_index("2010-01" ~ "2015-01") |> ggplot(mapping = aes(x=date, y= num_sales)) +
  geom_line() + geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

From the blue trend line above, we still observe a slight positive trend.

January to December 2015

texas_ts |> filter_index("2014-01" ~ "2014-12") |> ggplot(mapping = aes(x=date, y= num_sales)) +
  geom_line() + geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
  theme_hc()
## `geom_smooth()` using formula = 'y ~ x'

It appears like this trend is common even between months of the same year.

subset_2014 = texas_ts |> filter_index("2014-01" ~ "2014-12")

model_2014 = lm(num_sales~date, subset_2014)
model_2014$coefficients
##   (Intercept)          date 
## -41.024461388   0.005267095

As observed above even in the monthly model our coefficient is 0.00526, indicating a positive relationship. This is actually stronger than for our entire dataset but is still a relatively weak relationship

Looking for Seasonality

In order to look for seasons in our data - we will first smoothen it using LOWeSS method. LOcally Weighted Scatterplot Smoothing, or LOcally Estimated Scatterplot Smoothing is a method by which we can visualize the general trend of noisy data, and we’ve already used it before with geom_smooth. The idea is to consider a moving window of (relative) size span, and fit several weighted linear regression models to the data within each window.

texas_ts |>
  filter_index("2009" ~ "2013") |>
  drop_na() |>
  ggplot(mapping = aes(x = date, y = num_sales)) +
  geom_point(size=1, shape='O') +
  geom_smooth(span=0.2, color = 'blue', se=FALSE) +
  labs(title = "Sales between 2009 and 2013") +
  theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Identifying Seasons with Autocorrelation

If correlation is the amount to which two different variables change together (i.e., think \(\text{corr}(x_1, x_2)\)), then autocorrelation must be the amount to which a single variable changes with itself over time. In other words, we present autocorrelation between \(y\) at some time \(t\), \(y_t\), and \(y\) at an earlier “lag” \(t-h\), \(y_{t-h}\). These are good for identifying seasons in the data.

This autocorrelation can be calculated using the formula below:

\[ \text{AC}(h) = \frac{\text{Cov}(y_t, y_{t-h})}{\text{Var}(y_t)} \]

We calculate autocorrelation of our sales column across 12 month intervals by setting the frequency parameter to 12.

texas_xts <- xts(texas_ts$num_sales, 
                order.by = texas_ts$date,
                frequency = 12)


acf(texas_xts, ci = 0.95, na.action = na.exclude)

As seen in the ACF graph above we notice a weak trend where the ACF values are higher for earlier lags and consistently decrease as the time-lag increases. However the ACF values largely stay between 0.6 and 0.8 across the Lag periods.