In this Data DIve we will be exploring time series trends in our data set and looking for seasonal patterns in the ‘sales’ column
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks xts::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks xts::last()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
txhousing
## # A tibble: 8,602 × 9
## city year month sales volume median listings inventory date
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abilene 2000 1 72 5380000 71400 701 6.3 2000
## 2 Abilene 2000 2 98 6505000 58700 746 6.6 2000.
## 3 Abilene 2000 3 130 9285000 58100 784 6.8 2000.
## 4 Abilene 2000 4 98 9730000 68600 785 6.9 2000.
## 5 Abilene 2000 5 141 10590000 67300 794 6.8 2000.
## 6 Abilene 2000 6 156 13910000 66900 780 6.6 2000.
## 7 Abilene 2000 7 152 12635000 73500 742 6.2 2000.
## 8 Abilene 2000 8 131 10710000 75000 765 6.4 2001.
## 9 Abilene 2000 9 104 7615000 64500 771 6.5 2001.
## 10 Abilene 2000 10 101 7040000 59300 764 6.6 2001.
## # ℹ 8,592 more rows
texas = data.frame(txhousing)
We will use the ‘date’ column as our time series column of choice. We first need to convert it to a datetime datatype.
For this datadive we will explore the relationship of ‘sales’ over time.
library(tsibble)
##
## Attaching package: 'tsibble'
## The following object is masked from 'package:lubridate':
##
## interval
## The following object is masked from 'package:zoo':
##
## index
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
unique_txhousing = texas |> select(date, sales)|> distinct()
#unique_dates = txhousing$date |> distinct()
texas_ts = as_tsibble(unique_txhousing, key = sales, index = date) |>
summarise(num_sales = sum(!is.na(sales)))
texas_ts
## # A tsibble: 187 x 2 [1D]
## date num_sales
## <date> <int>
## 1 2000-01-01 36
## 2 2000-02-01 37
## 3 2000-03-01 37
## 4 2000-04-01 37
## 5 2000-05-01 35
## 6 2000-06-01 39
## 7 2000-07-01 36
## 8 2000-08-01 38
## 9 2000-09-01 38
## 10 2000-10-01 35
## # ℹ 177 more rows
library("ggplot2")
library(ggthemes)
texas_ts |>
ggplot(mapping = aes(x=date, y= num_sales)) +
geom_line() + geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
From the above graph, we can observe a general upward trend from 2003 to
2010 which plateaus from 2010 - 2015.
However in order to confirm this relationship, let us run a linear regression to see if there is a real relationship between sales and date
model = lm(num_sales ~ date, data = texas_ts)
model$coefficient
## (Intercept) date
## 17.814196362 0.001692918
From the above model we observe a coefficient of 0.00169. This means that for every marginal increase in date, we see an increase of 0.00169 sales. This implies that our initial observation of a positive relationship between date and sales is true however the strength is fairly weak
texas_ts |> filter_index("2010-01" ~ "2015-01") |> ggplot(mapping = aes(x=date, y= num_sales)) +
geom_line() + geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
From the blue trend line above, we still observe a slight positive
trend.
texas_ts |> filter_index("2014-01" ~ "2014-12") |> ggplot(mapping = aes(x=date, y= num_sales)) +
geom_line() + geom_smooth(method = 'lm', color = 'blue', se=FALSE) +
theme_hc()
## `geom_smooth()` using formula = 'y ~ x'
It appears like this trend is common even between months of the same
year.
subset_2014 = texas_ts |> filter_index("2014-01" ~ "2014-12")
model_2014 = lm(num_sales~date, subset_2014)
model_2014$coefficients
## (Intercept) date
## -41.024461388 0.005267095
As observed above even in the monthly model our coefficient is 0.00526, indicating a positive relationship. This is actually stronger than for our entire dataset but is still a relatively weak relationship
In order to look for seasons in our data - we will first smoothen it
using LOWeSS method. LOcally Weighted Scatterplot Smoothing, or LOcally
Estimated Scatterplot Smoothing is a method by which we can visualize
the general trend of noisy data, and we’ve already used it before with
geom_smooth. The idea is to consider a moving window of
(relative) size span, and fit several weighted
linear regression models to the data within each window.
texas_ts |>
filter_index("2009" ~ "2013") |>
drop_na() |>
ggplot(mapping = aes(x = date, y = num_sales)) +
geom_point(size=1, shape='O') +
geom_smooth(span=0.2, color = 'blue', se=FALSE) +
labs(title = "Sales between 2009 and 2013") +
theme_hc()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
If correlation is the amount to which two different variables change together (i.e., think \(\text{corr}(x_1, x_2)\)), then autocorrelation must be the amount to which a single variable changes with itself over time. In other words, we present autocorrelation between \(y\) at some time \(t\), \(y_t\), and \(y\) at an earlier “lag” \(t-h\), \(y_{t-h}\). These are good for identifying seasons in the data.
This autocorrelation can be calculated using the formula below:
\[ \text{AC}(h) = \frac{\text{Cov}(y_t, y_{t-h})}{\text{Var}(y_t)} \]
We calculate autocorrelation of our sales column across 12 month intervals by setting the frequency parameter to 12.
texas_xts <- xts(texas_ts$num_sales,
order.by = texas_ts$date,
frequency = 12)
acf(texas_xts, ci = 0.95, na.action = na.exclude)
As seen in the ACF graph above we notice a weak trend where the ACF values are higher for earlier lags and consistently decrease as the time-lag increases. However the ACF values largely stay between 0.6 and 0.8 across the Lag periods.