Assignment 1 - Rinad Salkham

Question 2.8:

1) Total Private Employes from us_employement

There is some seasonality in the data, whereby the number employed is at its lowest in January, increases to peak in June - August, then decreases again till the end of the year. Additionally, The number of people employed was on an increasing trend, despite decreasing in some years, which were during the big recessions. The recession years also went against the seasonality rules, for example, in 2009 the number of people employed in January was the highest during the year.

The autocorrelation graph shows very high correlation across all of the years. It starts almost equal to one at a lag of 1 month, and continues to decrease as the lags increase.

library('fpp3')

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## ── Attaching packages ──────────────────────────────────────────── fpp3 1.0.1 ──

## ✔ tibble      3.2.1     ✔ tsibble     1.1.5
## ✔ dplyr       1.1.4     ✔ tsibbledata 0.4.1
## ✔ tidyr       1.3.1     ✔ feasts      0.4.1
## ✔ lubridate   1.9.3     ✔ fable       0.4.1
## ✔ ggplot2     3.5.1

## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## ✖ lubridate::date()    masks base::date()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ tsibble::intersect() masks base::intersect()
## ✖ tsibble::interval()  masks lubridate::interval()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ tsibble::setdiff()   masks base::setdiff()
## ✖ tsibble::union()     masks base::union()

library(ggplot2)

us_emp <- us_employment %>% filter(Title == "Total Private")
us_emp |> autoplot(Employed)

us_emp |> gg_season(Employed, labels="both")

us_emp |> gg_subseries(Employed)

us_emp |> gg_lag(Employed)

us_emp |> ACF(Employed) |> autoplot()

2) Bricks from aus_production

The general trend is one where the production increased from 1960 till 1980, but then started to decrease. There is a seasonal trend whereby the first quarter has slightly lower values, and then the 2nd - 4th quarter of the year are higher. However, the seasonality is not consistent across all of the years, and is quite variable year-on-year. As a result of the seasonality, the autocorrelation graph has a general decreasing trend but peaks at the multiples of 4.

aus_production |> autoplot(Bricks)

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production |> gg_season(Bricks, labels="both")

## Warning: Removed 20 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_text()`).

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_text()`).

aus_production |> gg_subseries(Bricks)

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

aus_production |> gg_lag(Bricks, geom="point")

## Warning: Removed 20 rows containing missing values (gg_lag).

aus_production |> ACF(Bricks) |> autoplot()

3) Hare from pelt

There is seasonality in the data, whereby the value of Hare is increasing dramatically in certain years, but then dropping in others. However, there is no general trend in the data, it’s fluctuating around a similar mean across the years, except for the years around 1865 and 1885 where the increase was much more dramatic.

The autocorrelation graph shows the seasonality of the data, where the correlation is highly positive, then becomes highly negative, and then repeats.

pelt |> autoplot(Hare)

#pelt |> gg_season(Hare, labels="both")
pelt |> gg_subseries(Hare)

pelt |> gg_lag(Hare, geom="point")

pelt |> ACF(Hare) |> autoplot()

4) “H02” Cost from PBS

There is a general increasing trend in the data as well as clear seasonality, whereby the cost in January is high then decreases significantly in February and starts to gradually increase till December. All of the years follow the general trend and seasonality with some fluctuation in the year 2008.

The seasonality is also clear in the autocorrelation plot, which remains positive across the years, but continues to increase and decrease.

PBS_filter <- PBS %>% filter(ATC2 == "H02") %>% summarise(Cost=sum(Cost))
PBS_filter |> autoplot(Cost)

PBS_filter |> gg_season(Cost, labels="both")

PBS_filter |> gg_subseries(Cost)

PBS_filter |> gg_lag(Cost, geom="point")

PBS_filter |> ACF(Cost) |> autoplot()

5) Barrels from us_gasline

The general trend of the data was that it was increasing steadily over the years, but the trend shifted around 2008/2009 and decreased, but it looks to be in a general increasing trend again. There is seasonality in the data, where the summer months are slightly higher than the winter months, but the values fluctuate heavily week on week. The autocorrelation plot is positive across the years, but has a decreasing trend, with slight peaks at some lags due to seasonality.

us_gasoline |> autoplot(Barrels)

us_gasoline |> gg_season(Barrels, labels="both")

us_gasoline |> gg_subseries(Barrels)

us_gasoline |> gg_lag(Barrels, geom="point")

us_gasoline |> ACF(Barrels) |> autoplot()

Question 7.4:

The plot shows a general increasing trend in sales, with a significant increase in 1993 and 1994. There is also seasonality within the year, where sales tend to peak in the summer and again toward the end of the year.

souvenirs |> autoplot()

## Plot variable not specified, automatically selected `.vars = Sales`

This is done to normalize skewed data, making it more normally distributed, and to linearize non-linear relationships between variables, which allows for better model fit.

souvenirs_log <- souvenirs %>%
  mutate(log_sales = log(Sales))
souvenirs_log |> autoplot(log_sales)

I took the log of sales, creates seasonal dummies to represent the different seasons in the year, and a festival dummy for the surfing festival which is equal to 1 if it is January, February, March, or April.

#Mutating the dataset: took the log of sales and created a new dummy variable called festival that assigned 1 if the month is Jan, Feb, or Mar
souvenirs_question <- souvenirs %>%
  mutate(
    log_sales = log(Sales),              
    month = as.character(month(Month, label=TRUE)),   
    festival = if_else(month %in% c("Jan", "Feb", "Mar", "Apr"), 1, 0),
    winter = if_else(month %in% c("Dec", "Jan", "Feb"), 1, 0),
    spring = if_else(month %in% c("Mar", "Apr", "May"), 1, 0),
    summer = if_else(month %in% c("Jun", "Jul", "Aug"), 1, 0),
    fall = if_else(month %in% c("Sep", "Oct", "Nov"), 1, 0)
  )
souvenirs_question

#fitted the model to the data
model <- souvenirs_question %>% model(lm = TSLM(log_sales ~ Month + winter + spring + summer + fall + festival))

#plotting the fitted line
augment(model) |>
  ggplot(aes(x = Month)) +
  geom_line(aes(y = log_sales)) +
  geom_line(aes(y = .fitted, colour = "Fitted"))

The histogram of the residuals shows that there remains a some skewness in the data even after taking the log.

#Modeling the residulas against the fitted valeus
augment(model) |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() + labs(x = "Fitted", y = "Residuals")

#assessment of the residuals
model |> gg_tsresiduals()

Increasing residuals as time passes indicate heteroscedasticity, where the variability of the errors is not constant across the range of the predictor variables and seems to be impacted by seasonality. This violates the assumption of homoscedasticity in regression models, which can lead to inefficient coefficient estimates and unreliable hypothesis tests, such as biased p-values and confidence intervals.

augmented_data <- augment(model)
augmented_data$residuals <- augmented_data$.resid
augmented_data$Month <- as.factor(augmented_data$Month)
ggplot(augmented_data, aes(x = Month, y = residuals)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Boxplot of Residuals by Month", x = "Month", y = "Residuals")

The coefficients suggest that the intercept (3.698) is the baseline sales value. The Month coefficient (7.665e-04) shows a slight positive trend in sales over time. Seasonal variables indicate that sales are higher in winter (0.433) and lower in summer (-0.320), while spring does not have a significant impact. The festival coefficient (-0.846) shows that the festival period reduces sales significantly. The fall variable is not included in the model output as it is the reference for the seasonal categories.

report(model)

## Series: log_sales 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00284 -0.26948  0.01035  0.22375  0.83185 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.698e+00  4.741e-01   7.800 2.29e-11 ***
## Month        7.665e-04  6.142e-05  12.480  < 2e-16 ***
## winter       4.334e-01  1.558e-01   2.781  0.00678 ** 
## spring       1.943e-01  1.559e-01   1.247  0.21612    
## summer      -3.209e-01  1.273e-01  -2.520  0.01376 *  
## fall                NA         NA      NA       NA    
## festival    -8.455e-01  1.354e-01  -6.245 2.07e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4122 on 78 degrees of freedom
## Multiple R-squared: 0.7442,  Adjusted R-squared: 0.7278
## F-statistic: 45.39 on 5 and 78 DF, p-value: < 2.22e-16

The Ljung-Box test result indicates that there is significant autocorrelation in the residuals of the model, as the p-value is extremely small (less than 2.2e-16) while the X-squared value is 107.61 with 10 degrees of freedom. This suggests that the residuals are not independent, implying that the model may need adjustments to account for autocorrelation in the data.

resids <- residuals(model)$.resid
Box.test(resids, lag = 10, type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  resids
## X-squared = 107.61, df = 10, p-value < 2.2e-16

#forecast(model, h = 1)

I would improve the model by collecting additional data for different variables to include that are more accurate.

** Question 8.1:**

The optimal level for alpha = 0.09821581 and the optimal level for l0 = 422028. The forecasts for the next 4 months are: 419019.8 for January 2019, 422344.6 for February 2019, 467223.3 for March 2019, and 429610.8 for April 2019.

aus_pigs <- aus_livestock |> filter(Animal=="Pigs") |> summarise(Count = sum(Count))
aus_pigs |> autoplot(aus_pigs$Count)

## Warning: Use of `aus_pigs$Count` is discouraged.
## ℹ Use `Count` instead.

model_ets <- aus_pigs |> model(ETS(Count))
model_ets |> report()

## Series: Count 
## Model: ETS(M,Ad,M) 
##   Smoothing parameters:
##     alpha = 0.09821581 
##     beta  = 0.08632586 
##     gamma = 0.144023 
##     phi   = 0.8000286 
## 
##   Initial states:
##    l[0]      b[0]     s[0]    s[-1]     s[-2]     s[-3]     s[-4]     s[-5]
##  422028 -4053.052 1.043194 1.068446 0.9533266 0.9723318 0.8927822 0.8661272
##     s[-6]  s[-7]    s[-8]     s[-9]   s[-10]  s[-11]
##  1.038676 1.0529 1.040113 0.9965679 1.019755 1.05578
## 
##   sigma^2:  0.0027
## 
##      AIC     AICc      BIC 
## 14615.56 14616.83 14693.40

model_ets |> forecast(h=4, level=.95) |> autoplot(aus_pigs)

forecast <- forecast(model_ets, h=4, level=.95)
forecast

The interval I got was: lower bound = 419019.7; mean = 419019.8; upper bound = 419019.9)

forecast_1 <- forecast$.mean[1]

residuals <- residuals(model_ets)
residual_sd <- sd(residuals$.resid)
residual_sd

## [1] 0.05103259

lower_bound <- forecast_1 - (1.96 * residual_sd)
upper_bound <- forecast_1 + (1.96 * residual_sd)

cat(lower_bound, forecast_1, upper_bound)

## 419019.7 419019.8 419019.9