Load packages and data

library(fpp3)
library(tidyverse)
library(seasonal)

Questions

Question 1

Part A

ts <- vic_elec %>% 
  filter(year(Date)==2014) %>% 
  filter(month(Date)==01) %>% 
  index_by(Date=as_date(Time)) %>% 
  summarise(
    Demand=sum(Demand),
    Temperature=max(Temperature)
  )
ts %>% 
  autoplot(Demand)+
  labs(x='Date',title="Electricity Demand")

ts %>% 
  ggplot(aes(x = Temperature, y = Demand)) +
  labs(y = "Electricity Demand",
       x = "Maximum Temperature") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

fit_mod <- ts %>%
  model(TSLM(Demand ~ Temperature)) %>%
  report()
## Series: Demand 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -49978.2 -10218.9   -121.3  18533.2  35440.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391  0.00203 ** 
## Temperature   6154.3      601.3  10.235 3.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832,  Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11

There is a positive relationship because people use more electricity when the temperature is high for air conditioning.

fit_mod %>% gg_tsresiduals()

This model leaves residuals that, while having no significant autocorrelation, do not exhibit normally distributed residuals and do not seem visually random. There appear to be outliers on the 26th and 27th

future_scenarios <- scenarios(
  Decrease = new_data(ts, 1) %>%
    mutate(Temperature=15),
  Increase = new_data(ts, 1) %>%
    mutate(Temperature=35),
  names_to = "Scenario")

fc <- forecast(fit_mod, new_data = future_scenarios)

ts %>%
  autoplot(Demand) +
  autolayer(fc) +
  labs(title = "Electricity Demand", y = "Demand")

I believe the 35 degrees prediction, because it seems to be somewhat in line with the data I don’t believe the 15 degrees prediction, it seems to far outside the norm.

hilo(fc) %>% print(n=4)
## # A tsibble: 2 x 8 [1D]
## # Key:       Scenario, .model [2]
##   Scenario .model                   Date                   Demand  .mean Tempe…¹
##   <chr>    <chr>                    <date>                 <dist>  <dbl>   <dbl>
## 1 Decrease TSLM(Demand ~ Temperatu… 2014-02-01 N(151398, 6.8e+08) 1.51e5      15
## 2 Increase TSLM(Demand ~ Temperatu… 2014-02-01 N(274484, 6.4e+08) 2.74e5      35
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>, and abbreviated variable
## #   name ¹​Temperature
## # ℹ Use `colnames()` to see all variable names

My prediction interval for 15 degrees is [117908.1, 184888.6] at 80% and [100179.4, 202617.3] at 95% My prediction interval for 35 degrees is [242088.4, 306880.1] at 80% and [224939.1, 324029.4] at 95%

bts <- vic_elec %>% 
  index_by(Date=as_date(Time)) %>% 
  summarise(
    Demand=sum(Demand),
    Temperature=max(Temperature)
  )
bts %>% 
  ggplot(aes(x = Temperature, y = Demand)) +
  labs(y = "Electricity Demand",
       x = "Maximum Temperature") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

This says that it is incorrectly specified. The relationship is clearly nonlinear, and not strictly positive like my model indicated

Question 2

oly <- olympic_running %>% 
  na.omit()
oly %>% 
  autoplot(Time)

This chart shows that men tend to be slightly faster than women. This also shows that times have been slightly decreasing over time. Finally, it shows that longer races have longer times.

oly %>%
  ggplot(aes(x = Year, y = Time)) +
  facet_grid(rows ='Sex')+
  labs(y = "Finishing Time",
       x = "Year") +
  geom_point()

This graph illustrates the main points of the first graph.

fit_oly <- lm(Time ~., oly)
fit_oly
## 
## Call:
## lm(formula = Time ~ ., data = oly)
## 
## Coefficients:
## (Intercept)         Year       Length     Sexwomen  
##    734.9863      -0.3905       0.1766      32.9732

They have been decreasing by .3905 per year

forecast::checkresiduals(fit_oly)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 179.8, df = 10, p-value < 2.2e-16

These might be the least uncorrelated values I have ever seen

Prediction for men’s 100

men100 <- oly %>% 
  filter(Sex=='men',Length==100) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  )
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.3443995 -0.1081360  0.0007715  0.0750701  0.9015537 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.011581   2.347437   14.91 2.94e-14 ***
## Year        -0.012612   0.001198  -10.52 7.24e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2321 on 26 degrees of freedom
## Multiple R-squared: 0.8099,  Adjusted R-squared: 0.8026
## F-statistic: 110.8 on 1 and 26 DF, p-value: 7.2403e-11
men100 %>% 
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men100)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model            Year          Time .mean                  `80%`
##    <int> <chr> <chr>            <dbl>        <dist> <dbl>                 <hilo>
## 1    100 men   TSLM(Time ~ Yea…  2020 N(9.5, 0.061)  9.53 [9.217343, 9.851671]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for men’s 200

men200 <- oly %>% 
  filter(Sex=='men',Length==200) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  )
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6112 -0.1872 -0.1046  0.1935  0.6959 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 69.376498   3.553448   19.52  < 2e-16 ***
## Year        -0.024881   0.001812  -13.73  3.8e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3315 on 25 degrees of freedom
## Multiple R-squared: 0.8829,  Adjusted R-squared: 0.8782
## F-statistic: 188.5 on 1 and 25 DF, p-value: 3.7956e-13
men200%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men200)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year        Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>      <dist> <dbl>                 <hilo>
## 1    200 men   TSLM(Time ~ Year)  2020 N(19, 0.13)  19.1 [18.66343, 19.57145]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for men’s 400

men400 <- oly %>% 
  filter(Sex=='men',Length==400) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  )
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6001 -0.5747 -0.2858  0.5751  4.1505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 172.481477  11.487522   15.02 2.52e-14 ***
## Year         -0.064574   0.005865  -11.01 2.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.136 on 26 degrees of freedom
## Multiple R-squared: 0.8234,  Adjusted R-squared: 0.8166
## F-statistic: 121.2 on 1 and 26 DF, p-value: 2.7524e-11
men400%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men400)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year       Time .mean                 `80%`
##    <int> <chr> <chr>             <dbl>     <dist> <dbl>                <hilo>
## 1    400 men   TSLM(Time ~ Year)  2020 N(42, 1.5)  42.0 [40.49022, 43.5944]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for men’s 800

men800 <- oly %>% 
  filter(Sex=='men',Length==800) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7306 -1.8734 -0.8449  0.6851 12.9408 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 405.8457    34.6506  11.712 1.21e-11 ***
## Year         -0.1518     0.0177  -8.576 6.47e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.387 on 25 degrees of freedom
## Multiple R-squared: 0.7463,  Adjusted R-squared: 0.7361
## F-statistic: 73.54 on 1 and 25 DF, p-value: 6.4705e-09
men800%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men800)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year      Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>    <dist> <dbl>                 <hilo>
## 1    800 men   TSLM(Time ~ Year)  2020 N(99, 13)  99.2 [94.59505, 103.8803]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for men’s 1500

men1500 <- oly %>% 
  filter(Sex=='men',Length==1500) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.302  -4.585  -1.215   1.925  27.133 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 843.43666   81.81773  10.309 1.12e-10 ***
## Year         -0.31507    0.04177  -7.543 5.23e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.09 on 26 degrees of freedom
## Multiple R-squared: 0.6864,  Adjusted R-squared: 0.6743
## F-statistic:  56.9 on 1 and 26 DF, p-value: 5.2345e-08
men1500%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men1500)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year       Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>     <dist> <dbl>                 <hilo>
## 1   1500 men   TSLM(Time ~ Year)  2020 N(207, 74)  207. [195.9438, 218.0527]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for men’s 5000

men5000 <- oly %>% 
  filter(Sex=='men',Length==5000) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.311 -11.668  -1.096   7.515  40.596 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2853.1995   205.1246  13.910 2.22e-12 ***
## Year          -1.0299     0.1042  -9.881 1.50e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.66 on 22 degrees of freedom
## Multiple R-squared: 0.8161,  Adjusted R-squared: 0.8078
## F-statistic: 97.64 on 1 and 22 DF, p-value: 1.4995e-09
men5000%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men5000)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year        Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>      <dist> <dbl>                 <hilo>
## 1   5000 men   TSLM(Time ~ Year)  2020 N(773, 285)  773. [751.1888, 794.4603]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for men’s 10000

men10000 <- oly %>% 
  filter(Sex=='men',Length==10000) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.964 -24.222  -4.091  11.911  58.859 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6965.4594   378.4635   18.41 7.52e-15 ***
## Year          -2.6659     0.1923  -13.86 2.37e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.89 on 22 degrees of freedom
## Multiple R-squared: 0.8973,  Adjusted R-squared: 0.8926
## F-statistic: 192.2 on 1 and 22 DF, p-value: 2.3727e-12
men10000%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(men10000)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year         Time .mean                 `80%`
##    <int> <chr> <chr>             <dbl>       <dist> <dbl>                <hilo>
## 1  10000 men   TSLM(Time ~ Year)  2020 N(1580, 970) 1580. [1540.432, 1620.27]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for women’s 100

women100 <- oly %>% 
  filter(Sex=='women',Length==100) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44833 -0.11159  0.05775  0.11731  0.36057 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.188164   3.697686  10.598 2.05e-09 ***
## Year        -0.014185   0.001872  -7.577 3.72e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2232 on 19 degrees of freedom
## Multiple R-squared: 0.7513,  Adjusted R-squared: 0.7382
## F-statistic:  57.4 on 1 and 19 DF, p-value: 3.7226e-07
women100%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women100)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year         Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>       <dist> <dbl>                 <hilo>
## 1    100 women TSLM(Time ~ Year)  2020 N(11, 0.059)  10.5 [10.22224, 10.84658]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for women’s 200

women200 <- oly %>% 
  filter(Sex=='women',Length==200) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93598 -0.39344  0.08043  0.37624  0.78230 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 89.137613  11.267724   7.911 6.41e-07 ***
## Year        -0.033633   0.005685  -5.916 2.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5005 on 16 degrees of freedom
## Multiple R-squared: 0.6863,  Adjusted R-squared: 0.6667
## F-statistic:    35 on 1 and 16 DF, p-value: 2.1706e-05
women200%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women200)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year        Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>      <dist> <dbl>                 <hilo>
## 1    200 women TSLM(Time ~ Year)  2020 N(21, 0.31)  21.2 [20.48494, 21.91454]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for women’s 400

women400 <- oly %>% 
  filter(Sex=='women',Length==400) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1593 -1.0142  0.1024  0.7749  1.4797 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 129.39165   33.89755   3.817  0.00245 **
## Year         -0.04008    0.01703  -2.353  0.03652 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.028 on 12 degrees of freedom
## Multiple R-squared: 0.3157,  Adjusted R-squared: 0.2587
## F-statistic: 5.536 on 1 and 12 DF, p-value: 0.036523
women400%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women400)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year       Time .mean                 `80%`
##    <int> <chr> <chr>             <dbl>     <dist> <dbl>                <hilo>
## 1    400 women TSLM(Time ~ Year)  2020 N(48, 1.4)  48.4 [46.9239, 49.94863]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for women’s 800

women800 <- oly %>% 
  filter(Sex=='women',Length==800) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  )
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5.90197 -1.54437 -0.08512  1.55387  7.10393 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 511.36950   76.14729   6.716 9.85e-06 ***
## Year         -0.19796    0.03837  -5.159 0.000145 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.401 on 14 degrees of freedom
## Multiple R-squared: 0.6553,  Adjusted R-squared: 0.6307
## F-statistic: 26.61 on 1 and 14 DF, p-value: 0.00014512
women800%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women800)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year       Time .mean                 `80%`
##    <int> <chr> <chr>             <dbl>     <dist> <dbl>                <hilo>
## 1    800 women TSLM(Time ~ Year)  2020 N(111, 14)  111. [106.659, 116.3079]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for women’s 1500

women1500 <- oly %>% 
  filter(Sex=='women',Length==1500) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  )
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8003 -3.9012  0.7371  3.3202  6.4808 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.9926   211.3851  -0.241    0.814
## Year          0.1468     0.1060   1.384    0.196
## 
## Residual standard error: 5.071 on 10 degrees of freedom
## Multiple R-squared: 0.1608,  Adjusted R-squared: 0.07691
## F-statistic: 1.917 on 1 and 10 DF, p-value: 0.19635
women1500%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women1500)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year       Time .mean                `80%`
##    <int> <chr> <chr>             <dbl>     <dist> <dbl>               <hilo>
## 1   1500 women TSLM(Time ~ Year)  2020 N(245, 35)  245. [237.826, 253.087]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Prediction for women’s 5000

women5000 <- oly %>% 
  filter(Sex=='women',Length==5000) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1504.175   3467.321   0.434    0.687
## Year          -0.303      1.728  -0.175    0.869
## 
## Residual standard error: 28.92 on 4 degrees of freedom
## Multiple R-squared: 0.007624,    Adjusted R-squared: -0.2405
## F-statistic: 0.03073 on 1 and 4 DF, p-value: 0.86936
women5000%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women5000)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year         Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>       <dist> <dbl>                 <hilo>
## 1   5000 women TSLM(Time ~ Year)  2020 N(892, 1562)  892. [841.4729, 942.7565]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
women10000 <- oly %>% 
  filter(Sex=='women',Length==10000) %>% 
  model(TSLM(Time ~ Year)) %>%
  report() %>% 
  forecast(
    new_data(olympic_running, 1) %>%
      mutate(Year = 2020)
  ) 
## Series: Time 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.550 -11.594  -2.285   7.731  29.765 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8825.2600  1402.4211   6.293  0.00075 ***
## Year          -3.4962     0.7005  -4.991  0.00247 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.16 on 6 degrees of freedom
## Multiple R-squared: 0.8059,  Adjusted R-squared: 0.7735
## F-statistic: 24.91 on 1 and 6 DF, p-value: 0.0024746
women10000%>%
  autoplot(olympic_running) + 
  labs(title = "Forecasted Winning Race Time")

hilo(women10000)
## # A tsibble: 1 x 8 [?]
## # Key:       Length, Sex, .model [1]
##   Length Sex   .model             Year         Time .mean                  `80%`
##    <int> <chr> <chr>             <dbl>       <dist> <dbl>                 <hilo>
## 1  10000 women TSLM(Time ~ Year)  2020 N(1763, 530) 1763. [1733.513, 1792.518]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

The prediction intervals can be seen on the graphs and in the output. I am assuming that the trends will continue uninterrupted. Basically, that the past is a good guide to the future.

Question 3

souvenirs %>% 
  autoplot(Sales)

There is heteroskedasticity in this dataset. The variance increases drastically.Obviously there is high seasonality as well

logsouv <- souvenirs %>% 
  mutate(logs=log(Sales)) %>% 
  select(logs)
logsouv %>% 
  autoplot(logs)

Logging it gets rid of the heteroskedasticity

I genuinely could not figure out how to put a surfing dummy in the model. The textbook and slides gave no examples. I found methods of doing it on the internet, but I didn’t understand what was happening after reading documentation, so I neglected to put those methods in this homework.

logfit <- logsouv %>%
  model(TSLM(logs ~ trend() + season()))

augment(logfit) %>%
  ggplot(aes(x = Month)) +
  geom_line(aes(y = logs, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Sales",
       title = "Souvenir Sales") +
  guides(colour = guide_legend(title = "Series"))

logfit %>% gg_tsresiduals()

logsouv %>%
  left_join(residuals(logfit), by = "Month") %>%
  ggplot(aes(x = logs, y = .resid)) +
  geom_point() +
  labs(y = "Residuals", x = "Logged Values")

augment(logfit) %>%
  ggplot(aes(x = .fitted)) +
  geom_line(aes(y = .resid, colour = "Data"))

They seem to be relatively random when plotted against the fitted and actual values. Across time, there is significant autocorrelation.

model_augmt <- augment(logfit, souvenirs$Month)
## Warning: The `type` argument of `augment()` is deprecated as of fabletools 0.2.1.
## The type argument is now deprecated for changes to broom v0.7.0.
## Response residuals are now always found in `.resid` and innovation residuals are now found in `.innov`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
model_augmt$Month <- month(model_augmt$Month, label = FALSE)
model_res_month <- data.frame(inno=model_augmt$.innov, month=model_augmt$Month)
boxplot(inno ~ month, data=model_res_month)

The textbook doesn’t say anything about how to do this.

coefficients(logfit)
## # A tibble: 13 × 6
##    .model                          term         estim…¹ std.e…² stati…³  p.value
##    <chr>                           <chr>          <dbl>   <dbl>   <dbl>    <dbl>
##  1 TSLM(logs ~ trend() + season()) (Intercept)   7.61   7.69e-2   98.9  8.19e-78
##  2 TSLM(logs ~ trend() + season()) trend()       0.0224 8.45e-4   26.5  1.48e-38
##  3 TSLM(logs ~ trend() + season()) season()yea…  0.251  9.93e-2    2.53 1.37e- 2
##  4 TSLM(logs ~ trend() + season()) season()yea…  0.695  9.93e-2    7.00 1.18e- 9
##  5 TSLM(logs ~ trend() + season()) season()yea…  0.383  9.94e-2    3.85 2.52e- 4
##  6 TSLM(logs ~ trend() + season()) season()yea…  0.408  9.94e-2    4.11 1.06e- 4
##  7 TSLM(logs ~ trend() + season()) season()yea…  0.447  9.94e-2    4.50 2.63e- 5
##  8 TSLM(logs ~ trend() + season()) season()yea…  0.608  9.95e-2    6.12 4.69e- 8
##  9 TSLM(logs ~ trend() + season()) season()yea…  0.585  9.95e-2    5.88 1.21e- 7
## 10 TSLM(logs ~ trend() + season()) season()yea…  0.666  9.96e-2    6.69 4.27e- 9
## 11 TSLM(logs ~ trend() + season()) season()yea…  0.744  9.96e-2    7.47 1.61e-10
## 12 TSLM(logs ~ trend() + season()) season()yea…  1.20   9.97e-2   12.1  7.22e-19
## 13 TSLM(logs ~ trend() + season()) season()yea…  1.96   9.98e-2   19.6  2.12e-30
## # … with abbreviated variable names ¹​estimate, ²​std.error, ³​statistic

The trend coefficient tells the general amount of change for any given month. Basically, it is the drift method/moving average. The seasonal coefficients tell how much of an extra change for any given month.

augment(logfit) %>%
  features(.innov, ljung_box, lag = 6, dof = 0)
## # A tibble: 1 × 3
##   .model                          lb_stat     lb_pvalue
##   <chr>                             <dbl>         <dbl>
## 1 TSLM(logs ~ trend() + season())    52.6 0.00000000144
augment(logfit) %>%
  features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model                          lb_stat lb_pvalue
##   <chr>                             <dbl>     <dbl>
## 1 TSLM(logs ~ trend() + season())    64.7  4.60e-10
augment(logfit) %>%
  features(.innov, ljung_box, lag = 10, dof = 2)
## # A tibble: 1 × 3
##   .model                          lb_stat lb_pvalue
##   <chr>                             <dbl>     <dbl>
## 1 TSLM(logs ~ trend() + season())    64.7  5.50e-11

This means that the residuals are distinguishable from a white noise at virtually any significance level

logfc <- logfit %>% 
  forecast(new_data(logsouv,36))
hilo(logfc) %>% 
  print(n=36)
## # A tsibble: 36 x 6 [1M]
## # Key:       .model [1]
##    .model                     Month          logs .mean                    `80%`
##    <chr>                      <mth>        <dist> <dbl>                   <hilo>
##  1 TSLM(logs ~ trend() + … 1994 Jan N(9.5, 0.041)  9.51 [ 9.249436,  9.769091]80
##  2 TSLM(logs ~ trend() + … 1994 Feb N(9.8, 0.041)  9.78 [ 9.522873, 10.042528]80
##  3 TSLM(logs ~ trend() + … 1994 Mar  N(10, 0.041) 10.2  [ 9.989428, 10.509084]80
##  4 TSLM(logs ~ trend() + … 1994 Apr  N(10, 0.041)  9.96 [ 9.699549, 10.219204]80
##  5 TSLM(logs ~ trend() + … 1994 May  N(10, 0.041) 10.0  [ 9.747002, 10.266658]80
##  6 TSLM(logs ~ trend() + … 1994 Jun  N(10, 0.041) 10.1  [ 9.808363, 10.328019]80
##  7 TSLM(logs ~ trend() + … 1994 Jul  N(10, 0.041) 10.3  [ 9.992009, 10.511665]80
##  8 TSLM(logs ~ trend() + … 1994 Aug  N(10, 0.041) 10.3  [ 9.991539, 10.511195]80
##  9 TSLM(logs ~ trend() + … 1994 Sep  N(10, 0.041) 10.4  [10.094924, 10.614580]80
## 10 TSLM(logs ~ trend() + … 1994 Oct  N(10, 0.041) 10.5  [10.195006, 10.714662]80
## 11 TSLM(logs ~ trend() + … 1994 Nov  N(11, 0.041) 10.9  [10.676382, 11.196038]80
## 12 TSLM(logs ~ trend() + … 1994 Dec  N(12, 0.041) 11.7  [11.453895, 11.973551]80
## 13 TSLM(logs ~ trend() + … 1995 Jan N(9.8, 0.042)  9.78 [ 9.515245, 10.040714]80
## 14 TSLM(logs ~ trend() + … 1995 Feb  N(10, 0.042) 10.1  [ 9.788682, 10.314151]80
## 15 TSLM(logs ~ trend() + … 1995 Mar  N(11, 0.042) 10.5  [10.255237, 10.780707]80
## 16 TSLM(logs ~ trend() + … 1995 Apr  N(10, 0.042) 10.2  [ 9.965358, 10.490827]80
## 17 TSLM(logs ~ trend() + … 1995 May  N(10, 0.042) 10.3  [10.012811, 10.538280]80
## 18 TSLM(logs ~ trend() + … 1995 Jun  N(10, 0.042) 10.3  [10.074172, 10.599641]80
## 19 TSLM(logs ~ trend() + … 1995 Jul  N(11, 0.042) 10.5  [10.257818, 10.783287]80
## 20 TSLM(logs ~ trend() + … 1995 Aug  N(11, 0.042) 10.5  [10.257348, 10.782817]80
## 21 TSLM(logs ~ trend() + … 1995 Sep  N(11, 0.042) 10.6  [10.360733, 10.886202]80
## 22 TSLM(logs ~ trend() + … 1995 Oct  N(11, 0.042) 10.7  [10.460815, 10.986284]80
## 23 TSLM(logs ~ trend() + … 1995 Nov  N(11, 0.042) 11.2  [10.942191, 11.467660]80
## 24 TSLM(logs ~ trend() + … 1995 Dec  N(12, 0.042) 12.0  [11.719704, 12.245173]80
## 25 TSLM(logs ~ trend() + … 1996 Jan  N(10, 0.043) 10.0  [ 9.780451, 10.312939]80
## 26 TSLM(logs ~ trend() + … 1996 Feb  N(10, 0.043) 10.3  [10.053888, 10.586376]80
## 27 TSLM(logs ~ trend() + … 1996 Mar  N(11, 0.043) 10.8  [10.520444, 11.052932]80
## 28 TSLM(logs ~ trend() + … 1996 Apr  N(10, 0.043) 10.5  [10.230564, 10.763052]80
## 29 TSLM(logs ~ trend() + … 1996 May  N(11, 0.043) 10.5  [10.278017, 10.810506]80
## 30 TSLM(logs ~ trend() + … 1996 Jun  N(11, 0.043) 10.6  [10.339378, 10.871867]80
## 31 TSLM(logs ~ trend() + … 1996 Jul  N(11, 0.043) 10.8  [10.523025, 11.055513]80
## 32 TSLM(logs ~ trend() + … 1996 Aug  N(11, 0.043) 10.8  [10.522554, 11.055043]80
## 33 TSLM(logs ~ trend() + … 1996 Sep  N(11, 0.043) 10.9  [10.625939, 11.158428]80
## 34 TSLM(logs ~ trend() + … 1996 Oct  N(11, 0.043) 11.0  [10.726021, 11.258510]80
## 35 TSLM(logs ~ trend() + … 1996 Nov  N(11, 0.043) 11.5  [11.207397, 11.739886]80
## 36 TSLM(logs ~ trend() + … 1996 Dec  N(12, 0.043) 12.3  [11.984910, 12.517399]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
logfc%>% 
  autoplot(logsouv)

Honestly, I think this forecast is pretty good. Maybe if the textbook gave any indication on how to actually include the surfing dummy, I could have done that and made it better. Of course, it is entirely possible that I just missed an example.

Question 4

earlyg <- us_gasoline %>% 
  filter(year(Week)<=2004)
earlyg %>% 
  autoplot(Barrels)

K=2

gasfit2 <- earlyg %>% 
  model(TSLM(Barrels~trend()+fourier(K=2)))
augment(gasfit2) %>%
  ggplot(aes(x = Week)) +
  geom_line(aes(y = Barrels, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Barrels",
       title = "Gasoline") +
  guides(colour = guide_legend(title = "Fourier, K=2"))

K=15

gasfit15 <- earlyg %>% 
  model(TSLM(Barrels~trend()+fourier(K=4)))
augment(gasfit15) %>%
  ggplot(aes(x = Week)) +
  geom_line(aes(y = Barrels, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Barrels",
       title = "Gasoline") +
  guides(colour = guide_legend(title = "Fourier, K=15"))

K=26

gasfit26 <- earlyg %>% 
  model(TSLM(Barrels~trend()+fourier(K=26)))
augment(gasfit26) %>%
  ggplot(aes(x = Week)) +
  geom_line(aes(y = Barrels, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Barrels",
       title = "Gasoline") +
  guides(colour = guide_legend(title = "Fourier, K=26"))

It seems to get more accurate as the fourier terms went up, but can’t ever match the full variance. This is a for loop that goes through all possible K values between 1 and 26

for (i in 1:26){
  print(glance(earlyg %>% 
    model(TSLM(Barrels~trend()+fourier(K=i)))) %>% 
    select(AICc))
  out <- paste0("This is fourier= ", i, ".")
  print(out)
}
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1809.
## [1] "This is fourier= 1."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1814.
## [1] "This is fourier= 2."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1853.
## [1] "This is fourier= 3."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1856.
## [1] "This is fourier= 4."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1862.
## [1] "This is fourier= 5."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1884.
## [1] "This is fourier= 6."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1887.
## [1] "This is fourier= 7."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1885.
## [1] "This is fourier= 8."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1882.
## [1] "This is fourier= 9."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1881.
## [1] "This is fourier= 10."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1885.
## [1] "This is fourier= 11."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1884.
## [1] "This is fourier= 12."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1881.
## [1] "This is fourier= 13."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1879.
## [1] "This is fourier= 14."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1876.
## [1] "This is fourier= 15."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1872.
## [1] "This is fourier= 16."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1871.
## [1] "This is fourier= 17."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1868.
## [1] "This is fourier= 18."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1864.
## [1] "This is fourier= 19."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1859.
## [1] "This is fourier= 20."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1855.
## [1] "This is fourier= 21."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1851.
## [1] "This is fourier= 22."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1850.
## [1] "This is fourier= 23."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1847.
## [1] "This is fourier= 24."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1843.
## [1] "This is fourier= 25."
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1851.
## [1] "This is fourier= 26."

The lowest value is K=25

gasfit25 <- earlyg %>% 
  model(TSLM(Barrels~trend()+fourier(K=25)))

glance(gasfit25) %>%
  select(AICc)
## # A tibble: 1 × 1
##     AICc
##    <dbl>
## 1 -1843.
gasfit25 %>% gg_tsresiduals()

These look quite uncorrelated to me, and normally distributed, but the ACF plot shows significant autocorrelation at various lags

augment(gasfit25) %>%
  features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
##   .model                                    lb_stat lb_pvalue
##   <chr>                                       <dbl>     <dbl>
## 1 TSLM(Barrels ~ trend() + fourier(K = 25))    20.3    0.0263

This is distinguishable from a white noise at the 5% level

gasfit25 %>% 
  forecast(new_data(earlyg,52)) %>% 
  autoplot(us_gasoline %>% filter(year(Week)<=2005) %>% filter(year(Week)>2000))

For easy viewing, I cut off the back end.

Question 5

afg_pop <- global_economy %>% 
  filter(Country=='Afghanistan') %>% 
  select(Population)
afg_pop %>% 
  autoplot(Population/1e6)+
  labs(y="Population (millions)",title="Population of Afghanistan")

Given that the Soviet-Afghan War happened in the 80s, and there is a consistent population drop during that period, I would say I can observe it.

fit_pop <- afg_pop %>%
  model(
    linear = TSLM(Population ~ trend()),
    piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
  )
afg_pop %>%
  autoplot(Population) +
  geom_line(data = fitted(fit_pop),
            aes(y = .fitted, colour = .model)) +
  labs(y = "Population",
       title = "Population of Afghanistan")

fc_fit <- fit_pop %>% forecast(h=5)
afg_pop %>%
  autoplot(Population) +
  geom_line(data = fitted(fit_pop),
            aes(y = .fitted, colour = .model)) +
  autolayer(fc_fit)+
  labs(y = "Population",
       title = "Population of Afghanistan")

The piecewise forecast appears to be MUCH more accurate.

hilo(fc_fit)
## # A tsibble: 10 x 6 [1Y]
## # Key:       .model [2]
##    .model     Year          Population     .mean                  `80%`
##    <chr>     <dbl>              <dist>     <dbl>                 <hilo>
##  1 linear     2018   N(3e+07, 1.1e+13) 29919575. [25691350, 34147800]80
##  2 linear     2019   N(3e+07, 1.1e+13) 30345349. [26109842, 34580856]80
##  3 linear     2020 N(3.1e+07, 1.1e+13) 30771123. [26528105, 35014142]80
##  4 linear     2021 N(3.1e+07, 1.1e+13) 31196897. [26946139, 35447655]80
##  5 linear     2022 N(3.2e+07, 1.1e+13) 31622671. [27363947, 35881396]80
##  6 piecewise  2018   N(3.6e+07, 1e+11) 35977656. [35566510, 36388803]80
##  7 piecewise  2019   N(3.7e+07, 1e+11) 36828007. [36414420, 37241593]80
##  8 piecewise  2020 N(3.8e+07, 1.1e+11) 37678357. [37262199, 38094516]80
##  9 piecewise  2021 N(3.9e+07, 1.1e+11) 38528708. [38109847, 38947568]80
## 10 piecewise  2022 N(3.9e+07, 1.1e+11) 39379058. [38957369, 39800748]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names

Question 6

Assumption 1 - The model is linear, correctly specified, and has an additive error term This means that the coefficients are related to the variables linearly, the model includes only the variables it needs to according to theory, and adds the error term.

Assumption 2 - No Multicollinearity This means that none of the independent/exogenous/explanatory variables should be correlated highly with each other Usually, this one isn’t a problem assuming the correlation is somewhat low.

Assumption 3 - Error term has a population mean of zero If they don’t have mean of zero, the model will be systematically biased

Assumption 4 - No heteroskedasticity This means that the error terms should have constant variance If this isn’t fulfilled, the standard errors will be biased

Assumption 5 - No serial correlation/autocorrelation This means the error terms/residuals cannot be correlated with each other If this happens, this means that there is more information that can be utilized to fit the variables together

Assumption 6 - The independent/exogenous/explanatory variables must be uncorrelated with the error term If this happens, this means there is more information that can be utilized

Assumption 7 - The error terms must follow a normal distribution If this isn’t true, there is systematic bias in the model

Unbiased means that the expected value of your coefficients is equal to the true value of the coefficient Consistent means that as you add sample data, the coefficients get closer and closer to the true value Unbiased is more about specification in the model, whereas consistency is more about sensitivity to the sample

R-squared is explanatory power, what percentage of the variation in Y is explained by the model Adjusted R-squared is also explanatory power but it addresses a major weakness of R-squared, which is the ability to distort the R-squared by adding many predictor variables Adjusted R-squared accounts for the degrees of freedom, or the sample size-independent variables-y,intercent