library(fpp3)
library(tidyverse)
library(seasonal)
ts <- vic_elec %>%
filter(year(Date)==2014) %>%
filter(month(Date)==01) %>%
index_by(Date=as_date(Time)) %>%
summarise(
Demand=sum(Demand),
Temperature=max(Temperature)
)
ts %>%
autoplot(Demand)+
labs(x='Date',title="Electricity Demand")
ts %>%
ggplot(aes(x = Temperature, y = Demand)) +
labs(y = "Electricity Demand",
x = "Maximum Temperature") +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
fit_mod <- ts %>%
model(TSLM(Demand ~ Temperature)) %>%
report()
## Series: Demand
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978.2 -10218.9 -121.3 18533.2 35440.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
There is a positive relationship because people use more electricity when the temperature is high for air conditioning.
fit_mod %>% gg_tsresiduals()
This model leaves residuals that, while having no significant autocorrelation, do not exhibit normally distributed residuals and do not seem visually random. There appear to be outliers on the 26th and 27th
future_scenarios <- scenarios(
Decrease = new_data(ts, 1) %>%
mutate(Temperature=15),
Increase = new_data(ts, 1) %>%
mutate(Temperature=35),
names_to = "Scenario")
fc <- forecast(fit_mod, new_data = future_scenarios)
ts %>%
autoplot(Demand) +
autolayer(fc) +
labs(title = "Electricity Demand", y = "Demand")
I believe the 35 degrees prediction, because it seems to be somewhat in line with the data I don’t believe the 15 degrees prediction, it seems to far outside the norm.
hilo(fc) %>% print(n=4)
## # A tsibble: 2 x 8 [1D]
## # Key: Scenario, .model [2]
## Scenario .model Date Demand .mean Tempe…¹
## <chr> <chr> <date> <dist> <dbl> <dbl>
## 1 Decrease TSLM(Demand ~ Temperatu… 2014-02-01 N(151398, 6.8e+08) 1.51e5 15
## 2 Increase TSLM(Demand ~ Temperatu… 2014-02-01 N(274484, 6.4e+08) 2.74e5 35
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>, and abbreviated variable
## # name ¹Temperature
## # ℹ Use `colnames()` to see all variable names
My prediction interval for 15 degrees is [117908.1, 184888.6] at 80% and [100179.4, 202617.3] at 95% My prediction interval for 35 degrees is [242088.4, 306880.1] at 80% and [224939.1, 324029.4] at 95%
bts <- vic_elec %>%
index_by(Date=as_date(Time)) %>%
summarise(
Demand=sum(Demand),
Temperature=max(Temperature)
)
bts %>%
ggplot(aes(x = Temperature, y = Demand)) +
labs(y = "Electricity Demand",
x = "Maximum Temperature") +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
This says that it is incorrectly specified. The relationship is clearly nonlinear, and not strictly positive like my model indicated
oly <- olympic_running %>%
na.omit()
oly %>%
autoplot(Time)
This chart shows that men tend to be slightly faster than women. This also shows that times have been slightly decreasing over time. Finally, it shows that longer races have longer times.
oly %>%
ggplot(aes(x = Year, y = Time)) +
facet_grid(rows ='Sex')+
labs(y = "Finishing Time",
x = "Year") +
geom_point()
This graph illustrates the main points of the first graph.
fit_oly <- lm(Time ~., oly)
fit_oly
##
## Call:
## lm(formula = Time ~ ., data = oly)
##
## Coefficients:
## (Intercept) Year Length Sexwomen
## 734.9863 -0.3905 0.1766 32.9732
They have been decreasing by .3905 per year
forecast::checkresiduals(fit_oly)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals
## LM test = 179.8, df = 10, p-value < 2.2e-16
These might be the least uncorrelated values I have ever seen
Prediction for men’s 100
men100 <- oly %>%
filter(Sex=='men',Length==100) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3443995 -0.1081360 0.0007715 0.0750701 0.9015537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.011581 2.347437 14.91 2.94e-14 ***
## Year -0.012612 0.001198 -10.52 7.24e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2321 on 26 degrees of freedom
## Multiple R-squared: 0.8099, Adjusted R-squared: 0.8026
## F-statistic: 110.8 on 1 and 26 DF, p-value: 7.2403e-11
men100 %>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men100)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 100 men TSLM(Time ~ Yea… 2020 N(9.5, 0.061) 9.53 [9.217343, 9.851671]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for men’s 200
men200 <- oly %>%
filter(Sex=='men',Length==200) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6112 -0.1872 -0.1046 0.1935 0.6959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.376498 3.553448 19.52 < 2e-16 ***
## Year -0.024881 0.001812 -13.73 3.8e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3315 on 25 degrees of freedom
## Multiple R-squared: 0.8829, Adjusted R-squared: 0.8782
## F-statistic: 188.5 on 1 and 25 DF, p-value: 3.7956e-13
men200%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men200)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 200 men TSLM(Time ~ Year) 2020 N(19, 0.13) 19.1 [18.66343, 19.57145]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for men’s 400
men400 <- oly %>%
filter(Sex=='men',Length==400) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6001 -0.5747 -0.2858 0.5751 4.1505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 172.481477 11.487522 15.02 2.52e-14 ***
## Year -0.064574 0.005865 -11.01 2.75e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.136 on 26 degrees of freedom
## Multiple R-squared: 0.8234, Adjusted R-squared: 0.8166
## F-statistic: 121.2 on 1 and 26 DF, p-value: 2.7524e-11
men400%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men400)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 400 men TSLM(Time ~ Year) 2020 N(42, 1.5) 42.0 [40.49022, 43.5944]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for men’s 800
men800 <- oly %>%
filter(Sex=='men',Length==800) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7306 -1.8734 -0.8449 0.6851 12.9408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 405.8457 34.6506 11.712 1.21e-11 ***
## Year -0.1518 0.0177 -8.576 6.47e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.387 on 25 degrees of freedom
## Multiple R-squared: 0.7463, Adjusted R-squared: 0.7361
## F-statistic: 73.54 on 1 and 25 DF, p-value: 6.4705e-09
men800%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men800)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 800 men TSLM(Time ~ Year) 2020 N(99, 13) 99.2 [94.59505, 103.8803]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for men’s 1500
men1500 <- oly %>%
filter(Sex=='men',Length==1500) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.302 -4.585 -1.215 1.925 27.133
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 843.43666 81.81773 10.309 1.12e-10 ***
## Year -0.31507 0.04177 -7.543 5.23e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.09 on 26 degrees of freedom
## Multiple R-squared: 0.6864, Adjusted R-squared: 0.6743
## F-statistic: 56.9 on 1 and 26 DF, p-value: 5.2345e-08
men1500%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men1500)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 1500 men TSLM(Time ~ Year) 2020 N(207, 74) 207. [195.9438, 218.0527]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for men’s 5000
men5000 <- oly %>%
filter(Sex=='men',Length==5000) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.311 -11.668 -1.096 7.515 40.596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2853.1995 205.1246 13.910 2.22e-12 ***
## Year -1.0299 0.1042 -9.881 1.50e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.66 on 22 degrees of freedom
## Multiple R-squared: 0.8161, Adjusted R-squared: 0.8078
## F-statistic: 97.64 on 1 and 22 DF, p-value: 1.4995e-09
men5000%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men5000)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 5000 men TSLM(Time ~ Year) 2020 N(773, 285) 773. [751.1888, 794.4603]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for men’s 10000
men10000 <- oly %>%
filter(Sex=='men',Length==10000) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.964 -24.222 -4.091 11.911 58.859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6965.4594 378.4635 18.41 7.52e-15 ***
## Year -2.6659 0.1923 -13.86 2.37e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.89 on 22 degrees of freedom
## Multiple R-squared: 0.8973, Adjusted R-squared: 0.8926
## F-statistic: 192.2 on 1 and 22 DF, p-value: 2.3727e-12
men10000%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(men10000)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 10000 men TSLM(Time ~ Year) 2020 N(1580, 970) 1580. [1540.432, 1620.27]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for women’s 100
women100 <- oly %>%
filter(Sex=='women',Length==100) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44833 -0.11159 0.05775 0.11731 0.36057
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.188164 3.697686 10.598 2.05e-09 ***
## Year -0.014185 0.001872 -7.577 3.72e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2232 on 19 degrees of freedom
## Multiple R-squared: 0.7513, Adjusted R-squared: 0.7382
## F-statistic: 57.4 on 1 and 19 DF, p-value: 3.7226e-07
women100%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women100)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 100 women TSLM(Time ~ Year) 2020 N(11, 0.059) 10.5 [10.22224, 10.84658]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for women’s 200
women200 <- oly %>%
filter(Sex=='women',Length==200) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93598 -0.39344 0.08043 0.37624 0.78230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89.137613 11.267724 7.911 6.41e-07 ***
## Year -0.033633 0.005685 -5.916 2.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5005 on 16 degrees of freedom
## Multiple R-squared: 0.6863, Adjusted R-squared: 0.6667
## F-statistic: 35 on 1 and 16 DF, p-value: 2.1706e-05
women200%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women200)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 200 women TSLM(Time ~ Year) 2020 N(21, 0.31) 21.2 [20.48494, 21.91454]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for women’s 400
women400 <- oly %>%
filter(Sex=='women',Length==400) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1593 -1.0142 0.1024 0.7749 1.4797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.39165 33.89755 3.817 0.00245 **
## Year -0.04008 0.01703 -2.353 0.03652 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.028 on 12 degrees of freedom
## Multiple R-squared: 0.3157, Adjusted R-squared: 0.2587
## F-statistic: 5.536 on 1 and 12 DF, p-value: 0.036523
women400%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women400)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 400 women TSLM(Time ~ Year) 2020 N(48, 1.4) 48.4 [46.9239, 49.94863]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for women’s 800
women800 <- oly %>%
filter(Sex=='women',Length==800) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.90197 -1.54437 -0.08512 1.55387 7.10393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 511.36950 76.14729 6.716 9.85e-06 ***
## Year -0.19796 0.03837 -5.159 0.000145 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.401 on 14 degrees of freedom
## Multiple R-squared: 0.6553, Adjusted R-squared: 0.6307
## F-statistic: 26.61 on 1 and 14 DF, p-value: 0.00014512
women800%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women800)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 800 women TSLM(Time ~ Year) 2020 N(111, 14) 111. [106.659, 116.3079]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for women’s 1500
women1500 <- oly %>%
filter(Sex=='women',Length==1500) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8003 -3.9012 0.7371 3.3202 6.4808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.9926 211.3851 -0.241 0.814
## Year 0.1468 0.1060 1.384 0.196
##
## Residual standard error: 5.071 on 10 degrees of freedom
## Multiple R-squared: 0.1608, Adjusted R-squared: 0.07691
## F-statistic: 1.917 on 1 and 10 DF, p-value: 0.19635
women1500%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women1500)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 1500 women TSLM(Time ~ Year) 2020 N(245, 35) 245. [237.826, 253.087]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Prediction for women’s 5000
women5000 <- oly %>%
filter(Sex=='women',Length==5000) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1504.175 3467.321 0.434 0.687
## Year -0.303 1.728 -0.175 0.869
##
## Residual standard error: 28.92 on 4 degrees of freedom
## Multiple R-squared: 0.007624, Adjusted R-squared: -0.2405
## F-statistic: 0.03073 on 1 and 4 DF, p-value: 0.86936
women5000%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women5000)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 5000 women TSLM(Time ~ Year) 2020 N(892, 1562) 892. [841.4729, 942.7565]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
women10000 <- oly %>%
filter(Sex=='women',Length==10000) %>%
model(TSLM(Time ~ Year)) %>%
report() %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
)
## Series: Time
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.550 -11.594 -2.285 7.731 29.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8825.2600 1402.4211 6.293 0.00075 ***
## Year -3.4962 0.7005 -4.991 0.00247 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.16 on 6 degrees of freedom
## Multiple R-squared: 0.8059, Adjusted R-squared: 0.7735
## F-statistic: 24.91 on 1 and 6 DF, p-value: 0.0024746
women10000%>%
autoplot(olympic_running) +
labs(title = "Forecasted Winning Race Time")
hilo(women10000)
## # A tsibble: 1 x 8 [?]
## # Key: Length, Sex, .model [1]
## Length Sex .model Year Time .mean `80%`
## <int> <chr> <chr> <dbl> <dist> <dbl> <hilo>
## 1 10000 women TSLM(Time ~ Year) 2020 N(1763, 530) 1763. [1733.513, 1792.518]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
The prediction intervals can be seen on the graphs and in the output. I am assuming that the trends will continue uninterrupted. Basically, that the past is a good guide to the future.
souvenirs %>%
autoplot(Sales)
There is heteroskedasticity in this dataset. The variance increases drastically.Obviously there is high seasonality as well
logsouv <- souvenirs %>%
mutate(logs=log(Sales)) %>%
select(logs)
logsouv %>%
autoplot(logs)
Logging it gets rid of the heteroskedasticity
I genuinely could not figure out how to put a surfing dummy in the model. The textbook and slides gave no examples. I found methods of doing it on the internet, but I didn’t understand what was happening after reading documentation, so I neglected to put those methods in this homework.
logfit <- logsouv %>%
model(TSLM(logs ~ trend() + season()))
augment(logfit) %>%
ggplot(aes(x = Month)) +
geom_line(aes(y = logs, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(
values = c(Data = "black", Fitted = "#D55E00")
) +
labs(y = "Sales",
title = "Souvenir Sales") +
guides(colour = guide_legend(title = "Series"))
logfit %>% gg_tsresiduals()
logsouv %>%
left_join(residuals(logfit), by = "Month") %>%
ggplot(aes(x = logs, y = .resid)) +
geom_point() +
labs(y = "Residuals", x = "Logged Values")
augment(logfit) %>%
ggplot(aes(x = .fitted)) +
geom_line(aes(y = .resid, colour = "Data"))
They seem to be relatively random when plotted against the fitted and actual values. Across time, there is significant autocorrelation.
model_augmt <- augment(logfit, souvenirs$Month)
## Warning: The `type` argument of `augment()` is deprecated as of fabletools 0.2.1.
## The type argument is now deprecated for changes to broom v0.7.0.
## Response residuals are now always found in `.resid` and innovation residuals are now found in `.innov`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
model_augmt$Month <- month(model_augmt$Month, label = FALSE)
model_res_month <- data.frame(inno=model_augmt$.innov, month=model_augmt$Month)
boxplot(inno ~ month, data=model_res_month)
The textbook doesn’t say anything about how to do this.
coefficients(logfit)
## # A tibble: 13 × 6
## .model term estim…¹ std.e…² stati…³ p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 TSLM(logs ~ trend() + season()) (Intercept) 7.61 7.69e-2 98.9 8.19e-78
## 2 TSLM(logs ~ trend() + season()) trend() 0.0224 8.45e-4 26.5 1.48e-38
## 3 TSLM(logs ~ trend() + season()) season()yea… 0.251 9.93e-2 2.53 1.37e- 2
## 4 TSLM(logs ~ trend() + season()) season()yea… 0.695 9.93e-2 7.00 1.18e- 9
## 5 TSLM(logs ~ trend() + season()) season()yea… 0.383 9.94e-2 3.85 2.52e- 4
## 6 TSLM(logs ~ trend() + season()) season()yea… 0.408 9.94e-2 4.11 1.06e- 4
## 7 TSLM(logs ~ trend() + season()) season()yea… 0.447 9.94e-2 4.50 2.63e- 5
## 8 TSLM(logs ~ trend() + season()) season()yea… 0.608 9.95e-2 6.12 4.69e- 8
## 9 TSLM(logs ~ trend() + season()) season()yea… 0.585 9.95e-2 5.88 1.21e- 7
## 10 TSLM(logs ~ trend() + season()) season()yea… 0.666 9.96e-2 6.69 4.27e- 9
## 11 TSLM(logs ~ trend() + season()) season()yea… 0.744 9.96e-2 7.47 1.61e-10
## 12 TSLM(logs ~ trend() + season()) season()yea… 1.20 9.97e-2 12.1 7.22e-19
## 13 TSLM(logs ~ trend() + season()) season()yea… 1.96 9.98e-2 19.6 2.12e-30
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
The trend coefficient tells the general amount of change for any given month. Basically, it is the drift method/moving average. The seasonal coefficients tell how much of an extra change for any given month.
augment(logfit) %>%
features(.innov, ljung_box, lag = 6, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 TSLM(logs ~ trend() + season()) 52.6 0.00000000144
augment(logfit) %>%
features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 TSLM(logs ~ trend() + season()) 64.7 4.60e-10
augment(logfit) %>%
features(.innov, ljung_box, lag = 10, dof = 2)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 TSLM(logs ~ trend() + season()) 64.7 5.50e-11
This means that the residuals are distinguishable from a white noise at virtually any significance level
logfc <- logfit %>%
forecast(new_data(logsouv,36))
hilo(logfc) %>%
print(n=36)
## # A tsibble: 36 x 6 [1M]
## # Key: .model [1]
## .model Month logs .mean `80%`
## <chr> <mth> <dist> <dbl> <hilo>
## 1 TSLM(logs ~ trend() + … 1994 Jan N(9.5, 0.041) 9.51 [ 9.249436, 9.769091]80
## 2 TSLM(logs ~ trend() + … 1994 Feb N(9.8, 0.041) 9.78 [ 9.522873, 10.042528]80
## 3 TSLM(logs ~ trend() + … 1994 Mar N(10, 0.041) 10.2 [ 9.989428, 10.509084]80
## 4 TSLM(logs ~ trend() + … 1994 Apr N(10, 0.041) 9.96 [ 9.699549, 10.219204]80
## 5 TSLM(logs ~ trend() + … 1994 May N(10, 0.041) 10.0 [ 9.747002, 10.266658]80
## 6 TSLM(logs ~ trend() + … 1994 Jun N(10, 0.041) 10.1 [ 9.808363, 10.328019]80
## 7 TSLM(logs ~ trend() + … 1994 Jul N(10, 0.041) 10.3 [ 9.992009, 10.511665]80
## 8 TSLM(logs ~ trend() + … 1994 Aug N(10, 0.041) 10.3 [ 9.991539, 10.511195]80
## 9 TSLM(logs ~ trend() + … 1994 Sep N(10, 0.041) 10.4 [10.094924, 10.614580]80
## 10 TSLM(logs ~ trend() + … 1994 Oct N(10, 0.041) 10.5 [10.195006, 10.714662]80
## 11 TSLM(logs ~ trend() + … 1994 Nov N(11, 0.041) 10.9 [10.676382, 11.196038]80
## 12 TSLM(logs ~ trend() + … 1994 Dec N(12, 0.041) 11.7 [11.453895, 11.973551]80
## 13 TSLM(logs ~ trend() + … 1995 Jan N(9.8, 0.042) 9.78 [ 9.515245, 10.040714]80
## 14 TSLM(logs ~ trend() + … 1995 Feb N(10, 0.042) 10.1 [ 9.788682, 10.314151]80
## 15 TSLM(logs ~ trend() + … 1995 Mar N(11, 0.042) 10.5 [10.255237, 10.780707]80
## 16 TSLM(logs ~ trend() + … 1995 Apr N(10, 0.042) 10.2 [ 9.965358, 10.490827]80
## 17 TSLM(logs ~ trend() + … 1995 May N(10, 0.042) 10.3 [10.012811, 10.538280]80
## 18 TSLM(logs ~ trend() + … 1995 Jun N(10, 0.042) 10.3 [10.074172, 10.599641]80
## 19 TSLM(logs ~ trend() + … 1995 Jul N(11, 0.042) 10.5 [10.257818, 10.783287]80
## 20 TSLM(logs ~ trend() + … 1995 Aug N(11, 0.042) 10.5 [10.257348, 10.782817]80
## 21 TSLM(logs ~ trend() + … 1995 Sep N(11, 0.042) 10.6 [10.360733, 10.886202]80
## 22 TSLM(logs ~ trend() + … 1995 Oct N(11, 0.042) 10.7 [10.460815, 10.986284]80
## 23 TSLM(logs ~ trend() + … 1995 Nov N(11, 0.042) 11.2 [10.942191, 11.467660]80
## 24 TSLM(logs ~ trend() + … 1995 Dec N(12, 0.042) 12.0 [11.719704, 12.245173]80
## 25 TSLM(logs ~ trend() + … 1996 Jan N(10, 0.043) 10.0 [ 9.780451, 10.312939]80
## 26 TSLM(logs ~ trend() + … 1996 Feb N(10, 0.043) 10.3 [10.053888, 10.586376]80
## 27 TSLM(logs ~ trend() + … 1996 Mar N(11, 0.043) 10.8 [10.520444, 11.052932]80
## 28 TSLM(logs ~ trend() + … 1996 Apr N(10, 0.043) 10.5 [10.230564, 10.763052]80
## 29 TSLM(logs ~ trend() + … 1996 May N(11, 0.043) 10.5 [10.278017, 10.810506]80
## 30 TSLM(logs ~ trend() + … 1996 Jun N(11, 0.043) 10.6 [10.339378, 10.871867]80
## 31 TSLM(logs ~ trend() + … 1996 Jul N(11, 0.043) 10.8 [10.523025, 11.055513]80
## 32 TSLM(logs ~ trend() + … 1996 Aug N(11, 0.043) 10.8 [10.522554, 11.055043]80
## 33 TSLM(logs ~ trend() + … 1996 Sep N(11, 0.043) 10.9 [10.625939, 11.158428]80
## 34 TSLM(logs ~ trend() + … 1996 Oct N(11, 0.043) 11.0 [10.726021, 11.258510]80
## 35 TSLM(logs ~ trend() + … 1996 Nov N(11, 0.043) 11.5 [11.207397, 11.739886]80
## 36 TSLM(logs ~ trend() + … 1996 Dec N(12, 0.043) 12.3 [11.984910, 12.517399]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
logfc%>%
autoplot(logsouv)
Honestly, I think this forecast is pretty good. Maybe if the textbook gave any indication on how to actually include the surfing dummy, I could have done that and made it better. Of course, it is entirely possible that I just missed an example.
earlyg <- us_gasoline %>%
filter(year(Week)<=2004)
earlyg %>%
autoplot(Barrels)
K=2
gasfit2 <- earlyg %>%
model(TSLM(Barrels~trend()+fourier(K=2)))
augment(gasfit2) %>%
ggplot(aes(x = Week)) +
geom_line(aes(y = Barrels, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(
values = c(Data = "black", Fitted = "#D55E00")
) +
labs(y = "Barrels",
title = "Gasoline") +
guides(colour = guide_legend(title = "Fourier, K=2"))
K=15
gasfit15 <- earlyg %>%
model(TSLM(Barrels~trend()+fourier(K=4)))
augment(gasfit15) %>%
ggplot(aes(x = Week)) +
geom_line(aes(y = Barrels, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(
values = c(Data = "black", Fitted = "#D55E00")
) +
labs(y = "Barrels",
title = "Gasoline") +
guides(colour = guide_legend(title = "Fourier, K=15"))
K=26
gasfit26 <- earlyg %>%
model(TSLM(Barrels~trend()+fourier(K=26)))
augment(gasfit26) %>%
ggplot(aes(x = Week)) +
geom_line(aes(y = Barrels, colour = "Data")) +
geom_line(aes(y = .fitted, colour = "Fitted")) +
scale_colour_manual(
values = c(Data = "black", Fitted = "#D55E00")
) +
labs(y = "Barrels",
title = "Gasoline") +
guides(colour = guide_legend(title = "Fourier, K=26"))
It seems to get more accurate as the fourier terms went up, but can’t ever match the full variance. This is a for loop that goes through all possible K values between 1 and 26
for (i in 1:26){
print(glance(earlyg %>%
model(TSLM(Barrels~trend()+fourier(K=i)))) %>%
select(AICc))
out <- paste0("This is fourier= ", i, ".")
print(out)
}
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1809.
## [1] "This is fourier= 1."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1814.
## [1] "This is fourier= 2."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1853.
## [1] "This is fourier= 3."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1856.
## [1] "This is fourier= 4."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1862.
## [1] "This is fourier= 5."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1884.
## [1] "This is fourier= 6."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1887.
## [1] "This is fourier= 7."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1885.
## [1] "This is fourier= 8."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1882.
## [1] "This is fourier= 9."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1881.
## [1] "This is fourier= 10."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1885.
## [1] "This is fourier= 11."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1884.
## [1] "This is fourier= 12."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1881.
## [1] "This is fourier= 13."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1879.
## [1] "This is fourier= 14."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1876.
## [1] "This is fourier= 15."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1872.
## [1] "This is fourier= 16."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1871.
## [1] "This is fourier= 17."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1868.
## [1] "This is fourier= 18."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1864.
## [1] "This is fourier= 19."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1859.
## [1] "This is fourier= 20."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1855.
## [1] "This is fourier= 21."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1851.
## [1] "This is fourier= 22."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1850.
## [1] "This is fourier= 23."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1847.
## [1] "This is fourier= 24."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1843.
## [1] "This is fourier= 25."
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1851.
## [1] "This is fourier= 26."
The lowest value is K=25
gasfit25 <- earlyg %>%
model(TSLM(Barrels~trend()+fourier(K=25)))
glance(gasfit25) %>%
select(AICc)
## # A tibble: 1 × 1
## AICc
## <dbl>
## 1 -1843.
gasfit25 %>% gg_tsresiduals()
These look quite uncorrelated to me, and normally distributed, but the ACF plot shows significant autocorrelation at various lags
augment(gasfit25) %>%
features(.innov, ljung_box, lag = 10, dof = 0)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 TSLM(Barrels ~ trend() + fourier(K = 25)) 20.3 0.0263
This is distinguishable from a white noise at the 5% level
gasfit25 %>%
forecast(new_data(earlyg,52)) %>%
autoplot(us_gasoline %>% filter(year(Week)<=2005) %>% filter(year(Week)>2000))
For easy viewing, I cut off the back end.
afg_pop <- global_economy %>%
filter(Country=='Afghanistan') %>%
select(Population)
afg_pop %>%
autoplot(Population/1e6)+
labs(y="Population (millions)",title="Population of Afghanistan")
Given that the Soviet-Afghan War happened in the 80s, and there is a consistent population drop during that period, I would say I can observe it.
fit_pop <- afg_pop %>%
model(
linear = TSLM(Population ~ trend()),
piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
)
afg_pop %>%
autoplot(Population) +
geom_line(data = fitted(fit_pop),
aes(y = .fitted, colour = .model)) +
labs(y = "Population",
title = "Population of Afghanistan")
fc_fit <- fit_pop %>% forecast(h=5)
afg_pop %>%
autoplot(Population) +
geom_line(data = fitted(fit_pop),
aes(y = .fitted, colour = .model)) +
autolayer(fc_fit)+
labs(y = "Population",
title = "Population of Afghanistan")
The piecewise forecast appears to be MUCH more accurate.
hilo(fc_fit)
## # A tsibble: 10 x 6 [1Y]
## # Key: .model [2]
## .model Year Population .mean `80%`
## <chr> <dbl> <dist> <dbl> <hilo>
## 1 linear 2018 N(3e+07, 1.1e+13) 29919575. [25691350, 34147800]80
## 2 linear 2019 N(3e+07, 1.1e+13) 30345349. [26109842, 34580856]80
## 3 linear 2020 N(3.1e+07, 1.1e+13) 30771123. [26528105, 35014142]80
## 4 linear 2021 N(3.1e+07, 1.1e+13) 31196897. [26946139, 35447655]80
## 5 linear 2022 N(3.2e+07, 1.1e+13) 31622671. [27363947, 35881396]80
## 6 piecewise 2018 N(3.6e+07, 1e+11) 35977656. [35566510, 36388803]80
## 7 piecewise 2019 N(3.7e+07, 1e+11) 36828007. [36414420, 37241593]80
## 8 piecewise 2020 N(3.8e+07, 1.1e+11) 37678357. [37262199, 38094516]80
## 9 piecewise 2021 N(3.9e+07, 1.1e+11) 38528708. [38109847, 38947568]80
## 10 piecewise 2022 N(3.9e+07, 1.1e+11) 39379058. [38957369, 39800748]80
## # … with 1 more variable: `95%` <hilo>
## # ℹ Use `colnames()` to see all variable names
Assumption 1 - The model is linear, correctly specified, and has an additive error term This means that the coefficients are related to the variables linearly, the model includes only the variables it needs to according to theory, and adds the error term.
Assumption 2 - No Multicollinearity This means that none of the independent/exogenous/explanatory variables should be correlated highly with each other Usually, this one isn’t a problem assuming the correlation is somewhat low.
Assumption 3 - Error term has a population mean of zero If they don’t have mean of zero, the model will be systematically biased
Assumption 4 - No heteroskedasticity This means that the error terms should have constant variance If this isn’t fulfilled, the standard errors will be biased
Assumption 5 - No serial correlation/autocorrelation This means the error terms/residuals cannot be correlated with each other If this happens, this means that there is more information that can be utilized to fit the variables together
Assumption 6 - The independent/exogenous/explanatory variables must be uncorrelated with the error term If this happens, this means there is more information that can be utilized
Assumption 7 - The error terms must follow a normal distribution If this isn’t true, there is systematic bias in the model
Unbiased means that the expected value of your coefficients is equal to the true value of the coefficient Consistent means that as you add sample data, the coefficients get closer and closer to the true value Unbiased is more about specification in the model, whereas consistency is more about sensitivity to the sample
R-squared is explanatory power, what percentage of the variation in Y is explained by the model Adjusted R-squared is also explanatory power but it addresses a major weakness of R-squared, which is the ability to distort the R-squared by adding many predictor variables Adjusted R-squared accounts for the degrees of freedom, or the sample size-independent variables-y,intercent