Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
global_economy %>%
tsibble(key = Code, index = Year)%>%
autoplot(GDP/Population, show.legend = FALSE) +
labs(title= "GDP per capita",
y = "USD")
## Warning: Removed 3242 row(s) containing missing values (geom_path).
maximun_value <- global_economy %>%
mutate(sum = GDP/Population)
grouped <- setkey(setDT(maximun_value), Country)[,list(sum=sum(sum)),by=list(Country)]
maximun_value <- grouped[which.max(grouped$sum),]
maximun_value
## Country sum
## 1: Luxembourg 2398911
Luxembourgh has the highest GDP per capita considering the sum of whole historic data, now in the last year (2017).
a<-as.data.frame(global_economy)
a$percapita <- a$GDP/a$Population
just_2017<-a[a$Year=="2017",]
just_2017_ordered=just_2017[order(just_2017$percapita, decreasing = TRUE), ]
head(just_2017_ordered)
## Country Code Year GDP Growth CPI Imports
## 8404 Luxembourg LUX 2017 62404461275 2.297941 111.41323 193.96982
## 8462 Macao SAR, China MAC 2017 50361201096 9.096104 136.09909 32.00998
## 13440 Switzerland CHE 2017 678887336848 1.086792 98.26686 53.92747
## 10492 Norway NOR 2017 398831956478 1.918746 114.55072 33.05191
## 6084 Iceland ISL 2017 23909289979 3.642227 121.95696 42.84878
## 6606 Ireland IRL 2017 333730764773 7.802382 105.07959 87.87832
## Exports Population percapita
## 8404 230.01643 599449 104103.04
## 8462 79.39530 622567 80892.82
## 13440 64.97673 8466017 80189.70
## 10492 35.47242 5282223 75504.57
## 6084 46.96383 341284 70056.87
## 6606 120.01253 4813608 69330.69
Now in 2016.
a<-as.data.frame(global_economy)
a$percapita <- a$GDP/a$Population
just_2016<-a[a$Year=="2016",]
just_2016_ordered=just_2016[order(just_2016$percapita, decreasing = TRUE), ]
head(just_2016_ordered)
## Country Code Year GDP Growth CPI Imports
## 9505 Monaco MCO 2016 6468252212 3.2138488 NA NA
## 8113 Liechtenstein LIE 2016 6214633651 NA NA NA
## 8403 Luxembourg LUX 2016 58631324559 3.0826433 109.5177 186.16333
## 13439 Switzerland CHE 2016 668745279605 1.3758845 97.7451 54.58890
## 6663 Isle of Man IMN 2016 6592627599 7.4000000 NA NA
## 8461 Macao SAR, China MAC 2016 45310877913 -0.8631188 134.4500 34.51534
## Exports Population percapita
## 9505 NA 38499 168010.91
## 8113 NA 37666 164993.19
## 8403 221.26778 582014 100738.68
## 13439 65.81131 8373338 79866.03
## 6663 NA 83737 78730.16
## 8461 76.07221 612167 74017.18
The world leadership in terms of GDP per capita has been a competition between 4 countries,Monaco, Liechtenstein, Luxembourg and Switzerland.
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
-United States GDP from global_economy. -Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. -Victorian Electricity Demand from vic_elec. -Gas production from aus_production.
First we plot the GDP without any adjustment, then we adjust by population
global_economy %>%
filter(Country == "United States") %>%
autoplot(GDP)+
labs(title = "United States GDP", y = "USD")+
theme_replace()+
geom_line(col = "#8DC4E9")
global_economy %>%
filter(Country == "United States") %>%
tsibble(key = Code, index = Year)%>%
autoplot(GDP/Population, show.legend = FALSE)+
labs(title = "United States GDP per capita", y = "USD")+
theme_replace()+
geom_line(col = "#8DC4E9")
As we are just graphing US is not important the population adjustment, but we applied it because it lets to compare the value directly with other counstries.
aus_livestock %>%
filter(Animal == "Bulls, bullocks and steers",
State == "Victoria")%>%
autoplot(Count)+
labs(title = "Slaughter of Victorian “Bulls, bullocks and steers”")+
theme_replace()+
geom_line(col = "#1B89D3")
No adjustment was considered.
vic_elec %>%
autoplot(Demand)+
labs(title = "Victorian Electricity Demand",
y = "Demand")+
theme_replace()+
geom_line(col = "#D63B22")
No adjustment was considered.
aus_production %>%
autoplot(Gas)+
labs(title = "Gas production")+
theme_replace()+
geom_line(col = "#30D622")
Here, as the book says, it will be useful if we apply a transformation because the variation increases with the level of the serie.
lambda <- aus_production %>%
features(Gas, features = guerrero) %>%
pull(lambda_guerrero)
aus_production %>%
autoplot(box_cox(Gas, lambda)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed gas production with $\\lambda$ = ",
round(lambda,2))))
Here it is being applied the Guerrero Method to select the lambda, which minimises the coefficient of variation for subseries of x. We can see that the transformation made the variation steady along the serie.
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
canadian_gas %>%
autoplot(Volume)+
labs(title = "Canadian Gas Production",
y = "Monthly Gas Production (billions of cubic meter)")+
theme_replace()+
geom_line(col = "#CE22D6")
The Gas production shows a seasonal variance that is low between 1960 and 1978, then is large between 1978 and 1988 and is low again between 1988 and 2005, so is expected that the Box Cox transformation will be unhelpful. Here the result of the transformation.
lambda_cangas <- canadian_gas %>%
features(Volume, features = guerrero) %>%
pull(lambda_guerrero)
canadian_gas %>%
autoplot(box_cox(Volume, lambda = lambda_cangas))+
labs(title = latex2exp::TeX(paste0(
"Box Cox Transformation of Gas Production with $\\lambda$ = ",
round(lambda_cangas,2))))+
theme_replace()+
geom_line(col = "#1B89D3")
The graph shows that the transformation was unhelpful.
What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?
set.seed(569973)
myseries <- aus_retail %>%
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
autoplot(Turnover)+
labs(title = "Australian Retail Trade Turnover")+
theme_replace()+
geom_line(col = "#C70039")
The graph shows that the seasonal variation increases with time, so it will be expected that a Box Cox transformation will be helpful in this case. Here the result.
lambda_retail <- myseries %>%
features(Turnover, features = guerrero) %>%
pull(lambda_guerrero)
myseries %>%
autoplot(box_cox(Turnover, lambda_retail))+
labs(title = latex2exp::TeX(paste0(
"Box Cox Transformation of Australian Retail Trade Turnover with $\\lambda$ = ",
round(lambda_retail,2))))+
theme_replace()+
geom_line(col = "#C70039")
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.
aus_production %>%
autoplot(Tobacco)+
labs(title = "Tobacco")+
theme_replace()+
geom_line(col = "#FFC300")
## Warning: Removed 24 row(s) containing missing values (geom_path).
## Removed 24 row(s) containing missing values (geom_path).
Box Cox transformation
lambda_tobacco <- aus_production %>%
features(Tobacco, features = guerrero) %>%
pull(lambda_guerrero)
aus_production %>%
autoplot(box_cox(Tobacco, lambda_tobacco)) +
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Tobacco with $\\lambda$ = ",
round(lambda_tobacco,2))))+
theme_replace()+
geom_line(col = "#FFC300")
## Warning: Removed 24 row(s) containing missing values (geom_path).
## Removed 24 row(s) containing missing values (geom_path).
ansett %>%filter(Class == "Economy",Airports == "MEL-SYD")%>%
mutate(Passengers = Passengers/1000)%>%
autoplot(Passengers)+
labs(title = "Ansett Airlines Economy Class Passengers")+
theme_replace()+
geom_line(col = "#FFC300")
Box Cox transformation
lambda_class <- ansett %>%
filter(Class == "Economy",
Airports == "MEL-SYD")%>%
features(Passengers, features = guerrero) %>%
pull(lambda_guerrero)
ansett %>%
filter(Class == "Economy",
Airports == "MEL-SYD")%>%
mutate(Passengers = Passengers/1000) %>%
autoplot(box_cox(Passengers, lambda = lambda_class)) +
labs(y = "Passengers ('000)",
title = latex2exp::TeX(paste0(
"Transformed Ansett Airlines Economy Class with $\\lambda$ = ",
round(lambda_class,2))),
subtitle = "Melbourne-Sydney")+
theme_replace()+
geom_line(col = "#FFC300")
pedestrian %>%filter(Sensor == "Southern Cross Station")%>%
autoplot(Count)+
labs(title = "Pedestrian Counts at Southern Cross Station")+
theme_replace()+
geom_line(col = "#FFC300")
Box Cox Transformation
lambda_count <- pedestrian %>%
filter(Sensor == "Southern Cross Station") %>%
features(Count, features = guerrero) %>%
pull(lambda_guerrero)
pedestrian %>%
filter(Sensor == "Southern Cross Station") %>%
autoplot(box_cox(Count,lambda_count))+
labs(y = "",
title = latex2exp::TeX(paste0(
"Transformed Pedestrian Counts at Southern Cross Station with $\\lambda$ = ",
round(lambda_count,2))))+
theme_replace()+
geom_line(col = "#FFC300")
Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.
The demonstration is as follows.
Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) %>% dplyr::select(Gas)
head(gas)
## # A tsibble: 6 x 2 [1Q]
## Gas Quarter
## <dbl> <qtr>
## 1 221 2005 Q3
## 2 180 2005 Q4
## 3 171 2006 Q1
## 4 224 2006 Q2
## 5 233 2006 Q3
## 6 192 2006 Q4
gas %>%
autoplot()+
labs(title = "Last Five Years of The Gas Data")+
theme_replace()+
geom_line(col = "#581845")
## Plot variable not specified, automatically selected `.vars = Gas`
There is a increasing trend and a seasonality of 1 year.
gas %>%
model(classical_decomposition(Gas,type = "multiplicative")) %>%
components() %>%
autoplot() +
ggtitle("Last Five Years of The Gas Data")
## Warning: Removed 2 row(s) containing missing values (geom_path).
descom<-gas %>%
model(classical_decomposition(Gas,type = "multiplicative")) %>%
components()
a=mean(descom$seasonal)
b =mean(descom$trend,na.rm=TRUE)
cat(" The calculated seasonality is ", a, " and the trend is", b)
## The calculated seasonality is 1 and the trend is 216.3203
Yes, with the graphical interpretation we saw a seasonality of 1 the same that we obtained in the numerical decomposition. Also we obtained a positive trend, the same that we saw in the graphical interpretation.
gas_decom <- gas %>%
model(classical_decomposition(Gas,type = "multiplicative")) %>%
components()
gas_decom %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Gas, colour = "Data")) +
geom_line(aes(y = season_adjust,
colour = "Seasonally Adjusted")) +
geom_line(aes(y = trend, colour = "Trend")) +
labs(y = "",
title = "Last Five Years of The Gas Data") +
scale_colour_manual(
values = c("gray", "#0072B2", "#D55E00"),
breaks = c("Data", "Seasonally Adjusted", "Trend")
)
## Warning: Removed 4 row(s) containing missing values (geom_path).
gas2 <- gas
gas2$Gas[11] <- gas2$Gas[11]+300
gas2 %>%
model(classical_decomposition(Gas,type = "multiplicative")) %>%
components() %>%
autoplot() +
ggtitle("Last Five Years of The Gas Data with outlier")
## Warning: Removed 2 row(s) containing missing values (geom_path).
gas2 %>%
model(classical_decomposition(Gas,type = "multiplicative")) %>%
components() %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Gas, colour = "Data")) +
geom_line(aes(y = season_adjust,
colour = "Seasonally Adjusted")) +
geom_line(aes(y = trend, colour = "Trend")) +
labs(y = "",
title = "Last Five Years of The Gas Data with 300 added ") +
scale_colour_manual(
values = c("gray", "#0072B2", "#D55E00"),
breaks = c("Data", "Seasonally Adjusted", "Trend")
)
## Warning: Removed 4 row(s) containing missing values (geom_path).
It is observed that the outlier caused a spike in the seasonally adjusted data.The addition of 300 to the 11th observation has not effect in the seasonality of the sere, this is because the seasonal component is uniform for each year and only one data point has changed.
In terms of the trend, it caused a decreasing trend from early 2008 until middle 2008
gas3 <- gas
gas3$Gas[20] <- gas3$Gas[10]+300
gas3 %>%
model(classical_decomposition(Gas,type = "multiplicative")) %>%
components() %>%
autoplot() +
ggtitle("Last Five Years of The Gas Data with outlier ")
## Warning: Removed 2 row(s) containing missing values (geom_path).
The seasonal data is not affected, also the trend is more like the graph without any outlier, but it shows an increasing trend at the end of the serie.
Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
myseries %>%
model(classical_decomposition(Turnover,type = "multiplicative")) %>%
components() %>%
autoplot() +
ggtitle("Multiplicative decomposition of my retail time series")
## Warning: Removed 6 row(s) containing missing values (geom_path).
#### X11 decomposition.
Here is the X11 decomposition, the ggfortify package is detached since it is over-writing the autoplot() function.
data<-ts(myseries$Turnover,start=c(1982,4), end=c(2000,12), frequency=12)
#detach(ggfortify)
library(quantmod)
## Warning: package 'quantmod' was built under R version 4.0.5
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
##
## Attaching package: 'zoo'
## The following object is masked from 'package:tsibble':
##
## index
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:data.table':
##
## first, last
## The following objects are masked from 'package:dplyr':
##
## first, last
## Loading required package: TTR
## Warning: package 'TTR' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(forecast)
## Warning: package 'forecast' was built under R version 4.0.5
## Registered S3 methods overwritten by 'forecast':
## method from
## autoplot.Arima ggfortify
## autoplot.acf ggfortify
## autoplot.ar ggfortify
## autoplot.bats ggfortify
## autoplot.decomposed.ts ggfortify
## autoplot.ets ggfortify
## autoplot.forecast ggfortify
## autoplot.stl ggfortify
## autoplot.ts ggfortify
## fitted.ar ggfortify
## fortify.ts ggfortify
## residuals.ar ggfortify
##
## Attaching package: 'forecast'
## The following object is masked _by_ '.GlobalEnv':
##
## gas
library(ggplot2)
library(seasonal)
## Warning: package 'seasonal' was built under R version 4.0.5
##
## Attaching package: 'seasonal'
## The following object is masked from 'package:tibble':
##
## view
data %>% seas(x11="") -> fit
autoplot(fit) +
ggtitle("X11 decomposition of my retail time series")
In terms of time, the multiplicative was less computational expense than the X11 decomposition. On the other hand, the X 11 decomposition trend has captured the sudden fall in the 2000-2010, and also a little change in the seasonality.
Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
a. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.
1.The trend component shows that the trend has increased thought the majority of the time. 2. The seasonal component shows that it is an increase employment at the end of the year and a decrease behavior at the beginning. 3. There is a random valley in 1991.
Yes, we see a decrease in the serie of employment during 1991/1992 that is not explained by seasonality or the positive trend, because it is explained by the random component.
This exercise uses the canadian_gas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).
autoplot()
canadian_gas %>%
autoplot(Volume)+
labs(title = "Monthly Canadian Gas Production",
subtitle = "autoplot()",
y = "billions of cubic meter")+
theme_replace()+
geom_line(col = "#C94FC7")
gg_subseries()
canadian_gas %>%
gg_subseries(Volume)+
labs(title = "Monthly Canadian Gas Production",
subtitle = "gg_subseries()",
y = "billions of cubic meter")
gg_season
canadian_gas %>%
gg_season(Volume)+
labs(title = "Monthly Canadian Gas Production",
subtitle = "gg_season()",
y = "billions of cubic meter")
The plots show a clear seasonality and a trend, Season plot shows this behavior decreasing between march and June and increasing between June and February.
canadian_gas %>%
model(
STL(Volume ~ trend(window = 21) +
season(window = 13),
robust = TRUE)) %>%
components() %>%
autoplot()+
labs(title = "STL decomposition of Canadian Gas Production")
The trend represent the the trend of the original data. Between 1975 and 1985 the seasonal component increases and then decreases.
As shown above, the seasonal shape is flat from beginning and then as the time goes by the seasonal shape increases. Between 1980 and 1990 it increases, and then decrease but changing the timing.
canadian_gas %>%
model(
STL(Volume ~ trend(window = 21) +
season(window = 13),
robust = TRUE)) %>%
components() %>%
ggplot(aes(x = Month)) +
geom_line(aes(y = Volume, colour = "Data")) +
geom_line(aes(y = season_adjust,
colour = "Seasonally Adjusted")) +
geom_line(aes(y = trend, colour = "Trend")) +
labs(title = "STL decomposition of Canadian Gas Production") +
scale_colour_manual(
values = c("gray", "#0072B2", "#D55E00"),
breaks = c("Data", "Seasonally Adjusted", "Trend")
)
canadian_gas %>%
model(x11 = X_13ARIMA_SEATS(Volume ~ x11())) %>%
components() %>%
autoplot()+
labs(title = "X-11 decomposition of Canadian Gas Production")
canadian_gas %>%
model(seats = X_13ARIMA_SEATS(Volume ~ seats())) %>%
components() %>%
autoplot() +
labs(title ="SEATS Decomposition of Canadian Gas Production")
The decomposed trend and seasonal components are similar. The random component of SEAT is greater than of the X11.
Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.
jan14_vic_elec <- vic_elec %>%
filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
index_by(Date = as_date(Time)) %>%
summarise(
Demand = sum(Demand),
Temperature = max(Temperature)
)
jan14_vic_elec %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
ggtitle("Electricity Demand for Victoria, Australia (2014)") +
ylab("Demand") +
xlab("Temperature (Celsius)") +
geom_point() +
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
jan14_vic_elec %>%
model(TSLM(Demand~Temperature)) %>%
report()
## Series: Demand
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978.2 -10218.9 -121.3 18533.2 35440.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
The report displays a positive coefficient for temperature, this means that Demand demand grows if temperature grows, therefore there is a positive relationship. This happens because when you have high temperature people tend to use more air conditioners and are ofter more in home.
jan14_vic_elec %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
ylab("Electricity Demand") +
xlab("Temperature") +
geom_point() +
geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
tslm_Dem_Temp<-jan14_vic_elec %>%
model(TSLM(Demand~Temperature)) %>%
report()
## Series: Demand
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978.2 -10218.9 -121.3 18533.2 35440.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
tslm_Dem_Temp %>% gg_tsresiduals()
The variability of the residual increases as time goes by. Also, The majority of residual are around 0, but also is shown some outliers with values of -30000 and 30000.
#scenarios creation.
new_cons <- scenarios(
"Temperature of 15C" = new_data(jan14_vic_elec, 1) %>%
mutate(Temperature = 15),
"Temperature of 35C" = new_data(jan14_vic_elec, 1) %>%
mutate(Temperature = 35),
names_to = "Scenario"
)
#forescast.
fcast <- forecast(tslm_Dem_Temp, new_cons)
fcast$.mean
## [1] 151398.4 274484.2
print("Demand")
## [1] "Demand"
print(mean(jan14_vic_elec$Demand))
## [1] 231622.6
print("Temperature")
## [1] "Temperature"
print(mean(jan14_vic_elec$Temperature))
## [1] 28.03548
Here we apply a graphic approach.
jan14_vic_elec %>%
model(TSLM(Demand ~ Temperature)) %>%
forecast(
new_data(jan14_vic_elec, 1) %>%
mutate(Temperature = 15)
) %>%
autoplot(jan14_vic_elec)
jan14_vic_elec %>%
model(TSLM(Demand ~ Temperature)) %>%
forecast(
new_data(jan14_vic_elec, 1) %>%
mutate(Temperature = 35)
) %>%
autoplot(jan14_vic_elec)
I trust the forecast of the demand with a temperature of 35, but I do not trust the forecast of the demand with a temperature of 15. In the image you can see that the demands less than 20000 are outliers.
plot(jan14_vic_elec$Temperature)
And plotting the temperature we can see that a temperature of 15C is an outlier, thus I partially believe in the model.
#scenarios creation.
fit <- lm(Demand ~ Temperature, data=jan14_vic_elec)
forecast(fit, newdata=data.frame(Temperature=c(15,35)))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 151398.35 151398.4 117127.2 185669.5 97951.22 204845.5
## 274484.25 274484.2 241333.0 307635.5 222783.69 326184.8
#scenarios creation.
plot(Demand~Temperature, data=vic_elec, main= "Demand vs. Temperature for All Data in vic_elec")
There is a clear relationship between Temperature and Demand. More temperature more demand from 20C, before the relationship seems neutral or a little bit negative.
Data set olympic_running contains the winning times (in seconds) in each Olympic Games sprint, middle-distance and long-distance track events from 1896 to 2016.
autoplot(olympic_running)
## Plot variable not specified, automatically selected `.vars = Time`
How is expected the winning time is decreasing as the time goes. Also, between some year there is not data so we can assume that it was for world wars.
Here we are going to plot 3 events. This because the events have the same behaviour.
mens=olympic_running[olympic_running$Sex=="men",]
women2=olympic_running[olympic_running$Sex=="women",]
mens100=mens[mens$Length=="100",]
women100=women2[women2$Length=="100",]
mens800=mens[mens$Length=="800",]
Mens100
mens100<-ts(mens100$Time,start=c(1896), end=c(2015), frequency=1)
#Interpolate missing values (No transformation is required)
df <- na.interp(mens100, lambda = NULL)
t <- time(df)
fit.mens100 <- tslm(df~t, data = mens100)
autoplot(mens100, series = "Data") +
autolayer(fitted(fit.mens100), series = "Fitted") + ggtitle({"Winning Times in Olympic Men's 100m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds") + scale_color_manual(values = c("black", "red"))
Mens800
mens800<-ts(mens800$Time,start=c(1896), end=c(2015), frequency=1)
#Interpolate missing values (No transformation is required)
df <- na.interp(mens800, lambda = NULL)
t <- time(df)
fit.mens800 <- tslm(df~t, data = mens800)
autoplot(mens800, series = "Data") +
autolayer(fitted(fit.mens800), series = "Fitted") + ggtitle({"Winning Times in Olympic Men's 800m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds") + scale_color_manual(values = c("black", "red"))
Women100
Women100<-ts(women100$Time,start=c(1896), end=c(2015), frequency=1)
#Interpolate missing values (No transformation is required)
df <- na.interp(Women100, lambda = NULL)
t <- time(df)
fit.women100 <- tslm(df~t, data = women100)
autoplot(women100, series = "Data") +
autolayer(fitted(fit.women100), series = "Fitted") + ggtitle({"Winning Times in Olympic Women 100m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds") + scale_color_manual(values = c("black", "red"))
## Plot variable not specified, automatically selected `.vars = Time`
## Warning: Ignoring unknown parameters: series
This three events show a negative slope of the linear model, so this demonstrate that in each year the winning time is decreasing.
Here we are going to see the residual plots of the three events that were used previously.
Mens100
fit <- olympic_running %>%
filter(Sex=="men") %>%
filter(Length=="100") %>%
model(TSLM(Time ~ Year))
fit %>% gg_tsresiduals()
## Warning: Removed 3 rows containing missing values (geom_point).
## Warning: Removed 3 rows containing non-finite values (stat_bin).
Mens800
fit <- olympic_running %>%
filter(Sex=="men") %>%
filter(Length=="800") %>%
model(TSLM(Time ~ Year))
fit %>% gg_tsresiduals()
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing non-finite values (stat_bin).
Women100
fit <- olympic_running %>%
filter(Sex=="women") %>%
filter(Length=="100") %>%
model(TSLM(Time ~ Year))
fit %>% gg_tsresiduals()
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Mens100
olympic_running %>%
filter(Sex=="men") %>%
filter(Length=="100") %>%
model(TSLM(Time ~ Year)) %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
) %>%
autoplot(olympic_running)
women100
olympic_running %>%
filter(Sex=="women") %>%
filter(Length=="100") %>%
model(TSLM(Time ~ Year)) %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
) %>%
autoplot(olympic_running)
Men800
olympic_running %>%
filter(Sex=="men") %>%
filter(Length=="800") %>%
model(TSLM(Time ~ Year)) %>%
forecast(
new_data(olympic_running, 1) %>%
mutate(Year = 2020)
) %>%
autoplot(olympic_running)
The result shows that the forecasted winning time will less for for each event in 2020.the assumption is that whole events will have the same beahivour orrelatively similar.
An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x).Mathematically, the elasticity is defined as (dy/dx) X (x/y). Consider the log-log model.
Express y as a function of x and show that coefficient B1 is the elasticity coefficient.
### Exercise 4.
The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
####a.Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(souvenirs)
## Plot variable not specified, automatically selected `.vars = Sales`
There is a clear trend and seasonality, in terms of unexpected fluctuations, there is a fluctuation between 1993 and 1994 that seems to be unusual.
Variability increases with time, so applying a logarithmic transformation can give good results trying to control this variation and normalizing it
souvenirs2<- ts(souvenirs$Sales,start=c(1987,1), end=c(1993,6), frequency=12)
log.souvenirs = log(souvenirs2)
dummy.fest = rep(0, length(souvenirs2))
dummy.fest[seq_along(dummy.fest)%%12 == 3] = 1
dummy.fest[3] = 0
dummy.fest = ts(dummy.fest, freq = 12, start=c(1987,1))
new.data = data.frame(log.souvenirs, dummy.fest)
fit = tslm(log.souvenirs ~ trend + season + dummy.fest, data=new.data)
fore.data = data.frame(dummy.fest = rep(0, 12))
fore.data[3,] = 1
forecast(fit, newdata=fore.data)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jul 1993 9.883136 9.637329 10.128943 9.503919 10.262353
## Aug 1993 9.867772 9.621965 10.113579 9.488555 10.246989
## Sep 1993 10.531942 10.191069 10.872815 10.006062 11.057822
## Oct 1993 10.092605 9.846798 10.338412 9.713388 10.471822
## Nov 1993 10.585189 10.339382 10.830995 10.205971 10.964406
## Dec 1993 11.357556 11.111749 11.603363 10.978339 11.736773
## Jan 1994 9.430933 9.186093 9.675774 9.053207 9.808659
## Feb 1994 9.704370 9.459530 9.949210 9.326644 10.082096
## Mar 1994 9.695742 9.365756 10.025727 9.186658 10.204825
## Apr 1994 9.881046 9.636206 10.125887 9.503320 10.258772
## May 1994 9.928500 9.683659 10.173340 9.550774 10.306226
## Jun 1994 9.989861 9.745020 10.234701 9.612135 10.367587
autoplot(fit$residuals)
The plot shows that the residuals are around 0, therefore the logarithmic transformation was a useful transformation.
boxplot(residuals(fit)~cycle(residuals(fit)))
Yes, some months have more residual variance so the model is not fit to certain months, for example month 8 and 10.
fit$coefficients
## (Intercept) trend season2 season3 season4 season5
## 7.6662400 0.0207611 0.2526756 0.2232860 0.3878297 0.4145219
## season6 season7 season8 season9 season10 season11
## 0.4551219 0.5767690 0.5406444 0.6296713 0.7239552 1.1957774
## season12 dummy.fest
## 1.9473841 0.5543818
There is a positive relationship of all season and there are significant jumps in season 11 and 12.
checkresiduals(fit)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 31.535, df = 17, p-value = 0.01718
There is a p-value lower than 0.05 , which implies that the data can be modeled with a logarithm.
fore.data = data.frame(dummy.fest = rep(0, 36))
preds = forecast(fit, newdata=fore.data)
preds
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jul 1993 9.883136 9.637329 10.128943 9.503919 10.262353
## Aug 1993 9.867772 9.621965 10.113579 9.488555 10.246989
## Sep 1993 9.977560 9.731753 10.223367 9.598343 10.356777
## Oct 1993 10.092605 9.846798 10.338412 9.713388 10.471822
## Nov 1993 10.585189 10.339382 10.830995 10.205971 10.964406
## Dec 1993 11.357556 11.111749 11.603363 10.978339 11.736773
## Jan 1994 9.430933 9.186093 9.675774 9.053207 9.808659
## Feb 1994 9.704370 9.459530 9.949210 9.326644 10.082096
## Mar 1994 9.695742 9.365756 10.025727 9.186658 10.204825
## Apr 1994 9.881046 9.636206 10.125887 9.503320 10.258772
## May 1994 9.928500 9.683659 10.173340 9.550774 10.306226
## Jun 1994 9.989861 9.745020 10.234701 9.612135 10.367587
## Jul 1994 10.132269 9.883394 10.381144 9.748319 10.516219
## Aug 1994 10.116905 9.868031 10.365780 9.732955 10.500856
## Sep 1994 10.226693 9.977819 10.475568 9.842743 10.610644
## Oct 1994 10.341738 10.092864 10.590613 9.957788 10.725689
## Nov 1994 10.834322 10.585447 11.083197 10.450372 11.218272
## Dec 1994 11.606690 11.357815 11.855564 11.222739 11.990640
## Jan 1995 9.680067 9.431764 9.928369 9.296999 10.063134
## Feb 1995 9.953503 9.705201 10.201806 9.570436 10.336570
## Mar 1995 9.944875 9.610605 10.279144 9.429182 10.460567
## Apr 1995 10.130180 9.881877 10.378482 9.747112 10.513247
## May 1995 10.177633 9.929330 10.425935 9.794566 10.560700
## Jun 1995 10.238994 9.990691 10.487296 9.855927 10.622061
## Jul 1995 10.381402 10.128745 10.634059 9.991617 10.771188
## Aug 1995 10.366039 10.113381 10.618696 9.976253 10.755824
## Sep 1995 10.475827 10.223169 10.728484 10.086041 10.865612
## Oct 1995 10.590872 10.338214 10.843529 10.201086 10.980657
## Nov 1995 11.083455 10.830798 11.336112 10.693669 11.473240
## Dec 1995 11.855823 11.603165 12.108480 11.466037 12.245608
## Jan 1996 9.929200 9.676730 10.181669 9.539704 10.318696
## Feb 1996 10.202636 9.950167 10.455106 9.813141 10.592132
## Mar 1996 10.194008 9.854949 10.533067 9.670926 10.717090
## Apr 1996 10.379313 10.126843 10.631782 9.989817 10.768809
## May 1996 10.426766 10.174296 10.679236 10.037270 10.816262
## Jun 1996 10.488127 10.235658 10.740597 10.098631 10.877623
As was presented, doing a logarithmic transformation. Because twe have to create a model that has the capacity to increase the variability over the time.
The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.
autoplot(us_gasoline)
## Plot variable not specified, automatically selected `.vars = Barrels`
head(us_gasoline)
## # A tsibble: 6 x 2 [1W]
## Week Barrels
## <week> <dbl>
## 1 1991 W06 6.62
## 2 1991 W07 6.43
## 3 1991 W08 6.58
## 4 1991 W09 7.22
## 5 1991 W10 6.88
## 6 1991 W11 6.95
gas <- ts(us_gasoline$Barrels, start=(1991-1), end=(2017-1),frequency = 52)
#y <- ts(z, start=2003, frequency=12)
gas <- window(gas, end = 2005)
for(num in c(1, 2, 3, 5, 10)){
var_name <- paste("fit",
as.character(num),
"_gasoline_2004",
sep = "")
assign(var_name,
tslm(gas ~ trend + fourier(
gas,
K = num
))
)
print(
autoplot(gas) +
autolayer(get(var_name)$fitted.values,
series = as.character(num)) +
ggtitle(var_name) +
ylab("gasoline") +
guides(colour = guide_legend(title = "Number of Fourier Transform pairs"))
)
}
Using higher K values in the fourier transformationthe result displays that the lines are more fitted.
As the book says we have to minimize AICc and CV so we will calculate it.
for(i in c(1, 2, 3, 5, 10)){
fit_gasoline_2004.name <- paste(
"fit", as.character(i), "_gasoline_2004",
sep = ""
)
writeLines(
paste(
"\n", fit_gasoline_2004.name, "\n"
)
)
print(CV(get(fit_gasoline_2004.name)))
}
##
## fit1_gasoline_2004
##
## CV AIC AICc BIC AdjR2
## 8.012058e-02 -1.969129e+03 -1.969052e+03 -1.945827e+03 8.427750e-01
##
## fit2_gasoline_2004
##
## CV AIC AICc BIC AdjR2
## 7.919464e-02 -1.978217e+03 -1.978072e+03 -1.945593e+03 8.449887e-01
##
## fit3_gasoline_2004
##
## CV AIC AICc BIC AdjR2
## 7.470266e-02 -2.023838e+03 -2.023604e+03 -1.981892e+03 8.541546e-01
##
## fit5_gasoline_2004
##
## CV AIC AICc BIC AdjR2
## 7.365185e-02 -2.034993e+03 -2.034518e+03 -1.974405e+03 8.569479e-01
##
## fit10_gasoline_2004
##
## CV AIC AICc BIC AdjR2
## 7.173572e-02 -2.056054e+03 -2.054595e+03 -1.948861e+03 8.624864e-01
Fourier K value of 10 minimize CV and AICc.
checkresiduals(fit10_gasoline_2004)
##
## Breusch-Godfrey test for serial correlation of order up to 104
##
## data: Residuals from Linear regression model
## LM test = 128.54, df = 104, p-value = 0.05168
Using the best model we forecast the next data
fc_gasoline_2005 <- forecast(
fit10_gasoline_2004,
newdata=data.frame(fourier(
gas, K = 10, h = 52)
)
)
fc_gasoline_2005
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2005.019 8.901276 8.557367 9.245186 8.374931 9.427622
## 2005.038 8.940132 8.596126 9.284138 8.413638 9.466626
## 2005.058 8.982049 8.638038 9.326060 8.455548 9.508551
## 2005.077 9.056949 8.712952 9.400946 8.530469 9.583429
## 2005.096 9.143925 8.799930 9.487920 8.617448 9.670401
## 2005.115 9.192581 8.848582 9.536579 8.666099 9.719062
## 2005.135 9.185356 8.841349 9.529363 8.658862 9.711851
## 2005.154 9.160913 8.816906 9.504919 8.634419 9.687406
## 2005.173 9.169559 8.825559 9.513560 8.643075 9.696044
## 2005.192 9.216866 8.872868 9.560865 8.690384 9.743348
## 2005.212 9.264222 8.920221 9.608223 8.737736 9.790708
## 2005.231 9.281951 8.937946 9.625957 8.755458 9.808444
## 2005.250 9.285826 8.941821 9.629831 8.759334 9.812318
## 2005.269 9.312227 8.968226 9.656228 8.785741 9.838713
## 2005.288 9.367159 9.023159 9.711159 8.840675 9.893643
## 2005.308 9.416957 9.072955 9.760958 8.890470 9.943444
## 2005.327 9.433266 9.089261 9.777272 8.906774 9.959758
## 2005.346 9.433725 9.089720 9.777729 8.907234 9.960216
## 2005.365 9.463951 9.119950 9.807952 8.937465 9.990437
## 2005.385 9.540677 9.196677 9.884678 9.014193 10.067162
## 2005.404 9.625983 9.281981 9.969985 9.099496 10.152471
## 2005.423 9.665636 9.321631 10.009640 9.139144 10.192127
## 2005.442 9.647059 9.303055 9.991063 9.120569 10.173550
## 2005.462 9.610131 9.266130 9.954132 9.083645 10.136617
## 2005.481 9.601794 9.257794 9.945795 9.075310 10.128279
## 2005.500 9.629756 9.285754 9.973759 9.103268 10.156244
## 2005.519 9.664433 9.320428 10.008438 9.137941 10.190924
## 2005.538 9.676302 9.332298 10.020306 9.149812 10.202792
## 2005.558 9.658713 9.314712 10.002714 9.132227 10.185199
## 2005.577 9.615773 9.271772 9.959773 9.089288 10.142258
## 2005.596 9.544821 9.200818 9.888823 9.018333 10.071309
## 2005.615 9.445303 9.101298 9.789309 8.918811 9.971795
## 2005.635 9.341433 8.997429 9.685437 8.814943 9.867923
## 2005.654 9.279412 8.935411 9.623412 8.752926 9.805897
## 2005.673 9.290336 8.946336 9.634336 8.763851 9.816820
## 2005.692 9.357646 9.013643 9.701648 8.831157 9.884134
## 2005.712 9.428370 9.084364 9.772376 8.901877 9.954863
## 2005.731 9.457460 9.113457 9.801464 8.930971 9.983950
## 2005.750 9.437719 9.093720 9.781719 8.911235 9.964203
## 2005.769 9.389996 9.045997 9.733996 8.863513 9.916480
## 2005.788 9.337513 8.993510 9.681516 8.811024 9.864001
## 2005.808 9.298876 8.954869 9.642883 8.772382 9.825371
## 2005.827 9.296233 8.952230 9.640237 8.769744 9.822723
## 2005.846 9.346559 9.002561 9.690557 8.820078 9.873040
## 2005.865 9.430551 9.086553 9.774549 8.904070 9.957032
## 2005.885 9.480414 9.136411 9.824418 8.953925 10.006904
## 2005.904 9.423219 9.079208 9.767229 8.896719 9.949719
## 2005.923 9.251762 8.907762 9.595762 8.725278 9.778246
## 2005.942 9.047458 8.703470 9.391446 8.520993 9.573924
## 2005.962 8.918927 8.574940 9.262915 8.392462 9.445392
## 2005.981 8.911344 8.567381 9.255307 8.384917 9.437771
## 2006.000 8.978870 8.634891 9.322849 8.452418 9.505323
We see that the gasolin is increasing but we are not considering that periods that gasoline is affected by geopolitical problems.
The annual population of Afghanistan is available in the global_economy data set.
global_economy %>%
filter(Country=="Afghanistan") %>%
tsibble(key = Code, index = Year)%>%
autoplot(Population, show.legend = FALSE) +
labs(title= "Afghanistan Population",
y = "Population")
There is a valley produced by the Soviet-Afghan war in the total population in Afghanistan.
global_economy %>%
filter(Country=="Afghanistan")%>%
model(TSLM(Population ~ Year)) %>%
report()
## Series: Population
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -5794518 -2582559 744761 2259222 6036280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -829292529 49730866 -16.68 <2e-16 ***
## Year 425774 25008 17.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3188000 on 56 degrees of freedom
## Multiple R-squared: 0.8381, Adjusted R-squared: 0.8352
## F-statistic: 289.9 on 1 and 56 DF, p-value: < 2.22e-16
global_economy %>%
filter(Country=="Afghanistan")%>%
filter(Year<1980)%>%
model(TSLM(Population ~ Year)) %>%
report()
## Series: Population
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -146380.8 -110290.6 -451.2 105877.8 202881.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -470734527 8787062 -53.57 <2e-16 ***
## Year 244657 4462 54.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 115100 on 18 degrees of freedom
## Multiple R-squared: 0.994, Adjusted R-squared: 0.9937
## F-statistic: 3007 on 1 and 18 DF, p-value: < 2.22e-16
global_economy %>%
filter(Country=="Afghanistan")%>%
filter(Year>1989)%>%
model(TSLM(Population ~ Year)) %>%
report()
## Series: Population
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -619234 -212927 6598 234280 612277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.670e+09 1.640e+07 -101.8 <2e-16 ***
## Year 8.451e+05 8.184e+03 103.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 349800 on 26 degrees of freedom
## Multiple R-squared: 0.9976, Adjusted R-squared: 0.9975
## F-statistic: 1.066e+04 on 1 and 26 DF, p-value: < 2.22e-16
The reports display that the coefficient of the second model is higher.
m1<-global_economy %>%
filter(Country=="Afghanistan")%>%
model(TSLM(Population ~ Year))
m2<-global_economy %>%
filter(Country=="Afghanistan")%>%
filter(Year>1989)%>%
model(TSLM(Population ~ Year))
forecast(m1, h=5)
## # A fable: 5 x 5 [1Y]
## # Key: Country, .model [1]
## Country .model Year Population .mean
## <fct> <chr> <dbl> <dist> <dbl>
## 1 Afghanistan TSLM(Population ~ Year) 2018 N(3e+07, 1.1e+13) 29919575.
## 2 Afghanistan TSLM(Population ~ Year) 2019 N(3e+07, 1.1e+13) 30345349.
## 3 Afghanistan TSLM(Population ~ Year) 2020 N(3.1e+07, 1.1e+13) 30771123.
## 4 Afghanistan TSLM(Population ~ Year) 2021 N(3.1e+07, 1.1e+13) 31196897.
## 5 Afghanistan TSLM(Population ~ Year) 2022 N(3.2e+07, 1.1e+13) 31622671.
forecast(m2, h=5)
## # A fable: 5 x 5 [1Y]
## # Key: Country, .model [1]
## Country .model Year Population .mean
## <fct> <chr> <dbl> <dist> <dbl>
## 1 Afghanistan TSLM(Population ~ Year) 2018 N(3.6e+07, 1.4e+11) 35925602.
## 2 Afghanistan TSLM(Population ~ Year) 2019 N(3.7e+07, 1.4e+11) 36770747.
## 3 Afghanistan TSLM(Population ~ Year) 2020 N(3.8e+07, 1.4e+11) 37615892.
## 4 Afghanistan TSLM(Population ~ Year) 2021 N(3.8e+07, 1.5e+11) 38461037.
## 5 Afghanistan TSLM(Population ~ Year) 2022 N(3.9e+07, 1.5e+11) 39306182.
With the second model, where we did not consider this data that was part of the valley caused by the Soviet-Afghan war, we saw that the mean forecast population is higher. Therefore we can conclude that ignoring the data of certain events can improve our model.