Here, the R data set of monthly totals of international airline passengers between 1949 to 1960 is analyzed and a simple model to forecast 3-point estimates for 1961’s monthly totals is applied.
There are several packages used in this example:
library(forecast)
library(tseries)
library(zoo)
library(dtwclust)
library(ggplot2)
library(plotly)
library(dplyr)
library(tidyr)
library(tibble)
library(factoextra)
library(NbClust)
library(akmedoids)Loading data set AirPassengers (in package forecast) into R’s workspace and assigning it to AP_year (for convenience):
data("AirPassengers")
AP_year <- AirPassengersExtracting time series AP_year by year into 12 time series by month:
# Air passenger data. ts converted to long matrix:
dfAP_month <- data.frame(Year = floor(time(AirPassengers) + .01),
Month = cycle(AirPassengers),
Passengers = AirPassengers)
# convert month numbers to names, using a built-in constant:
dfAP_month$Month <- factor(dfAP_month$Month)
levels(dfAP_month$Month) <- month.abb
rmarkdown::paged_table(dfAP_month)# Long matrix is converted to wide matrix:
dfAP_month.matrix <- spread(dfAP_month, key = Month, value = Passengers) %>% column_to_rownames("Year")
knitr::kable(dfAP_month.matrix)| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1949 | 112 | 118 | 132 | 129 | 121 | 135 | 148 | 148 | 136 | 119 | 104 | 118 |
| 1950 | 115 | 126 | 141 | 135 | 125 | 149 | 170 | 170 | 158 | 133 | 114 | 140 |
| 1951 | 145 | 150 | 178 | 163 | 172 | 178 | 199 | 199 | 184 | 162 | 146 | 166 |
| 1952 | 171 | 180 | 193 | 181 | 183 | 218 | 230 | 242 | 209 | 191 | 172 | 194 |
| 1953 | 196 | 196 | 236 | 235 | 229 | 243 | 264 | 272 | 237 | 211 | 180 | 201 |
| 1954 | 204 | 188 | 235 | 227 | 234 | 264 | 302 | 293 | 259 | 229 | 203 | 229 |
| 1955 | 242 | 233 | 267 | 269 | 270 | 315 | 364 | 347 | 312 | 274 | 237 | 278 |
| 1956 | 284 | 277 | 317 | 313 | 318 | 374 | 413 | 405 | 355 | 306 | 271 | 306 |
| 1957 | 315 | 301 | 356 | 348 | 355 | 422 | 465 | 467 | 404 | 347 | 305 | 336 |
| 1958 | 340 | 318 | 362 | 348 | 363 | 435 | 491 | 505 | 404 | 359 | 310 | 337 |
| 1959 | 360 | 342 | 406 | 396 | 420 | 472 | 548 | 559 | 463 | 407 | 362 | 405 |
| 1960 | 417 | 391 | 419 | 461 | 472 | 535 | 622 | 606 | 508 | 461 | 390 | 432 |
Plot of AirPassengers time series by month:
# plotting reference lines across each facet:
refLines <- dfAP_month # \/ Rename
colnames(refLines)[2] <- "MonthGroup"
zp <- ggplot(dfAP_month,
aes(x = Year, y = Passengers))
zp <- zp + geom_line(data = refLines, # Plotting the "underlayer"
aes(x = Year, y = Passengers, group = MonthGroup),
colour = "GRAY", alpha = 1 / 2, size = 1 / 2)
zp <- zp + geom_line(size = 1) # Drawing the "overlayer"
zp <- zp + facet_wrap(~Month)
zp <- zp + theme_bw()
ggplotly(zp)Heat map of air passengers by month:
heatmap(t(zscore(dfAP_month.matrix)), Colv = NA, Rowv = NA, scale = "row",
main = "Heat map of Monthly Air Passenger numbers by year",
xlab = "Year",
ylab = "Month")For choosing the Optimal Number of Clusters, please refer the following link for suggestion: link
In this section, we will describe two functions for determining the optimal number of clusters:
fviz_nbclust() function [in factoextra R package]: It can be used to compute the three different methods [elbow, silhouette and gap statistic] for any partitioning clustering methods [K-means, K-medoids (PAM), CLARA, HCUT]. Note that the hcut() function is available only in factoextra package. It computes hierarchical clustering and cut the tree in k pre-specified clusters.
NbClust() function [ in NbClust package] (Charrad et al. (2014)): It provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. It can simultaneously computes all the indices and determine the number of clusters in a single function call.
We will use the following R packages:
factoextra to determine the optimal number clusters for a given clustering methods and for data visualization.
NbClust for computing about 30 methods at once, in order to find the optimal number of clusters.
We start by standardizing the data to make variables comparable.
# Standardize the data by z-score
dfAP_month_scaled <- zscore(dfAP_month.matrix)
knitr::kable(dfAP_month_scaled, digits = 3)| Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1949 | -1.069 | -0.632 | 0.389 | 0.170 | -0.413 | 0.607 | 1.555 | 1.555 | 0.680 | -0.559 | -1.652 | -0.632 |
| 1950 | -1.293 | -0.717 | 0.070 | -0.245 | -0.769 | 0.489 | 1.591 | 1.591 | 0.961 | -0.350 | -1.346 | 0.017 |
| 1951 | -1.365 | -1.094 | 0.425 | -0.389 | 0.099 | 0.425 | 1.564 | 1.564 | 0.750 | -0.443 | -1.311 | -0.226 |
| 1952 | -1.132 | -0.740 | -0.174 | -0.697 | -0.610 | 0.914 | 1.437 | 1.959 | 0.523 | -0.261 | -1.089 | -0.131 |
| 1953 | -1.019 | -1.019 | 0.386 | 0.351 | 0.141 | 0.632 | 1.370 | 1.651 | 0.422 | -0.492 | -1.581 | -0.843 |
| 1954 | -1.000 | -1.458 | -0.112 | -0.341 | -0.141 | 0.718 | 1.806 | 1.549 | 0.575 | -0.284 | -1.028 | -0.284 |
| 1955 | -0.997 | -1.210 | -0.403 | -0.356 | -0.332 | 0.736 | 1.898 | 1.495 | 0.664 | -0.237 | -1.115 | -0.142 |
| 1956 | -0.925 | -1.071 | -0.235 | -0.319 | -0.214 | 0.956 | 1.771 | 1.604 | 0.559 | -0.465 | -1.196 | -0.465 |
| 1957 | -0.923 | -1.165 | -0.214 | -0.353 | -0.232 | 0.926 | 1.668 | 1.703 | 0.615 | -0.370 | -1.095 | -0.560 |
| 1958 | -0.635 | -0.976 | -0.294 | -0.511 | -0.279 | 0.837 | 1.705 | 1.922 | 0.356 | -0.341 | -1.100 | -0.682 |
| 1959 | -0.979 | -1.236 | -0.320 | -0.463 | -0.119 | 0.625 | 1.714 | 1.871 | 0.496 | -0.306 | -0.950 | -0.334 |
| 1960 | -0.761 | -1.096 | -0.735 | -0.195 | -0.054 | 0.757 | 1.876 | 1.670 | 0.409 | -0.195 | -1.108 | -0.568 |
p = fviz_nbclust(dfAP_month_scaled, hcut, method = "wss")
k = round(elbow_point(as.numeric(p$data[[1]]), p$data[[2]])$x,0) # Identify the elbow point of the curve
p + geom_vline(xintercept = k, linetype = 2) +
labs(subtitle = "Elbow method")fviz_nbclust(dfAP_month_scaled, hcut, method = "silhouette") +
labs(subtitle = "Silhouette method")# Gap statistic
# nboot = 50 to keep the function speedy.
# recommended value: nboot= 500 for your analysis.
# Use verbose = FALSE to hide computing progression.
set.seed(123)
fviz_nbclust(dfAP_month_scaled, hcut, nstart = 25, method = "gap_stat", nboot = 50, print.summary = F) +
labs(subtitle = "Gap statistic method")Filtering which appropriate indices for each distance measure that can be used to computes the function NbClust() for hierarchical clustering. To compute NbClust() for hierarchical clustering, method should be one of: ward.D, ward.D2, single, complete, average:
lista.methods = c("kl", "ch", "hartigan", "ccc", "scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db", "silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball", "ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", "sdindex", "sdbw", "all")
lista.distance <- c("Indices","euclidean", "maximum", "manhattan", "canberra", "binary")
tabla <- as.data.frame(matrix(ncol = length(lista.distance), nrow = length(lista.methods)))
names(tabla) <- lista.distance
for (j in 2:length(lista.distance)) {
for (i in 1:length(lista.methods)) {
nb <- try(NbClust(t(dfAP_month_scaled),
distance = lista.distance[j],
min.nc = 2,
max.nc = 11,
method = "ward.D",
index = lista.methods[i])
)
if ((inherits(nb, "try-error"))) next
if (!is.na(nb$Best.nc[1])) {
tabla[i,j] <- nb$Best.nc[1]
tabla[i,1] <- lista.methods[i]
}
}
}tabla <- subset(tabla, !is.na(tabla$Indices))
knitr::kable(tabla)| Indices | euclidean | maximum | manhattan | canberra | binary | |
|---|---|---|---|---|---|---|
| 1 | kl | 4 | 4 | 4 | 2 | 11 |
| 2 | ch | 4 | 4 | 4 | 6 | 11 |
| 3 | hartigan | 11 | 11 | 11 | 11 | 11 |
| 11 | cindex | 11 | 11 | 3 | 11 | 2 |
| 12 | db | 11 | 11 | 11 | 11 | 11 |
| 13 | silhouette | 11 | 11 | 11 | 11 | -Inf |
| 14 | duda | 4 | 4 | 4 | 2 | 2 |
| 15 | pseudot2 | 4 | 4 | 4 | 2 | 2 |
| 16 | beale | 5 | 5 | 5 | 2 | 2 |
| 17 | ratkowsky | 2 | 2 | 2 | 2 | 8 |
| 18 | ball | 3 | 3 | 3 | 3 | 3 |
| 19 | ptbiserial | 2 | 2 | 2 | 2 | -Inf |
| 20 | gap | 2 | 2 | 2 | 2 | 2 |
| 22 | mcclain | 2 | 2 | 2 | 2 | Inf |
| 23 | gamma | 4 | 7 | 4 | 3 | -Inf |
| 24 | gplus | 4 | 7 | 4 | 3 | 2 |
| 25 | tau | 2 | 2 | 2 | 3 | 11 |
| 26 | dunn | 4 | 11 | 4 | 11 | -Inf |
| 27 | sdindex | 4 | 4 | 4 | 2 | 3 |
| 28 | sdbw | 11 | 11 | 11 | 11 | 11 |
Plotting frequency of each number of cluster for each distance methods:
tabla %>%
gather(key = "disMethod","Ncluster", -Indices) %>%
count(disMethod, Ncluster) -> PlotData
ggplot(PlotData, aes(x = factor(Ncluster), y = n)) +
theme_bw() +
geom_bar(fill = "#0073C2FF", stat = "identity") +
geom_text(aes(label = n), nudge_y = 1) +
xlab("Number of clusters k") +
ylab("Frequency") +
facet_wrap(vars(disMethod), scales = "fixed", ncol = 2)Elbow method: k = 4 clusters solution suggested
Silhouette method: k = 2 clusters solution suggested
Gap statistic method: k = 1 cluster solution suggested
Computing 30 indices: If we choose distane measure is euclidean or manhattan then it is possible to define k = 4 as the optimal number of clusters in the data.
==> According to these observations, it is possible to define k = 4 as the optimal number of clusters in the data.
nclust <- 4
hcc_sbd <- tsclust(t(dfAP_month.matrix),
type = "h", # type of clustering analysis: hierarchical clustering
k = nclust, # number of clusters
preproc = zscore, # normalize data by z-score
seed = 999,
distance = "sbd", # type of distance measurement: shape-based distance
centroid = shape_extraction, # type of centroid computation method: shape extraction
control = hierarchical_control(method = "ward.D") # hierarchical clustering method: ward.D
)
plot(hcc_sbd)
rect.hclust(hcc_sbd, k = nclust, border = 2:(2 + (nclust - 1)))cutree(hcc_sbd, k = nclust)## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 1 2 3 3 1 4 4 4 4 4 4
plot(hcc_sbd, type = "sc", labels = list(nudge_x = -1, nudge_y = 1))sum(is.na(AP_year))## [1] 0
Check cycle components:
Cycle of this time series is 12 months in a year:
frequency(AP_year)## [1] 12
p <- autoplot(AP_year) + labs(x = "Date",
y = "Passenger numbers (1000's)",
title = "Air Passengers from 1949 to 1961")
ggplotly(p)p <- ggseasonplot(AP_year) +
ylab("Passenger numbers (1000's)") +
ggtitle("Air Passenger numbers from 1949 to 1961") +
theme_bw()
ggplotly(p)ggseasonplot(AP_year, polar = TRUE) +
ylab("Passenger numbers (1000's)") +
ggtitle("Air Passenger numbers from 1949 to 1961") +
theme_bw()boxplot(AP_year ~ cycle(AP_year),
xlab = "Month",
ylab = "Passenger Numbers (1000's)",
main = "Monthly Air Passengers from 1949 to 1961")Augmented Dickey-Fuller Test using the adf.test function from the tseries R package
adf.test(AP_year, alternative = "stationary", k = 12)##
## Augmented Dickey-Fuller Test
##
## data: AP_year
## Dickey-Fuller = -1.5094, Lag order = 12, p-value = 0.7807
## alternative hypothesis: stationary
diff(log()) and test the stationarity diff(log()):adf.test(diff(log(AP_year)), alternative = "stationary", k = 12)##
## Augmented Dickey-Fuller Test
##
## data: diff(log(AP_year))
## Dickey-Fuller = -3.3656, Lag order = 12, p-value = 0.06313
## alternative hypothesis: stationary
p <- autoplot(diff(log(AP_year))) +
xlab("Year") +
ylab("Differents in log(Passenger numbers (1000's))")
ggplotly(p)p <- ggAcf((AP_year))
ggplotly(p)adf.test(AP_year, alternative = "stationary", k = 12)##
## Augmented Dickey-Fuller Test
##
## data: AP_year
## Dickey-Fuller = -1.5094, Lag order = 12, p-value = 0.7807
## alternative hypothesis: stationary
p <- ggAcf(diff(log(AP_year)))
ggplotly(p)adf.test(diff(log(AP_year)), alternative = "stationary", k = 12)##
## Augmented Dickey-Fuller Test
##
## data: diff(log(AP_year))
## Dickey-Fuller = -3.3656, Lag order = 12, p-value = 0.06313
## alternative hypothesis: stationary
Break data into trend, seasonal, regular and random:
p <- autoplot(decompose(AP_year)) + xlab("Year")
ggplotly(p)fit <- arima(log(AP_year), c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
fit##
## Call:
## arima(x = log(AP_year), order = c(0, 1, 1), seasonal = list(order = c(0, 1,
## 1), period = 12))
##
## Coefficients:
## ma1 sma1
## -0.4018 -0.5569
## s.e. 0.0896 0.0731
##
## sigma^2 estimated as 0.001348: log likelihood = 244.7, aic = -483.4
Predict for next 10 years:
pred <- predict(fit, n.ahead = 10 * 12)
pred## $pred
## Jan Feb Mar Apr May Jun Jul Aug
## 1961 6.110186 6.053775 6.171715 6.199300 6.232556 6.368779 6.507294 6.502906
## 1962 6.206435 6.150025 6.267964 6.295550 6.328805 6.465028 6.603543 6.599156
## 1963 6.302684 6.246274 6.364213 6.391799 6.425054 6.561277 6.699792 6.695405
## 1964 6.398933 6.342523 6.460463 6.488048 6.521304 6.657526 6.796042 6.791654
## 1965 6.495183 6.438772 6.556712 6.584297 6.617553 6.753776 6.892291 6.887903
## 1966 6.591432 6.535022 6.652961 6.680547 6.713802 6.850025 6.988540 6.984153
## 1967 6.687681 6.631271 6.749210 6.776796 6.810051 6.946274 7.084789 7.080402
## 1968 6.783930 6.727520 6.845460 6.873045 6.906301 7.042523 7.181039 7.176651
## 1969 6.880180 6.823769 6.941709 6.969294 7.002550 7.138773 7.277288 7.272900
## 1970 6.976429 6.920019 7.037958 7.065544 7.098799 7.235022 7.373537 7.369150
## Sep Oct Nov Dec
## 1961 6.324698 6.209008 6.063487 6.168025
## 1962 6.420947 6.305257 6.159737 6.264274
## 1963 6.517197 6.401507 6.255986 6.360523
## 1964 6.613446 6.497756 6.352235 6.456773
## 1965 6.709695 6.594005 6.448484 6.553022
## 1966 6.805944 6.690254 6.544734 6.649271
## 1967 6.902194 6.786504 6.640983 6.745520
## 1968 6.998443 6.882753 6.737232 6.841770
## 1969 7.094692 6.979002 6.833481 6.938019
## 1970 7.190941 7.075251 6.929731 7.034268
##
## $se
## Jan Feb Mar Apr May Jun
## 1961 0.03671562 0.04278291 0.04809072 0.05286830 0.05724856 0.06131670
## 1962 0.09008475 0.09549708 0.10061869 0.10549195 0.11014981 0.11461854
## 1963 0.14650643 0.15224985 0.15778435 0.16313118 0.16830825 0.17333075
## 1964 0.20896657 0.21513653 0.22113442 0.22697386 0.23266679 0.23822371
## 1965 0.27748210 0.28408309 0.29053414 0.29684503 0.30302451 0.30908048
## 1966 0.35174476 0.35876289 0.36564634 0.37240257 0.37903840 0.38556004
## 1967 0.43142043 0.43883816 0.44613258 0.45330963 0.46037481 0.46733319
## 1968 0.51620376 0.52400376 0.53168935 0.53926541 0.54673651 0.55410688
## 1969 0.60582584 0.61399203 0.62205103 0.63000694 0.63786363 0.64562471
## 1970 0.70005133 0.70856907 0.71698563 0.72530453 0.73352910 0.74166246
## Jul Aug Sep Oct Nov Dec
## 1961 0.06513124 0.06873441 0.07215787 0.07542612 0.07855851 0.08157070
## 1962 0.11891946 0.12307018 0.12708540 0.13097758 0.13475740 0.13843405
## 1963 0.17821177 0.18296261 0.18759318 0.19211216 0.19652727 0.20084534
## 1964 0.24365393 0.24896574 0.25416656 0.25926308 0.26426132 0.26916676
## 1965 0.31502004 0.32084967 0.32657525 0.33220217 0.33773535 0.34317933
## 1966 0.39197318 0.39828307 0.40449455 0.41061207 0.41663978 0.42258152
## 1967 0.47418947 0.48094803 0.48761291 0.49418791 0.50067658 0.50708223
## 1968 0.56138049 0.56856106 0.57565206 0.58265678 0.58957827 0.59641945
## 1969 0.65329361 0.66087351 0.66836746 0.67577831 0.68310877 0.69036139
## 1970 0.74970759 0.75766731 0.76554426 0.77334099 0.78105989 0.78870326
The above output prediction value are in logarithmic part, convert them to original form we need to transform them:
pred1 <- round(2.718^pred$pred, 0)
pred1## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477
## 1962 496 468 527 542 560 642 737 734 614 547 473 525
## 1963 546 516 580 597 617 707 812 808 676 602 521 578
## 1964 601 568 639 657 679 778 894 890 745 663 573 637
## 1965 661 625 703 723 748 857 984 980 820 730 631 701
## 1966 728 688 775 796 823 943 1083 1079 903 804 695 772
## 1967 802 758 853 877 906 1039 1193 1188 994 885 765 850
## 1968 883 834 939 965 998 1143 1313 1308 1094 975 843 935
## 1969 972 919 1034 1063 1099 1259 1446 1440 1205 1073 928 1030
## 1970 1070 1012 1138 1170 1210 1386 1592 1585 1326 1181 1021 1134
Plot the model:
fitted <- forecast(fit, h = 10 * 12)
autoplot(fitted) +
theme_bw() +
ylab("log(Passenger numbers (1000's))") +
xlab("Year")Compare predicted values with original values in 1960:
predicted_1960 <- round(head(pred1, 12)) # For Predicted data in 1960
predicted_1960## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1961 450 425 479 492 509 583 670 667 558 497 430 477
original_1960 <- tail(AP_year, 12) # For Original data in 1960
original_1960## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
datawide <- ts(AP_year, frequency = 12, start = c(1949, 1), end = c(1959, 12))
datawide## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
fit1 <- arima(log(datawide), c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 12))
pred <- predict(fit1, n.ahead = 10 * 12) # predictor now 1960 to 1969
pred1 <- round(2.718^pred$pred,0)
pred1## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1960 419 399 466 454 473 547 622 630 526 462 406 452
## 1961 470 447 522 509 530 613 697 706 590 518 455 506
## 1962 526 501 585 570 594 687 781 791 661 580 510 568
## 1963 590 561 656 639 665 769 875 886 741 650 572 636
## 1964 661 629 735 716 746 862 980 993 830 728 641 713
## 1965 740 704 824 802 836 966 1098 1112 930 816 718 798
## 1966 830 789 923 899 936 1082 1231 1247 1042 915 804 895
## 1967 930 884 1034 1007 1049 1213 1379 1397 1168 1025 901 1003
## 1968 1042 991 1159 1129 1176 1359 1545 1565 1308 1148 1010 1123
## 1969 1167 1110 1299 1265 1317 1523 1732 1754 1466 1287 1132 1259
Predicted <- zoo(round(head(pred1, 12), 0), 1960:1969) #head of Predicted
Original <- zoo(round(tail(AP_year, 12), 0), 1960:1969) #tail of original
CbindData <- merge.zoo(Predicted, Original)
p <- autoplot.zoo(CbindData, facets = NULL) +
scale_x_discrete(limits = 1960:1969) +
theme_bw() +
ylab("Passenger numbers (1000's)") +
xlab("Year")
ggplotly(p)df <- as.data.frame(residuals(fit))
p <- ggplot(df, aes(sample = x)) +
stat_qq(color = "gray") +
stat_qq_line(colour = "red") +
ggtitle("Normal Q-Q Plot") +
xlab("Theoretical Quantiles") +
ylab("Sample Quantiles")
ggplotly(p)adf.test(fit$residuals, alternative = "stationary")##
## Augmented Dickey-Fuller Test
##
## data: fit$residuals
## Dickey-Fuller = -4.7907, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary