Load Libraries

library(fma)  # dataset
## Warning: package 'fma' was built under R version 4.3.3
## Loading required package: forecast
## Warning: package 'forecast' was built under R version 4.3.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(zoo) #  na.approx
## Warning: package 'zoo' was built under R version 4.3.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tseries)   # statistical tests
## Warning: package 'tseries' was built under R version 4.3.2
library(forecast)  # auto.arima and forecasting accuracy measures
options(warn=0)

Select the R data set with my ID 20228034

34%%6
## [1] 4

My result got 4 that means i AirPassengers data set for this assignment

Question 1

Plot the series (use time-plot, ACF plot and seasonal-subseries plot) and describe its main features.

Load AirPassengers data set

data(AirPassengers)
#Check the missing values in the entire dataset
missing_values <- sum(is.na(AirPassengers))
cat("Number of missing values:", missing_values, "\n")
## Number of missing values: 0

There are no missing values for this data set, if dataset have missing values then i can handle this with the linear interpolation using ’’’ na.approx(AirPassengers) ’’’

AirPassengers <- na.approx(AirPassengers)
#AirPassengers <- ts(start = c(1949, 1),frequency = 12)
AirPassengers
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

Outliers finding

outliers <- boxplot.stats(AirPassengers)$out


# Plot the AirPassengers data
plot(AirPassengers, main="AirPassengers Data with Outliers Highlighted", 
     xlab="Time", ylab="Number of Passengers", pch=20, type='o', col="blue")

# Highlight outliers with a different color
points(which(AirPassengers %in% outliers), AirPassengers[which(AirPassengers %in% outliers)], 
       col="red", pch=20)

# Add a legend
legend("topright", legend=c("Outliers", "Regular observations"),
       col=c("red", "blue"), pch=20)

No outliers seen in the data set, using summary library to see the details about the dataset.

# Apply outlier handling
suppressWarnings({
  q_05 <- quantile(AirPassengers, 0.05)
  q_95 <- quantile(AirPassengers, 0.95)

  AirPassengers[AirPassengers < q_05] <- q_05
  AirPassengers[AirPassengers > q_95] <- q_95
  
  summary(AirPassengers)
})
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   121.6   180.0   265.5   277.4   360.5   488.1
# Plot the modified data
plot(AirPassengers, main="Modified AirPassengers Data", 
     xlab="Time", ylab="Number of Passengers", pch=20, type='o', col="green")

Plot the time series

plot(AirPassengers, main = "Time Plot of AirPassengers data", ylab = "Value", xlab = "Time")

This time series plot displays a rising trend in air passenger numbers from around 1949 to 1961, with evident seasonal fluctuations. The increasing variability over time suggests a growing market for air travel, emphasizing the importance of reliable forecasting methods to anticipate future demand.

The autocorrelation function (ACF) plot

acf(AirPassengers, main = "ACF Plot of Air Passenger", ylab = "Autocorrelation", xlab = "Lag")

The ACF plot indicates significant autocorrelation at various lags. At lag 0, the autocorrelation is 1, suggesting a strong correlation with itself. As the lag increases, autocorrelation decreases, but all lags are still significant. This suggests a strong persistence in the values over time. Also there are spike in 12 number lags it suggest that data set have monthly data with seasonality. all lags are cross the confidence interval which suggest their are no stationary in the data set. also this plot suggest to outliers in the data.

Seasonal-subseries plot

monthplot(AirPassengers, main = "Seasonal-Subseries Plot of Air Passenger", ylab = "Value", xlab 
= "Month")

The seasonal subseries plot indicates a general upward trend across the year with variations in data points suggesting seasonal patterns. Some months, like January and August, show higher variability, whereas others, like February and November, appear more consistent. The plot does not reveal extreme outliers, indicating a relatively stable seasonal trend over the years observed.

Split the data into training (70%) and test (30%) data sets. Use the training data for answering questions 2, and 3. Use the test data for answering question 6.
#Set the seed for reproducibility
set.seed(20228034)
train_size <- ceiling(0.7 * length(AirPassengers)) # 70% training size

# Calculate the year and month for the end of training data
end_year <- 1949 + (floor((train_size - 1) / 12))
end_month <- ((train_size - 1) %% 12) + 1

suppressWarnings(train_data <- window(AirPassengers, start = c(1949, 1), end = c(end_year, end_month))) #train data

# Calculate the year and month for the start of test data
start_year <- 1949 + (floor(train_size / 12))
start_month <- (train_size %% 12) + 1

suppressWarnings(test_data <- window(AirPassengers, start = c(start_year, start_month))) #test data

# Print the training and test set
print(train_data)
##        Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
## 1949 121.6 121.6 132.0 129.0 121.6 135.0 148.0 148.0 136.0 121.6 121.6 121.6
## 1950 121.6 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 121.6 140.0
## 1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0
## 1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0
## 1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0
## 1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0
## 1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0
## 1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0
## 1957 315.0 301.0 356.0 348.0 355.0
print(test_data)
##         Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
## 1957                                    422.00 465.00 467.00 404.00 347.00
## 1958 340.00 318.00 362.00 348.00 363.00 435.00 488.15 488.15 404.00 359.00
## 1959 360.00 342.00 406.00 396.00 420.00 472.00 488.15 488.15 463.00 407.00
## 1960 417.00 391.00 419.00 461.00 472.00 488.15 488.15 488.15 488.15 461.00
##         Nov    Dec
## 1957 305.00 336.00
## 1958 310.00 337.00
## 1959 362.00 405.00
## 1960 390.00 432.00

I set the seed for reproducibility using set.seed(20228034). The training set (train_data) includes the first 70% of the data. The test set (test_data) includes the remaining 30% of the data starting from the next observation after the training set.

Question 2

Take the suitable number of non-seasonal and/or seasonal differences (if required) to get a stationary series.
Hints: Test for stationarity can be checked by Augmented Dickey-Fuller Test (using adf.test function in tseries package) and The Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test (using kpss.test function in tseries package). In the KPSS test, accepting the null hypothesis means that the series is stationarity, and a small p-value suggests that the series is not stationary, and a differencing is required. Whereas a small p-value suggests that the series is stationary in the Augmented Dickey-Fuller Test.
Choose a model using the ACF & PACF plots of the stationary series, for instance, M1 model. Fit the M1 model, and write the mathematical expression of the M1 model. Also, test for the residuals diagnostic checking. Repeat the process until all the underlying assumptions of residuals are fulfilled.
adf_result <- adf.test(train_data)
## Warning in adf.test(train_data): p-value smaller than printed p-value
kpss_result <- kpss.test(train_data)
## Warning in kpss.test(train_data): p-value smaller than printed p-value
print("ADF Result ")
## [1] "ADF Result "
print(adf_result$statistic)
## Dickey-Fuller 
##     -4.623752
print(adf_result$p.value)
## [1] 0.01
print("KPSS Result ")
## [1] "KPSS Result "
print(kpss_result$statistic)
## KPSS Level 
##   1.945694
print(kpss_result$p.value)
## [1] 0.01

For ADF Result , (H): Non Stationary)

The test statistic is -4.6238, and the p-value is 0.01. Since the p-value is smaller than the significance level of 0.05, we reject the null hypothesis.This suggests that the series is stationary

For KPSS Test, (H0: Stationary)

The null hypothesis of the KPSS test is that the series is stationary around a deterministic trend. The test statistic is 1.9457, and the p-value is 0.01. Since the p-value is less than the significance level of 0.05, we reject the null hypothesis.This suggests evidence of Non stationarity in the series.