Load Libraries
library(fma) # dataset
## Warning: package 'fma' was built under R version 4.3.3
## Loading required package: forecast
## Warning: package 'forecast' was built under R version 4.3.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(zoo) # na.approx
## Warning: package 'zoo' was built under R version 4.3.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(tseries) # statistical tests
## Warning: package 'tseries' was built under R version 4.3.2
library(forecast) # auto.arima and forecasting accuracy measures
options(warn=0)
Select the R data set with my ID 20228034
34%%6
## [1] 4
My result got 4 that means i AirPassengers data set for this assignment
Load AirPassengers data set
data(AirPassengers)
#Check the missing values in the entire dataset
missing_values <- sum(is.na(AirPassengers))
cat("Number of missing values:", missing_values, "\n")
## Number of missing values: 0
There are no missing values for this data set, if dataset have missing values then i can handle this with the linear interpolation using ’’’ na.approx(AirPassengers) ’’’
AirPassengers <- na.approx(AirPassengers)
#AirPassengers <- ts(start = c(1949, 1),frequency = 12)
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
Outliers finding
outliers <- boxplot.stats(AirPassengers)$out
# Plot the AirPassengers data
plot(AirPassengers, main="AirPassengers Data with Outliers Highlighted",
xlab="Time", ylab="Number of Passengers", pch=20, type='o', col="blue")
# Highlight outliers with a different color
points(which(AirPassengers %in% outliers), AirPassengers[which(AirPassengers %in% outliers)],
col="red", pch=20)
# Add a legend
legend("topright", legend=c("Outliers", "Regular observations"),
col=c("red", "blue"), pch=20)
No outliers seen in the data set, using summary library to see the
details about the dataset.
# Apply outlier handling
suppressWarnings({
q_05 <- quantile(AirPassengers, 0.05)
q_95 <- quantile(AirPassengers, 0.95)
AirPassengers[AirPassengers < q_05] <- q_05
AirPassengers[AirPassengers > q_95] <- q_95
summary(AirPassengers)
})
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 121.6 180.0 265.5 277.4 360.5 488.1
# Plot the modified data
plot(AirPassengers, main="Modified AirPassengers Data",
xlab="Time", ylab="Number of Passengers", pch=20, type='o', col="green")
Plot the time series
plot(AirPassengers, main = "Time Plot of AirPassengers data", ylab = "Value", xlab = "Time")
This time series plot displays a rising trend in air passenger numbers
from around 1949 to 1961, with evident seasonal fluctuations. The
increasing variability over time suggests a growing market for air
travel, emphasizing the importance of reliable forecasting methods to
anticipate future demand.
acf(AirPassengers, main = "ACF Plot of Air Passenger", ylab = "Autocorrelation", xlab = "Lag")
The ACF plot indicates significant autocorrelation at various lags. At
lag 0, the autocorrelation is 1, suggesting a strong correlation with
itself. As the lag increases, autocorrelation decreases, but all lags
are still significant. This suggests a strong persistence in the values
over time. Also there are spike in 12 number lags it suggest that data
set have monthly data with seasonality. all lags are cross the
confidence interval which suggest their are no stationary in the data
set. also this plot suggest to outliers in the data.
monthplot(AirPassengers, main = "Seasonal-Subseries Plot of Air Passenger", ylab = "Value", xlab
= "Month")
The seasonal subseries plot indicates a general upward trend across the year with variations in data points suggesting seasonal patterns. Some months, like January and August, show higher variability, whereas others, like February and November, appear more consistent. The plot does not reveal extreme outliers, indicating a relatively stable seasonal trend over the years observed.
#Set the seed for reproducibility
set.seed(20228034)
train_size <- ceiling(0.7 * length(AirPassengers)) # 70% training size
# Calculate the year and month for the end of training data
end_year <- 1949 + (floor((train_size - 1) / 12))
end_month <- ((train_size - 1) %% 12) + 1
suppressWarnings(train_data <- window(AirPassengers, start = c(1949, 1), end = c(end_year, end_month))) #train data
# Calculate the year and month for the start of test data
start_year <- 1949 + (floor(train_size / 12))
start_month <- (train_size %% 12) + 1
suppressWarnings(test_data <- window(AirPassengers, start = c(start_year, start_month))) #test data
# Print the training and test set
print(train_data)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 121.6 121.6 132.0 129.0 121.6 135.0 148.0 148.0 136.0 121.6 121.6 121.6
## 1950 121.6 126.0 141.0 135.0 125.0 149.0 170.0 170.0 158.0 133.0 121.6 140.0
## 1951 145.0 150.0 178.0 163.0 172.0 178.0 199.0 199.0 184.0 162.0 146.0 166.0
## 1952 171.0 180.0 193.0 181.0 183.0 218.0 230.0 242.0 209.0 191.0 172.0 194.0
## 1953 196.0 196.0 236.0 235.0 229.0 243.0 264.0 272.0 237.0 211.0 180.0 201.0
## 1954 204.0 188.0 235.0 227.0 234.0 264.0 302.0 293.0 259.0 229.0 203.0 229.0
## 1955 242.0 233.0 267.0 269.0 270.0 315.0 364.0 347.0 312.0 274.0 237.0 278.0
## 1956 284.0 277.0 317.0 313.0 318.0 374.0 413.0 405.0 355.0 306.0 271.0 306.0
## 1957 315.0 301.0 356.0 348.0 355.0
print(test_data)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
## 1957 422.00 465.00 467.00 404.00 347.00
## 1958 340.00 318.00 362.00 348.00 363.00 435.00 488.15 488.15 404.00 359.00
## 1959 360.00 342.00 406.00 396.00 420.00 472.00 488.15 488.15 463.00 407.00
## 1960 417.00 391.00 419.00 461.00 472.00 488.15 488.15 488.15 488.15 461.00
## Nov Dec
## 1957 305.00 336.00
## 1958 310.00 337.00
## 1959 362.00 405.00
## 1960 390.00 432.00
I set the seed for reproducibility using set.seed(20228034). The training set (train_data) includes the first 70% of the data. The test set (test_data) includes the remaining 30% of the data starting from the next observation after the training set.
adf_result <- adf.test(train_data)
## Warning in adf.test(train_data): p-value smaller than printed p-value
kpss_result <- kpss.test(train_data)
## Warning in kpss.test(train_data): p-value smaller than printed p-value
print("ADF Result ")
## [1] "ADF Result "
print(adf_result$statistic)
## Dickey-Fuller
## -4.623752
print(adf_result$p.value)
## [1] 0.01
print("KPSS Result ")
## [1] "KPSS Result "
print(kpss_result$statistic)
## KPSS Level
## 1.945694
print(kpss_result$p.value)
## [1] 0.01
For ADF Result , (H): Non Stationary)
The test statistic is -4.6238, and the p-value is 0.01. Since the p-value is smaller than the significance level of 0.05, we reject the null hypothesis.This suggests that the series is stationary
For KPSS Test, (H0: Stationary)
The null hypothesis of the KPSS test is that the series is stationary around a deterministic trend. The test statistic is 1.9457, and the p-value is 0.01. Since the p-value is less than the significance level of 0.05, we reject the null hypothesis.This suggests evidence of Non stationarity in the series.