Loading the required libraries

library(tidyverse)
library(ggplot2)
library(forecast)
library(tseries)
library(lubridate)

Loading data and manipulation

df <- read.csv("D:/Downloads/PRSA2017_Data_20130301-20170228/PRSA_Data_20130301-20170228/PRSA_Data_Aotizhongxin_20130301-20170228.csv")

#looking at the first 5 observations
head(df, 5)

##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     4    4   4   7 300 77 -0.7 1023.0 -18.8    0 NNW
## 2  2 2013     3   1    1     8    8   4   7 300 77 -1.1 1023.2 -18.2    0   N
## 3  3 2013     3   1    2     7    7   5  10 300 73 -1.1 1023.5 -18.2    0 NNW
## 4  4 2013     3   1    3     6    6  11  11 300 72 -1.4 1024.5 -19.4    0  NW
## 5  5 2013     3   1    4     3    3  12  12 300 72 -2.0 1025.2 -19.5    0   N
##   WSPM      station
## 1  4.4 Aotizhongxin
## 2  4.7 Aotizhongxin
## 3  5.6 Aotizhongxin
## 4  3.1 Aotizhongxin
## 5  2.0 Aotizhongxin

# uniting the date, month and year into a single column - date
df <- df %>%
  unite('date',year:day, sep = "-", remove = FALSE)

#converting degree Celsius to kelvin
df <- df %>% dplyr::mutate(df, TEMP_abs = TEMP + 273.15)

# converting to datetime object
df[['date']] <- as.POSIXct(df[['date']],format = "%Y-%m-%d")

Section 1

Time Series

# aggregating data at daily level 
df <- df %>% 
  group_by(date) %>%
  summarise(daily_mean_TEMP_abs = (mean(TEMP_abs, na.rm = TRUE)))

# plotting the time-series
(plot1 <- df %>%
  ggplot() +
  geom_line(aes (date, daily_mean_TEMP_abs)) + 
  ggtitle("TEMP vs day-month-year") +
  ylab("Temperature") +
  xlab("Date"))

Upon visual inspection, the time series (granularity: day) seems to be mean-stationary but variance non-stationary . However, we cannot rely on visual inspections alone for mean stationary check.

adf.test(df$daily_mean_TEMP_abs) # Raw temp Value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  df$daily_mean_TEMP_abs
## Dickey-Fuller = -1.9566, Lag order = 11, p-value = 0.5967
## alternative hypothesis: stationary

kpss.test(df$daily_mean_TEMP_abs) # Raw temp value

## 
##  KPSS Test for Level Stationarity
## 
## data:  df$daily_mean_TEMP_abs
## KPSS Level = 0.48883, Truncation lag parameter = 7, p-value = 0.04418

Augmented Dickey-Fuller Test and KPSS Test for Level Stationarity show us that the raw temp values are not mean stationary.

First, let us apply BoxCox transformation to make the time series variance stationary. All the temperature values are positive and hence we can proceed to apply BoxCox transformation.

df_boxcox <- df %>%
  mutate(temp_boxcox = BoxCox(daily_mean_TEMP_abs,lambda='auto'))
# plot the transformed time series
df_boxcox %>%
  ggplot() +
  geom_line(aes(date,temp_boxcox))+
  ggtitle("date vs time: boxcox transformation")+
  ylab("abs_ temp (boxcox transformation)")+
  xlab("date")

Applying Boxcox transformation to check variance stationarity

adf.test(df_boxcox$temp_boxcox) # BoxCox temp Value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_boxcox$temp_boxcox
## Dickey-Fuller = -1.9481, Lag order = 11, p-value = 0.6003
## alternative hypothesis: stationary

kpss.test(df_boxcox$temp_boxcox) # BoxCox temp value

## 
##  KPSS Test for Level Stationarity
## 
## data:  df_boxcox$temp_boxcox
## KPSS Level = 0.4821, Truncation lag parameter = 7, p-value = 0.0457

Even though Augmented Dickey-Fuller and KPSS tests are used for mean stationarity check, the p-values for BoxCox transformed are a bit greater than that of raw time series suggesting that transformation is redundant. Also,relatively observing the raw and BoxCox transformed time series, we come to a conclusion that the raw has approximately a constant variance. Hence, we shall stick with the raw time series data.

Applying first difference to make time series mean stationary

df1 <- df %>%
  filter(date >= ymd('2013-03-01') & date <= ymd("2017-02-28")) %>%
  mutate(temp_diff1 = daily_mean_TEMP_abs - lag(daily_mean_TEMP_abs))

#plotting the time series
df1 %>%
  ggplot()+
  geom_line(aes(date,temp_diff1))+
  theme_bw()+
  ggtitle("date vs time: first difference")+
  ylab("abs_ temp (first difference)")+
  xlab("date")

## Warning: Removed 1 row(s) containing missing values (geom_path).

Looking at this plot, we can see that the time series has a constant mean around 0. Let us check the mean stationarity criterion

# removing the first data point: NA removal
df1 <- df1 %>%
  filter(date >= ymd('2013-03-02') & date <= ymd("2017-02-28"))

# mean stationarity tests
adf.test(df1$temp_diff1) # fd1 temp Value

## Warning in adf.test(df1$temp_diff1): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  df1$temp_diff1
## Dickey-Fuller = -12.773, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary

kpss.test(df1$temp_diff1) # fd1  temp value

## Warning in kpss.test(df1$temp_diff1): p-value greater than printed p-value

## 
##  KPSS Test for Level Stationarity
## 
## data:  df1$temp_diff1
## KPSS Level = 0.068418, Truncation lag parameter = 7, p-value = 0.1

Thus, after applying first difference, our time series has become a stationary time series.

#printing raw time series to view seasonality
plot1

Looking at this raw time series plot again, one can clearly observe seasonality with a period of almost 1 year

Let us produce ACF and PACF plots on the created stationary time series

#ACF and PACF plots
par(mfrow = c(1,2))
acf(df1$temp_diff1,lag.max=20)
pacf(df1$temp_diff1,lag.max=20)

The ACF plot shows a bit seasonality. PACF plot does show oscillations after lag 7 indicating presence of MA process. However, the actual model can be understood by generating various ARIMA models and comparing them using AIC or BIC as the criterion.

Section 2

fitting ARIMA models and checking AIC and BIC values to select the best model

#ACF and PACF plots
AIC(
  arima(df1$daily_mean_TEMP_abs ,order=c(0,1,1)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,2)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,3)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,0)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,1)),
  arima(df1$daily_mean_TEMP_abs,order=c(2,1,0)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,0)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,1)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,2))
)

##                                                    df      AIC
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 1))  2 6290.070
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 2))  3 6166.603
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 3))  4 6158.679
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 0))  2 6295.277
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 1))  3 6199.492
## arima(df1$daily_mean_TEMP_abs, order = c(2, 1, 0))  3 6210.754
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 0))  4 6184.981
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 1))  5 6158.656
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 2))  6 6149.699

BIC(
  arima(df1$daily_mean_TEMP_abs ,order=c(0,1,1)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,2)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,3)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,0)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,1)),
  arima(df1$daily_mean_TEMP_abs,order=c(2,1,0)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,0)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,1)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,2))
)

##                                                    df      BIC
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 1))  2 6300.641
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 2))  3 6182.459
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 3))  4 6179.821
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 0))  2 6305.848
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 1))  3 6215.348
## arima(df1$daily_mean_TEMP_abs, order = c(2, 1, 0))  3 6226.610
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 0))  4 6206.123
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 1))  5 6185.083
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 2))  6 6181.412

Looking at the above outputs, ARIMA (3,1,2) shows the lowest AIC and BIC values. Thus, ARIMA(3,1,2) is the chosen model. Let us cross check with auto arima if it provides us with the same results.

#auto arima for best model
auto.arima(df1$daily_mean_TEMP_abs,stationary=FALSE,allowdrift=FALSE,
           seasonal=FALSE,stepwise=FALSE,approximation=FALSE)

## Series: df1$daily_mean_TEMP_abs 
## ARIMA(3,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      mean
##       0.9187  -0.1672  0.2344  286.0401
## s.e.  0.0255   0.0348  0.0255    3.5514
## 
## sigma^2 = 4.104:  log likelihood = -3102.13
## AIC=6214.26   AICc=6214.3   BIC=6240.69

Auto arima confirms that the ARIMA(3,0,0) is the best model for this time frame of this series.However, we will go with ARIMA(3,1,2) as our desired model for fitting.

Let us calculate in-sample prediction and compare with the observed values

#prediction
best_mod = arima(df1$daily_mean_TEMP_abs,order = c(3,1,2))
resid = best_mod$residuals
pred = df1$daily_mean_TEMP_abs - resid


df2 <- df1 %>%
  mutate(pred_temp = pred)

# ensuring both timeseries have same dates
df1 <- df1 %>%
  filter(date >= ymd('2013-03-02') & date <= ymd("2017-02-28"))
df2 <- df2 %>%
  filter(date >= ymd('2013-03-02') & date <= ymd("2017-02-28"))

#plotting the prediction vs observed
ggplot()+
  geom_line(aes(df1$date,df1$daily_mean_TEMP_abs),color='black') +
  geom_line(aes(df2$date,df2$pred_temp),color ='blue',alpha = 0.3) +
  theme_bw() +
  xlab("Date") +
  ylab("actual vs predicted")

From the above plot, we can see that the predicted values for in sample data do follow the trend exhibited by actual values of the time series.

Let us perform the residual analysis for the ARIMA(3,1,2) model.

# residuals
best_mod = arima(df1$daily_mean_TEMP_abs,order=c(3,1,2))
forecast::checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,2)
## Q* = 1.8491, df = 5, p-value = 0.8696
## 
## Model df: 5.   Total lags used: 10

par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

The residuals seem to have a constant variance and approximately mean value at 0. However, there is some sort of seasonality present in the residuals. The distribution of residuals is almost normal.

# Box-Ljung test 
for (i in 1:10){
  print(Box.test(resid,type='Ljung-Box',lag = i))
}

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.0017368, df = 1, p-value = 0.9668
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.12168, df = 2, p-value = 0.941
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.21235, df = 3, p-value = 0.9756
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.37196, df = 4, p-value = 0.9847
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.78432, df = 5, p-value = 0.978
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.1725, df = 6, p-value = 0.9782
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.1831, df = 7, p-value = 0.9913
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.3314, df = 8, p-value = 0.9952
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.4669, df = 9, p-value = 0.9974
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.8491, df = 10, p-value = 0.9974

From the above L-jung Box tests for upto 10 lags, p-value for each lag is greater than 0.05. This is an indication that no autocorrelation exists at any particular lag upto 10 lags. Thus, we can conclude that the residuals are white-noise.

Let us calculate the insample model performance and the criterion to be used is RMSE

#RMSE in degree kelvin
RMSE = sqrt(mean((df2$pred_temp - df2$daily_mean_TEMP_abs)^2))
RMSE

## [1] 1.981746

The results suggest that our model predicts the temperature index within ~ 2 degree kelvin on average in-sample.

Let us now create a forecast for five time periods.

#forecast
best_mod %>%
  forecast(h=5) %>% 
  autoplot()

Visually inspecting, the forecast values do seem reasonable.

Assignment2