Section 1 - Exploratory Data Analysis

This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017. The station of interest and hence the region of interest is Aotizhongxin.

The air pollutants tracked are PM2.5, PM10, SO2, NO2, CO, O3. Along with that other air important indicators such as TEMP, PRES, DEWP, RAIN, wd (winddirection), WSPM (windspeed in m/s) are also recorded.

# Load libraries
library(dplyr)
library(ggplot2)
library(janitor)
library(readr)
library(zoo)
library(forecast)
library(tseries)
library(lubridate)
library(tidyverse)
library(prophet)

#Loading data
df <- read.csv("D:/Downloads/PRSA2017_Data_20130301-20170228/PRSA_Data_20130301-20170228/PRSA_Data_Aotizhongxin_20130301-20170228.csv")

head(df)
##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     4    4   4   7 300 77 -0.7 1023.0 -18.8    0 NNW
## 2  2 2013     3   1    1     8    8   4   7 300 77 -1.1 1023.2 -18.2    0   N
## 3  3 2013     3   1    2     7    7   5  10 300 73 -1.1 1023.5 -18.2    0 NNW
## 4  4 2013     3   1    3     6    6  11  11 300 72 -1.4 1024.5 -19.4    0  NW
## 5  5 2013     3   1    4     3    3  12  12 300 72 -2.0 1025.2 -19.5    0   N
## 6  6 2013     3   1    5     5    5  18  18 400 66 -2.2 1025.6 -19.6    0   N
##   WSPM      station
## 1  4.4 Aotizhongxin
## 2  4.7 Aotizhongxin
## 3  5.6 Aotizhongxin
## 4  3.1 Aotizhongxin
## 5  2.0 Aotizhongxin
## 6  3.7 Aotizhongxin

Temperature is the variable of interest. Logically speaking, the temperature is lower at night and goes higher around the noon and then drops as evening approaches. Two most plausible factors that could cause this variation are presence of Sun and vehicular activity due to humans.

Particulates like SO2, NO2 etc. are a function of human activities like vehicular emmissions, factory emmissions etc. Since, humans usually are involved in such activities during the normal working hours, there is a usual rise in the concentrations of those particulates in such hours.

For this project, Temperature is the variable of interest. However, there are many significant indicators of air quality. This dataset has many null values which makes it difficult to forecast as important information is lost at many time instances. Hence, only for this assignment, Temperature, which has only 20 null values, is chosen for forecasting.

Initial basic EDA

attach(df)
boxplot(TEMP , names = "Temp")

#looks like a bimodal distribution
hist(TEMP)

#summarising by hour
#avg. temp increases as day approaches and decreases from evening to night till early morning
by_hour <- group_by(df, hour)
df1 <- summarise(by_hour, T1 = mean(TEMP, na.rm = TRUE)) 
ggplot(df1, aes(x=hour, y=T1)) +
  geom_line()+
  ggtitle("abs_temp vs hour")+
  ylab("abs_ temp ")+
  xlab("hour")

#summarising by day
by_day <- group_by(df, day)
df2 <- summarise(by_day, T1 = mean(TEMP, na.rm = TRUE)) 
ggplot(df2, aes(x=day, y=T1)) +
  geom_line() +
  ggtitle("avg_abs_temp vs day")+
  ylab("avg_abs_ temp ")+
  xlab("day")

#summarising by month
by_month <- group_by(df, month)
df3 <- summarise(by_month, T1 = mean(TEMP, na.rm = TRUE)) 
ggplot(df3, aes(x=month, y=T1)) +
  geom_line() +
  ggtitle("avg_abs_temp vs month")+
  ylab("avg_abs_ temp")+
  xlab("month")

#range, mean, median
summary(df$TEMP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -16.80    3.10   14.50   13.58   23.30   40.50      20
#standard deviation
sd(df$TEMP, na.rm = TRUE)
## [1] 11.3991

The intended variable of interest TEMP has no outliers. This can be confirmed from the previous boxplot visualization and summaries.

Linear Regression between TEMP and date

# prepping date and Temp variables

df <- read.csv("D:/Downloads/PRSA2017_Data_20130301-20170228/PRSA_Data_20130301-20170228/PRSA_Data_Aotizhongxin_20130301-20170228.csv")

# uniting the date, month and year into a single column - date
df <- df %>%
  unite('date',year:day, sep = "-", remove = FALSE)

#converting degree Celsius to kelvin
df <- df %>% dplyr::mutate(df, TEMP_abs = TEMP + 273.15)

# converting to datetime object
df[['date']] <- as.POSIXct(df[['date']],format = "%Y-%m-%d")

# aggregating data at daily level 
df <- df %>% 
  group_by(date) %>%
  summarise(daily_mean_TEMP_abs = (mean(TEMP_abs, na.rm = TRUE)))

# fitting linear regression model as per avg temp vs hour

LR <- lm(daily_mean_TEMP_abs ~ date, data = df)
summary(LR)
## 
## Call:
## lm(formula = daily_mean_TEMP_abs ~ date, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.854 -10.373   1.082   9.774  19.971 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.350e+02  1.098e+01  30.518  < 2e-16 ***
## date        -3.388e-08  7.700e-09  -4.399 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.72 on 1459 degrees of freedom
## Multiple R-squared:  0.01309,    Adjusted R-squared:  0.01242 
## F-statistic: 19.36 on 1 and 1459 DF,  p-value: 1.164e-05

Here we can see that this model(daily_mean_TEMP_abs vs date) has a p- value less than 0.05. However,the model is insignificant as adj-Rsq is really low.

One conclusion that can be drawn here is that fitting a linear regression model is not adequately explaining the movement of variable TEMP i.e. the temperature.

Section 2 - ARIMA Modeling

Let us look at the Time series plot

# plotting the time-series
(plot1 <- df %>%
  ggplot() +
  geom_line(aes (date, daily_mean_TEMP_abs)) + 
  ggtitle("TEMP vs day-month-year") +
  ylab("Temperature") +
  xlab("Date"))

Upon visual inspection, the time series (granularity: day) seems to be mean-stationary but variance non-stationary . However, we cannot rely on visual inspections alone for mean stationary check.

adf.test(df$daily_mean_TEMP_abs) # Raw temp Value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df$daily_mean_TEMP_abs
## Dickey-Fuller = -1.9566, Lag order = 11, p-value = 0.5967
## alternative hypothesis: stationary
kpss.test(df$daily_mean_TEMP_abs) # Raw temp value
## 
##  KPSS Test for Level Stationarity
## 
## data:  df$daily_mean_TEMP_abs
## KPSS Level = 0.48883, Truncation lag parameter = 7, p-value = 0.04418

Augmented Dickey-Fuller Test and KPSS Test for Level Stationarity show us that the raw temp values are not mean stationary.

First, let us apply BoxCox transformation to make the time series variance stationary. All the temperature values are positive and hence we can proceed to apply BoxCox transformation.

df_boxcox <- df %>%
  mutate(temp_boxcox = BoxCox(daily_mean_TEMP_abs,lambda='auto'))
# plot the transformed time series
df_boxcox %>%
  ggplot() +
  geom_line(aes(date,temp_boxcox))+
  ggtitle("date vs time: boxcox transformation")+
  ylab("abs_ temp (boxcox transformation)")+
  xlab("date")

Applying Boxcox transformation to check variance stationarity

adf.test(df_boxcox$temp_boxcox) # BoxCox temp Value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df_boxcox$temp_boxcox
## Dickey-Fuller = -1.9481, Lag order = 11, p-value = 0.6003
## alternative hypothesis: stationary
kpss.test(df_boxcox$temp_boxcox) # BoxCox temp value
## 
##  KPSS Test for Level Stationarity
## 
## data:  df_boxcox$temp_boxcox
## KPSS Level = 0.4821, Truncation lag parameter = 7, p-value = 0.0457

Even though Augmented Dickey-Fuller and KPSS tests are used for mean stationarity check, the p-values for BoxCox transformed are a bit greater than that of raw time series suggesting that transformation is redundant. Also,relatively observing the raw and BoxCox transformed time series, we come to a conclusion that the raw has approximately a constant variance. Hence, we shall stick with the raw time series data.

Applying first difference to make time series mean stationary

df1 <- df %>%
  filter(date >= ymd('2013-03-01') & date <= ymd("2017-02-28")) %>%
  mutate(temp_diff1 = daily_mean_TEMP_abs - lag(daily_mean_TEMP_abs))

#plotting the time series
df1 %>%
  ggplot()+
  geom_line(aes(date,temp_diff1))+
  theme_bw()+
  ggtitle("date vs time: first difference")+
  ylab("abs_ temp (first difference)")+
  xlab("date")

Looking at this plot, we can see that the time series has a constant mean around 0. Let us check the mean stationarity criterion

# removing the first data point: NA removal
df1 <- df1 %>%
  filter(date >= ymd('2013-03-02') & date <= ymd("2017-02-28"))

# mean stationarity tests
adf.test(df1$temp_diff1) # fd1 temp Value
## Warning in adf.test(df1$temp_diff1): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df1$temp_diff1
## Dickey-Fuller = -12.773, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(df1$temp_diff1) # fd1  temp value
## Warning in kpss.test(df1$temp_diff1): p-value greater than printed p-value
## 
##  KPSS Test for Level Stationarity
## 
## data:  df1$temp_diff1
## KPSS Level = 0.068418, Truncation lag parameter = 7, p-value = 0.1

Thus, after applying first difference, our time series has become a stationary time series.

#printing raw time series to view seasonality
plot1

Looking at this raw time series plot again, one can clearly observe seasonality with a period of almost 1 year

Let us produce ACF and PACF plots on the created stationary time series

#ACF and PACF plots
par(mfrow = c(1,2))
acf(df1$temp_diff1,lag.max=20)
pacf(df1$temp_diff1,lag.max=20)

The ACF plot shows a bit seasonality. PACF plot does show oscillations after lag 7 indicating presence of MA process. However, the actual model can be understood by generating various ARIMA models and comparing them using AIC or BIC as the criterion.

Let us fit ARIMA models and check AIC and BIC values to select the best model. Here we add the seasonality parameter because it is clearly evident that seasonality is present atleast on yearly level.

#ACF and PACF plots
AIC(
  arima(df1$daily_mean_TEMP_abs ,order=c(0,1,1), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,2), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,3), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,0), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,1), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(2,1,0), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,0), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,1), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,2), seasonal = list(order = c(0, 1, 1), period = NA))
)
##                                                                                                      df
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA))  3
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA))  4
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 1), period = NA))  5
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA))  3
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA))  4
## arima(df1$daily_mean_TEMP_abs, order = c(2, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA))  4
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA))  5
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA))  6
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA))  7
##                                                                                                           AIC
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA)) 6562.468
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA)) 6173.903
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 1), period = NA)) 6150.467
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 6301.349
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA)) 6175.680
## arima(df1$daily_mean_TEMP_abs, order = c(2, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 6217.319
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 6191.821
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA)) 6143.168
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA)) 6137.561
BIC(
  arima(df1$daily_mean_TEMP_abs ,order=c(0,1,1), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,2), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(0,1,3), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,0), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(1,1,1), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(2,1,0), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,0), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,1), seasonal = list(order = c(0, 1, 1), period = NA)),
  arima(df1$daily_mean_TEMP_abs,order=c(3,1,2), seasonal = list(order = c(0, 1, 1), period = NA))
)
##                                                                                                      df
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA))  3
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA))  4
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 1), period = NA))  5
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA))  3
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA))  4
## arima(df1$daily_mean_TEMP_abs, order = c(2, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA))  4
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA))  5
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA))  6
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA))  7
##                                                                                                           BIC
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA)) 6578.323
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA)) 6195.043
## arima(df1$daily_mean_TEMP_abs, order = c(0, 1, 3), seasonal = list(order = c(0, 1, 1), period = NA)) 6176.891
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 6317.203
## arima(df1$daily_mean_TEMP_abs, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA)) 6196.820
## arima(df1$daily_mean_TEMP_abs, order = c(2, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 6238.459
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 6218.245
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 1), seasonal = list(order = c(0, 1, 1), period = NA)) 6174.877
## arima(df1$daily_mean_TEMP_abs, order = c(3, 1, 2), seasonal = list(order = c(0, 1, 1), period = NA)) 6174.554

Looking at the above outputs, ARIMA (3,1,0)(0,1,1) shows the optimum AIC and BIC values. Thus, ARIMA (3,1,0)(0,1,1) is the chosen model. Let us cross check with auto arima if it provides us with the same results.

#auto arima for best model
auto.arima(df1$daily_mean_TEMP_abs,stationary=TRUE, allowdrift=TRUE,
           seasonal=TRUE,stepwise=FALSE,approximation=FALSE)
## Series: df1$daily_mean_TEMP_abs 
## ARIMA(3,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      mean
##       0.9187  -0.1672  0.2344  286.0401
## s.e.  0.0255   0.0348  0.0255    3.5514
## 
## sigma^2 = 4.104:  log likelihood = -3102.13
## AIC=6214.26   AICc=6214.3   BIC=6240.69

Auto arima confirms that the ARIMA(3,0,0) is the best model for this time frame of this series.Here, auto arima has not provided us with any seasonal component. Hence, we will go with ARIMA (3,1,0)(0,1,1) as our desired model for fitting.

Let us calculate in-sample prediction and compare with the observed values

#prediction
best_mod = arima(df1$daily_mean_TEMP_abs,order = c(3,1,0),seasonal = list(order = c(0, 1, 1), period = NA) )
resid = best_mod$residuals
pred = resid + df1$daily_mean_TEMP_abs


df2 <- df1 %>%
  mutate(pred_temp = pred)

# ensuring both timeseries have same dates
df1 <- df1 %>%
  filter(date >= ymd('2013-03-02') & date <= ymd("2017-02-28"))
df2 <- df2 %>%
  filter(date >= ymd('2013-03-02') & date <= ymd("2017-02-28"))

#plotting the prediction vs observed
ggplot()+
  geom_line(aes(df1$date,df1$daily_mean_TEMP_abs),color='black') +
  geom_line(aes(df2$date,df2$pred_temp),color ='blue',alpha = 0.3) +
  theme_bw() +
  xlab("Date") +
  ylab("actual vs predicted")

From the above plot, we can see that the predicted values for in sample data do follow the trend exhibited by actual values of the time series.

Let us perform the residual analysis for the ARIMA (3,1,0)(0,1,1) model.

# residuals
best_mod = arima(df1$daily_mean_TEMP_abs,order=c(3,1,0), seasonal = list(order = c(0, 1, 1), period = NA))
forecast::checkresiduals(best_mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(3,1,0)
## Q* = 30.673, df = 6, p-value = 2.927e-05
## 
## Model df: 4.   Total lags used: 10
par(mfrow=c(1,2))
resid = best_mod$resid
acf(resid,lag.max=20)
pacf(resid,lag.max=20)

# Box-Ljung test 
for(i in 1:10){
  print(Box.test(resid,type='Ljung-Box',lag = i))
} 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.41323, df = 1, p-value = 0.5203
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.7272, df = 2, p-value = 0.4216
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 2.1692, df = 3, p-value = 0.538
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 26.699, df = 4, p-value = 2.286e-05
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 26.817, df = 5, p-value = 6.192e-05
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 26.817, df = 6, p-value = 0.0001567
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 27.393, df = 7, p-value = 0.0002832
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 28.016, df = 8, p-value = 0.0004713
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 28.775, df = 9, p-value = 0.0007072
## 
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 30.673, df = 10, p-value = 0.0006644

From the above L-jung Box tests for upto 10 lags, p-value for each lag is greater than 0.05, except the few ones. This is an indication that some autocorrelation does exist at some lags . Thus, we cannot conclude that the residuals are not white-noise. However, this seems to be the best model as AIC and BIC values are optimal for this model and auto.arima does not provide us with any seasonal components, depsite setting the seasonal parameter as TRUE.

Let us calculate the insample model performance and the criterion to be used is RMSE.

#RMSE in degree kelvin
RMSE = sqrt(mean((df2$pred_temp - df2$daily_mean_TEMP_abs)^2))
RMSE
## [1] 2.008763

The results suggest that our model predicts the temperature index within ~ 2 degree kelvin on average in-sample.

Let us now create a forecast for five time periods.

#forecast
best_mod %>%
  forecast(h=5) %>% 
  autoplot()

Visually inspecting, the forecast values do seem reasonable.

Section 3 - Facebook Prophet Model

From the past assignment, we know that the TS is variance stationary but mean non-stationary. However, fitting a prophet model can be done irrespective of the stationarity state of the TS.

Fitting a Prophet model + plotting

#fitting a prophet model
prophet_data = df %>% 
  rename(ds = date, # Have to name our date variable "ds"
         y = daily_mean_TEMP_abs)  # Have to name our time series "y"
train = prophet_data %>% 
  filter(ds <= ymd("2017-01-28"))
test = prophet_data %>%
  filter(ds > ymd("2017-01-28"))
model = prophet(train, daily.seasonality = "auto")
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future = make_future_dataframe(model,periods = 30)
forecast = predict(model,future)

A prophet model has been fitted with forecast period as 30 days. This means that beyond the date:2017-01-28, we will make predictions out 30 days. The following plots will help us visualize the forecast.

#plotting the prophet model
plot(model,forecast)+
  ylab("daily_mean_TEMP_abs")+xlab("Date") + ggtitle("Prophet model plot") + 
  theme_bw()

#plotting the prophet model: interactive charts
dyplot.prophet(model,forecast)

The above plots indicate that prophet model has been able to capture the seasonality really well. Also, upon close inspection, it seems that the prophet model has been able to capture the trend too.(Very low but increasing trend)

#visualizing the components
prophet_plot_components(model,forecast)

Trend: The trend seems to have a cycle of ascent and descent over a period of 1.5 years. However, beyond June 2016, the trend has shown a plateau till early months of 2017. This maybe due to factors like global warming, that are not allowing the gradual descent that was observed for the past years.

Weekly: Observing seasonality is difficult at weekly level. However, there is gradual ascent and descent with Tuesday having the lowest value.

Yearly: Seasonal and cyclical nature can be clearly observed here.

Changepoints

#plotting the change points
plot(model,forecast)+
  add_changepoints_to_plot(model)+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs")+ ggtitle("Default model")

# number of change points: Manual number of change points
model = prophet(train,n.changepoints=50,daily.seasonality="auto")
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
forecast = predict(model,future)
plot(model,forecast)+
  add_changepoints_to_plot(model)+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs") +
  ggtitle("New model with 'n.changepoints = 50' ")

#observing the components again
prophet_plot_components(model,forecast)

With respect to the previous trend (refer to the previous component figure), the one with added change points to 50 has shown to smooth out the trend a bit. However, upon visualizing the two component plots, we can say that not much change is observed in weekly, yearly and daily components. Hence, incorporation of additional change points (to 50) has not added much value to the initially fitted prophet model, which has by default 25 changepoints.

Forecast

Let us forecast out 30 days using our train data. Also, let us visually compare how our forecast varies from the observed values of test data.

#forecast
forecast_plot_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.POSIXct(ds),format = "%Y-%m-%d") %>% 
  filter(ds > ymd("2017-01-28"))

forecast_not_scaled = ggplot()+
  geom_line(aes(test$ds,test$y))+
  geom_line(aes(forecast_plot_data$ds,forecast_plot_data$yhat),color='red')

forecast_scaled = forecast_not_scaled + 
  ylim(250,315)

forecast_scaled

Saturation Points

Ideally saturation points are not required for this TS. Since, we are tracking the absolute value of temperature, ideally the lowest could be 0 Kelvin. But, we all know that it is impossible to achieve 0 Kelvin.The highest temperature recorded in China is around 50.5 degree Celsius. Hence, setting this as the cap for saturation points. For convenience, the saturation points are set to: Lowest: 0 Kelvin, Highest: 323.15 Kelvin.

#saturation points
one_year_future = make_future_dataframe(model,periods = 365)

one_year_forecast = predict(model,one_year_future)

plot(model,one_year_forecast)+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs") + 
  ggtitle("one year future forecast without saturation points")

#specifying saturation points

# Set "floor" in training data
train$floor = 0 # no
train$cap = 323.15
future$floor = 0
future$cap = 323.15
# Set floor in forecast data
one_year_future$floor = 0
one_year_future$cap = 323.15
sat_model = prophet(train,growth='logistic',daily.seasonality="auto")
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
sat_one_year_forecast = predict(sat_model,one_year_future)

#plotting with saturation points
plot(sat_model, sat_one_year_forecast) +
  ylim(0,330) +
  theme_bw() +
  xlab("Date") +
  ylab("daily_mean_TEMP_abs") +
  ggtitle("one year future forecast with saturation points")

Seasonality

# creating new identical test and train data having no saturation points to see trends properly
train_iden = prophet_data %>% 
  filter(ds <= ymd("2017-01-28"))
test_iden = prophet_data %>%
  filter(ds > ymd("2017-01-28"))
model_iden = prophet(train_iden, daily.seasonality = "auto")
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future_iden = make_future_dataframe(model_iden,periods = 30)
forecast_iden = predict(model_iden,future)

#additive seasonality
additive = prophet(train_iden, daily.seasonality = "auto")
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
add_fcst = predict(additive,future_iden)
plot(additive,add_fcst) +
  xlab("Date") +
  ylab("daily_mean_TEMP_abs") +
  ggtitle("additive seasonality")+
  ylim(250,323.15)

#multiplicative seasonality
multi = prophet(train_iden,seasonality.mode = 'multiplicative',daily.seasonality = "auto")
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
multi_fcst = predict(multi,future_iden)
plot(multi,multi_fcst) + 
  xlab("Date") +
  ylab("daily_mean_TEMP_abs") +
  ggtitle("multiplicative seasonality")+
  ylim(250,323.15)

#additive seasonality: plotting components
prophet_plot_components(additive,add_fcst)

#multiplicative seasonality: plotting components
prophet_plot_components(multi, multi_fcst)

The component plots for additive and multiplicative seasonality incporporated models do not have much difference in the components except the trend part. There are minute changes in the trend before and after 2016. After 2016, as we head towards 2017, the additive model shows increasing (slope is very low but positive) trend whereas the multiplicative one shows constant(slope is almost zero) trend.

Holidays

Let us involve and model holidays to learn if some association can be understood. Since the data was recorded in the province of China, setting the country name to China.

#impact of holidays on TEMP
model = prophet(train_iden,fit = FALSE, daily.seasonality = "auto")
model = add_country_holidays(model,country_name = 'CN')
model = fit.prophet(model,train_iden)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
forecast = predict(model,future_iden)
prophet_plot_components(model,forecast)

Upon comparing with the previous component plots, there is hardly any difference in the components except the trend components between the model involving holidays and additive model. The trend components of the model involving holidays and multiplicative model are identical in nature.

#holiday effects
forecast %>%
  filter(holidays != 0) %>%
  dplyr::select(-ds:-additive_terms_upper, -holidays:-holidays_upper, -weekly:-yhat, -contains("upper"), -contains("lower")) %>%
  mutate_all(~ if_else(. == 0, as.numeric(NA), .)) %>%
  summarize_if(is.numeric, ~ max(., na.rm = T)) %>%
  pivot_longer(
    cols = model$train.holiday.names,
    names_to = 'holiday', 
    values_to = 'effect'
  ) %>%
  ggplot() +
  geom_col(aes(effect,holiday)) +
  theme_bw()

Here, we can see the relative impacts of the major holidays that are celebrated in China on the mean absolute Temperature.

Overall, inclusion of holidays does seem to create a positive impact by involving more information in the model and we can logically add them as the grnularity of this data is daily level.

Performance of prophet

In-sample performance:

# performance on in-sample data
forecast_insample = predict(model,train_iden)
forecast_metric_data = forecast_insample %>% 
  as_tibble() %>% 
  mutate(ds = as.POSIXct(ds),format = "%Y-%m-%d") %>% 
  filter(ds <= ymd("2017-01-28"))
RMSE = sqrt(mean((train_iden$y - forecast_metric_data$yhat)^2))

print(paste("In-sample RMSE:",round(RMSE,2)))
## [1] "In-sample RMSE: 2.45"
#visualization
ggplot()+
  geom_line(aes(train_iden$ds ,train_iden$y), color = 'green')+
  geom_line(aes(forecast_metric_data$ds,forecast_metric_data$yhat),color='red',
            alpha = 1)+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs") +
  ggtitle("In sample performance visualization")

Out-of-sample performance:

# performance on in-sample data
forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.POSIXct(ds),format = "%Y-%m-%d") %>% 
  filter(ds > ymd("2017-01-28"))
RMSE = sqrt(mean((test$y - forecast_metric_data$yhat)^2))

print(paste("Out-of-sample RMSE:",round(RMSE,2)))
## [1] "Out-of-sample RMSE: 3.23"
#visualization
ggplot()+
  geom_line(aes(test$ds ,test$y), color = 'green')+
  geom_line(aes(forecast_metric_data$ds,forecast_metric_data$yhat),color='red',
            alpha = 1)+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs")+
  ggtitle("Out of sample performance visualization")

It seems that our additive model with holiday inclusion does well both in-sample and out-of-sample in capturing the overall trend. When it comes to out of sample performance, the overall spikes are not effectively achieved by our fitted model, but the trend is captured.

Cross Validation

Cross Validation: Additive model

#cross-validation with prophet
df.cv <- cross_validation(model, initial = 365*3, period = 15, horizon = 30, units = 'days')
## Making 21 forecasts with cutoffs between 2016-03-04 and 2016-12-29
head(df.cv)
## # A tibble: 6 x 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  282. 2016-03-05 00:00:00  277.       274.       281. 2016-03-04 00:00:00
## 2  278. 2016-03-06 00:00:00  278.       274.       281. 2016-03-04 00:00:00
## 3  279. 2016-03-07 00:00:00  278.       275.       281. 2016-03-04 00:00:00
## 4  275. 2016-03-08 00:00:00  278.       275.       281. 2016-03-04 00:00:00
## 5  274. 2016-03-09 00:00:00  279.       276.       282. 2016-03-04 00:00:00
## 6  276. 2016-03-10 00:00:00  279.       276.       283. 2016-03-04 00:00:00
unique(df.cv$cutoff)
##  [1] "2016-03-04 GMT" "2016-03-19 GMT" "2016-04-03 GMT" "2016-04-18 GMT"
##  [5] "2016-05-03 GMT" "2016-05-18 GMT" "2016-06-02 GMT" "2016-06-17 GMT"
##  [9] "2016-07-02 GMT" "2016-07-17 GMT" "2016-08-01 GMT" "2016-08-16 GMT"
## [13] "2016-08-31 GMT" "2016-09-15 GMT" "2016-09-30 GMT" "2016-10-15 GMT"
## [17] "2016-10-30 GMT" "2016-11-14 GMT" "2016-11-29 GMT" "2016-12-14 GMT"
## [21] "2016-12-29 GMT"
#plotting Cross-Val Actual vs Pred
df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs") +
  scale_color_discrete(name = 'Cutoff')

head(performance_metrics(df.cv))
##   horizon      mse     rmse      mae        mape       mdape       smape
## 1  3 days 7.210890 2.685310 2.060706 0.007150614 0.005808667 0.007144014
## 2  4 days 6.370263 2.523938 1.961022 0.006783966 0.005178127 0.006779074
## 3  5 days 5.670068 2.381190 1.901887 0.006588856 0.005551978 0.006588966
## 4  6 days 4.670644 2.161167 1.755290 0.006082927 0.005580515 0.006081695
## 5  7 days 4.839387 2.199861 1.768499 0.006141514 0.005658155 0.006138169
## 6  8 days 6.238315 2.497662 1.949593 0.006790940 0.005658155 0.006783447
##    coverage
## 1 0.7619048
## 2 0.7460317
## 3 0.7777778
## 4 0.8571429
## 5 0.8571429
## 6 0.7936508
plot_cross_validation_metric(df.cv, metric = 'rmse',rolling_window = 0.1)

plot_cross_validation_metric(df.cv, metric = 'mae',rolling_window = 0.1)

plot_cross_validation_metric(df.cv, metric = 'mape',rolling_window = 0.1)

With respect to MAPE for additive model(with holiday inclusion), the interpretation is that our fitted model is able to forecast with around 0.8% error out 10 days, with around 0.65% error out 20 days and with around 0.65% error out 30 days into the future.

Cross Validation: Multiplicative model

#cross-validation with prophet
model = prophet(train_iden,fit = FALSE, daily.seasonality = "auto", seasonality.mode = 'multiplicative')
model = add_country_holidays(model,country_name = 'CN')
model = fit.prophet(model,train_iden)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
df.cv <- cross_validation(model, initial = 365*3, period = 15, horizon = 30, units = 'days')
## Making 21 forecasts with cutoffs between 2016-03-04 and 2016-12-29
head(df.cv)
## # A tibble: 6 x 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  282. 2016-03-05 00:00:00  277.       274.       281. 2016-03-04 00:00:00
## 2  278. 2016-03-06 00:00:00  278.       275.       281. 2016-03-04 00:00:00
## 3  279. 2016-03-07 00:00:00  278.       275.       281. 2016-03-04 00:00:00
## 4  275. 2016-03-08 00:00:00  278.       275.       281. 2016-03-04 00:00:00
## 5  274. 2016-03-09 00:00:00  279.       276.       282. 2016-03-04 00:00:00
## 6  276. 2016-03-10 00:00:00  279.       276.       282. 2016-03-04 00:00:00
unique(df.cv$cutoff)
##  [1] "2016-03-04 GMT" "2016-03-19 GMT" "2016-04-03 GMT" "2016-04-18 GMT"
##  [5] "2016-05-03 GMT" "2016-05-18 GMT" "2016-06-02 GMT" "2016-06-17 GMT"
##  [9] "2016-07-02 GMT" "2016-07-17 GMT" "2016-08-01 GMT" "2016-08-16 GMT"
## [13] "2016-08-31 GMT" "2016-09-15 GMT" "2016-09-30 GMT" "2016-10-15 GMT"
## [17] "2016-10-30 GMT" "2016-11-14 GMT" "2016-11-29 GMT" "2016-12-14 GMT"
## [21] "2016-12-29 GMT"
#plotting Cross-Val Actual vs Pred
df.cv %>% 
  ggplot()+
  geom_point(aes(ds,y)) +
  geom_point(aes(ds,yhat,color=factor(cutoff)))+
  theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs") +
  scale_color_discrete(name = 'Cutoff')

head(performance_metrics(df.cv))
##   horizon      mse     rmse      mae        mape       mdape       smape
## 1  3 days 7.299973 2.701846 2.078623 0.007212499 0.005810397 0.007206289
## 2  4 days 6.458589 2.541375 1.976870 0.006838353 0.005228743 0.006833891
## 3  5 days 5.752520 2.398441 1.915562 0.006635079 0.005557420 0.006635744
## 4  6 days 4.732280 2.175380 1.763935 0.006111348 0.005419114 0.006110686
## 5  7 days 4.849743 2.202213 1.768447 0.006139838 0.005659796 0.006137124
## 6  8 days 6.162435 2.482425 1.937480 0.006747551 0.005659796 0.006740866
##    coverage
## 1 0.7936508
## 2 0.7777778
## 3 0.8095238
## 4 0.8571429
## 5 0.8571429
## 6 0.7936508
plot_cross_validation_metric(df.cv, metric = 'rmse',rolling_window = 0.1)

plot_cross_validation_metric(df.cv, metric = 'mae',rolling_window = 0.1)

plot_cross_validation_metric(df.cv, metric = 'mape',rolling_window = 0.1)

With respect to MAPE, for multiplicative model(with holiday inclusion), the interpretation is that our fitted model is able to forecast with around 0.85% error out 10 days, with around 0.7% error out 20 days and with around 0.65% error out 30 days into the future.

For both the models(with inclusion of holidays), the metrics give almost the same error values with the ones obtained from multiplicative to be slightly higher. Thus, for model selection, Additive model with holiday inclusion is the chosen one.

Section 4 - Model Comparison and Validation

We will use test-train split approach to compare the prophet and arima models using metrics like RMSE and MAE.

# Test train split at this date
(df$date[(round(nrow(df)*0.8))])
## [1] "2016-05-12 EDT"
#fitting a prophet model
prophet_data = df %>% 
  rename(ds = date, # Have to name our date variable "ds"
         y = daily_mean_TEMP_abs)  # Have to name our time series "y"
train = prophet_data %>% 
  filter(ds <= ymd("2016-05-12"))
test = prophet_data %>%
  filter(ds > ymd("2016-05-12"))


model = prophet(train ,fit = FALSE, daily.seasonality = "auto")
model = add_country_holidays(model,country_name = 'CN')
model = fit.prophet(model,train)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
future = make_future_dataframe(model,periods = nrow(test))
forecast = predict(model,future)

# performance on out-of-sample data
forecast_metric_data = forecast %>% 
  as_tibble() %>% 
  mutate(ds = as.POSIXct(ds),format = "%Y-%m-%d") %>% 
  filter(ds > ymd("2016-05-12"))

RMSE_p = sqrt(mean((test$y - forecast_metric_data$yhat)^2))
MAE_p = mean(abs(test$y - forecast_metric_data$yhat))
MAPE_p = mean(abs(1 - (forecast_metric_data$yhat/test$y)))

print(paste("Out-of-sample RMSE for prophet:",round(RMSE_p,2)))
## [1] "Out-of-sample RMSE for prophet: 2.44"
print(paste("Out-of-sample MAE for prophet:",round(MAE_p,2)))
## [1] "Out-of-sample MAE for prophet: 1.85"
print(paste("Out-of-sample MAPE for prophet:",round(MAPE_p,2)))
## [1] "Out-of-sample MAPE for prophet: 0.01"
# fitting arima
arima.mod <- arima(train$y, order = c(3, 1, 0), seasonal = list(order = c(0, 1, 1), period = NA)) 
arima.predict <- arima.mod %>%
  forecast(h=nrow(test))
#arima.predict$mean[1:nrow(test)]

arima.df <- data.frame(test$ds,arima.predict$mean[1:nrow(test)])
RMSE_a = sqrt(mean((test$y - arima.df$arima.predict.mean.1.nrow.test..)^2))
MAE_a = mean(abs(test$y - arima.df$arima.predict.mean.1.nrow.test..))
MAPE_a = mean(abs(1 - (arima.df$arima.predict.mean.1.nrow.test../test$y)))

print(paste("Out-of-sample RMSE for arima:",round(RMSE_a,2)))
## [1] "Out-of-sample RMSE for arima: 14.32"
print(paste("Out-of-sample MAE for arima:",round(MAE_a,2)))
## [1] "Out-of-sample MAE for arima: 12.14"
print(paste("Out-of-sample MAPE for prophet:",round(MAPE_a,2)))
## [1] "Out-of-sample MAPE for prophet: 0.04"

Thus, we can say that prophet is able to do a better job out of sample than arima. Thus, we choose Prophet model (Additive seasonality with holiday inclusion). Arima seems to fit the train model properly. However, when it comes to forecasting for test values, it does not do well. This may be a classic case of over-fitting, where the model significantly loses prediction value on the test data. Addition of external regressors, might improve the arima model.

Let us visually observe the forecast from Prophet model for the test dataset.

#visualization
ggplot()+
  geom_line(aes(test$ds ,test$y), color = 'green')+
  geom_line(aes(forecast_metric_data$ds,forecast_metric_data$yhat),color='red',
            alpha = 1)+
   theme_bw()+
  xlab("Date")+
  ylab("daily_mean_TEMP_abs")+
  ggtitle("Out of sample performance visualization")

Here, we can see that Prophet model is doing a really good job at capturing the seasonality and trend too.