Time Series-Prophet

Mihir

3/15/2022

1 INTRODUCTION

The dataset is obtained from UCI Machine Learning Repository website. The dataset includes hourly air pollutants data from the 12 nationally-controlled air-quality monitoring sites in China. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. This project deals with data collected from the weather station named Aotizhongxin.

Data are recorded at each hour from March 1st, 2013 to February 28th, 2017. The recorded data include concentrations (in ug/m^3) of fine particulates like PM2.5, PM10; air pollutants like SO2, NO2, CO, O3; and other climate data like TEMP (i.e. Temperature in degree Celsius), PRES (i.e. pressure in hPA), DEWP (i.e. dew point temperature in degree Celsius), RAIN (i.e. precipitation in mm), wd (i.e. wind direction), WSPM (i.e. wind speed (m/s)). Missing data are denoted as NA.

The main purpose of this project is to forecast climate variables and understand the association of air pollutants and other climate variables. This can be of of immense importance where weather drives demand of certain products. Estimation of weather driven demand is significant for projection of Sales and Revenue for coming weeks or months. Such a forecast of climate variables finds application in agricultural industries manufacturing lawn, garden, fertilizers, insecticides, pest control products, etc.

The scope of this project is limited to forecasting of TEMP variable. This is a typical time series data and hence, I have decided to use Facebook’s Prophet Model to forecast TEMP variable. > Normal Regression strategies like Linear Regression cannot be used here as the order of recorded observations is of importance. Also, for this project, climate variables other than TEMP are not considered for forecasting and are used as external regressors for forecasting TEMP. However, each of the climate variables is a Time Series in itself.

The project has been divided into the following main components.

  1. Exploratory Data Analyses
  2. Prophet Model
  3. Forecast and Inference (Stakeholder Communication)

The Exploratory Data Analyses section deals with exploring various variables of interest and understanding their association with the target variable TEMP. The main aim is to understand and figure out the data generating process for the target variable TEMP and creating a subset of variables that could be added as external regressors.

The Prophet Model section includes fitting an improved model involving external regressors. Finally, the improved model will be hyperparemeter tuned for identifying the best model to be used for forecast.

The final section, Forecast and Inference, creates forecast and assesses the model performance. It also provides a summary and important inferences, drawn from modelling, to be used for stakeholder communication.

2 EXPLORATORY DATA ANALYSES

2.1 Data Loading

# importing data stored on my github
df <- read.csv(file = "https://raw.githubusercontent.com/mpleo17/Project/main/PRSA_Data_Aotizhongxin_20130301-20170228.csv", sep = ",", header = TRUE)

#looking at first 5 observations
head(df, 5)
##   No year month day hour PM2.5 PM10 SO2 NO2  CO O3 TEMP   PRES  DEWP RAIN  wd
## 1  1 2013     3   1    0     4    4   4   7 300 77 -0.7 1023.0 -18.8    0 NNW
## 2  2 2013     3   1    1     8    8   4   7 300 77 -1.1 1023.2 -18.2    0   N
## 3  3 2013     3   1    2     7    7   5  10 300 73 -1.1 1023.5 -18.2    0 NNW
## 4  4 2013     3   1    3     6    6  11  11 300 72 -1.4 1024.5 -19.4    0  NW
## 5  5 2013     3   1    4     3    3  12  12 300 72 -2.0 1025.2 -19.5    0   N
##   WSPM      station
## 1  4.4 Aotizhongxin
## 2  4.7 Aotizhongxin
## 3  5.6 Aotizhongxin
## 4  3.1 Aotizhongxin
## 5  2.0 Aotizhongxin

2.2 Data Wrangling

From the above few observations, we can see that year, month, day and hour are in separate columns. We can create a date-time stamp to make our case easy. Also, we would convert TEMP from degree Celsisus to Fahrenheit and rename it to TEMP_abs. Finally, we drop the non-important variables like No, year, month, day, hour, etc. and look at the summary of resultant dataframe.

library(tidyverse) # for playing with data,dataframes and more
library(lubridate) # for playing with dates

# uniting the date, month and year into a single column - date
df <- df %>%
  mutate(date = make_datetime(year, month, day, hour))

#converting degree Celsius to kelvin
df <- df %>% dplyr::mutate(df, TEMP_abs = TEMP + 273.15)

# creating required dataframe
df <- df %>% dplyr::select(date, PM2.5, PM10, SO2, NO2, CO, O3, TEMP_abs, PRES, DEWP, RAIN, wd, WSPM, station)

# converting to datetime object
df[['date']] <- as.POSIXct(df[['date']],format = "%Y-%m-%d %H")

# summary
summary(df) # NA columns: PM2.5 + PM10 + SO2 + NO2 + CO + O3 + TEMP_abs + PRES + DEWP + RAIN + WSPM
##       date                         PM2.5             PM10      
##  Min.   :2013-03-01 00:00:00   Min.   :  3.00   Min.   :  2.0  
##  1st Qu.:2014-03-01 05:45:00   1st Qu.: 22.00   1st Qu.: 38.0  
##  Median :2015-03-01 11:30:00   Median : 58.00   Median : 87.0  
##  Mean   :2015-03-01 11:30:00   Mean   : 82.77   Mean   :110.1  
##  3rd Qu.:2016-02-29 17:15:00   3rd Qu.:114.00   3rd Qu.:155.0  
##  Max.   :2017-02-28 23:00:00   Max.   :898.00   Max.   :984.0  
##                                NA's   :925      NA's   :718    
##       SO2                NO2               CO              O3          
##  Min.   :  0.2856   Min.   :  2.00   Min.   :  100   Min.   :  0.2142  
##  1st Qu.:  3.0000   1st Qu.: 30.00   1st Qu.:  500   1st Qu.:  8.0000  
##  Median :  9.0000   Median : 53.00   Median :  900   Median : 42.0000  
##  Mean   : 17.3759   Mean   : 59.31   Mean   : 1263   Mean   : 56.3534  
##  3rd Qu.: 21.0000   3rd Qu.: 82.00   3rd Qu.: 1500   3rd Qu.: 82.0000  
##  Max.   :341.0000   Max.   :290.00   Max.   :10000   Max.   :423.0000  
##  NA's   :935        NA's   :1023     NA's   :1776    NA's   :1719      
##     TEMP_abs          PRES             DEWP              RAIN         
##  Min.   :256.4   Min.   : 985.9   Min.   :-35.300   Min.   : 0.00000  
##  1st Qu.:276.2   1st Qu.:1003.3   1st Qu.: -8.100   1st Qu.: 0.00000  
##  Median :287.6   Median :1011.4   Median :  3.800   Median : 0.00000  
##  Mean   :286.7   Mean   :1011.8   Mean   :  3.123   Mean   : 0.06742  
##  3rd Qu.:296.4   3rd Qu.:1020.1   3rd Qu.: 15.600   3rd Qu.: 0.00000  
##  Max.   :313.6   Max.   :1042.0   Max.   : 28.500   Max.   :72.50000  
##  NA's   :20      NA's   :20       NA's   :20        NA's   :20        
##       wd                 WSPM          station         
##  Length:35064       Min.   : 0.000   Length:35064      
##  Class :character   1st Qu.: 0.900   Class :character  
##  Mode  :character   Median : 1.400   Mode  :character  
##                     Mean   : 1.708                     
##                     3rd Qu.: 2.200                     
##                     Max.   :11.200                     
##                     NA's   :14

Variables PM2.5, PM10, SO2, NO2, CO, O3, TEMP_abs, PRES, DEWP, RAIN and WSPM have missing values. Let us calculate the percentage of missing values and then decide on how to deal with them.

# proportion of NA values for these columns
attach(df) # attaching to avoid mentioning df$var everytime

col1 = c() 
col2 = c()
col3 = c()

#loop 
for (i in (1:dim(df)[2])){
  col1[i] = colnames(df)[i]
  col2[i] =  sum(is.na(df[i]))
  col3[i] = round(col2[i]/dim(df)[1],4)
}

#creating dataframe for compiling NA values 
(df_NA <- tibble("Column" = col1, "NA_values" = col2, "NA_proportion" = col3))
## # A tibble: 14 x 3
##    Column   NA_values NA_proportion
##    <chr>        <int>         <dbl>
##  1 date             0        0     
##  2 PM2.5          925        0.0264
##  3 PM10           718        0.0205
##  4 SO2            935        0.0267
##  5 NO2           1023        0.0292
##  6 CO            1776        0.0507
##  7 O3            1719        0.049 
##  8 TEMP_abs        20        0.0006
##  9 PRES            20        0.0006
## 10 DEWP            20        0.0006
## 11 RAIN            20        0.0006
## 12 wd              81        0.0023
## 13 WSPM            14        0.0004
## 14 station          0        0

Let us now observe the missingness pattern before dealing with the missing values.

library(naniar) # for missing points
library(ggpubr) #for dealing with multiple ggplots

#generating plots for variables having missing values
P1 <- ggplot(df,aes(y = PM2.5, x = date)) + geom_miss_point() +
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P2 <- ggplot(df,aes(y = PM10, x = date)) + geom_miss_point() +
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P3 <- ggplot(df,aes(y = SO2, x = date)) + geom_miss_point() +
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P4 <- ggplot(df,aes(y = NO2, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P5 <- ggplot(df,aes(y = CO, x = date)) + geom_miss_point()+ 
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P6 <- ggplot(df,aes(y = O3, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P7 <- ggplot(df,aes(y = TEMP_abs, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P8 <- ggplot(df,aes(y = PRES, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P9 <- ggplot(df,aes(y = DEWP, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P10 <- ggplot(df,aes(y = RAIN, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')
P11 <- ggplot(df,aes(y = WSPM, x = date)) + geom_miss_point()+
  geom_vline(aes(xintercept=df[round(dim(df)[1]*0.8),1]),color='red')

# 11 figures arranged in 4 rows and 3 columns
annotate_figure(ggarrange(P1, P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, 
                          ncol = 3, nrow = 4),
                top=text_grob("Missingness Pattern in Variables"))
Plot indicating missingness pattern for variables. The red dots indicate missing values and blue dots indicate non missing values.

Figure 2.1: Plot indicating missingness pattern for variables. The red dots indicate missing values and blue dots indicate non missing values.

Since the proportion of missing values is very small (max ~ 2.9 %) and the data are missing completely at random, we will proceed with filling in of missing values using spline interpolation. There are many methods available to fill in the missing values, however, we shall use spline interpolation for this project.

library(zoo)# for spline interpolation

# Missing values using spline interpolation. Here 2,3...indicate column index
for (i in c(2,3,4,5,6,7,8,9,10,11,13)){
 df[i] <- na.spline(df[i]) 
}

# creating train-test split with split having exactly 60 days
train <- df %>% slice(1:(dim(df)[1] - (60*24)))  # train having the rest
test <- df %>% slice((dim(df)[1] - (60*24)) + 1:dim(df)[1]) # test having last 60 days

The train and test split have been made and solely train data shall be used for model building. Final model shall be run only on test data. Forecast results and performance will be calculated on test data and will be communicated with the stake-holders.

2.3 ANALYZING TEMP_abs

# visualizing `TEMP_abs` time series
library(xts)
library(dygraphs)

data <- xts(train$TEMP_abs , order.by = train$date)

dygraph(data = data , main = "Time Series of TEMP_abs") %>%
  dyAxis("y", label = "Temp in Kelvin", valueRange = c(250,320)) %>% dyAxis("x", label = "Date-Time") %>% 
  dyRangeSelector(height = 20) %>% dyLegend(show = "follow")

Figure 2.2: Time Series of TEMP_abs from 2013-03-01 00:00:00 to 2017-02-28 23:00:00. The data was recorded every hour.

The time series seems to have a seasonality at yearly level. Let us look at following plots to better understand the observed seasonalities.

#grouping by hour for daily effect
data <- train %>% dplyr::select(date,TEMP_abs) %>% 
  mutate(hour = hour(date)) %>%
  group_by(hour) %>%
  summarise(avg_TEMP_abs = mean(TEMP_abs)) 

dygraph(data = data , main = "Hour of the Day effect") %>%
  dyAxis("y", label = "Temp in Kelvin", valueRange = c(283,293)) %>% 
  dyAxis("x", label = "Hours in a Day", valueRange = c(1,24)) %>% 
  dyRangeSelector(height = 20) %>% dyLegend(show = "follow")

Figure 2.3: This plot indicates seasonal variation at daily level.

The time series is grouped at hour level for all the days in this dataset. Thus, we can observe, on an average, seasonal pattern of TEMP_abs at daily level. It logically agrees with the fact that temperature rises as the Sun rises and lowers as the Sun sets. Thus, this seasonal pattern on a daily basis is due to the rotation of the Earth. Hence, the rotation of the Earth impacts the TEMP_abs and forms a part of data generating process.

# grouping by days for weekly effect
data <- train %>% dplyr::select(date,TEMP_abs) %>% 
  mutate(day = format(as.POSIXct(date,format = '%Y-%m-%d %H:%M:%S'),format = '%Y%m%d')) %>%
  group_by(day_of_week = wday(ymd(day)), label = FALSE) %>%
  summarise(avg_TEMP_abs = mean(TEMP_abs)) 

dygraph(data = data , main = "Day of the Week effect") %>%
  dyAxis("y", label = "Temp in Kelvin", valueRange = c(286,289)) %>% 
  dyAxis("x", label = "Days in a Week",valueRange = c(1,7)) %>% 
  dyRangeSelector(height = 20) %>% dyLegend(show = "follow")

Figure 2.4: This plot indicates variation (hardly any) at weekly level with 1 as Sunday.

The time series is grouped at day of week level to observe any patterns at week level. Looking at the plot, we can infer that there is hardly any evidence that supports minimal change in TEMP_abs for different days of the week. Thus, we can safely assume that there is hardly any seasonality at weekly level.

# grouping by month to see yearly effect
data <- train %>% select(date,TEMP_abs) %>% 
  mutate(month = month(date)) %>%
  group_by(month) %>%
  summarise(avg_TEMP_abs = mean(TEMP_abs)) 

dygraph(data = data , main = "Month of the Year effect") %>%
  dyAxis("y", label = "Temp in Kelvin") %>% dyAxis("x", label = "Months in a Year", valueRange = c(1,12)) %>% dyRangeSelector(height = 20) %>% dyLegend(show = "follow")

Figure 2.5: This plot indicates seasonal variation at yearly level.

The time series is grouped at month of the year level to observe any noticeable pattern at yearly level. This plot indicates the presence of yearly seasonality. This can also be confirmed from the raw time series plot above. Logically, this is coherent with the fact that temperatures are usually the highest during the summer season, gradually reduce till the winter season and then rise again. This can be attributed to the revolution of the Earth around the Sun and it takes a year to revolve. Thus, this revolution influences the temperature and could be a potential part of the data generating process. This explains yearly seasonality for TEMP_abs.

Rotation and revolution of the Earth are two of the most important factors affecting the temperature values.

Hence, they form an integral part of the data generating process.

2.4 Analyzing association of rest of the variables with TEMP_abs

For simplicity and a high level view of linear association, a correlation plot is made. To dive in deep, with an aim to observe almost independent effect of each variable on Temp_abs, a random forest model are fit between TEMP_abs and rest of the variables. Variable Importance plots are created for the Random Forest model. This plot shows us the relative importance of variables with TEMP_abs. A small subset of important variables is selected and their nature of association with TEMP_abs could be observed using partial dependence plots(not shown).

2.4.1 Pearsons’ correlation

library(ggcorrplot) #for plotting correlations
#calculating and plotting correlation 
corr <- cor(df %>% select(TEMP_abs, PM2.5, PM10, SO2, NO2, CO, O3, PRES, DEWP, RAIN, WSPM))
ggcorrplot(corr , hc.order = TRUE, method = "circle")
Pearson Correlation plot

Figure 2.6: Pearson Correlation plot

TEMP_abs shows positive correlation with O3, DEWP and negatively with PRES.

2.4.2 Random Forest

Regression trees can be powerful tools for EDA. Let us fit a Random Forest model and select a subset of variables from the generated Variable Importance by Random Forest.

#partial dependence plot approach
library(randomForest) # for randomForest, partialPlot, and varImpPlot functions

set.seed(100) # for reproducibility

train.rf <- randomForest(TEMP_abs ~ PM2.5 + PM10 + SO2 + NO2 + CO + O3 + PRES + DEWP + RAIN + WSPM , data = train, importance = TRUE)

#plotting relative importance of variables
varImpPlot(train.rf)
Variable Importance Plot generated by Random Forest fit

Figure 2.7: Variable Importance Plot generated by Random Forest fit

Looking at the above variable importance plots, O3, DEWP, PRES along with WSPM seem to be significant predictor variables having strong association with TEMP_abs.

The Correlation plot, Variable Importance plot from Random Forest model indicate that DEWP, PRES, O3 and WSPM show maximum levels of association with TEMP_abs.

Thus, for the models where external regressors can be included (such as Prophet model), DEWP, PRES, O3 and WSPM shall form a part of external regressors.

3 PROPHET MODEL

Prophet model is a univariate forecasting technique developed by Facebook. It breaks down a univariate time series into the following components:

  1. Trend
  2. Seasonality
  3. Holidays/ External Regressors
  4. Residuals that cannot be mapped by the above components

Considering the default case of additive seasonality, which is our case too, the following is the equation for the Time series, in our case: TEMP_abs.

\[Time Series = Trend + Seasonality + ExternalRegressors + Residuals\]

3.1 Prophet data preparation

For avoiding computational complexity, I am aggregating hourly level data to daily data. Hence, daily seasonality will be of no use, however, yearly seasonality will still be present.

# aggregating data at daily level
train_day <- train %>%  
  mutate(date = date(date)) %>%
  group_by(date) %>%
  summarise(PM2.5 = mean(PM2.5),
            PM10 = mean(PM10),
            SO2 = mean(SO2),
            NO2 = mean(NO2),
            CO = mean(CO),
            O3 = mean(O3),
            TEMP_abs = mean(TEMP_abs),
            PRES = mean(PRES),
            DEWP = mean(DEWP),
            RAIN = mean(RAIN),
            WSPM = mean(WSPM))

test_day <- test %>%  
  mutate(date = date(date)) %>%
  group_by(date) %>%
  summarise(PM2.5 = mean(PM2.5),
            PM10 = mean(PM10),
            SO2 = mean(SO2),
            NO2 = mean(NO2),
            CO = mean(CO),
            O3 = mean(O3),
            TEMP_abs = mean(TEMP_abs),
            PRES = mean(PRES),
            DEWP = mean(DEWP),
            RAIN = mean(RAIN),
            WSPM = mean(WSPM))

# creating prophet train and train_validation sets
prophet_train <- train_day[1:(nrow(train_day) - 60),] %>%
  rename(ds = date, y = TEMP_abs)
prophet_train_val <- train_day[(nrow(train_day) - 60 + 1):nrow(train_day),] %>%
  rename(ds = date, y = TEMP_abs)

# prophet test set
prophet_test <- test_day %>%
  rename(ds = date, y = TEMP_abs)

3.2 Initial improved model fitting

Let us begin and fit a prophet model. The explanation of values of parameters are shown in form of comments in the code below.

  1. Trend : Real time series frequently have abrupt changes in their trajectories. By default, Prophet will automatically detect these changepoints and will allow the trend to adapt appropriately. The number of changepoints has been increased to 40 (might overfit). An attempt to manually add changepoint dates was made, but it didn’t improve the model significantly. Hence, I stuck myself to 40 changepoint number.

  2. Seasonality: This is the part which repeats itself after a particular calendar time period. The seasonality mode is set to additive for Yearly seasonality with default fourier order.

  3. External Regressors: WSPM, DEWP, O3 and PRES have been added as external regressors. Future values of these regressors have also been added to the created future dataframe.

# library for prophet model
library(prophet)

#creating an improved prophet model
im <- (prophet(
df = NULL,   # Dataframe containing the history
growth = "linear",    # trend change/growth can't be logistic or flat for this TS
#changepoints = c('2013-08-10','2014-01-10', '2014-08-02', '2015-01-13', '2015-07-13',  
#'2016-01-22', '2016-08-04'), 
n.changepoints = 40, #more than default, might overfit 
changepoint.range = 0.80, # Proportion of history in which trend changepoints will be estimated
yearly.seasonality = TRUE, # Default Fourier Order
weekly.seasonality = FALSE, # no evidence for temp change for days of a week
daily.seasonality = FALSE, # Daily seasonality locked as off as data is daily leveled
holidays = NULL, # no evidence that holidays affect temp 
seasonality.mode = "additive", # by observation
seasonality.prior.scale = 10, # default
holidays.prior.scale = 10, # default for regressors too
changepoint.prior.scale = 0.05, # default
mcmc.samples = 0, # default
interval.width = 0.80, #default
uncertainty.samples = 1000, #default
fit = TRUE
))

# adding external regressors
im = add_regressor(im,'WSPM',standardize = FALSE) # added WSPM as an external regressor
im = add_regressor(im,'DEWP', standardize = FALSE) # added DEWP as an external regressor
im = add_regressor(im,'O3', standardize = FALSE) # added O3 as an external regressor
im = add_regressor(im,'PRES', standardize = FALSE) # added PRES as an external regressor

# fitting a prophet model
im = fit.prophet(im, df = prophet_train)

#making future dataframes
future_im <- make_future_dataframe(im, periods = 60, 
                                   freq = "day", include_history = TRUE) #predictions for two months

future_im$WSPM = head(train_day$WSPM ,nrow(future_im))
future_im$DEWP = head(train_day$DEWP ,nrow(future_im))
future_im$O3 = head(train_day$O3, nrow(future_im))
future_im$PRES = head(train_day$PRES, nrow(future_im))

Once the model has been fitted and future dataframe has been created, we can now proceed to forecasting.

Prediction of TEMP_abs has been made for 60 days. The forecast gives us the dates and its corresponding point estimation along with upper and lower values of TEMP_abs. The fitted vales and forecast can be visually seen too. A plot indicating components has also been created.

# prediction
fcst_im <- predict(im,future_im) # creating forecast for 60 days
#tail(fcst_im[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')]) #observing tail observations
dyplot.prophet(im,fcst_im , uncertainty = TRUE) # creating interactive plots for the forecast

Figure 3.1: Interactive plot indicating fit on prophet_train data and showing forecast

# plotting components of the forecast
prophet_plot_components(
im,
fcst_im,
uncertainty = TRUE,
plot_cap = TRUE,
yearly_start = 0,
render_plot = TRUE
)
A ggplot indicating components of the Time series as per Prophet model fit

Figure 3.2: A ggplot indicating components of the Time series as per Prophet model fit

It can be seen from the first plot that trend increases from 2013 to mid 2014 and then decreases for almost a year till mid 2015. There is a rise till 2016 and then it remains constant henceforth. The second plot shows the seasonal component. This means that the seasonal graph repeats itself after a period of one year. The third plot can be roughly understood as variation explained by external regressors. The remaining unexplained part forms a part of residuals.

The addition of the above three components along with the residuals would sum up to the fit created by prophet model .

Let us now look at the model performance. The assessment is done on the validation set of the train data. RMSE, MAE, and MAPE are the metrics used to assess the model.

# out of sample (validation set) assessment
RMSE_im = sqrt(mean((tail(train_day$TEMP_abs,60) - tail(fcst_im$yhat,60))^2))
MAE_im = mean(abs( tail(train_day$TEMP_abs,60) - tail(fcst_im$yhat,60) ))
MAPE_im = mean(abs(  (tail(train_day$TEMP_abs,60) - tail(fcst_im$yhat,60))  / tail(train_day$TEMP_abs,60) ))

tibble("RMSE"= c(round(RMSE_im,4)), "MAE" = round(MAE_im,4), "MAPE" = round(MAPE_im,4))
## # A tibble: 1 x 3
##    RMSE   MAE   MAPE
##   <dbl> <dbl>  <dbl>
## 1  2.32  1.87 0.0068

3.3 Hyperparameter Tuning

Fitting a prophet model is easy and that was the intention behind development of the model for usage by non-experts. However, it is extremely important to tune the parameters before deploying the model. The paramters to be tuned here include:

  1. changepoint_prior_scale: It determines the flexibility of the trend, and in particular how much the trend changes at the trend changepoints. As described in this documentation, if it is too small, the trend will be underfit and variance that should have been modeled with trend changes will instead end up being handled with the noise term. If it is too large, the trend will overfit and in the most extreme case you can end up with the trend capturing yearly seasonality. The default, 0.05, works for many time series.

  2. seasonality_prior_scale (for yearly seasonality): This parameter controls the flexibility of the seasonality. Similarly, a large value allows the seasonality to fit large fluctuations, a small value shrinks the magnitude of the seasonality. The default is 10, which applies basically no regularization.

  3. fourier_order_yearly: Seasonalities are estimated using a partial Fourier sum. The number of terms in the partial sum (the order) is a parameter that determines how quickly the seasonality can change. The default Fourier order for yearly seasonality is 10.

  4. holiday_prior_scale: The scale of external regressors is the same as this scale. Since we are not adding the effect of holiday in this project, this scale will control the dampening effect of external regressors.

The codes below provide a subset of values to be chosen for each of the above mentioned parameters. This is usually obtained by playing and fitting the Prophet model and observing the performance.

changepoint_prior_scale <- c(0.01, 0.1, 0.5)
seasonality_prior_scale <- c(8.0, 10.0, 12.0)
fourier_order_yearly <- c(10, 12, 17)
holidays_prior_scale <- c(8, 10)

iter <- expand.grid(changepoint_prior_scale = changepoint_prior_scale,
                    seasonality_prior_scale = seasonality_prior_scale,
                    fourier_order_yearly = fourier_order_yearly,
                    holidays_prior_scale = holidays_prior_scale)

a <- NULL
b <- NULL
f <- NULL
h <- NULL
c <- NULL
d <- NULL
e <- NULL
 

for (i in (1:nrow(iter))){
  im <- NULL # null model
  #creating an improved prophet model
  im <- (prophet(
  df = NULL,   # Dataframe containing the history
  growth = "linear",    # trend change/growth can't be logistic or flat for this TS
  n.changepoints = 40, #more than default, might overfit 
  changepoint.range = 0.80, # Proportion of history in which trend changepoints will be estimated
  yearly.seasonality = iter$fourier_order_yearly[i], # need to be tuned 
  weekly.seasonality = FALSE, # no evidence for temp change for days of a week
  daily.seasonality = FALSE, # Daily seasonality locked as off as data is daily leveled
  holidays = NULL, # no evidence that holidays affect temp 
  seasonality.mode = "additive", # by observation
  holidays.prior.scale = iter$holidays_prior_scale[i], # default
  seasonality.prior.scale = iter$seasonality_prior_scale[i], # need to be tuned 
  changepoint.prior.scale = iter$changepoint_prior_scale[i], # need to be tuned 
  mcmc.samples = 0, # default
  interval.width = 0.80, #default
  uncertainty.samples = 1000, #default
  fit = TRUE
  ))

  # adding external regressors
  im = add_regressor(im,'WSPM',standardize = FALSE) # added WSPM as an external regressor
  im = add_regressor(im,'DEWP', standardize = FALSE) # added DEWP as an external regressor
  im = add_regressor(im,'O3', standardize = FALSE) # added O3 as an external regressor
  im = add_regressor(im,'PRES', standardize = FALSE) # added PRES as an external regressor

  # fitting a prophet model
  im = fit.prophet(im, df = prophet_train)
  
  # performing cross validation
  df.cv <- cross_validation(im, initial = 365*2.4, period = 30, horizon = 60, units = "days")
  
  #performance metrics
  df_p = performance_metrics(df.cv, rolling_window = 1)

  a[i] <- iter$changepoint_prior_scale[i]
  b[i] <- iter$seasonality_prior_scale[i]
  f[i] <- iter$fourier_order_yearly[i]
  h[i] <- iter$holidays_prior_scale[i]
  c[i] <- df_p$rmse
  d[i] <- df_p$mae
  e[i] <- df_p$mape
}

# summarizing results
pm <- tibble("changepoint_prior_scale" = a,
             "seasonality_prior_scale" = b,
             "fourier_order_yearly" = f,
             "holidays_prior_scale" = h,
             "RMSE" = c,
             "MAE" = d,
             "MAPE" = e)

# arranging in ascending order of RMSE
(pm %>% arrange(RMSE))
## # A tibble: 54 x 7
##    changepoint_prior~ seasonality_prior~ fourier_order_y~ holidays_prior_~  RMSE
##                 <dbl>              <dbl>            <dbl>            <dbl> <dbl>
##  1                0.5                 10               17                8  2.83
##  2                0.5                  8               17               10  2.85
##  3                0.5                  8               12               10  2.85
##  4                0.5                 10               10                8  2.86
##  5                0.5                  8               17                8  2.88
##  6                0.5                 12               10               10  2.89
##  7                0.5                 10               12                8  2.89
##  8                0.5                  8               12                8  2.89
##  9                0.5                 10               17               10  2.90
## 10                0.5                 12               12               10  2.90
## # ... with 44 more rows, and 2 more variables: MAE <dbl>, MAPE <dbl>

Looking at the above summary table, we will select the top 10 rows, since they have lower RMSE scores, and utilize those parameters’ values to calculate out of sample (validation set) performance. The set of good hyperparameter tuned models (top 10 models) is then tested on validation set. This means that we shall select the best model from these top 10 models after looking at their out-of-sample performance on the validation set. This step is important as forecast in future is more related to near data than farther data, and validation data is more near to the test data.

# top 10 rows having the best parameter values as per RMSE
pm1 <- pm %>% arrange(RMSE) %>% slice(1:10) 

# creating null vectors to store values 
a <- NULL
b <- NULL
f <- NULL
h <- NULL
c <- NULL
d <- NULL
e <- NULL

# loop for out_of_sample (validation) set assessment

for (i in 1:nrow(pm1)){
  
  im <- NULL # null model
  
  #creating an improved prophet model
  im <- (prophet(
  df = NULL,   # Dataframe containing the history
  growth = "linear",    # trend change/growth can't be logistic or flat for this TS
  n.changepoints = 40, #more than default, might overfit 
  changepoint.range = 0.80, # Proportion of history in which trend changepoints will be estimated
  yearly.seasonality = pm1$fourier_order_yearly[i], # need to be tuned  
  weekly.seasonality = FALSE, # no evidence for temp change for days of a week
  daily.seasonality = FALSE, # Daily seasonality locked as off as data is daily leveled
  holidays = NULL, # no evidence that holidays affect temp 
  seasonality.mode = "additive", # by observation
  holidays.prior.scale = pm1$holidays_prior_scale[i], # default
  seasonality.prior.scale = pm1$seasonality_prior_scale[i], # need to be tuned 
  changepoint.prior.scale = pm1$changepoint_prior_scale[i], # need to be tuned 
  mcmc.samples = 0, # default
  interval.width = 0.80, #default
  uncertainty.samples = 1000, #default
  fit = TRUE
  )) 
  
  # adding external regressors
  im = add_regressor(im,'WSPM',standardize = FALSE) # added WSPM as an external regressor
  im = add_regressor(im,'DEWP', standardize = FALSE) # added DEWP as an external regressor
  im = add_regressor(im,'O3', standardize = FALSE) # added O3 as an external regressor
  im = add_regressor(im,'PRES', standardize = FALSE) # added PRES as an external regressor

  # fitting a prophet model
  im = fit.prophet(im, df = prophet_train)
  
  #making future dataframes
  future_im <- make_future_dataframe(im, periods = 60, freq = "day", include_history = FALSE) #predictions for    two months
  future_im$WSPM = prophet_train_val$WSPM
  future_im$DEWP = prophet_train_val$DEWP
  future_im$O3 = prophet_train_val$O3
  future_im$PRES = prophet_train_val$PRES
  
  # prediction
  fcst_im <- predict(im,future_im) # creating forecast for 60 days
  
  # out of sample (validation set) assessment
  RMSE_im = sqrt(mean((prophet_train_val$y - fcst_im$yhat)^2))
  MAE_im = mean(abs( prophet_train_val$y - fcst_im$yhat ))
  MAPE_im = mean(abs(  (prophet_train_val$y - fcst_im$yhat)  / prophet_train_val$y ))

  #appending values to vectors
  a[i] <- pm1$changepoint_prior_scale[i]
  b[i] <- pm1$seasonality_prior_scale[i]
  f[i] <- pm1$fourier_order_yearly[i]
  h[i] <- pm1$holidays_prior_scale[i]
  c[i] <- RMSE_im
  d[i] <- MAE_im
  e[i] <- MAPE_im
  
}

# summarizing results
pm3 <- data_frame("changepoint_prior_scale" = a,
                  "seasonality_prior_scale" = b,
                  "fourier_order_yearly" = f,
                  "holiday_prior_scale" = h,
                  "out_of_sample_RMSE" = c,
                  "out_of_sample_MAE" = d,
                  "out_of_sample_MAPE" = e)

# arranging rows as per best "out_of_sample_RMSE"
(pm3 <- pm3 %>% arrange(out_of_sample_RMSE))
## # A tibble: 10 x 7
##    changepoint_prior_scale seasonality_prior~ fourier_order_ye~ holiday_prior_s~
##                      <dbl>              <dbl>             <dbl>            <dbl>
##  1                     0.5                  8                17                8
##  2                     0.5                 10                17                8
##  3                     0.5                 10                17               10
##  4                     0.5                  8                12                8
##  5                     0.5                 10                10                8
##  6                     0.5                  8                17               10
##  7                     0.5                 12                10               10
##  8                     0.5                 10                12                8
##  9                     0.5                 12                12               10
## 10                     0.5                  8                12               10
## # ... with 3 more variables: out_of_sample_RMSE <dbl>, out_of_sample_MAE <dbl>,
## #   out_of_sample_MAPE <dbl>

It can be seen that the first row of parameter values has the lowest out-of-sample (validation set) RMSE, MAE and MAPE. Thus, we go and choose those parameter values as the best parameter values for prophet model.

4 STAKEHOLDER COMMUNICATION

At last, let us perform cross validation on the entire train (prophet train + validation) dataset and interpret the performance metrics through plots. Cross Validaion performance metrics shall be shown to the clients for the training data that they had provided.

changepoint_prior_scale = 0.1
seasonality_prior_scale = 12
fourier_order_yearly = 17
holidays_prior_scale = 8

#creating an improved prophet model
fm <- (prophet(
df = NULL,   # Dataframe containing the history
growth = "linear",    # trend change/growth can't be logistic or flat for this TS
#changepoints = c('2013-08-10','2014-01-10', '2014-08-02', '2015-01-13', '2015-07-13',  
#'2016-01-22', '2016-08-04'), 
n.changepoints = 40, #more than default, might overfit 
changepoint.range = 0.80, # Proportion of history in which trend changepoints will be estimated
yearly.seasonality = fourier_order_yearly, # best tuned value
weekly.seasonality = FALSE, # no evidence for temp change for days of a week
daily.seasonality = FALSE, # Daily seasonality locked as off as data is daily leveled
holidays = NULL, # no evidence that holidays affect temp 
seasonality.mode = "additive", # by observation
seasonality.prior.scale = seasonality_prior_scale, # best tuned value
holidays.prior.scale = 10, # default
changepoint.prior.scale = changepoint_prior_scale, # best tuned value
mcmc.samples = 0, # default
interval.width = 0.80, #default
uncertainty.samples = 1000, #default
fit = TRUE
))

# adding external regressors
fm = add_regressor(fm,'WSPM',standardize = FALSE) 
fm = add_regressor(fm,'DEWP', standardize = FALSE) 
fm = add_regressor(fm,'O3', standardize = FALSE) 
fm = add_regressor(fm,'PRES', standardize = FALSE) 

# fitting a prophet model on full (train + train_validation data)
fm = fit.prophet(fm, df = train_day %>% rename(ds = date, y = TEMP_abs))

# performing cross validation
df.cv <- cross_validation(fm, initial = 365*2.4, period = 30, horizon = 60, units = "days")

# plot performance metric: MAPE
plot_cross_validation_metric(df.cv, metric = 'mape')
Performance metric MAPE from Cross Validation results over a horizon of 60 days

Figure 4.1: Performance metric MAPE from Cross Validation results over a horizon of 60 days

INTERPRETATION : Cross validation performance metrics can be visualized, here shown for MAPE. Dots show the absolute percent error for each prediction in df_cv. The blue line shows the MAPE, where the mean is taken over a rolling window of the dots. We see for this forecast that errors around 0.7% are typical for predictions one month into the future, and that errors increase up to around 0.9% for predictions that are a two months out.

4.1 Forecast and Inference

#making future dataframes
future_fm <- make_future_dataframe(fm, periods = 60, 
                                   freq = "day", include_history = FALSE) #predictions for two months

future_fm$WSPM = prophet_test$WSPM 
future_fm$DEWP = prophet_test$DEWP
future_fm$O3 = prophet_test$O3
future_fm$PRES = prophet_test$PRES

# prediction
fcst_fm <- predict(fm,future_fm) # creating forecast for 60 days

#regressor coefficients 
regressor_coefficients(fm)
##   regressor regressor_mode center   coef_lower         coef   coef_upper
## 1      WSPM       additive      0 1.449885e-03 1.449885e-03 1.449885e-03
## 2      DEWP       additive      0 6.037463e-04 6.037463e-04 6.037463e-04
## 3        O3       additive      0 9.377208e-05 9.377208e-05 9.377208e-05
## 4      PRES       additive      0 2.334914e-05 2.334914e-05 2.334914e-05
#visualizing forecast
ggplot(data = tibble(fcst_fm$ds, fcst_fm$yhat, prophet_test$y), aes(x = ymd(fcst_fm$ds))) +
  geom_path( aes(y = fcst_fm$yhat), color = "green", size = 1.2, show.legend = c("predicted")) + 
  geom_line( aes(y = prophet_test$y), color = "orange", size = 1.2, show.legend = c("actual")) + 
  ggtitle("Actual (orange) and Predicted (green)") +
  xlab("Date") + ylab("TEMP_abs in Kelvin")
Actual Temp_abs values vs Forecast values from the Prophet fit over a horizon of 60 days

Figure 4.2: Actual Temp_abs values vs Forecast values from the Prophet fit over a horizon of 60 days

# out of sample (test set) assessment
RMSE_fm = sqrt(mean((prophet_test$y - fcst_fm$yhat)^2))
MAE_fm = mean(abs( prophet_test$y - fcst_fm$yhat ))
MAPE_fm = mean(abs(  (prophet_test$y - fcst_fm$yhat)  / prophet_test$y ))

# summarizing
tibble("RMSE" = c(round(RMSE_fm,4)), "MAE" = round(MAE_fm,4), "MAPE" = round(MAPE_fm,4))
## # A tibble: 1 x 3
##    RMSE   MAE   MAPE
##   <dbl> <dbl>  <dbl>
## 1  2.15  1.82 0.0067

OBSERVATIONS:

  1. From the above plot of forecast, we can see that the the forecast follows the local trend of actual values.

  2. The RMSE (2.1524 Kelvin), MAE (1.8249 Kelvin) and MAPE (0.67 %) seem to be reasonable values.

  3. The impact of external regressors seems to be less significant than that due to seasonality as the coefficients for respective regressors are in the order of \[10^-4\]. But we would still like to retain them in the model.