Forecasting Construction Costs

## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## Warning: package 'dplyr' was built under R version 4.0.3

## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Warning: package 'caret' was built under R version 4.0.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## Warning: package 'strucchange' was built under R version 4.0.5

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.0.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 4.0.5

## 
## Attaching package: 'strucchange'

## The following object is masked from 'package:stringr':
## 
##     boundary

## Warning: package 'plyr' was built under R version 4.0.3

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

## Warning: package 'anomalize' was built under R version 4.0.3

## == Use anomalize to improve your Forecasts by 50%! ============================================================
## Business Science offers a 1-hour course - Lab #18: Time Series Anomaly Detection!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

## Warning: package 'forecast' was built under R version 4.0.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Warning: package 'readxl' was built under R version 4.0.3

## Warning: package 'janitor' was built under R version 4.0.5

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

## Warning: package 'tseries' was built under R version 4.0.5

## Warning: package 'corrgram' was built under R version 4.0.5

## 
## Attaching package: 'corrgram'

## The following object is masked from 'package:plyr':
## 
##     baseball

## The following object is masked from 'package:lattice':
## 
##     panel.fill

## Warning: package 'feasts' was built under R version 4.0.5

## Loading required package: fabletools

## Warning: package 'fabletools' was built under R version 4.0.5

## 
## Attaching package: 'fabletools'

## The following objects are masked from 'package:forecast':
## 
##     accuracy, forecast

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## Warning: package 'tsibble' was built under R version 4.0.5

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:zoo':
## 
##     index

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:tsibble':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

## Warning: package 'TTR' was built under R version 4.0.5

## Warning: package 'dynlm' was built under R version 4.0.5

I will study the quarterly data provided on the construction cost indices for civil engineering projects from 1995 till 2020. Using time series analysis, I will follow an empirical strategy that will analyze the provided data and use the ETS and ARIMA models. Using ETS and ARIMA, this investigation will forecast the percentage change compared to the same quarter the year before for the next 5-10 years after 2020.

The data sheet provided contains indices that are representative of civil engineering projects in Denmark from 1995 to 2020. The statistical unit is indices. The data is quarterly where: quarter 1 refers to mid-March, quarter 2 refers to mid-June, quarter 3 refers to mid-September and quarter 4 to mid-December. Moreover, as of quarter 1 of 2016 the base year has been changed to 2015 = 100.

We begin by reading the data and removing unnecessary rows and columns

data <- read_excel("C:/Users/hibal/Documents/SUBMIT/data.xlsx",skip=2)

## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4

data<- data[-c(1:3)]
data<- data[-c(4:6),]

## Warning: The `i` argument of ``[.tbl_df`()` must lie in [-rows, 0] if negative, as of tibble 3.0.0.
## Use `NA_integer_` as row index to obtain a row full of `NA` values.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

The data set needs to be tranposed and the first row was set as the column names respectively

t_data<-t(data)
t_data <- t_data %>%row_to_names(row_number = 1)

Let’s take a look on the data and inspect the types

summary(t_data)

##     Index           Percentage change compared to previous quarter
##  Length:104         Length:104                                    
##  Class :character   Class :character                              
##  Mode  :character   Mode  :character                              
##  Percentage change compared to same quarter the year before
##  Length:104                                                
##  Class :character                                          
##  Mode  :character

str(t_data)

##  chr [1:104, 1:3] "57.23" "57.31" "57.95" "58.2" "59.39" "59.15" "59.85" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:104] "1995Q1" "1995Q2" "1995Q3" "1995Q4" ...
##   ..$ : chr [1:3] "Index" "Percentage change compared to previous quarter" "Percentage change compared to same quarter the year before"

ncol(t_data)

## [1] 3

There are 3 columns and the type is character, but they should be numeric. ## So we convert to numeric

t_data <- as.data.frame(apply(t_data, 2, as.numeric))

## Warning in apply(t_data, 2, as.numeric): NAs introduced by coercion

Check for na’s and replace with zero

colSums(is.na(t_data))

##                                                      Index 
##                                                          0 
##             Percentage change compared to previous quarter 
##                                                          0 
## Percentage change compared to same quarter the year before 
##                                                          2

t_data[is.na(t_data)] <- 0

We convert the dataframe to a time series

datatimeseries <- ts(t_data, frequency=4, start=c(1995,1), end =c(2020,4))

Lets look at the summary of the Data

summary(datatimeseries)

##      Index        Percentage change compared to previous quarter
##  Min.   : 57.23   Min.   :-7.3500                               
##  1st Qu.: 71.97   1st Qu.:-0.0175                               
##  Median : 91.06   Median : 0.7250                               
##  Mean   : 87.27   Mean   : 0.6422                               
##  3rd Qu.:103.19   3rd Qu.: 1.6675                               
##  Max.   :112.81   Max.   : 3.7700                               
##  Percentage change compared to same quarter the year before
##  Min.   :-6.2500                                           
##  1st Qu.: 0.2125                                           
##  Median : 2.7050                                           
##  Mean   : 2.4930                                           
##  3rd Qu.: 4.5150                                           
##  Max.   :10.5700

The mean and median values are fairly very close indicating the data displays a symmetrical distribution

Lets Check the QQ plot for a normal distribution check for the third variable. (main focus)

pcs <- datatimeseries[, c(3)]
qqnorm(pcs, pch = 1, frame = FALSE)
qqline(pcs, col = "steelblue", lwd = 2)

As seen in the normal Q-Q plot that indicates a normally distributed dataset.

Let’s plot the Time Series

plot(datatimeseries, main= "Time Series Plot for the three variables", ylab = "", xlab = "Year")

Immediate Observation in the index is that there seems to be an overall construction cost inflation for civil engineering projects. The first plot seems to be non-stationary unlike the bottom two, but further tests would confirm.

Simple Moving Average

pcsSMA3 <- SMA(pcs,n=11)
autoplot(pcsSMA3)+ xlab("Year")+ ggtitle("After Simple Moving Average")

After smoothing the 3rd variable, one observes a general increase with 4 major drops: 2005, 2010, 2015, and 2020. We would expect another drop around 2025 as the trend seems to happen every 5 years.

ADF TEST

adf.test(datatimeseries[,c(1)])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  datatimeseries[, c(1)]
## Dickey-Fuller = -2.0858, Lag order = 4, p-value = 0.5411
## alternative hypothesis: stationary

adf.test(datatimeseries[,c(2)])

## Warning in adf.test(datatimeseries[, c(2)]): p-value smaller than printed p-
## value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  datatimeseries[, c(2)]
## Dickey-Fuller = -5.0574, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary

adf.test(datatimeseries[,c(3)])

## 
##  Augmented Dickey-Fuller Test
## 
## data:  datatimeseries[, c(3)]
## Dickey-Fuller = -3.3912, Lag order = 4, p-value = 0.05973
## alternative hypothesis: stationary

KPSS TEST

kpss.test(datatimeseries[,c(1)])

## Warning in kpss.test(datatimeseries[, c(1)]): p-value smaller than printed p-
## value

## 
##  KPSS Test for Level Stationarity
## 
## data:  datatimeseries[, c(1)]
## KPSS Level = 2.1195, Truncation lag parameter = 4, p-value = 0.01

kpss.test(datatimeseries[,c(2)])

## Warning in kpss.test(datatimeseries[, c(2)]): p-value greater than printed p-
## value

## 
##  KPSS Test for Level Stationarity
## 
## data:  datatimeseries[, c(2)]
## KPSS Level = 0.28339, Truncation lag parameter = 4, p-value = 0.1

kpss.test(datatimeseries[,c(3)])

## 
##  KPSS Test for Level Stationarity
## 
## data:  datatimeseries[, c(3)]
## KPSS Level = 0.35737, Truncation lag parameter = 4, p-value = 0.09553

First the two Unit Tests ADF and KPSS tests were performed. A p-value < 0.05 in the ADF and > 0.05 in KPSS test indicates the time series is stationary. Therefore, the first time series does not pass this test and can be concluded to be non-stationary. Our main variable (third column) is already stationary and does not need any differencing to transform the data.

Furthermore, to detect and study seasonality in the time series a correlogram and STL decomposition was applied. The first correlogram on the that for increasing lags there is a general decreasing trend till lag 15 and then an increasing trend towards zero till lag 25. Meanwhile, the last plot for corelogram the data degradation happens more slowly and there are more variations in the plot. Furthermore, the STL decomposition was applied for the variable being studied (percentage change compared to the same quarter the year before). The STL decomposition was chosen due to its robustness and versatility compared to other methods. The correlogram and STL confirm a seasonality trend in the data.

Auto Correlation Plots

datatimeseries[, c(1)] %>%acf(lag.max = 300,main = "Autocorrelation Plot - R- indexes")

datatimeseries[, c(2)] %>%acf(lag.max = 300,main = "Autocorrelation Plot - R - change 1")

datatimeseries[, c(3)] %>%acf(lag.max = 300,main = "Autocorrelation Plot - R - change 3")

STL Decomposition Plot

pcs%>%stl(t.window=11, s.window="periodic", robust=FALSE) %>% autoplot() + xlab("Year") +ggtitle("STL Decompostion plot")

After careful analysis, the ETS and ARIMA models were chosen for forecasting the percent change in percentage change compared to the same quarter the year before. The two models are suggested as adaptable for seasonality and popular models for forecasting that are known to give accurate results. As we are not working with time series regression, and we concluded data is stationary structural breaks and QLR testing would not be necessary. However, for educational and curios reasons the chow test for structural breaks was applied and returned 5 structural breaks as seen in the screenshot below. We would choose the lowest BIC being the 2016(2) data point.

Breakpoints

MFG_bp <- breakpoints(pcs ~ 1)
summary(MFG_bp)

## 
##   Optimal (m+1)-segment partition: 
## 
## Call:
## breakpoints.formula(formula = pcs ~ 1)
## 
## Breakpoints at observation number:
##                       
## m = 1         55      
## m = 2      37 55      
## m = 3      37 55 71   
## m = 4      37 55 71 86
## m = 5   18 40 55 71 86
## 
## Corresponding to breakdates:
##                                                
## m = 1                   2008(3)                
## m = 2           2004(1) 2008(3)                
## m = 3           2004(1) 2008(3) 2012(3)        
## m = 4           2004(1) 2008(3) 2012(3) 2016(2)
## m = 5   1999(2) 2004(4) 2008(3) 2012(3) 2016(2)
## 
## Fit:
##                                              
## m   0      1      2      3      4      5     
## RSS 1134.2  921.6  819.8  795.4  749.0  732.3
## BIC  552.9  540.6  537.7  543.9  546.9  553.9

MFG_ci <- confint(MFG_bp)
plot(pcs)
lines(MFG_bp)
lines(MFG_ci)

ETS

The trend that was observed in the STL decomposition can be classified as a decreasing trend with some seasonality and thus I chose the Exponential Smoothing ETS method to make short-term forecasts. Through the STL decomposition, the model for ETS chosen was the AAA model. ETS AAA model is a seasonal algorithm that models a time series using an equation that accounts for additive error, additive trend, and additive seasonality. The algorithm is also popularly known as the Holt-Winters algorithm. The SES and other types of exponential smoothing would be too simple and would not detect the trend and seasonality in the data as the Winter Holt. We see from the plot that the exponential method is very successful in predicting the seasonal peaks and the fitted line is very close to the original line. However, the predicted line does not seem to capture the future trend properly and predicting a line. This indicated the choice of model could be improved.

model.ets1 = ets(pcs,model="AAA",beta=0.2)
fc.ets1 = forecast(model.ets1,h=20)
autoplot(fc.ets1)

autoplot(pcs) + autolayer(fitted(model.ets1), series="Fitted") + ggtitle("Percentage change compared to same quarter the year before")+ xlab("Year")

summary(model.ets1 )

## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = pcs, model = "AAA", beta = 0.2) 
## 
##   Smoothing parameters:
##     alpha = 0.9997 
##     beta  = 0.2 
##     gamma = 1e-04 
##     phi   = 0.8 
## 
##   Initial states:
##     l = -1.8401 
##     b = 1.0092 
##     s = -0.1974 -0.081 0.1535 0.125
## 
##   sigma:  2.0799
## 
##      AIC     AICc      BIC 
## 643.9287 645.8436 667.7283 
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE      ACF1
## Training set 0.01634377 1.987897 1.360348 NaN  Inf 0.3849205 0.2114789

We can investigate whether the predictive model can be improved upon by checking whether the in-sample forecast errors show non-zero autocorrelations at lags 1-20, by making a correlogram and carrying out the Ljung-Box test: The p value below as well as forecast errors exceeding significance bounds indicates there is evidence of non-zero autocorrelations. From the histogram of forecast errors, it seems plausible that the forecast errors are normally distributed with mean zero. Therefore, this confirms the conclusion that this model could be improved on and is not the best for our dataset. Later, the error measurements such as MSE will be compared with ARIMA to confirm this claim.

Autocorrelation Plot

acf(model.ets1$residuals, lag.max=20, main='corelogram Residuals')

## Ljung-Box test

Box.test(model.ets1$residuals, lag=20, type="Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  model.ets1$residuals
## X-squared = 39.486, df = 20, p-value = 0.005799

ETS forecast time series plot

autoplot(model.ets1$residuals) + ggtitle("ETS forecast time series plot")+ xlab("Year")

## Error Plot

plotForecastErrors <- function(forecasterrors)
{
  # make a histogram of the forecast errors:
  mybinsize <- IQR(forecasterrors)/4
  mysd   <- sd(forecasterrors)
  mymin  <- min(forecasterrors) - mysd*5
  mymax  <- max(forecasterrors) + mysd*3
  # generate normally distributed data with mean 0 and standard deviation mysd
  mynorm <- rnorm(10000, mean=0, sd=mysd)
  mymin2 <- min(mynorm)
  mymax2 <- max(mynorm)
  if (mymin2 < mymin) { mymin <- mymin2 }
  if (mymax2 > mymax) { mymax <- mymax2 }
  # make a red histogram of the forecast errors, with the normally distributed data overlaid:
  mybins <- seq(mymin, mymax, mybinsize)
  hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
  # freq=FALSE ensures the area under the histogram = 1
  # generate normally distributed data with mean 0 and standard deviation mysd
  myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
  # plot the normal curve as a blue line on top of the histogram of forecast errors:
  points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}

plotForecastErrors(model.ets1$residuals)

ARIMA Model

There are two options for choosing the ARIMA model. For practice purposes, I’ve decided to plot the PCF and ACF graphs and conclude the appropriate model and later confirming this using auto.arima() function in R.

DIFFERENCING THE FIRST COLUMN

index <- datatimeseries[, c(1)]

indexdiff <- diff(index, differences=2)
plot.ts(indexdiff)

## Selecting a candidate ARIMA model

acf(pcs, lag.max=20,main = "Autocorrelation Plot")             # plot a correlogram

acf(pcs, lag.max=20, plot=FALSE)

## 
## Autocorrelations of series 'pcs', by lag
## 
##   0.00   0.25   0.50   0.75   1.00   1.25   1.50   1.75   2.00   2.25   2.50 
##  1.000  0.821  0.533  0.236 -0.020 -0.109 -0.116 -0.089 -0.055 -0.013  0.001 
##   2.75   3.00   3.25   3.50   3.75   4.00   4.25   4.50   4.75   5.00 
## -0.024 -0.043 -0.057 -0.046 -0.016  0.007  0.016  0.030  0.072  0.121

pacf(pcs, lag.max=20,main = "Partial Autocorrelation Plot")

pacf(pcs, lag.max=20, plot=FALSE)

## 
## Partial autocorrelations of series 'pcs', by lag
## 
##   0.25   0.50   0.75   1.00   1.25   1.50   1.75   2.00   2.25   2.50   2.75 
##  0.821 -0.431 -0.133 -0.109  0.313 -0.142 -0.038 -0.077  0.216 -0.165 -0.088 
##   3.00   3.25   3.50   3.75   4.00   4.25   4.50   4.75   5.00 
##  0.028  0.128 -0.029 -0.076 -0.008  0.068  0.072  0.082 -0.005

As observed in the correlogram, the autocorrelations for lags 1, 2 and 3 go beyond the significance bounds, and that the autocorrelations drops off to zero after lag 3. Another attribute for lags 1,2,3 is that they are positive with a decreasing magnitude with respect to increasing lags.

As observed in partial autocorrelogram, the partial autocorrelation at lag 1 is positive while the partial autocorrelation at lag 2 is negative. They both exceed the significance bounds and the partial autocorrelations drops off to zero after lag 2.

Due to the following observations: the correlogram drops off to zero after lag 3 and the partial correlogram is zero after lag 2, the following ARMA models are possible for the time series: • an ARMA(2,0) model, since the partial autocorrelogram is zero after lag 2, and the correlogram tails off to zero after lag 3, and the partial correlogram is zero after lag 2. • an ARMA(0,3) model, since the autocorrelogram is zero after lag 3, and the partial correlogram tails off to zero • an ARMA(p,q) mixed model, since the correlogram and partial correlogram tail off to zero

We will now evaluate the parameters and choose the best model. The ARMA (2,0) model has 2 parameters, the ARMA (0,3) model has 3 parameters, and the ARMA (p, q) model has at least 2 parameters. Therefore, using the principle of parsimony, the ARMA (2,0) model and ARMA (p, q) model are equally adequate candidate models.

The summary below shows that the ARIMA function chose the correct model. The forecast plot for ARIMA is show and has captured the 5th drop in percent change around 2025 as anticipated.

pcs_arima<- forecast::auto.arima(pcs)
pcs_arima_fit<- auto.arima(pcs)
autoplot(forecast(pcs_arima_fit,h=20))

Plotting residuals

checkresiduals(pcs_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,0,0)(2,0,0)[4] with non-zero mean
## Q* = 2.7043, df = 3, p-value = 0.4395
## 
## Model df: 5.   Total lags used: 8

The correlogram shows that none of the sample autocorrelations for lags 1-20 exceed the significance bounds. Moreover, the reported p-value for the Ljung-Box test is 0.9. Therefore, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20.

The histogram of the time series shows that the forecast errors are roughly normally distributed, and the mean seems to be close to zero. Hence, it is likely that the forecast errors are normally distributed with mean zero and constant variance. Successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with mean zero and constant variance. Therefore, a valid conclusion can be made that the ARIMA (2,0,0) is an adequate predictive model for the percentage change compared to the same quarter the year before time series data.

LETS COMPARE MODELS!

The results below shows that for all error measures the ARIMA performs much better and is a better model for our dataset. Moreover, the forecast plot for the ARIMA portrayed the changes in the trends much better than the ETS.

summary(pcs_arima)

## Series: pcs 
## ARIMA(2,0,0)(2,0,0)[4] with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     sar1     sar2    mean
##       1.2209  -0.3548  -0.5540  -0.3097  2.5814
## s.e.  0.0920   0.0943   0.1048   0.0984  0.5824
## 
## sigma^2 estimated as 2.352:  log likelihood=-191.03
## AIC=394.06   AICc=394.93   BIC=409.93
## 
## Training set error measures:
##                        ME   RMSE      MAE  MPE MAPE      MASE        ACF1
## Training set 0.0004520438 1.4962 1.186112 -Inf  Inf 0.3356191 0.001730779

summary(model.ets1)

## ETS(A,Ad,A) 
## 
## Call:
##  ets(y = pcs, model = "AAA", beta = 0.2) 
## 
##   Smoothing parameters:
##     alpha = 0.9997 
##     beta  = 0.2 
##     gamma = 1e-04 
##     phi   = 0.8 
## 
##   Initial states:
##     l = -1.8401 
##     b = 1.0092 
##     s = -0.1974 -0.081 0.1535 0.125
## 
##   sigma:  2.0799
## 
##      AIC     AICc      BIC 
## 643.9287 645.8436 667.7283 
## 
## Training set error measures:
##                      ME     RMSE      MAE MPE MAPE      MASE      ACF1
## Training set 0.01634377 1.987897 1.360348 NaN  Inf 0.3849205 0.2114789

Conclusion

The aim of this study was to perform time series analysis on the percentage change of the cost construction index compared to the same quarter the year before. The investigation was to choose the best model to forecast the behaviour of the percent change over the next 5 -10 years. After applying different data identification techniques, root tests, correlograms, and decomposition tests, the ETS and ARIMA were chosen as the best fit models. However, after comparing results and residuals graphs the ARIMA performed much better on the data. The trend shows a slightly decreasing percent change with a significant drop almost every year as confirmed by the forecasting. The results suggest that in this significant drop, construction project the previous year of the same quarter should anticipate that the costs are lower and best.