1 Introduction

1.1 What is Drought?

The dictionary defines drought as “a period of dry weather” or “an extended shortage.” Drought has always been difficult to define because it varies based on location and associated impacts to the human and natural environment. In all cases, drought is temporary in nature, tends to be cyclical and results in an abnormal extended decrease in water available to humans, plants and wildlife. There are generally four types of drought: Hydrological, Socioeconomic, Meteorological and Agricultural. There is some debate if socioeconomic and hydrological types are independent or two faces of the same coin. But in this article we are specifically focuses on hydrological drought that is occurring in the US.

1.2 US Drought

The most recent nationwide drought in the United States was in 2012 and many states are still experiencing a severe drought in 2014. More than 60% of the United States was in a severe drought covering over 1,000 counties. According to the United States Department of Agriculture (USDA), over 80% of productive agricultural land was affected by this drought with half of the total value of livestock being ravaged by severe drought conditions.

As of September 2021, nationally the US is still affected by drought from the charge above we can see the western hemisphere is hugely affected by drought especially California.

California is no stranger to drought; it is a recurring feature of our climate. We recently experienced the 5-year event of 2012-2016, and other notable historical droughts included 2007-09, 1987-92, 1976-77, and off-and-on dry conditions spanning more than a decade in the 1920s and 1930s. So in this article we are mainly focuses on the drought problem in California

1.3 USDM

The U.S. Drought Monitor (USDM) is a map that is updated each Thursday to show the location and intensity of drought across the country. The USDM uses a five-category system, labeled Abnormally Dry or D0, (a precursor to drought, not actually drought), and Moderate (D1), Severe (D2), Extreme (D3) and Exceptional (D4) Drought. Drought categories show experts’ assessments of conditions related to dryness and drought including observations of how much water is available in streams, lakes, and soils compared to usual for the same time of year. Below are the example of what USDM data looks like.

data <- read.csv("data_input/usdm-prcnt-area_cali.csv")
reactable::reactable(head(data)) 

1.4 DSCI

The Drought Severity and Coverage Index is an experimental method for converting drought levels from the U.S. Drought Monitor map to a single value for an area. DSCI values are part of the U.S. Drought Monitor data tables. If you want to compute it yourself, using cumulative Drought Monitor data, add the values for D0 through D4 to get the Drought Severity and Coverage Index. Possible values of the DSCI are from 0 to 500. Zero means that none of the area is abnormally dry or in drought, and 500 means that all of the area is in D4, exceptional drought. Or, to see more math, use categorical (not cumulative) Drought Monitor data, and compute a weighted sum:

\[ 1*(D0) + 2(D1) + 3(D2) + 4(D3) + 5(D4) = DSCI \]

1.5 Target

This article aims to do a prediction using time series and forecasting approach. We are going to explore different methods of time series and how the outcome would be. By the end of the article we hope to give the most accurate forecasting model there is. The datasets we are going to use in this article are DSCI number as it gives a single vector value from different severities of the USDM value. We are going to use USDM data when necessary, such condition are; adding context to the

The methods we are going to use in this article are Simple Moving Average, Exponential Smoothing and ARIMA model.

2 Analysis

2.1 Import library

library(dplyr) # for data wrangling
library(lubridate) # to deal with date
library(padr) # for padding
library(forecast) # time series library
library(tseries) # for adf.test
library(MLmetrics) # calculate error

2.2 Read Dataset and Data Engineering

data <- read.csv("data_input/dsci_cali.csv")
data <- data %>% mutate(MapDate = ymd(MapDate)) %>% 
        select(-Name) %>% 
        filter(MapDate >= ymd("2010-01-01"))
str(data)
## 'data.frame':    610 obs. of  2 variables:
##  $ MapDate: Date, format: "2010-01-05" "2010-01-12" ...
##  $ DSCI   : int  175 188 169 79 78 69 51 50 50 50 ...

From this data set we are only going to use the DSCI value and the relative date of it so we are going to deselect column Name as it only contains California in it. And then since the data from 20 years ago would not make a significant difference, we are only restricting our use of data from 2010.

2.3 Data Cleaning and Interval check

Check for missing values:

colSums(is.na(data))
## MapDate    DSCI 
##       0       0

** Check for missing interval: **

library(zoo)
data_clean <- data %>% pad()
## pad applied on the interval: week
data_clean %>% mutate(DSCI = na.fill(DSCI, fill = "extend")) %>% head(10)
##       MapDate DSCI
## 1  2010-01-05  175
## 2  2010-01-12  188
## 3  2010-01-19  169
## 4  2010-01-26   79
## 5  2010-02-02   78
## 6  2010-02-09   69
## 7  2010-02-16   51
## 8  2010-02-23   50
## 9  2010-03-02   50
## 10 2010-03-09   50

2.4 Time Series Object & EDA

2.4.1 Making Time Series Object and Data vizualization

# making ts object
data_ts <- ts(data = data$DSCI, 
              start = c(2010,1,5),
              frequency = 52) # Yearly Seasonality

# Viz
data_ts %>% TSstudio::ts_plot(slider = T)

From the graph above we can see there is an incline in drought levels nearing 2021

2.4.2 Decomposition

In this section, we are going to find whether our timseries object has a trend and seasonal properties

# decompose object
data_decom <- decompose(data_ts)
autoplot(data_decom)

The plot shows that trend are showing some fluctuations of up and downs but it should not be a problem considering it is not a wild fluctuation and through time there are some efforts to be made to reduce droughts in California. But to make sure that our data is stationary we can use Augmented Dickey-Fuller Test

# Check for stationary
adf.test(data_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_ts
## Dickey-Fuller = -1.3478, Lag order = 8, p-value = 0.8544
## alternative hypothesis: stationary

Apparently the p-value > 0.05 indicating the trend is not stationary. So we can keep a note for the future, if we are going to use SARIMA method, we have to perform differencing on the data.

2.5 Cross Validation

We are going to cross validate by using the data from this year (2021) as the unseen data. The data includes 36 weeks of test data and the previous 11 years as our train data.

data_train <- data_ts %>% head(length(data_ts) - 36)
data_test <- data_ts %>% tail(36)

2.6 Model Building

For model building, We will use two of the widely used time series modeling in business and industry: Exponential Smoothing (ETS) Holt-Winters and Seasonal ARIMA. We use ETS Holt-Winters because our data contains both seasonal and trend. I also want to compare it between seasonal ARIMA to check whether seasonal ARIMA can give better forecasting performance.

2.6.1 Holt-Winters

# automatic Holt-Winters 
hw_auto <- HoltWinters(x = data_train, seasonal = "additive")
hw_forecast_auto <- forecast(object = hw_auto, h = 36)
data_train %>% 
  autoplot() +
  autolayer(data_test, series = "test") +
  autolayer(hw_forecast_auto$mean, series = "forecast") +
  autolayer(hw_auto$fitted[,1], series = "fitted")

By the graph above, it shows that our forecast is not great on predicting our test data. to change this we have to tweak our smoothing coefficient. There are three smoothing coefficient that we can tweak to give a better forecast.

  1. alpha: specifies the coefficient for the level smoothing

  2. beta : specifies the coefficient for the trend smoothing.

  3. gamma: specifies the coefficient for the seasonal smoothing.

# manual tuning Holt-Winters
hw_manual1 <- HoltWinters(x = data_train, seasonal = "additive",
                          alpha = 0.2, beta = 0.52, gamma = 0.5)
hw_manual2 <- HoltWinters(x = data_train, seasonal = "additive",
                          alpha = 0.2, beta = 0.52, gamma = F)

For this tuning process, we are using 0.2 as our alpha value. the reasoning behind this is that the weight of the current time series value is not significantly different than the weight of the previous value.

hw_forecast_manual1 <- forecast(object = hw_manual1, h = 36)
hw_forecast_manual2 <- forecast(object = hw_manual2, h = 36)
data_train %>% 
  autoplot() +
  autolayer(data_test, series = "test") +
  autolayer(hw_forecast_manual1$mean, series = "forecast1") +
  autolayer(hw_forecast_manual2$mean, series = "forecast2") +
  autolayer(hw_manual1$fitted[,1], series = "fitted")

From the results, we can see that our Holt-Winters model make a great prediction to our test data but a relatively worse prediction on our train data. From that plot we can see a relatively high fluctuations on our train data. From this we can conclude that two of our Holt-Winters model is not good enough. So we can use SARIMA model to see whether it makes a better prediction or not.

Another thing to also point out that by the blue line that is interpreted in the plot, it suggested that our time series data have a seasonality in it.

2.6.2 Model Evaluation

accuracy(hw_forecast_auto$mean, data_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -144.0089 156.0936 144.0089 -39.47626 39.47626 0.8493939  13.70486
accuracy(hw_forecast_manual1$mean, data_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -9.889612 22.08971 15.66834 -2.825329 4.247168 0.8266654  1.832203

2.7 ARIMA

data_ts %>% decompose() %>%  autoplot()

Another method that can be use for our data is Seasonal ARIMA model. Seasonal ARIMA is the same as ARIMA which includes Auto-Regressive model combined with Moving Average model but includes a seasonal component into it. As we know from the adf.test() that we have conducted above, we can see that our data is not stationary. A stationary time series is one whose properties do not depend on the time at which the series is observed, so a time series with trends or seasonality are not stationary - the trend and seasonality will affect the value of the time series at different times. A white noise series on the other hand is stationary.

2.7.1 Differencing

As we know, our time series data has seasonality and trend included into it. By using differences we can get rid of those effects in order to make ARIMA and interpret the ACF and PACF value. To use differencing function, we will use diff() command. diff() has a parameter called lag that indicates how many steps that each data has to subtract itself to. Subtraction is a part of differencing that we will not discuss at this point but you should find the definition when you study ARIMA models.

For this data differencing we are going to use lag = 52 that indicates an annual seasonality, the number 52 comes from the number of weeks in a year. After that we use lag = 1 to get rid of trend bias. The result of the differencing process should give us a time series object full of noise, without any seasonality or trend, or in other words we called stationary.

# remove seasonal and trend effects.
data_train %>% diff(lag = 52) %>% diff() %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -6.5792, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
data_train %>% diff(lag = 52) %>% diff() %>% autoplot()

2.7.2 ACF and PACF Model Fitting

After we transform our data using diff() we would like to see what info we could get for our model building using ARIMA. To do this, we are going to use tsdisplay(). tsdisplay() will let us see the ACF and PACF values to find the P and Q values as our ARIMA parameters.

data_train %>% diff(lag = 52) %>% diff() %>% tsdisplay()

The Basic principle of how we choose the P and Q value of our data is to find out the pattern of ACF and PACF, whether it’s cut off or dies down, here are the

Assumption 1:

Seasonal:

  • P : 0 (dies down)
  • D : 1
  • Q : 0 (dies down)

Overall:

  • p : 0 (dies down)
  • d : 1
  • q : 0 (dies down)

Assumption 2:

Seasonal: * P : 0 (dies down) * D : 1 * Q : 0 (dies down)

Overall: * p : 3 (cut off) * d : 1 * q : 0 (dies down)

2.7.3 Model Creation

We will use Arima from the forecast library to use as our tool. we are going to use the two assumptions that we found earlier and put it in the order and seasonal parameter. As a comparison we will also use the automatic Seasonal ARIMA to find the best model.

# Manual SARIMA model
sarima1 <- Arima(y = data_train, order = c(0,1,0), seasonal = c(0,1,0))
sarima2 <- Arima(y = data_train, order = c(0,1,0), seasonal = c(3,1,0))

# Auto SARIMA model
# sarima.auto <- Arima(y = data_train, seasonal = TRUE)
sarima1_forecast <- forecast(sarima1, h = 36)
sarima2_forecast <- forecast(sarima2, h = 36)
data_train %>%autoplot() +
  autolayer(data_test, series = "test") +
  autolayer(sarima1$fitted, series = "fitted") +
  autolayer(sarima1_forecast$mean, series = "forecast sarima1") +
  autolayer(sarima2_forecast$mean, series = "forecast sarima2") 

As we can see, the blue line which is our sarima2 forecast dives down which makes our second assumption is wrong. So after this, the model we are going to observe further is the sarima1 model.

2.7.4 Model Correctness

  1. Akaike Information Criterion
sarima1$aic
## [1] 4451.885
  1. Errors
# Accuracy of Forecast
accuracy(sarima1_forecast$mean, data_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -51.94444 69.24433 53.16667 -16.50707 16.89622 0.9348997  7.665544

2.7.5 Assumption

Another thing that we must look at is the residual value of our model. These assumptions of our time series model acts as a justification that our data is well-made or not.

The first assumption that we are going to look at is the No-Autocorrelation residual which means every residuals that we produce cannot have any correlation. Secondly we will have to see that our residuals is equal to zero or not.

  1. No-Autocorrelation residual

\(H_0\): residual has no-autocorrelation \(H_1\): residual has autocorrelation

acf(sarima1$residuals)

Box.test(sarima1$residuals, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  sarima1$residuals
## X-squared = 45.181, df = 1, p-value = 1.797e-11
  1. Normality Residual

\(H_0\): residual is normally distributed \(H_1\): residual is not normally distributed

shapiro.test(sarima1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  sarima1$residuals
## W = 0.68036, p-value < 2.2e-16

Our SARIMA model failed two fulfill both of the assumptions, which means we are rejecting our SARIMA model as it is not perfect.

2.8 Model Comparison

data_test %>% autoplot() +
  autolayer(hw_forecast_manual1$mean, series = "Holt-Winters") +
  autolayer(sarima1_forecast$mean, series = "SARIMA")

# Holt-Winters Accuracy
accuracy(hw_forecast_manual1$mean, data_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -9.889612 22.08971 15.66834 -2.825329 4.247168 0.8266654  1.832203
# SARIMA Accuracy
accuracy(sarima1_forecast$mean, data_test)
##                 ME     RMSE      MAE       MPE     MAPE      ACF1 Theil's U
## Test set -51.94444 69.24433 53.16667 -16.50707 16.89622 0.9348997  7.665544

The Model of Forecasting we are going to use is our Holt-Winters model because it has a very low errors. We accept this model noting that it could only predict our model as far as 6 months forward.