library(dplyr)
library(lubridate)
library(padr)
library(zoo)
library(forecast)
library(TTR)
library(MLmetrics) 
library(tseries)
library(fpp)
library(TSstudio)
library(ggplot2)
library(tidyr)

Climate Time Series

Today we’ll be analyzing and trying to create a prediction model from a data set of Delhi’s daily climate measurements. The data can be found on kaggle: https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data

Data Pre-processing

Let’s read the data and check the variables using glimpse.

climate <- read.csv("DailyDelhiClimateTrain.csv")
glimpse(climate)
## Rows: 1,462
## Columns: 5
## $ date         <chr> "2013-01-01", "2013-01-02", "2013-01-03", "2013-01-04", "…
## $ meantemp     <dbl> 10.000000, 7.400000, 7.166667, 8.666667, 6.000000, 7.0000…
## $ humidity     <dbl> 84.50000, 92.00000, 87.00000, 71.33333, 86.83333, 82.8000…
## $ wind_speed   <dbl> 0.0000000, 2.9800000, 4.6333333, 1.2333333, 3.7000000, 1.…
## $ meanpressure <dbl> 1015.667, 1017.800, 1018.667, 1017.167, 1016.500, 1018.00…

There are 4 different variables aside from date. Today I’ll be choosing to analyze the meantemp, which is the mean temperature for that specific date.

For now let’s use lubridate’s functions to convert the dates. Since the format of our date is year-month-date, we can simply use ymd().

climate <- climate %>%
  mutate(date=ymd(date))

Let’s check if the dates are complete.

complete_dates <- seq.Date(from = as.Date(min(climate$date)) , 
                           to = as.Date(max(climate$date)) ,
                           by = "day" ) 
all(climate$date == complete_dates)
## [1] TRUE

Let’s also check for NA values.

anyNA(climate)
## [1] FALSE

Seems like our data is already quite clean. We should be able to turn it into a time series object now. We’ll pick only the meanTemp variable and set the frequency as 365, as our data is in a daily format and we’d like to interpret it yearly.

climate_ts <- ts(data=climate$meantemp,
               start=c(2013,1,1),
               frequency=365)

Analysis

Let’s plot out the trend and seasonality of our time series object so that we can analyze it. We can do this by using decompose() and plotting it out.

climate_ts %>% decompose() %>% autoplot()


Seems like the mean temperature of Delhi has been steadily increasing over the years according to the trend chart. Seasonally, it would seem the temperature is rather low for the start and the end of the year, and gets hotter towards the middle.

Cross-validation

Before we create our time series prediction model, let’s split the data first. When it comes to time series models we can’t just randomly assign values to the training or testing group since the data still has to be in order. So let’s simply take the last 365 days (a year) for the testing group while using the rest for the training group.

climate_test <- climate_ts %>% tail(365)
climate_train <- climate_ts %>% head(-365)

Holt-Winters Model

Since our data has seasonality, we’ll be using the Holt-Winters model.

climateHolt <- HoltWinters(climate_train)
holtForecast <- forecast(climateHolt, h=365)
holtForecast %>% as.data.frame()

Now, let’s check the accuracy of our model.

accuracy(holtForecast$mean, climate_test)
##                 ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 0.9592066 3.000553 2.382942 3.183632 9.445904 0.7288183  1.714971

With a MAE of 2.3, it’s not too bad. Let’s see if we can get better values by tuning the hyperparameters.

Hyperparameter tuning

We can adjust our Holt-Winters model by changing the alpha, beta, and gamma values manually. We can also set the seasonal as additive as our seasonality stays relatively constant.

climateHoltTuning <- HoltWinters(climate_train, alpha=0.35, beta=0, gamma=0, seasonal = "additive")
tuningForecast <- forecast(climateHoltTuning, h=365)
tuningForecast %>% as.data.frame()

Now, let’s compare the two models’ accuracy.

accuracy(holtForecast$mean, climate_test)
##                 ME     RMSE      MAE      MPE     MAPE      ACF1 Theil's U
## Test set 0.9592066 3.000553 2.382942 3.183632 9.445904 0.7288183  1.714971
accuracy(tuningForecast$mean, climate_test)
##                ME     RMSE      MAE        MPE     MAPE      ACF1 Theil's U
## Test set 0.169967 2.945899 2.339756 0.03180694 9.362847 0.6987164   1.73176

Seems like we managed to lower the MAE value in the new model by around 0.05.

Visualization

Let’s visualize the forecast in order to compare it with the original test data.

climate_train %>% 
  autoplot() +
  autolayer(climate_test, series = "Data Test") +
  autolayer(climateHoltTuning$fitted[,1], series = "Model") +
  autolayer(tuningForecast$mean, series = "Forecast")


It seems that our model is similar enough with the test data visually as well.

Conclusion

We can use time series models not just to predict data based on how they change overtime, but also to analyze the pattern of change in the data itself. Whether it be just the general trend, or seasonally. In our analysis of the mean temperature of Delhi over the years, we can see one of the signs of climate change in effect, that being the increase in temperature overall (trend-wise). This analysis can also be applied to the climate data of other regions, and hopefully, can be used to calculate the risks of climate change within the next couple of years.