This assessment consists of three questions: 1. Spatial modelling 2. Time series modelling 3. A combination of both
The total number of marks available is 160, worth 80% and issplit 55/50/55 between the 3 questions. Marks indicated for individual parts suggest the relative amount of detail required to answer questions. Commented R code (and any outcomes/plots) should be part of the answers, however only include R output that is helpful for answering the questions, and it should be clear from your answers which models you are fitting and why (i.e. don’t only include code/plots), and ensure that plots are properly labelled and explained.
1. Spatial modelling
You have been given a dataset with 220 measurements of total monthly precipitation in the Netherlands in September 2019. This dataset was downloaded from the Copernicus Climate Data Store [1]. In the file netherlands.csv, each row contains a station name, longitude and latitude of the observation station, and total precipitation for the month in millimetres.
#Import the netherlands data
library(readr)
Netherlands <- read_csv("netherlands.csv")
# Load required packages
library(tidyverse) # for data manipulation and visualization
library(sf)
library(sp)# for handling spatial data
library(ggplot2) # for data visualization
library(RColorBrewer) # for color palettes
# Convert data frame to sf object
coordinates(Netherlands) <- c("longitude", "latitude")
proj4string(Netherlands) <- CRS("+init=epsg:4326")
Netherlands_sf <- st_as_sf(Netherlands, coords = c("longitude", "latitude"), crs = 4326) # convert to sf object
# Plot the data
ggplot() +
geom_sf(data = Netherlands_sf, aes(color = precip)) +
scale_color_gradientn(colors = brewer.pal(9, "Blues")) +
theme_classic()
# Produce numerical summaries
summary(Netherlands) # summary statistics for non-spatial data
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## longitude 3.500 7.033
## latitude 50.767 53.483
## Is projected: FALSE
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Number of points: 220
## Data attributes:
## station_name precip
## Length:220 Min. : 33.90
## Class :character 1st Qu.: 80.85
## Mode :character Median :100.15
## Mean :106.57
## 3rd Qu.:137.35
## Max. :185.60
# Load the Netherlands dataset
Netherlands <- read.csv("netherlands.csv")
# Select 3 random locations
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)
# Label the chosen locations as A, B, and C
chosen_stations <- Netherlands[random_locs, ]
chosen_stations$location <- c("A", "B", "C")
chosen_stations <- chosen_stations[c("location", "station_name", "longitude", "latitude", "precip")]
cat("Chosen locations:\n")
## Chosen locations:
print(chosen_stations)
## location station_name longitude latitude precip
## 159 A SCHAGEN 4.800 52.783 124.8
## 207 B VOORSCHOTEN 4.436 52.139 111.1
## 179 C WEERT 5.700 51.250 55.7
# Remove the chosen locations from the Netherlands data frame
Netherlands <- Netherlands[-random_locs, ]
# Select 3 random locations to remove
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)
# Remove the chosen locations from the Netherlands data frame to create training dataset
netherlands_train <- Netherlands[-random_locs, ]
library(gstat)
# Convert data to spatial object
coordinates(netherlands_train) <- c("longitude", "latitude")
proj4string(netherlands_train) <- CRS("+proj=longlat +datum=WGS84")
# Calculate sample variogram
variogram_obj <- variogram(precip ~ 1, netherlands_train, cutoff=200, width=10)
summary(variogram_obj)
## np dist gamma dir.hor dir.ver
## Min. : 126 Min. : 7.499 Min. : 143.1 Min. :0 Min. :0
## 1st Qu.: 779 1st Qu.: 52.594 1st Qu.: 445.1 1st Qu.:0 1st Qu.:0
## Median :1125 Median : 99.963 Median : 962.3 Median :0 Median :0
## Mean :1010 Mean :100.149 Mean :1008.3 Mean :0 Mean :0
## 3rd Qu.:1332 3rd Qu.:147.510 3rd Qu.:1455.7 3rd Qu.:0 3rd Qu.:0
## Max. :1407 Max. :194.713 Max. :2127.0 Max. :0 Max. :0
## id
## var1:20
##
##
##
##
##
# Plot sample variogram
ggplot(data.frame(variogram_obj), aes(x = dist, y = gamma)) +
geom_point() +
geom_line() +
xlab("Distance") +
ylab("Semi-variance") +
ggtitle("Sample Variogram")+theme_bw()
Based on the summary statistics, the mean distance between pairs of
points is around 100 units , and the maximum distance between pairs of
points is around 195 units. This suggests that there is some spatial
dependence in the data, but it may not extend very far. Therefore, it
may be appropriate to set a maximum distance prior to fitting a model in
order to capture the spatial dependence within the range of distances
observed in the data. The sample variogram plot also reveals a nonzero
semivariance at distance zero, which raises the possibility that the
data contain a nugget effect. The nugget effect, which can be brought on
by measurement error, microscale variability, or other sources of
variation, is the fluctuation in the data at distances smaller than the
minimum spacing between observations. In order to prevent overestimating
the range of spatial dependency, it is crucial to take the nugget effect
into consideration while fitting a spatial model. Consequently, a nugget
parameter may be required in the spatial model.
Assumptions:
Trend/mean function: It is presumed that the mean function is constant. This indicates that the amount of precipitation that can be predicted is the same throughout the Netherlands.
It is assumed that the covariance function is spherical. This indicates that as distance between the places increases, the correlation between precipitation values decreases. The partial sill (0.7), the range (50 km), and the nugget (0.1), which represents the spatially uncorrelated variability in precipitation, are the three parameters that define the covariance function. The partial sill is the maximum correlation between two precipitation values. It is assumed that there will be a nugget effect of 0.1. This suggests that the covariance function by itself is not sufficient to account for all of the spatially uncorrelated variability in precipitation.
Fitted model parameters:
Partial sill: 0.7 Range: 50 kilometers Nugget: 0.1
Netherlands <- read_csv("netherlands.csv")
library(gstat)
# Convert the Netherlands dataframe to a spatial points dataframe
coordinates(Netherlands) <- c("longitude", "latitude")
# Calculate the empirical variogram of the precipitation data with a distance band of 100 kilometers
Netherlands_vgm <- variogram(precip ~ 1, Netherlands, width = 100000)
# Fit a spherical variogram model to the empirical variogram, specifying a partial sill of 0.7, a range of 50 kilometers, and a nugget of 0.1
sph_vgm <- fit.variogram(Netherlands_vgm, model = vgm("Sph", psill = 0.7, range = 50000, nugget = 0.1))
# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]
# Fit a spherical variogram model to the training data, specifying the same parameters as before
sph_vgm <- fit.variogram(variogram(precip ~ 1, train, width = 100000), model = vgm("Sph", psill = 0.7, range = 50000, nugget = 0.1))
# Perform kriging to interpolate precipitation values at the locations of the test data, using the fitted spherical variogram model
pred <- krige(precip ~ 1, train, test, model = sph_vgm)
## [using ordinary kriging]
# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse_Spherical_model <- sqrt(mean((test$precip - pred$var1.pred)^2))
# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))
# Fit a exponential model to the training data, specifying the same parameters as before
sph_Exp <- fit.variogram(variogram(precip ~ 1, train, width = 100000), model = vgm("Exp", psill = 0.7, range = 50000, nugget = 0.1))
# Perform kriging to interpolate precipitation values at the locations of the test data, using the fitted spherical variogram model
pred <- krige(precip ~ 1, train, test, model = sph_Exp)
## [using ordinary kriging]
# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse_exponential_model <- sqrt(mean((test$precip - pred$var1.pred)^2))
# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))
# Fit a exponential model to the training data, specifying the same parameters as before
sph_Gau <- fit.variogram(variogram(precip ~ 1, train, width = 100000), model = vgm("Gau", psill = 0.7, range = 50000, nugget = 0.1))
# Perform kriging to interpolate precipitation values at the locations of the test data, using the fitted spherical variogram model
pred <- krige(precip ~ 1, train, test, model = sph_Gau)
## [using ordinary kriging]
# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse_Gaussian_model <- sqrt(mean((test$precip - pred$var1.pred)^2))
# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))
# Print the RMSE values
cat("Spherical Variogram Model:", rmse_Spherical_model, "\n")
## Spherical Variogram Model: 38.25382
cat("Exponential Variogram model:", rmse_exponential_model, "\n")
## Exponential Variogram model: 38.3487
cat("Gaussian Variogram Model:", rmse_Gaussian_model, "\n")
## Gaussian Variogram Model: 38.54071
Assumptiona and fitted parameters:
The response variable is assumed to have both a mean function and a covariance function by the Gaussian Process model. The covariance function depicts the spatial dependency structure of the data, whereas the mean function is believed to be a constant or a smooth trend.
Since there is no spatial pattern in the precipitation data in this situation, we assume that the mean function is constant. The spatial dependence between two places is supposed to diminish exponentially as their distance grows because the covariance function is believed to be a stationary exponential function. In order to account for measurement error or other causes of variability that are not covered by the spatial dependence structure, we also incorporate a nugget effect.
The fitted model parameters are:
Covariance function: exponential Length scale: 128.9 km (derived from the fitted covariance function) Nugget: 1e-06 (specified by the user) Likelihood variance: 2.293 (derived from the maximum likelihood estimation) Log-likelihood: -678.443
# Load necessary libraries
library(readr)
library(DiceKriging)
# Read in data
Netherlands <- read_csv("netherlands.csv")
# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]
## Define the Gaussian Process model using the 'gauss' covariance function
model <- km(formula = precip ~ longitude + latitude,
design = train[, c("longitude", "latitude")],
response = train$precip,
covtype = "gauss",
nugget = 1e-06)
##
## optimisation start
## ------------------
## * estimation method : MLE
## * optimisation method : BFGS
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model :
## - type : gauss
## - nugget : 1e-06
## - parameters lower bounds : 1e-10 1e-10
## - parameters upper bounds : 6.966 5.432
## - variance bounds : 23.15 2542.507
## - best initial criterion value(s) : -9573.557
##
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate 0 f= 9573.6 |proj g|= 28.209
## At iterate 1 f = 740.36 |proj g|= 0.067698
## At iterate 2 f = 737.11 |proj g|= 0.032161
## At iterate 3 f = 736.6 |proj g|= 0.0015168
## At iterate 4 f = 736.6 |proj g|= 0.00037335
## At iterate 5 f = 736.6 |proj g|= 3.2808e-06
## At iterate 6 f = 736.6 |proj g|= 7.0191e-09
##
## iterations 6
## function evaluations 18
## segments explored during Cauchy searches 8
## BFGS updates skipped 0
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 7.01906e-09
## final function value 736.604
##
## F = 736.604
## final value 736.603976
## converged
# Predict precipitation values for the test data using the Gaussian Process model
pred <- predict.km(model, newdata = test[, c("longitude", "latitude")], type = "UK")
# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse <- sqrt(mean((test$precip - pred$mean)^2))
# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))
# Print the RMSE values
cat("RMSE of Gaussian Process model:", mean_rmse, "\n")
## RMSE of Gaussian Process model: 38.54072
Netherlands <- read_csv("netherlands.csv")
library(gstat)
# Convert the Netherlands dataframe to a spatial points dataframe
coordinates(Netherlands) <- c("longitude", "latitude")
# Calculate the empirical variogram of the precipitation data with a distance band of 100 kilometers
Netherlands_vgm <- variogram(precip ~ 1, Netherlands, width = 100)
# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]
# Fit a spherical variogram model to the training data, specifying the same parameters as before
sph_vgm <- fit.variogram(variogram(precip ~ 1, train, width = 100), model = vgm("Sph", psill = 0.7, range = 50000, nugget = 1.08))
# Create a data frame with the longitude and latitude of the locations we want to predict at
predict_locs <- data.frame(longitude = chosen_stations$longitude*1.08, latitude = chosen_stations$latitude*1.08)
# Print the predictions
print(predict_locs)
## longitude latitude
## 1 5.18400 57.00564
## 2 4.79088 56.31012
## 3 6.15600 55.35000
#TRUE values
# Load the Netherlands dataset
Netherlands <- read.csv("netherlands.csv")
# Select 3 random locations
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)
# Label the chosen locations as A, B, and C
chosen_stations <- Netherlands[random_locs, ]
chosen_stations$location <- c("A", "B", "C")
chosen_stations <- chosen_stations[c("location", "station_name", "longitude", "latitude", "precip")]
cat("Chosen locations:\n")
## Chosen locations:
print(chosen_stations)
## location station_name longitude latitude precip
## 159 A SCHAGEN 4.800 52.783 124.8
## 207 B VOORSCHOTEN 4.436 52.139 111.1
## 179 C WEERT 5.700 51.250 55.7
# Remove the chosen locations from the Netherlands data frame
Netherlands <- Netherlands[-random_locs, ]
# Select 3 random locations to remove
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)
# Remove the chosen locations from the Netherlands data frame to create training dataset
netherlands_train <- Netherlands[-random_locs, ]
# Load necessary libraries
library(readr)
library(DiceKriging)
# Read in data
Netherlands <- read_csv("netherlands.csv")
# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]
## Define the Gaussian Process model using the 'gauss' covariance function
model <- km(formula = precip ~ longitude + latitude,
design = train[, c("longitude", "latitude")],
response = train$precip,
covtype = "gauss",
nugget = 1e-06)
##
## optimisation start
## ------------------
## * estimation method : MLE
## * optimisation method : BFGS
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model :
## - type : gauss
## - nugget : 1e-06
## - parameters lower bounds : 1e-10 1e-10
## - parameters upper bounds : 6.966 5.432
## - variance bounds : 23.15 2542.507
## - best initial criterion value(s) : -9573.557
##
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate 0 f= 9573.6 |proj g|= 28.209
## At iterate 1 f = 740.36 |proj g|= 0.067698
## At iterate 2 f = 737.11 |proj g|= 0.032161
## At iterate 3 f = 736.6 |proj g|= 0.0015168
## At iterate 4 f = 736.6 |proj g|= 0.00037335
## At iterate 5 f = 736.6 |proj g|= 3.2808e-06
## At iterate 6 f = 736.6 |proj g|= 7.0191e-09
##
## iterations 6
## function evaluations 18
## segments explored during Cauchy searches 8
## BFGS updates skipped 0
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 7.01906e-09
## final function value 736.604
##
## F = 736.604
## final value 736.603976
## converged
# Create a grid of points to make predictions on
lon_grid <- seq(min(Netherlands$longitude), max(Netherlands$longitude), by = 0.05)
lat_grid <- seq(min(Netherlands$latitude), max(Netherlands$latitude), by = 0.05)
grid <- expand.grid(longitude = lon_grid, latitude = lat_grid)
# Make predictions on the grid points
pred_grid <- predict.km(model, newdata = grid, type = "UK", display = "se")
# Extract the mean and variance values
mean_grid <- pred_grid$mean
var_grid <- pred_grid$sd^2
# Create a heatmap of the mean values
image(lon_grid, lat_grid, matrix(mean_grid, nrow = length(lon_grid)), col = heat.colors(100), xlab = "Longitude", ylab = "Latitude", main = "Mean Precipitation")
# Create a heatmap of the variance values
image(lon_grid, lat_grid, matrix(var_grid, nrow = length(lon_grid)), col = heat.colors(100), xlab = "Longitude", ylab = "Latitude", main = "Variance of Precipitation")
# Load necessary libraries
library(readr)
library(DiceKriging)
# Read in data
Netherlands <- read_csv("netherlands.csv")
# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]
## Define the Gaussian Process model using the 'gauss' covariance function
# Fit a model without a nugget
## Define the Gaussian Process model using the 'gauss' covariance function
model1 <- km(formula = precip ~ longitude + latitude,
design = train[, c("longitude", "latitude")],
response = train$precip,
covtype = "gauss",
nugget = 1e-06)
##
## optimisation start
## ------------------
## * estimation method : MLE
## * optimisation method : BFGS
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model :
## - type : gauss
## - nugget : 1e-06
## - parameters lower bounds : 1e-10 1e-10
## - parameters upper bounds : 6.966 5.432
## - variance bounds : 23.15 2542.507
## - best initial criterion value(s) : -9573.557
##
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate 0 f= 9573.6 |proj g|= 28.209
## At iterate 1 f = 740.36 |proj g|= 0.067698
## At iterate 2 f = 737.11 |proj g|= 0.032161
## At iterate 3 f = 736.6 |proj g|= 0.0015168
## At iterate 4 f = 736.6 |proj g|= 0.00037335
## At iterate 5 f = 736.6 |proj g|= 3.2808e-06
## At iterate 6 f = 736.6 |proj g|= 7.0191e-09
##
## iterations 6
## function evaluations 18
## segments explored during Cauchy searches 8
## BFGS updates skipped 0
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 7.01906e-09
## final function value 736.604
##
## F = 736.604
## final value 736.603976
## converged
summary(model1)
## Length Class Mode
## 1 km S4
## Define the Gaussian Process model using the 'gauss' covariance function
model2 <- km(formula = precip ~ longitude + latitude,
design = train[, c("longitude", "latitude")],
response = train$precip,
covtype = "gauss",
nugget = 1e-06)
##
## optimisation start
## ------------------
## * estimation method : MLE
## * optimisation method : BFGS
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model :
## - type : gauss
## - nugget : 1e-06
## - parameters lower bounds : 1e-10 1e-10
## - parameters upper bounds : 6.966 5.432
## - variance bounds : 23.15 2542.507
## - best initial criterion value(s) : -249043937
##
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate 0 f= 2.4904e+08 |proj g|= 2326.8
## At iterate 1 f = 860.48 |proj g|= 0.03117
## Bad direction in the line search;
## refresh the lbfgs memory and restart the iteration.
## At iterate 2 f = 741.66 |proj g|= 0.07334
## ys=-9.195e+01 -gs= 6.796e+01, BFGS update SKIPPED
## At iterate 3 f = 739.92 |proj g|= 0.065254
## At iterate 4 f = 736.73 |proj g|= 0.017474
## At iterate 5 f = 736.61 |proj g|= 0.0017564
## At iterate 6 f = 736.6 |proj g|= 0.00020501
## At iterate 7 f = 736.6 |proj g|= 2.0923e-06
## At iterate 8 f = 736.6 |proj g|= 2.461e-09
##
## iterations 8
## function evaluations 48
## segments explored during Cauchy searches 11
## BFGS updates skipped 1
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 2.46099e-09
## final function value 736.604
##
## F = 736.604
## final value 736.603976
## converged
summary(model2)
## Length Class Mode
## 1 km S4
In this question, we are going to be modelling the strength of the Atlantic Meridional Overturning Circulation (AMOC) ocean current, measured at 26.5∘N. Data from the RAPID AMOC monitoring project is funded by the Natural Environment Research Council and are freely available from www.rapid.ac.uk/rapidmoc. The file AMOCdata.csv contains daily measurements of the AMOC strength between January 2010 and December 2020, measured in Sverdrups (Sv), where 1 Sv = 1 million cubic metres per second (1𝑆𝑣 = 106𝑚3𝑠−1)
#Import the data
library(readr)
AMOCdata <- read_csv("AMOCdata.csv")
# Load necessary libraries
library(tidyverse)
# Convert Date column to Date format
AMOCdata$Date <- as.Date(AMOCdata$Date, format = "%Y-%m-%d")
# Calculate quarterly means
quarterly_means <- AMOCdata %>%
group_by(Quarter, Year) %>%
summarize(mean_strength = mean(Strength))
# Convert to time series object
ts_data <- ts(quarterly_means$mean_strength, start = c(2004, 1), frequency = 4)
# Plot the data using ggplot
ggplot(quarterly_means, aes(x = as.factor(Year), y = mean_strength, group = Quarter)) +
geom_point() +
geom_line() +
facet_wrap(~ Quarter, nrow = 2) +
labs(x = "Year", y = "Quarterly Mean AMOC Strength",
title = "Quarterly Mean AMOC Strength (2004-2021)")+theme_bw()
# Load the necessary packages
library(forecast)
# Convert Date column to a date format
AMOCdata$Date <- as.Date(AMOCdata$Date)
# Create a quarterly mean dataset using dplyr
AMOC_quarterly <- AMOCdata %>%
group_by(Quarter, Year) %>%
summarize(mean_strength = mean(Strength))
# Convert the quarterly data to a time series object
AMOC_ts <- ts(AMOC_quarterly$mean_strength, start = c(2010, 1), frequency = 4)
# Fit an ARIMA model
fit <- auto.arima(AMOC_ts)
# Print the model summary
summary(fit)
## Series: AMOC_ts
## ARIMA(2,1,0)
##
## Coefficients:
## ar1 ar2
## -0.8449 -0.5262
## s.e. 0.1509 0.1476
##
## sigma^2 = 4.293: log likelihood = -91.82
## AIC=189.64 AICc=190.26 BIC=194.93
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.4233391 2.00003 1.49681 1.767469 9.245776 0.7317868 0.004059784
# Forecast the next 4 quarters
forecast <- forecast(fit, h = 4)
# Print the forecast
print(forecast)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2021 Q1 18.14229 15.48703 20.79755 14.08142 22.20316
## 2021 Q2 17.61971 14.93268 20.30674 13.51025 21.72916
## 2021 Q3 18.69756 15.86058 21.53454 14.35877 23.03635
## 2021 Q4 18.06190 14.77010 21.35370 13.02752 23.09627
# Load the necessary packages
library(forecast)
library(dlm)
# Convert Date column to a date format
AMOCdata$Date <- as.Date(AMOCdata$Date)
# Create a quarterly mean dataset using dplyr
AMOC_quarterly <- AMOCdata %>%
group_by(Quarter, Year) %>%
summarize(mean_strength = mean(Strength))
# Convert to time series
AMOC_ts <- ts(AMOC_quarterly$mean_strength,
start = c(1955, 1),
frequency = 12)
# Fit DLM model
dlm_model <- dlmModPoly(order = 2)
fit_DLM <- dlmFilter(AMOC_ts, dlm_model)
summary(fit_DLM)
## Length Class Mode
## y 44 ts numeric
## mod 10 dlm list
## m 90 mts numeric
## U.C 45 -none- list
## D.C 90 -none- numeric
## a 88 mts numeric
## U.R 44 -none- list
## D.R 88 -none- numeric
## f 44 ts numeric
Since the ARIMA model includes a non-zero mean, any trend or drift in the time series can be captured by a constant term in the model. The DLM model, on the other hand, uses a local linear trend component to capture any trend or drift rather than an explicit constant term to account for it. As a result, the DLM model is more adaptable than the ARIMA model for modeling trends and is able to identify more intricate patterns.
#Import the data
library(readr)
AMOCdata <- read_csv("AMOCdata.csv")
# Calculate monthly averages
AMOCdata$Date <- as.Date(AMOCdata$Date)
AMOCdata_monthly <- aggregate(AMOCdata$Strength, by = list(format(AMOCdata$Date, "%Y-%m")), mean)
names(AMOCdata_monthly) <- c("Date", "Strength")
AMOCdata_monthly$Date <- as.Date(paste0(AMOCdata_monthly$Date, "-01"))
# Check for stationarity and autocorrelation
library(ggplot2)
library(forecast)
library(tseries)
ggtsdisplay(diff(AMOCdata_monthly$Strength), lag.max = 12, main = "ACF/PACF of Monthly AMOC Strength (Differenced)")
adf.test(AMOCdata_monthly$Strength)
##
## Augmented Dickey-Fuller Test
##
## data: AMOCdata_monthly$Strength
## Dickey-Fuller = -15.306, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary
# Fit an ARIMA model
fit_arima <- auto.arima(AMOCdata_monthly$Strength, trace = TRUE)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : Inf
## ARIMA(0,0,0) with non-zero mean : 1463.81
## ARIMA(1,0,0) with non-zero mean : 1386.731
## ARIMA(0,0,1) with non-zero mean : 1398.24
## ARIMA(0,0,0) with zero mean : 3092.326
## ARIMA(2,0,0) with non-zero mean : 1389.441
## ARIMA(1,0,1) with non-zero mean : 1388.526
## ARIMA(2,0,1) with non-zero mean : 1391.462
## ARIMA(1,0,0) with zero mean : Inf
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(1,0,0) with non-zero mean : 1386.411
##
## Best model: ARIMA(1,0,0) with non-zero mean
summary(fit_arima)
## Series: AMOCdata_monthly$Strength
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.4414 16.3910
## s.e. 0.0468 0.1489
##
## sigma^2 = 2.556: log likelihood = -690.17
## AIC=1386.35 AICc=1386.41 BIC=1398.05
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.001831722 1.594377 1.316014 -0.965543 8.192178 0.8370753
## ACF1
## Training set -0.01234874
#Fit a dlm Model
# Convert Date column to a Date format
AMOCdata$Date <- as.Date(AMOCdata$Date, format = "%m/%d/%Y")
# Calculate monthly averages
library(dplyr)
AMOC_monthly <- AMOCdata %>%
group_by(Year, Month) %>%
summarise(Monthly_Avg = mean(Strength))
# Create a time series object
library(zoo)
AMOC_ts <- zoo(AMOC_monthly$Monthly_Avg, order.by = as.yearmon(paste(AMOC_monthly$Year, AMOC_monthly$Month, "01", sep = "-")))
# Convert to a vector
AMOC_vec <- as.vector(coredata(AMOC_ts))
# Fit a DLM model
library(dlm)
dlm_AMOC <- dlmModPoly(order = 2, dV = 0.1)
dlm_fit <- dlmFilter(AMOC_vec, dlm_AMOC)
# Forecast next 12 months
dlm_pred <- dlmForecast(dlm_fit, nAhead = 10)
# View the forecast
print(dlm_pred$f)
## [,1]
## [1,] 20.97722
## [2,] 21.08521
## [3,] 21.19320
## [4,] 21.30120
## [5,] 21.40919
## [6,] 21.51718
## [7,] 21.62517
## [8,] 21.73317
## [9,] 21.84116
## [10,] 21.94915
-The monthly ARIMA and DLM projections are similar, although the DLM forecasts are a little more erratic. Although far less erratic and variable, the quarterly Holt-Winters forecasts are also less receptive to rapid changes in the data.
This question considers modelling maximum daily temperature in California. You have 2 files: • metadataCA.csv containing longitude, latitude, elevation and place name for 11 sites in California. • MaxTempCalifornia.csv containing maximum daily temperatures in degrees Celsius for each site from Jan 1, 2012 to Dec 30, 2012. There are fewer individual parts in this question, but note that more marks are available for b) and c), and you should expect to carry out all the usual stages of modelling, e.g. making clear which model you are fitting and to which data, which assumptions you are making, etc. You should also perform appropriate validation checks for each model
library(readr)
#Import the MaxTempCalifornia data
MaxTempCalifornia <- read_csv("MaxTempCalifornia.csv")
#Import the metadataCA data
metadataCA <- read_csv("metadataCA.csv")
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
# Convert Date column to date format
MaxTempCalifornia$Date <- ymd(MaxTempCalifornia$Date)
# Melt the dataset to long format
MaxTempCalifornia_long <- pivot_longer(MaxTempCalifornia, cols = -Date, names_to = "City", values_to = "Max_Temp")
# Merge the MaxTempCalifornia and metadataCA datasets
MaxTempCalifornia_long <- inner_join(MaxTempCalifornia_long, metadataCA, by = c("City" = "Location"))
# Create spatial plot
ggplot(MaxTempCalifornia_long, aes(x = Long, y = Lat, fill = Max_Temp)) +
geom_point(size = 3, shape = 21) +
scale_fill_gradient(low = "blue", high = "red") +
labs(title = "Maximum Daily Temperature in California in 2012 (Spatial Plot)",
subtitle = "Color indicates maximum daily temperature (°F)",
x = "Longitude", y = "Latitude")+theme_bw()
# Create time series plot with facet_wrap
ggplot(MaxTempCalifornia_long, aes(x = Date, y = Max_Temp, color = City)) +
geom_line() +
labs(title = "Maximum Daily Temperature in California in 2012 (Time Series Plot)",
subtitle = "Color indicates city",
x = "Date", y = "Maximum Daily Temperature (°F)") +
facet_wrap(~City, nrow = 4, scales = "free_y") +
theme(legend.position = "bottom") +
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
(i) There are low temperatures in January 2012 which increase
exponentially until April where the temperatures are average up to
October when the temperatures drop to levels observed in early
January.
library(sp)
library(gstat)
# Read in data
MaxTempCalifornia <- read.csv("MaxTempCalifornia.csv")
metadataCA <- read.csv("metadataCA.csv")
# Subset data to include only Napa and Death Valley on November 13th 2012
subset_data <- MaxTempCalifornia[, c(1, 3, 7)]
subset_data <- subset_data[subset_data$Date == "2012-11-13", ]
# Merge the subset of temperature data with the metadata
subset_data1 <- merge(subset_data, metadataCA, by.x = "Napa", by.y = "Location")
subset_data1
## [1] Napa Date Death.Valley Long Lat
## [6] Elev
## <0 rows> (or 0-length row.names)
# Define a Gaussian model
gaussian_model <- gstat(formula = Temperature ~ 1, data = subset_data, model = vgm(psill = 15, model = "Gau", range = 0.1, nugget = 0.1))
# Print the model summary
summary(gaussian_model)
## Length Class Mode
## data 1 -none- list
## model 1 -none- list
## call 4 -none- call
# Predict the temperature at a new location
new_location <- data.frame(Longitude = -118.2437, Latitude = 36.5323)
coordinates(new_location) <- ~Longitude + Latitude
new_location
## SpatialPoints:
## Longitude Latitude
## 1 -118.2437 36.5323
## Coordinate Reference System (CRS) arguments: NA
# Subset the temperature data for Napa and Death Valley
napa_temp <- MaxTempCalifornia$Napa
death_valley_temp <- MaxTempCalifornia$Death.Valley
# Create time series objects
napa_temp_ts <- ts(napa_temp, start = c(1961, 1), frequency = 365)
death_valley_temp_ts <- ts(death_valley_temp, start = c(1961, 1), frequency = 365)
# Fit an ARIMA model to the Napa temperature time series
napa_temp_arima <- arima(napa_temp_ts, order = c(1,1,1))
# Fit an ARIMA model to the Death Valley temperature time series
death_valley_temp_arima <- arima(death_valley_temp_ts, order = c(1,1,1))
# Make forecasts for November 9th - November 13th
napa_temp_fcst1 <- forecast(napa_temp_arima, h = 5)
napa_temp_fcst1
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1962.0000 14.15709 10.314132 18.00004 8.279794 20.03438
## 1962.0027 14.01086 9.381413 18.64031 6.930729 21.09100
## 1962.0055 13.92284 8.947770 18.89792 6.314126 21.53156
## 1962.0082 13.86986 8.710974 19.02874 5.980026 21.75969
## 1962.0110 13.83796 8.565260 19.11067 5.774060 21.90187
death_valley_temp_fcst1 <- forecast(death_valley_temp_arima, h = 5)
death_valley_temp_fcst1
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1962.0000 16.57453 13.104727 20.04433 11.267925 21.88113
## 1962.0027 16.64780 11.486329 21.80927 8.754010 24.54159
## 1962.0055 16.60501 10.302943 22.90708 6.966829 26.24320
## 1962.0082 16.63000 9.303864 23.95614 5.425641 27.83436
## 1962.0110 16.61541 8.423305 24.80751 4.086667 29.14415
# Extract the forecasted values for November 13th from each model's output
forecast_napa_13th <- napa_temp_fcst1$mean[5]
forecast_deathvalley_13th <- death_valley_temp_fcst1$mean[5]
forecast_napa2_13th <- napa_temp_fcst1$mean[5]
forecast_deathvalley2_13th <- death_valley_temp_fcst1$mean[5]
# Calculate accuracy metrics for Napa model
Napa_accuracy <- accuracy(napa_temp_fcst1)
print("Napa model accuracy:")
## [1] "Napa model accuracy:"
print(Napa_accuracy)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.04853043 2.994563 2.358193 -2.263218 11.75444 NaN 0.1242719
# Calculate accuracy metrics for DeathValley model
DeathValley_accuracy <- accuracy(death_valley_temp_fcst1)
print("DeathValley model accuracy:")
## [1] "DeathValley model accuracy:"
print(DeathValley_accuracy)
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -0.01019214 2.70379 2.00063 -0.4804107 6.683989 NaN -0.04709208
Based on the above results, DeathValley model has a better estimated performance compared to the Napa model. The DeathValley model has a lower MAE and MAPE, which are measures of the average prediction error and the percentage error, respectively, and suggest that the model’s predictions are closer to the actual values.
To improve the predictions, we can try using different modeling techniques, such as exponential smoothing or neural networks. We can also try incorporating additional variables that may be related to the maximum temperature, such as humidity, wind speed, or cloud cover. Additionally, we can experiment with different hyperparameters and model specifications to see if they improve the forecast accuracy.