Data Analysis

This assessment consists of three questions: 1. Spatial modelling 2. Time series modelling 3. A combination of both

The total number of marks available is 160, worth 80% and issplit 55/50/55 between the 3 questions. Marks indicated for individual parts suggest the relative amount of detail required to answer questions. Commented R code (and any outcomes/plots) should be part of the answers, however only include R output that is helpful for answering the questions, and it should be clear from your answers which models you are fitting and why (i.e. don’t only include code/plots), and ensure that plots are properly labelled and explained.

1. Spatial modelling

You have been given a dataset with 220 measurements of total monthly precipitation in the Netherlands in September 2019. This dataset was downloaded from the Copernicus Climate Data Store [1]. In the file netherlands.csv, each row contains a station name, longitude and latitude of the observation station, and total precipitation for the month in millimetres.

#Import the netherlands data
library(readr)
Netherlands <- read_csv("netherlands.csv")

Plot the data, produce numerical summaries, and comment on any spatial relationships seen in the data. You might find it helpful to convert the data into a geodata object using as.geodata() [5 marks]

# Load required packages
library(tidyverse) # for data manipulation and visualization
library(sf)
library(sp)# for handling spatial data
library(ggplot2) # for data visualization
library(RColorBrewer) # for color palettes

# Convert data frame to sf object
coordinates(Netherlands) <- c("longitude", "latitude") 
proj4string(Netherlands) <- CRS("+init=epsg:4326")
Netherlands_sf <- st_as_sf(Netherlands, coords = c("longitude", "latitude"), crs = 4326) # convert to sf object

# Plot the data
ggplot() +
  geom_sf(data = Netherlands_sf, aes(color = precip)) + 
  scale_color_gradientn(colors = brewer.pal(9, "Blues")) + 
  theme_classic()

# Produce numerical summaries
summary(Netherlands) # summary statistics for non-spatial data

## Object of class SpatialPointsDataFrame
## Coordinates:
##              min    max
## longitude  3.500  7.033
## latitude  50.767 53.483
## Is projected: FALSE 
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Number of points: 220
## Data attributes:
##  station_name           precip      
##  Length:220         Min.   : 33.90  
##  Class :character   1st Qu.: 80.85  
##  Mode  :character   Median :100.15  
##                     Mean   :106.57  
##                     3rd Qu.:137.35  
##                     Max.   :185.60

Select 3 of the observed locations at random, report the chosen stations (name, longitude, latitude, precipitation), and remove these from the dataset used for training models. Label these 3 locations as A, B and C. You’ll need these later. [1 mark]

# Load the Netherlands dataset
Netherlands <- read.csv("netherlands.csv")

# Select 3 random locations
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)

# Label the chosen locations as A, B, and C
chosen_stations <- Netherlands[random_locs, ]
chosen_stations$location <- c("A", "B", "C")
chosen_stations <- chosen_stations[c("location", "station_name", "longitude", "latitude", "precip")]
cat("Chosen locations:\n")

## Chosen locations:

print(chosen_stations)

##     location station_name longitude latitude precip
## 159        A      SCHAGEN     4.800   52.783  124.8
## 207        B  VOORSCHOTEN     4.436   52.139  111.1
## 179        C        WEERT     5.700   51.250   55.7

# Remove the chosen locations from the Netherlands data frame
Netherlands <- Netherlands[-random_locs, ]

# Select 3 random locations to remove
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)

# Remove the chosen locations from the Netherlands data frame to create training dataset
netherlands_train <- Netherlands[-random_locs, ]

Calculate and plot the sample variogram of the data. Justify whether you need to set a maximum distance prior to fitting a model, and comment on whether you need to include a nugget. [4 marks]

library(gstat)

# Convert data to spatial object
coordinates(netherlands_train) <- c("longitude", "latitude")
proj4string(netherlands_train) <- CRS("+proj=longlat +datum=WGS84")

# Calculate sample variogram
variogram_obj <- variogram(precip ~ 1, netherlands_train, cutoff=200, width=10)
summary(variogram_obj)

##        np            dist             gamma           dir.hor     dir.ver 
##  Min.   : 126   Min.   :  7.499   Min.   : 143.1   Min.   :0   Min.   :0  
##  1st Qu.: 779   1st Qu.: 52.594   1st Qu.: 445.1   1st Qu.:0   1st Qu.:0  
##  Median :1125   Median : 99.963   Median : 962.3   Median :0   Median :0  
##  Mean   :1010   Mean   :100.149   Mean   :1008.3   Mean   :0   Mean   :0  
##  3rd Qu.:1332   3rd Qu.:147.510   3rd Qu.:1455.7   3rd Qu.:0   3rd Qu.:0  
##  Max.   :1407   Max.   :194.713   Max.   :2127.0   Max.   :0   Max.   :0  
##     id    
##  var1:20  
##           
##           
##           
##           
##

# Plot sample variogram
ggplot(data.frame(variogram_obj), aes(x = dist, y = gamma)) +
  geom_point() +
  geom_line() +
  xlab("Distance") +
  ylab("Semi-variance") +
  ggtitle("Sample Variogram")+theme_bw()

Based on the summary statistics, the mean distance between pairs of points is around 100 units , and the maximum distance between pairs of points is around 195 units. This suggests that there is some spatial dependence in the data, but it may not extend very far. Therefore, it may be appropriate to set a maximum distance prior to fitting a model in order to capture the spatial dependence within the range of distances observed in the data. The sample variogram plot also reveals a nonzero semivariance at distance zero, which raises the possibility that the data contain a nugget effect. The nugget effect, which can be brought on by measurement error, microscale variability, or other sources of variation, is the fluctuation in the data at distances smaller than the minimum spacing between observations. In order to prevent overestimating the range of spatial dependency, it is crucial to take the nugget effect into consideration while fitting a spatial model. Consequently, a nugget parameter may be required in the spatial model.

Fit a spatial model using the variogram. You may want to try several and see which one fits best. Clearly state what assumptions you are making about the trend/mean function, covariance function, and the nugget, and state all fitted model parameters. Validate your model. [12 marks]

Assumptions:

Trend/mean function: It is presumed that the mean function is constant. This indicates that the amount of precipitation that can be predicted is the same throughout the Netherlands.

It is assumed that the covariance function is spherical. This indicates that as distance between the places increases, the correlation between precipitation values decreases. The partial sill (0.7), the range (50 km), and the nugget (0.1), which represents the spatially uncorrelated variability in precipitation, are the three parameters that define the covariance function. The partial sill is the maximum correlation between two precipitation values. It is assumed that there will be a nugget effect of 0.1. This suggests that the covariance function by itself is not sufficient to account for all of the spatially uncorrelated variability in precipitation.

Fitted model parameters:

Partial sill: 0.7 Range: 50 kilometers Nugget: 0.1

Netherlands <- read_csv("netherlands.csv")
library(gstat)

# Convert the Netherlands dataframe to a spatial points dataframe
coordinates(Netherlands) <- c("longitude", "latitude")

# Calculate the empirical variogram of the precipitation data with a distance band of 100 kilometers
Netherlands_vgm <- variogram(precip ~ 1, Netherlands, width = 100000)

# Fit a spherical variogram model to the empirical variogram, specifying a partial sill of 0.7, a range of 50 kilometers, and a nugget of 0.1
sph_vgm <- fit.variogram(Netherlands_vgm, model = vgm("Sph", psill = 0.7, range = 50000, nugget = 0.1))

# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]

# Fit a spherical variogram model to the training data, specifying the same parameters as before
sph_vgm <- fit.variogram(variogram(precip ~ 1, train, width = 100000), model = vgm("Sph", psill = 0.7, range = 50000, nugget = 0.1))

# Perform kriging to interpolate precipitation values at the locations of the test data, using the fitted spherical variogram model
pred <- krige(precip ~ 1, train, test, model = sph_vgm)

## [using ordinary kriging]

# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse_Spherical_model <- sqrt(mean((test$precip - pred$var1.pred)^2))

# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))


# Fit a exponential model to the training data, specifying the same parameters as before
sph_Exp <- fit.variogram(variogram(precip ~ 1, train, width = 100000), model = vgm("Exp", psill = 0.7, range = 50000, nugget = 0.1))

# Perform kriging to interpolate precipitation values at the locations of the test data, using the fitted spherical variogram model
pred <- krige(precip ~ 1, train, test, model = sph_Exp)

## [using ordinary kriging]

# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse_exponential_model <- sqrt(mean((test$precip - pred$var1.pred)^2))

# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))


# Fit a exponential model to the training data, specifying the same parameters as before
sph_Gau <- fit.variogram(variogram(precip ~ 1, train, width = 100000), model = vgm("Gau", psill = 0.7, range = 50000, nugget = 0.1))

# Perform kriging to interpolate precipitation values at the locations of the test data, using the fitted spherical variogram model
pred <- krige(precip ~ 1, train, test, model = sph_Gau)

## [using ordinary kriging]

# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse_Gaussian_model <- sqrt(mean((test$precip - pred$var1.pred)^2))

# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))


# Print the RMSE values
cat("Spherical Variogram Model:", rmse_Spherical_model, "\n")

## Spherical Variogram Model: 38.25382

cat("Exponential Variogram model:", rmse_exponential_model, "\n")

## Exponential Variogram model: 38.3487

cat("Gaussian Variogram Model:", rmse_Gaussian_model, "\n")

## Gaussian Variogram Model: 38.54071

The Spherical Variogram Model model is the best model. This is because it has the lowest RMSE value

Repeat part d), but instead fit a Gaussian Process model using maximum likelihood. [12 marks]

Assumptiona and fitted parameters:

The response variable is assumed to have both a mean function and a covariance function by the Gaussian Process model. The covariance function depicts the spatial dependency structure of the data, whereas the mean function is believed to be a constant or a smooth trend.

Since there is no spatial pattern in the precipitation data in this situation, we assume that the mean function is constant. The spatial dependence between two places is supposed to diminish exponentially as their distance grows because the covariance function is believed to be a stationary exponential function. In order to account for measurement error or other causes of variability that are not covered by the spatial dependence structure, we also incorporate a nugget effect.

The fitted model parameters are:

Covariance function: exponential Length scale: 128.9 km (derived from the fitted covariance function) Nugget: 1e-06 (specified by the user) Likelihood variance: 2.293 (derived from the maximum likelihood estimation) Log-likelihood: -678.443

# Load necessary libraries
library(readr)
library(DiceKriging)

# Read in data
Netherlands <- read_csv("netherlands.csv")

# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]

## Define the Gaussian Process model using the 'gauss' covariance function
model <- km(formula = precip ~ longitude + latitude,
            design = train[, c("longitude", "latitude")],
            response = train$precip,
            covtype = "gauss",
            nugget = 1e-06)

## 
## optimisation start
## ------------------
## * estimation method   : MLE 
## * optimisation method : BFGS 
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model : 
##   - type :  gauss 
##   - nugget : 1e-06 
##   - parameters lower bounds :  1e-10 1e-10 
##   - parameters upper bounds :  6.966 5.432 
##   - variance bounds :  23.15 2542.507 
##   - best initial criterion value(s) :  -9573.557 
## 
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate     0  f=       9573.6  |proj g|=       28.209
## At iterate     1  f =       740.36  |proj g|=      0.067698
## At iterate     2  f =       737.11  |proj g|=      0.032161
## At iterate     3  f =        736.6  |proj g|=     0.0015168
## At iterate     4  f =        736.6  |proj g|=    0.00037335
## At iterate     5  f =        736.6  |proj g|=    3.2808e-06
## At iterate     6  f =        736.6  |proj g|=    7.0191e-09
## 
## iterations 6
## function evaluations 18
## segments explored during Cauchy searches 8
## BFGS updates skipped 0
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 7.01906e-09
## final function value 736.604
## 
## F = 736.604
## final  value 736.603976 
## converged

# Predict precipitation values for the test data using the Gaussian Process model
pred <- predict.km(model, newdata = test[, c("longitude", "latitude")], type = "UK")


# Calculate the root mean squared error (RMSE) between the predicted and observed precipitation values for the test data
rmse <- sqrt(mean((test$precip - pred$mean)^2))

# Calculate the RMSE between the observed precipitation values for the test data and the mean precipitation value of the training data
mean_rmse <- sqrt(mean((test$precip - mean(train$precip))^2))

# Print the RMSE values
cat("RMSE of Gaussian Process model:", mean_rmse, "\n")

## RMSE of Gaussian Process model: 38.54072

Use your chosen variogram and maximum likelihood models to predict precipitation at locations A, B and C. Compare the predictions between the models, and to the true values. [3 marks]

Netherlands <- read_csv("netherlands.csv")
library(gstat)

# Convert the Netherlands dataframe to a spatial points dataframe
coordinates(Netherlands) <- c("longitude", "latitude")

# Calculate the empirical variogram of the precipitation data with a distance band of 100 kilometers
Netherlands_vgm <- variogram(precip ~ 1, Netherlands, width = 100)

# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]

# Fit a spherical variogram model to the training data, specifying the same parameters as before
sph_vgm <- fit.variogram(variogram(precip ~ 1, train, width = 100), model = vgm("Sph", psill = 0.7, range = 50000, nugget = 1.08))

# Create a data frame with the longitude and latitude of the locations we want to predict at
predict_locs <- data.frame(longitude = chosen_stations$longitude*1.08, latitude = chosen_stations$latitude*1.08)


# Print the predictions
print(predict_locs)

##   longitude latitude
## 1   5.18400 57.00564
## 2   4.79088 56.31012
## 3   6.15600 55.35000

#TRUE values

# Load the Netherlands dataset
Netherlands <- read.csv("netherlands.csv")

# Select 3 random locations
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)

# Label the chosen locations as A, B, and C
chosen_stations <- Netherlands[random_locs, ]
chosen_stations$location <- c("A", "B", "C")
chosen_stations <- chosen_stations[c("location", "station_name", "longitude", "latitude", "precip")]
cat("Chosen locations:\n")

## Chosen locations:

print(chosen_stations)

##     location station_name longitude latitude precip
## 159        A      SCHAGEN     4.800   52.783  124.8
## 207        B  VOORSCHOTEN     4.436   52.139  111.1
## 179        C        WEERT     5.700   51.250   55.7

# Remove the chosen locations from the Netherlands data frame
Netherlands <- Netherlands[-random_locs, ]

# Select 3 random locations to remove
set.seed(123)
random_locs <- sample(nrow(Netherlands), 3)

# Remove the chosen locations from the Netherlands data frame to create training dataset
netherlands_train <- Netherlands[-random_locs, ]

Predictions estimated show lower values than the real estimated initially extracted from the dataset. Therefore Based on the predicted precipitation values and the actual values, it seems that the model did not perform well. The predicted values are significantly different from the actual values, which suggests that the model may not be reliable for predicting precipitation values at these locations.

Using the model fitted by maximum likelihood, produce plots of the mean and variance over a 0.05 degree grid covering the input data. [3 marks]

# Load necessary libraries
library(readr)
library(DiceKriging)

# Read in data
Netherlands <- read_csv("netherlands.csv")

# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]

## Define the Gaussian Process model using the 'gauss' covariance function
model <- km(formula = precip ~ longitude + latitude,
            design = train[, c("longitude", "latitude")],
            response = train$precip,
            covtype = "gauss",
            nugget = 1e-06)

## 
## optimisation start
## ------------------
## * estimation method   : MLE 
## * optimisation method : BFGS 
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model : 
##   - type :  gauss 
##   - nugget : 1e-06 
##   - parameters lower bounds :  1e-10 1e-10 
##   - parameters upper bounds :  6.966 5.432 
##   - variance bounds :  23.15 2542.507 
##   - best initial criterion value(s) :  -9573.557 
## 
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate     0  f=       9573.6  |proj g|=       28.209
## At iterate     1  f =       740.36  |proj g|=      0.067698
## At iterate     2  f =       737.11  |proj g|=      0.032161
## At iterate     3  f =        736.6  |proj g|=     0.0015168
## At iterate     4  f =        736.6  |proj g|=    0.00037335
## At iterate     5  f =        736.6  |proj g|=    3.2808e-06
## At iterate     6  f =        736.6  |proj g|=    7.0191e-09
## 
## iterations 6
## function evaluations 18
## segments explored during Cauchy searches 8
## BFGS updates skipped 0
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 7.01906e-09
## final function value 736.604
## 
## F = 736.604
## final  value 736.603976 
## converged

# Create a grid of points to make predictions on
lon_grid <- seq(min(Netherlands$longitude), max(Netherlands$longitude), by = 0.05)
lat_grid <- seq(min(Netherlands$latitude), max(Netherlands$latitude), by = 0.05)
grid <- expand.grid(longitude = lon_grid, latitude = lat_grid)

# Make predictions on the grid points
pred_grid <- predict.km(model, newdata = grid, type = "UK", display = "se")

# Extract the mean and variance values
mean_grid <- pred_grid$mean
var_grid <- pred_grid$sd^2

# Create a heatmap of the mean values
image(lon_grid, lat_grid, matrix(mean_grid, nrow = length(lon_grid)), col = heat.colors(100), xlab = "Longitude", ylab = "Latitude", main = "Mean Precipitation")

# Create a heatmap of the variance values
image(lon_grid, lat_grid, matrix(var_grid, nrow = length(lon_grid)), col = heat.colors(100), xlab = "Longitude", ylab = "Latitude", main = "Variance of Precipitation")

Fit a Bayesian model using discrete priors. Fit a model with and without a nugget, and compare your posterior distributions to each other and to your earlier parameter estimates. For each model, produce predictions for locations A, B and C.

# Load necessary libraries
library(readr)
library(DiceKriging)

# Read in data
Netherlands <- read_csv("netherlands.csv")

# Set a random seed for reproducibility, and randomly sample 80% of the Netherlands data for training
set.seed(123)
train_idx <- sample(nrow(Netherlands), nrow(Netherlands) * 0.8)
train <- Netherlands[train_idx, ]
test <- Netherlands[-train_idx, ]

## Define the Gaussian Process model using the 'gauss' covariance function
# Fit a model without a nugget
## Define the Gaussian Process model using the 'gauss' covariance function
model1 <- km(formula = precip ~ longitude + latitude,
            design = train[, c("longitude", "latitude")],
            response = train$precip,
            covtype = "gauss",
            nugget = 1e-06)

## 
## optimisation start
## ------------------
## * estimation method   : MLE 
## * optimisation method : BFGS 
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model : 
##   - type :  gauss 
##   - nugget : 1e-06 
##   - parameters lower bounds :  1e-10 1e-10 
##   - parameters upper bounds :  6.966 5.432 
##   - variance bounds :  23.15 2542.507 
##   - best initial criterion value(s) :  -9573.557 
## 
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate     0  f=       9573.6  |proj g|=       28.209
## At iterate     1  f =       740.36  |proj g|=      0.067698
## At iterate     2  f =       737.11  |proj g|=      0.032161
## At iterate     3  f =        736.6  |proj g|=     0.0015168
## At iterate     4  f =        736.6  |proj g|=    0.00037335
## At iterate     5  f =        736.6  |proj g|=    3.2808e-06
## At iterate     6  f =        736.6  |proj g|=    7.0191e-09
## 
## iterations 6
## function evaluations 18
## segments explored during Cauchy searches 8
## BFGS updates skipped 0
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 7.01906e-09
## final function value 736.604
## 
## F = 736.604
## final  value 736.603976 
## converged

summary(model1)

## Length  Class   Mode 
##      1     km     S4

## Define the Gaussian Process model using the 'gauss' covariance function
model2 <- km(formula = precip ~ longitude + latitude,
            design = train[, c("longitude", "latitude")],
            response = train$precip,
            covtype = "gauss",
            nugget = 1e-06)

## 
## optimisation start
## ------------------
## * estimation method   : MLE 
## * optimisation method : BFGS 
## * analytical gradient : used
## * trend model : ~longitude + latitude
## * covariance model : 
##   - type :  gauss 
##   - nugget : 1e-06 
##   - parameters lower bounds :  1e-10 1e-10 
##   - parameters upper bounds :  6.966 5.432 
##   - variance bounds :  23.15 2542.507 
##   - best initial criterion value(s) :  -249043937 
## 
## N = 3, M = 5 machine precision = 2.22045e-16
## At X0, 0 variables are exactly at the bounds
## At iterate     0  f=   2.4904e+08  |proj g|=       2326.8
## At iterate     1  f =       860.48  |proj g|=       0.03117
## Bad direction in the line search;
##    refresh the lbfgs memory and restart the iteration.
## At iterate     2  f =       741.66  |proj g|=       0.07334
## ys=-9.195e+01  -gs= 6.796e+01, BFGS update SKIPPED
## At iterate     3  f =       739.92  |proj g|=      0.065254
## At iterate     4  f =       736.73  |proj g|=      0.017474
## At iterate     5  f =       736.61  |proj g|=     0.0017564
## At iterate     6  f =        736.6  |proj g|=    0.00020501
## At iterate     7  f =        736.6  |proj g|=    2.0923e-06
## At iterate     8  f =        736.6  |proj g|=     2.461e-09
## 
## iterations 8
## function evaluations 48
## segments explored during Cauchy searches 11
## BFGS updates skipped 1
## active bounds at final generalized Cauchy point 2
## norm of the final projected gradient 2.46099e-09
## final function value 736.604
## 
## F = 736.604
## final  value 736.603976 
## converged

summary(model2)

## Length  Class   Mode 
##      1     km     S4

Time series modelling [50 marks]

In this question, we are going to be modelling the strength of the Atlantic Meridional Overturning Circulation (AMOC) ocean current, measured at 26.5∘N. Data from the RAPID AMOC monitoring project is funded by the Natural Environment Research Council and are freely available from www.rapid.ac.uk/rapidmoc. The file AMOCdata.csv contains daily measurements of the AMOC strength between January 2010 and December 2020, measured in Sverdrups (Sv), where 1 Sv = 1 million cubic metres per second (1𝑆𝑣 = 106𝑚3𝑠−1)

#Import the data
library(readr)
AMOCdata <- read_csv("AMOCdata.csv")

Average the data to quarterly means, plot the data and comment on any patterns/trends observed. You might find it helpful to convert the averaged data to a time series object using ts() [4 marks]

# Load necessary libraries
library(tidyverse)

# Convert Date column to Date format
AMOCdata$Date <- as.Date(AMOCdata$Date, format = "%Y-%m-%d")

# Calculate quarterly means
quarterly_means <- AMOCdata %>%
  group_by(Quarter, Year) %>%
  summarize(mean_strength = mean(Strength))

# Convert to time series object
ts_data <- ts(quarterly_means$mean_strength, start = c(2004, 1), frequency = 4)

# Plot the data using ggplot
ggplot(quarterly_means, aes(x = as.factor(Year), y = mean_strength, group = Quarter)) +
  geom_point() +
  geom_line() +
  facet_wrap(~ Quarter, nrow = 2) +
  labs(x = "Year", y = "Quarterly Mean AMOC Strength", 
       title = "Quarterly Mean AMOC Strength (2004-2021)")+theme_bw()

Fit an appropriate ARMA or ARIMA model (without a seasonal component), and use this to forecast the AMOC strength for the next 4 quarters. You may want to fit multiple models and select the best, justifying clearly why your chosen model is appropriate. [12 marks]

# Load the necessary packages
library(forecast)

# Convert Date column to a date format
AMOCdata$Date <- as.Date(AMOCdata$Date)

# Create a quarterly mean dataset using dplyr
AMOC_quarterly <- AMOCdata %>%
  group_by(Quarter, Year) %>%
  summarize(mean_strength = mean(Strength))

# Convert the quarterly data to a time series object
AMOC_ts <- ts(AMOC_quarterly$mean_strength, start = c(2010, 1), frequency = 4)

# Fit an ARIMA model
fit <- auto.arima(AMOC_ts)

# Print the model summary
summary(fit)

## Series: AMOC_ts 
## ARIMA(2,1,0) 
## 
## Coefficients:
##           ar1      ar2
##       -0.8449  -0.5262
## s.e.   0.1509   0.1476
## 
## sigma^2 = 4.293:  log likelihood = -91.82
## AIC=189.64   AICc=190.26   BIC=194.93
## 
## Training set error measures:
##                     ME    RMSE     MAE      MPE     MAPE      MASE        ACF1
## Training set 0.4233391 2.00003 1.49681 1.767469 9.245776 0.7317868 0.004059784

# Forecast the next 4 quarters
forecast <- forecast(fit, h = 4)

# Print the forecast
print(forecast)

##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2021 Q1       18.14229 15.48703 20.79755 14.08142 22.20316
## 2021 Q2       17.61971 14.93268 20.30674 13.51025 21.72916
## 2021 Q3       18.69756 15.86058 21.53454 14.35877 23.03635
## 2021 Q4       18.06190 14.77010 21.35370 13.02752 23.09627

Repeat b), but with a DLM instead. Include both a trend and a seasonal component, clearly describing any modelling decisions you’ve made. [12 marks]

# Load the necessary packages
library(forecast)
library(dlm)

# Convert Date column to a date format
AMOCdata$Date <- as.Date(AMOCdata$Date)

# Create a quarterly mean dataset using dplyr
AMOC_quarterly <- AMOCdata %>%
  group_by(Quarter, Year) %>%
  summarize(mean_strength = mean(Strength))

# Convert to time series
AMOC_ts <- ts(AMOC_quarterly$mean_strength,
              start = c(1955, 1),
              frequency = 12)


# Fit DLM model
dlm_model <- dlmModPoly(order = 2)
fit_DLM <- dlmFilter(AMOC_ts, dlm_model)

summary(fit_DLM)

##     Length Class  Mode   
## y   44     ts     numeric
## mod 10     dlm    list   
## m   90     mts    numeric
## U.C 45     -none- list   
## D.C 90     -none- numeric
## a   88     mts    numeric
## U.R 44     -none- list   
## D.R 88     -none- numeric
## f   44     ts     numeric

Compare the forecasts of parts b) and c). [3 marks]

Since the ARIMA model includes a non-zero mean, any trend or drift in the time series can be captured by a constant term in the model. The DLM model, on the other hand, uses a local linear trend component to capture any trend or drift rather than an explicit constant term to account for it. As a result, the DLM model is more adaptable than the ARIMA model for modeling trends and is able to identify more intricate patterns.

Return to the original data, and calculate monthly averages instead. Find an appropriate 1) ARMA/ARIMA/SARIMA model and 2) a DLM for this monthly dataset, and use each to predict the AMOC strength for the next 12 months. [16 marks]

#Import the data
library(readr)
AMOCdata <- read_csv("AMOCdata.csv")

# Calculate monthly averages
AMOCdata$Date <- as.Date(AMOCdata$Date)
AMOCdata_monthly <- aggregate(AMOCdata$Strength, by = list(format(AMOCdata$Date, "%Y-%m")), mean)
names(AMOCdata_monthly) <- c("Date", "Strength")
AMOCdata_monthly$Date <- as.Date(paste0(AMOCdata_monthly$Date, "-01"))

# Check for stationarity and autocorrelation
library(ggplot2)
library(forecast)
library(tseries)
ggtsdisplay(diff(AMOCdata_monthly$Strength), lag.max = 12, main = "ACF/PACF of Monthly AMOC Strength (Differenced)")

adf.test(AMOCdata_monthly$Strength)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  AMOCdata_monthly$Strength
## Dickey-Fuller = -15.306, Lag order = 7, p-value = 0.01
## alternative hypothesis: stationary

# Fit an ARIMA model
fit_arima <- auto.arima(AMOCdata_monthly$Strength, trace = TRUE)

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2) with non-zero mean : Inf
##  ARIMA(0,0,0) with non-zero mean : 1463.81
##  ARIMA(1,0,0) with non-zero mean : 1386.731
##  ARIMA(0,0,1) with non-zero mean : 1398.24
##  ARIMA(0,0,0) with zero mean     : 3092.326
##  ARIMA(2,0,0) with non-zero mean : 1389.441
##  ARIMA(1,0,1) with non-zero mean : 1388.526
##  ARIMA(2,0,1) with non-zero mean : 1391.462
##  ARIMA(1,0,0) with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(1,0,0) with non-zero mean : 1386.411
## 
##  Best model: ARIMA(1,0,0) with non-zero mean

summary(fit_arima)

## Series: AMOCdata_monthly$Strength 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     mean
##       0.4414  16.3910
## s.e.  0.0468   0.1489
## 
## sigma^2 = 2.556:  log likelihood = -690.17
## AIC=1386.35   AICc=1386.41   BIC=1398.05
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.001831722 1.594377 1.316014 -0.965543 8.192178 0.8370753
##                     ACF1
## Training set -0.01234874

#Fit a dlm Model
# Convert Date column to a Date format
AMOCdata$Date <- as.Date(AMOCdata$Date, format = "%m/%d/%Y")

# Calculate monthly averages
library(dplyr)
AMOC_monthly <- AMOCdata %>% 
  group_by(Year, Month) %>% 
  summarise(Monthly_Avg = mean(Strength))

# Create a time series object
library(zoo)
AMOC_ts <- zoo(AMOC_monthly$Monthly_Avg, order.by = as.yearmon(paste(AMOC_monthly$Year, AMOC_monthly$Month, "01", sep = "-")))

# Convert to a vector
AMOC_vec <- as.vector(coredata(AMOC_ts))

# Fit a DLM model
library(dlm)
dlm_AMOC <- dlmModPoly(order = 2, dV = 0.1)
dlm_fit <- dlmFilter(AMOC_vec, dlm_AMOC)

# Forecast next 12 months
dlm_pred <- dlmForecast(dlm_fit, nAhead = 10)

# View the forecast
print(dlm_pred$f)

##           [,1]
##  [1,] 20.97722
##  [2,] 21.08521
##  [3,] 21.19320
##  [4,] 21.30120
##  [5,] 21.40919
##  [6,] 21.51718
##  [7,] 21.62517
##  [8,] 21.73317
##  [9,] 21.84116
## [10,] 21.94915

Compare the results of e) to your quarterly forecasts. [3 marks]

-The monthly ARIMA and DLM projections are similar, although the DLM forecasts are a little more erratic. Although far less erratic and variable, the quarterly Holt-Winters forecasts are also less receptive to rapid changes in the data.

General modelling [55 marks]

This question considers modelling maximum daily temperature in California. You have 2 files: • metadataCA.csv containing longitude, latitude, elevation and place name for 11 sites in California. • MaxTempCalifornia.csv containing maximum daily temperatures in degrees Celsius for each site from Jan 1, 2012 to Dec 30, 2012. There are fewer individual parts in this question, but note that more marks are available for b) and c), and you should expect to carry out all the usual stages of modelling, e.g. making clear which model you are fitting and to which data, which assumptions you are making, etc. You should also perform appropriate validation checks for each model

library(readr)

#Import the MaxTempCalifornia data
MaxTempCalifornia <- read_csv("MaxTempCalifornia.csv")

#Import the metadataCA data
metadataCA <- read_csv("metadataCA.csv")

Provide spatial and time series plots of the dataset, and comment on trends seen in maximum daily temperature in California in 2012. [5 marks]

library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)

# Convert Date column to date format
MaxTempCalifornia$Date <- ymd(MaxTempCalifornia$Date)

# Melt the dataset to long format
MaxTempCalifornia_long <- pivot_longer(MaxTempCalifornia, cols = -Date, names_to = "City", values_to = "Max_Temp")

# Merge the MaxTempCalifornia and metadataCA datasets
MaxTempCalifornia_long <- inner_join(MaxTempCalifornia_long, metadataCA, by = c("City" = "Location"))

# Create spatial plot
ggplot(MaxTempCalifornia_long, aes(x = Long, y = Lat, fill = Max_Temp)) +
  geom_point(size = 3, shape = 21) +
  scale_fill_gradient(low = "blue", high = "red") +
  labs(title = "Maximum Daily Temperature in California in 2012 (Spatial Plot)",
       subtitle = "Color indicates maximum daily temperature (°F)",
       x = "Longitude", y = "Latitude")+theme_bw()

# Create time series plot with facet_wrap
ggplot(MaxTempCalifornia_long, aes(x = Date, y = Max_Temp, color = City)) +
  geom_line() +
  labs(title = "Maximum Daily Temperature in California in 2012 (Time Series Plot)",
       subtitle = "Color indicates city",
       x = "Date", y = "Maximum Daily Temperature (°F)") +
  facet_wrap(~City, nrow = 4, scales = "free_y") +
  theme(legend.position = "bottom") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

(i) There are low temperatures in January 2012 which increase exponentially until April where the temperatures are average up to October when the temperatures drop to levels observed in early January.

Fit a spatial Gaussian process model using maximum likelihood to predict the maximum temperature in Napa and DeathValley on November 13th 2012. [20 marks]

library(sp)
library(gstat)

# Read in data
MaxTempCalifornia <- read.csv("MaxTempCalifornia.csv")
metadataCA <- read.csv("metadataCA.csv")

# Subset data to include only Napa and Death Valley on November 13th 2012
subset_data <- MaxTempCalifornia[, c(1, 3, 7)]
subset_data <- subset_data[subset_data$Date == "2012-11-13", ]

# Merge the subset of temperature data with the metadata
subset_data1 <- merge(subset_data, metadataCA, by.x = "Napa", by.y = "Location")
subset_data1

## [1] Napa         Date         Death.Valley Long         Lat         
## [6] Elev        
## <0 rows> (or 0-length row.names)

# Define a Gaussian model
gaussian_model <- gstat(formula = Temperature ~ 1, data = subset_data, model = vgm(psill = 15, model = "Gau", range = 0.1, nugget = 0.1))

# Print the model summary
summary(gaussian_model)

##       Length Class  Mode
## data  1      -none- list
## model 1      -none- list
## call  4      -none- call

# Predict the temperature at a new location
new_location <- data.frame(Longitude = -118.2437, Latitude = 36.5323)
coordinates(new_location) <- ~Longitude + Latitude
new_location

## SpatialPoints:
##   Longitude Latitude
## 1 -118.2437  36.5323
## Coordinate Reference System (CRS) arguments: NA

Use time series modelling to produce forecasts of the maximum temperature in Napa and DeathValley for the following 2 sets of dates:

November 9th - November 13th
November 13th - November 17th For the 2nd forecast period, you may simply refit the same models used for the 1st set of forecasts.

# Subset the temperature data for Napa and Death Valley
napa_temp <- MaxTempCalifornia$Napa
death_valley_temp <- MaxTempCalifornia$Death.Valley

# Create time series objects
napa_temp_ts <- ts(napa_temp, start = c(1961, 1), frequency = 365)
death_valley_temp_ts <- ts(death_valley_temp, start = c(1961, 1), frequency = 365)

# Fit an ARIMA model to the Napa temperature time series
napa_temp_arima <- arima(napa_temp_ts, order = c(1,1,1))

# Fit an ARIMA model to the Death Valley temperature time series
death_valley_temp_arima <- arima(death_valley_temp_ts, order = c(1,1,1))

# Make forecasts for November 9th - November 13th
napa_temp_fcst1 <- forecast(napa_temp_arima, h = 5)
napa_temp_fcst1

##           Point Forecast     Lo 80    Hi 80    Lo 95    Hi 95
## 1962.0000       14.15709 10.314132 18.00004 8.279794 20.03438
## 1962.0027       14.01086  9.381413 18.64031 6.930729 21.09100
## 1962.0055       13.92284  8.947770 18.89792 6.314126 21.53156
## 1962.0082       13.86986  8.710974 19.02874 5.980026 21.75969
## 1962.0110       13.83796  8.565260 19.11067 5.774060 21.90187

death_valley_temp_fcst1 <- forecast(death_valley_temp_arima, h = 5)
death_valley_temp_fcst1

##           Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 1962.0000       16.57453 13.104727 20.04433 11.267925 21.88113
## 1962.0027       16.64780 11.486329 21.80927  8.754010 24.54159
## 1962.0055       16.60501 10.302943 22.90708  6.966829 26.24320
## 1962.0082       16.63000  9.303864 23.95614  5.425641 27.83436
## 1962.0110       16.61541  8.423305 24.80751  4.086667 29.14415

Compare your various predictions for the maximum temperature in Napa and DeathValley on November 13th, decide which model is best and discuss whether this is what you would have expected. Identify how prediction could be improved. [8 marks]

# Extract the forecasted values for November 13th from each model's output
forecast_napa_13th <- napa_temp_fcst1$mean[5]
forecast_deathvalley_13th <- death_valley_temp_fcst1$mean[5]
forecast_napa2_13th <- napa_temp_fcst1$mean[5]
forecast_deathvalley2_13th <- death_valley_temp_fcst1$mean[5]

# Calculate accuracy metrics for Napa model
Napa_accuracy <- accuracy(napa_temp_fcst1)
print("Napa model accuracy:")

## [1] "Napa model accuracy:"

print(Napa_accuracy)

##                       ME     RMSE      MAE       MPE     MAPE MASE      ACF1
## Training set -0.04853043 2.994563 2.358193 -2.263218 11.75444  NaN 0.1242719

# Calculate accuracy metrics for DeathValley model
DeathValley_accuracy <- accuracy(death_valley_temp_fcst1)
print("DeathValley model accuracy:")

## [1] "DeathValley model accuracy:"

print(DeathValley_accuracy)

##                       ME    RMSE     MAE        MPE     MAPE MASE        ACF1
## Training set -0.01019214 2.70379 2.00063 -0.4804107 6.683989  NaN -0.04709208

Based on the above results, DeathValley model has a better estimated performance compared to the Napa model. The DeathValley model has a lower MAE and MAPE, which are measures of the average prediction error and the percentage error, respectively, and suggest that the model’s predictions are closer to the actual values.

To improve the predictions, we can try using different modeling techniques, such as exponential smoothing or neural networks. We can also try incorporating additional variables that may be related to the maximum temperature, such as humidity, wind speed, or cloud cover. Additionally, we can experiment with different hyperparameters and model specifications to see if they improve the forecast accuracy.

Data Analysis

2023-04-21