Project Objective

The objective of this code is to predict rainfall using simulated climate variables (temperature, pressure, humidity, wind) through various modeling approaches, ranging from linear and generalized regression to advanced models like M5P regression trees. The focus is on building and comparing predictive models, validating their performance using train/test splits and cross-validation, and quantitatively evaluating predictions with metrics such as RMSE and R². This code enables the exploration of model robustness, identification of the most influential variables, and visualization of model fit through plots comparing observed and predicted values, making the analysis both educational and applicable to real climate datasets.

Packages

The following packages are required for the analysis. Install them using: install.packages('package_name'). Note: For the M5P model, ensure Java is installed (e.g., JDK from https://www.oracle.com/java/technologies/downloads/) and configure rJava accordingly.

library(rJava)
library(RWeka)
library(Cubist)
library(party)
library(modeldata)
library(rpart)
library(caret)
library(klaR)
library(olsrr)
library(leaps)
library(glmulti)
library(datasets)
library(car)
library(randomForest)
library(mlbench)
library(caret)
library(readxl)
library(dplyr)
library(ggplot2)
library(lattice)
library(rpart)
library(Metrics)
library(glmnet)
library(Matrix)
library(kableExtra)
library(hydroGOF)
library(neuralnet)
system("java -version")
## [1] 0

Data Preparation

A synthetic climate dataset is generated with 100 observations and 20 variables, including rainfall, temperature, humidity, and wind speed. A subset of key variables is selected for modeling.

set.seed(123)   # For reproducibility
n <- 100        # Number of observations
data <- data.frame(
  rainfall       = rnorm(n, mean = 50, sd = 20),   # mm
  temperature    = rnorm(n, mean = 27, sd = 5),    # °C
  mslp           = rnorm(n, mean = 1013, sd = 10), # hPa
  humidity       = rnorm(n, mean = 70, sd = 15),   # %
  wind_speed     = rnorm(n, mean = 5, sd = 2),     # m/s
  wind_gust      = rnorm(n, mean = 10, sd = 4),
  wind_dir       = runif(n, min = 0, max = 360),   # degrees
  sunshine       = rnorm(n, mean = 8, sd = 3),     # hours
  cloud_cover    = rnorm(n, mean = 50, sd = 20),   # %
  evapotransp    = rnorm(n, mean = 4, sd = 1.2),   # mm
  soil_moisture  = rnorm(n, mean = 30, sd = 10),   # %
  dew_point      = rnorm(n, mean = 20, sd = 3),    # °C
  vapor_pressure = rnorm(n, mean = 25, sd = 5),    # hPa
  radiation      = rnorm(n, mean = 200, sd = 40),  # W/m²
  snowfall       = rnorm(n, mean = 2, sd = 1),     # mm
  cyclone_freq   = rpois(n, lambda = 1),           # count/year
  drought_index  = rnorm(n, mean = 0, sd = 1),     # normalized index
  heatwaves      = rpois(n, lambda = 3),           # count/year
  storm_surge    = rnorm(n, mean = 0.5, sd = 0.2), # m
  sea_temp       = rnorm(n, mean = 28, sd = 2)     # °C
)

# Select key variables for modeling
data1 <- data[, c("rainfall", "temperature", "mslp", "humidity", "wind_speed")]
head(data1)
##   rainfall temperature     mslp humidity wind_speed
## 1 38.79049    23.44797 1034.988 59.27137   4.852888
## 2 45.39645    28.28442 1026.124 58.70967   2.662697
## 3 81.17417    25.76654 1010.349 55.92192   3.730503
## 4 51.41017    25.26229 1018.432 54.21230   4.942317
## 5 52.58575    22.24191 1008.857 63.44261   6.341392
## 6 84.30130    26.77486 1008.238 74.96769   1.698907
str(data1)
## 'data.frame':    100 obs. of  5 variables:
##  $ rainfall   : num  38.8 45.4 81.2 51.4 52.6 ...
##  $ temperature: num  23.4 28.3 25.8 25.3 22.2 ...
##  $ mslp       : num  1035 1026 1010 1018 1009 ...
##  $ humidity   : num  59.3 58.7 55.9 54.2 63.4 ...
##  $ wind_speed : num  4.85 2.66 3.73 4.94 6.34 ...

Train/Test Split

The dataset is split into training (73.68%) and validation (26.32%) sets to evaluate model performance.

set.seed(123)
partition <- createDataPartition(y = data1$rainfall, p = 0.7368, list = FALSE)
trainingdata <- data1[partition, ]
validationdata <- data1[-partition, ]

Model Training

Linear and generalized linear regression models are trained on the training data, and predictions are made on the validation set.

# Linear regression
model <- lm(rainfall ~ ., data = trainingdata)
p <- predict(model, validationdata)

# Generalized linear regression
modelg <- glm(rainfall ~ ., data = trainingdata)
pg <- predict(modelg, validationdata)

Cross-Validation

A 10-fold cross-validation is performed on the linear regression model to assess its robustness.

modelcv <- train(rainfall ~ ., data = trainingdata,
                 method = "lm",
                 trControl = trainControl(method = "cv", number = 10))
pcv <- predict(modelcv, validationdata)

M5P Regression Tree

An M5P regression tree model is trained on the full dataset, and its performance is evaluated.

fit <- M5P(rainfall ~ ., data = data1)
predictions <- predict(fit, data1)
summary(fit)
## 
## === Summary ===
## 
## Correlation coefficient                  0     
## Mean absolute error                     16.6471
## Root mean squared error                 20.3919
## Relative absolute error                114.1338 %
## Root relative squared error            112.2606 %
## Total Number of Instances              100
# Evaluation
obs <- data1$rainfall
gof(predictions, obs)
##           [,1]
## ME        9.27
## MAE      16.65
## MSE     415.83
## RMSE     20.39
## NRMSE % 111.70
## PBIAS %  17.90
## RSR       1.12
## rSD       0.00
## NSE      -0.26
## mNSE     -0.14
## rNSE    -22.14
## d         0.39
## md        0.30
## rd      -10.13
## cp        0.39
## r           NA
## R2          NA
## bR2         NA
## KGE         NA
## VE        0.68
# Comparison of predictions and observations
cbind(rownames(data1), predictions, obs)
##              predictions        obs               
##   [1,] "1"   "61.0747382902458" "38.7904870689557"
##   [2,] "2"   "61.0747382902458" "45.3964502103344"
##   [3,] "3"   "61.0747382902458" "81.1741662829825"
##   [4,] "4"   "61.0747382902458" "51.4101678284915"
##   [5,] "5"   "61.0747382902458" "52.5857547032189"
##   [6,] "6"   "61.0747382902458" "84.3012997376656"
##   [7,] "7"   "61.0747382902458" "59.218324119784" 
##   [8,] "8"   "61.0747382902458" "24.6987753078693"
##   [9,] "9"   "61.0747382902458" "36.2629429621295"
##  [10,] "10"  "61.0747382902458" "41.0867605980008"
##  [11,] "11"  "61.0747382902458" "74.4816359487892"
##  [12,] "12"  "61.0747382902458" "57.1962765411473"
##  [13,] "13"  "61.0747382902458" "58.015429011881" 
##  [14,] "14"  "61.0747382902458" "52.2136543189024"
##  [15,] "15"  "61.0747382902458" "38.8831773049185"
##  [16,] "16"  "61.0747382902458" "85.7382627360616"
##  [17,] "17"  "61.0747382902458" "59.9570095645848"
##  [18,] "18"  "61.0747382902458" "10.6676568674072"
##  [19,] "19"  "61.0747382902458" "64.0271180312737"
##  [20,] "20"  "61.0747382902458" "40.5441718454413"
##  [21,] "21"  "61.0747382902458" "28.6435258802631"
##  [22,] "22"  "61.0747382902458" "45.6405017068341"
##  [23,] "23"  "61.0747382902458" "29.4799110338552"
##  [24,] "24"  "61.0747382902458" "35.4221754141772"
##  [25,] "25"  "61.0747382902458" "37.4992146430149"
##  [26,] "26"  "61.0747382902458" "16.2661337851517"
##  [27,] "27"  "61.0747382902458" "66.7557408898905"
##  [28,] "28"  "61.0747382902458" "53.0674623567303"
##  [29,] "29"  "61.0747382902458" "27.237261259761" 
##  [30,] "30"  "61.0747382902458" "75.0762984213985"
##  [31,] "31"  "61.0747382902458" "58.5292844295363"
##  [32,] "32"  "61.0747382902458" "44.0985703401546"
##  [33,] "33"  "61.0747382902458" "67.9025132209004"
##  [34,] "34"  "61.0747382902458" "67.5626697506608"
##  [35,] "35"  "61.0747382902458" "66.4316216327497"
##  [36,] "36"  "61.0747382902458" "63.7728050820018"
##  [37,] "37"  "61.0747382902458" "61.0783530707518"
##  [38,] "38"  "61.0747382902458" "48.7617657884656"
##  [39,] "39"  "61.0747382902458" "43.8807467252017"
##  [40,] "40"  "61.0747382902458" "42.3905799797523"
##  [41,] "41"  "61.0747382902458" "36.1058604215897"
##  [42,] "42"  "61.0747382902458" "45.841654439608" 
##  [43,] "43"  "61.0747382902458" "24.6920729686347"
##  [44,] "44"  "61.0747382902458" "93.3791193067702"
##  [45,] "45"  "61.0747382902458" "74.1592399660998"
##  [46,] "46"  "61.0747382902458" "27.537828335933" 
##  [47,] "47"  "61.0747382902458" "41.9423032940185"
##  [48,] "48"  "61.0747382902458" "40.6668929275356"
##  [49,] "49"  "61.0747382902458" "65.5993023667264"
##  [50,] "50"  "61.0747382902458" "48.3326186705634"
##  [51,] "51"  "61.0747382902458" "55.0663702798951"
##  [52,] "52"  "61.0747382902458" "49.4290648930259"
##  [53,] "53"  "61.0747382902458" "49.1425908541737"
##  [54,] "54"  "61.0747382902458" "77.3720456802891"
##  [55,] "55"  "61.0747382902458" "45.4845802868146"
##  [56,] "56"  "61.0747382902458" "80.3294120885908"
##  [57,] "57"  "61.0747382902458" "19.0249439153956"
##  [58,] "58"  "61.0747382902458" "61.6922749927214"
##  [59,] "59"  "61.0747382902458" "52.4770848768923"
##  [60,] "60"  "61.0747382902458" "54.3188313748795"
##  [61,] "61"  "61.0747382902458" "57.5927896551976"
##  [62,] "62"  "61.0747382902458" "39.953530937814" 
##  [63,] "63"  "61.0747382902458" "43.3358523266116"
##  [64,] "64"  "61.0747382902458" "29.6284923378582"
##  [65,] "65"  "61.0747382902458" "28.5641754704884"
##  [66,] "66"  "61.0747382902458" "56.0705728280852"
##  [67,] "67"  "61.0747382902458" "58.9641955725885"
##  [68,] "68"  "61.0747382902458" "51.0600845346101"
##  [69,] "69"  "61.0747382902458" "68.4453493575948"
##  [70,] "70"  "61.0747382902458" "91.0016937125429"
##  [71,] "71"  "61.0747382902458" "40.1793766788693"
##  [72,] "72"  "61.0747382902458" "3.81662248718375"
##  [73,] "73"  "61.0747382902458" "70.1147704892451"
##  [74,] "74"  "61.0747382902458" "35.8159847483522"
##  [75,] "75"  "61.0747382902458" "36.2398276706528"
##  [76,] "76"  "61.0747382902458" "70.511427393934" 
##  [77,] "77"  "61.0747382902458" "44.3045398589798"
##  [78,] "78"  "61.0747382902458" "25.5856457549093"
##  [79,] "79"  "61.0747382902458" "53.626069594983" 
##  [80,] "80"  "61.0747382902458" "47.2221727512191"
##  [81,] "81"  "61.0747382902458" "50.1152837179977"
##  [82,] "82"  "61.0747382902458" "57.7056080225266"
##  [83,] "83"  "61.0747382902458" "42.5867993641518"
##  [84,] "84"  "61.0747382902458" "62.8875309703767"
##  [85,] "85"  "61.0747382902458" "45.590268763625" 
##  [86,] "86"  "61.0747382902458" "56.6356392783139"
##  [87,] "87"  "61.0747382902458" "71.936780262987" 
##  [88,] "88"  "61.0747382902458" "58.7036298166761"
##  [89,] "89"  "61.0747382902458" "43.4813682893755"
##  [90,] "90"  "61.0747382902458" "72.9761523690219"
##  [91,] "91"  "61.0747382902458" "69.8700771192424"
##  [92,] "92"  "61.0747382902458" "60.9679391901614"
##  [93,] "93"  "61.0747382902458" "54.7746347022288"
##  [94,] "94"  "61.0747382902458" "37.4418784792126"
##  [95,] "95"  "61.0747382902458" "77.2130489706002"
##  [96,] "96"  "61.0747382902458" "37.9948082570575"
##  [97,] "97"  "61.0747382902458" "93.7466598603315"
##  [98,] "98"  "61.0747382902458" "80.6522125237038"
##  [99,] "99"  "61.0747382902458" "45.2859928179905"
## [100,] "100" "61.0747382902458" "29.4715819938644"

Model Visualization

Visualizations are created to compare observed and predicted rainfall values and to assess model residuals.

# Observed vs Predicted (M5P)
df_pred <- data.frame(obs = obs, pred = predictions)
ggplot(df_pred, aes(x = obs, y = pred)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Observed vs Predicted Rainfall (M5P)",
       x = "Observed Rainfall (mm)",
       y = "Predicted Rainfall (mm)") +
  theme_minimal(base_size = 14)

# Residual plots for linear model
par(mfrow = c(1,2))
plot(model, which = 1)  # Residuals vs Fitted
plot(model, which = 2)  # QQ-plot of residuals

Pedagogical Value

This project serves as an educational exercise, demonstrating the application of various modeling techniques, from linear regression to M5P regression trees, in a controlled environment. The methodology, including train/test splits, cross-validation, and performance evaluation, can be applied to real-world climate data analysis, providing insights into predictive modeling and variable importance.author: “Abdi-Basid ADAN” email: “